new

[book_gpu.git] / BookGPU / Chapters / chapter9 / ch9.tex
diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex

index a78e63396984732a861dc76d756c739ca43cffc8..caf17c421092f521eb8e3982932d58d46aecf54d 100644 (file)
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -6,7 +6,7 @@
  
  \chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
  
  
  \chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
  
-\chapter{Parallel GPU-accelerated Metaheuristics}
+\chapter{Parallel GPU-accelerated metaheuristics}
  \label{chapter9}
  \section{Introduction}
  This chapter presents GPU-based parallel
  \label{chapter9}
  \section{Introduction}
  This chapter presents GPU-based parallel
@@ -26,8 +26,7 @@ Section~\ref{ch8:sec:challenges} highlights the  main challenges
  related to the GPU implementation of metaheuristics. A
  state-of-the-art of GPU-based parallel
  metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
  related to the GPU implementation of metaheuristics. A
  state-of-the-art of GPU-based parallel
  metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
-summarized in Section~\ref{ch8:sec:state}. In Section
-Section~\ref{ch8:sec:GPU_APIs}, the main developed GPU-based
+summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based
  frameworks for metaheuristics are described. Finally, a case study
  is presented in Section~\ref{ch8:sec:case} and some concluding
  remarks are given in Section~\ref{ch8:conclusion}
  frameworks for metaheuristics are described. Finally, a case study
  is presented in Section~\ref{ch8:sec:case} and some concluding
  remarks are given in Section~\ref{ch8:conclusion}
@@ -56,7 +55,7 @@ for solving an optimization problem, two classes of optimization
  methods can be distinguished: \emph{exact methods} and
  \emph{approximate methods}. Exact methods allow one to reach
  optimal solution(s) of the handled optimization problem with a
  methods can be distinguished: \emph{exact methods} and
  \emph{approximate methods}. Exact methods allow one to reach
  optimal solution(s) of the handled optimization problem with a
-proof of its or their optimality. The most known methods of this
+proof of its or their optimality. The known methods of this
  class are the \emph{branch and bound technique}, \emph{dynamic
  programming}, \emph{constraint programming}, and \emph{A*
  algorithm}. However, optimization problems, whether practical or
  class are the \emph{branch and bound technique}, \emph{dynamic
  programming}, \emph{constraint programming}, and \emph{A*
  algorithm}. However, optimization problems, whether practical or
@@ -73,7 +72,7 @@ these problems. Indeed, these methods allow one to reach good
  quality solutions in reasonable computation time compared to exact
  methods but with no guarantee to find optimal or even bounded
  solutions. This is due to the nature of the search process adopted
  quality solutions in reasonable computation time compared to exact
  methods but with no guarantee to find optimal or even bounded
  solutions. This is due to the nature of the search process adopted
-by these approaches which consists in performing a
+by these approaches which consists of performing a
  search in a subset of the whole search space.\\
  
  Regarding the number of solutions considered at each iteration in
  search in a subset of the whole search space.\\
  
  Regarding the number of solutions considered at each iteration in
@@ -120,7 +119,7 @@ methods and the solving of large-sized optimization problems.
    \item Sequential processor architectures have reached their
  physical limit which prevents creating faster processors. The
  current trend of microprocessor manufacturers consists of placing
    \item Sequential processor architectures have reached their
  physical limit which prevents creating faster processors. The
  current trend of microprocessor manufacturers consists of placing
-multiple cores on a single chip. Nowadays, laptops and
+multicores on a single chip. Nowadays, laptops and
  workstations are multicore processors. In addition, the evolution
  of network technologies and the proliferation of broadband
  networks have made possible the emergence of clusters of
  workstations are multicore processors. In addition, the evolution
  of network technologies and the proliferation of broadband
  networks have made possible the emergence of clusters of
@@ -131,7 +130,7 @@ high-performance computing.
  
  From the granularity of parallelism point of view, three major parallel
  models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
  
  From the granularity of parallelism point of view, three major parallel
  models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
-\emph{iteration-level} \index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
+\emph{iteration-level}\index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
  
  \begin{figure}[h!]
  \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
  
  \begin{figure}[h!]
  \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
@@ -271,7 +270,7 @@ GPU is achieved when launching a large amount of threads
  (thousands of threads)~\cite{CUDA}. However, the execution order
  of these thousands of threads is unknown by the programmer which
  makes the prediction of their execution order a challenging issue.
  (thousands of threads)~\cite{CUDA}. However, the execution order
  of these thousands of threads is unknown by the programmer which
  makes the prediction of their execution order a challenging issue.
-On an other hand, the developer has to control explicitly the
+Plus, the developer has to control explicitly the
  threads through the insertion of barrier synchronizations in the
  codes to avoid concurrent accesses to data structures and to meet
  some requirements related to data-dependent synchronizations.
  threads through the insertion of barrier synchronizations in the
  codes to avoid concurrent accesses to data structures and to meet
  some requirements related to data-dependent synchronizations.
@@ -290,7 +289,7 @@ algorithm\index{Metaheuristics!genetic~algorithm} and the
  exploration of different paths using an ant
  colony\index{Metaheuristics!ant~colony~optimization} are random
  operations. Threads of the same warp have to execute
  exploration of different paths using an ant
  colony\index{Metaheuristics!ant~colony~optimization} are random
  operations. Threads of the same warp have to execute
-simultaneously instructions leading to different branches whereas
+instructions simultaneously  leading to different branches whereas
  in an SIMD model the threads of a same warp execute the same
  instruction. Consequently, the different branches of a conditional
  instruction which is data-dependent lead to a serial execution of
  in an SIMD model the threads of a same warp execute the same
  instruction. Consequently, the different branches of a conditional
  instruction which is data-dependent lead to a serial execution of
@@ -339,12 +338,11 @@ of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics
  The first works on metaheuristic algorithms implemented on GPUs
  started on old graphics cards before the appearance of modern GPUs
  equipped with high-level programming interfaces such as CUDA and
  The first works on metaheuristic algorithms implemented on GPUs
  started on old graphics cards before the appearance of modern GPUs
  equipped with high-level programming interfaces such as CUDA and
-OpenCL. Among these pioneering works we cite the work of Wong
-\emph{et al.}~\cite{wongOldGPU2006} dealing with the
+OpenCL. Among these pioneering works we cite the work of Wong et al.~\cite{wongOldGPU2006} dealing with the
  implementation
  implementation
-of EAs on graphics processing cards and the work by Catala \emph{et al.} in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
-is implemented on old GPU architectures. Yu \emph{et al.}~\cite{yu2005} and
-Li \emph{et al.}~\cite{li2007} proposed a full parallelization of genetic
+of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
+is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
+Li et al.~\cite{li2007} proposed a full parallelization of genetic
  algorithm\index{Metaheuristics!genetic~algorithm}s  on old GPU architectures using
  shader libraries based on Direct3D and OpenGL.\\
  
  algorithm\index{Metaheuristics!genetic~algorithm}s  on old GPU architectures using
  shader libraries based on Direct3D and OpenGL.\\
  
@@ -386,12 +384,12 @@ overhead due to the communication and memory latencies. Therefore,
  large neighborhoods are necessary for efficient implementation of
  local searches on GPUs.\\
  
  large neighborhoods are necessary for efficient implementation of
  local searches on GPUs.\\
  
-Luong \emph{et al.}~\cite{luong2010large} proposed efficient
+Luong et al.~\cite{luong2010large} proposed efficient
  mappings for large neighborhood structures on GPUs. In this work,
  three different neighborhoods are studied and mapped to the
  hierarchical GPU for binary problems. The three neighborhoods are
  mappings for large neighborhood structures on GPUs. In this work,
  three different neighborhoods are studied and mapped to the
  hierarchical GPU for binary problems. The three neighborhoods are
-based on the \emph{Hamming} distance. The move operators used in
-the three neighborhoods consider \emph{Hamming} distances of 1, 
+based on the Hamming distance. The move operators used in
+the three neighborhoods consider Hamming distances of 1, 
  2, and 3 (this consists on flipping the binary value of one, two,
  or three positions at a time in the candidate binary vector).
  In~\cite{luong2010large}, each thread is associated to a unique
  2, and 3 (this consists on flipping the binary value of one, two,
  or three positions at a time in the candidate binary vector).
  In~\cite{luong2010large}, each thread is associated to a unique
@@ -402,7 +400,7 @@ solution associated to each CUDA thread's \textit{id}.
  %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
  The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
  
  %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
  The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
  
-In the same context, Deevacq \emph{et al.}~\cite{audreyANT}
+In the same context, Deevacq et al.~\cite{audreyANT}
  proposed two parallelization strategies inspired by the multi-walk
  parallelization strategy, of a 3-opt iterated local
  search\index{Metaheuristics!iterated local search} algorithm (ILS)
  proposed two parallelization strategies inspired by the multi-walk
  parallelization strategy, of a 3-opt iterated local
  search\index{Metaheuristics!iterated local search} algorithm (ILS)
@@ -419,7 +417,7 @@ shared memory to store the related data structures. The two
  strategies have been experimented on standard benchmarks of
  the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
  
  strategies have been experimented on standard benchmarks of
  the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
  
-The same strategy is used by Luong \emph{et al.}
+The same strategy is used by Luong et al.
  in~\cite{luongMultiStart} to implement multistart parallel local
  search algorithms (a special case of the
  algorithmic-level\index{Metaheuristics!algorithmic-level
  in~\cite{luongMultiStart} to implement multistart parallel local
  search algorithms (a special case of the
  algorithmic-level\index{Metaheuristics!algorithmic-level
@@ -451,10 +449,10 @@ parallelism} with usage of texture memory rise from $7.8\times$ to
  $12\times$ (for different
  QAP benchmark sizes). \\
  
  $12\times$ (for different
  QAP benchmark sizes). \\
  
-Janiak \emph{et al.}~\cite{Janiak_et_al_2008} implemented two
+Janiak et al.~\cite{Janiak_et_al_2008} implemented two
  algorithms for TSP and the flow-shop scheduling problem (FSP).
  These algorithms are based on a multistart tabu
  algorithms for TSP and the flow-shop scheduling problem (FSP).
  These algorithms are based on a multistart tabu
-search\index{Metaheuristics!tabu~search} model. Both of the two
+search\index{Metaheuristics!tabu~search} model. Both of the 
  algorithms exploit multicore CPU and GPU. A full parallelization
  on GPU is adopted using shader libraries where each thread is
  mapped with one tabu search\index{Metaheuristics!tabu~search}.
  algorithms exploit multicore CPU and GPU. A full parallelization
  on GPU is adopted using shader libraries where each thread is
  mapped with one tabu search\index{Metaheuristics!tabu~search}.
@@ -462,9 +460,9 @@ However, even though their experiments report that the use of GPU
  speedups the serial execution almost $16 \times$, the mapping of
  one thread with one tabu search\index{Metaheuristics!tabu~search}
  requires a large number of local search algorithms to
  speedups the serial execution almost $16 \times$, the mapping of
  one thread with one tabu search\index{Metaheuristics!tabu~search}
  requires a large number of local search algorithms to
-cover the memory access latency. The same mapping policy is adopted by Zhu \emph{et al.} in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
+cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
  
  
-Luong \emph{et al.}~\cite{luong2012ppsn} proposed a GPU-based
+Luong et al.~\cite{luong2012ppsn} proposed a GPU-based
  implementation of hybrid metaheuristics on heterogeneous parallel
  architectures (multicore CPU  coupled to one GPU).  The challenge
  of using such a heterogeneous architecture is how to distribute
  implementation of hybrid metaheuristics on heterogeneous parallel
  architectures (multicore CPU  coupled to one GPU).  The challenge
  of using such a heterogeneous architecture is how to distribute
@@ -473,7 +471,7 @@ optimal performances. Among the three traditional parallel models
  (solution-level\index{Metaheuristics!solution-level~parallelism},
  iteration-level\index{Metaheuristics!iteration-level parallelism},
  and algorithmic-level\index{Metaheuristics!algorithmic-level
  (solution-level\index{Metaheuristics!solution-level~parallelism},
  iteration-level\index{Metaheuristics!iteration-level parallelism},
  and algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}), the authors pointed out that the most convenient
+parallelism}), the authors point out that the most convenient
  model for the considered heterogeneous architecture is a hybrid
  model combining
  iteration-level\index{Metaheuristics!iteration-level parallelism}
  model for the considered heterogeneous architecture is a hybrid
  model combining
  iteration-level\index{Metaheuristics!iteration-level parallelism}
@@ -484,7 +482,7 @@ associated to one CPU core and  used to accelerate the
  neighborhood calculation of several S-metaheuristics at the same
  time.
   In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
  neighborhood calculation of several S-metaheuristics at the same
  time.
   In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
-S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the fast ant colony algorithm (FANT)~\cite{taillardFant}. \\
+S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the Fast Ant Colony Algorithm (FANT)~\cite{taillardFant}. \\
  
  All the previously noted works  exploit the same parallel models
  used on CPUs based on the task parallelism. A different
  
  All the previously noted works  exploit the same parallel models
  used on CPUs based on the task parallelism. A different
@@ -497,18 +495,18 @@ swap moves are calculated and stored relative to the initial
  permutation $p$. For the GPU implementation, the authors used the
  parallel implementation of neighborhood exploration. The
  time-consuming tasks in the SA-matrix are identified by the
  permutation $p$. For the GPU implementation, the authors used the
  parallel implementation of neighborhood exploration. The
  time-consuming tasks in the SA-matrix are identified by the
-authors as: updating the matrix and passing through it to select
+authors as updating the matrix and passing through it to select
  the next accepted move. To initialize the delta matrix, several
  threads from different blocks explore different segments of the
  matrix (different moves) at the same time.  In order to select the
  the next accepted move. To initialize the delta matrix, several
  threads from different blocks explore different segments of the
  matrix (different moves) at the same time.  In order to select the
-next accepted, swap several threads are also used. Starting from
+next accepted swap, several threads are also used. Starting from
  the last move, a group of threads explores different subsets of
  the delta matrix. The shared memory is used to preload all the
  necessary elements for a given group of threads responsible for
  the updating of the delta matrix. The main difference in this work
  compared to the previous works resides in the introduction of a
  data parallelism using the precalculated delta matrix. The use of
  the last move, a group of threads explores different subsets of
  the delta matrix. The shared memory is used to preload all the
  necessary elements for a given group of threads responsible for
  the updating of the delta matrix. The main difference in this work
  compared to the previous works resides in the introduction of a
  data parallelism using the precalculated delta matrix. The use of
-this matrix allows the increasing of the number of threads
+this matrix allows the increase in the number of threads
  involved in the evaluation of a single move. Experimentations are
  done on standard QAP instances from the
  QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
  involved in the evaluation of a single move. Experimentations are
  done on standard QAP instances from the
  QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
@@ -530,10 +528,10 @@ colony\index{Metaheuristics!ant~colony~optimization} optimization
  swarm optimization\index{Metaheuristics!particle swarm
  optimization} (PSO).
  
  swarm optimization\index{Metaheuristics!particle swarm
  optimization} (PSO).
  
-\subsubsection*{Evolutionary Algorithms}
+\subsubsection*{Evolutionary algorithms}
  
  Traditional parallel models for EAs are classified in three main
  
  Traditional parallel models for EAs are classified in three main
-classes: coarse grain models using several sub-populations
+classes: coarse-grain models using several subpopulations
  (islands), master-slave models used for the parallelization of CPU
  intensive steps (evaluation and transformation), and cellular
  models based on the use of one population disposed (generally) on
  (islands), master-slave models used for the parallelization of CPU
  intensive steps (evaluation and transformation), and cellular
  models based on the use of one population disposed (generally) on
@@ -566,15 +564,15 @@ related to each individual. The obtained speedups range from  $10
  \times$ to $47 \times$ for population sizes
  ranging from $50$ to $10000$. \\
  
  \times$ to $47 \times$ for population sizes
  ranging from $50$ to $10000$. \\
  
-Maitre \emph{et al.}~\cite{maitre2009} also exploited the
+Maitre et al.~\cite{maitre2009} also exploited the
  master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
  C-like metalanguage for easy development of EAs. The user writes a
  master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
  C-like metalanguage for easy development of EAs. The user writes a
-description  of the problem specific components (fitness function,
+description  of the problem-specific components (fitness function,
  problem representation, etc) in EASEA. The code  is then compiled
  to obtain a ready-to-use evolutionary
  algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
  compiler  uses genetic
  problem representation, etc) in EASEA. The code  is then compiled
  to obtain a ready-to-use evolutionary
  algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
  compiler  uses genetic
-algorithm\index{Metaheuristics!genetic~algorithm}LIB and EO
+algorithm\index{Metaheuristics!genetic~algorithm} LIB and EO
  Libraries to produce C++ or JAVA written EA codes.
  In~\cite{maitre2009}, the authors  proposed an extension of EASEA
  to produce CUDA code from the EASEA files. This extension has been
  Libraries to produce C++ or JAVA written EA codes.
  In~\cite{maitre2009}, the authors  proposed an extension of EASEA
  to produce CUDA code from the EASEA files. This extension has been
@@ -585,18 +583,18 @@ during the experiments: a benchmark mathematical function and a
  real problem (molecular structure prediction). In order to
  maximize the GPU occupation, very large populations are used (from
  $2000$ to $20000$). Even though transferring such large
  real problem (molecular structure prediction). In order to
  maximize the GPU occupation, very large populations are used (from
  $2000$ to $20000$). Even though transferring such large
-populations from the CPU to the GPU device memory at every generation is very costly, the authors reported important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
  
  The master-slave  model is efficient when the fitness function is
  highly time intensive. Nevertheless, it requires the use of
  large-sized populations in order to saturate the GPU unless the
  per-block is used (one individual perblock) when the acceleration
  of the fitness function itself is possible. The use of many
  
  The master-slave  model is efficient when the fitness function is
  highly time intensive. Nevertheless, it requires the use of
  large-sized populations in order to saturate the GPU unless the
  per-block is used (one individual perblock) when the acceleration
  of the fitness function itself is possible. The use of many
-sub-populations of medium sizes is another way to obtain a maximum
+subpopulations of medium sizes is another way to obtain a maximum
  occupation of the GPU. This is coarse-grained parallelism (island
  model).\\
  
  occupation of the GPU. This is coarse-grained parallelism (island
  model).\\
  
-The coarse grained model is  used by Pospichal \emph{et al.}
+The coarse-grained model is  used by Pospichal et al.
  in~\cite{pospichal10} to implement a parallel genetic
  algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
  work the entire genetic
  in~\cite{pospichal10} to implement a parallel genetic
  algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
  work the entire genetic
@@ -630,17 +628,16 @@ compared to a single core implementation of a coarse-grained
  genetic algorithm\index{Metaheuristics!genetic~algorithm} on a
  Intel i7 processor.\\
  
  genetic algorithm\index{Metaheuristics!genetic~algorithm} on a
  Intel i7 processor.\\
  
-Nowotniak \emph{et al.}~\cite{nowotniak} proposed a GPU-based
-implementation of a quantum inspired genetic
-algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The used
-parallel model is a hierarchical model based on two levels: each
+Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
+implementation of a Quantum Inspired Genetic Algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The 
+parallel model used is a hierarchical model based on two levels: each
  thread in a block transforms a unique individual and a different
  population is assigned to each block. The algorithm is run
  entirely on GPU. A different instance of the QIGA is run on each
  block and the computations have been shared between 8 GPUs. This
  approach is very convenient to speed up the experimental process
  on metaheuristics that require several independent runs of the
  thread in a block transforms a unique individual and a different
  population is assigned to each block. The algorithm is run
  entirely on GPU. A different instance of the QIGA is run on each
  block and the computations have been shared between 8 GPUs. This
  approach is very convenient to speed up the experimental process
  on metaheuristics that require several independent runs of the
-same algorithm in order to asses statistical efficiency. The
+same algorithm in order to assess statistical efficiency. The
  populations are stored in the shared memory, the data matrix used
  for fitness evaluation is placed in read only constant memory, and
  finally seeds for random numbers generated on the GPU are stored
  populations are stored in the shared memory, the data matrix used
  for fitness evaluation is placed in read only constant memory, and
  finally seeds for random numbers generated on the GPU are stored
@@ -654,7 +651,7 @@ EAs) have been used by many authors to implement parallel EAs on
  GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
  to the SIMD architecture of the GPU. \\
  
  GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
  to the SIMD architecture of the GPU. \\
  
-Pinel \emph{et al.} in~\cite{pinel2012JPDC} developed a highly
+Pinel et al. in~\cite{pinel2012JPDC} developed a highly
  parallel synchronous cellular genetic
  algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
  GraphCell, to solve the independent task scheduling problem on GPU
  parallel synchronous cellular genetic
  algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
  GraphCell, to solve the independent task scheduling problem on GPU
@@ -693,7 +690,7 @@ performed on a set of standard benchmark functions with different
  grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
  the GPU implementation against the CPU version range from
  $5\times$ to $24\times$ and increase as the size of the population
  grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
  the GPU implementation against the CPU version range from
  $5\times$ to $24\times$ and increase as the size of the population
-increases. However, the CPU implementation run faster than the GPU
+increases. However, the CPU implementation runs faster than the GPU
  version for all the tested  benchmarks when the size of the
  population is set to $32^2$. When the size of the population is
  too small, there are not enough computations to cover the overhead
  version for all the tested  benchmarks when the size of the
  population is set to $32^2$. When the size of the population is
  too small, there are not enough computations to cover the overhead
@@ -701,10 +698,10 @@ created by the call of kernel functions, CPU/GPU
  communication\index{GPU Challenges!CPU/GPU~communication}s,
  synchronization, and access to global memory. Finally, an
  interesting review on GPU parallel computation in bio-inspired
  communication\index{GPU Challenges!CPU/GPU~communication}s,
  synchronization, and access to global memory. Finally, an
  interesting review on GPU parallel computation in bio-inspired
-algorithms is proposed by Arenas \emph{et al.}
+algorithms is proposed by Arenas et al.
  in~\cite{arenas2011}. \\
  
  in~\cite{arenas2011}. \\
  
-\subsubsection*{Ant Colony Optimization}
+\subsubsection*{Ant colony optimization}
  
  Ant colony optimization
  (ACO\index{Metaheuristics!ant~colony~optimization}) is another
  
  Ant colony optimization
  (ACO\index{Metaheuristics!ant~colony~optimization}) is another
@@ -715,7 +712,7 @@ accelerating the tour construction step performed by each ant by
  taking a task-based parallelism approach, with pheromone
  deposition on the CPU.\\
  
  taking a task-based parallelism approach, with pheromone
  deposition on the CPU.\\
  
-In~\cite{cecilia}, Cecilia \emph{et al.}  present a GPU-based
+In~\cite{cecilia}, Cecilia et al.  present a GPU-based
  implementation of
  ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
  the two steps (tour construction and pheromone update) are
  implementation of
  ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
  the two steps (tour construction and pheromone update) are
@@ -737,10 +734,10 @@ matrices that are easily updated by several threads (one thread
  per matrix entry). The achieved speedups are $21 \times$ for tour
  construction and  $20 \times$ for pheromone updates.\\
  
  per matrix entry). The achieved speedups are $21 \times$ for tour
  construction and  $20 \times$ for pheromone updates.\\
  
-In another work, Tsutsui \emph{et al.}~\cite{tsutsui} propose  a
+In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose  a
  hybrid algorithm combining
  ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic
  hybrid algorithm combining
  ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic
-and tabu search (TS)\index{Metaheuristics!tabu~search} implemented
+and Tabu Search (TS)\index{Metaheuristics!tabu~search} implemented
  on GPU to solve the QAP. A solution of QAP is represented as a
  permutation of ${1,2,..,n}$ with $n$ being the size of the
  problem. The TS algorithm is based on the 2-opt neighborhood
  on GPU to solve the QAP. A solution of QAP is represented as a
  permutation of ${1,2,..,n}$ with $n$ being the size of the
  problem. The TS algorithm is based on the 2-opt neighborhood
@@ -771,7 +768,7 @@ compared to the CPU implementation,  compared with a speedup of
  only $5 \times$
  when MATA is not used. \\
  
  only $5 \times$
  when MATA is not used. \\
  
-\subsubsection*{Particle Swarm Optimization}
+\subsubsection*{Particle swarm optimization}
  In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
  of a standard PSO algorithm. All the data is stored in global
  memory (velocities, positions, swarm population, etc). Only
  In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
  of a standard PSO algorithm. All the data is stored in global
  memory (velocities, positions, swarm population, etc). Only
@@ -897,7 +894,7 @@ Algorithmic-level parallelism consists of launching several self-contained algor
  
  The implementation of the multistart model is based on two
  different mapping strategies~\cite{luongMultiStart, audreyANT}:
  
  The implementation of the multistart model is based on two
  different mapping strategies~\cite{luongMultiStart, audreyANT}:
-(1) each local search (LS) is associated to a unique thread and
+(1) each Local Search (LS) is associated to a unique thread and
  (2) each solution (from multiple neighborhoods) is associated to a
  unique thread. The first mapping strategy (one thread per LS)
  presents a big drawback: the number of LS to use should be very
  (2) each solution (from multiple neighborhoods) is associated to a
  unique thread. The first mapping strategy (one thread per LS)
  presents a big drawback: the number of LS to use should be very
@@ -943,13 +940,13 @@ The three frameworks reviewed in this section are
  PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
  ParadisEO-MO-GPU\index{GPU-based
  frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
  PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
  ParadisEO-MO-GPU\index{GPU-based
  frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
-libCudaOptimize\index{GPU-based
-frameworks!libCudaOptimize}~\cite{libcuda}. PUGACE\index{GPU-based
+libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize}~\cite{libcuda}. PUGACE\index{GPU-based
  frameworks!PUGACE} is a framework for implementing EAs on GPUs.
  ParadisEO-MO-GPU is an extension of the framework
  ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
  frameworks!PUGACE} is a framework for implementing EAs on GPUs.
  ParadisEO-MO-GPU is an extension of the framework
  ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
-GPU. Finally, libCudaOptimize\index{GPU-based
-frameworks!libCudaOptimize} is a  library intended for the
+GPU. Finally, libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize} is a  library intended for the
  implementation of p-metaheuristics on GPU. The three frameworks
  are presented in more detail in the following.
  
  implementation of p-metaheuristics on GPU. The three frameworks
  are presented in more detail in the following.
  
@@ -965,7 +962,7 @@ implementation on GPUs compared to other parallel models for EAs
  (island, master-slave). The main standard evolutionary operators
  are already implemented in PUGACE\index{GPU-based
  frameworks!PUGACE}: different selection strategies, standard
  (island, master-slave). The main standard evolutionary operators
  are already implemented in PUGACE\index{GPU-based
  frameworks!PUGACE}: different selection strategies, standard
-crossover, and mutation operators (\emph{PMX, swap, 2-exchange},
+crossover, and mutation operators (PMX, swap, 2-exchange,
  etc.). Different problem encoding is also supported. The framework
  is organized as a set of modules in which the different
  functionalities  are separated as much as possible in order to
  etc.). Different problem encoding is also supported. The framework
  is organized as a set of modules in which the different
  functionalities  are separated as much as possible in order to
@@ -986,8 +983,7 @@ crossover is taken at the block level and applied to all the
  individuals within the block. It is the decision on the choose
  of the cutting point for the crossover.\\
  
  individuals within the block. It is the decision on the choose
  of the cutting point for the crossover.\\
  
-The framework is validated on standard benchmarks of the quadratic
-assignment problem (QAP). Speedups of $15.44 \times$  to $18.41
+The framework is validated on standard benchmarks of the QAP. Speedups of $15.44 \times$  to $18.41
  \times$ are achieved compared to a CPU implementation of a cEA
  using population sizes rising from 512 to 1024 and from 1024 to
  2048.
  \times$ are achieved compared to a CPU implementation of a cEA
  using population sizes rising from 512 to 1024 and from 1024 to
  2048.
@@ -1005,7 +1001,7 @@ using population sizes rising from 512 to 1024 and from 1024 to
  
  \subsection{ParadisEO-MO-GPU}
  
  
  \subsection{ParadisEO-MO-GPU}
  
-Melab \emph{et al.}~\cite{paradiseoGPU} developed a reusable
+Melab et al.~\cite{paradiseoGPU} developed a reusable
  framework ParadisEO-MO-GPU\index{GPU-based
  frameworks!ParadisEO-MO-GPU} for parallel local search
  metaheuristics (s-metaheuristics) on GPUs. It focuses on the
  framework ParadisEO-MO-GPU\index{GPU-based
  frameworks!ParadisEO-MO-GPU} for parallel local search
  metaheuristics (s-metaheuristics) on GPUs. It focuses on the
@@ -1046,11 +1042,11 @@ evaluation. After all the neighborhood is generated and evaluated,
  it is sent back to the CPU which selects the best solution (See
  Figure~\ref{ch1:fig:paradiseoGPU}).
  
  it is sent back to the CPU which selects the best solution (See
  Figure~\ref{ch1:fig:paradiseoGPU}).
  
-\subsection{libCudaOptimize: an open source library of GPU-based metaheuristics}
+\subsection{libCUDAOptimize: an open source library of GPU-based metaheuristics}
  \index{GPU-based frameworks!libCudaOptimize}
  \index{GPU-based frameworks!libCudaOptimize}
-LibCudaOptimize~\cite{libcuda} is a C++/Cuda open source library
+LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library
  for the design and implementation of metaheuristics on GPUs. Until
  for the design and implementation of metaheuristics on GPUs. Until
-now, the metaheuristics supported by LibCudaOptimize are: scatter
+now, the metaheuristics supported by LibCUDAOptimize are: scatter
  search, differential evolution, and particle swarm
  optimization\index{Metaheuristics!particle swarm optimization}.
  Nevertheless, the library is designed in such a way to allow
  search, differential evolution, and particle swarm
  optimization\index{Metaheuristics!particle swarm optimization}.
  Nevertheless, the library is designed in such a way to allow
@@ -1065,9 +1061,9 @@ on CPU.
  
  In this case study, a large neighborhood GPU-based local
  search\index{GPU-based Metaheuristics!GPU-based~local~search}
  
  In this case study, a large neighborhood GPU-based local
  search\index{GPU-based Metaheuristics!GPU-based~local~search}
-method for solving the quadratic 3-dimensional assignment problem
-(Q3AP) will be presented. The local search method is an iterated
-local search\index{Metaheuristics!iterated local search}
+method for solving the Quadratic 3-dimensional Assignment Problem
+(Q3AP) will be presented. The local search method is an Iterated
+Local Search\index{Metaheuristics!iterated local search}
  (ILS)~\cite{stutzle2006ILSforQAP} using an embedded
  TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
  consists of executing iteratively the embedded local search, each
  (ILS)~\cite{stutzle2006ILSforQAP} using an embedded
  TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
  consists of executing iteratively the embedded local search, each
@@ -1089,7 +1085,7 @@ ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
       $s^*$=AcceptationCriteria($s^*,s^{*'},history$)\;
  
      }
       $s^*$=AcceptationCriteria($s^*,s^{*'},history$)\;
  
      }
-\caption{Iterated local search template}
+\caption{iterated local search template}
  \label{ch8:ils_algorithm_template}
  \end{algorithm}
  
  \label{ch8:ils_algorithm_template}
  \end{algorithm}
  
@@ -1117,7 +1113,7 @@ ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
  \subsection{The quadratic 3-dimensional assignment problem}
  
  The Q3AP is an extension of the well-known NP-hard QAP. The latter
  \subsection{The quadratic 3-dimensional assignment problem}
  
  The Q3AP is an extension of the well-known NP-hard QAP. The latter
-is one of the most studied problem by the combinatorial
+is one of the most studied problems by the combinatorial
  optimization community due to its wide range of practical
  applications (site planning, schedule problem, computer-aided
  design, etc.) and to its computational challenges since it is
  optimization community due to its wide range of practical
  applications (site planning, schedule problem, computer-aided
  design, etc.) and to its computational challenges since it is
@@ -1127,7 +1123,7 @@ optimization problems.\\
  The Q3AP was first introduced by William P. Pierskalla in
  1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
  less studied problem. Indeed, the Q3AP was revisited only this
  The Q3AP was first introduced by William P. Pierskalla in
  1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
  less studied problem. Indeed, the Q3AP was revisited only this
-past years and has recently been used to model some advanced
+past year and has recently been used to model some advanced
  assignment problems such as the symbol-mapping problem encountered
  in wireless communication systems and described
  in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
  assignment problems such as the symbol-mapping problem encountered
  in wireless communication systems and described
  in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
@@ -1146,7 +1142,7 @@ where
  
  \begin{eqnarray}
    X=(x_{ijl})\in I \cap J \cap L, \label{Q3APConstraints1}\\
  
  \begin{eqnarray}
    X=(x_{ijl})\in I \cap J \cap L, \label{Q3APConstraints1}\\
-  x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1. \label{Q3APConstraints2}
+  x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1 \label{Q3APConstraints2}
  \end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
  \begin{displaymath} I=\lbrace X=(x_{ijl}):
  \sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
  \end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
  \begin{displaymath} I=\lbrace X=(x_{ijl}):
  \sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
@@ -1156,7 +1152,7 @@ i=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  \begin{displaymath} L=\lbrace X=(x_{ijl}):
  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
  j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  \begin{displaymath} L=\lbrace X=(x_{ijl}):
  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
-l=0,1,...,n-1\rbrace \mathrm{.} \end{displaymath}
+l=0,1,...,n-1\rbrace  \end{displaymath}
  % fin------------------------------------------------------------------------------
  
  Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
  % fin------------------------------------------------------------------------------
  
  Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
@@ -1193,12 +1189,12 @@ procedure~\cite{Glover1989TS} starts from an initial feasible
  solution and tries, at each step, to move to a neighboring
  solution that minimizes the fitness (for a minimization case). If
  no such move exists, the neighbor solution that less degrades the
  solution and tries, at each step, to move to a neighboring
  solution that minimizes the fitness (for a minimization case). If
  no such move exists, the neighbor solution that less degrades the
-fitness is chosen as a next move. This enables, TS process to
+fitness is chosen as a next move. This enables the TS process to
  escape local optima. However, this strategy may generate cycles,
  i.e., previous moves can be selected again. To avoid cycles, the
  TS manages a short-term memory that contains the moves that have
  been recently performed. A TS\index{Metaheuristics!tabu~search}
  escape local optima. However, this strategy may generate cycles,
  i.e., previous moves can be selected again. To avoid cycles, the
  TS manages a short-term memory that contains the moves that have
  been recently performed. A TS\index{Metaheuristics!tabu~search}
-template is given by the Algorithm \ref{TS_pseudo_code}.\\
+template is given by Algorithm \ref{TS_pseudo_code}.\\
  %
  %  \begin{algorithm}
  %  \label{TS_pseudo_code}
  %
  %  \begin{algorithm}
  %  \label{TS_pseudo_code}
@@ -1231,7 +1227,7 @@ template is given by the Algorithm \ref{TS_pseudo_code}.\\
            $t = t + 1$\;
  
      }
            $t = t + 1$\;
  
      }
-\caption{Tabu search template}
+\caption{tabu search template}
  \label{TS_pseudo_code}
  \end{algorithm}
  
  \label{TS_pseudo_code}
  \end{algorithm}
  
@@ -1405,7 +1401,7 @@ given number of iterations has been reached.
          Update the tabu list\;
          Copy the chosen solution on GPU device memory\;
      }
          Update the tabu list\;
          Copy the chosen solution on GPU device memory\;
      }
-\caption{Template of an iterated tabu search on GPU for solving the Q3AP}
+\caption{template of an iterated tabu search on GPU for solving the Q3AP}
  \label{ch8:algoITS}
  \end{algorithm}
  
  \label{ch8:algoITS}
  \end{algorithm}
  
@@ -1458,7 +1454,7 @@ each average measurement is shown in small type characters.\\
  \small
  \begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
  \multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average}  & \multicolumn{1}{|c|}{Maximal}  & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
  \small
  \begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
  \multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average}  & \multicolumn{1}{|c|}{Maximal}  & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
-& \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iteration}\\ \hline
+& \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iter.}\\ \hline
  Nug12-d & $580$   & $615.4$    & $744$   & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
      &   & \tiny{$41.7$}    & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
  Nug13-d & $1912$  & $1985.4$   & $2100$  & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
  Nug12-d & $580$   & $615.4$    & $744$   & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
      &   & \tiny{$41.7$}    & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
  Nug13-d & $1912$  & $1985.4$   & $2100$  & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
@@ -1518,7 +1514,7 @@ three-dimensional assignment problem (Q3AP). \\
  Implementing parallel
  metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
  GPU architectures poses new issues and challenges such as memory
  Implementing parallel
  metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
  GPU architectures poses new issues and challenges such as memory
-management\index{GPU Challenges!memory~management},  finding
+management\index{GPU Challenges!memory~management};  finding
  efficient mapping strategies between tasks to parallelize; and the
  GPU threads, thread divergence\index{GPU
  Challenges!thread~divergence}, and synchronization. Actually, most
  efficient mapping strategies between tasks to parallelize; and the
  GPU threads, thread divergence\index{GPU
  Challenges!thread~divergence}, and synchronization. Actually, most