new

[book_gpu.git] / BookGPU / Chapters / chapter9 / ch9.tex
diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex

index dce18898e9d1b9caf9e747c7d285782c4743d62e..0ebc4464899aa36c9f29dc45822ff385b20a2adb 100644 (file)
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -1,81 +1,111 @@
-\chapterauthor{Malika Mehdi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
+\chapterauthor{Malika Mehdi and Ahc\`{e}ne Bendjoudi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
  
  \chapterauthor{Lakhdar Loukil}{University of Oran, Algeria}
  
  
  \chapterauthor{Lakhdar Loukil}{University of Oran, Algeria}
  
-\chapterauthor{Ahc\`{e}ne Bendjoudi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
+%\chapterauthor{Ahc\`{e}ne Bendjoudi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
  
  
-\chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d’Ascq cedex, France}
+\chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
  
  
-\chapter{Parallel GPU-accelerated Metaheuristics}
+\chapter{Parallel GPU-accelerated metaheuristics}
  \label{chapter9}
  \section{Introduction}
  \label{chapter9}
  \section{Introduction}
-This chapter presents GPU-based parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics}, challenges and issues related to the particularities of the GPU architecture and a synthesis on the different implementation strategies used in the literature. The implementation of parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on GPUs is not straightforward. The traditional models used in CPUs must be rethought to meet the new requirements of GPU architectures. This chapter is organized as follows. 
-Combinatorial optimization\index{Combinatorial~optimization} and  resolution methods are introduced in Section~\ref{ch8:sec:optim}. The main traditional parallel models used for metaheuristics are recalled in Section~\ref{ch8:sec:paraMeta}. Section~\ref{ch8:sec:challenges} highlights the  main challenges related to the GPU implementation of metaheuristics. A 
-state-of-the-art of GPU-based parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} is summarized in Section~\ref{ch8:sec:state}. Finally, a case study is presented in Section~\ref{ch8:sec:case} and some concluding remarks are given in Section~\ref{ch8:conclusion}
-
-\section{Combinatorial optimization} 
+This chapter presents GPU-based parallel
+metaheuristics\index{metaheuristics!parallel metaheuristics},
+challenges, and issues related to the particularities of the GPU
+architecture and a synthesis on the different implementation
+strategies used in the literature. The implementation of parallel
+metaheuristics on
+GPUs is not straightforward. The traditional models used in CPUs
+must be rethought to meet the new requirements of GPU
+architectures. This chapter is organized as follows. Combinatorial
+optimization\index{combinatorial optimization} and  resolution
+methods are introduced in Section~\ref{ch8:sec:optim}. The main
+traditional parallel models used for metaheuristics are recalled
+in Section~\ref{ch8:sec:paraMeta}.
+Section~\ref{ch8:sec:challenges} highlights the  main challenges
+related to the GPU implementation of metaheuristics. A
+state-of-the-art of GPU-based parallel
+metaheuristics is
+summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based
+frameworks for metaheuristics are described. Finally, a case study
+is presented in Section~\ref{ch8:sec:case} and some concluding
+remarks are given in Section~\ref{ch8:conclusion}
+
+\section{Combinatorial optimization}
  \label{ch8:sec:optim}
  
  \label{ch8:sec:optim}
  
-Combinatorial optimization\index{Combinatorial~optimization} (CO) is a branch of applied and discrete mathematics. 
-It consists in finding optimal configuration(s) among a finite set of possible configurations 
-(or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\  %(Definition~\ref{def:cops})
+Combinatorial optimization (CO) is a branch of applied and discrete mathematics.
+It consists in finding optimal configuration(s) among a finite set of possible configurations
+(or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:  %(Definition~\ref{def:cops})
  
  
  %\begin{minipage}{0.5\linewidth}
  
  
  %\begin{minipage}{0.5\linewidth}
-A combinatorial problem $P=(S,f)$ can be defined by: 
-\begin{itemize} 
+A combinatorial problem $P=(S,f)$ can be defined by
+\begin{itemize}
  \item a set of decision variables $X$,
  \item an objective function $f$ to optimize (minimize or maximize) over the set $S$,
  \item a set of decision variables $X$,
  \item an objective function $f$ to optimize (minimize or maximize) over the set $S$,
-\item subject to constraints on the decision variables.\\
+\item subject to constraints on the decision variables.
  \end{itemize}
  %\end{minipage}
  
  \end{itemize}
  %\end{minipage}
  
-COPs are generally formulated as mixed integer programming problems (MIPS) and most of them are NP-hard~\cite{garey}. According to the quality level of solutions and deadlines required
+COPs are generally formulated as mixed integer programming
+problems (MIPS) and most of them are NP-hard~\cite{garey}.
+According to the quality level of solutions and deadlines required
  for solving an optimization problem, two classes of optimization
  methods can be distinguished: \emph{exact methods} and
  for solving an optimization problem, two classes of optimization
  methods can be distinguished: \emph{exact methods} and
-\emph{approximate methods}. Exact methods allow one to reach optimal
-solution(s) of the handled optimization problem with a proof of its or 
-their optimality. The most known methods of this class are the
-\emph{branch and bound technique}, \emph{dynamic programming},
-\emph{constraint programming}, \emph{A* algorithm}, etc. However,
-optimization problems, whether practical or academic, are often
-complex and NP-hard. Moreover, a large number of real-life
-optimization problems encountered in science, engineering, economics
-and business are usually large-size problems for which the size of
-the potential solution domain increases dramatically with the size
-of the problem instance. Such problems cannot be tackled using exact
-methods due to the excessive computation time needed by these
-methods to catch optimal solution(s). In such
-situation, approximate methods (or \textit{metaheuristics}) offer an
-alternative approach to solve these problems. Indeed, these methods
-allow one to reach good quality solutions in reasonable computation time
-compared to exact methods but with no guarantee to find optimal or
-even bounded solutions. This is due to the nature of the search
-process adopted by these approaches which consists in performing a
-search in a subset of the whole search space.\\
+\emph{approximate methods}. Exact methods allow one to reach
+optimal solution(s) of the handled optimization problem with a
+proof of its or their optimality. The known methods of this
+class are the \emph{branch and bound technique}, \emph{dynamic
+programming}, \emph{constraint programming}, and \emph{A*
+algorithm}. However, optimization problems, whether practical or
+academic, are often complex and NP-hard. Moreover, a large number
+of real-life optimization problems encountered in science,
+engineering, economics, and business are usually large-sized
+problems for which the size of the potential solution domain
+increases dramatically with the size of the problem instance. Such
+problems cannot be tackled using exact methods due to the
+excessive computation time needed by these methods to find optimal
+solution(s). In such a situation, approximate methods (or
+\textit{metaheuristics}) offer an alternative approach to solve
+these problems. Indeed, these methods allow one to reach good
+quality solutions in reasonable computation time compared to exact
+methods but with no guarantee to find optimal or even bounded
+solutions. This is due to the nature of the search process adopted
+by these approaches which consists of performing a
+search in a subset of the whole search space.
  
  Regarding the number of solutions considered at each iteration in
  the search process, two classes of metaheuristics can be
  distinguished~\cite{talbi2009mfdti}: \textit{solution-based} and
  
  Regarding the number of solutions considered at each iteration in
  the search process, two classes of metaheuristics can be
  distinguished~\cite{talbi2009mfdti}: \textit{solution-based} and
-\textit{population-based} metaheuristics. In the rest of this chapter, the term \emph{s-metaheuristic} refers to solution-based metaheuristic and the term \emph{p-metaheuristic} refers to population-based metaheuristic. In s-metaheuristics, the search process starts with a single solution (generally set at random) and iteratively improves it by exploring
-its neighborhood in the search space. The most known
-methods in this class are local search methods that include
-\emph{simulated annealing}\index{Metaheuristics!simulated~annealing}~\cite{Kirkpatrick1983SA}, \emph{tabu
-search}~\cite{Glover1989TS}, \emph{iterated local search\index{Metaheuristics!iterated local search}}~\cite{stutzle2006ILSforQAP},  \emph{variable neighborhood search}~\cite{HansenMladenovic1997VNS}, etc. 
-
-Unlike s-metaheuristics, p-metaheuristics start with a population of solutions and implement an iterative process that evolves the current population towards a new
-population of better quality solutions. The process is repeated
-until a stopping criterion is satisfied. \emph{Evolutionary
-algorithms}, \emph{swarm optimization} and \emph{ant colonies} fall
-into this class.
+\textit{population-based} metaheuristics. In the rest of this
+chapter, the term \emph{s-metaheuristic} refers to solution-based
+metaheuristic and the term \emph{p-metaheuristic} refers to
+population-based metaheuristic. In s-metaheuristics, the search
+process starts with a single solution (generally set at random)
+and iteratively improves it by exploring its neighborhood in the
+search space. The most known methods in this class are local
+search methods that include \emph{simulated
+annealing}\index{metaheuristics!simulated annealing}~\cite{Kirkpatrick1983SA},
+\emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local
+search\index{metaheuristics!iterated local
+search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
+neighborhood search}~\cite{HansenMladenovic1997VNS}.
+
+Unlike s-metaheuristics, p-metaheuristics start with a population
+of solutions and implement an iterative process that evolves the
+current population towards a new population of better quality
+solutions. The process is repeated until a stopping criterion is
+satisfied. \emph{Evolutionary algorithms}, \emph{swarm
+optimization}, and \emph{ant colonies} fall into this class.
  
  
  \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta}
  
  
  \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta}
-Optimization problems, whether real-life or academic, are more often
-NP-hard and CPU time and/or memory consuming. Metaheuristics allow
-to significantly reduce the computational time of the search process
-but remain time-consuming particularly when it comes to tackle
-large-size problems. \\
+Optimization problems, whether real-life or academic, are more
+often NP-hard and CPU time and/or memory consuming. Metaheuristics
+allow the significant reduction of the computational time of the search
+process but remain time-consuming particularly when it comes
+dealing with large-sized problems. 
  
  The use of parallelism in the design of metaheuristics is a relevant
  approach that is widely adopted by the combinatorial optimization
  
  The use of parallelism in the design of metaheuristics is a relevant
  approach that is widely adopted by the combinatorial optimization
@@ -84,334 +114,713 @@ community for various reasons:
  \begin{itemize}
    \item One of the main goals of parallelism is to reduce the search
  time. This will allow the design of high performance optimization
  \begin{itemize}
    \item One of the main goals of parallelism is to reduce the search
  time. This will allow the design of high performance optimization
-methods and to solve large-size optimization problems.
+methods and the solving of large-sized optimization problems.
  
    \item Sequential processor architectures have reached their
  
    \item Sequential processor architectures have reached their
-physical limit that prevents creating faster processors. The current
-trend of microprocessor manufacturers consists in placing multiple
-cores on a single chip. Nowadays, laptops and workstations are
-multi-core processors. In addition, the evolution of network
-technologies and the proliferation of broadband networks have made
-possible the emergence of clusters of workstations (COWs), networks
-of workstations (NOWS) and large-scale networks of machines (grids)
-as platforms for high performance computing.
+physical limit which prevents creating faster processors. The
+current trend of microprocessor manufacturers consists of placing
+multicores on a single chip. Nowadays, laptops and
+workstations are multicore processors. In addition, the evolution
+of network technologies and the proliferation of broadband
+networks have made possible the emergence of clusters of
+workstations (COWs), networks of workstations (NOWS), and
+large-scale networks of machines (grids) as platforms for
+high-performance computing.
  \end{itemize}
  
  From the granularity of parallelism point of view, three major parallel
  \end{itemize}
  
  From the granularity of parallelism point of view, three major parallel
-models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
-\emph{iteration-level} \index{Metaheuristics!iteration-level parallelism} and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
+models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{metaheuristics!algorithmic-level parallelism},
+\emph{iteration-level}\index{metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. 
  
  \begin{figure}[h!]
  \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
  
  \begin{figure}[h!]
  \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
-\caption{Parallel models for metaheuristics}
+\caption{Parallel models for metaheuristics.}
  \label{ch8:fig:paraMeta}
  \end{figure}
  
  \begin{itemize}
  
  \label{ch8:fig:paraMeta}
  \end{figure}
  
  \begin{itemize}
  
-\item{In algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} parallel model, several self-contained
-metaheuristics are launched in parallel. The parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics}
-may start with identical or different solutions (s-metaheuristics case) or populations (p-metaheuristics
-case), their parameter settings such as the size of tabu list for
-tabu search\index{Metaheuristics!tabu~search}, transition probabilities for ant colonies, mutation and
-crossover probabilities for evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm}s may be the same
-or different. The parallel processes may evolve independently or in
-cooperative manner. In cooperative parallel models, the algorithms
-exchange information related to the search during evolution in order 
-to find better and more robust solutions.}
-
-\item{In iteration-level\index{Metaheuristics!iteration-level parallelism} parallel model, the focus is on the
-parallelization of each iteration of the metaheuristic. Indeed,
-metaheuristics are generally iterative search processes. Moreover,
-the most resource-consuming part of a metaheuristic is the
-evaluation of the generated solutions at each iteration. For
-s-metaheuristics (e. g. tabu search\index{Metaheuristics!tabu~search}, simulated
+\item{In the
+algorithmic-level parallel model, several self-contained metaheuristics
+are launched in parallel. The parallel
+metaheuristics\index{metaheuristics!parallel metaheuristics} may
+start with identical or different solutions (s-metaheuristics
+case) or populations (p-metaheuristics case). Their parameter
+settings such as the size of tabu list for tabu
+search\index{metaheuristics!tabu search}, transition probabilities
+for ant colonies, mutation and crossover probabilities for
+evolutionary
+algorithm\index{metaheuristics!evolutionary algorithm}s may be the
+same or different. The parallel processes may evolve independently
+or in a cooperative manner. In cooperative parallel models, the
+algorithms exchange information related to the search during
+evolution in order to find better and more robust solutions.}
+
+\item{In the iteration-level parallel model, the focus is on the parallelization
+of each iteration of the metaheuristic. Indeed, metaheuristics are
+generally iterative search processes. Moreover, the most
+resource-consuming part of a metaheuristic is the evaluation of
+the generated solutions at each iteration. For s-metaheuristics
+(e.g., tabu search\index{metaheuristics!tabu search}, simulated
  annealing, variable neighborhood search), the evaluation and
  generation of the neighborhood is the most time-consuming step of
  annealing, variable neighborhood search), the evaluation and
  generation of the neighborhood is the most time-consuming step of
-the algorithm particularly when it comes to deal with large
+the algorithm particularly when it comes to dealing with large
  neighborhood sets. In this parallel model, the neighborhood is
  neighborhood sets. In this parallel model, the neighborhood is
-decomposed into partitions, each partition is evaluated in a
-parallel way. For p-metaheuristics (e. g.
-evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm}s, ant colonies, swarm optimization),
-iteration-level\index{Metaheuristics!iteration-level parallelism} parallel model arises naturally since these
-metaheuristics deal with a population of independent solutions. In
-evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm}s, for instance, the iteration-level\index{Metaheuristics!iteration-level parallelism} model
-consists in decomposing the whole population into several partitions where
-each partition is evaluated in parallel.}
-
-\item{In solution-level\index{Metaheuristics!solution-level~parallelism} parallel model, the focus is on the
-parallelization of the evaluation of a single solution. This model
-is useful when the objective function and/or the constraints are
-time and/or memory consuming. Unlike the two previous parallel
-models, the solution-level\index{Metaheuristics!solution-level~parallelism} parallel model is problem-dependent.}
+decomposed into partitions, and each partition is evaluated in a
+parallel way. For p-metaheuristics (e.g., evolutionary
+algorithms, ant
+colonies, swarm optimization), the
+iteration-level
+parallel model arises naturally since these metaheuristics deal
+with a population of independent solutions. In evolutionary
+algorithms, for
+instance, the iteration-level model consists of decomposing the whole population
+into several partitions where each partition is evaluated in
+parallel.}
+
+\item{In the
+solution-level
+parallel model, the focus is on the parallelization of the
+evaluation of a single solution. This model is useful when the
+objective function and/or the constraints are time and/or memory
+consuming. Unlike the two previous parallel models, the
+solution-level\index{metaheuristics!solution-level parallelism}
+parallel model is problem-dependent.}
  \end{itemize}
  \end{itemize}
-
+\clearpage
  \section{Challenges for the design of GPU-based metaheuristics}
  \section{Challenges for the design of GPU-based metaheuristics}
-\label{ch8:sec:challenges} 
-
-Developing GPU-based parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} is not straightforward. 
-The parallel models have to be rethought to meet the new requirements 
-of the GPU architecture. Several major issues have to be taken into account both at design and 
-implementation levels to develop efficient metaheuristics on GPU. 
-These issues are mainly related to the size and latency of the GPU memories, 
-thread synchronization and divergence, the distribution of tasks and data
-transfer between the CPU and GPU~\cite{luongMultiStart}.
+\label{ch8:sec:challenges}
+
+Developing GPU-based parallel
+metaheuristics\index{metaheuristics!parallel metaheuristics} is
+not straightforward. The parallel models have to be rethought to
+meet the new requirements of the GPU architecture. Several major
+issues have to be taken into account both at design and
+implementation levels to develop efficient metaheuristics on GPU.
+These issues are mainly related to the size and latency of the GPU
+memories, thread synchronization and divergence, the distribution
+of tasks, and data transfer between the CPU and
+GPU~\cite{luongMultiStart}.
  
  \subsection{Data placement on a hierarchical memory}
  
  \subsection{Data placement on a hierarchical memory}
-\index{GPU Challenges!data~placement}
-During the execution of metaheuristics on GPU, the different threads 
-may access multiple data structures from multiple memory spaces. These 
-memories have different sizes and access latencies. Nevertheless, faster 
-memories (registers, shared and constant memories) are generally very 
-small in size and the larger memories (global memory) are relatively slower. 
-On the other hand, p-metaheuristics require the exploration of a large 
-amount of individuals to diversify the search. Moreover, an efficient 
-execution of s-metaheuristics requires exploring large neighborhoods. 
-Thus, programmers have to take into account this point to efficiently place the different data structures of the metaheuristic on the different memories to benefit from both the 
-faster memories and the larger ones. A deep study has to be performed 
-on both the metaheuristic data structures (size and access frequency) 
-and the GPU memories (size and access latency) to identify which data 
-will be placed on which memory. Generaly, the most accessed ones should 
-be put on faster memories (registers, shared memory) and larger ones on 
-the lager memories (global memory). Moreover, an efficient mapping between 
-threads and the corresponding metaheuristic elements (one neighbor per thread, 
-one individual per thread, single population per threads block, one ant per 
-thread, etc.) must be defined to unsure a maximum occupancy of the GPU and 
-to cover CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} and memory access times.\\
-
-According to the used metaheuristic and to the handled problem, the data 
-values may have different types and different ranges of their values. The data 
-types should then be chosen carefully and the ranges of the data values should 
-be analyzed to minimize the amount of occupied memory space.\\
-
-In addition to the size and latency of GPU memories issues, the memory access pattern 
-is another important issue to be dealt with to speedup GPU-based metaheuristics. 
-Indeed, the different memories have been designed to achieve specific features that 
-programmers must take into account to optimize their codes and then to benefit from 
-these features. For instance, the global memory is optimized for coalesced accesses. 
-The texture, respectively the constant memory are read-only memories and optimized 
-for uncoalesced accesses respectively for simultaneously accesses of all threads to 
-the same location~\cite{luongMultiStart}. Therefore, to improve the performance of the kernel execution, the programmers should put coalesced data on global memory, uncoalesced read-only (e.g. metaheuristic instance data) on texture, concurrent read-only on constant memory (e.g. 
-data for fitness evaluation that all threads concurrently access), and the most accessed 
-data structures on the shared memory (e.g. population of individuals for a CUDA thread 
-block).
+\index{GPU!data placement} During the execution of
+metaheuristics on GPU, the different threads may access multiple
+data structures from multiple memory spaces. These memories have
+different sizes and access latencies. Nevertheless, faster
+memories (registers, shared and constant memories) are generally
+very small in size, and the larger memories (global memory) are
+relatively slower. However, p-metaheuristics require the
+exploration of a large amount of individuals to diversify the
+search. Moreover, an efficient execution of s-metaheuristics
+requires exploring large neighborhoods. Thus, programmers have to
+take into account this point to efficiently place the different
+data structures of the metaheuristic on the different memories to
+benefit from both the faster memories and the larger ones. A deep
+study has to be performed on both the metaheuristic data
+structures (size and access frequency) and the GPU memories (size
+and access latency) to identify which data will be placed on which
+memory. Generally, the most accessed ones should be put on faster
+memories (registers, shared memory) and larger ones on the larger
+memories (global memory). Also, an efficient mapping between
+threads and the corresponding metaheuristic elements (one neighbor
+per thread, one individual per thread, single population per
+threads block, one ant per thread, etc.) must be defined to ensure
+a maximum occupancy of the GPU and
+to cover CPU/GPU communication\index{GPU!CPU/GPU communication} and memory access times.
+
+According to the used metaheuristic and to the handled problem, the data
+values may have different types and different ranges of their values. The data
+types should then be chosen carefully and the ranges of the data values should
+be analyzed to minimize the amount of occupied memory space.
+
+In addition to the size and latency of GPU memory issues, the
+memory access pattern is another important issue to be dealt with
+to speedup GPU-based metaheuristics. Indeed, the different
+memories have been designed to achieve specific features that
+programmers must take into account to optimize their codes and
+then to benefit from these features. For instance, the global
+memory is optimized for coalesced accesses. The texture and the
+constant memory are read-only memories. The texture is optimized
+for uncoalesced accesses and the constant memory is optimized for
+simultaneously accesses of all threads to the same
+location~\cite{luongMultiStart}. Therefore, to improve the
+performance of the kernel execution, the programmers should put
+coalesced data on global memory, uncoalesced read-only data (e.g.,
+metaheuristic instance data) on the texture, concurrent read-only
+data (e.g., data for fitness evaluation that all threads
+concurrently access) on the constant memory, and the most accessed
+data structures (e.g., population of individuals for a CUDA thread
+block) on the shared memory.
  
  \subsection{Threads synchronization}
  
  \subsection{Threads synchronization}
-\index{GPU Challenges!threads~synchronization}
-The thread synchronization issue is caused by both the GPU architecture and the synchronization requirements 
-of the implemented method. Indeed, GPUs are based on a multi-core architecture 
-organized into several multi-processors (Streaming Multiprocessor SM) supporting the 
-SPMD model (Single Program Multiple Data). Each SM contains several cores executing 
-the same instruction of different threads following the SIMD model (Single Instruction 
-Multiple Data). These threads belong to a warp (a group of 32 threads) and handle 
-different data elements. Furthermore, an efficient execution of applications on GPU 
-is achieved when launching a large amount of threads (thousands of threads)~\cite{CUDA}. 
-However, the execution order of these thousands of threads is unknown by the programmer 
-making the prediction of their execution order a challenging issue. On an other hand, 
-the developer has to control explicitly the threads through the insertion of barrier 
-synchronizations in the codes to avoid concurrent accesses to data structures and to meet some   requirements related to data-dependent synchronizations. 
+\index{GPU!threads synchronization} The thread
+synchronization issue is caused by both the GPU architecture and
+the synchronization requirements of the implemented method.
+Indeed, GPUs are based on a multicore architecture organized into
+several multi-processors (Streaming Multiprocessor SM) supporting
+the SPMD model (Single Program Multiple Data). Each SM contains
+several cores executing the same instruction of different threads
+following the SIMD model (Single Instruction Multiple Data). These
+threads belong to a warp (a group of 32 threads) and handle
+different data elements. An efficient execution of applications on
+GPU is achieved when launching a large amount of threads
+(thousands of threads)~\cite{CUDA}. However, the execution order
+of these thousands of threads is unknown by the programmer which
+makes the prediction of their execution order a challenging issue.
+Plus, the developer has to control explicitly the
+threads through the insertion of barrier synchronizations in the
+codes to avoid concurrent accesses to data structures and to meet
+some requirements related to data-dependent synchronizations.
  
  \subsection{Thread divergence}
  
  
  \subsection{Thread divergence}
  
-Thread divergence\index{GPU Challenges!thread~divergence} is another challenging issue in GPU-based metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally, metaheuristics contain irregular loops and conditional instructions when generating and evaluating the neighborhood (s-metaheuristics), respectively the population (p-metaheuristics) in the same block. In addition,  the decision of applying or not a crossover or a mutation on an individual in a genetic algorithm\index{Metaheuristics!genetic~algorithm} and the exploration of different paths using an ant colony\index{Metaheuristics!ant~colony~optimization} are random operations. Threads of the same warp have to execute simultaneously instructions leading to different branches whereas in a SIMD model the threads of a same warp execute the same instruction at a time. Consequently, the different branches of a conditional instruction which is data-dependent lead to a serial execution of the different threads degrading the performance of the application in 
-terms of execution time. The challenge here is then to revisit the traditional irregular metaheuristic codes to eliminate these divergences. 
-
+Thread divergence\index{GPU!thread divergence} is
+another challenging issue in GPU-based
+metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally,
+metaheuristics contain irregular loops and conditional
+instructions when generating and evaluating the neighborhood
+(s-metaheuristics), and the population (p-metaheuristics) in the
+same block. In addition,  the decision to apply a crossover or a
+mutation on an individual in a genetic
+algorithm and the
+exploration of different paths using an ant
+colony\index{metaheuristics!ant colony optimization} are random
+operations. Threads of the same warp have to execute
+instructions simultaneously  leading to different branches whereas
+in an SIMD model the threads of a same warp execute the same
+instruction. Consequently, the different branches of a conditional
+instruction which is data-dependent lead to a serial execution of
+the different threads degrading the performance of the application
+in terms of execution time. The challenge here is then to revisit
+the traditional irregular metaheuristic codes to eliminate these
+divergences.
+\clearpage
  \subsection{Task distribution and CPU/GPU communication}
  
  \subsection{Task distribution and CPU/GPU communication}
  
-The performance of GPU-based metaheuristics in terms of execution time could be improved 
-by choosing the most appropriate parallel model (algorithmic-level\index{Metaheuristics!algorithmic-level parallelism}, instruction-level, 
-solution-level\index{Metaheuristics!solution-level~parallelism}). Moreover, an efficient decomposition of the metaheuristic and efficient 
-assignment of code portions between the CPU and GPU should be adopted. The objective 
-is to take benefit from the GPU computing power without affecting the efficiency and 
-the behavior of the metaheuristic and without losing performance in CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication}
-and memory accesses. In order to decide which part of the metaheuristic will be executed 
-on which component, one should perform a careful analysis on the serial code of the 
-metaheuristic, identify the compute-intensive tasks (e.g. exploration of the neighborhood, 
-evaluation of individuals, etc.), and then offload them to the GPU, while the reminder 
-tasks still run on the CPU in a serial way. \\
-
-On another hand, the CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} is done through the global memory which is a 
-slow memory making the memory transfer between the CPU and GPU time-consuming and can 
-significantly degrade the performance of the application. Therefore, accesses to this 
-memory should be optimized by minimizing the amount of transferred data between the 
-two components in order to reduce the communication time and then the whole execution 
-time of the metaheuristic. 
+The performance of GPU-based metaheuristics in terms of execution
+time could be improved by choosing the most appropriate parallel
+model (algorithmic-level, instruction-level,
+solution-level).
+Moreover, an efficient decomposition of the metaheuristic and an
+efficient assignment of code portions between the CPU and GPU
+should be adopted. The objective is to take benefit from the GPU
+computing power without affecting the efficiency and the behavior
+of the metaheuristic and without losing performance in CPU/GPU
+communication\index{GPU!CPU/GPU communication} and
+memory accesses. In order to decide which part of the
+metaheuristic will be executed on which component, one should
+perform a careful analysis on the serial code of the
+metaheuristic, identify the compute-intensive tasks (e.g.,
+exploration of the neighborhood, evaluation of individuals), and
+then offload them to the GPU, while the remaining
+tasks still run on the CPU in a serial way. 
+
+The CPU/GPU communication is done through the global
+memory which is a slow memory making the memory transfer between
+the CPU and GPU time-consuming which can significantly degrade the
+performance of the application. Accesses to this memory should be
+optimized by minimizing the amount of transferred data between the
+two components in order to reduce the communication time and,
+therefore, the whole execution time of the metaheuristic.
  
  \section{State-of-the-art parallel metaheuristics on GPUs}
  \label{ch8:sec:state}
  
  \section{State-of-the-art parallel metaheuristics on GPUs}
  \label{ch8:sec:state}
-After more than two decades of research  devoted by the combinatorial optimisation 
-community  for developing adequate parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} for different types of 
-parallel architectures (clusters, supercomputers and grids), the actual developement 
-of GPGPU brings new challenges for parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on SIMD architectures.\\
-
-The first works on metaheuristic algorithms implemented on GPUs have started on old 
-graphics cards before the appearance of modern GPUs equipped with high level 
-programming interfaces such as CUDA and OpenCL. Among these pioneering works we cite 
-the work of Wong \emph{et al.}~\cite{wongOldGPU2006} ($2006$) dealing with the implementation 
-of EAs on graphics processing cards and the work by Catala \emph{et al.} in~\cite{catala2007} ($2007$) where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm is implemented on old GPU architectures. Yu \emph{et al.}~\cite{yu2005} and Li \emph{et al.}~\cite{li2007} proposed a full parallelization of genetic algorithm\index{Metaheuristics!genetic~algorithm}s  on old GPU architectures using shader libraries based on Direct3D and OpenGL.\\
-
-Such architectures are based on pre-configured pipelined stages used to accelerate 
-the transformation of 3D geometric primitives into pixels. Implementing a general-purpose 
-algorithm on such pre-configured architectures is very hard and requires the 
-tailoring of the algorithm's data and instructions to fit the pipelined stages 
-of the GPU. Since then, GPU architectures have evolved to become programmable using high level 
-programming interfaces. In this section we will focus only on recent state-of-the-art works 
-dealing with metaheuristics implementation on modern programmable GPUs. 
-In this review two classes are considered: (1) s-metaheuristics on GPUs and  
-(2) p-metaheuristics on GPUs. A comparative study is 
-done between the main works and a classification of the different existing strategies is proposed in Section~\ref{ch8:sec:synthesis}. 
+After more than two decades of research by the combinatorial optimisation
+community devoted to developing adequate parallel metaheuristics for different types of
+parallel architectures (clusters, supercomputers and grids), the actual developement
+of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics on SIMD architectures.
+
+The first works on metaheuristic algorithms implemented on GPUs
+started on old graphics cards before the appearance of modern GPUs
+equipped with high-level programming interfaces such as CUDA and
+OpenCL. Among these pioneering works we cite the work of Wong et al.~\cite{wongOldGPU2006} dealing with the
+implementation
+of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{metaheuristics!ant colony optimization} algorithm
+is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
+Li et al.~\cite{li2007} proposed a full parallelization of genetic
+algorithms  on old GPU architectures using
+shader libraries based on Direct3D and OpenGL.
+
+Such architectures are based on preconfigured pipelined stages
+used to accelerate the transformation of 3D geometric primitives
+into pixels. Implementing a general-purpose algorithm on such
+preconfigured architectures is very hard and requires the
+tailoring of the algorithm's data and instructions to fit the
+pipelined stages of the GPU. Since then, GPU architectures have
+evolved to become programmable using high-level programming
+interfaces. In this section we will focus only on recent
+state-of-the-art works dealing with metaheuristics implementation
+on modern programmable GPUs. In this review two classes are
+considered: (1) s-metaheuristics on GPUs and (2) p-metaheuristics
+on GPUs. A comparative study is done of the main works and a
+classification of the different existing strategies is proposed in
+Section~\ref{ch8:sec:synthesis}.
  
  \subsection{Implementing solution-based metaheuristics on GPUs}
  
  
  
  \subsection{Implementing solution-based metaheuristics on GPUs}
  
  
-A very basic local search algorithm starts with an initial solution generated either at random 
-or by the mean of a specific heuristic and is based on two elementary components: the generation 
-of neighborhood structures using an elementary move function and a selection function which 
-determines which solution in the current neighborhood 
-will replace the actual search point. The search continues as long as there is 
-improvement and stops when there is no better solution in the current neighborhood. 
-The exploration (or evaluation) of the different moves of a given neighborhood is done 
-independently for each move. Thus, the easiest way to accelerate a local search algorithm 
-is to parallelize the evaluation of the neighborhood (instruction-level parallelism). This 
-is  by far the most used scheme in the literature for parallelizing local search algorithms 
-on GPUs. Nevertheless, small neighborhoods may lead to  non optimal occupation of the CUDA 
-threads which may lead in its turn to an overhead due to the communication and memory latencies. 
-Therefore, large neighborhoods are necessary for efficient implementation of 
-local searches on GPUs.\\
-
-Luong \emph{et al.}~\cite{luong2010large} proposed efficient mappings for large neighborhood 
-structures on GPUs. In this work, three different neighborhoods are studied and mapped to the 
-hierarchical GPU for binary problems. The three neighborhoods are based on the \emph{Hamming} distance. The move operators used in the three neighborhoods consider \emph{Hamming} distances of 1, 2 and 3 respectively (it consists on flipping the binary value of one, two or three positions at a time in the candidate binary vector, respectively). In~\cite{luong2010large}, each thread is associated to a unique solution in the neighborhood. The addressed issue is how to efficiently map the different neighborhoods on the device memory. More explicitly, how to calculate the memory index of each solution associated to each CUDA thread's \textit{id}. 
-%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector. 
-The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
-
-In the same context, Deevacq \emph{et al.}~\cite{audreyANT} proposed two parallelization strategies inspired by the multi-walk parallelization strategy, of a 3-opt iterated local search\index{Metaheuristics!iterated local search} algorithm (ILS) over a CPU/GPU architecture. In the first strategy, each LS is associated to a unique CUDA thread and improves a unique solution by generating 
-its neighborhood. On the contrary, the second strategy associates each solution to a 
-CUDA block and the neighborhood exploration is parallelized on the block threads. In the first strategy, since several LS are executed on different solutions on each MP, the data structures 
-should be stored on the global memory while the exploration of a single solution at a 
-time in the second strategy allows the use of the shared memory to store the related 
-data structures. The two strategies have been experimented on standard benchmarks of 
-the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies  but on the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
-
-The same strategy is used by Luong \emph{et al.} in~\cite{luongMultiStart} to implement multi-start parallel local search algorithms (a special case of the algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} parallel model 
-where several homogeneous LS algorithms are used). The multi-start model is combined with iteration-level\index{Metaheuristics!iteration-level parallelism} parallelism: several LS algorithms are managed by the CPU and the neighborhood evaluation step of each algorithm is parallelized on the GPU (each GPU thread is 
-associated with one neighbor and executes the same evaluation function kernel). 
-The advantage of such a model is that it allows a  high occupancy of the GPU 
-threads. Nevertheless, memory management\index{GPU Challenges!memory~management} causes new issues due to the quantity of 
-data to store and to communicate between CPU    and GPU. A second proposition for 
-implementing the same model on GPU consists in implementing the whole LS processes 
-on GPU and each GPU thread is associated to a unique LS algorithm. This solves the 
-communication issue encountered in the first model. In addition, a memory management\index{GPU Challenges!memory~management}
-strategy is proposed to improve the efficiency of the algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} model: 
-texture memory is used to avoid memory latency due to uncoalesced memory accesses. 
-The proposed approaches are implemented on the quadratic assignment problem (QAP) using 
-CUDA. The acceleration rates obtained for the 
-algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} with usage of texture memory rise from $7.8\times$ to $12\times$ (for different 
-QAP benchmark sizes). \\
-
-Janiak \emph{et al.}~\cite{Janiak_et_al_2008} implemented two algorithms for 
-TSP and the flow shop scheduling problem (FSP). These algorithms are based on a 
-multi-start tabu search\index{Metaheuristics!tabu~search} model. Both of the two algorithms exploit multi-core CPU and GPU. 
-A full parallelization on GPU is adopted using shader libraries where each thread is 
-mapped with one tabu search\index{Metaheuristics!tabu~search}. However, even though their experiments report that the 
-use of GPU speedups the serial execution almost $16 \times$, the mapping of one 
-thread with one tabu search\index{Metaheuristics!tabu~search} requires a large number of local search algorithms to 
-cover the memory access latency. The same mapping policy is adopted by Zhu \emph{et al.} in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
-
-Luong \emph{et al.}~\cite{luong2012ppsn} proposed a GPU-based implementation of hybrid metaheuristics on heterogeneous parallel architectures (multi core CPU  coupled to one GPU).  The challenges of using such a heterogeneous architecture consist on the task distribution between the CPU cores and the GPU in such a way to have optimal performances. Among the three traditional parallel models (solution-level\index{Metaheuristics!solution-level~parallelism}, iteration-level\index{Metaheuristics!iteration-level parallelism} and algorithmic-level\index{Metaheuristics!algorithmic-level parallelism}), the authors pointed out that the most convenient model for the 
-considered heterogeneous architecture is a hybrid  model combining iteration-level\index{Metaheuristics!iteration-level parallelism} 
-and algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models. Several CPU threads execute several instances of the same 
-S-metaheuristic in parallel while the GPU device is associated to one CPU core and  used 
-to accelerate the neighborhood calculation of several S-metaheuristics in the same time. 
- In order to efficiently exploit the remaining CPU cores, a load 
-balancing heuristic is also proposed in order to decide on the number of additional 
-S-metaheuristics to launch on the remaining CPU cores relatively to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of FANT algorithm~\cite{taillardFant}. \\
-
-All the previous works  exploit the same parallel models used on CPUs based on the 
-task parallelism. A different implementation approach is proposed by Paul in~\cite{gerald2012} 
-to implement a simulated annealing\index{Metaheuristics!simulated~annealing} (SA) algorithm for the QAP on GPUs. Indeed, the author used 
-a preinitialized matrix \emph{delta} in which the incremental evaluation of simple swap moves are 
-calculated and stored relatively to the initial  permutation $p$. For the GPU implementation, 
-the authors used the parallel implementation of neighborhood exploration. The time consuming 
-tasks in the SA-matrix are identified by the authors as: updating the matrix and passing through it to select the next accepted move. To initialize the delta matrix, several threads from different blocks explore different segments of the matrix (different moves) at the same time.  In order to select the next accepted swap several threads are also used. Starting from the last move, a group of threads explore different subsets of the delta matrix. The shared memory is used to pre-load all the necessary elements for a given group of threads responsible of the updating of the delta matrix. The main difference in this work compared to the previous works resids in the introduction of a data parallelism using the precalculated delta matrix. The use of this matrix allows to increase the number of threads involved in the evaluation of a single move. Experimentations are done on standard QAP instances from the QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are achieved by the GPU implementation compared 
-to the same sequential implementation on CPU using SA-matrix.\\
-
-\subsection{Implementing population-based metaheuristics on GPUs} 
-
-State-of-the-art works dealing with the implementation of p-metaheuristics on GPUs generally rely 
-on parallel models and research efforts done for parallelizing different classes of p-metaheuristics over different types of parallel architectures: supercomputers, clusters and computational grids. Three main classes of p-metaheuristics are considered in this section: evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant colony\index{Metaheuristics!ant~colony~optimization} optimization (ACO\index{Metaheuristics!ant~colony~optimization}) and particle swarm optimization\index{Metaheuristics!particle swarm optimization} (PSO). 
-
-\subsubsection*{Evolutionary Algorithms} 
-
-Traditional parallel models for EAs are classified in three main classes: coarse grain models using several sub-populations (islands), master-slave models used for the parallelization of CPU intensive steps (evaluation and transformation) and cellular models based on the use of one population disposed (generally) on a toroidal grid. 
-
-The three traditional models have been implemented on GPUs by several researchers for different 
-optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.\\
-
-In~\cite{kannan}, Kannan and Ganji presented a CUDA implementation of the drug discovery application 
-Autodock (molecular docking application). Autodock uses a genetic algorithm\index{Metaheuristics!genetic~algorithm} to find optimal docking 
-positions of a ligand to a protein. The most time consuming task in Autodock is the fitness function 
-evaluation. The fitness function used for a docking problem consists in calculating the energy  of the ligand-protein complex (sum of intermolecular energies). The authors explored two different approaches to evaluate the fitness function on GPU. In the first approach, each GPU thread calculates the energy function of a single individual. This approach requires the use of large size populations to saturate the GPU (thousands of individuals). In the second approach each individual is associated to one thread block. The evaluation of the energy function is performed by the threads of the same block. This allows the use of medium population sizes (hundreds of individuals) and the acceleration of a single fitness 
-evaluation. Another great advantage of the per block approach resides in the use of shared memory instead of global memory to store all the information related to each individual. The obtained speedups range from  $10 \times$ to $47 \times$ for population sizes 
-ranging from $50$ to $10000$. \\ 
-
-Maitre \emph{et al.}~\cite{maitre2009} also exploited the master-slave parallelism of EAs on GPUs using EASEA. EASEA is a C-like metalanguage for easy development of EAs. The user writes a description  of the problem specific components (fitness function, problem representation, etc) in EASEA. The code  is then compiled to obtain a ready to use evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA compiler  uses genetic algorithm\index{Metaheuristics!genetic~algorithm}LIB and EO Libraries to produce C++ or JAVA written EA codes. In~\cite{maitre2009}, the authors  proposed an extension of EASEA to produce CUDA code from the EASEA files. This extension has been used  to generate a master-worker parallel EA in which the sequential algorithm is maintained on CPU and the population is sent to GPU for evaluation. Two problems have been considered during the experiments: a benchmark mathematical function and a real problem (molecular structure prediction). In order to maximize the GPU 
-occupation, very large populations are used (from $2000$ to $20000$). Even though transferring such large 
-populations from the CPU to the GPU device memory at every generation is very costly, the authors reported important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\ 
-
-The master-slave  model is efficient when the fitness function is highly time intensive. 
-Neverethless, it requires the use of large size populations in order to saturate the GPU unless the per-block is used (one individual per-block) when the acceleration of the fitness function itself is possible. The use of many sub-populations of medium sizes is another way to obtain a maximum occupation of the GPU. This is coarse grained prallelism (island model).\\
-
-The coarse grained model is  used by Pospichal \emph{et al.} in~\cite{pospichal10} to implement a parallel genetic algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this work the entire genetic algorithm\index{Metaheuristics!genetic~algorithm} is implemented on GPU. This choice is motivated by the overhead induced by the CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} when only population evaluation is performed on GPU. Each population island is affected to a CUDA thread block and each thread is responsible of a unique individual. Sub-populations are stored on shared memory of each block. Nevertheless, because inter-block communications are not possible on the CUDA architecture, the islands evolve independently in each block and migrations are performed 
-asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.\\ 
-
-The same strategy is also adopted by Tsuitsui and Fujimoto in~\cite{tsutsuiGAQAP}  to implement a coarse grained genetic algorithm\index{Metaheuristics!genetic~algorithm} on GPU to solve the quadratic assignment problem (QAP). Initially, several sub-populations are created on CPU and transfered to the global memory. The sub-populations are organized in the global memory into blocks of $8$ individuals in such a way to allow coalesced memory access by the threads of the same  thread block. Each sub-population is allocated to a single thread block in the GPU and transfered to the shared memory to start evolution. Population evaluation and transformation are done in parallel by the different threads of a given block. Migration is also done through the global memory. 
-Experiments are performed on standard QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU implementation reached speedups of $2.9\times$ to $12.6 \times$ compared to a single core implementation of a coarse grained genetic algorithm\index{Metaheuristics!genetic~algorithm} on a Intel i7 processor.\\
-
-Nowotniak \emph{et al.}~\cite{nowotniak} proposed a GPU-based implementation of quantum inspired genetic algorithm\index{Metaheuristics!genetic~algorithm} called QIGA. The used parallel model is a hierarchical model based on two levels: each thread in a block transforms a unique individual and a different population is assigned to each block. The algorithm is entirely run on GPU. A different instance of the QIGA is run on each block and the computations have been shared between 8 GPUs. This approach is very convenient to speedup the experimental process on metaheuristics 
-that require several independent runs of the same algorithm in order to asses statistical efficiency. The populations are stored in the shared memory, the data matrix used for fitness evaluation is placed in read only constant memory and finally seeds for random numbers generated on the GPU are stored in the global memory.\\
-
-In coarse grained parallelism, the use of the per-block approach to implement 
-the islands (one sub-population per thread block) is almost natural  and it 
-allows the use of shared memory to store the sub-populations. Fine grained 
-parallel models for EAs (cellular EAs) have been used by many authors to implement 
-parallel EAs on GPUs. Indeed, the fine grained parallelism of cEAs fits perfectly to 
-the SIMD architecture of the GPU. \\
-
-Pinel \emph{et al.} in~\cite{pinel2012JPDC} developed a highly parallel synchronous cellular genetic algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called GraphCell, to solve the independent task scheduling problem on GPU architectures. In CGAs, the population is arranged into a two dimensional toroidal grid where only neighboring solutions are allowed to interact with each other  during the recombination step. In GraphCell, two recombination operators for CGA are especially designed to run efficiently on GPU. Indeed, instead of assigning a single thread to each solution of the population to perform the recombination, in GraphCell, a single thread is assigned to each task  of each solution. Offsprings are created by independently modifying the assignment of some tasks in the current solution. Mainly, each thread chooses 
-one neighboring solution in the grid as second parent using different selection strategies and assigns one  task of  the solution (first parent) to the machine for which the same task is assigned in the second parent.  This 
-way, the number of threads used for the recombination step is equal to: population size $\times$  size of the solution (number of tasks). This leads to a high number of threads used to accelerate the recombination operators especially when dealing with large instances of the problem. In addition to the recombination operators, the rest of the CGA steps are  also parallelized on GPU (fitness evaluation, mutation and replacement).\\
-
-A similar work is proposed by Vidal and Alba  in~\cite{albaCGAGPU} where a  CGA using a toroidal grid is completely implemented on GPU. A direct mapping between the population and the GPU threads is done. At each step, several threads execute the same operations on each individual independently: initialization, computing the neighborhood, selection, crossover, mutation and evaluation.  A synchronization is done for all threads to perform the replacement stage and form the next generation. 
-All the data of the algorithm is placed on the global memory. Several experiments have been performed on a set of standard benchmark functions with different grid sizes ranging from $32^2$ to $512^2$. 
-The speedups reached by the GPU implementation against the CPU version range from $5\times$ to $24\times$ and increases as the size of the population increases. However, the CPU implementation run faster than the GPU version for all the tested  benchmarks when the size of the population is set to $32^2$. When the size of the population is too small, there is not enough computations to cover the overhead created by the call of kernel functions, CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication}s, synchronisation and acces to global memory. Finally, an interesting review on GPU parallel computation in bio-inspired algorithms is proposed by Arenas \emph{et al.} in~\cite{arenas2011}. \\
-
-\subsubsection*{Ant Colony Optimization} 
-
-Ant colony optimization (ACO\index{Metaheuristics!ant~colony~optimization}) is another p-metaheuristic subject to parallelization on GPUs. State-of-the-art works on parallelizing ACO\index{Metaheuristics!ant~colony~optimization} focuses on accelerating the tour construction step performed by each ant by taking a task-based parallelism approach, with pheromone deposition on the CPU. 
-
-In~\cite{cecilia}, Cecilia \emph{et al.}  presented a GPU-based implementation of ACO\index{Metaheuristics!ant~colony~optimization} for TSP where the two steps (tour construction and pheromone update) are parallelized on the GPU. A data parallelism approach is used to enhance the performance of the tour  construction step. The authors used two categories of artificial ants: queen ants associated with CUDA thread-blocks and worker ants associated with CUDA threads. A queen ant represents a simulated ant and worker ants collaborate with the queen ant to accelerate the decision about the  next city to visit. The tour construction step of each queen ant is accelerated. Each worker ant  maintains a history of the search in a tabu list containing a chronological 
-ordering of the already visited cities. This memory is used to determine the feasible neighborhood. After all queen ants have constructed their tours, the pheromone trails are updated. For pheromone update, several GPU techniques are also used to increase the data bandwidth of the application mainly by the use of precalculated matrices that are easily updated by several threads (one thread per matrix entry). The achieved speedups are $21 \times$ for tour construction and  $20 \times$ for pheromone updates.\\
-
-In another work, Tsuitsui \emph{et al.}~\cite{tsutsui} proposed  a hybrid algorithm combining ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic and Tabu search (TS)\index{Metaheuristics!tabu~search} implemented on GPU to solve the QAP. A solution of QAP is represented as a permutation of ${1,2,..,n}$ with $n$ being the size of the problem. The TS algorithm is based on the 2-opt neighborhood (swapping of two elements $(i,j)$ in the permutation). The authors pointed out that the move cost of each neighbor depends on the couple $(i,j)$. Two groups of moves are formed according to the move cost. In order to avoid thread divergence\index{GPU Challenges!thread~divergence} within the same warp, the neighborhood evaluation is parallelized in such a way to assign only moves of the same cost to each thread warp. This strategy is called MATA for Move-cost Adjusted Thread Assignment. Concerning the memory management\index{GPU Challenges!memory~management}, all the data of the ACO\index{Metaheuristics!
-ant~colony~optimization} (population, pheromone matrix), QAP matrices, TS tabu list, are placed on the global memory of the GPU. Shared memory is used only for working data common to all threads in a given block.
- All the 
-steps of the hybrid algorithm ACO\index{Metaheuristics!ant~colony~optimization}-TS (ACO\index{Metaheuristics!ant~colony~optimization} initialization, pheromone update, construct solutions, applying TS) are implemented as kernel functions on the GPU. The GPU/CPU communication are only used to transfer the best so far solution in order to verify termination conditions. The experiments performed on standard QAP benchmarks showed that the GPU implementation using MATA obtained a  speedup of $19 \times$  compared to the CPU implementation,  against a speedup of only $5 \times$  when MATA is not used. \\ 
-
-\subsubsection*{Particle Swarm Optimization} 
-In~\cite{zhou2009} Zhou and Tan proposed a full GPU implementation of a standard PSO algorithm. All the data is stored in global memory (velocities, positions, swarm population, etc). Only working data is copied to shared memory by each thread. The four steps of the PSO have been parallelized on GPU:  Fitness evaluation of the swarm, update of local best and global best  of each particle and update of velocity and position of each particle. The same strategy is used to parallelize the first and last  steps: the evaluation of fitness functions is performed in parallel for each dimension  by several threads. It is the case for the update of position and velocities of each particle: one dimension at a time is updated for the whole swarm by several threads. Random numbers needed for  updating the velocities and positions for the whole PSO processes are generated on CPU at the starting of the algorithm and transferred to the GPU global memory. For the steps 2 and 3 (update of local 
-best and global best of each particle), the GPU threads are mapped to the N particles of the swarm one to one. Experiments done on four benchmark functions show speedups ranging from $3.7 \times$ to $9.0 \times$ for swarm sizes ranging from $400$ to $2800$.\\
+A very basic local search algorithm starts with an initial
+solution generated either at random or by the mean of a specific
+heuristic and is based on two elementary components: the
+generation of neighborhood structures using an elementary move
+function and a selection function that determines which solution
+in the current neighborhood will replace the actual search point.
+The search continues as long as there is improvement and stops
+when there is no better solution in the current neighborhood. The
+exploration (or evaluation) of the different moves of a given
+neighborhood is done independently for each move. Thus, the
+easiest way to accelerate a local search algorithm is to
+parallelize the evaluation of the neighborhood (instruction-level
+parallelism). This is  by far the most used scheme in the
+literature for parallelizing local search algorithms on GPUs.
+Nevertheless, small neighborhoods may lead to  nonoptimal
+occupation of the CUDA threads which may lead in turn to an
+overhead due to the communication and memory latencies. Therefore,
+large neighborhoods are necessary for efficient implementation of
+local searches on GPUs.
+
+Luong et al.~\cite{luong2010large} proposed efficient
+mappings for large neighborhood structures on GPUs. In this work,
+three different neighborhoods are studied and mapped to the
+hierarchical GPU for binary problems. The three neighborhoods are
+based on the Hamming distance. The move operators used in
+the three neighborhoods consider Hamming distances of 1, 
+2, and 3 (this consists on flipping the binary value of one, two,
+or three positions at a time in the candidate binary vector).
+In~\cite{luong2010large}, each thread is associated to a unique
+solution in the neighborhood. The addressed issue is how to
+efficiently map the different neighborhoods on the device memory,
+more explicitly, how to calculate the memory index of each
+solution associated to each CUDA thread's \textit{id}.
+%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
+The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes. % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
+
+In the same context, Deevacq et al.~\cite{audreyANT}
+proposed two parallelization strategies inspired by the multiwalk
+parallelization strategy, of a 3-opt iterated local
+search algorithm (ILS)
+over a CPU/GPU architecture. In the first strategy, each Local
+Search (LS) is associated to a unique CUDA thread and improves a
+unique solution by generating its neighborhood. The second
+strategy associates each solution to a CUDA block and the
+neighborhood exploration is parallelized on the block threads. In
+the first strategy, since several LS are executed on different
+solutions on each Multi Processor (MP), the data structures should
+be stored on the global memory, while the exploration of a single
+solution at a time in the second strategy allows the use of the
+shared memory to store the related data structures. The two
+strategies have been experimented on standard benchmarks of
+the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.
+
+The same strategy is used by Luong et al.
+in~\cite{luongMultiStart} to implement multistart parallel local
+search algorithms (a special case of the
+algorithmic-level parallel model where several homogeneous LS
+algorithms are used). The multistart model is combined with
+iteration-level
+parallelism: several LS algorithms are managed by the CPU and the
+neighborhood evaluation step of each algorithm is parallelized on
+the GPU (each GPU thread is associated with one neighbor and
+executes the same evaluation function kernel). The advantage of
+such a model is that it allows a  high occupancy of the GPU
+threads. Nevertheless, memory management causes new issues due to the
+quantity of data to store and to communicate between CPU     and
+GPU. A second proposition for implementing the same model on GPU
+consists of implementing the whole LS processes on GPU with each
+GPU thread being associated to a unique LS algorithm. This solves
+the communication issue encountered in the first model. In
+addition, a memory management strategy is proposed to improve the
+efficiency of the
+algorithmic-level model: texture memory is used to avoid memory latency
+due to uncoalesced memory accesses. The proposed approaches are
+implemented on the quadratic assignment problem (QAP) using CUDA.
+The acceleration rates obtained for the
+algorithmic-level with usage of texture memory rise from $7.8\times$ to
+$12\times$ (for different
+QAP benchmark sizes). 
+
+Janiak et al.~\cite{Janiak_et_al_2008} implemented two
+algorithms for TSP and the flow-shop scheduling problem (FSP).
+These algorithms are based on a multistart tabu
+search model. Both of the 
+algorithms exploit multicore CPU and GPU. A full parallelization
+on GPU is adopted using shader libraries where each thread is
+mapped with one tabu search.
+However, even though their experiments report that the use of GPU
+speedups the serial execution almost $16 \times$, the mapping of
+one thread with one tabu search
+requires a large number of local search algorithms to
+cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.
+
+Luong et al.~\cite{luong2012ppsn} proposed a GPU-based
+implementation of hybrid metaheuristics on heterogeneous parallel
+architectures (multicore CPU  coupled to one GPU).  The challenge
+of using such a heterogeneous architecture is how to distribute
+tasks between the CPU cores and the GPU in such a way to have
+optimal performances. Among the three traditional parallel models
+(solution-level,
+iteration-level
+and algorithmic-level), the authors point out that the most convenient
+model for the considered heterogeneous architecture is a hybrid
+model combining
+iteration-level
+and algorithmic-level models. Several CPU threads execute several instances
+of the same S-metaheuristic in parallel while the GPU device is
+associated to one CPU core and  used to accelerate the
+neighborhood calculation of several S-metaheuristics at the same
+time.
+ In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
+S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the Fast Ant Colony Algorithm (FANT)~\cite{taillardFant}. 
+
+All the previously noted works  exploit the same parallel models
+used on CPUs based on the task parallelism. A different
+implementation approach is proposed by Paul in~\cite{gerald2012}
+to implement a simulated
+annealing (SA) algorithm
+for the QAP on GPUs. Indeed, the author used a preinitialized
+matrix \emph{delta} in which the incremental evaluation of simple
+swap moves are calculated and stored relative to the initial
+permutation $p$. For the GPU implementation, the authors used the
+parallel implementation of neighborhood exploration. The
+time-consuming tasks in the SA-matrix are identified by the
+authors as updating the matrix and passing through it to select
+the next accepted move. To initialize the delta matrix, several
+threads from different blocks explore different segments of the
+matrix (different moves) at the same time.  In order to select the
+next accepted swap, several threads are also used. Starting from
+the last move, a group of threads explores different subsets of
+the delta matrix. The shared memory is used to preload all the
+necessary elements for a given group of threads responsible for
+the updating of the delta matrix. The main difference in this work
+compared to the previous works resides in the introduction of a
+data parallelism using the precalculated delta matrix. The use of
+this matrix allows the increase in the number of threads
+involved in the evaluation of a single move. Experimentations are
+done on standard QAP instances from the
+QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
+achieved by the GPU implementation compared
+to the same sequential implementation on CPU using SA-matrix.
+
+\subsection[Implementing population-based metaheuristics\hfill\break on GPUs]{Implementing population-based metaheuristics on GPUs}
+
+State-of-the-art works dealing with the implementation of
+p-metaheuristics on GPUs generally rely on parallel models and
+research efforts done for parallelizing different classes of
+p-metaheuristics over different types of parallel architectures:
+supercomputers, clusters, and computational grids. Three main
+classes of p-metaheuristics are considered in this section:
+evolutionary
+algorithms (EAs), ant
+colony optimization
+(ACO), and particle
+swarm optimization\index{metaheuristics!particle swarm
+optimization} (PSO).
+\clearpage
+\subsubsection*{Evolutionary algorithms}
+
+Traditional parallel models for EAs are classified in three main
+classes: coarse-grain models using several subpopulations
+(islands), master-slave models used for the parallelization of CPU
+intensive steps (evaluation and transformation), and cellular
+models based on the use of one population disposed (generally) on
+a toroidal grid.
+
+The three traditional models have been implemented on GPUs by several researchers for different
+optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.
+
+In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
+of the drug discovery application Autodock (molecular docking
+application). Autodock uses a genetic
+algorithm to find optimal
+docking positions of a ligand to a protein. The most
+time-consuming task in Autodock is the fitness function
+evaluation. The fitness function used for a docking problem
+consists of calculating the energy  of the ligand-protein complex
+(sum of intermolecular energies). The authors explore two
+different approaches to evaluate the fitness function on GPU. In
+the first approach, each GPU thread calculates the energy function
+of a single individual. This approach requires the use of
+large-sized populations to saturate the GPU (thousands of
+individuals). In the second approach each individual is associated
+with one thread block. The evaluation of the energy function is
+performed by the threads of the same block. This allows the use of
+medium population sizes (hundreds of individuals) and the
+acceleration of a single fitness evaluation. Another great
+advantage of the per block approach resides in the use of shared
+memory instead of global memory to store all the information
+related to each individual. The obtained speedups range from  $10
+\times$ to $47 \times$ for population sizes
+ranging from $50$ to $10000$. 
+
+Maitre et al.~\cite{maitre2009} also exploited the
+master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
+C-like metalanguage for easy development of EAs. The user writes a
+description  of the problem-specific components (fitness function,
+problem representation, etc) in EASEA. The code  is then compiled
+to obtain a ready-to-use evolutionary
+algorithm. The EASEA
+compiler  uses genetic
+algorithm LIB and EO
+Libraries to produce C++ or JAVA written EA codes.
+In~\cite{maitre2009}, the authors  proposed an extension of EASEA
+to produce CUDA code from the EASEA files. This extension has been
+used  to generate a master-slave parallel EA in which the
+sequential algorithm is maintained on CPU and the population is
+sent to GPU for evaluation. Two problems have been considered
+during the experiments: a benchmark mathematical function and a
+real problem (molecular structure prediction). In order to
+maximize the GPU occupation, very large populations are used (from
+$2000$ to $20000$). Even though transferring such large
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.
+
+The master-slave  model is efficient when the fitness function is
+highly time intensive. Nevertheless, it requires the use of
+large-sized populations in order to saturate the GPU unless the
+per-block is used (one individual perblock) when the acceleration
+of the fitness function itself is possible. The use of many
+subpopulations of medium sizes is another way to obtain a maximum
+occupation of the GPU. This is coarse-grained parallelism (island
+model).
+
+The coarse-grained model is  used by Pospichal et al.
+in~\cite{pospichal10} to implement a parallel genetic
+algorithm on GPU. In this
+work the entire genetic
+algorithm is implemented
+on GPU. This choice is motivated by the overhead engendered by the
+CPU/GPU communication
+when only population evaluation is performed on GPU. Each
+population island is mapped with a CUDA thread block and each
+thread is responsible for a unique individual. Subpopulations are
+stored on shared memory of each block. Nevertheless, because
+interblock communications are not possible on the CUDA
+architecture, the islands evolve independently in each block, and
+migrations are performed
+asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.
+
+The same strategy is also adopted by Tsutsui and Fujimoto
+in~\cite{tsutsuiGAQAP}  to implement a coarse-grained genetic
+algorithm on GPU to solve
+the QAP. Initially, several subpopulations are created on CPU and
+transferred to the global memory. The subpopulations are organized
+in the global memory into blocks of $8$ individuals in such a way
+to allow coalesced memory access by the threads of the same thread
+block. Each sub-population is allocated to a single thread block
+in the GPU and transfered to the shared memory to start evolution.
+Population evaluation and transformation are done in parallel by
+the different threads of a given block. Migration is also done
+through the global memory. Experiments are performed on standard
+QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
+implementation reached speedups of $2.9\times$ to $12.6 \times$
+compared to a single core implementation of a coarse-grained
+genetic algorithm on a
+Intel i7 processor.
+
+Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
+implementation of a Quantum Inspired Genetic Algorithm (QIGA). The 
+parallel model used is a hierarchical model based on two levels: each
+thread in a block transforms a unique individual and a different
+population is assigned to each block. The algorithm is run
+entirely on GPU. A different instance of the QIGA is run on each
+block and the computations have been shared between 8 GPUs. This
+approach is very convenient to speed up the experimental process
+on metaheuristics that require several independent runs of the
+same algorithm in order to assess statistical efficiency. The
+populations are stored in the shared memory, the data matrix used
+for fitness evaluation is placed in read only constant memory, and
+finally seeds for random numbers generated on the GPU are stored
+in the global memory.
+
+In coarse-grained parallelism, the use of the per-block approach
+to implement the islands (one subpopulation per thread block) is
+almost natural  and it allows the use of shared memory to store
+the subpopulations. Fine-grained parallel models for EAs (cellular
+EAs) have been used by many authors to implement parallel EAs on
+GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
+to the SIMD architecture of the GPU. 
+
+Pinel et al. in~\cite{pinel2012JPDC} developed a highly
+parallel synchronous cellular genetic
+algorithm (CGA), called
+GraphCell, to solve the independent task scheduling problem on GPU
+architectures. In CGAs, the population is arranged into a
+two-dimensional toroidal grid where only neighboring solutions are
+allowed to interact with each other  during the recombination
+step. In GraphCell, two recombination operators for CGA are
+especially designed to run efficiently on GPU. Indeed, instead of
+assigning a single thread to each solution of the population to
+perform the recombination, in GraphCell, a single thread is
+assigned to each task  of each solution. Offsprings are created by
+independently modifying the assignment of some tasks in the
+current solution. Mainly, each thread chooses one neighboring
+solution in the grid as second parent using different selection
+strategies and assigns one  task of  the solution (first parent)
+to the machine for which the same task is assigned in the second
+parent.  This way, the number of threads used for the
+recombination step is equal to population size $\times$  size of
+the solution (number of tasks). This leads to a high number of
+threads used to accelerate the recombination operators especially
+when dealing with large instances of the problem. In addition to
+the recombination operators, the rest of the CGA steps are  also
+parallelized on GPU (fitness evaluation, mutation, and
+replacement).
+
+A similar work is proposed by Vidal and Alba  in~\cite{albaCGAGPU}
+where a  CGA using a toroidal grid is completely implemented on
+GPU. A direct mapping between the population and the GPU threads
+is done. At each step, several threads execute the same operations
+on each individual independently: initialization, computing the
+neighborhood, selection, crossover, mutation, and evaluation.  A
+synchronization is done for all threads to perform the replacement
+stage and form the next generation. All the data of the algorithm
+is placed on the global memory. Several experiments have been
+performed on a set of standard benchmark functions with different
+grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
+the GPU implementation against the CPU version range from
+$5\times$ to $24\times$ and increase as the size of the population
+increases. However, the CPU implementation runs faster than the GPU
+version for all the tested  benchmarks when the size of the
+population is set to $32^2$. When the size of the population is
+too small, there are not enough computations to cover the overhead
+created by the call of kernel functions, CPU/GPU
+communications,
+synchronization, and access to global memory. Finally, an
+interesting review on GPU parallel computation in bio-inspired
+algorithms is proposed by Arenas et al.
+in~\cite{arenas2011}. 
+
+\subsubsection*{Ant colony optimization}
+
+Ant colony optimization
+(ACO) is another
+p-metaheuristic subject to parallelization on GPUs.
+State-of-the-art works on parallelizing
+ACO focus on
+accelerating the tour construction step performed by each ant by
+taking a task-based parallelism approach, with pheromone
+deposition on the CPU.
+
+In~\cite{cecilia}, Cecilia et al.  present a GPU-based
+implementation of
+ACO for TSP where
+the two steps (tour construction and pheromone update) are
+parallelized on the GPU. A data parallelism approach is used to
+enhance the performance of the tour  construction step. The
+authors use two categories of artificial ants: queen ants
+associated with CUDA thread-blocks and worker ants associated with
+CUDA threads. A queen ant represents a simulated ant and worker
+ants collaborate with the queen ant to accelerate the decision
+about the  next city to visit. The tour construction step of each
+queen ant is accelerated. Each worker ant  maintains a history of
+the search in a tabu list containing a chronological ordering of
+the already visited cities. This memory is used to determine the
+feasible neighborhoods. After all queen ants have constructed
+their tours, the pheromone trails are updated. For pheromone
+update, several GPU techniques are also used to increase the data
+bandwidth of the application mainly by the use of precalculated
+matrices that are easily updated by several threads (one thread
+per matrix entry). The achieved speedups are $21 \times$ for tour
+construction and  $20 \times$ for pheromone updates.
+
+In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose  a
+hybrid algorithm combining
+ACO metaheuristic
+and Tabu Search (TS) implemented
+on GPU to solve the QAP. A solution of QAP is represented as a
+permutation of ${1,2,..,n}$ with $n$ being the size of the
+problem. The TS algorithm is based on the 2-opt neighborhood
+(swapping of two elements $(i,j)$ in the permutation). The authors
+point out that the move cost of each neighbor depends on the
+couple $(i,j)$. Two groups of moves are formed according to the
+move cost. In order to avoid thread divergence\index{GPU!thread divergence} within the same warp, the
+neighborhood evaluation is parallelized in such a way to assign
+only moves of the same cost to each thread warp. This strategy is
+called MATA for Move-cost Adjusted Thread Assignment. Concerning
+the memory management\index{GPU!memory management}, all
+the data of the ACO\index{metaheuristics!ant colony optimization}
+(population, pheromone matrix), QAP matrices, and tabu list are
+placed on the global memory of the GPU. Shared memory is used only
+for working data common to all threads in a given block.
+ All the
+steps of the hybrid algorithm
+ACO-TS
+(ACO initialization,
+pheromone update, construct solutions, applying TS) are
+implemented as kernel functions on the GPU. The GPU/CPU
+communications are only used to transfer the best-so-far solution
+in order to verify termination conditions. The experiments
+performed on standard QAP benchmarks showed that the GPU
+implementation using MATA obtained a  speedup of $19 \times$
+compared to the CPU implementation,  compared with a speedup of
+only $5 \times$
+when MATA is not used. 
+
+\subsubsection*{Particle swarm optimization}
+In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
+of a standard PSO algorithm. All the data is stored in global
+memory (velocities, positions, swarm population, etc). Only
+working data is copied to shared memory by each thread. The four
+steps of the PSO have been parallelized on GPU: fitness evaluation
+of the swarm, update of local best and global best  of each
+particle, and update of velocity and position of each particle.
+The same strategy is used to parallelize the first and last
+steps: the evaluation of fitness functions is performed in
+parallel for each dimension  by several threads. It is the case
+for the update of position and velocities of each particle: one
+dimension at a time is updated for the whole swarm by several
+threads. Random numbers needed for  updating the velocities and
+positions for the whole PSO processes are generated on CPU at the
+starting of the algorithm and transferred to the GPU global
+memory. For the steps 2 and 3 (update of local best and global
+best of each particle), the GPU threads are mapped to the \emph{N}
+particles of the swarm one to one. Experiments done on four
+benchmark functions show speedups ranging from $3.7 \times$ to
+$9.0 \times$ for swarm sizes
+ranging from $400$ to $2800$.
  
  \subsection{Synthesis of the implementation strategies}
  
  \subsection{Synthesis of the implementation strategies}
-\label{ch8:sec:synthesis}
-After reviewing some works dealing with GPU-based implementation of metaheuristics, 
-in this section we will try to come up with a classification of the different 
-strategies used in the literature for the implementation of metaheuristics on GPUs. 
-One may distinguish between the parallel models adopted in each metaheuristic (design level)  and the way they are exploited on GPU architectures (implementation level). Indeed, even though the parallelization models used in most works for GPUs are inspired from the traditional parallel models of each metaheuristic (on CPU), their implementation could take a different way and sometimes it may result in new parallel models customized for GPUs.\\
-
-Traditional parallel models for metaheuristics are based on an intuitive task parallelism: the independent tasks of the algorithms are simply parallelized. For example, in the case of s-metaheuristics, the evaluation of large neighborhoods could be done in parallel since there is no synchronization at this step of the algorithm. That is the case of EAs  when it comes to apply the evaluation step. Nevertheless, because of the particularity of the GPU architecture, some authors used new implementation techniques to enhance the data parallelism in the sequential algorithms in order to increase the data throughput of the application.\\ 
-
-From this observation we propose the following classification based on 2 levels: design level and implementation level as illustrated in Figure~\ref{ch8:fig:classification}. The design level regroups the three classes of parallel models used in metaheuristics (solution-level\index{Metaheuristics!solution-level~parallelism}, iteration-level\index{Metaheuristics!iteration-level parallelism}, algorithmic-level\index{Metaheuristics!algorithmic-level parallelism}) with examples for s-metaheuristics, EAs, ACO\index{Metaheuristics!ant~colony~optimization} and PCO. This classification  is principally built from the reviewed state-of-the-art works  in the previous section. The implementation level refers to the way these parallel design models are implemented on GPU. This classification focuses only on the mapping strategies between the GPU threads and the parallelized tasks (neighborhood evaluation, solution construction and so on). The different implementation strategies are explained in the following sections.\\
+\label{ch8:sec:synthesis} After reviewing some works dealing with
+GPU-based implementation of metaheuristics, in this section we
+will try to come up with a classification of the different
+strategies used in the literature for the implementation of
+metaheuristics on GPUs. One may distinguish between the parallel
+models adopted in each metaheuristic (design level)  and the way
+they are exploited on GPU architectures (implementation level).
+Indeed, even though the parallelization models used in most works
+for GPUs are derived from the traditional parallel models of each
+metaheuristic (on CPU), their implementation could take a
+different way and sometimes it may result in new parallel models
+customized for GPUs.
+
+Traditional parallel models for metaheuristics are based on an
+intuitive task parallelism: the independent tasks of the
+algorithms are simply parallelized. For example, in the case of
+s-metaheuristics, the evaluation of large neighborhoods could be
+done in parallel since there is no synchronization at this step of
+the algorithm. That is the case of EAs  when it comes to applying
+the evaluation step. Nevertheless, because of the particularity of
+the GPU architecture, some authors have used new implementation
+techniques to enhance the data parallelism in the sequential
+algorithms in order to increase the data throughput of the
+application.
+
+From this observation we propose the following classification
+based on 2 levels: design level and implementation level as
+illustrated in Figure~\ref{ch8:fig:classification}. The design
+level regroups the three classes of parallel models used in
+metaheuristics
+(solution-level,
+iteration-level,
+algorithmic-level) with examples for s-metaheuristics, EAs,
+ACO and PCO. This
+classification  is principally built from the reviewed
+state-of-the-art works  in the previous section. The
+implementation level refers to the way these parallel design
+models are implemented on GPU. This classification focuses only on
+the mapping strategies between the GPU threads and the
+parallelized tasks (neighborhood evaluation, solution
+construction, and so on). The different implementation strategies
+are explained in the following sections.
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=1\textwidth]{Chapters/chapter9/figures/classification.pdf}}
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=1\textwidth]{Chapters/chapter9/figures/classification.pdf}}
@@ -422,164 +831,286 @@ From this observation we propose the following classification based on 2 levels:
  
  
  \subsubsection*{GPU thread mapping for solution-level parallelism}
  
  
  \subsubsection*{GPU thread mapping for solution-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping}
-Parallel models at solution level consists on parallelizing a time intensive atomic task of the algorithm. Generally, it consists on the fitness evaluation~\cite{kannan}. Nevertheless, crossover operators have also been parallelized by some authors~\cite{pinel2012JPDC}. This kind of models are not always possible as they are problem dependent. The GPU implementation of solution level models uses the per-block mapping: each solution is associated to a block of threads. A second level of parallelism is used inside each block to parallelize the fitness evaluation of a single solution. This kind of mapping allows the use of shared memory to store the data structures of the solution and do not require the use of very large neighborhoods or populations.\\ 
-%data parallelism in SA-matrix to parallelize 
-\subsubsection*{GPU thread mapping for iteration-level parallelism} 
-\index{GPU-based Metaheuristics!GPU-thread mapping}
-Iteration-level parallelism consists on parallelizing the tasks performed independently on different solutions. Different mapping strategies are adopted in the reviewed works to implement these models.
-
-In Figure \ref{ch8:fig:classification}, the first example of iteration-level\index{Metaheuristics!iteration-level parallelism} parallelism is the parallel evaluation of neighborhoods in s-metaheuristics. In most of the reviewed works,  a per-thread mapping approach is used: each solution of the neighborhood is mapped to a unique thread in the GPU for evaluation~\cite{luong2010large, audreyANT}. The same mapping is used for master-slave parallel EAs to accelerate the evaluation of large populations. This kind of mappings is only efficient for very large neighborhoods and very large populations (to saturate the GPU). Many authors have pointed out that the use of such large populations (or neighborhoods) may lead to an overhead due to the communication costs between the CPU and the GPU (if the sequential part of the algorithm is placed on CPU). In order to circumvent this issue, many authors have implemented the entire algorithm on GPU~\cite{pospichal10}. On the other hand, as several solutions may be affected 
-to the same thread block in the GPU, shared memory 
-could not be used and all the data should be placed on global memory.
+\index{GPU!thread mapping} Parallel
+models at solution level consist of parallelizing a time intensive
+atomic task of the algorithm. Generally, it consists of the
+fitness evaluation~\cite{kannan}. Nevertheless, crossover
+operators have also been parallelized by some
+authors~\cite{pinel2012JPDC}. These kinds of models are not always
+possible as they are problem-dependent. The GPU implementation of
+solution-level models uses the perblock mapping: each solution is
+associated to a block of threads. A second level of parallelism is
+used inside each block to parallelize the fitness evaluation of a
+single solution. This kind of mapping allows the use of shared
+memory to store the data structures of the solution and does not
+require the use of very large neighborhoods or populations.
+%data parallelism in SA-matrix to parallelize
+
+\subsubsection*{GPU thread mapping for iteration-level parallelism}
+\index{GPU!thread mapping}
+Iteration-level parallelism consists of parallelizing the tasks
+performed independently on different solutions. Different mapping
+strategies are adopted in the reviewed works to implement these
+models.
+
+In Figure \ref{ch8:fig:classification}, the first example of
+iteration-level
+parallelism is the parallel evaluation of neighborhoods in
+s-metaheuristics. In most of the reviewed works,  a per-thread
+mapping approach is used: each solution of the neighborhood is
+mapped to a unique thread in the GPU for
+evaluation~\cite{luong2010large, audreyANT}. The same mapping is
+used for master-slave parallel EAs to accelerate the evaluation of
+large populations. This kind of mapping is only efficient for very
+large neighborhoods and very large populations (to saturate the
+GPU). Many authors have pointed out that the use of such large
+populations (or neighborhoods) may lead to an overhead due to the
+communication costs between the CPU and the GPU (if the sequential
+part of the algorithm is placed on CPU). In order to circumvent
+this issue, many authors have implemented the entire algorithm on
+GPU~\cite{pospichal10}. On the other hand, as several solutions
+may be mapped with the same thread block in the GPU, shared memory
+could not be used and all the data should be placed on global
+memory.
  %pheromone update data parallelism matrices
  
  \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
  %pheromone update data parallelism matrices
  
  \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping}
-Algorithmic-level parallelism consists on launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models have been used: the multi-start model and the island model (parallel EAs).\\
-
-The implementation of the multi-start model is based on two different mapping strategies~\cite{luongMultiStart, audreyANT}: (1) each local search is associated to a unique thread, (2) each solution (from multiple neighborhoods) is associated to a unique thread. 
-The first mapping strategy (one thread per LS) presents a big drawback: the number of LS to use should be very large to saturate the GPU and cover the memory access latency. On the other hand, the evaluation of the neighborhood of a single LS by one thread is time intensive. Furthermore, shared memory could not be used to store the huge data generated by the different neighborhoods. In the second mapping strategy, the LS algorithms are placed on CPU and the neighborhood evaluation of each LS is parallelized on GPU using per-thread mapping strategy (one thread per solution). This comes to a hierarchical parallel model combining algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} parallelism (multi-start) with iteration-level\index{Metaheuristics!iteration-level parallelism} parallelism (master-worker).\\ 
-
-In the island model, the same mapping is used in all the reviewed works~\cite{tsutsuiGAQAP, nowotniak, maitre2009}: each sub-population (island) is associated to one thread block in the GPU. A second level of parallelism is used inside the block to parallelize the evaluation step of the local population. Migrations are always performed through the global memory as inter-block communications  are impossible in CUDA. The first advantage of this hierarchical implementation is that it allows the occupation of a large number of threads even for medium population sizes. The second advantage consists on using shared memory to store sub-populations to benefit from the low latency of shared memory.\\
-
-\section{Frameworks for metaheuristics on GPUs} 
+Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level models have been used: the multistart model and the island model (parallel EAs).
+
+The implementation of the multistart model is based on two
+different mapping strategies~\cite{luongMultiStart, audreyANT}:
+(1) each Local Search (LS) is associated to a unique thread and
+(2) each solution (from multiple neighborhoods) is associated to a
+unique thread. The first mapping strategy (one thread per LS)
+presents a big drawback: the number of LS to use should be very
+large to saturate the GPU and cover the memory access latency. On
+the other hand, the evaluation of the neighborhood of a single LS
+by one thread is time intensive. Furthermore, shared memory could
+not be used to store the huge data generated by the different
+neighborhoods. In the second mapping strategy, the LS algorithms
+are placed on CPU and the neighborhood evaluation of each LS is
+parallelized on GPU using per-thread mapping strategy (one thread
+per solution). This consists of a hierarchical parallel model
+combining algorithmic-level parallelism (multistart) with
+iteration-level
+parallelism (master-worker).
+
+In the island model, the same mapping is used in all the reviewed
+works~\cite{tsutsuiGAQAP, nowotniak, maitre2009}: each
+subpopulation (island) is associated to one thread block in the
+GPU. A second level of parallelism is used inside the block to
+parallelize the evaluation step of the local population.
+Migrations are always performed through the global memory as
+interblock communications  are impossible in CUDA. The first
+advantage of this hierarchical implementation is that it allows
+the occupation of a large number of threads even for medium
+population sizes. The second advantage consists of using shared
+memory to store subpopulations to benefit from the low latency of
+shared memory.
+
+\section{Frameworks for metaheuristics on GPUs}
  \label{ch8:sec:frameworks}
  
  \label{ch8:sec:frameworks}
  
-After the first pioneering works of metaheuristics on graphics processing units, 
-the next challenge is to provide easy to use frameworks and libraries for rapid development 
-of metaheuristics on GPUs. Although the works on this subject are not yet mature 
-and do not cover the main metaheuristic algorithms, we will present the only three 
-works to our knowledge, which propose open source frameworks for developing 
-metaheuristics on GPUs.\\
-
-The three frameworks reviewed in this section are: 
- PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace}, ParadisEO-MO-GPU\index{GPU-based frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and libCudaOptimize\index{GPU-based frameworks!libCudaOptimize}~\cite{libcuda}. PUGACE\index{GPU-based frameworks!PUGACE} is a framework for implementing EAs on GPUs. Paradiseo-GPU is an extension of the framework ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on GPU. Finally, libCudaOptimize\index{GPU-based frameworks!libCudaOptimize} 
-is a  library intended for the implementation of p-metaheuristics on GPU. 
-The three frameworks are presented in more details in the following. 
-
-\subsection{PUGACE\index{GPU-based frameworks!PUGACE}: A framework for implementing evolutionary computation on GPUs}
-PUGACE\index{GPU-based frameworks!PUGACE} is a generic  framework for easy implementation of cellular evolutionary 
-algorithms on GPUs implemented using C and CUDA. It is based on the frameworks MALLBA 
-and JCell (a framework for cellular algorithms). The authors justified the choose of 
-cellular evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm} by the good feedback found in the literature concerning 
-its efficient implementation on GPUs compared to other parallel models for EAs (island, 
-master-slave). The main standard evolutionary operators are already implemented in PUGACE\index{GPU-based frameworks!PUGACE}: 
-different selection strategies, standard crossover and mutation operators (\emph{PMX, swap, 
-2-exchange}, etc.). Different problem encoding are also supported. The framework is 
-organized as a set of modules in which the different functionalities  are separated 
-as much as possible in order to facilitate the extension of the framework. Despite, all 
-the functions and procedures that execute on GPU are implemented in the same file kernel.cu. \\
-
-The implementation strategy adopted on the GPU is as follows. Population initialization 
-is done on the CPU side and the population is transferred to the GPU. On the GPU side, 
-each individual is associated to a unique CUDA thread. The function evaluation and mutation 
-are done on the GPU while selection and replacement are maintained on the CPU. In order 
-to avoid thread divergence\index{GPU Challenges!thread~divergence} appearing in the same CUDA thread block at the crossover step 
-(because of the probability of application which may give different results from one 
-thread to the other), the decision of applying or not a crossover is taken at the block 
-level and applied to all the individuals within the block. It is the decision on the choose 
-of the cutting point for the crossover.\\
-
-The framework is validated on standard benchmarks of the quadratic assignment problem (QAP). 
-Speedups of $15.44 \times$  to $18.41 \times$ are achieved compared to a CPU implementation of a cEA using population sizes rising from 512 to 1024 and from 1024 to 2048.
-
-% The authors noticed an 
-% improvement on the speedup when the size of the population increases. This is best explained 
-% by the linear increasing of execution time on CPU side while a sublinear increasing is 
-% occuring on GPU side (due to a better exploitation of the GPU threads). Neverethless, 
-% dispite the positive impact of large populations on th parallel efficieny of the implementation 
-% on GPU, a study on the impact on the final solution quality when using such large populations 
-% is to be done. Indeed, too large populations on EAs may lead to high diversification of the 
-% search and poor intensification capabilities, which leads to poor results. 
-% For the moment, plateforms whith more than one CPU and one GPU are not yet supported in PUGACE\index{GPU-based frameworks!PUGACE}. 
-% No details on how to get PUGACE\index{GPU-based frameworks!PUGACE} code source are given in the paper. 
-
-\subsection{Paradiseo-MO-GPU}
-
-Melab \emph{et al.}~\cite{paradiseoGPU} developed a reusable framework ParadisEO-MO-GPU\index{GPU-based frameworks!ParadisEO-MO-GPU} for parallel local search metaheuristics (s-metaheuristics) on GPUs. It focuses on 
-iteration-level\index{Metaheuristics!iteration-level parallelism} parallel model of S-Metaheuristics which consists in exploring 
-in parallel the neighborhood of a problem solution on GPU. The framework, 
-implemented using C++ and CUDA, is an extension of the ParadisEO~\cite{paradiseo} framework 
-previously developed by the same team for parallel and distributed metaheuristics on both dedicated parallel hardware platforms and computational grids. The objective of this framework is to facilitate the development of GPU-based metaheuristics providing a transparent use of 
-GPUs to users that are unfamiliar with advanced features of all parallelization 
-techniques and deployment on GPUs. The framework allows one to efficiently manage 
-the hierarchical organization of the memories (latencies and sizes) of the GPU 
-and its communication with the CPU as well as the minimizing of the user 
-involvement in its management.\\
+After the first pioneering works of metaheuristics on graphics
+processing units, the next challenge is to provide easy-to-use
+frameworks and libraries for rapid development of metaheuristics
+on GPUs. Although the works on this subject are not yet mature and
+do not cover the main metaheuristic algorithms, we will present
+the only three works to our knowledge, which propose open source
+frameworks for developing
+metaheuristics on GPUs.
+
+The three frameworks reviewed in this section are
+PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
+ParadisEO-MO-GPU\index{GPU-based
+frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
+libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize}~\cite{libcuda}. PUGACE\index{GPU-based
+frameworks!PUGACE} is a framework for implementing EAs on GPUs.
+ParadisEO-MO-GPU is an extension of the framework
+ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
+GPU. Finally, libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize} is a  library intended for the
+implementation of p-metaheuristics on GPU. The three frameworks
+are presented in more detail in the following.
+
+\subsection{PUGACE: framework for implementing evolutionary computation on GPUs}
+PUGACE is a generic  framework
+for easy implementation of cellular evolutionary algorithms on
+GPUs implemented using C and CUDA. It is based on the frameworks
+MALLBA and JCell (a framework for cellular algorithms). The
+authors justified the choice of cellular evolutionary
+algorithm by the good
+feedback found in the literature concerning its efficient
+implementation on GPUs compared to other parallel models for EAs
+(island, master-slave). The main standard evolutionary operators
+are already implemented in PUGACE: different selection strategies, standard
+crossover, and mutation operators (PMX, swap, 2-exchange,
+etc.). Different problem encoding is also supported. The framework
+is organized as a set of modules in which the different
+functionalities  are separated as much as possible in order to
+facilitate the extension of the framework. All
+the functions and procedures that execute on GPU are implemented in the same file kernel.cu. 
+
+The implementation strategy adopted on the GPU is as follows.
+Population initialization is done on the CPU side and the
+population is transferred to the GPU. On the GPU side, each
+individual is associated to a unique CUDA thread. The function
+evaluation and mutation are done on the GPU while selection and
+replacement are maintained on the CPU. In order to avoid thread
+divergence\index{GPU!thread divergence} appearing in
+the same CUDA thread block at the crossover step (because of the
+probability of application which may give different results from
+one thread to the other), the decision of whether to apply a
+crossover is taken at the block level and applied to all the
+individuals within the block. It is the decision on the choose
+of the cutting point for the crossover.
+
+The framework is validated on standard benchmarks of the QAP. Speedups of $15.44 \times$  to $18.41
+\times$ are achieved compared to a CPU implementation of a cEA
+using population sizes rising from 512 to 1024 and from 1024 to
+2048.
+
+% The authors noticed an
+% improvement on the speedup when the size of the population increases. This is best explained
+% by the linear increasing of execution time on CPU side while a sublinear increasing is
+% occuring on GPU side (due to a better exploitation of the GPU threads). Neverethless,
+% dispite the positive impact of large populations on th parallel efficieny of the implementation
+% on GPU, a study on the impact on the final solution quality when using such large populations
+% is to be done. Indeed, too large populations on EAs may lead to high diversification of the
+% search and poor intensification capabilities, which leads to poor results.
+% For the moment, plateforms whith more than one CPU and one GPU are not yet supported in PUGACE\index{GPU-based frameworks!PUGACE}.
+% No details on how to get PUGACE\index{GPU-based frameworks!PUGACE} code source are given in the paper.
+
+\subsection{ParadisEO-MO-GPU}
+
+Melab et al.~\cite{paradiseoGPU} developed a reusable
+framework ParadisEO-MO-GPU\index{GPU-based
+frameworks!ParadisEO-MO-GPU} for parallel local search
+metaheuristics (s-metaheuristics) on GPUs. It focuses on the
+iteration-level
+parallel model of s-metaheuristics which consists of exploring in
+parallel the neighborhood of a problem solution on GPU. The
+framework, implemented using C++ and CUDA, is an extension of the
+ParadisEO~\cite{paradiseo} framework previously developed by the
+same team for parallel and distributed metaheuristics on both
+dedicated parallel hardware platforms and computational grids. The
+objective of this framework is to facilitate the development of
+GPU-based metaheuristics providing a transparent use of GPUs to
+users who are unfamiliar with advanced features of all
+parallelization techniques and deployment on GPUs. The framework
+allows one to efficiently manage the hierarchical organization of
+the memories (latencies and sizes) of the GPU and its
+communication with the CPU as well as the minimizing of the user
+involvement in its management.
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=0.8\textwidth]{Chapters/chapter9/figures/paradiseoGPU.pdf}}
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=0.8\textwidth]{Chapters/chapter9/figures/paradiseoGPU.pdf}}
-\caption{The squelton of the ParadisEO-MO-GPU\index{GPU-based frameworks!ParadisEO-MO-GPU}.}
+\caption{The skeleton of the ParadisEO-MO-GPU\index{GPU-based
+frameworks!ParadisEO-MO-GPU}.}
  %[Comparison of number of cores in a CPU and in a GPU]
  \label{ch1:fig:paradiseoGPU}
  \end{figure}
  
  %[Comparison of number of cores in a CPU and in a GPU]
  \label{ch1:fig:paradiseoGPU}
  \end{figure}
  
-The framework is based on a master-worker model where the master is the CPU and the workers are threads executed by the processing cores of the GPU. The CPU executes the serial part of the metaheuristic and sends only the current solution to the GPU to minimize the transfer cost. The GPU, on its side, generates the neighboring of the received solution and evaluates them at each iteration. All the threads execute the same kernel and according to a static mapping table between the threads and the neighbors where each thread is associated with exactly one neighbor evaluation. After all the neighborhood is generated and evaluated, 
-it is sent back to the CPU which selects the best solution.
-
-\subsection{libCudaOptimize: an open source library of GPU-based metaheuristics}
-\index{GPU-based frameworks!libCudaOptimize}
-LibCudaOptimize~\cite{libcuda} is a C++/Cuda open source library for the design and implementation of metaheuristics on GPUs. Until now, the metaheuristics supported by LibCudaOptimize are: scatter search, differential evolution and particle swarm optimization\index{Metaheuristics!particle swarm optimization}. Nevertheless, the library is designed in such a way to allow further extensions for other metaheuristics and it is still on development phase by the authors. The parallelization strategy adopted on GPU is principally based on fitness evaluation. The sequential structure of the optimization algorithms are maintained on CPU. 
-
-\section{Case study: Accelerating large neighborhood LS method on GPUs for solving the Q3AP}
+The framework is based on a master-worker model where the master
+is the CPU and the workers are threads executed by the processing
+cores of the GPU. The CPU executes the serial part of the
+metaheuristic and sends only the current solution to the GPU to
+minimize the transfer cost. The GPU, on its side, generates the
+neighboring of the received solution and evaluates them at each
+iteration. All the threads execute the same kernel and according
+to a static mapping table between the threads and the neighbors
+where each thread is associated with exactly one neighbor
+evaluation. After all the neighborhood is generated and evaluated,
+it is sent back to the CPU which selects the best solution (See
+Figure~\ref{ch1:fig:paradiseoGPU}).
+
+\subsection{libCUDAOptimize: an open source library of GPU-based metaheuristics}
+\index{GPU-based frameworks!libCUDAOptimize}
+LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library
+for the design and implementation of metaheuristics on GPUs. Until
+now, the metaheuristics supported by LibCUDAOptimize are: scatter
+search, differential evolution, and particle swarm
+optimization\index{metaheuristics!particle swarm optimization}.
+Nevertheless, the library is designed in such a way to allow
+further extensions for other metaheuristics and it is still in
+development phase by the authors. The parallelization strategy
+adopted on GPU is principally based on fitness evaluation. The
+sequential structure of the optimization algorithms is maintained
+on CPU.
+
+\section{Case study: accelerating large neighborhood LS method on GPUs for solving the Q3AP}
  \label{ch8:sec:case}
  
  \label{ch8:sec:case}
  
-In this case study, a large neighborhood GPU-based local search\index{GPU-based Metaheuristics!GPU-based~local~search} method for solving the quadratic 3-dimensional assignment problem (Q3AP) will be presented. The local search method is an iterated local search\index{Metaheuristics!iterated local search} (ILS)~\cite{stutzle2006ILSforQAP} 
-using an embedded tabu search\index{Metaheuristics!tabu~search} (TS) algorithm. The ILS principle consists in executing iteratively the embedded local search, 
-each iteration of the local search process starting from a disrupted local optima reached by the previous local search process. 
-The disruption heuristic is a performance parameter of an ILS algorithm and should be judiciously defined. A template of an 
-ILS algorithm is given by the algorithm \ref{ch8:ils_algorithm_template}.\\
+In this case study, a large neighborhood GPU-based local
+search\index{GPU!based local search}
+method for solving the Quadratic 3-dimensional Assignment Problem
+(Q3AP) will be presented. The local search method is an Iterated
+Local Search\index{metaheuristics!iterated local search}
+(ILS)~\cite{stutzle2006ILSforQAP} using an embedded
+TS algorithm. The ILS principle
+consists of executing iteratively the embedded local search, each
+iteration which starts from a disrupted local optima reached by
+the previous local search process. The disruption heuristic is a
+performance parameter of an ILS algorithm and should be
+judiciously defined. A template of an
+ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.
  
  \begin{algorithm}[H]
  
  \begin{algorithm}[H]
-       \SetAlgoLined
-       %\KwData{this text}
-       %\KwResult{how to write algorithm with \LaTeX2e }
+    \SetAlgoLined
+    %\KwData{this text}
+    %\KwResult{how to write algorithm with \LaTeX2e }
     $s_{0}$=GenerateInitSol()\;
     $s^*$=LocalSearch($s_0$)\;
     $s_{0}$=GenerateInitSol()\;
     $s^*$=LocalSearch($s_0$)\;
-       \Repeat{a maximum number of iterations is reached}{
+    \Repeat{a maximum number of iterations is reached}{
       $s'$=Perturbate($s^*,history$)\;
       $s^{*'}$=LocalSearch($s'$)\;
       $s^*$=AcceptationCriteria($s^*,s^{*'},history$)\;
  
       $s'$=Perturbate($s^*,history$)\;
       $s^{*'}$=LocalSearch($s'$)\;
       $s^*$=AcceptationCriteria($s^*,s^{*'},history$)\;
  
-       }
-\caption{Iterated local search template}
+    }
+\caption{iterated local search template}
  \label{ch8:ils_algorithm_template}
  \end{algorithm}
  
  \label{ch8:ils_algorithm_template}
  \end{algorithm}
  
-%  
+%
  %   \begin{algorithm}
  %   \begin{algorithm}
-%  % \label{ch8:ils_algorithm_template} 
+%  % \label{ch8:ils_algorithm_template}
  %   \caption{iterated local search\index{Metaheuristics!iterated local search} template}
  %   \begin{algorithmic}[1]
  %   \caption{iterated local search\index{Metaheuristics!iterated local search} template}
  %   \begin{algorithmic}[1]
-%   
+%
  %   \STATE $s_{0}$=\texttt{GenerateInitSol}();
  %   \STATE $s_{0}$=\texttt{GenerateInitSol}();
-%   
+%
  %   \STATE $s^*$=\texttt{LocalSearch}($s_0$);
  %   \STATE $s^*$=\texttt{LocalSearch}($s_0$);
-%   
+%
  %   \REPEAT
  %     \STATE $s'$=\texttt{Perturbate}($s^*,history$);
  %     \STATE $s^{*'}$=\texttt{LocalSearch}($s'$);
  %     \STATE $s^*$=\texttt{AcceptationCriteria}($s^*,s^{*'},history$);
  %   \REPEAT
  %     \STATE $s'$=\texttt{Perturbate}($s^*,history$);
  %     \STATE $s^{*'}$=\texttt{LocalSearch}($s'$);
  %     \STATE $s^*$=\texttt{AcceptationCriteria}($s^*,s^{*'},history$);
-%   
+%
  %   \UNTIL {a maximum number of iterations is reached}
  %   \UNTIL {a maximum number of iterations is reached}
-%   
+%
  %   \end{algorithmic}
  %   \end{algorithmic}
-%   
+%
  %   \end{algorithm}
  
  \subsection{The quadratic 3-dimensional assignment problem}
  
  %   \end{algorithm}
  
  \subsection{The quadratic 3-dimensional assignment problem}
  
-The quadratic 3-dimensional assignment problem (Q3AP) is an extension of the well-known NP-hard quadratic assignment 
-problem (QAP). The latter is one of the most studied problem by the combinatorial optimization community due to its  
-wide range of practical applications (site planning, schedule problem, computer-aided design, etc.) and to its 
-computational challenges since it is considered as one of the most computationally difficult optimization problem.
-
-The Q3AP was firstly introduced by William P. Pierskalla in 1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, 
-the Q3AP is a more less studied problem. Indeed, the Q3AP was revisited only this last years and has recently 
-been used to model some advanced assignment problems such as the symbol-mapping problem encountered in wireless 
-communication systems and described in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be mathematically 
-expressed as follows:
+The Q3AP is an extension of the well-known NP-hard QAP. The latter
+is one of the most studied problems by the combinatorial
+optimization community due to its wide range of practical
+applications (site planning, schedule problem, computer-aided
+design, etc.) and to its computational challenges since it is
+considered as one of the most computationally difficult
+optimization problems.
+
+The Q3AP was first introduced by William P. Pierskalla in
+1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
+less studied problem. Indeed, the Q3AP was revisited only this
+past year and has recently been used to model some advanced
+assignment problems such as the symbol-mapping problem encountered
+in wireless communication systems and described
+in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
+mathematically expressed as follows:
  
  % debut -----------------------------------------------------------------------------
  \begin{equation*} min \left \{
  
  % debut -----------------------------------------------------------------------------
  \begin{equation*} min \left \{
@@ -590,12 +1121,12 @@ expressed as follows:
  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}b_{ijl}x_{ijl}
  \right \} \end{equation}
  
  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}b_{ijl}x_{ijl}
  \right \} \end{equation}
  
-where:
+where
  
  \begin{eqnarray}
    X=(x_{ijl})\in I \cap J \cap L, \label{Q3APConstraints1}\\
  
  \begin{eqnarray}
    X=(x_{ijl})\in I \cap J \cap L, \label{Q3APConstraints1}\\
-  x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1. \label{Q3APConstraints2}
-\end{eqnarray} $I$, $J$ and $L$ sets are defined as follows:
+  x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1 \label{Q3APConstraints2}
+\end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
  \begin{displaymath} I=\lbrace X=(x_{ijl}):
  \sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
  i=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  \begin{displaymath} I=\lbrace X=(x_{ijl}):
  \sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
  i=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
@@ -604,11 +1135,10 @@ i=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  \begin{displaymath} L=\lbrace X=(x_{ijl}):
  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
  j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  \begin{displaymath} L=\lbrace X=(x_{ijl}):
  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
-l=0,1,...,n-1\rbrace \mathrm{.} \end{displaymath}
+l=0,1,...,n-1\rbrace  \end{displaymath}
  % fin------------------------------------------------------------------------------
  
  % fin------------------------------------------------------------------------------
  
-
-Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the 
+Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
  \textit{permutation-based formulation} wherein the Q3AP can be expressed as follows:
  
   \begin{equation*}
  \textit{permutation-based formulation} wherein the Q3AP can be expressed as follows:
  
   \begin{equation*}
@@ -621,25 +1151,33 @@ Other equivalent formulations of the Q3AP can be found in the literature. A part
   \left. \sum_{i=0}^{n-1}b_{i\pi_{1(i)}\pi_{2(i)}}\right
   \} \label{Q3APPermutationBasedFormulation}
  \end{equation}
   \left. \sum_{i=0}^{n-1}b_{i\pi_{1(i)}\pi_{2(i)}}\right
   \} \label{Q3APPermutationBasedFormulation}
  \end{equation}
-
-where $\pi_1$ and $\pi_2$ are permutations over the set $\left\lbrace 0,1,\ldots,n-1\right\rbrace$. According to this 
-formulation, minimizing the Q3AP consists in finding a double permutation $(\pi_1,\pi_2)$ which minimizes the objective 
-function (\ref{Q3APPermutationBasedFormulation}).
-
-The Q3AP is proven to be a NP-hard problem since it is an extension of the quadratic assignment problem and of the axial
- 3-dimensional assignment problem which are both NP-hard problems. It is particularly computationally difficult since the 
+where $\pi_1$ and $\pi_2$ are permutations over the set
+$\left\lbrace 0,1,\ldots,n-1\right\rbrace$. According to this
+formulation, minimizing the Q3AP consists of finding a double
+permutation $(\pi_1,\pi_2)$ which minimizes the objective function
+(\ref{Q3APPermutationBasedFormulation}).
+
+The Q3AP is proven to be an NP-hard problem since it is an
+extension of the quadratic assignment problem and of the axial
+ 3-dimensional assignment problem which are both NP-hard problems. It is particularly computationally difficult since the
  number of feasible solutions of an instance of size $n$ is $n! \times n!$.
  
  \subsection{Iterated tabu search algorithm for the Q3AP}
  
  number of feasible solutions of an instance of size $n$ is $n! \times n!$.
  
  \subsection{Iterated tabu search algorithm for the Q3AP}
  
-To tackle large size instances of the Q3AP and speed up the search process, a parallel iterated local search\index{Metaheuristics!iterated local search} (ILS) 
-algorithm has been designed. The local search embedded in the ILS is a tabu search\index{Metaheuristics!tabu~search} (TS). A TS procedure~\cite{Glover1989TS} 
-starts from an initial feasible solution and tries, at each step, to move to a neighboring solution that minimizes the 
-fitness (for a minimization case). If no such move exists, the neighbor solution that less degrade the fitness is chosen 
-as a next move. This enables TS process to escape local optima. However, this strategy may generate cycles, i. e. previous 
-moves can be selected again. To avoid cycles, the TS manages a short-term memory that contains the moves that have been 
-recently performed. A tabu search\index{Metaheuristics!tabu~search} template is given by the algorithm \ref{TS_pseudo_code}.\\
-%  
+To tackle large-sized instances of the Q3AP and speed up the
+search process, a parallel ILS algorithm has been designed. The local search embedded in
+the ILS is a TS. A TS
+procedure~\cite{Glover1989TS} starts from an initial feasible
+solution and tries, at each step, to move to a neighboring
+solution that minimizes the fitness (for a minimization case). If
+no such move exists, the neighbor solution that less degrades the
+fitness is chosen as a next move. This enables the TS process to
+escape local optima. However, this strategy may generate cycles,
+i.e., previous moves can be selected again. To avoid cycles, the
+TS manages a short-term memory that contains the moves that have
+been recently performed. A TS
+template is given by Algorithm \ref{TS_pseudo_code}.
+%
  %  \begin{algorithm}
  %  \label{TS_pseudo_code}
  %  \caption{Tabu search template}
  %  \begin{algorithm}
  %  \label{TS_pseudo_code}
  %  \caption{Tabu search template}
@@ -653,39 +1191,46 @@ recently performed. A tabu search\index{Metaheuristics!tabu~search} template is
  %          \STATE TabuList = TabuList $\bigcup \{m(t)\}$;
  %          \STATE $t = t + 1$;
  %  \UNTIL{a maximum number of iterations is reached}
  %          \STATE TabuList = TabuList $\bigcup \{m(t)\}$;
  %          \STATE $t = t + 1$;
  %  \UNTIL{a maximum number of iterations is reached}
-%  
+%
  %  \end{algorithmic}
  %  \end{algorithm}
  
  \begin{algorithm}[H]
  %  \end{algorithmic}
  %  \end{algorithm}
  
  \begin{algorithm}[H]
-       \SetAlgoLined
-       %\KwData{this text}
-       %\KwResult{how to write algorithm with \LaTeX2e }
-       GenerateInitSol($s_0$)\;
-       TabuList=$\phi$\;
-       $t=0$\;
-       \Repeat{a maximum number of iterations is reached}{
+    \SetAlgoLined
+    %\KwData{this text}
+    %\KwResult{how to write algorithm with \LaTeX2e }
+    GenerateInitSol($s_0$)\;
+    TabuList=$\phi$\;
+    $t=0$\;
+    \Repeat{a maximum number of iterations is reached}{
            $m(t)$ = SelectBestMove($s(t)$)\;
            $s(t+1)$ = ApplyMove($m(t),s(t)$)\;
            TabuList = TabuList $\bigcup \{m(t)\}$\;
            $t = t + 1$\;
  
            $m(t)$ = SelectBestMove($s(t)$)\;
            $s(t+1)$ = ApplyMove($m(t),s(t)$)\;
            TabuList = TabuList $\bigcup \{m(t)\}$\;
            $t = t + 1$\;
  
-       }
-\caption{Tabu search template}
+    }
+\caption{tabu search template}
  \label{TS_pseudo_code}
  \end{algorithm}
  
  
  \subsection{Neighborhood functions for the Q3AP}
  
  \label{TS_pseudo_code}
  \end{algorithm}
  
  
  \subsection{Neighborhood functions for the Q3AP}
  
-The neighborhood function is a crucial parameter in any local search algorithm. Indeed, if the neighborhood function is 
-not adequate to the problem and/or does not consider the targeted computing framework, any local search algorithm will 
-fail to reach good quality solutions of the search space.
-
-Regarding the Q3AP, many neighborhood structures can be considered. A basic neighborhood was proposed and was investigated 
-in different works of the literature~\cite{Hahn2008q3ap,DBLP:journals/ijfcs/LoukilMMTB12} and consists in the set of all 
-solutions (double-permutations) generated from the current one by an exchange of two positions in either the first ($\pi_1$) 
-or the second ($\pi_2$) permutation. This neighborhood that we denote by $N_b$ can be formalized as follows:
+The neighborhood function is a crucial parameter in any local
+search algorithm. Indeed, if the neighborhood function is not
+adequate to the problem and/or does not consider the targeted
+computing framework, any local search algorithm will fail to reach
+good quality solutions of the search space.
+
+Regarding the Q3AP, many neighborhood structures can be
+considered. A basic neighborhood was proposed and investigated in
+different works of the
+literature~\cite{Hahn2008q3ap,DBLP:journals/ijfcs/LoukilMMTB12}
+and consists of the set of all solutions (double-permutations)
+generated from the current one by an exchange of two positions in
+either the first ($\pi_1$) or the second ($\pi_2$) permutation.
+This neighborhood that we denote by $N_b$ can be formalized as
+follows:
  
  \begin{equation}
  \begin{array}{rl}
  
  \begin{equation}
  \begin{array}{rl}
@@ -700,23 +1245,29 @@ N_b(\pi_1,\pi_2)=\{&(\pi_1^{'},\pi_2^{'}): \pi_1^{'}[k]=\pi_1[l], \pi_1^{'}[l]=\
                  & \pi_1^{'}[j]=\pi_1[j], \quad 0\leq j\neq k,l < n\\
  \}& \end{array}
  \end{equation}
                  & \pi_1^{'}[j]=\pi_1[j], \quad 0\leq j\neq k,l < n\\
  \}& \end{array}
  \end{equation}
-
-
-where $(\pi_1,\pi_2)$ is the current solution and $n$ is the size of the Q3AP instance. The size $|N_b|$ of such neighborhood is equal to:
+where $(\pi_1,\pi_2)$ is the current solution and $n$ is the size
+of the Q3AP instance. The size $|N_b|$ of such neighborhood is
+equal to
  
  \begin{equation}
  
  \begin{equation}
- |N_b| = n \times (n-1).
+ |N_b| = n \times (n-1)
  \end{equation}
  
  
  \end{equation}
  
  
-In our GPU-based implementations, a large-size neighborhood structure has been experimented. In fact, theoretical 
-and experimental studies have shown that the use of large neighborhood structures may improve the effectiveness of 
-LS algorithms~\cite{Ahuja:2007:VLN:1528422.1528438}. However, for such neighborhood, the generation/evaluation step 
-of an LS becomes a time-consuming task and may dramatically increase the computational time of the LS process. This 
-justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. \\
+In our GPU-based implementations, a large-sized neighborhood
+structure has been used for experimentation. In fact, theoretical
+and experimental studies have shown that the use of large
+neighborhood structures may improve the effectiveness of LS
+algorithms~\cite{Ahuja:2007:VLN:1528422.1528438}. However, for
+such a neighborhood, the generation/evaluation step of an LS
+becomes a time-consuming task and may dramatically increase the
+computational time of the LS process. This
+justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. 
  
  
-The considered large-size neighborhood consists in swapping two positions in both permutations $\pi_1$ and $\pi_2$. 
-This neighborhood structure, $N(\pi_1,\pi_2)$, can be formalized as follows:
+The considered large-sized neighborhood consists of swapping two
+positions in both permutations $\pi_1$ and $\pi_2$. This
+neighborhood structure, $N(\pi_1,\pi_2)$, can be formalized as
+follows:
  
  \begin{equation} \begin{array}{rl}
  N(\pi_1,\pi_2)=\{&(\pi_1^{'},\pi_2^{'}): \pi_1^{'}[k]=\pi_1[l], \pi_1^{'}[l]=\pi_1[k],\\
  
  \begin{equation} \begin{array}{rl}
  N(\pi_1,\pi_2)=\{&(\pi_1^{'},\pi_2^{'}): \pi_1^{'}[k]=\pi_1[l], \pi_1^{'}[l]=\pi_1[k],\\
@@ -736,33 +1287,55 @@ So, for a given Q3AP instance of size $n$, the size $|N|$ of the advanced neighb
  
  \subsection{Design and implementation of a GPU-based iterated tabu search algorithm for the Q3AP} \label{ch8:ITS-Q3APSection}
  
  
  \subsection{Design and implementation of a GPU-based iterated tabu search algorithm for the Q3AP} \label{ch8:ITS-Q3APSection}
  
-The use of GPU devices to speed up the search process of local search methods is not a straightforward task. 
-This requires to consider, at the same time, the characteristics and underlined issues of the GPU architecture 
-and the parallel models of LS methods. The main challenges that must be faced when designing a local search 
-algorithm are the efficient distribution of the search process among the CPU and the GPU minimizing the data 
-transfer between them, the hierarchical memory management\index{GPU Challenges!memory~management} and the capacity constraints of GPU memories, the 
-thread synchronization, etc. All these issues must be regarded when designing parallel LS models to allow 
-solving of large scale optimization problems on GPU architectures.\\
-
-To go back to our problem (i. e. Q3AP), we propose in algorithm  \ref{ch8:algoITS} an iterated tabu search\index{Metaheuristics!tabu~search} on GPU 
-(GPU-ITS). The parallel model is in agreement with the iteration-level\index{Metaheuristics!iteration-level parallelism} parallel model of LS methods presented 
-in Section \ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This algorithm can be seen as a cooperative 
-model between the CPU and the GPU. Indeed, the GPU is used as a coprocessor in a synchronous manner. The 
-resource-consuming part i.e. the generation and evaluation kernel is calculated by the GPU and the rest of 
-the LS process is handled by the CPU. First of all, at initialization stage, memory allocations on GPU are 
-made: the input matrices (distance and flow matrices) and the initial candidate solution of the Q3AP must 
-be allocated (lines 4 and 5). Since GPUs require massive computations with predictable memory accesses, a 
-structure has to be allocated for storing all the neighborhood fitnesses at different addresses (line 6). 
-Second, the matrices and the initial candidate solution have to be copied on the GPU (lines 7 and 8). We 
-notice that the input matrices are read-only structures and do not change during all the execution of the 
-LS algorithm. Therefore, their associated memory is copied only once during all the execution. Third, comes 
-the parallel iteration-level\index{Metaheuristics!iteration-level parallelism}, in which each neighboring solution is generated, evaluated and copied into the 
-neighborhood fitnesses structure (from lines 10 to 13). Fourth, since the order in which candidate neighbors 
-are evaluated is undefined, the neighborhood fitnesses structure has to be copied to the host CPU (line 14). 
-Then the selection strategy is applied to this structure (line 15): the exploration of the neighborhood 
-fitnesses structure is done by the CPU. Finally, after a new candidate has been selected, this latter is 
-copied to the GPU (line 17). The process is repeated until a given number of iterations has been reached.
- 
+The use of GPU devices to speed up the search process of local
+search methods is not a straightforward task. It requires one to
+consider, at the same time, the characteristics and underlying
+issues of the GPU architecture and the parallel models of LS
+methods. The main challenges that must be faced when designing a
+local search algorithm are the efficient distribution of the
+search process between the CPU and the GPU minimizing the data
+transfer between them, the hierarchical memory
+management\index{GPU!memory management} and the
+capacity constraints of GPU memories, and the thread
+synchronization. All these issues must be regarded when designing
+parallel LS models to allow
+solving of large scale optimization problems on GPU architectures.
+
+To go back to our problem (i.e., Q3AP), we propose in
+Algorithm~\ref{ch8:algoITS} an iterated tabu
+search on GPU (GPU-ITS). The
+parallel model is in agreement with the
+iteration-level
+parallel model of LS methods presented in Section
+\ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This
+algorithm can be seen as a cooperative model between the CPU and
+the GPU. Indeed, the GPU is used as a coprocessor in a synchronous
+manner. The resource-consuming part, i.e., the generation and
+evaluation kernel, is calculated by the GPU and the rest of the LS
+process is handled by the CPU. First of all, at initialization
+stage, memory allocations on GPU are made; the input matrices
+(distance and flow matrices) and the initial candidate solution of
+the Q3AP must be allocated (lines 4 and 5). Since GPUs require
+massive computations with predictable memory accesses, a structure
+has to be allocated for storing all the neighborhood fitnesses at
+different addresses (line 6). Second, the matrices and the initial
+candidate solution have to be copied on the GPU (lines 7 and 8).
+We notice that the input matrices are read-only structures and do
+not change during all the execution of the LS algorithm.
+Therefore, their associated memory is copied only once during all
+the execution. Third, comes the parallel
+iteration-level,
+in which each neighboring solution is generated, evaluated, and
+copied into the neighborhood fitnesses structure (from lines 10 to
+14). Fourth, since the order in which candidate neighbors are
+evaluated is undefined, the neighborhood fitnesses structure has
+to be copied to the host CPU (line 15). Then the selection
+strategy is applied to this structure (line 16): the exploration
+of the neighborhood fitnesses structure is done by the CPU.
+Finally, after a new candidate has been selected, this information
+is copied to the GPU (line 18). The process is repeated until a
+given number of iterations has been reached.
+
  % \begin{algorithm}
  % \caption{Template of an iterated tabu search\index{Metaheuristics!tabu~search} on GPU for solving the Q3AP} \label{LSGPU}
  %  \begin{algorithmic}[1]
  % \begin{algorithm}
  % \caption{Template of an iterated tabu search\index{Metaheuristics!tabu~search} on GPU for solving the Q3AP} \label{LSGPU}
  %  \begin{algorithmic}[1]
@@ -788,79 +1361,95 @@ copied to the GPU (line 17). The process is repeated until a given number of ite
  %  \end{algorithm}
  
  \begin{algorithm}[H]
  %  \end{algorithm}
  
  \begin{algorithm}[H]
-       \SetAlgoLined
-       %\KwData{this text}
-       %\KwResult{how to write algorithm with \LaTeX2e }
-       Choose an initial solution\;
-       Evaluate the solution\;
-       Initialize the tabu list\;
-       Allocate the Q3AP input data on GPU device memory\;
-       Allocate a solution on GPU device memory\;
-       Allocate a neighborhood fitnesses structure on GPU device memory\;
-       Copy the Q3AP input data on GPU device memory\;
-       Copy the solution on GPU device memory\;
-       $t=0$\;
-       \Repeat{a maximum number of iterations is reached}{
-               \For{each generated neighbor on GPU}{
+    \SetAlgoLined
+    %\KwData{this text}
+    %\KwResult{how to write algorithm with \LaTeX2e }
+    Choose an initial solution\;
+    Evaluate the solution\;
+    Initialize the tabu list\;
+    Allocate the Q3AP input data on GPU device memory\;
+    Allocate a solution on GPU device memory\;
+    Allocate a neighborhood fitnesses structure on GPU device memory\;
+    Copy the Q3AP input data on GPU device memory\;
+    Copy the solution on GPU device memory\;
+    $t=0$\;
+    \Repeat{a maximum number of iterations is reached}{
+        \For{each generated neighbor on GPU}{
            Incremental evaluation of the candidate solution\;
            Insert the resulting fitness into the neighborhood fitnesses structure\;
            Incremental evaluation of the candidate solution\;
            Insert the resulting fitness into the neighborhood fitnesses structure\;
-               }
-               Copy the neighborhood fitnesses structure on CPU host memory\;
-       Select the best admissible neighboring solution\;
-       Update the tabu list\;
-       Copy the chosen solution on GPU device memory\;
-       }
-\caption{Template of an iterated tabu search on GPU for solving the Q3AP}
+        }
+        Copy the neighborhood fitnesses structure on CPU host memory\;
+        Select the best admissible neighboring solution\;
+        Update the tabu list\;
+        Copy the chosen solution on GPU device memory\;
+    }
+\caption{template of an iterated tabu search on GPU for solving the Q3AP}
  \label{ch8:algoITS}
  \end{algorithm}
  
  
  \subsection {Experimental results}
  
  \label{ch8:algoITS}
  \end{algorithm}
  
  
  \subsection {Experimental results}
  
-In this section, some experimental results related to the approach presented in Section \ref{ch8:ITS-Q3APSection} are 
-reported. We recall that the approach is a GPU-based iterated tabu search\index{Metaheuristics!tabu~search} (GPU-ITS) method consisting in an iterated 
-local search (ILS) embedding a tabu search\index{Metaheuristics!tabu~search} (TS) and where the generation/evaluation step of the TS process is executed 
-on GPU. The ILS is used to improve the quality of successive local optima provided by TS methods. This is achieved by 
-perturbing the local optima reached by the current TS process and reconsidering it as initial solution of the following 
-TS process. Regarding our algorithm, the applied perturbation is a random number $\mu $ of swaps in either the first or 
-the second permutation where $\mu \in [2:n]$ ($n$ is the instance size). 
-
-
-Experiments have been carried out on a node of the Chirloute cluster of Lille site. 
-The latter is one of the 10 sites that currently make up Grid5000~\cite{grid5000}, 
-the French experimental computational grid. A Chirloute cluster node consists of an Intel Xeon E5620 
-CPU and a NVIDIA Tesla Fermi M2050 (448 cores) GPU type. The number of ILS iterations and the number 
-of TS iterations were set respectively to 20 and 500. The tabu list 
-size has been initalized to $\frac{m}{4}$, $m$ being the size of the neighborhood set. 
-
-Table \ref{ch8:ITSQ3APResults} reports the obtained results for the GPU-ITS using our large-size neighborhood structure.
- The method have been tested on 5 Q3AP instances derived from the QAP Nugent instances in QAPLIB~\cite{burkard1991qaplib}. The average time 
-measurement for 20 executions is reported in seconds and acceleration factors compared to a standalone CPU are 
-also considered. The algorithm is stopped when a maximum number of iterations has been reached or when the 
-optimal (or best known) value has been discovered. Average and max values of the evaluation function of the parallel GPU version have been measured. 
-The number of successful tries (hits) and the average number of ILS iterations to converge to the optimal/best 
-known value are also represented. The associated standard deviation for each average measurement are shown in small size characters.
+In this section, some experimental results related to the approach
+presented in Section \ref{ch8:ITS-Q3APSection} are reported. We
+recall that the approach is a GPU-based iterated tabu
+search (GPU-ITS) method
+consisting in an iterated local search (ILS) embedding a tabu
+search (TS) and where the
+generation/evaluation step of the TS process is executed on GPU.
+The ILS is used to improve the quality of successive local optima
+provided by TS methods. This is achieved by perturbing the local
+optima reached by the current TS process and reconsidering it as
+initial solution of the following TS process. Regarding our
+algorithm, the applied perturbation is a random number $\mu $ of
+swaps in either the first or the second permutation where $\mu \in
+[2:n]$ ($n$ is the instance size).
+
+Experiments have been carried out on a node of the Chirloute
+cluster of the Lille site. This is one of the 10 sites that
+currently make up Grid5000~\cite{grid5000}, the French
+experimental computational grid. A Chirloute cluster node consists
+of an Intel Xeon E5620 CPU and a NVIDIA Tesla Fermi M2050 (448
+cores) GPU type. The number of ILS iterations and the number of TS
+iterations were set respectively to 20 and 500. The tabu list size
+has been initalized to $\frac{m}{4}$, $m$ being the size of the
+neighborhood set.
+
+Table \ref{ch8:ITSQ3APResults} reports the obtained results for
+the GPU-ITS using our large-sized neighborhood structure. The
+method has been tested on 5 Q3AP instances derived from the QAP
+Nugent instances in QAPLIB~\cite{burkard1991qaplib}. The average
+time measurement for 20 executions is reported in seconds and
+acceleration factors compared to a standalone CPU are also
+considered. The algorithm is stopped when a maximum number of
+iterations has been reached or when the optimal (or best known)
+value has been discovered. Average and max values of the
+evaluation function of the parallel GPU version have been
+measured. The number of successful tries (hits) and the average
+number of ILS iterations to converge to the optimal/best known
+value are also represented. The associated standard deviation for
+each average measurement is shown in small type characters.
  
  \begin{table}
  %\tiny
  %\footnotesize
  \small
  
  \begin{table}
  %\tiny
  %\footnotesize
  \small
-\caption{Results of the GPU-based iterated tabu search for different Q3AP instances}
-\label{ch8:ITSQ3APResults} \center
-\begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline 
-\multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average}  & \multicolumn{1}{|c|}{Maximal}  & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\ 
-& \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iteration}\\ \hline
-Nug12-d & $580$   & $615.4$    & $744$   & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\ 
-       &   & \tiny{$41.7$}    & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline 
-Nug13-d & $1912$  & $1985.4$   & $2100$  & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\ 
-&  & \tiny{$51.0$}   &  & & \tiny{$1.3$} & \tiny{$1.0$} &  & \tiny{$5.6$} \\ \hline 
-Nug15-d & $2230$  & $2418.1$  & $2580$  & $30$\% & $305.5$ & $5.2$ & $58.8 \times$ & $17$ \\ 
- &  & \tiny{$135.3$}  & & & \tiny{$164.5$} & \tiny{$1.3$} & & \tiny{$5.0$} \\ \hline 
-Nug18-d & $17836$ & $18110.2$ & $18506$ & $10$\% & $1375.9$ & $12.8$ & $107.4 \times$ & $19$ \\ 
-& & \tiny{$157.8$} & & & \tiny{$123.5$} & \tiny{$2.6$} & & \tiny{$4.2$} \\ \hline 
+\begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
+\multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average}  & \multicolumn{1}{|c|}{Maximal}  & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
+& \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iter.}\\ \hline
+Nug12-d & $580$   & $615.4$    & $744$   & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
+    &   & \tiny{$41.7$}    & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
+Nug13-d & $1912$  & $1985.4$   & $2100$  & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
+&  & \tiny{$51.0$}   &  & & \tiny{$1.3$} & \tiny{$1.0$} &  & \tiny{$5.6$} \\ \hline
+Nug15-d & $2230$  & $2418.1$  & $2580$  & $30$\% & $305.5$ & $5.2$ & $58.8 \times$ & $17$ \\
+ &  & \tiny{$135.3$}  & & & \tiny{$164.5$} & \tiny{$1.3$} & & \tiny{$5.0$} \\ \hline
+Nug18-d & $17836$ & $18110.2$ & $18506$ & $10$\% & $1375.9$ & $12.8$ & $107.4 \times$ & $19$ \\
+& & \tiny{$157.8$} & & & \tiny{$123.5$} & \tiny{$2.6$} & & \tiny{$4.2$} \\ \hline
  Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \times$ & $18$ \\
  & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
  \end{tabular}
  Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \times$ & $18$ \\
  & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
  \end{tabular}
+\caption{Results of the GPU-based iterated tabu search for
+different Q3AP instances.} \label{ch8:ITSQ3APResults} % \center %Shashi
  \end{table}
  
  %\begin{table}
  \end{table}
  
  %\begin{table}
@@ -868,36 +1457,58 @@ Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \t
  %%\footnotesize
  %\caption{Results of the GPU-based iterated tabu search for different Q3AP instances}
  %\label{ch8:ITSQ3APResults} \center
  %%\footnotesize
  %\caption{Results of the GPU-based iterated tabu search for different Q3AP instances}
  %\label{ch8:ITSQ3APResults} \center
-%\begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline 
+%\begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline
  %Instance & Opt./BKV\footnotemark & Average val. & Max val. & Hits & CPU time & GPU time & Accel. & ITS iter.\\ \hline
  %Instance & Opt./BKV\footnotemark & Average val. & Max val. & Hits & CPU time & GPU time & Accel. & ITS iter.\\ \hline
-%Nug12-d & $580$   & $615.4_{41.7}$    & $744$   & $35$\% & $87.7_{40.9}$ & $2.5_{0.9}$ & $\times 34.7 $& $16_{6.3}$ \\ \hline 
-%Nug13-d & $1912$  & $1985.4_{51.0}$   & $2100$  & $20$\% & $209.2_{1.3}$ & $3.3_{1.0}$ & $\times 63.5$ & $17_{5.6}$ \\ \hline 
-%Nug15-d & $2230$  & $2418.1_{135.3}$  & $2580$  & $30$\% & $305.5_{164.5}$ & $5.2_{1.3}$ & $\times 58.8 $ & $17_{5.0}$ \\ \hline 
-%Nug18-d & $17836$ & $18110.2_{157.8}$ & $18506$ & $10$\% & $1375.9_{123.5}$ & $12.8_{2.6}$ & $\times 107.4$ & $19_{4.2}$ \\ \hline 
+%Nug12-d & $580$   & $615.4_{41.7}$    & $744$   & $35$\% & $87.7_{40.9}$ & $2.5_{0.9}$ & $\times 34.7 $& $16_{6.3}$ \\ \hline
+%Nug13-d & $1912$  & $1985.4_{51.0}$   & $2100$  & $20$\% & $209.2_{1.3}$ & $3.3_{1.0}$ & $\times 63.5$ & $17_{5.6}$ \\ \hline
+%Nug15-d & $2230$  & $2418.1_{135.3}$  & $2580$  & $30$\% & $305.5_{164.5}$ & $5.2_{1.3}$ & $\times 58.8 $ & $17_{5.0}$ \\ \hline
+%Nug18-d & $17836$ & $18110.2_{157.8}$ & $18506$ & $10$\% & $1375.9_{123.5}$ & $12.8_{2.6}$ & $\times 107.4$ & $19_{4.2}$ \\ \hline
  %Nug22-d & $42476$ & $43282.1_{529.6}$ & $44140$ & $25$\% & $4506.5_{341.1}$ & $32.7_{6.6}$ & $\times 137.8$ & $18_{4.0}$ \\ \hline
  %\end{tabular}
  %\end{table}
  
  \footnotetext{Best known value.}
  
  %Nug22-d & $42476$ & $43282.1_{529.6}$ & $44140$ & $25$\% & $4506.5_{341.1}$ & $32.7_{6.6}$ & $\times 137.8$ & $18_{4.0}$ \\ \hline
  %\end{tabular}
  %\end{table}
  
  \footnotetext{Best known value.}
  
-
-Regarding the execution time, the generation and evaluation of the neighborhood in parallel on GPU provides 
-an efficient way to speedup the search process in comparison with a single CPU. In fact, for the smaller 
-instance Nug12-d, the GPU version starts to be faster than the CPU one (acceleration factor of $34.7~\times$). 
-As long as the problem size increases, the speedup grows significantly (up to $137.8 \times$ for the Nug22-d instance). 
-This means that the use of GPU provides an efficient way to deal with large neighborhoods. Indeed, the 
-proposed neighborhood were unpractical in terms of single CPU computational resources for large Q3AP 
-instances such as Nug22-d (estimated to around $2$ hours per run). So, implementing this algorithm on 
-GPU has allowed to exploit parallelism in such neighborhood and improve the robustness/quality of 
-provided solutions.
+Regarding the execution time, the generation and evaluation of the
+neighborhood in parallel on GPU provides an efficient way to
+speedup the search process in comparison with a single CPU. In
+fact, for the smaller instance Nug12-d, the GPU version starts to
+be faster than the CPU one (acceleration factor of $34.7~\times$).
+As long as the problem size increases, the speedup grows
+significantly (up to $137.8 \times$ for the Nug22-d instance).
+This means that the use of GPU provides an efficient way to deal
+with large neighborhoods. Indeed, the proposed neighborhoods were
+unpractical in terms of single CPU computational resources for
+large Q3AP instances such as Nug22-d (estimated to around $2$
+hours per run). So, implementing this algorithm on GPU has allowed
+the exploitation of  parallelism in such neighborhood and improved
+the robustness/quality of provided solutions.
  
  \section{Conclusion}
  \label{ch8:conclusion}
  
  
  \section{Conclusion}
  \label{ch8:conclusion}
  
-This chapter presents state-of-the-art GPU-based parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} and  a case study on implementing large neighborhood local search methods on GPUs for solving large benchmarks of the quadratic three dimensional  assignment problem (Q3AP). \\
- 
-Implementing parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on GPU architectures poses new issues and challenges such as memory management\index{GPU Challenges!memory~management},  finding efficient mapping strategies between tasks to parallelize and the GPU threads, thread divergence\index{GPU Challenges!thread~divergence} and synchronization. Actually, most of metaheuristics have been implemented on GPU using different implementation strategies. In this chapter, a two level classification of the reviewed works has been proposed: design level and implementation level. Design level regroups traditional parallel models used for metaheuristics while implementation level refers to the way those models are mapped to the GPU architecture. This classification  focuses mainly on  the mapping between the metaheuristic tasks to parallelize and the GPU threads. Indeed, the choice of a given mapping strategy influences strongly  the other challenges (memory usage, communication, thread divergence\
-index{GPU Challenges!thread~divergence}).  
+This chapter has presented state-of-the-art GPU-based parallel
+metaheuristics and
+a case study on implementing large neighborhood local search
+methods on GPUs for solving large benchmarks of the quadratic
+three-dimensional assignment problem (Q3AP). 
+
+Implementing parallel
+metaheuristics on
+GPU architectures poses new issues and challenges such as memory
+management;  finding
+efficient mapping strategies between tasks to parallelize; and the
+GPU threads, thread divergence, and synchronization. Actually, most
+of metaheuristics have been implemented on GPU using different
+implementation strategies. In this chapter, a two-level
+classification of the reviewed works has been proposed: design
+level and implementation level. Design level regroups traditional
+parallel models used for metaheuristics while implementation level
+refers to the way those models are mapped to the GPU architecture.
+This classification focuses mainly on  the mapping between the
+metaheuristic tasks to parallelize and the GPU threads. Indeed,
+the choice of a given mapping strategy strongly influences the
+other challenges (memory usage, communication, thread
+divergence).
  
  \putbib[Chapters/chapter9/biblio9]
  
  \putbib[Chapters/chapter9/biblio9]
-