\chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
\chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
challenges, and issues related to the particularities of the GPU
architecture and a synthesis on the different implementation
strategies used in the literature. The implementation of parallel
challenges, and issues related to the particularities of the GPU
architecture and a synthesis on the different implementation
strategies used in the literature. The implementation of parallel
GPUs is not straightforward. The traditional models used in CPUs
must be rethought to meet the new requirements of GPU
architectures. This chapter is organized as follows. Combinatorial
GPUs is not straightforward. The traditional models used in CPUs
must be rethought to meet the new requirements of GPU
architectures. This chapter is organized as follows. Combinatorial
methods are introduced in Section~\ref{ch8:sec:optim}. The main
traditional parallel models used for metaheuristics are recalled
in Section~\ref{ch8:sec:paraMeta}.
Section~\ref{ch8:sec:challenges} highlights the main challenges
related to the GPU implementation of metaheuristics. A
state-of-the-art of GPU-based parallel
methods are introduced in Section~\ref{ch8:sec:optim}. The main
traditional parallel models used for metaheuristics are recalled
in Section~\ref{ch8:sec:paraMeta}.
Section~\ref{ch8:sec:challenges} highlights the main challenges
related to the GPU implementation of metaheuristics. A
state-of-the-art of GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
-summarized in Section~\ref{ch8:sec:state}. In Section
-Section~\ref{ch8:sec:GPU_APIs}, the main developed GPU-based
+metaheuristics is
+summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based
frameworks for metaheuristics are described. Finally, a case study
is presented in Section~\ref{ch8:sec:case} and some concluding
remarks are given in Section~\ref{ch8:conclusion}
frameworks for metaheuristics are described. Finally, a case study
is presented in Section~\ref{ch8:sec:case} and some concluding
remarks are given in Section~\ref{ch8:conclusion}
It consists in finding optimal configuration(s) among a finite set of possible configurations
(or solutions) of a given combinatorial optimization problem (COP). The set of all possible solutions noted $S$ is called solution space or search space. Each solution in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\ %(Definition~\ref{def:cops})
It consists in finding optimal configuration(s) among a finite set of possible configurations
(or solutions) of a given combinatorial optimization problem (COP). The set of all possible solutions noted $S$ is called solution space or search space. Each solution in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\ %(Definition~\ref{def:cops})
methods can be distinguished: \emph{exact methods} and
\emph{approximate methods}. Exact methods allow one to reach
optimal solution(s) of the handled optimization problem with a
methods can be distinguished: \emph{exact methods} and
\emph{approximate methods}. Exact methods allow one to reach
optimal solution(s) of the handled optimization problem with a
class are the \emph{branch and bound technique}, \emph{dynamic
programming}, \emph{constraint programming}, and \emph{A*
algorithm}. However, optimization problems, whether practical or
class are the \emph{branch and bound technique}, \emph{dynamic
programming}, \emph{constraint programming}, and \emph{A*
algorithm}. However, optimization problems, whether practical or
quality solutions in reasonable computation time compared to exact
methods but with no guarantee to find optimal or even bounded
solutions. This is due to the nature of the search process adopted
quality solutions in reasonable computation time compared to exact
methods but with no guarantee to find optimal or even bounded
solutions. This is due to the nature of the search process adopted
search in a subset of the whole search space.\\
Regarding the number of solutions considered at each iteration in
search in a subset of the whole search space.\\
Regarding the number of solutions considered at each iteration in
and iteratively improves it by exploring its neighborhood in the
search space. The most known methods in this class are local
search methods that include \emph{simulated
and iteratively improves it by exploring its neighborhood in the
search space. The most known methods in this class are local
search methods that include \emph{simulated
search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
neighborhood search}~\cite{HansenMladenovic1997VNS}.\\
search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
neighborhood search}~\cite{HansenMladenovic1997VNS}.\\
\item Sequential processor architectures have reached their
physical limit which prevents creating faster processors. The
current trend of microprocessor manufacturers consists of placing
\item Sequential processor architectures have reached their
physical limit which prevents creating faster processors. The
current trend of microprocessor manufacturers consists of placing
workstations are multicore processors. In addition, the evolution
of network technologies and the proliferation of broadband
networks have made possible the emergence of clusters of
workstations are multicore processors. In addition, the evolution
of network technologies and the proliferation of broadband
networks have made possible the emergence of clusters of
-models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
-\emph{iteration-level} \index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
+models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{metaheuristics!algorithmic-level parallelism},
+\emph{iteration-level}\index{metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
start with identical or different solutions (s-metaheuristics
case) or populations (p-metaheuristics case). Their parameter
settings such as the size of tabu list for tabu
start with identical or different solutions (s-metaheuristics
case) or populations (p-metaheuristics case). Their parameter
settings such as the size of tabu list for tabu
same or different. The parallel processes may evolve independently
or in a cooperative manner. In cooperative parallel models, the
algorithms exchange information related to the search during
evolution in order to find better and more robust solutions.}
same or different. The parallel processes may evolve independently
or in a cooperative manner. In cooperative parallel models, the
algorithms exchange information related to the search during
evolution in order to find better and more robust solutions.}
-\item{In the iteration-level\index{Metaheuristics!iteration-level
-parallelism} parallel model, the focus is on the parallelization
+\item{In the iteration-level parallel model, the focus is on the parallelization
of each iteration of the metaheuristic. Indeed, metaheuristics are
generally iterative search processes. Moreover, the most
resource-consuming part of a metaheuristic is the evaluation of
the generated solutions at each iteration. For s-metaheuristics
of each iteration of the metaheuristic. Indeed, metaheuristics are
generally iterative search processes. Moreover, the most
resource-consuming part of a metaheuristic is the evaluation of
the generated solutions at each iteration. For s-metaheuristics
annealing, variable neighborhood search), the evaluation and
generation of the neighborhood is the most time-consuming step of
the algorithm particularly when it comes to dealing with large
neighborhood sets. In this parallel model, the neighborhood is
decomposed into partitions, and each partition is evaluated in a
parallel way. For p-metaheuristics (e.g., evolutionary
annealing, variable neighborhood search), the evaluation and
generation of the neighborhood is the most time-consuming step of
the algorithm particularly when it comes to dealing with large
neighborhood sets. In this parallel model, the neighborhood is
decomposed into partitions, and each partition is evaluated in a
parallel way. For p-metaheuristics (e.g., evolutionary
parallel model arises naturally since these metaheuristics deal
with a population of independent solutions. In evolutionary
parallel model arises naturally since these metaheuristics deal
with a population of independent solutions. In evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s, for
-instance, the iteration-level\index{Metaheuristics!iteration-level
-parallelism} model consists of decomposing the whole population
+algorithms, for
+instance, the iteration-level model consists of decomposing the whole population
parallel model, the focus is on the parallelization of the
evaluation of a single solution. This model is useful when the
objective function and/or the constraints are time and/or memory
consuming. Unlike the two previous parallel models, the
parallel model, the focus is on the parallelization of the
evaluation of a single solution. This model is useful when the
objective function and/or the constraints are time and/or memory
consuming. Unlike the two previous parallel models, the
not straightforward. The parallel models have to be rethought to
meet the new requirements of the GPU architecture. Several major
issues have to be taken into account both at design and
not straightforward. The parallel models have to be rethought to
meet the new requirements of the GPU architecture. Several major
issues have to be taken into account both at design and
metaheuristics on GPU, the different threads may access multiple
data structures from multiple memory spaces. These memories have
different sizes and access latencies. Nevertheless, faster
metaheuristics on GPU, the different threads may access multiple
data structures from multiple memory spaces. These memories have
different sizes and access latencies. Nevertheless, faster
per thread, one individual per thread, single population per
threads block, one ant per thread, etc.) must be defined to ensure
a maximum occupancy of the GPU and
per thread, one individual per thread, single population per
threads block, one ant per thread, etc.) must be defined to ensure
a maximum occupancy of the GPU and
According to the used metaheuristic and to the handled problem, the data
values may have different types and different ranges of their values. The data
According to the used metaheuristic and to the handled problem, the data
values may have different types and different ranges of their values. The data
synchronization issue is caused by both the GPU architecture and
the synchronization requirements of the implemented method.
Indeed, GPUs are based on a multicore architecture organized into
synchronization issue is caused by both the GPU architecture and
the synchronization requirements of the implemented method.
Indeed, GPUs are based on a multicore architecture organized into
(thousands of threads)~\cite{CUDA}. However, the execution order
of these thousands of threads is unknown by the programmer which
makes the prediction of their execution order a challenging issue.
(thousands of threads)~\cite{CUDA}. However, the execution order
of these thousands of threads is unknown by the programmer which
makes the prediction of their execution order a challenging issue.
threads through the insertion of barrier synchronizations in the
codes to avoid concurrent accesses to data structures and to meet
some requirements related to data-dependent synchronizations.
\subsection{Thread divergence}
threads through the insertion of barrier synchronizations in the
codes to avoid concurrent accesses to data structures and to meet
some requirements related to data-dependent synchronizations.
\subsection{Thread divergence}
another challenging issue in GPU-based
metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally,
metaheuristics contain irregular loops and conditional
another challenging issue in GPU-based
metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally,
metaheuristics contain irregular loops and conditional
(s-metaheuristics), and the population (p-metaheuristics) in the
same block. In addition, the decision to apply a crossover or a
mutation on an individual in a genetic
(s-metaheuristics), and the population (p-metaheuristics) in the
same block. In addition, the decision to apply a crossover or a
mutation on an individual in a genetic
in an SIMD model the threads of a same warp execute the same
instruction. Consequently, the different branches of a conditional
instruction which is data-dependent lead to a serial execution of
in an SIMD model the threads of a same warp execute the same
instruction. Consequently, the different branches of a conditional
instruction which is data-dependent lead to a serial execution of
The performance of GPU-based metaheuristics in terms of execution
time could be improved by choosing the most appropriate parallel
The performance of GPU-based metaheuristics in terms of execution
time could be improved by choosing the most appropriate parallel
-model (algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}, instruction-level,
-solution-level\index{Metaheuristics!solution-level~parallelism}).
+model (algorithmic-level, instruction-level,
+solution-level).
Moreover, an efficient decomposition of the metaheuristic and an
efficient assignment of code portions between the CPU and GPU
should be adopted. The objective is to take benefit from the GPU
computing power without affecting the efficiency and the behavior
of the metaheuristic and without losing performance in CPU/GPU
Moreover, an efficient decomposition of the metaheuristic and an
efficient assignment of code portions between the CPU and GPU
should be adopted. The objective is to take benefit from the GPU
computing power without affecting the efficiency and the behavior
of the metaheuristic and without losing performance in CPU/GPU
memory accesses. In order to decide which part of the
metaheuristic will be executed on which component, one should
perform a careful analysis on the serial code of the
memory accesses. In order to decide which part of the
metaheuristic will be executed on which component, one should
perform a careful analysis on the serial code of the
then offload them to the GPU, while the remaining
tasks still run on the CPU in a serial way. \\
then offload them to the GPU, while the remaining
tasks still run on the CPU in a serial way. \\
memory which is a slow memory making the memory transfer between
the CPU and GPU time-consuming which can significantly degrade the
performance of the application. Accesses to this memory should be
memory which is a slow memory making the memory transfer between
the CPU and GPU time-consuming which can significantly degrade the
performance of the application. Accesses to this memory should be
\section{State-of-the-art parallel metaheuristics on GPUs}
\label{ch8:sec:state}
After more than two decades of research by the combinatorial optimisation
\section{State-of-the-art parallel metaheuristics on GPUs}
\label{ch8:sec:state}
After more than two decades of research by the combinatorial optimisation
The first works on metaheuristic algorithms implemented on GPUs
started on old graphics cards before the appearance of modern GPUs
equipped with high-level programming interfaces such as CUDA and
The first works on metaheuristic algorithms implemented on GPUs
started on old graphics cards before the appearance of modern GPUs
equipped with high-level programming interfaces such as CUDA and
-of EAs on graphics processing cards and the work by Catala \emph{et al.} in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
-is implemented on old GPU architectures. Yu \emph{et al.}~\cite{yu2005} and
-Li \emph{et al.}~\cite{li2007} proposed a full parallelization of genetic
-algorithm\index{Metaheuristics!genetic~algorithm}s on old GPU architectures using
+of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{metaheuristics!ant colony optimization} algorithm
+is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
+Li et al.~\cite{li2007} proposed a full parallelization of genetic
+algorithms on old GPU architectures using
shader libraries based on Direct3D and OpenGL.\\
Such architectures are based on preconfigured pipelined stages
shader libraries based on Direct3D and OpenGL.\\
Such architectures are based on preconfigured pipelined stages
mappings for large neighborhood structures on GPUs. In this work,
three different neighborhoods are studied and mapped to the
hierarchical GPU for binary problems. The three neighborhoods are
mappings for large neighborhood structures on GPUs. In this work,
three different neighborhoods are studied and mapped to the
hierarchical GPU for binary problems. The three neighborhoods are
-based on the \emph{Hamming} distance. The move operators used in
-the three neighborhoods consider \emph{Hamming} distances of 1,
+based on the Hamming distance. The move operators used in
+the three neighborhoods consider Hamming distances of 1,
2, and 3 (this consists on flipping the binary value of one, two,
or three positions at a time in the candidate binary vector).
In~\cite{luong2010large}, each thread is associated to a unique
2, and 3 (this consists on flipping the binary value of one, two,
or three positions at a time in the candidate binary vector).
In~\cite{luong2010large}, each thread is associated to a unique
more explicitly, how to calculate the memory index of each
solution associated to each CUDA thread's \textit{id}.
%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes in the candidate vector.
more explicitly, how to calculate the memory index of each
solution associated to each CUDA thread's \textit{id}.
%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes in the candidate vector.
-The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
+The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
proposed two parallelization strategies inspired by the multi-walk
parallelization strategy, of a 3-opt iterated local
proposed two parallelization strategies inspired by the multi-walk
parallelization strategy, of a 3-opt iterated local
over a CPU/GPU architecture. In the first strategy, each Local
Search (LS) is associated to a unique CUDA thread and improves a
unique solution by generating its neighborhood. The second
over a CPU/GPU architecture. In the first strategy, each Local
Search (LS) is associated to a unique CUDA thread and improves a
unique solution by generating its neighborhood. The second
strategies have been experimented on standard benchmarks of
the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
strategies have been experimented on standard benchmarks of
the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
in~\cite{luongMultiStart} to implement multistart parallel local
search algorithms (a special case of the
in~\cite{luongMultiStart} to implement multistart parallel local
search algorithms (a special case of the
parallelism: several LS algorithms are managed by the CPU and the
neighborhood evaluation step of each algorithm is parallelized on
the GPU (each GPU thread is associated with one neighbor and
executes the same evaluation function kernel). The advantage of
such a model is that it allows a high occupancy of the GPU
parallelism: several LS algorithms are managed by the CPU and the
neighborhood evaluation step of each algorithm is parallelized on
the GPU (each GPU thread is associated with one neighbor and
executes the same evaluation function kernel). The advantage of
such a model is that it allows a high occupancy of the GPU
quantity of data to store and to communicate between CPU and
GPU. A second proposition for implementing the same model on GPU
consists of implementing the whole LS processes on GPU with each
GPU thread being associated to a unique LS algorithm. This solves
the communication issue encountered in the first model. In
quantity of data to store and to communicate between CPU and
GPU. A second proposition for implementing the same model on GPU
consists of implementing the whole LS processes on GPU with each
GPU thread being associated to a unique LS algorithm. This solves
the communication issue encountered in the first model. In
due to uncoalesced memory accesses. The proposed approaches are
implemented on the quadratic assignment problem (QAP) using CUDA.
The acceleration rates obtained for the
due to uncoalesced memory accesses. The proposed approaches are
implemented on the quadratic assignment problem (QAP) using CUDA.
The acceleration rates obtained for the
algorithms for TSP and the flow-shop scheduling problem (FSP).
These algorithms are based on a multistart tabu
algorithms for TSP and the flow-shop scheduling problem (FSP).
These algorithms are based on a multistart tabu
algorithms exploit multicore CPU and GPU. A full parallelization
on GPU is adopted using shader libraries where each thread is
algorithms exploit multicore CPU and GPU. A full parallelization
on GPU is adopted using shader libraries where each thread is
However, even though their experiments report that the use of GPU
speedups the serial execution almost $16 \times$, the mapping of
However, even though their experiments report that the use of GPU
speedups the serial execution almost $16 \times$, the mapping of
implementation of hybrid metaheuristics on heterogeneous parallel
architectures (multicore CPU coupled to one GPU). The challenge
of using such a heterogeneous architecture is how to distribute
tasks between the CPU cores and the GPU in such a way to have
optimal performances. Among the three traditional parallel models
implementation of hybrid metaheuristics on heterogeneous parallel
architectures (multicore CPU coupled to one GPU). The challenge
of using such a heterogeneous architecture is how to distribute
tasks between the CPU cores and the GPU in such a way to have
optimal performances. Among the three traditional parallel models
-(solution-level\index{Metaheuristics!solution-level~parallelism},
-iteration-level\index{Metaheuristics!iteration-level parallelism},
-and algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}), the authors pointed out that the most convenient
+(solution-level,
+iteration-level
+and algorithmic-level), the authors point out that the most convenient
-iteration-level\index{Metaheuristics!iteration-level parallelism}
-and algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} models. Several CPU threads execute several instances
+iteration-level
+and algorithmic-level models. Several CPU threads execute several instances
of the same S-metaheuristic in parallel while the GPU device is
associated to one CPU core and used to accelerate the
neighborhood calculation of several S-metaheuristics at the same
time.
In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
of the same S-metaheuristic in parallel while the GPU device is
associated to one CPU core and used to accelerate the
neighborhood calculation of several S-metaheuristics at the same
time.
In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
All the previously noted works exploit the same parallel models
used on CPUs based on the task parallelism. A different
implementation approach is proposed by Paul in~\cite{gerald2012}
to implement a simulated
All the previously noted works exploit the same parallel models
used on CPUs based on the task parallelism. A different
implementation approach is proposed by Paul in~\cite{gerald2012}
to implement a simulated
for the QAP on GPUs. Indeed, the author used a preinitialized
matrix \emph{delta} in which the incremental evaluation of simple
swap moves are calculated and stored relative to the initial
permutation $p$. For the GPU implementation, the authors used the
parallel implementation of neighborhood exploration. The
time-consuming tasks in the SA-matrix are identified by the
for the QAP on GPUs. Indeed, the author used a preinitialized
matrix \emph{delta} in which the incremental evaluation of simple
swap moves are calculated and stored relative to the initial
permutation $p$. For the GPU implementation, the authors used the
parallel implementation of neighborhood exploration. The
time-consuming tasks in the SA-matrix are identified by the
the next accepted move. To initialize the delta matrix, several
threads from different blocks explore different segments of the
matrix (different moves) at the same time. In order to select the
the next accepted move. To initialize the delta matrix, several
threads from different blocks explore different segments of the
matrix (different moves) at the same time. In order to select the
the last move, a group of threads explores different subsets of
the delta matrix. The shared memory is used to preload all the
necessary elements for a given group of threads responsible for
the updating of the delta matrix. The main difference in this work
compared to the previous works resides in the introduction of a
data parallelism using the precalculated delta matrix. The use of
the last move, a group of threads explores different subsets of
the delta matrix. The shared memory is used to preload all the
necessary elements for a given group of threads responsible for
the updating of the delta matrix. The main difference in this work
compared to the previous works resides in the introduction of a
data parallelism using the precalculated delta matrix. The use of
involved in the evaluation of a single move. Experimentations are
done on standard QAP instances from the
QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
achieved by the GPU implementation compared
to the same sequential implementation on CPU using SA-matrix.\\
involved in the evaluation of a single move. Experimentations are
done on standard QAP instances from the
QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
achieved by the GPU implementation compared
to the same sequential implementation on CPU using SA-matrix.\\
State-of-the-art works dealing with the implementation of
p-metaheuristics on GPUs generally rely on parallel models and
State-of-the-art works dealing with the implementation of
p-metaheuristics on GPUs generally rely on parallel models and
supercomputers, clusters, and computational grids. Three main
classes of p-metaheuristics are considered in this section:
evolutionary
supercomputers, clusters, and computational grids. Three main
classes of p-metaheuristics are considered in this section:
evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant
-colony\index{Metaheuristics!ant~colony~optimization} optimization
-(ACO\index{Metaheuristics!ant~colony~optimization}), and particle
-swarm optimization\index{Metaheuristics!particle swarm
+algorithms (EAs), ant
+colony optimization
+(ACO), and particle
+swarm optimization\index{metaheuristics!particle swarm
(islands), master-slave models used for the parallelization of CPU
intensive steps (evaluation and transformation), and cellular
models based on the use of one population disposed (generally) on
(islands), master-slave models used for the parallelization of CPU
intensive steps (evaluation and transformation), and cellular
models based on the use of one population disposed (generally) on
In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
of the drug discovery application Autodock (molecular docking
application). Autodock uses a genetic
In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
of the drug discovery application Autodock (molecular docking
application). Autodock uses a genetic
docking positions of a ligand to a protein. The most
time-consuming task in Autodock is the fitness function
evaluation. The fitness function used for a docking problem
docking positions of a ligand to a protein. The most
time-consuming task in Autodock is the fitness function
evaluation. The fitness function used for a docking problem
master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
C-like metalanguage for easy development of EAs. The user writes a
master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
C-like metalanguage for easy development of EAs. The user writes a
problem representation, etc) in EASEA. The code is then compiled
to obtain a ready-to-use evolutionary
problem representation, etc) in EASEA. The code is then compiled
to obtain a ready-to-use evolutionary
Libraries to produce C++ or JAVA written EA codes.
In~\cite{maitre2009}, the authors proposed an extension of EASEA
to produce CUDA code from the EASEA files. This extension has been
Libraries to produce C++ or JAVA written EA codes.
In~\cite{maitre2009}, the authors proposed an extension of EASEA
to produce CUDA code from the EASEA files. This extension has been
real problem (molecular structure prediction). In order to
maximize the GPU occupation, very large populations are used (from
$2000$ to $20000$). Even though transferring such large
real problem (molecular structure prediction). In order to
maximize the GPU occupation, very large populations are used (from
$2000$ to $20000$). Even though transferring such large
-populations from the CPU to the GPU device memory at every generation is very costly, the authors reported important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.\\
The master-slave model is efficient when the fitness function is
highly time intensive. Nevertheless, it requires the use of
large-sized populations in order to saturate the GPU unless the
per-block is used (one individual perblock) when the acceleration
of the fitness function itself is possible. The use of many
The master-slave model is efficient when the fitness function is
highly time intensive. Nevertheless, it requires the use of
large-sized populations in order to saturate the GPU unless the
per-block is used (one individual perblock) when the acceleration
of the fitness function itself is possible. The use of many
when only population evaluation is performed on GPU. Each
population island is mapped with a CUDA thread block and each
thread is responsible for a unique individual. Subpopulations are
when only population evaluation is performed on GPU. Each
population island is mapped with a CUDA thread block and each
thread is responsible for a unique individual. Subpopulations are
The same strategy is also adopted by Tsutsui and Fujimoto
in~\cite{tsutsuiGAQAP} to implement a coarse-grained genetic
The same strategy is also adopted by Tsutsui and Fujimoto
in~\cite{tsutsuiGAQAP} to implement a coarse-grained genetic
the QAP. Initially, several subpopulations are created on CPU and
transferred to the global memory. The subpopulations are organized
in the global memory into blocks of $8$ individuals in such a way
the QAP. Initially, several subpopulations are created on CPU and
transferred to the global memory. The subpopulations are organized
in the global memory into blocks of $8$ individuals in such a way
QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
implementation reached speedups of $2.9\times$ to $12.6 \times$
compared to a single core implementation of a coarse-grained
QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
implementation reached speedups of $2.9\times$ to $12.6 \times$
compared to a single core implementation of a coarse-grained
-Nowotniak \emph{et al.}~\cite{nowotniak} proposed a GPU-based
-implementation of a quantum inspired genetic
-algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The used
-parallel model is a hierarchical model based on two levels: each
+Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
+implementation of a Quantum Inspired Genetic Algorithm (QIGA). The
+parallel model used is a hierarchical model based on two levels: each
thread in a block transforms a unique individual and a different
population is assigned to each block. The algorithm is run
entirely on GPU. A different instance of the QIGA is run on each
block and the computations have been shared between 8 GPUs. This
approach is very convenient to speed up the experimental process
on metaheuristics that require several independent runs of the
thread in a block transforms a unique individual and a different
population is assigned to each block. The algorithm is run
entirely on GPU. A different instance of the QIGA is run on each
block and the computations have been shared between 8 GPUs. This
approach is very convenient to speed up the experimental process
on metaheuristics that require several independent runs of the
populations are stored in the shared memory, the data matrix used
for fitness evaluation is placed in read only constant memory, and
finally seeds for random numbers generated on the GPU are stored
populations are stored in the shared memory, the data matrix used
for fitness evaluation is placed in read only constant memory, and
finally seeds for random numbers generated on the GPU are stored
GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
to the SIMD architecture of the GPU. \\
GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
to the SIMD architecture of the GPU. \\
GraphCell, to solve the independent task scheduling problem on GPU
architectures. In CGAs, the population is arranged into a
two-dimensional toroidal grid where only neighboring solutions are
GraphCell, to solve the independent task scheduling problem on GPU
architectures. In CGAs, the population is arranged into a
two-dimensional toroidal grid where only neighboring solutions are
grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
the GPU implementation against the CPU version range from
$5\times$ to $24\times$ and increase as the size of the population
grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
the GPU implementation against the CPU version range from
$5\times$ to $24\times$ and increase as the size of the population
version for all the tested benchmarks when the size of the
population is set to $32^2$. When the size of the population is
too small, there are not enough computations to cover the overhead
created by the call of kernel functions, CPU/GPU
version for all the tested benchmarks when the size of the
population is set to $32^2$. When the size of the population is
too small, there are not enough computations to cover the overhead
created by the call of kernel functions, CPU/GPU
synchronization, and access to global memory. Finally, an
interesting review on GPU parallel computation in bio-inspired
synchronization, and access to global memory. Finally, an
interesting review on GPU parallel computation in bio-inspired
accelerating the tour construction step performed by each ant by
taking a task-based parallelism approach, with pheromone
deposition on the CPU.\\
accelerating the tour construction step performed by each ant by
taking a task-based parallelism approach, with pheromone
deposition on the CPU.\\
the two steps (tour construction and pheromone update) are
parallelized on the GPU. A data parallelism approach is used to
enhance the performance of the tour construction step. The
the two steps (tour construction and pheromone update) are
parallelized on the GPU. A data parallelism approach is used to
enhance the performance of the tour construction step. The
per matrix entry). The achieved speedups are $21 \times$ for tour
construction and $20 \times$ for pheromone updates.\\
per matrix entry). The achieved speedups are $21 \times$ for tour
construction and $20 \times$ for pheromone updates.\\
on GPU to solve the QAP. A solution of QAP is represented as a
permutation of ${1,2,..,n}$ with $n$ being the size of the
problem. The TS algorithm is based on the 2-opt neighborhood
(swapping of two elements $(i,j)$ in the permutation). The authors
point out that the move cost of each neighbor depends on the
couple $(i,j)$. Two groups of moves are formed according to the
on GPU to solve the QAP. A solution of QAP is represented as a
permutation of ${1,2,..,n}$ with $n$ being the size of the
problem. The TS algorithm is based on the 2-opt neighborhood
(swapping of two elements $(i,j)$ in the permutation). The authors
point out that the move cost of each neighbor depends on the
couple $(i,j)$. Two groups of moves are formed according to the
-move cost. In order to avoid thread divergence\index{GPU
-Challenges!thread~divergence} within the same warp, the
+move cost. In order to avoid thread divergence\index{GPU!thread divergence} within the same warp, the
neighborhood evaluation is parallelized in such a way to assign
only moves of the same cost to each thread warp. This strategy is
called MATA for Move-cost Adjusted Thread Assignment. Concerning
neighborhood evaluation is parallelized in such a way to assign
only moves of the same cost to each thread warp. This strategy is
called MATA for Move-cost Adjusted Thread Assignment. Concerning
-the memory management\index{GPU Challenges!memory~management}, all
-the data of the ACO\index{Metaheuristics! ant~colony~optimization}
+the memory management\index{GPU!memory management}, all
+the data of the ACO\index{metaheuristics!ant colony optimization}
(population, pheromone matrix), QAP matrices, and tabu list are
placed on the global memory of the GPU. Shared memory is used only
for working data common to all threads in a given block.
All the
steps of the hybrid algorithm
(population, pheromone matrix), QAP matrices, and tabu list are
placed on the global memory of the GPU. Shared memory is used only
for working data common to all threads in a given block.
All the
steps of the hybrid algorithm
pheromone update, construct solutions, applying TS) are
implemented as kernel functions on the GPU. The GPU/CPU
communications are only used to transfer the best-so-far solution
pheromone update, construct solutions, applying TS) are
implemented as kernel functions on the GPU. The GPU/CPU
communications are only used to transfer the best-so-far solution
In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
of a standard PSO algorithm. All the data is stored in global
memory (velocities, positions, swarm population, etc). Only
In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
of a standard PSO algorithm. All the data is stored in global
memory (velocities, positions, swarm population, etc). Only
illustrated in Figure~\ref{ch8:fig:classification}. The design
level regroups the three classes of parallel models used in
metaheuristics
illustrated in Figure~\ref{ch8:fig:classification}. The design
level regroups the three classes of parallel models used in
metaheuristics
-(solution-level\index{Metaheuristics!solution-level~parallelism},
-iteration-level\index{Metaheuristics!iteration-level parallelism},
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}) with examples for s-metaheuristics, EAs,
-ACO\index{Metaheuristics!ant~colony~optimization} and PCO. This
+(solution-level,
+iteration-level,
+algorithmic-level) with examples for s-metaheuristics, EAs,
+ACO and PCO. This
classification is principally built from the reviewed
state-of-the-art works in the previous section. The
implementation level refers to the way these parallel design
classification is principally built from the reviewed
state-of-the-art works in the previous section. The
implementation level refers to the way these parallel design
models at solution level consist of parallelizing a time intensive
atomic task of the algorithm. Generally, it consists of the
fitness evaluation~\cite{kannan}. Nevertheless, crossover
models at solution level consist of parallelizing a time intensive
atomic task of the algorithm. Generally, it consists of the
fitness evaluation~\cite{kannan}. Nevertheless, crossover
%data parallelism in SA-matrix to parallelize
\subsubsection*{GPU thread mapping for iteration-level parallelism}
%data parallelism in SA-matrix to parallelize
\subsubsection*{GPU thread mapping for iteration-level parallelism}
Iteration-level parallelism consists of parallelizing the tasks
performed independently on different solutions. Different mapping
strategies are adopted in the reviewed works to implement these
models.\\
In Figure \ref{ch8:fig:classification}, the first example of
Iteration-level parallelism consists of parallelizing the tasks
performed independently on different solutions. Different mapping
strategies are adopted in the reviewed works to implement these
models.\\
In Figure \ref{ch8:fig:classification}, the first example of
parallelism is the parallel evaluation of neighborhoods in
s-metaheuristics. In most of the reviewed works, a per-thread
mapping approach is used: each solution of the neighborhood is
parallelism is the parallel evaluation of neighborhoods in
s-metaheuristics. In most of the reviewed works, a per-thread
mapping approach is used: each solution of the neighborhood is
%pheromone update data parallelism matrices
\subsubsection*{GPU thread mapping for algorithmic-level parallelism}
%pheromone update data parallelism matrices
\subsubsection*{GPU thread mapping for algorithmic-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping}
-Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models have been used: the multistart model and the island model (parallel EAs).\\
+Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level models have been used: the multistart model and the island model (parallel EAs).\\
The implementation of the multistart model is based on two
different mapping strategies~\cite{luongMultiStart, audreyANT}:
The implementation of the multistart model is based on two
different mapping strategies~\cite{luongMultiStart, audreyANT}:
(2) each solution (from multiple neighborhoods) is associated to a
unique thread. The first mapping strategy (one thread per LS)
presents a big drawback: the number of LS to use should be very
(2) each solution (from multiple neighborhoods) is associated to a
unique thread. The first mapping strategy (one thread per LS)
presents a big drawback: the number of LS to use should be very
are placed on CPU and the neighborhood evaluation of each LS is
parallelized on GPU using per-thread mapping strategy (one thread
per solution). This consists of a hierarchical parallel model
are placed on CPU and the neighborhood evaluation of each LS is
parallelized on GPU using per-thread mapping strategy (one thread
per solution). This consists of a hierarchical parallel model
-combining algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} parallelism (multistart) with
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+combining algorithmic-level parallelism (multistart) with
+iteration-level
parallelism (master-worker).\\
In the island model, the same mapping is used in all the reviewed
parallelism (master-worker).\\
In the island model, the same mapping is used in all the reviewed
PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
frameworks!PUGACE} is a framework for implementing EAs on GPUs.
ParadisEO-MO-GPU is an extension of the framework
ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
frameworks!PUGACE} is a framework for implementing EAs on GPUs.
ParadisEO-MO-GPU is an extension of the framework
ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
-GPU. Finally, libCudaOptimize\index{GPU-based
-frameworks!libCudaOptimize} is a library intended for the
+GPU. Finally, libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize} is a library intended for the
implementation of p-metaheuristics on GPU. The three frameworks
are presented in more detail in the following.
implementation of p-metaheuristics on GPU. The three frameworks
are presented in more detail in the following.
-\subsection{PUGACE\index{GPU-based frameworks!PUGACE}: framework for implementing evolutionary computation on GPUs}
-PUGACE\index{GPU-based frameworks!PUGACE} is a generic framework
+\subsection{PUGACE: framework for implementing evolutionary computation on GPUs}
+PUGACE is a generic framework
for easy implementation of cellular evolutionary algorithms on
GPUs implemented using C and CUDA. It is based on the frameworks
MALLBA and JCell (a framework for cellular algorithms). The
authors justified the choice of cellular evolutionary
for easy implementation of cellular evolutionary algorithms on
GPUs implemented using C and CUDA. It is based on the frameworks
MALLBA and JCell (a framework for cellular algorithms). The
authors justified the choice of cellular evolutionary
feedback found in the literature concerning its efficient
implementation on GPUs compared to other parallel models for EAs
(island, master-slave). The main standard evolutionary operators
feedback found in the literature concerning its efficient
implementation on GPUs compared to other parallel models for EAs
(island, master-slave). The main standard evolutionary operators
-are already implemented in PUGACE\index{GPU-based
-frameworks!PUGACE}: different selection strategies, standard
-crossover, and mutation operators (\emph{PMX, swap, 2-exchange},
+are already implemented in PUGACE: different selection strategies, standard
+crossover, and mutation operators (PMX, swap, 2-exchange,
etc.). Different problem encoding is also supported. The framework
is organized as a set of modules in which the different
functionalities are separated as much as possible in order to
etc.). Different problem encoding is also supported. The framework
is organized as a set of modules in which the different
functionalities are separated as much as possible in order to
individual is associated to a unique CUDA thread. The function
evaluation and mutation are done on the GPU while selection and
replacement are maintained on the CPU. In order to avoid thread
individual is associated to a unique CUDA thread. The function
evaluation and mutation are done on the GPU while selection and
replacement are maintained on the CPU. In order to avoid thread
the same CUDA thread block at the crossover step (because of the
probability of application which may give different results from
one thread to the other), the decision of whether to apply a
the same CUDA thread block at the crossover step (because of the
probability of application which may give different results from
one thread to the other), the decision of whether to apply a
individuals within the block. It is the decision on the choose
of the cutting point for the crossover.\\
individuals within the block. It is the decision on the choose
of the cutting point for the crossover.\\
\times$ are achieved compared to a CPU implementation of a cEA
using population sizes rising from 512 to 1024 and from 1024 to
2048.
\times$ are achieved compared to a CPU implementation of a cEA
using population sizes rising from 512 to 1024 and from 1024 to
2048.
framework ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU} for parallel local search
metaheuristics (s-metaheuristics) on GPUs. It focuses on the
framework ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU} for parallel local search
metaheuristics (s-metaheuristics) on GPUs. It focuses on the
parallel model of s-metaheuristics which consists of exploring in
parallel the neighborhood of a problem solution on GPU. The
framework, implemented using C++ and CUDA, is an extension of the
parallel model of s-metaheuristics which consists of exploring in
parallel the neighborhood of a problem solution on GPU. The
framework, implemented using C++ and CUDA, is an extension of the
Nevertheless, the library is designed in such a way to allow
further extensions for other metaheuristics and it is still in
development phase by the authors. The parallelization strategy
Nevertheless, the library is designed in such a way to allow
further extensions for other metaheuristics and it is still in
development phase by the authors. The parallelization strategy
-search\index{GPU-based Metaheuristics!GPU-based~local~search}
-method for solving the quadratic 3-dimensional assignment problem
-(Q3AP) will be presented. The local search method is an iterated
-local search\index{Metaheuristics!iterated local search}
+search\index{GPU!based local search}
+method for solving the Quadratic 3-dimensional Assignment Problem
+(Q3AP) will be presented. The local search method is an Iterated
+Local Search\index{metaheuristics!iterated local search}
consists of executing iteratively the embedded local search, each
iteration which starts from a disrupted local optima reached by
the previous local search process. The disruption heuristic is a
consists of executing iteratively the embedded local search, each
iteration which starts from a disrupted local optima reached by
the previous local search process. The disruption heuristic is a
\subsection{The quadratic 3-dimensional assignment problem}
The Q3AP is an extension of the well-known NP-hard QAP. The latter
\subsection{The quadratic 3-dimensional assignment problem}
The Q3AP is an extension of the well-known NP-hard QAP. The latter
optimization community due to its wide range of practical
applications (site planning, schedule problem, computer-aided
design, etc.) and to its computational challenges since it is
optimization community due to its wide range of practical
applications (site planning, schedule problem, computer-aided
design, etc.) and to its computational challenges since it is
The Q3AP was first introduced by William P. Pierskalla in
1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
less studied problem. Indeed, the Q3AP was revisited only this
The Q3AP was first introduced by William P. Pierskalla in
1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
less studied problem. Indeed, the Q3AP was revisited only this
assignment problems such as the symbol-mapping problem encountered
in wireless communication systems and described
in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
assignment problems such as the symbol-mapping problem encountered
in wireless communication systems and described
in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
- x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1. \label{Q3APConstraints2}
+ x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1 \label{Q3APConstraints2}
\end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
\begin{displaymath} I=\lbrace X=(x_{ijl}):
\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
\end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
\begin{displaymath} I=\lbrace X=(x_{ijl}):
\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
\begin{displaymath} L=\lbrace X=(x_{ijl}):
\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
\begin{displaymath} L=\lbrace X=(x_{ijl}):
\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
-l=0,1,...,n-1\rbrace \mathrm{.} \end{displaymath}
+l=0,1,...,n-1\rbrace \end{displaymath}
% fin------------------------------------------------------------------------------
Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
% fin------------------------------------------------------------------------------
Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
\subsection{Iterated tabu search algorithm for the Q3AP}
To tackle large-sized instances of the Q3AP and speed up the
\subsection{Iterated tabu search algorithm for the Q3AP}
To tackle large-sized instances of the Q3AP and speed up the
-search process, a parallel ILS\index{Metaheuristics!iterated local
-search} algorithm has been designed. The local search embedded in
-the ILS is a TS\index{Metaheuristics!tabu~search}. A TS
+search process, a parallel ILS algorithm has been designed. The local search embedded in
+the ILS is a TS. A TS
procedure~\cite{Glover1989TS} starts from an initial feasible
solution and tries, at each step, to move to a neighboring
solution that minimizes the fitness (for a minimization case). If
no such move exists, the neighbor solution that less degrades the
procedure~\cite{Glover1989TS} starts from an initial feasible
solution and tries, at each step, to move to a neighboring
solution that minimizes the fitness (for a minimization case). If
no such move exists, the neighbor solution that less degrades the
escape local optima. However, this strategy may generate cycles,
i.e., previous moves can be selected again. To avoid cycles, the
TS manages a short-term memory that contains the moves that have
escape local optima. However, this strategy may generate cycles,
i.e., previous moves can be selected again. To avoid cycles, the
TS manages a short-term memory that contains the moves that have
-been recently performed. A TS\index{Metaheuristics!tabu~search}
-template is given by the Algorithm \ref{TS_pseudo_code}.\\
+been recently performed. A TS
+template is given by Algorithm \ref{TS_pseudo_code}.\\
local search algorithm are the efficient distribution of the
search process between the CPU and the GPU minimizing the data
transfer between them, the hierarchical memory
local search algorithm are the efficient distribution of the
search process between the CPU and the GPU minimizing the data
transfer between them, the hierarchical memory
capacity constraints of GPU memories, and the thread
synchronization. All these issues must be regarded when designing
parallel LS models to allow
capacity constraints of GPU memories, and the thread
synchronization. All these issues must be regarded when designing
parallel LS models to allow
To go back to our problem (i.e., Q3AP), we propose in
Algorithm~\ref{ch8:algoITS} an iterated tabu
To go back to our problem (i.e., Q3AP), we propose in
Algorithm~\ref{ch8:algoITS} an iterated tabu
parallel model of LS methods presented in Section
\ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This
algorithm can be seen as a cooperative model between the CPU and
parallel model of LS methods presented in Section
\ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This
algorithm can be seen as a cooperative model between the CPU and
not change during all the execution of the LS algorithm.
Therefore, their associated memory is copied only once during all
the execution. Third, comes the parallel
not change during all the execution of the LS algorithm.
Therefore, their associated memory is copied only once during all
the execution. Third, comes the parallel
in which each neighboring solution is generated, evaluated, and
copied into the neighborhood fitnesses structure (from lines 10 to
14). Fourth, since the order in which candidate neighbors are
in which each neighboring solution is generated, evaluated, and
copied into the neighborhood fitnesses structure (from lines 10 to
14). Fourth, since the order in which candidate neighbors are
In this section, some experimental results related to the approach
presented in Section \ref{ch8:ITS-Q3APSection} are reported. We
recall that the approach is a GPU-based iterated tabu
In this section, some experimental results related to the approach
presented in Section \ref{ch8:ITS-Q3APSection} are reported. We
recall that the approach is a GPU-based iterated tabu
generation/evaluation step of the TS process is executed on GPU.
The ILS is used to improve the quality of successive local optima
provided by TS methods. This is achieved by perturbing the local
generation/evaluation step of the TS process is executed on GPU.
The ILS is used to improve the quality of successive local optima
provided by TS methods. This is achieved by perturbing the local
\small
\begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
\multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average} & \multicolumn{1}{|c|}{Maximal} & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
\small
\begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
\multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average} & \multicolumn{1}{|c|}{Maximal} & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
Nug12-d & $580$ & $615.4$ & $744$ & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
& & \tiny{$41.7$} & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
Nug13-d & $1912$ & $1985.4$ & $2100$ & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
Nug12-d & $580$ & $615.4$ & $744$ & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
& & \tiny{$41.7$} & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
Nug13-d & $1912$ & $1985.4$ & $2100$ & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
& & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
\end{tabular}
\caption{Results of the GPU-based iterated tabu search for
& & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
\end{tabular}
\caption{Results of the GPU-based iterated tabu search for
a case study on implementing large neighborhood local search
methods on GPUs for solving large benchmarks of the quadratic
three-dimensional assignment problem (Q3AP). \\
Implementing parallel
a case study on implementing large neighborhood local search
methods on GPUs for solving large benchmarks of the quadratic
three-dimensional assignment problem (Q3AP). \\
Implementing parallel
-GPU threads, thread divergence\index{GPU
-Challenges!thread~divergence}, and synchronization. Actually, most
+GPU threads, thread divergence, and synchronization. Actually, most
of metaheuristics have been implemented on GPU using different
implementation strategies. In this chapter, a two-level
classification of the reviewed works has been proposed: design
of metaheuristics have been implemented on GPU using different
implementation strategies. In this chapter, a two-level
classification of the reviewed works has been proposed: design
metaheuristic tasks to parallelize and the GPU threads. Indeed,
the choice of a given mapping strategy strongly influences the
other challenges (memory usage, communication, thread
metaheuristic tasks to parallelize and the GPU threads. Indeed,
the choice of a given mapping strategy strongly influences the
other challenges (memory usage, communication, thread