\chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
\chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
related to the GPU implementation of metaheuristics. A
state-of-the-art of GPU-based parallel
metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
related to the GPU implementation of metaheuristics. A
state-of-the-art of GPU-based parallel
metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
-summarized in Section~\ref{ch8:sec:state}. In Section
-Section~\ref{ch8:sec:GPU_APIs}, the main developed GPU-based
+summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based
frameworks for metaheuristics are described. Finally, a case study
is presented in Section~\ref{ch8:sec:case} and some concluding
remarks are given in Section~\ref{ch8:conclusion}
frameworks for metaheuristics are described. Finally, a case study
is presented in Section~\ref{ch8:sec:case} and some concluding
remarks are given in Section~\ref{ch8:conclusion}
methods can be distinguished: \emph{exact methods} and
\emph{approximate methods}. Exact methods allow one to reach
optimal solution(s) of the handled optimization problem with a
methods can be distinguished: \emph{exact methods} and
\emph{approximate methods}. Exact methods allow one to reach
optimal solution(s) of the handled optimization problem with a
class are the \emph{branch and bound technique}, \emph{dynamic
programming}, \emph{constraint programming}, and \emph{A*
algorithm}. However, optimization problems, whether practical or
class are the \emph{branch and bound technique}, \emph{dynamic
programming}, \emph{constraint programming}, and \emph{A*
algorithm}. However, optimization problems, whether practical or
quality solutions in reasonable computation time compared to exact
methods but with no guarantee to find optimal or even bounded
solutions. This is due to the nature of the search process adopted
quality solutions in reasonable computation time compared to exact
methods but with no guarantee to find optimal or even bounded
solutions. This is due to the nature of the search process adopted
search in a subset of the whole search space.\\
Regarding the number of solutions considered at each iteration in
search in a subset of the whole search space.\\
Regarding the number of solutions considered at each iteration in
\item Sequential processor architectures have reached their
physical limit which prevents creating faster processors. The
current trend of microprocessor manufacturers consists of placing
\item Sequential processor architectures have reached their
physical limit which prevents creating faster processors. The
current trend of microprocessor manufacturers consists of placing
workstations are multicore processors. In addition, the evolution
of network technologies and the proliferation of broadband
networks have made possible the emergence of clusters of
workstations are multicore processors. In addition, the evolution
of network technologies and the proliferation of broadband
networks have made possible the emergence of clusters of
From the granularity of parallelism point of view, three major parallel
models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
From the granularity of parallelism point of view, three major parallel
models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
(thousands of threads)~\cite{CUDA}. However, the execution order
of these thousands of threads is unknown by the programmer which
makes the prediction of their execution order a challenging issue.
(thousands of threads)~\cite{CUDA}. However, the execution order
of these thousands of threads is unknown by the programmer which
makes the prediction of their execution order a challenging issue.
threads through the insertion of barrier synchronizations in the
codes to avoid concurrent accesses to data structures and to meet
some requirements related to data-dependent synchronizations.
threads through the insertion of barrier synchronizations in the
codes to avoid concurrent accesses to data structures and to meet
some requirements related to data-dependent synchronizations.
exploration of different paths using an ant
colony\index{Metaheuristics!ant~colony~optimization} are random
operations. Threads of the same warp have to execute
exploration of different paths using an ant
colony\index{Metaheuristics!ant~colony~optimization} are random
operations. Threads of the same warp have to execute
in an SIMD model the threads of a same warp execute the same
instruction. Consequently, the different branches of a conditional
instruction which is data-dependent lead to a serial execution of
in an SIMD model the threads of a same warp execute the same
instruction. Consequently, the different branches of a conditional
instruction which is data-dependent lead to a serial execution of
The first works on metaheuristic algorithms implemented on GPUs
started on old graphics cards before the appearance of modern GPUs
equipped with high-level programming interfaces such as CUDA and
The first works on metaheuristic algorithms implemented on GPUs
started on old graphics cards before the appearance of modern GPUs
equipped with high-level programming interfaces such as CUDA and
-of EAs on graphics processing cards and the work by Catala \emph{et al.} in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
-is implemented on old GPU architectures. Yu \emph{et al.}~\cite{yu2005} and
-Li \emph{et al.}~\cite{li2007} proposed a full parallelization of genetic
+of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
+is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
+Li et al.~\cite{li2007} proposed a full parallelization of genetic
algorithm\index{Metaheuristics!genetic~algorithm}s on old GPU architectures using
shader libraries based on Direct3D and OpenGL.\\
algorithm\index{Metaheuristics!genetic~algorithm}s on old GPU architectures using
shader libraries based on Direct3D and OpenGL.\\
mappings for large neighborhood structures on GPUs. In this work,
three different neighborhoods are studied and mapped to the
hierarchical GPU for binary problems. The three neighborhoods are
mappings for large neighborhood structures on GPUs. In this work,
three different neighborhoods are studied and mapped to the
hierarchical GPU for binary problems. The three neighborhoods are
-based on the \emph{Hamming} distance. The move operators used in
-the three neighborhoods consider \emph{Hamming} distances of 1,
+based on the Hamming distance. The move operators used in
+the three neighborhoods consider Hamming distances of 1,
2, and 3 (this consists on flipping the binary value of one, two,
or three positions at a time in the candidate binary vector).
In~\cite{luong2010large}, each thread is associated to a unique
2, and 3 (this consists on flipping the binary value of one, two,
or three positions at a time in the candidate binary vector).
In~\cite{luong2010large}, each thread is associated to a unique
%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes in the candidate vector.
The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes in the candidate vector.
The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
proposed two parallelization strategies inspired by the multi-walk
parallelization strategy, of a 3-opt iterated local
search\index{Metaheuristics!iterated local search} algorithm (ILS)
proposed two parallelization strategies inspired by the multi-walk
parallelization strategy, of a 3-opt iterated local
search\index{Metaheuristics!iterated local search} algorithm (ILS)
strategies have been experimented on standard benchmarks of
the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
strategies have been experimented on standard benchmarks of
the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
in~\cite{luongMultiStart} to implement multistart parallel local
search algorithms (a special case of the
algorithmic-level\index{Metaheuristics!algorithmic-level
in~\cite{luongMultiStart} to implement multistart parallel local
search algorithms (a special case of the
algorithmic-level\index{Metaheuristics!algorithmic-level
algorithms for TSP and the flow-shop scheduling problem (FSP).
These algorithms are based on a multistart tabu
algorithms for TSP and the flow-shop scheduling problem (FSP).
These algorithms are based on a multistart tabu
algorithms exploit multicore CPU and GPU. A full parallelization
on GPU is adopted using shader libraries where each thread is
mapped with one tabu search\index{Metaheuristics!tabu~search}.
algorithms exploit multicore CPU and GPU. A full parallelization
on GPU is adopted using shader libraries where each thread is
mapped with one tabu search\index{Metaheuristics!tabu~search}.
speedups the serial execution almost $16 \times$, the mapping of
one thread with one tabu search\index{Metaheuristics!tabu~search}
requires a large number of local search algorithms to
speedups the serial execution almost $16 \times$, the mapping of
one thread with one tabu search\index{Metaheuristics!tabu~search}
requires a large number of local search algorithms to
implementation of hybrid metaheuristics on heterogeneous parallel
architectures (multicore CPU coupled to one GPU). The challenge
of using such a heterogeneous architecture is how to distribute
implementation of hybrid metaheuristics on heterogeneous parallel
architectures (multicore CPU coupled to one GPU). The challenge
of using such a heterogeneous architecture is how to distribute
(solution-level\index{Metaheuristics!solution-level~parallelism},
iteration-level\index{Metaheuristics!iteration-level parallelism},
and algorithmic-level\index{Metaheuristics!algorithmic-level
(solution-level\index{Metaheuristics!solution-level~parallelism},
iteration-level\index{Metaheuristics!iteration-level parallelism},
and algorithmic-level\index{Metaheuristics!algorithmic-level
model for the considered heterogeneous architecture is a hybrid
model combining
iteration-level\index{Metaheuristics!iteration-level parallelism}
model for the considered heterogeneous architecture is a hybrid
model combining
iteration-level\index{Metaheuristics!iteration-level parallelism}
neighborhood calculation of several S-metaheuristics at the same
time.
In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
neighborhood calculation of several S-metaheuristics at the same
time.
In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
All the previously noted works exploit the same parallel models
used on CPUs based on the task parallelism. A different
All the previously noted works exploit the same parallel models
used on CPUs based on the task parallelism. A different
permutation $p$. For the GPU implementation, the authors used the
parallel implementation of neighborhood exploration. The
time-consuming tasks in the SA-matrix are identified by the
permutation $p$. For the GPU implementation, the authors used the
parallel implementation of neighborhood exploration. The
time-consuming tasks in the SA-matrix are identified by the
the next accepted move. To initialize the delta matrix, several
threads from different blocks explore different segments of the
matrix (different moves) at the same time. In order to select the
the next accepted move. To initialize the delta matrix, several
threads from different blocks explore different segments of the
matrix (different moves) at the same time. In order to select the
the last move, a group of threads explores different subsets of
the delta matrix. The shared memory is used to preload all the
necessary elements for a given group of threads responsible for
the updating of the delta matrix. The main difference in this work
compared to the previous works resides in the introduction of a
data parallelism using the precalculated delta matrix. The use of
the last move, a group of threads explores different subsets of
the delta matrix. The shared memory is used to preload all the
necessary elements for a given group of threads responsible for
the updating of the delta matrix. The main difference in this work
compared to the previous works resides in the introduction of a
data parallelism using the precalculated delta matrix. The use of
involved in the evaluation of a single move. Experimentations are
done on standard QAP instances from the
QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
involved in the evaluation of a single move. Experimentations are
done on standard QAP instances from the
QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
(islands), master-slave models used for the parallelization of CPU
intensive steps (evaluation and transformation), and cellular
models based on the use of one population disposed (generally) on
(islands), master-slave models used for the parallelization of CPU
intensive steps (evaluation and transformation), and cellular
models based on the use of one population disposed (generally) on
master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
C-like metalanguage for easy development of EAs. The user writes a
master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
C-like metalanguage for easy development of EAs. The user writes a
problem representation, etc) in EASEA. The code is then compiled
to obtain a ready-to-use evolutionary
algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
compiler uses genetic
problem representation, etc) in EASEA. The code is then compiled
to obtain a ready-to-use evolutionary
algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
compiler uses genetic
Libraries to produce C++ or JAVA written EA codes.
In~\cite{maitre2009}, the authors proposed an extension of EASEA
to produce CUDA code from the EASEA files. This extension has been
Libraries to produce C++ or JAVA written EA codes.
In~\cite{maitre2009}, the authors proposed an extension of EASEA
to produce CUDA code from the EASEA files. This extension has been
real problem (molecular structure prediction). In order to
maximize the GPU occupation, very large populations are used (from
$2000$ to $20000$). Even though transferring such large
real problem (molecular structure prediction). In order to
maximize the GPU occupation, very large populations are used (from
$2000$ to $20000$). Even though transferring such large
-populations from the CPU to the GPU device memory at every generation is very costly, the authors reported important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
The master-slave model is efficient when the fitness function is
highly time intensive. Nevertheless, it requires the use of
large-sized populations in order to saturate the GPU unless the
per-block is used (one individual perblock) when the acceleration
of the fitness function itself is possible. The use of many
The master-slave model is efficient when the fitness function is
highly time intensive. Nevertheless, it requires the use of
large-sized populations in order to saturate the GPU unless the
per-block is used (one individual perblock) when the acceleration
of the fitness function itself is possible. The use of many
in~\cite{pospichal10} to implement a parallel genetic
algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
work the entire genetic
in~\cite{pospichal10} to implement a parallel genetic
algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
work the entire genetic
-Nowotniak \emph{et al.}~\cite{nowotniak} proposed a GPU-based
-implementation of a quantum inspired genetic
-algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The used
-parallel model is a hierarchical model based on two levels: each
+Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
+implementation of a Quantum Inspired Genetic Algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The
+parallel model used is a hierarchical model based on two levels: each
thread in a block transforms a unique individual and a different
population is assigned to each block. The algorithm is run
entirely on GPU. A different instance of the QIGA is run on each
block and the computations have been shared between 8 GPUs. This
approach is very convenient to speed up the experimental process
on metaheuristics that require several independent runs of the
thread in a block transforms a unique individual and a different
population is assigned to each block. The algorithm is run
entirely on GPU. A different instance of the QIGA is run on each
block and the computations have been shared between 8 GPUs. This
approach is very convenient to speed up the experimental process
on metaheuristics that require several independent runs of the
populations are stored in the shared memory, the data matrix used
for fitness evaluation is placed in read only constant memory, and
finally seeds for random numbers generated on the GPU are stored
populations are stored in the shared memory, the data matrix used
for fitness evaluation is placed in read only constant memory, and
finally seeds for random numbers generated on the GPU are stored
GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
to the SIMD architecture of the GPU. \\
GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
to the SIMD architecture of the GPU. \\
parallel synchronous cellular genetic
algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
GraphCell, to solve the independent task scheduling problem on GPU
parallel synchronous cellular genetic
algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
GraphCell, to solve the independent task scheduling problem on GPU
grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
the GPU implementation against the CPU version range from
$5\times$ to $24\times$ and increase as the size of the population
grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
the GPU implementation against the CPU version range from
$5\times$ to $24\times$ and increase as the size of the population
version for all the tested benchmarks when the size of the
population is set to $32^2$. When the size of the population is
too small, there are not enough computations to cover the overhead
version for all the tested benchmarks when the size of the
population is set to $32^2$. When the size of the population is
too small, there are not enough computations to cover the overhead
communication\index{GPU Challenges!CPU/GPU~communication}s,
synchronization, and access to global memory. Finally, an
interesting review on GPU parallel computation in bio-inspired
communication\index{GPU Challenges!CPU/GPU~communication}s,
synchronization, and access to global memory. Finally, an
interesting review on GPU parallel computation in bio-inspired
implementation of
ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
the two steps (tour construction and pheromone update) are
implementation of
ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
the two steps (tour construction and pheromone update) are
per matrix entry). The achieved speedups are $21 \times$ for tour
construction and $20 \times$ for pheromone updates.\\
per matrix entry). The achieved speedups are $21 \times$ for tour
construction and $20 \times$ for pheromone updates.\\
on GPU to solve the QAP. A solution of QAP is represented as a
permutation of ${1,2,..,n}$ with $n$ being the size of the
problem. The TS algorithm is based on the 2-opt neighborhood
on GPU to solve the QAP. A solution of QAP is represented as a
permutation of ${1,2,..,n}$ with $n$ being the size of the
problem. The TS algorithm is based on the 2-opt neighborhood
In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
of a standard PSO algorithm. All the data is stored in global
memory (velocities, positions, swarm population, etc). Only
In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
of a standard PSO algorithm. All the data is stored in global
memory (velocities, positions, swarm population, etc). Only
The implementation of the multistart model is based on two
different mapping strategies~\cite{luongMultiStart, audreyANT}:
The implementation of the multistart model is based on two
different mapping strategies~\cite{luongMultiStart, audreyANT}:
(2) each solution (from multiple neighborhoods) is associated to a
unique thread. The first mapping strategy (one thread per LS)
presents a big drawback: the number of LS to use should be very
(2) each solution (from multiple neighborhoods) is associated to a
unique thread. The first mapping strategy (one thread per LS)
presents a big drawback: the number of LS to use should be very
PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
frameworks!PUGACE} is a framework for implementing EAs on GPUs.
ParadisEO-MO-GPU is an extension of the framework
ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
frameworks!PUGACE} is a framework for implementing EAs on GPUs.
ParadisEO-MO-GPU is an extension of the framework
ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
-GPU. Finally, libCudaOptimize\index{GPU-based
-frameworks!libCudaOptimize} is a library intended for the
+GPU. Finally, libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize} is a library intended for the
implementation of p-metaheuristics on GPU. The three frameworks
are presented in more detail in the following.
implementation of p-metaheuristics on GPU. The three frameworks
are presented in more detail in the following.
(island, master-slave). The main standard evolutionary operators
are already implemented in PUGACE\index{GPU-based
frameworks!PUGACE}: different selection strategies, standard
(island, master-slave). The main standard evolutionary operators
are already implemented in PUGACE\index{GPU-based
frameworks!PUGACE}: different selection strategies, standard
-crossover, and mutation operators (\emph{PMX, swap, 2-exchange},
+crossover, and mutation operators (PMX, swap, 2-exchange,
etc.). Different problem encoding is also supported. The framework
is organized as a set of modules in which the different
functionalities are separated as much as possible in order to
etc.). Different problem encoding is also supported. The framework
is organized as a set of modules in which the different
functionalities are separated as much as possible in order to
individuals within the block. It is the decision on the choose
of the cutting point for the crossover.\\
individuals within the block. It is the decision on the choose
of the cutting point for the crossover.\\
\times$ are achieved compared to a CPU implementation of a cEA
using population sizes rising from 512 to 1024 and from 1024 to
2048.
\times$ are achieved compared to a CPU implementation of a cEA
using population sizes rising from 512 to 1024 and from 1024 to
2048.
framework ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU} for parallel local search
metaheuristics (s-metaheuristics) on GPUs. It focuses on the
framework ParadisEO-MO-GPU\index{GPU-based
frameworks!ParadisEO-MO-GPU} for parallel local search
metaheuristics (s-metaheuristics) on GPUs. It focuses on the
search, differential evolution, and particle swarm
optimization\index{Metaheuristics!particle swarm optimization}.
Nevertheless, the library is designed in such a way to allow
search, differential evolution, and particle swarm
optimization\index{Metaheuristics!particle swarm optimization}.
Nevertheless, the library is designed in such a way to allow
In this case study, a large neighborhood GPU-based local
search\index{GPU-based Metaheuristics!GPU-based~local~search}
In this case study, a large neighborhood GPU-based local
search\index{GPU-based Metaheuristics!GPU-based~local~search}
-method for solving the quadratic 3-dimensional assignment problem
-(Q3AP) will be presented. The local search method is an iterated
-local search\index{Metaheuristics!iterated local search}
+method for solving the Quadratic 3-dimensional Assignment Problem
+(Q3AP) will be presented. The local search method is an Iterated
+Local Search\index{Metaheuristics!iterated local search}
(ILS)~\cite{stutzle2006ILSforQAP} using an embedded
TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
consists of executing iteratively the embedded local search, each
(ILS)~\cite{stutzle2006ILSforQAP} using an embedded
TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
consists of executing iteratively the embedded local search, each
\subsection{The quadratic 3-dimensional assignment problem}
The Q3AP is an extension of the well-known NP-hard QAP. The latter
\subsection{The quadratic 3-dimensional assignment problem}
The Q3AP is an extension of the well-known NP-hard QAP. The latter
optimization community due to its wide range of practical
applications (site planning, schedule problem, computer-aided
design, etc.) and to its computational challenges since it is
optimization community due to its wide range of practical
applications (site planning, schedule problem, computer-aided
design, etc.) and to its computational challenges since it is
The Q3AP was first introduced by William P. Pierskalla in
1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
less studied problem. Indeed, the Q3AP was revisited only this
The Q3AP was first introduced by William P. Pierskalla in
1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
less studied problem. Indeed, the Q3AP was revisited only this
assignment problems such as the symbol-mapping problem encountered
in wireless communication systems and described
in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
assignment problems such as the symbol-mapping problem encountered
in wireless communication systems and described
in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
- x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1. \label{Q3APConstraints2}
+ x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1 \label{Q3APConstraints2}
\end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
\begin{displaymath} I=\lbrace X=(x_{ijl}):
\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
\end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
\begin{displaymath} I=\lbrace X=(x_{ijl}):
\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
\begin{displaymath} L=\lbrace X=(x_{ijl}):
\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
\begin{displaymath} L=\lbrace X=(x_{ijl}):
\sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
-l=0,1,...,n-1\rbrace \mathrm{.} \end{displaymath}
+l=0,1,...,n-1\rbrace \end{displaymath}
% fin------------------------------------------------------------------------------
Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
% fin------------------------------------------------------------------------------
Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
solution and tries, at each step, to move to a neighboring
solution that minimizes the fitness (for a minimization case). If
no such move exists, the neighbor solution that less degrades the
solution and tries, at each step, to move to a neighboring
solution that minimizes the fitness (for a minimization case). If
no such move exists, the neighbor solution that less degrades the
escape local optima. However, this strategy may generate cycles,
i.e., previous moves can be selected again. To avoid cycles, the
TS manages a short-term memory that contains the moves that have
been recently performed. A TS\index{Metaheuristics!tabu~search}
escape local optima. However, this strategy may generate cycles,
i.e., previous moves can be selected again. To avoid cycles, the
TS manages a short-term memory that contains the moves that have
been recently performed. A TS\index{Metaheuristics!tabu~search}
\small
\begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
\multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average} & \multicolumn{1}{|c|}{Maximal} & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
\small
\begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
\multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average} & \multicolumn{1}{|c|}{Maximal} & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
Nug12-d & $580$ & $615.4$ & $744$ & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
& & \tiny{$41.7$} & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
Nug13-d & $1912$ & $1985.4$ & $2100$ & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
Nug12-d & $580$ & $615.4$ & $744$ & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
& & \tiny{$41.7$} & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
Nug13-d & $1912$ & $1985.4$ & $2100$ & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
Implementing parallel
metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
GPU architectures poses new issues and challenges such as memory
Implementing parallel
metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
GPU architectures poses new issues and challenges such as memory
efficient mapping strategies between tasks to parallelize; and the
GPU threads, thread divergence\index{GPU
Challenges!thread~divergence}, and synchronization. Actually, most
efficient mapping strategies between tasks to parallelize; and the
GPU threads, thread divergence\index{GPU
Challenges!thread~divergence}, and synchronization. Actually, most