BookGPU/Chapters/chapter9/ch9.tex

   1 \chapterauthor{Malika Mehdi and Ahc\`{e}ne Bendjoudi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
   2
   3 \chapterauthor{Lakhdar Loukil}{University of Oran, Algeria}
   4
   5 %\chapterauthor{Ahc\`{e}ne Bendjoudi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
   6
   7 \chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
   8
   9 \chapter{Parallel GPU-accelerated Metaheuristics}
  10 \label{chapter9}
  11 \section{Introduction}
  12 This chapter presents GPU-based parallel
  13 metaheuristics\index{Metaheuristics!parallel~metaheuristics},
  14 challenges, and issues related to the particularities of the GPU
  15 architecture and a synthesis on the different implementation
  16 strategies used in the literature. The implementation of parallel
  17 metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
  18 GPUs is not straightforward. The traditional models used in CPUs
  19 must be rethought to meet the new requirements of GPU
  20 architectures. This chapter is organized as follows. Combinatorial
  21 optimization\index{Combinatorial~optimization} and  resolution
  22 methods are introduced in Section~\ref{ch8:sec:optim}. The main
  23 traditional parallel models used for metaheuristics are recalled
  24 in Section~\ref{ch8:sec:paraMeta}.
  25 Section~\ref{ch8:sec:challenges} highlights the  main challenges
  26 related to the GPU implementation of metaheuristics. A
  27 state-of-the-art of GPU-based parallel
  28 metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
  29 summarized in Section~\ref{ch8:sec:state}. In Section
  30 Section~\ref{ch8:sec:GPU_APIs}, the main developed GPU-based
  31 frameworks for metaheuristics are described. Finally, a case study
  32 is presented in Section~\ref{ch8:sec:case} and some concluding
  33 remarks are given in Section~\ref{ch8:conclusion}
  34
  35 \section{Combinatorial optimization}
  36 \label{ch8:sec:optim}
  37
  38 Combinatorial optimization\index{Combinatorial~optimization} (CO) is a branch of applied and discrete mathematics.
  39 It consists in finding optimal configuration(s) among a finite set of possible configurations
  40 (or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\  %(Definition~\ref{def:cops})
  41
  42
  43 %\begin{minipage}{0.5\linewidth}
  44 A combinatorial problem $P=(S,f)$ can be defined by
  45 \begin{itemize}
  46 \item a set of decision variables $X$,
  47 \item an objective function $f$ to optimize (minimize or maximize) over the set $S$,
  48 \item subject to constraints on the decision variables.\\
  49 \end{itemize}
  50 %\end{minipage}
  51
  52 COPs are generally formulated as mixed integer programming
  53 problems (MIPS) and most of them are NP-hard~\cite{garey}.
  54 According to the quality level of solutions and deadlines required
  55 for solving an optimization problem, two classes of optimization
  56 methods can be distinguished: \emph{exact methods} and
  57 \emph{approximate methods}. Exact methods allow one to reach
  58 optimal solution(s) of the handled optimization problem with a
  59 proof of its or their optimality. The most known methods of this
  60 class are the \emph{branch and bound technique}, \emph{dynamic
  61 programming}, \emph{constraint programming}, and \emph{A*
  62 algorithm}. However, optimization problems, whether practical or
  63 academic, are often complex and NP-hard. Moreover, a large number
  64 of real-life optimization problems encountered in science,
  65 engineering, economics, and business are usually large-sized
  66 problems for which the size of the potential solution domain
  67 increases dramatically with the size of the problem instance. Such
  68 problems cannot be tackled using exact methods due to the
  69 excessive computation time needed by these methods to find optimal
  70 solution(s). In such a situation, approximate methods (or
  71 \textit{metaheuristics}) offer an alternative approach to solve
  72 these problems. Indeed, these methods allow one to reach good
  73 quality solutions in reasonable computation time compared to exact
  74 methods but with no guarantee to find optimal or even bounded
  75 solutions. This is due to the nature of the search process adopted
  76 by these approaches which consists in performing a
  77 search in a subset of the whole search space.\\
  78
  79 Regarding the number of solutions considered at each iteration in
  80 the search process, two classes of metaheuristics can be
  81 distinguished~\cite{talbi2009mfdti}: \textit{solution-based} and
  82 \textit{population-based} metaheuristics. In the rest of this
  83 chapter, the term \emph{s-metaheuristic} refers to solution-based
  84 metaheuristic and the term \emph{p-metaheuristic} refers to
  85 population-based metaheuristic. In s-metaheuristics, the search
  86 process starts with a single solution (generally set at random)
  87 and iteratively improves it by exploring its neighborhood in the
  88 search space. The most known methods in this class are local
  89 search methods that include \emph{simulated
  90 annealing}\index{Metaheuristics!simulated~annealing}~\cite{Kirkpatrick1983SA},
  91 \emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local
  92 search\index{Metaheuristics!iterated local
  93 search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
  94 neighborhood search}~\cite{HansenMladenovic1997VNS}.\\
  95
  96 Unlike s-metaheuristics, p-metaheuristics start with a population
  97 of solutions and implement an iterative process that evolves the
  98 current population towards a new population of better quality
  99 solutions. The process is repeated until a stopping criterion is
 100 satisfied. \emph{Evolutionary algorithms}, \emph{swarm
 101 optimization}, and \emph{ant colonies} fall into this class.
 102
 103
 104 \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta}
 105 Optimization problems, whether real-life or academic, are more
 106 often NP-hard and CPU time and/or memory consuming. Metaheuristics
 107 allow the significant reduction of the computational time of the search
 108 process but remain time-consuming particularly when it comes
 109 dealing with large-sized problems. \\
 110
 111 The use of parallelism in the design of metaheuristics is a relevant
 112 approach that is widely adopted by the combinatorial optimization
 113 community for various reasons:
 114
 115 \begin{itemize}
 116   \item One of the main goals of parallelism is to reduce the search
 117 time. This will allow the design of high performance optimization
 118 methods and the solving of large-sized optimization problems.
 119
 120   \item Sequential processor architectures have reached their
 121 physical limit which prevents creating faster processors. The
 122 current trend of microprocessor manufacturers consists of placing
 123 multiple cores on a single chip. Nowadays, laptops and
 124 workstations are multicore processors. In addition, the evolution
 125 of network technologies and the proliferation of broadband
 126 networks have made possible the emergence of clusters of
 127 workstations (COWs), networks of workstations (NOWS), and
 128 large-scale networks of machines (grids) as platforms for
 129 high-performance computing.
 130 \end{itemize}
 131
 132 From the granularity of parallelism point of view, three major parallel
 133 models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
 134 \emph{iteration-level} \index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
 135
 136 \begin{figure}[h!]
 137 \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
 138 \caption{Parallel models for metaheuristics.}
 139 \label{ch8:fig:paraMeta}
 140 \end{figure}
 141
 142 \begin{itemize}
 143
 144 \item{In the
 145 algorithmic-level\index{Metaheuristics!algorithmic-level
 146 parallelism} parallel model, several self-contained metaheuristics
 147 are launched in parallel. The parallel
 148 metaheuristics\index{Metaheuristics!parallel~metaheuristics} may
 149 start with identical or different solutions (s-metaheuristics
 150 case) or populations (p-metaheuristics case). Their parameter
 151 settings such as the size of tabu list for tabu
 152 search\index{Metaheuristics!tabu~search}, transition probabilities
 153 for ant colonies, mutation and crossover probabilities for
 154 evolutionary
 155 algorithm\index{Metaheuristics!evolutionary~algorithm}s may be the
 156 same or different. The parallel processes may evolve independently
 157 or in a cooperative manner. In cooperative parallel models, the
 158 algorithms exchange information related to the search during
 159 evolution in order to find better and more robust solutions.}
 160
 161 \item{In the iteration-level\index{Metaheuristics!iteration-level
 162 parallelism} parallel model, the focus is on the parallelization
 163 of each iteration of the metaheuristic. Indeed, metaheuristics are
 164 generally iterative search processes. Moreover, the most
 165 resource-consuming part of a metaheuristic is the evaluation of
 166 the generated solutions at each iteration. For s-metaheuristics
 167 (e.g., tabu search\index{Metaheuristics!tabu~search}, simulated
 168 annealing, variable neighborhood search), the evaluation and
 169 generation of the neighborhood is the most time-consuming step of
 170 the algorithm particularly when it comes to dealing with large
 171 neighborhood sets. In this parallel model, the neighborhood is
 172 decomposed into partitions, and each partition is evaluated in a
 173 parallel way. For p-metaheuristics (e.g., evolutionary
 174 algorithm\index{Metaheuristics!evolutionary~algorithm}s, ant
 175 colonies, swarm optimization), the
 176 iteration-level\index{Metaheuristics!iteration-level parallelism}
 177 parallel model arises naturally since these metaheuristics deal
 178 with a population of independent solutions. In evolutionary
 179 algorithm\index{Metaheuristics!evolutionary~algorithm}s, for
 180 instance, the iteration-level\index{Metaheuristics!iteration-level
 181 parallelism} model consists of decomposing the whole population
 182 into several partitions where each partition is evaluated in
 183 parallel.}
 184
 185 \item{In the
 186 solution-level\index{Metaheuristics!solution-level~parallelism}
 187 parallel model, the focus is on the parallelization of the
 188 evaluation of a single solution. This model is useful when the
 189 objective function and/or the constraints are time and/or memory
 190 consuming. Unlike the two previous parallel models, the
 191 solution-level\index{Metaheuristics!solution-level~parallelism}
 192 parallel model is problem-dependent.}
 193 \end{itemize}
 194
 195 \section{Challenges for the design of GPU-based metaheuristics}
 196 \label{ch8:sec:challenges}
 197
 198 Developing GPU-based parallel
 199 metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
 200 not straightforward. The parallel models have to be rethought to
 201 meet the new requirements of the GPU architecture. Several major
 202 issues have to be taken into account both at design and
 203 implementation levels to develop efficient metaheuristics on GPU.
 204 These issues are mainly related to the size and latency of the GPU
 205 memories, thread synchronization and divergence, the distribution
 206 of tasks, and data transfer between the CPU and
 207 GPU~\cite{luongMultiStart}.
 208
 209 \subsection{Data placement on a hierarchical memory}
 210 \index{GPU Challenges!data~placement} During the execution of
 211 metaheuristics on GPU, the different threads may access multiple
 212 data structures from multiple memory spaces. These memories have
 213 different sizes and access latencies. Nevertheless, faster
 214 memories (registers, shared and constant memories) are generally
 215 very small in size, and the larger memories (global memory) are
 216 relatively slower. However, p-metaheuristics require the
 217 exploration of a large amount of individuals to diversify the
 218 search. Moreover, an efficient execution of s-metaheuristics
 219 requires exploring large neighborhoods. Thus, programmers have to
 220 take into account this point to efficiently place the different
 221 data structures of the metaheuristic on the different memories to
 222 benefit from both the faster memories and the larger ones. A deep
 223 study has to be performed on both the metaheuristic data
 224 structures (size and access frequency) and the GPU memories (size
 225 and access latency) to identify which data will be placed on which
 226 memory. Generally, the most accessed ones should be put on faster
 227 memories (registers, shared memory) and larger ones on the larger
 228 memories (global memory). Also, an efficient mapping between
 229 threads and the corresponding metaheuristic elements (one neighbor
 230 per thread, one individual per thread, single population per
 231 threads block, one ant per thread, etc.) must be defined to ensure
 232 a maximum occupancy of the GPU and
 233 to cover CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} and memory access times.\\
 234
 235 According to the used metaheuristic and to the handled problem, the data
 236 values may have different types and different ranges of their values. The data
 237 types should then be chosen carefully and the ranges of the data values should
 238 be analyzed to minimize the amount of occupied memory space.\\
 239
 240 In addition to the size and latency of GPU memory issues, the
 241 memory access pattern is another important issue to be dealt with
 242 to speedup GPU-based metaheuristics. Indeed, the different
 243 memories have been designed to achieve specific features that
 244 programmers must take into account to optimize their codes and
 245 then to benefit from these features. For instance, the global
 246 memory is optimized for coalesced accesses. The texture and the
 247 constant memory are read-only memories. The texture is optimized
 248 for uncoalesced accesses and the constant memory is optimized for
 249 simultaneously accesses of all threads to the same
 250 location~\cite{luongMultiStart}. Therefore, to improve the
 251 performance of the kernel execution, the programmers should put
 252 coalesced data on global memory, uncoalesced read-only data (e.g.,
 253 metaheuristic instance data) on the texture, concurrent read-only
 254 data (e.g., data for fitness evaluation that all threads
 255 concurrently access) on the constant memory, and the most accessed
 256 data structures (e.g., population of individuals for a CUDA thread
 257 block) on the shared memory.
 258
 259 \subsection{Threads synchronization}
 260 \index{GPU Challenges!threads~synchronization} The thread
 261 synchronization issue is caused by both the GPU architecture and
 262 the synchronization requirements of the implemented method.
 263 Indeed, GPUs are based on a multicore architecture organized into
 264 several multi-processors (Streaming Multiprocessor SM) supporting
 265 the SPMD model (Single Program Multiple Data). Each SM contains
 266 several cores executing the same instruction of different threads
 267 following the SIMD model (Single Instruction Multiple Data). These
 268 threads belong to a warp (a group of 32 threads) and handle
 269 different data elements. An efficient execution of applications on
 270 GPU is achieved when launching a large amount of threads
 271 (thousands of threads)~\cite{CUDA}. However, the execution order
 272 of these thousands of threads is unknown by the programmer which
 273 makes the prediction of their execution order a challenging issue.
 274 On an other hand, the developer has to control explicitly the
 275 threads through the insertion of barrier synchronizations in the
 276 codes to avoid concurrent accesses to data structures and to meet
 277 some requirements related to data-dependent synchronizations.
 278
 279 \subsection{Thread divergence}
 280
 281 Thread divergence\index{GPU Challenges!thread~divergence} is
 282 another challenging issue in GPU-based
 283 metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally,
 284 metaheuristics contain irregular loops and conditional
 285 instructions when generating and evaluating the neighborhood
 286 (s-metaheuristics), and the population (p-metaheuristics) in the
 287 same block. In addition,  the decision to apply a crossover or a
 288 mutation on an individual in a genetic
 289 algorithm\index{Metaheuristics!genetic~algorithm} and the
 290 exploration of different paths using an ant
 291 colony\index{Metaheuristics!ant~colony~optimization} are random
 292 operations. Threads of the same warp have to execute
 293 simultaneously instructions leading to different branches whereas
 294 in an SIMD model the threads of a same warp execute the same
 295 instruction. Consequently, the different branches of a conditional
 296 instruction which is data-dependent lead to a serial execution of
 297 the different threads degrading the performance of the application
 298 in terms of execution time. The challenge here is then to revisit
 299 the traditional irregular metaheuristic codes to eliminate these
 300 divergences.
 301
 302 \subsection{Task distribution and CPU/GPU communication}
 303
 304 The performance of GPU-based metaheuristics in terms of execution
 305 time could be improved by choosing the most appropriate parallel
 306 model (algorithmic-level\index{Metaheuristics!algorithmic-level
 307 parallelism}, instruction-level,
 308 solution-level\index{Metaheuristics!solution-level~parallelism}).
 309 Moreover, an efficient decomposition of the metaheuristic and an
 310 efficient assignment of code portions between the CPU and GPU
 311 should be adopted. The objective is to take benefit from the GPU
 312 computing power without affecting the efficiency and the behavior
 313 of the metaheuristic and without losing performance in CPU/GPU
 314 communication\index{GPU Challenges!CPU/GPU~communication} and
 315 memory accesses. In order to decide which part of the
 316 metaheuristic will be executed on which component, one should
 317 perform a careful analysis on the serial code of the
 318 metaheuristic, identify the compute-intensive tasks (e.g.,
 319 exploration of the neighborhood, evaluation of individuals), and
 320 then offload them to the GPU, while the remaining
 321 tasks still run on the CPU in a serial way. \\
 322
 323 The CPU/GPU communication\index{GPU
 324 Challenges!CPU/GPU~communication} is done through the global
 325 memory which is a slow memory making the memory transfer between
 326 the CPU and GPU time-consuming which can significantly degrade the
 327 performance of the application. Accesses to this memory should be
 328 optimized by minimizing the amount of transferred data between the
 329 two components in order to reduce the communication time and,
 330 therefore, the whole execution time of the metaheuristic.
 331
 332 \section{State-of-the-art parallel metaheuristics on GPUs}
 333 \label{ch8:sec:state}
 334 After more than two decades of research by the combinatorial optimisation
 335 community devoted to developing adequate parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} for different types of
 336 parallel architectures (clusters, supercomputers and grids), the actual developement
 337 of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on SIMD architectures.\\
 338
 339 The first works on metaheuristic algorithms implemented on GPUs
 340 started on old graphics cards before the appearance of modern GPUs
 341 equipped with high-level programming interfaces such as CUDA and
 342 OpenCL. Among these pioneering works we cite the work of Wong
 343 \emph{et al.}~\cite{wongOldGPU2006} dealing with the
 344 implementation
 345 of EAs on graphics processing cards and the work by Catala \emph{et al.} in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
 346 is implemented on old GPU architectures. Yu \emph{et al.}~\cite{yu2005} and
 347 Li \emph{et al.}~\cite{li2007} proposed a full parallelization of genetic
 348 algorithm\index{Metaheuristics!genetic~algorithm}s  on old GPU architectures using
 349 shader libraries based on Direct3D and OpenGL.\\
 350
 351 Such architectures are based on preconfigured pipelined stages
 352 used to accelerate the transformation of 3D geometric primitives
 353 into pixels. Implementing a general-purpose algorithm on such
 354 preconfigured architectures is very hard and requires the
 355 tailoring of the algorithm's data and instructions to fit the
 356 pipelined stages of the GPU. Since then, GPU architectures have
 357 evolved to become programmable using high-level programming
 358 interfaces. In this section we will focus only on recent
 359 state-of-the-art works dealing with metaheuristics implementation
 360 on modern programmable GPUs. In this review two classes are
 361 considered: (1) s-metaheuristics on GPUs and (2) p-metaheuristics
 362 on GPUs. A comparative study is done of the main works and a
 363 classification of the different existing strategies is proposed in
 364 Section~\ref{ch8:sec:synthesis}.
 365
 366 \subsection{Implementing solution-based metaheuristics on GPUs}
 367
 368
 369 A very basic local search algorithm starts with an initial
 370 solution generated either at random or by the mean of a specific
 371 heuristic and is based on two elementary components: the
 372 generation of neighborhood structures using an elementary move
 373 function and a selection function that determines which solution
 374 in the current neighborhood will replace the actual search point.
 375 The search continues as long as there is improvement and stops
 376 when there is no better solution in the current neighborhood. The
 377 exploration (or evaluation) of the different moves of a given
 378 neighborhood is done independently for each move. Thus, the
 379 easiest way to accelerate a local search algorithm is to
 380 parallelize the evaluation of the neighborhood (instruction-level
 381 parallelism). This is  by far the most used scheme in the
 382 literature for parallelizing local search algorithms on GPUs.
 383 Nevertheless, small neighborhoods may lead to  nonoptimal
 384 occupation of the CUDA threads which may lead in turn to an
 385 overhead due to the communication and memory latencies. Therefore,
 386 large neighborhoods are necessary for efficient implementation of
 387 local searches on GPUs.\\
 388
 389 Luong \emph{et al.}~\cite{luong2010large} proposed efficient
 390 mappings for large neighborhood structures on GPUs. In this work,
 391 three different neighborhoods are studied and mapped to the
 392 hierarchical GPU for binary problems. The three neighborhoods are
 393 based on the \emph{Hamming} distance. The move operators used in
 394 the three neighborhoods consider \emph{Hamming} distances of 1,
 395 2, and 3 (this consists on flipping the binary value of one, two,
 396 or three positions at a time in the candidate binary vector).
 397 In~\cite{luong2010large}, each thread is associated to a unique
 398 solution in the neighborhood. The addressed issue is how to
 399 efficiently map the different neighborhoods on the device memory,
 400 more explicitly, how to calculate the memory index of each
 401 solution associated to each CUDA thread's \textit{id}.
 402 %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
 403 The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
 404
 405 In the same context, Deevacq \emph{et al.}~\cite{audreyANT}
 406 proposed two parallelization strategies inspired by the multi-walk
 407 parallelization strategy, of a 3-opt iterated local
 408 search\index{Metaheuristics!iterated local search} algorithm (ILS)
 409 over a CPU/GPU architecture. In the first strategy, each Local
 410 Search (LS) is associated to a unique CUDA thread and improves a
 411 unique solution by generating its neighborhood. The second
 412 strategy associates each solution to a CUDA block and the
 413 neighborhood exploration is parallelized on the block threads. In
 414 the first strategy, since several LS are executed on different
 415 solutions on each Multi Processor (MP), the data structures should
 416 be stored on the global memory, while the exploration of a single
 417 solution at a time in the second strategy allows the use of the
 418 shared memory to store the related data structures. The two
 419 strategies have been experimented on standard benchmarks of
 420 the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
 421
 422 The same strategy is used by Luong \emph{et al.}
 423 in~\cite{luongMultiStart} to implement multistart parallel local
 424 search algorithms (a special case of the
 425 algorithmic-level\index{Metaheuristics!algorithmic-level
 426 parallelism} parallel model where several homogeneous LS
 427 algorithms are used). The multistart model is combined with
 428 iteration-level\index{Metaheuristics!iteration-level parallelism}
 429 parallelism: several LS algorithms are managed by the CPU and the
 430 neighborhood evaluation step of each algorithm is parallelized on
 431 the GPU (each GPU thread is associated with one neighbor and
 432 executes the same evaluation function kernel). The advantage of
 433 such a model is that it allows a  high occupancy of the GPU
 434 threads. Nevertheless, memory management\index{GPU
 435 Challenges!memory~management} causes new issues due to the
 436 quantity of data to store and to communicate between CPU     and
 437 GPU. A second proposition for implementing the same model on GPU
 438 consists of implementing the whole LS processes on GPU with each
 439 GPU thread being associated to a unique LS algorithm. This solves
 440 the communication issue encountered in the first model. In
 441 addition, a memory management\index{GPU
 442 Challenges!memory~management} strategy is proposed to improve the
 443 efficiency of the
 444 algorithmic-level\index{Metaheuristics!algorithmic-level
 445 parallelism} model: texture memory is used to avoid memory latency
 446 due to uncoalesced memory accesses. The proposed approaches are
 447 implemented on the quadratic assignment problem (QAP) using CUDA.
 448 The acceleration rates obtained for the
 449 algorithmic-level\index{Metaheuristics!algorithmic-level
 450 parallelism} with usage of texture memory rise from $7.8\times$ to
 451 $12\times$ (for different
 452 QAP benchmark sizes). \\
 453
 454 Janiak \emph{et al.}~\cite{Janiak_et_al_2008} implemented two
 455 algorithms for TSP and the flow-shop scheduling problem (FSP).
 456 These algorithms are based on a multistart tabu
 457 search\index{Metaheuristics!tabu~search} model. Both of the two
 458 algorithms exploit multicore CPU and GPU. A full parallelization
 459 on GPU is adopted using shader libraries where each thread is
 460 mapped with one tabu search\index{Metaheuristics!tabu~search}.
 461 However, even though their experiments report that the use of GPU
 462 speedups the serial execution almost $16 \times$, the mapping of
 463 one thread with one tabu search\index{Metaheuristics!tabu~search}
 464 requires a large number of local search algorithms to
 465 cover the memory access latency. The same mapping policy is adopted by Zhu \emph{et al.} in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
 466
 467 Luong \emph{et al.}~\cite{luong2012ppsn} proposed a GPU-based
 468 implementation of hybrid metaheuristics on heterogeneous parallel
 469 architectures (multicore CPU  coupled to one GPU).  The challenge
 470 of using such a heterogeneous architecture is how to distribute
 471 tasks between the CPU cores and the GPU in such a way to have
 472 optimal performances. Among the three traditional parallel models
 473 (solution-level\index{Metaheuristics!solution-level~parallelism},
 474 iteration-level\index{Metaheuristics!iteration-level parallelism},
 475 and algorithmic-level\index{Metaheuristics!algorithmic-level
 476 parallelism}), the authors pointed out that the most convenient
 477 model for the considered heterogeneous architecture is a hybrid
 478 model combining
 479 iteration-level\index{Metaheuristics!iteration-level parallelism}
 480 and algorithmic-level\index{Metaheuristics!algorithmic-level
 481 parallelism} models. Several CPU threads execute several instances
 482 of the same S-metaheuristic in parallel while the GPU device is
 483 associated to one CPU core and  used to accelerate the
 484 neighborhood calculation of several S-metaheuristics at the same
 485 time.
 486  In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
 487 S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the fast ant colony algorithm (FANT)~\cite{taillardFant}. \\
 488
 489 All the previously noted works  exploit the same parallel models
 490 used on CPUs based on the task parallelism. A different
 491 implementation approach is proposed by Paul in~\cite{gerald2012}
 492 to implement a simulated
 493 annealing\index{Metaheuristics!simulated~annealing} (SA) algorithm
 494 for the QAP on GPUs. Indeed, the author used a preinitialized
 495 matrix \emph{delta} in which the incremental evaluation of simple
 496 swap moves are calculated and stored relative to the initial
 497 permutation $p$. For the GPU implementation, the authors used the
 498 parallel implementation of neighborhood exploration. The
 499 time-consuming tasks in the SA-matrix are identified by the
 500 authors as: updating the matrix and passing through it to select
 501 the next accepted move. To initialize the delta matrix, several
 502 threads from different blocks explore different segments of the
 503 matrix (different moves) at the same time.  In order to select the
 504 next accepted, swap several threads are also used. Starting from
 505 the last move, a group of threads explores different subsets of
 506 the delta matrix. The shared memory is used to preload all the
 507 necessary elements for a given group of threads responsible for
 508 the updating of the delta matrix. The main difference in this work
 509 compared to the previous works resides in the introduction of a
 510 data parallelism using the precalculated delta matrix. The use of
 511 this matrix allows the increasing of the number of threads
 512 involved in the evaluation of a single move. Experimentations are
 513 done on standard QAP instances from the
 514 QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
 515 achieved by the GPU implementation compared
 516 to the same sequential implementation on CPU using SA-matrix.\\
 517
 518 \subsection{Implementing population-based metaheuristics on GPUs}
 519
 520 State-of-the-art works dealing with the implementation of
 521 p-metaheuristics on GPUs generally rely on parallel models and
 522 research efforts done for parallelizing different classes of
 523 p-metaheuristics over different types of parallel architectures:
 524 supercomputers, clusters, and computational grids. Three main
 525 classes of p-metaheuristics are considered in this section:
 526 evolutionary
 527 algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant
 528 colony\index{Metaheuristics!ant~colony~optimization} optimization
 529 (ACO\index{Metaheuristics!ant~colony~optimization}), and particle
 530 swarm optimization\index{Metaheuristics!particle swarm
 531 optimization} (PSO).
 532
 533 \subsubsection*{Evolutionary Algorithms}
 534
 535 Traditional parallel models for EAs are classified in three main
 536 classes: coarse grain models using several sub-populations
 537 (islands), master-slave models used for the parallelization of CPU
 538 intensive steps (evaluation and transformation), and cellular
 539 models based on the use of one population disposed (generally) on
 540 a toroidal grid.
 541
 542 The three traditional models have been implemented on GPUs by several researchers for different
 543 optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.\\
 544
 545 In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
 546 of the drug discovery application Autodock (molecular docking
 547 application). Autodock uses a genetic
 548 algorithm\index{Metaheuristics!genetic~algorithm} to find optimal
 549 docking positions of a ligand to a protein. The most
 550 time-consuming task in Autodock is the fitness function
 551 evaluation. The fitness function used for a docking problem
 552 consists of calculating the energy  of the ligand-protein complex
 553 (sum of intermolecular energies). The authors explore two
 554 different approaches to evaluate the fitness function on GPU. In
 555 the first approach, each GPU thread calculates the energy function
 556 of a single individual. This approach requires the use of
 557 large-sized populations to saturate the GPU (thousands of
 558 individuals). In the second approach each individual is associated
 559 with one thread block. The evaluation of the energy function is
 560 performed by the threads of the same block. This allows the use of
 561 medium population sizes (hundreds of individuals) and the
 562 acceleration of a single fitness evaluation. Another great
 563 advantage of the per block approach resides in the use of shared
 564 memory instead of global memory to store all the information
 565 related to each individual. The obtained speedups range from  $10
 566 \times$ to $47 \times$ for population sizes
 567 ranging from $50$ to $10000$. \\
 568
 569 Maitre \emph{et al.}~\cite{maitre2009} also exploited the
 570 master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
 571 C-like metalanguage for easy development of EAs. The user writes a
 572 description  of the problem specific components (fitness function,
 573 problem representation, etc) in EASEA. The code  is then compiled
 574 to obtain a ready-to-use evolutionary
 575 algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
 576 compiler  uses genetic
 577 algorithm\index{Metaheuristics!genetic~algorithm}LIB and EO
 578 Libraries to produce C++ or JAVA written EA codes.
 579 In~\cite{maitre2009}, the authors  proposed an extension of EASEA
 580 to produce CUDA code from the EASEA files. This extension has been
 581 used  to generate a master-slave parallel EA in which the
 582 sequential algorithm is maintained on CPU and the population is
 583 sent to GPU for evaluation. Two problems have been considered
 584 during the experiments: a benchmark mathematical function and a
 585 real problem (molecular structure prediction). In order to
 586 maximize the GPU occupation, very large populations are used (from
 587 $2000$ to $20000$). Even though transferring such large
 588 populations from the CPU to the GPU device memory at every generation is very costly, the authors reported important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
 589
 590 The master-slave  model is efficient when the fitness function is
 591 highly time intensive. Nevertheless, it requires the use of
 592 large-sized populations in order to saturate the GPU unless the
 593 per-block is used (one individual perblock) when the acceleration
 594 of the fitness function itself is possible. The use of many
 595 sub-populations of medium sizes is another way to obtain a maximum
 596 occupation of the GPU. This is coarse-grained parallelism (island
 597 model).\\
 598
 599 The coarse grained model is  used by Pospichal \emph{et al.}
 600 in~\cite{pospichal10} to implement a parallel genetic
 601 algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
 602 work the entire genetic
 603 algorithm\index{Metaheuristics!genetic~algorithm} is implemented
 604 on GPU. This choice is motivated by the overhead engendered by the
 605 CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication}
 606 when only population evaluation is performed on GPU. Each
 607 population island is mapped with a CUDA thread block and each
 608 thread is responsible for a unique individual. Subpopulations are
 609 stored on shared memory of each block. Nevertheless, because
 610 interblock communications are not possible on the CUDA
 611 architecture, the islands evolve independently in each block, and
 612 migrations are performed
 613 asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.\\
 614
 615 The same strategy is also adopted by Tsutsui and Fujimoto
 616 in~\cite{tsutsuiGAQAP}  to implement a coarse-grained genetic
 617 algorithm\index{Metaheuristics!genetic~algorithm} on GPU to solve
 618 the QAP. Initially, several subpopulations are created on CPU and
 619 transferred to the global memory. The subpopulations are organized
 620 in the global memory into blocks of $8$ individuals in such a way
 621 to allow coalesced memory access by the threads of the same thread
 622 block. Each sub-population is allocated to a single thread block
 623 in the GPU and transfered to the shared memory to start evolution.
 624 Population evaluation and transformation are done in parallel by
 625 the different threads of a given block. Migration is also done
 626 through the global memory. Experiments are performed on standard
 627 QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
 628 implementation reached speedups of $2.9\times$ to $12.6 \times$
 629 compared to a single core implementation of a coarse-grained
 630 genetic algorithm\index{Metaheuristics!genetic~algorithm} on a
 631 Intel i7 processor.\\
 632
 633 Nowotniak \emph{et al.}~\cite{nowotniak} proposed a GPU-based
 634 implementation of a quantum inspired genetic
 635 algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The used
 636 parallel model is a hierarchical model based on two levels: each
 637 thread in a block transforms a unique individual and a different
 638 population is assigned to each block. The algorithm is run
 639 entirely on GPU. A different instance of the QIGA is run on each
 640 block and the computations have been shared between 8 GPUs. This
 641 approach is very convenient to speed up the experimental process
 642 on metaheuristics that require several independent runs of the
 643 same algorithm in order to asses statistical efficiency. The
 644 populations are stored in the shared memory, the data matrix used
 645 for fitness evaluation is placed in read only constant memory, and
 646 finally seeds for random numbers generated on the GPU are stored
 647 in the global memory.\\
 648
 649 In coarse-grained parallelism, the use of the per-block approach
 650 to implement the islands (one subpopulation per thread block) is
 651 almost natural  and it allows the use of shared memory to store
 652 the subpopulations. Fine-grained parallel models for EAs (cellular
 653 EAs) have been used by many authors to implement parallel EAs on
 654 GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
 655 to the SIMD architecture of the GPU. \\
 656
 657 Pinel \emph{et al.} in~\cite{pinel2012JPDC} developed a highly
 658 parallel synchronous cellular genetic
 659 algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
 660 GraphCell, to solve the independent task scheduling problem on GPU
 661 architectures. In CGAs, the population is arranged into a
 662 two-dimensional toroidal grid where only neighboring solutions are
 663 allowed to interact with each other  during the recombination
 664 step. In GraphCell, two recombination operators for CGA are
 665 especially designed to run efficiently on GPU. Indeed, instead of
 666 assigning a single thread to each solution of the population to
 667 perform the recombination, in GraphCell, a single thread is
 668 assigned to each task  of each solution. Offsprings are created by
 669 independently modifying the assignment of some tasks in the
 670 current solution. Mainly, each thread chooses one neighboring
 671 solution in the grid as second parent using different selection
 672 strategies and assigns one  task of  the solution (first parent)
 673 to the machine for which the same task is assigned in the second
 674 parent.  This way, the number of threads used for the
 675 recombination step is equal to population size $\times$  size of
 676 the solution (number of tasks). This leads to a high number of
 677 threads used to accelerate the recombination operators especially
 678 when dealing with large instances of the problem. In addition to
 679 the recombination operators, the rest of the CGA steps are  also
 680 parallelized on GPU (fitness evaluation, mutation, and
 681 replacement).\\
 682
 683 A similar work is proposed by Vidal and Alba  in~\cite{albaCGAGPU}
 684 where a  CGA using a toroidal grid is completely implemented on
 685 GPU. A direct mapping between the population and the GPU threads
 686 is done. At each step, several threads execute the same operations
 687 on each individual independently: initialization, computing the
 688 neighborhood, selection, crossover, mutation, and evaluation.  A
 689 synchronization is done for all threads to perform the replacement
 690 stage and form the next generation. All the data of the algorithm
 691 is placed on the global memory. Several experiments have been
 692 performed on a set of standard benchmark functions with different
 693 grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
 694 the GPU implementation against the CPU version range from
 695 $5\times$ to $24\times$ and increase as the size of the population
 696 increases. However, the CPU implementation run faster than the GPU
 697 version for all the tested  benchmarks when the size of the
 698 population is set to $32^2$. When the size of the population is
 699 too small, there are not enough computations to cover the overhead
 700 created by the call of kernel functions, CPU/GPU
 701 communication\index{GPU Challenges!CPU/GPU~communication}s,
 702 synchronization, and access to global memory. Finally, an
 703 interesting review on GPU parallel computation in bio-inspired
 704 algorithms is proposed by Arenas \emph{et al.}
 705 in~\cite{arenas2011}. \\
 706
 707 \subsubsection*{Ant Colony Optimization}
 708
 709 Ant colony optimization
 710 (ACO\index{Metaheuristics!ant~colony~optimization}) is another
 711 p-metaheuristic subject to parallelization on GPUs.
 712 State-of-the-art works on parallelizing
 713 ACO\index{Metaheuristics!ant~colony~optimization} focus on
 714 accelerating the tour construction step performed by each ant by
 715 taking a task-based parallelism approach, with pheromone
 716 deposition on the CPU.\\
 717
 718 In~\cite{cecilia}, Cecilia \emph{et al.}  present a GPU-based
 719 implementation of
 720 ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
 721 the two steps (tour construction and pheromone update) are
 722 parallelized on the GPU. A data parallelism approach is used to
 723 enhance the performance of the tour  construction step. The
 724 authors use two categories of artificial ants: queen ants
 725 associated with CUDA thread-blocks and worker ants associated with
 726 CUDA threads. A queen ant represents a simulated ant and worker
 727 ants collaborate with the queen ant to accelerate the decision
 728 about the  next city to visit. The tour construction step of each
 729 queen ant is accelerated. Each worker ant  maintains a history of
 730 the search in a tabu list containing a chronological ordering of
 731 the already visited cities. This memory is used to determine the
 732 feasible neighborhoods. After all queen ants have constructed
 733 their tours, the pheromone trails are updated. For pheromone
 734 update, several GPU techniques are also used to increase the data
 735 bandwidth of the application mainly by the use of precalculated
 736 matrices that are easily updated by several threads (one thread
 737 per matrix entry). The achieved speedups are $21 \times$ for tour
 738 construction and  $20 \times$ for pheromone updates.\\
 739
 740 In another work, Tsutsui \emph{et al.}~\cite{tsutsui} propose  a
 741 hybrid algorithm combining
 742 ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic
 743 and tabu search (TS)\index{Metaheuristics!tabu~search} implemented
 744 on GPU to solve the QAP. A solution of QAP is represented as a
 745 permutation of ${1,2,..,n}$ with $n$ being the size of the
 746 problem. The TS algorithm is based on the 2-opt neighborhood
 747 (swapping of two elements $(i,j)$ in the permutation). The authors
 748 point out that the move cost of each neighbor depends on the
 749 couple $(i,j)$. Two groups of moves are formed according to the
 750 move cost. In order to avoid thread divergence\index{GPU
 751 Challenges!thread~divergence} within the same warp, the
 752 neighborhood evaluation is parallelized in such a way to assign
 753 only moves of the same cost to each thread warp. This strategy is
 754 called MATA for Move-cost Adjusted Thread Assignment. Concerning
 755 the memory management\index{GPU Challenges!memory~management}, all
 756 the data of the ACO\index{Metaheuristics! ant~colony~optimization}
 757 (population, pheromone matrix), QAP matrices, and tabu list are
 758 placed on the global memory of the GPU. Shared memory is used only
 759 for working data common to all threads in a given block.
 760  All the
 761 steps of the hybrid algorithm
 762 ACO\index{Metaheuristics!ant~colony~optimization}-TS
 763 (ACO\index{Metaheuristics!ant~colony~optimization} initialization,
 764 pheromone update, construct solutions, applying TS) are
 765 implemented as kernel functions on the GPU. The GPU/CPU
 766 communications are only used to transfer the best-so-far solution
 767 in order to verify termination conditions. The experiments
 768 performed on standard QAP benchmarks showed that the GPU
 769 implementation using MATA obtained a  speedup of $19 \times$
 770 compared to the CPU implementation,  compared with a speedup of
 771 only $5 \times$
 772 when MATA is not used. \\
 773
 774 \subsubsection*{Particle Swarm Optimization}
 775 In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
 776 of a standard PSO algorithm. All the data is stored in global
 777 memory (velocities, positions, swarm population, etc). Only
 778 working data is copied to shared memory by each thread. The four
 779 steps of the PSO have been parallelized on GPU: fitness evaluation
 780 of the swarm, update of local best and global best  of each
 781 particle, and update of velocity and position of each particle.
 782 The same strategy is used to parallelize the first and last
 783 steps: the evaluation of fitness functions is performed in
 784 parallel for each dimension  by several threads. It is the case
 785 for the update of position and velocities of each particle: one
 786 dimension at a time is updated for the whole swarm by several
 787 threads. Random numbers needed for  updating the velocities and
 788 positions for the whole PSO processes are generated on CPU at the
 789 starting of the algorithm and transferred to the GPU global
 790 memory. For the steps 2 and 3 (update of local best and global
 791 best of each particle), the GPU threads are mapped to the \emph{N}
 792 particles of the swarm one to one. Experiments done on four
 793 benchmark functions show speedups ranging from $3.7 \times$ to
 794 $9.0 \times$ for swarm sizes
 795 ranging from $400$ to $2800$.\\
 796
 797 \subsection{Synthesis of the implementation strategies}
 798 \label{ch8:sec:synthesis} After reviewing some works dealing with
 799 GPU-based implementation of metaheuristics, in this section we
 800 will try to come up with a classification of the different
 801 strategies used in the literature for the implementation of
 802 metaheuristics on GPUs. One may distinguish between the parallel
 803 models adopted in each metaheuristic (design level)  and the way
 804 they are exploited on GPU architectures (implementation level).
 805 Indeed, even though the parallelization models used in most works
 806 for GPUs are derived from the traditional parallel models of each
 807 metaheuristic (on CPU), their implementation could take a
 808 different way and sometimes it may result in new parallel models
 809 customized for GPUs.\\
 810
 811 Traditional parallel models for metaheuristics are based on an
 812 intuitive task parallelism: the independent tasks of the
 813 algorithms are simply parallelized. For example, in the case of
 814 s-metaheuristics, the evaluation of large neighborhoods could be
 815 done in parallel since there is no synchronization at this step of
 816 the algorithm. That is the case of EAs  when it comes to applying
 817 the evaluation step. Nevertheless, because of the particularity of
 818 the GPU architecture, some authors have used new implementation
 819 techniques to enhance the data parallelism in the sequential
 820 algorithms in order to increase the data throughput of the
 821 application.\\
 822
 823 From this observation we propose the following classification
 824 based on 2 levels: design level and implementation level as
 825 illustrated in Figure~\ref{ch8:fig:classification}. The design
 826 level regroups the three classes of parallel models used in
 827 metaheuristics
 828 (solution-level\index{Metaheuristics!solution-level~parallelism},
 829 iteration-level\index{Metaheuristics!iteration-level parallelism},
 830 algorithmic-level\index{Metaheuristics!algorithmic-level
 831 parallelism}) with examples for s-metaheuristics, EAs,
 832 ACO\index{Metaheuristics!ant~colony~optimization} and PCO. This
 833 classification  is principally built from the reviewed
 834 state-of-the-art works  in the previous section. The
 835 implementation level refers to the way these parallel design
 836 models are implemented on GPU. This classification focuses only on
 837 the mapping strategies between the GPU threads and the
 838 parallelized tasks (neighborhood evaluation, solution
 839 construction, and so on). The different implementation strategies
 840 are explained in the following sections.\\
 841
 842 \begin{figure}[h]
 843 \centerline{\includegraphics[width=1\textwidth]{Chapters/chapter9/figures/classification.pdf}}
 844 \caption{A two level classification of state-of-the-art GPU-based parallel metaheuristics.}
 845 %[Comparison of number of cores in a CPU and in a GPU]
 846 \label{ch8:fig:classification}
 847 \end{figure}
 848
 849
 850 \subsubsection*{GPU thread mapping for solution-level parallelism}
 851 \index{GPU-based Metaheuristics!GPU-thread mapping} Parallel
 852 models at solution level consist of parallelizing a time intensive
 853 atomic task of the algorithm. Generally, it consists of the
 854 fitness evaluation~\cite{kannan}. Nevertheless, crossover
 855 operators have also been parallelized by some
 856 authors~\cite{pinel2012JPDC}. These kinds of models are not always
 857 possible as they are problem-dependent. The GPU implementation of
 858 solution-level models uses the perblock mapping: each solution is
 859 associated to a block of threads. A second level of parallelism is
 860 used inside each block to parallelize the fitness evaluation of a
 861 single solution. This kind of mapping allows the use of shared
 862 memory to store the data structures of the solution and does not
 863 require the use of very large neighborhoods or populations.\\
 864 %data parallelism in SA-matrix to parallelize
 865
 866 \subsubsection*{GPU thread mapping for iteration-level parallelism}
 867 \index{GPU-based Metaheuristics!GPU-thread mapping}
 868 Iteration-level parallelism consists of parallelizing the tasks
 869 performed independently on different solutions. Different mapping
 870 strategies are adopted in the reviewed works to implement these
 871 models.\\
 872
 873 In Figure \ref{ch8:fig:classification}, the first example of
 874 iteration-level\index{Metaheuristics!iteration-level parallelism}
 875 parallelism is the parallel evaluation of neighborhoods in
 876 s-metaheuristics. In most of the reviewed works,  a per-thread
 877 mapping approach is used: each solution of the neighborhood is
 878 mapped to a unique thread in the GPU for
 879 evaluation~\cite{luong2010large, audreyANT}. The same mapping is
 880 used for master-slave parallel EAs to accelerate the evaluation of
 881 large populations. This kind of mapping is only efficient for very
 882 large neighborhoods and very large populations (to saturate the
 883 GPU). Many authors have pointed out that the use of such large
 884 populations (or neighborhoods) may lead to an overhead due to the
 885 communication costs between the CPU and the GPU (if the sequential
 886 part of the algorithm is placed on CPU). In order to circumvent
 887 this issue, many authors have implemented the entire algorithm on
 888 GPU~\cite{pospichal10}. On the other hand, as several solutions
 889 may be mapped with the same thread block in the GPU, shared memory
 890 could not be used and all the data should be placed on global
 891 memory.
 892 %pheromone update data parallelism matrices
 893
 894 \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
 895 \index{GPU-based Metaheuristics!GPU-thread mapping}
 896 Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models have been used: the multistart model and the island model (parallel EAs).\\
 897
 898 The implementation of the multistart model is based on two
 899 different mapping strategies~\cite{luongMultiStart, audreyANT}:
 900 (1) each local search (LS) is associated to a unique thread and
 901 (2) each solution (from multiple neighborhoods) is associated to a
 902 unique thread. The first mapping strategy (one thread per LS)
 903 presents a big drawback: the number of LS to use should be very
 904 large to saturate the GPU and cover the memory access latency. On
 905 the other hand, the evaluation of the neighborhood of a single LS
 906 by one thread is time intensive. Furthermore, shared memory could
 907 not be used to store the huge data generated by the different
 908 neighborhoods. In the second mapping strategy, the LS algorithms
 909 are placed on CPU and the neighborhood evaluation of each LS is
 910 parallelized on GPU using per-thread mapping strategy (one thread
 911 per solution). This consists of a hierarchical parallel model
 912 combining algorithmic-level\index{Metaheuristics!algorithmic-level
 913 parallelism} parallelism (multistart) with
 914 iteration-level\index{Metaheuristics!iteration-level parallelism}
 915 parallelism (master-worker).\\
 916
 917 In the island model, the same mapping is used in all the reviewed
 918 works~\cite{tsutsuiGAQAP, nowotniak, maitre2009}: each
 919 subpopulation (island) is associated to one thread block in the
 920 GPU. A second level of parallelism is used inside the block to
 921 parallelize the evaluation step of the local population.
 922 Migrations are always performed through the global memory as
 923 interblock communications  are impossible in CUDA. The first
 924 advantage of this hierarchical implementation is that it allows
 925 the occupation of a large number of threads even for medium
 926 population sizes. The second advantage consists of using shared
 927 memory to store subpopulations to benefit from the low latency of
 928 shared memory.\\
 929
 930 \section{Frameworks for metaheuristics on GPUs}
 931 \label{ch8:sec:frameworks}
 932
 933 After the first pioneering works of metaheuristics on graphics
 934 processing units, the next challenge is to provide easy-to-use
 935 frameworks and libraries for rapid development of metaheuristics
 936 on GPUs. Although the works on this subject are not yet mature and
 937 do not cover the main metaheuristic algorithms, we will present
 938 the only three works to our knowledge, which propose open source
 939 frameworks for developing
 940 metaheuristics on GPUs.\\
 941
 942 The three frameworks reviewed in this section are
 943 PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
 944 ParadisEO-MO-GPU\index{GPU-based
 945 frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
 946 libCudaOptimize\index{GPU-based
 947 frameworks!libCudaOptimize}~\cite{libcuda}. PUGACE\index{GPU-based
 948 frameworks!PUGACE} is a framework for implementing EAs on GPUs.
 949 ParadisEO-MO-GPU is an extension of the framework
 950 ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
 951 GPU. Finally, libCudaOptimize\index{GPU-based
 952 frameworks!libCudaOptimize} is a  library intended for the
 953 implementation of p-metaheuristics on GPU. The three frameworks
 954 are presented in more detail in the following.
 955
 956 \subsection{PUGACE\index{GPU-based frameworks!PUGACE}: framework for implementing evolutionary computation on GPUs}
 957 PUGACE\index{GPU-based frameworks!PUGACE} is a generic  framework
 958 for easy implementation of cellular evolutionary algorithms on
 959 GPUs implemented using C and CUDA. It is based on the frameworks
 960 MALLBA and JCell (a framework for cellular algorithms). The
 961 authors justified the choice of cellular evolutionary
 962 algorithm\index{Metaheuristics!evolutionary~algorithm} by the good
 963 feedback found in the literature concerning its efficient
 964 implementation on GPUs compared to other parallel models for EAs
 965 (island, master-slave). The main standard evolutionary operators
 966 are already implemented in PUGACE\index{GPU-based
 967 frameworks!PUGACE}: different selection strategies, standard
 968 crossover, and mutation operators (\emph{PMX, swap, 2-exchange},
 969 etc.). Different problem encoding is also supported. The framework
 970 is organized as a set of modules in which the different
 971 functionalities  are separated as much as possible in order to
 972 facilitate the extension of the framework. All
 973 the functions and procedures that execute on GPU are implemented in the same file kernel.cu. \\
 974
 975 The implementation strategy adopted on the GPU is as follows.
 976 Population initialization is done on the CPU side and the
 977 population is transferred to the GPU. On the GPU side, each
 978 individual is associated to a unique CUDA thread. The function
 979 evaluation and mutation are done on the GPU while selection and
 980 replacement are maintained on the CPU. In order to avoid thread
 981 divergence\index{GPU Challenges!thread~divergence} appearing in
 982 the same CUDA thread block at the crossover step (because of the
 983 probability of application which may give different results from
 984 one thread to the other), the decision of whether to apply a
 985 crossover is taken at the block level and applied to all the
 986 individuals within the block. It is the decision on the choose
 987 of the cutting point for the crossover.\\
 988
 989 The framework is validated on standard benchmarks of the quadratic
 990 assignment problem (QAP). Speedups of $15.44 \times$  to $18.41
 991 \times$ are achieved compared to a CPU implementation of a cEA
 992 using population sizes rising from 512 to 1024 and from 1024 to
 993 2048.
 994
 995 % The authors noticed an
 996 % improvement on the speedup when the size of the population increases. This is best explained
 997 % by the linear increasing of execution time on CPU side while a sublinear increasing is
 998 % occuring on GPU side (due to a better exploitation of the GPU threads). Neverethless,
 999 % dispite the positive impact of large populations on th parallel efficieny of the implementation
1000 % on GPU, a study on the impact on the final solution quality when using such large populations
1001 % is to be done. Indeed, too large populations on EAs may lead to high diversification of the
1002 % search and poor intensification capabilities, which leads to poor results.
1003 % For the moment, plateforms whith more than one CPU and one GPU are not yet supported in PUGACE\index{GPU-based frameworks!PUGACE}.
1004 % No details on how to get PUGACE\index{GPU-based frameworks!PUGACE} code source are given in the paper.
1005
1006 \subsection{ParadisEO-MO-GPU}
1007
1008 Melab \emph{et al.}~\cite{paradiseoGPU} developed a reusable
1009 framework ParadisEO-MO-GPU\index{GPU-based
1010 frameworks!ParadisEO-MO-GPU} for parallel local search
1011 metaheuristics (s-metaheuristics) on GPUs. It focuses on the
1012 iteration-level\index{Metaheuristics!iteration-level parallelism}
1013 parallel model of s-metaheuristics which consists of exploring in
1014 parallel the neighborhood of a problem solution on GPU. The
1015 framework, implemented using C++ and CUDA, is an extension of the
1016 ParadisEO~\cite{paradiseo} framework previously developed by the
1017 same team for parallel and distributed metaheuristics on both
1018 dedicated parallel hardware platforms and computational grids. The
1019 objective of this framework is to facilitate the development of
1020 GPU-based metaheuristics providing a transparent use of GPUs to
1021 users who are unfamiliar with advanced features of all
1022 parallelization techniques and deployment on GPUs. The framework
1023 allows one to efficiently manage the hierarchical organization of
1024 the memories (latencies and sizes) of the GPU and its
1025 communication with the CPU as well as the minimizing of the user
1026 involvement in its management.\\
1027
1028 \begin{figure}[h]
1029 \centerline{\includegraphics[width=0.8\textwidth]{Chapters/chapter9/figures/paradiseoGPU.pdf}}
1030 \caption{The skeleton of the ParadisEO-MO-GPU\index{GPU-based
1031 frameworks!ParadisEO-MO-GPU}.}
1032 %[Comparison of number of cores in a CPU and in a GPU]
1033 \label{ch1:fig:paradiseoGPU}
1034 \end{figure}
1035
1036 The framework is based on a master-worker model where the master
1037 is the CPU and the workers are threads executed by the processing
1038 cores of the GPU. The CPU executes the serial part of the
1039 metaheuristic and sends only the current solution to the GPU to
1040 minimize the transfer cost. The GPU, on its side, generates the
1041 neighboring of the received solution and evaluates them at each
1042 iteration. All the threads execute the same kernel and according
1043 to a static mapping table between the threads and the neighbors
1044 where each thread is associated with exactly one neighbor
1045 evaluation. After all the neighborhood is generated and evaluated,
1046 it is sent back to the CPU which selects the best solution (See
1047 Figure~\ref{ch1:fig:paradiseoGPU}).
1048
1049 \subsection{libCudaOptimize: an open source library of GPU-based metaheuristics}
1050 \index{GPU-based frameworks!libCudaOptimize}
1051 LibCudaOptimize~\cite{libcuda} is a C++/Cuda open source library
1052 for the design and implementation of metaheuristics on GPUs. Until
1053 now, the metaheuristics supported by LibCudaOptimize are: scatter
1054 search, differential evolution, and particle swarm
1055 optimization\index{Metaheuristics!particle swarm optimization}.
1056 Nevertheless, the library is designed in such a way to allow
1057 further extensions for other metaheuristics and it is still in
1058 development phase by the authors. The parallelization strategy
1059 adopted on GPU is principally based on fitness evaluation. The
1060 sequential structure of the optimization algorithms is maintained
1061 on CPU.
1062
1063 \section{Case study: accelerating large neighborhood LS method on GPUs for solving the Q3AP}
1064 \label{ch8:sec:case}
1065
1066 In this case study, a large neighborhood GPU-based local
1067 search\index{GPU-based Metaheuristics!GPU-based~local~search}
1068 method for solving the quadratic 3-dimensional assignment problem
1069 (Q3AP) will be presented. The local search method is an iterated
1070 local search\index{Metaheuristics!iterated local search}
1071 (ILS)~\cite{stutzle2006ILSforQAP} using an embedded
1072 TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
1073 consists of executing iteratively the embedded local search, each
1074 iteration which starts from a disrupted local optima reached by
1075 the previous local search process. The disruption heuristic is a
1076 performance parameter of an ILS algorithm and should be
1077 judiciously defined. A template of an
1078 ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
1079
1080 \begin{algorithm}[H]
1081     \SetAlgoLined
1082     %\KwData{this text}
1083     %\KwResult{how to write algorithm with \LaTeX2e }
1084    $s_{0}$=GenerateInitSol()\;
1085    $s^*$=LocalSearch($s_0$)\;
1086     \Repeat{a maximum number of iterations is reached}{
1087      $s'$=Perturbate($s^*,history$)\;
1088      $s^{*'}$=LocalSearch($s'$)\;
1089      $s^*$=AcceptationCriteria($s^*,s^{*'},history$)\;
1090
1091     }
1092 \caption{Iterated local search template}
1093 \label{ch8:ils_algorithm_template}
1094 \end{algorithm}
1095
1096 %
1097 %   \begin{algorithm}
1098 %  % \label{ch8:ils_algorithm_template}
1099 %   \caption{iterated local search\index{Metaheuristics!iterated local search} template}
1100 %   \begin{algorithmic}[1]
1101 %
1102 %   \STATE $s_{0}$=\texttt{GenerateInitSol}();
1103 %
1104 %   \STATE $s^*$=\texttt{LocalSearch}($s_0$);
1105 %
1106 %   \REPEAT
1107 %     \STATE $s'$=\texttt{Perturbate}($s^*,history$);
1108 %     \STATE $s^{*'}$=\texttt{LocalSearch}($s'$);
1109 %     \STATE $s^*$=\texttt{AcceptationCriteria}($s^*,s^{*'},history$);
1110 %
1111 %   \UNTIL {a maximum number of iterations is reached}
1112 %
1113 %   \end{algorithmic}
1114 %
1115 %   \end{algorithm}
1116
1117 \subsection{The quadratic 3-dimensional assignment problem}
1118
1119 The Q3AP is an extension of the well-known NP-hard QAP. The latter
1120 is one of the most studied problem by the combinatorial
1121 optimization community due to its wide range of practical
1122 applications (site planning, schedule problem, computer-aided
1123 design, etc.) and to its computational challenges since it is
1124 considered as one of the most computationally difficult
1125 optimization problems.\\
1126
1127 The Q3AP was first introduced by William P. Pierskalla in
1128 1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
1129 less studied problem. Indeed, the Q3AP was revisited only this
1130 past years and has recently been used to model some advanced
1131 assignment problems such as the symbol-mapping problem encountered
1132 in wireless communication systems and described
1133 in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
1134 mathematically expressed as follows:
1135
1136 % debut -----------------------------------------------------------------------------
1137 \begin{equation*} min \left \{
1138 \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}\sum_{k=0}^{n-1}\sum_{s=0}^{n-1}\sum_{r=0}^{n-1}c_{ijlksr}x_{ijl}x_{ksr}
1139 \quad+ \right . \end{equation*}
1140
1141 \begin{equation} \label{Q3APFormulation} \left .
1142 \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}b_{ijl}x_{ijl}
1143 \right \} \end{equation}
1144
1145 where
1146
1147 \begin{eqnarray}
1148   X=(x_{ijl})\in I \cap J \cap L, \label{Q3APConstraints1}\\
1149   x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1. \label{Q3APConstraints2}
1150 \end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
1151 \begin{displaymath} I=\lbrace X=(x_{ijl}):
1152 \sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
1153 i=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
1154 \begin{displaymath} J=\lbrace X=(x_{ijl}):
1155 \sum_{i=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
1156 j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
1157 \begin{displaymath} L=\lbrace X=(x_{ijl}):
1158 \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
1159 l=0,1,...,n-1\rbrace \mathrm{.} \end{displaymath}
1160 % fin------------------------------------------------------------------------------
1161
1162 Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
1163 \textit{permutation-based formulation} wherein the Q3AP can be expressed as follows:
1164
1165  \begin{equation*}
1166  min\left\{ f(\pi_1,\pi_2) =
1167  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}c_{i\pi_{1(i)}\pi_{2(i)}j\pi_{1(j)}\pi_{2(j)}} \quad
1168  + \right.
1169 \end{equation*}
1170
1171 \begin{equation}
1172  \left. \sum_{i=0}^{n-1}b_{i\pi_{1(i)}\pi_{2(i)}}\right
1173  \} \label{Q3APPermutationBasedFormulation}
1174 \end{equation}
1175 where $\pi_1$ and $\pi_2$ are permutations over the set
1176 $\left\lbrace 0,1,\ldots,n-1\right\rbrace$. According to this
1177 formulation, minimizing the Q3AP consists of finding a double
1178 permutation $(\pi_1,\pi_2)$ which minimizes the objective function
1179 (\ref{Q3APPermutationBasedFormulation}).\\
1180
1181 The Q3AP is proven to be an NP-hard problem since it is an
1182 extension of the quadratic assignment problem and of the axial
1183  3-dimensional assignment problem which are both NP-hard problems. It is particularly computationally difficult since the
1184 number of feasible solutions of an instance of size $n$ is $n! \times n!$.
1185
1186 \subsection{Iterated tabu search algorithm for the Q3AP}
1187
1188 To tackle large-sized instances of the Q3AP and speed up the
1189 search process, a parallel ILS\index{Metaheuristics!iterated local
1190 search} algorithm has been designed. The local search embedded in
1191 the ILS is a TS\index{Metaheuristics!tabu~search}. A TS
1192 procedure~\cite{Glover1989TS} starts from an initial feasible
1193 solution and tries, at each step, to move to a neighboring
1194 solution that minimizes the fitness (for a minimization case). If
1195 no such move exists, the neighbor solution that less degrades the
1196 fitness is chosen as a next move. This enables, TS process to
1197 escape local optima. However, this strategy may generate cycles,
1198 i.e., previous moves can be selected again. To avoid cycles, the
1199 TS manages a short-term memory that contains the moves that have
1200 been recently performed. A TS\index{Metaheuristics!tabu~search}
1201 template is given by the Algorithm \ref{TS_pseudo_code}.\\
1202 %
1203 %  \begin{algorithm}
1204 %  \label{TS_pseudo_code}
1205 %  \caption{Tabu search template}
1206 %  \begin{algorithmic}[1]
1207 %  \STATE GenerateInitSol($s_0$);
1208 %  \STATE TabuList=$\phi$;
1209 %  \STATE $t=0$;
1210 %  \REPEAT
1211 %          \STATE $m(t)$ = SelectBestMove($s(t)$);
1212 %          \STATE $s(t+1)$ = ApplyMove($m(t),s(t)$);
1213 %          \STATE TabuList = TabuList $\bigcup \{m(t)\}$;
1214 %          \STATE $t = t + 1$;
1215 %  \UNTIL{a maximum number of iterations is reached}
1216 %
1217 %  \end{algorithmic}
1218 %  \end{algorithm}
1219
1220 \begin{algorithm}[H]
1221     \SetAlgoLined
1222     %\KwData{this text}
1223     %\KwResult{how to write algorithm with \LaTeX2e }
1224     GenerateInitSol($s_0$)\;
1225     TabuList=$\phi$\;
1226     $t=0$\;
1227     \Repeat{a maximum number of iterations is reached}{
1228           $m(t)$ = SelectBestMove($s(t)$)\;
1229           $s(t+1)$ = ApplyMove($m(t),s(t)$)\;
1230           TabuList = TabuList $\bigcup \{m(t)\}$\;
1231           $t = t + 1$\;
1232
1233     }
1234 \caption{Tabu search template}
1235 \label{TS_pseudo_code}
1236 \end{algorithm}
1237
1238
1239 \subsection{Neighborhood functions for the Q3AP}
1240
1241 The neighborhood function is a crucial parameter in any local
1242 search algorithm. Indeed, if the neighborhood function is not
1243 adequate to the problem and/or does not consider the targeted
1244 computing framework, any local search algorithm will fail to reach
1245 good quality solutions of the search space.\\
1246
1247 Regarding the Q3AP, many neighborhood structures can be
1248 considered. A basic neighborhood was proposed and investigated in
1249 different works of the
1250 literature~\cite{Hahn2008q3ap,DBLP:journals/ijfcs/LoukilMMTB12}
1251 and consists of the set of all solutions (double-permutations)
1252 generated from the current one by an exchange of two positions in
1253 either the first ($\pi_1$) or the second ($\pi_2$) permutation.
1254 This neighborhood that we denote by $N_b$ can be formalized as
1255 follows:
1256
1257 \begin{equation}
1258 \begin{array}{rl}
1259 N_b(\pi_1,\pi_2)=\{&(\pi_1^{'},\pi_2^{'}): \pi_1^{'}[k]=\pi_1[l], \pi_1^{'}[l]=\pi_1[k]\\
1260                        & 0\leq k \neq l < n;\\
1261                 & \pi_1^{'}[i]=\pi_1[i], \quad 0\leq i\neq k,l < n;\\
1262                 & \pi_2^{'}[j]=\pi_2[j], \quad 0\leq j<n\\ \} \\
1263 \bigcup \\
1264 \{& (\pi_1^{'},\pi_2^{'}): \pi_2^{'}[k]=\pi_2[l], \pi_2^{'}[l]=\pi_2[k]\\
1265                 & 0\leq k \neq l < n;\\
1266                 & \pi_2^{'}[i]=\pi_2[i], \quad 0\leq i < n;\\
1267                 & \pi_1^{'}[j]=\pi_1[j], \quad 0\leq j\neq k,l < n\\
1268 \}& \end{array}
1269 \end{equation}
1270 where $(\pi_1,\pi_2)$ is the current solution and $n$ is the size
1271 of the Q3AP instance. The size $|N_b|$ of such neighborhood is
1272 equal to
1273
1274 \begin{equation}
1275  |N_b| = n \times (n-1)
1276 \end{equation}
1277
1278
1279 In our GPU-based implementations, a large-sized neighborhood
1280 structure has been used for experimentation. In fact, theoretical
1281 and experimental studies have shown that the use of large
1282 neighborhood structures may improve the effectiveness of LS
1283 algorithms~\cite{Ahuja:2007:VLN:1528422.1528438}. However, for
1284 such a neighborhood, the generation/evaluation step of an LS
1285 becomes a time-consuming task and may dramatically increase the
1286 computational time of the LS process. This
1287 justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. \\
1288
1289 The considered large-sized neighborhood consists of swapping two
1290 positions in both permutations $\pi_1$ and $\pi_2$. This
1291 neighborhood structure, $N(\pi_1,\pi_2)$, can be formalized as
1292 follows:
1293
1294 \begin{equation} \begin{array}{rl}
1295 N(\pi_1,\pi_2)=\{&(\pi_1^{'},\pi_2^{'}): \pi_1^{'}[k]=\pi_1[l], \pi_1^{'}[l]=\pi_1[k],\\
1296                 & \pi_2^{'}[r]=\pi_2[s], \pi_2^{'}[s]=\pi_2[r]\\
1297                 & 0\leq k \neq l < n,0\leq r \neq s < n;\\
1298                 & \pi_1^{'}[i]=\pi_1[i], \quad 0\leq i\neq k,l < n;\\
1299                 & \pi_2^{'}[j]=\pi_2^{'}[j], \quad 0\leq j\neq r,s < n\\
1300
1301 \}& \end{array}
1302 \end{equation}
1303
1304 So, for a given Q3AP instance of size $n$, the size $|N|$ of the advanced neighborhood set can be expressed by the following formula:
1305
1306 \begin{equation}
1307   |N|=\left(\frac{n \times (n-1)}{2}\right )^2
1308 \end{equation}
1309
1310 \subsection{Design and implementation of a GPU-based iterated tabu search algorithm for the Q3AP} \label{ch8:ITS-Q3APSection}
1311
1312 The use of GPU devices to speed up the search process of local
1313 search methods is not a straightforward task. It requires one to
1314 consider, at the same time, the characteristics and underlying
1315 issues of the GPU architecture and the parallel models of LS
1316 methods. The main challenges that must be faced when designing a
1317 local search algorithm are the efficient distribution of the
1318 search process between the CPU and the GPU minimizing the data
1319 transfer between them, the hierarchical memory
1320 management\index{GPU Challenges!memory~management} and the
1321 capacity constraints of GPU memories, and the thread
1322 synchronization. All these issues must be regarded when designing
1323 parallel LS models to allow
1324 solving of large scale optimization problems on GPU architectures.\\
1325
1326 To go back to our problem (i.e., Q3AP), we propose in
1327 Algorithm~\ref{ch8:algoITS} an iterated tabu
1328 search\index{Metaheuristics!tabu~search} on GPU (GPU-ITS). The
1329 parallel model is in agreement with the
1330 iteration-level\index{Metaheuristics!iteration-level parallelism}
1331 parallel model of LS methods presented in Section
1332 \ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This
1333 algorithm can be seen as a cooperative model between the CPU and
1334 the GPU. Indeed, the GPU is used as a coprocessor in a synchronous
1335 manner. The resource-consuming part, i.e., the generation and
1336 evaluation kernel, is calculated by the GPU and the rest of the LS
1337 process is handled by the CPU. First of all, at initialization
1338 stage, memory allocations on GPU are made; the input matrices
1339 (distance and flow matrices) and the initial candidate solution of
1340 the Q3AP must be allocated (lines 4 and 5). Since GPUs require
1341 massive computations with predictable memory accesses, a structure
1342 has to be allocated for storing all the neighborhood fitnesses at
1343 different addresses (line 6). Second, the matrices and the initial
1344 candidate solution have to be copied on the GPU (lines 7 and 8).
1345 We notice that the input matrices are read-only structures and do
1346 not change during all the execution of the LS algorithm.
1347 Therefore, their associated memory is copied only once during all
1348 the execution. Third, comes the parallel
1349 iteration-level\index{Metaheuristics!iteration-level parallelism},
1350 in which each neighboring solution is generated, evaluated, and
1351 copied into the neighborhood fitnesses structure (from lines 10 to
1352 14). Fourth, since the order in which candidate neighbors are
1353 evaluated is undefined, the neighborhood fitnesses structure has
1354 to be copied to the host CPU (line 15). Then the selection
1355 strategy is applied to this structure (line 16): the exploration
1356 of the neighborhood fitnesses structure is done by the CPU.
1357 Finally, after a new candidate has been selected, this information
1358 is copied to the GPU (line 18). The process is repeated until a
1359 given number of iterations has been reached.
1360
1361 % \begin{algorithm}
1362 % \caption{Template of an iterated tabu search\index{Metaheuristics!tabu~search} on GPU for solving the Q3AP} \label{LSGPU}
1363 %  \begin{algorithmic}[1]
1364 %  \STATE Choose an initial solution
1365 %  \STATE Evaluate the solution
1366 %  \STATE Initialize the tabu list
1367 %  \STATE Allocate the Q3AP input data on GPU device memory
1368 %  \STATE Allocate a solution on GPU device memory
1369 %  \STATE Allocate a neighborhood fitnesses structure on GPU device memory
1370 %  \STATE Copy the Q3AP input data on GPU device memory
1371 %  \STATE Copy the solution on GPU device memory
1372 %  \REPEAT
1373 %      \FOR{each generated neighbor on GPU}
1374 %          \STATE Incremental evaluation of the candidate solution
1375 %          \STATE Insert the resulting fitness into the neighborhood fitnesses structure
1376 %      \ENDFOR
1377 %      \STATE Copy the neighborhood fitnesses structure on CPU host memory
1378 %      \STATE Select the best admissible neighboring solution
1379 %      \STATE Update the tabu list
1380 %      \STATE Copy the chosen solution on GPU device memory
1381 %  \UNTIL{a maximum number of iterations reached}
1382 %  \end{algorithmic}
1383 %  \end{algorithm}
1384
1385 \begin{algorithm}[H]
1386     \SetAlgoLined
1387     %\KwData{this text}
1388     %\KwResult{how to write algorithm with \LaTeX2e }
1389     Choose an initial solution\;
1390     Evaluate the solution\;
1391     Initialize the tabu list\;
1392     Allocate the Q3AP input data on GPU device memory\;
1393     Allocate a solution on GPU device memory\;
1394     Allocate a neighborhood fitnesses structure on GPU device memory\;
1395     Copy the Q3AP input data on GPU device memory\;
1396     Copy the solution on GPU device memory\;
1397     $t=0$\;
1398     \Repeat{a maximum number of iterations is reached}{
1399         \For{each generated neighbor on GPU}{
1400           Incremental evaluation of the candidate solution\;
1401           Insert the resulting fitness into the neighborhood fitnesses structure\;
1402         }
1403         Copy the neighborhood fitnesses structure on CPU host memory\;
1404         Select the best admissible neighboring solution\;
1405         Update the tabu list\;
1406         Copy the chosen solution on GPU device memory\;
1407     }
1408 \caption{Template of an iterated tabu search on GPU for solving the Q3AP}
1409 \label{ch8:algoITS}
1410 \end{algorithm}
1411
1412
1413 \subsection {Experimental results}
1414
1415 In this section, some experimental results related to the approach
1416 presented in Section \ref{ch8:ITS-Q3APSection} are reported. We
1417 recall that the approach is a GPU-based iterated tabu
1418 search\index{Metaheuristics!tabu~search} (GPU-ITS) method
1419 consisting in an iterated local search (ILS) embedding a tabu
1420 search\index{Metaheuristics!tabu~search} (TS) and where the
1421 generation/evaluation step of the TS process is executed on GPU.
1422 The ILS is used to improve the quality of successive local optima
1423 provided by TS methods. This is achieved by perturbing the local
1424 optima reached by the current TS process and reconsidering it as
1425 initial solution of the following TS process. Regarding our
1426 algorithm, the applied perturbation is a random number $\mu $ of
1427 swaps in either the first or the second permutation where $\mu \in
1428 [2:n]$ ($n$ is the instance size).\\
1429
1430 Experiments have been carried out on a node of the Chirloute
1431 cluster of the Lille site. This is one of the 10 sites that
1432 currently make up Grid5000~\cite{grid5000}, the French
1433 experimental computational grid. A Chirloute cluster node consists
1434 of an Intel Xeon E5620 CPU and a NVIDIA Tesla Fermi M2050 (448
1435 cores) GPU type. The number of ILS iterations and the number of TS
1436 iterations were set respectively to 20 and 500. The tabu list size
1437 has been initalized to $\frac{m}{4}$, $m$ being the size of the
1438 neighborhood set.\\
1439
1440 Table \ref{ch8:ITSQ3APResults} reports the obtained results for
1441 the GPU-ITS using our large-sized neighborhood structure. The
1442 method has been tested on 5 Q3AP instances derived from the QAP
1443 Nugent instances in QAPLIB~\cite{burkard1991qaplib}. The average
1444 time measurement for 20 executions is reported in seconds and
1445 acceleration factors compared to a standalone CPU are also
1446 considered. The algorithm is stopped when a maximum number of
1447 iterations has been reached or when the optimal (or best known)
1448 value has been discovered. Average and max values of the
1449 evaluation function of the parallel GPU version have been
1450 measured. The number of successful tries (hits) and the average
1451 number of ILS iterations to converge to the optimal/best known
1452 value are also represented. The associated standard deviation for
1453 each average measurement is shown in small type characters.\\
1454
1455 \begin{table}
1456 %\tiny
1457 %\footnotesize
1458 \small
1459 \begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
1460 \multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average}  & \multicolumn{1}{|c|}{Maximal}  & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
1461 & \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iteration}\\ \hline
1462 Nug12-d & $580$   & $615.4$    & $744$   & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
1463     &   & \tiny{$41.7$}    & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
1464 Nug13-d & $1912$  & $1985.4$   & $2100$  & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
1465 &  & \tiny{$51.0$}   &  & & \tiny{$1.3$} & \tiny{$1.0$} &  & \tiny{$5.6$} \\ \hline
1466 Nug15-d & $2230$  & $2418.1$  & $2580$  & $30$\% & $305.5$ & $5.2$ & $58.8 \times$ & $17$ \\
1467  &  & \tiny{$135.3$}  & & & \tiny{$164.5$} & \tiny{$1.3$} & & \tiny{$5.0$} \\ \hline
1468 Nug18-d & $17836$ & $18110.2$ & $18506$ & $10$\% & $1375.9$ & $12.8$ & $107.4 \times$ & $19$ \\
1469 & & \tiny{$157.8$} & & & \tiny{$123.5$} & \tiny{$2.6$} & & \tiny{$4.2$} \\ \hline
1470 Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \times$ & $18$ \\
1471 & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
1472 \end{tabular}
1473 \caption{Results of the GPU-based iterated tabu search for
1474 different Q3AP instances.} \label{ch8:ITSQ3APResults} \center
1475 \end{table}
1476
1477 %\begin{table}
1478 %\tiny
1479 %%\footnotesize
1480 %\caption{Results of the GPU-based iterated tabu search for different Q3AP instances}
1481 %\label{ch8:ITSQ3APResults} \center
1482 %\begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline
1483 %Instance & Opt./BKV\footnotemark & Average val. & Max val. & Hits & CPU time & GPU time & Accel. & ITS iter.\\ \hline
1484 %Nug12-d & $580$   & $615.4_{41.7}$    & $744$   & $35$\% & $87.7_{40.9}$ & $2.5_{0.9}$ & $\times 34.7 $& $16_{6.3}$ \\ \hline
1485 %Nug13-d & $1912$  & $1985.4_{51.0}$   & $2100$  & $20$\% & $209.2_{1.3}$ & $3.3_{1.0}$ & $\times 63.5$ & $17_{5.6}$ \\ \hline
1486 %Nug15-d & $2230$  & $2418.1_{135.3}$  & $2580$  & $30$\% & $305.5_{164.5}$ & $5.2_{1.3}$ & $\times 58.8 $ & $17_{5.0}$ \\ \hline
1487 %Nug18-d & $17836$ & $18110.2_{157.8}$ & $18506$ & $10$\% & $1375.9_{123.5}$ & $12.8_{2.6}$ & $\times 107.4$ & $19_{4.2}$ \\ \hline
1488 %Nug22-d & $42476$ & $43282.1_{529.6}$ & $44140$ & $25$\% & $4506.5_{341.1}$ & $32.7_{6.6}$ & $\times 137.8$ & $18_{4.0}$ \\ \hline
1489 %\end{tabular}
1490 %\end{table}
1491
1492 \footnotetext{Best known value.}
1493
1494 Regarding the execution time, the generation and evaluation of the
1495 neighborhood in parallel on GPU provides an efficient way to
1496 speedup the search process in comparison with a single CPU. In
1497 fact, for the smaller instance Nug12-d, the GPU version starts to
1498 be faster than the CPU one (acceleration factor of $34.7~\times$).
1499 As long as the problem size increases, the speedup grows
1500 significantly (up to $137.8 \times$ for the Nug22-d instance).
1501 This means that the use of GPU provides an efficient way to deal
1502 with large neighborhoods. Indeed, the proposed neighborhoods were
1503 unpractical in terms of single CPU computational resources for
1504 large Q3AP instances such as Nug22-d (estimated to around $2$
1505 hours per run). So, implementing this algorithm on GPU has allowed
1506 the exploitation of  parallelism in such neighborhood and improved
1507 the robustness/quality of provided solutions.
1508
1509 \section{Conclusion}
1510 \label{ch8:conclusion}
1511
1512 This chapter has presented state-of-the-art GPU-based parallel
1513 metaheuristics\index{Metaheuristics!parallel~metaheuristics} and
1514 a case study on implementing large neighborhood local search
1515 methods on GPUs for solving large benchmarks of the quadratic
1516 three-dimensional assignment problem (Q3AP). \\
1517
1518 Implementing parallel
1519 metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
1520 GPU architectures poses new issues and challenges such as memory
1521 management\index{GPU Challenges!memory~management},  finding
1522 efficient mapping strategies between tasks to parallelize; and the
1523 GPU threads, thread divergence\index{GPU
1524 Challenges!thread~divergence}, and synchronization. Actually, most
1525 of metaheuristics have been implemented on GPU using different
1526 implementation strategies. In this chapter, a two-level
1527 classification of the reviewed works has been proposed: design
1528 level and implementation level. Design level regroups traditional
1529 parallel models used for metaheuristics while implementation level
1530 refers to the way those models are mapped to the GPU architecture.
1531 This classification focuses mainly on  the mapping between the
1532 metaheuristic tasks to parallelize and the GPU threads. Indeed,
1533 the choice of a given mapping strategy strongly influences the
1534 other challenges (memory usage, communication, thread
1535 divergence\index{GPU Challenges!thread~divergence}).
1536
1537 \putbib[Chapters/chapter9/biblio9]