BookGPU/Chapters/chapter9/ch9.tex

   1 \chapterauthor{Malika Mehdi and Ahc\`{e}ne Bendjoudi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
   2
   3 \chapterauthor{Lakhdar Loukil}{University of Oran, Algeria}
   4
   5 %\chapterauthor{Ahc\`{e}ne Bendjoudi}{CERIST Research Center, DTISI, 3 rue des frères Aissou, 16030 Ben-Aknoun, Algiers, Algeria}
   6
   7 \chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
   8
   9 \chapter{Parallel GPU-accelerated metaheuristics}
  10 \label{chapter9}
  11 \section{Introduction}
  12 This chapter presents GPU-based parallel
  13 metaheuristics\index{Metaheuristics!parallel~metaheuristics},
  14 challenges, and issues related to the particularities of the GPU
  15 architecture and a synthesis on the different implementation
  16 strategies used in the literature. The implementation of parallel
  17 metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
  18 GPUs is not straightforward. The traditional models used in CPUs
  19 must be rethought to meet the new requirements of GPU
  20 architectures. This chapter is organized as follows. Combinatorial
  21 optimization\index{Combinatorial~optimization} and  resolution
  22 methods are introduced in Section~\ref{ch8:sec:optim}. The main
  23 traditional parallel models used for metaheuristics are recalled
  24 in Section~\ref{ch8:sec:paraMeta}.
  25 Section~\ref{ch8:sec:challenges} highlights the  main challenges
  26 related to the GPU implementation of metaheuristics. A
  27 state-of-the-art of GPU-based parallel
  28 metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
  29 summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based
  30 frameworks for metaheuristics are described. Finally, a case study
  31 is presented in Section~\ref{ch8:sec:case} and some concluding
  32 remarks are given in Section~\ref{ch8:conclusion}
  33
  34 \section{Combinatorial optimization}
  35 \label{ch8:sec:optim}
  36
  37 Combinatorial optimization\index{Combinatorial~optimization} (CO) is a branch of applied and discrete mathematics.
  38 It consists in finding optimal configuration(s) among a finite set of possible configurations
  39 (or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\  %(Definition~\ref{def:cops})
  40
  41
  42 %\begin{minipage}{0.5\linewidth}
  43 A combinatorial problem $P=(S,f)$ can be defined by
  44 \begin{itemize}
  45 \item a set of decision variables $X$,
  46 \item an objective function $f$ to optimize (minimize or maximize) over the set $S$,
  47 \item subject to constraints on the decision variables.\\
  48 \end{itemize}
  49 %\end{minipage}
  50
  51 COPs are generally formulated as mixed integer programming
  52 problems (MIPS) and most of them are NP-hard~\cite{garey}.
  53 According to the quality level of solutions and deadlines required
  54 for solving an optimization problem, two classes of optimization
  55 methods can be distinguished: \emph{exact methods} and
  56 \emph{approximate methods}. Exact methods allow one to reach
  57 optimal solution(s) of the handled optimization problem with a
  58 proof of its or their optimality. The known methods of this
  59 class are the \emph{branch and bound technique}, \emph{dynamic
  60 programming}, \emph{constraint programming}, and \emph{A*
  61 algorithm}. However, optimization problems, whether practical or
  62 academic, are often complex and NP-hard. Moreover, a large number
  63 of real-life optimization problems encountered in science,
  64 engineering, economics, and business are usually large-sized
  65 problems for which the size of the potential solution domain
  66 increases dramatically with the size of the problem instance. Such
  67 problems cannot be tackled using exact methods due to the
  68 excessive computation time needed by these methods to find optimal
  69 solution(s). In such a situation, approximate methods (or
  70 \textit{metaheuristics}) offer an alternative approach to solve
  71 these problems. Indeed, these methods allow one to reach good
  72 quality solutions in reasonable computation time compared to exact
  73 methods but with no guarantee to find optimal or even bounded
  74 solutions. This is due to the nature of the search process adopted
  75 by these approaches which consists of performing a
  76 search in a subset of the whole search space.\\
  77
  78 Regarding the number of solutions considered at each iteration in
  79 the search process, two classes of metaheuristics can be
  80 distinguished~\cite{talbi2009mfdti}: \textit{solution-based} and
  81 \textit{population-based} metaheuristics. In the rest of this
  82 chapter, the term \emph{s-metaheuristic} refers to solution-based
  83 metaheuristic and the term \emph{p-metaheuristic} refers to
  84 population-based metaheuristic. In s-metaheuristics, the search
  85 process starts with a single solution (generally set at random)
  86 and iteratively improves it by exploring its neighborhood in the
  87 search space. The most known methods in this class are local
  88 search methods that include \emph{simulated
  89 annealing}\index{Metaheuristics!simulated~annealing}~\cite{Kirkpatrick1983SA},
  90 \emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local
  91 search\index{Metaheuristics!iterated local
  92 search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
  93 neighborhood search}~\cite{HansenMladenovic1997VNS}.\\
  94
  95 Unlike s-metaheuristics, p-metaheuristics start with a population
  96 of solutions and implement an iterative process that evolves the
  97 current population towards a new population of better quality
  98 solutions. The process is repeated until a stopping criterion is
  99 satisfied. \emph{Evolutionary algorithms}, \emph{swarm
 100 optimization}, and \emph{ant colonies} fall into this class.
 101
 102
 103 \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta}
 104 Optimization problems, whether real-life or academic, are more
 105 often NP-hard and CPU time and/or memory consuming. Metaheuristics
 106 allow the significant reduction of the computational time of the search
 107 process but remain time-consuming particularly when it comes
 108 dealing with large-sized problems. \\
 109
 110 The use of parallelism in the design of metaheuristics is a relevant
 111 approach that is widely adopted by the combinatorial optimization
 112 community for various reasons:
 113
 114 \begin{itemize}
 115   \item One of the main goals of parallelism is to reduce the search
 116 time. This will allow the design of high performance optimization
 117 methods and the solving of large-sized optimization problems.
 118
 119   \item Sequential processor architectures have reached their
 120 physical limit which prevents creating faster processors. The
 121 current trend of microprocessor manufacturers consists of placing
 122 multicores on a single chip. Nowadays, laptops and
 123 workstations are multicore processors. In addition, the evolution
 124 of network technologies and the proliferation of broadband
 125 networks have made possible the emergence of clusters of
 126 workstations (COWs), networks of workstations (NOWS), and
 127 large-scale networks of machines (grids) as platforms for
 128 high-performance computing.
 129 \end{itemize}
 130
 131 From the granularity of parallelism point of view, three major parallel
 132 models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
 133 \emph{iteration-level}\index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
 134
 135 \begin{figure}[h!]
 136 \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
 137 \caption{Parallel models for metaheuristics.}
 138 \label{ch8:fig:paraMeta}
 139 \end{figure}
 140
 141 \begin{itemize}
 142
 143 \item{In the
 144 algorithmic-level\index{Metaheuristics!algorithmic-level
 145 parallelism} parallel model, several self-contained metaheuristics
 146 are launched in parallel. The parallel
 147 metaheuristics\index{Metaheuristics!parallel~metaheuristics} may
 148 start with identical or different solutions (s-metaheuristics
 149 case) or populations (p-metaheuristics case). Their parameter
 150 settings such as the size of tabu list for tabu
 151 search\index{Metaheuristics!tabu~search}, transition probabilities
 152 for ant colonies, mutation and crossover probabilities for
 153 evolutionary
 154 algorithm\index{Metaheuristics!evolutionary~algorithm}s may be the
 155 same or different. The parallel processes may evolve independently
 156 or in a cooperative manner. In cooperative parallel models, the
 157 algorithms exchange information related to the search during
 158 evolution in order to find better and more robust solutions.}
 159
 160 \item{In the iteration-level\index{Metaheuristics!iteration-level
 161 parallelism} parallel model, the focus is on the parallelization
 162 of each iteration of the metaheuristic. Indeed, metaheuristics are
 163 generally iterative search processes. Moreover, the most
 164 resource-consuming part of a metaheuristic is the evaluation of
 165 the generated solutions at each iteration. For s-metaheuristics
 166 (e.g., tabu search\index{Metaheuristics!tabu~search}, simulated
 167 annealing, variable neighborhood search), the evaluation and
 168 generation of the neighborhood is the most time-consuming step of
 169 the algorithm particularly when it comes to dealing with large
 170 neighborhood sets. In this parallel model, the neighborhood is
 171 decomposed into partitions, and each partition is evaluated in a
 172 parallel way. For p-metaheuristics (e.g., evolutionary
 173 algorithm\index{Metaheuristics!evolutionary~algorithm}s, ant
 174 colonies, swarm optimization), the
 175 iteration-level\index{Metaheuristics!iteration-level parallelism}
 176 parallel model arises naturally since these metaheuristics deal
 177 with a population of independent solutions. In evolutionary
 178 algorithm\index{Metaheuristics!evolutionary~algorithm}s, for
 179 instance, the iteration-level\index{Metaheuristics!iteration-level
 180 parallelism} model consists of decomposing the whole population
 181 into several partitions where each partition is evaluated in
 182 parallel.}
 183
 184 \item{In the
 185 solution-level\index{Metaheuristics!solution-level~parallelism}
 186 parallel model, the focus is on the parallelization of the
 187 evaluation of a single solution. This model is useful when the
 188 objective function and/or the constraints are time and/or memory
 189 consuming. Unlike the two previous parallel models, the
 190 solution-level\index{Metaheuristics!solution-level~parallelism}
 191 parallel model is problem-dependent.}
 192 \end{itemize}
 193
 194 \section{Challenges for the design of GPU-based metaheuristics}
 195 \label{ch8:sec:challenges}
 196
 197 Developing GPU-based parallel
 198 metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
 199 not straightforward. The parallel models have to be rethought to
 200 meet the new requirements of the GPU architecture. Several major
 201 issues have to be taken into account both at design and
 202 implementation levels to develop efficient metaheuristics on GPU.
 203 These issues are mainly related to the size and latency of the GPU
 204 memories, thread synchronization and divergence, the distribution
 205 of tasks, and data transfer between the CPU and
 206 GPU~\cite{luongMultiStart}.
 207
 208 \subsection{Data placement on a hierarchical memory}
 209 \index{GPU Challenges!data~placement} During the execution of
 210 metaheuristics on GPU, the different threads may access multiple
 211 data structures from multiple memory spaces. These memories have
 212 different sizes and access latencies. Nevertheless, faster
 213 memories (registers, shared and constant memories) are generally
 214 very small in size, and the larger memories (global memory) are
 215 relatively slower. However, p-metaheuristics require the
 216 exploration of a large amount of individuals to diversify the
 217 search. Moreover, an efficient execution of s-metaheuristics
 218 requires exploring large neighborhoods. Thus, programmers have to
 219 take into account this point to efficiently place the different
 220 data structures of the metaheuristic on the different memories to
 221 benefit from both the faster memories and the larger ones. A deep
 222 study has to be performed on both the metaheuristic data
 223 structures (size and access frequency) and the GPU memories (size
 224 and access latency) to identify which data will be placed on which
 225 memory. Generally, the most accessed ones should be put on faster
 226 memories (registers, shared memory) and larger ones on the larger
 227 memories (global memory). Also, an efficient mapping between
 228 threads and the corresponding metaheuristic elements (one neighbor
 229 per thread, one individual per thread, single population per
 230 threads block, one ant per thread, etc.) must be defined to ensure
 231 a maximum occupancy of the GPU and
 232 to cover CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} and memory access times.\\
 233
 234 According to the used metaheuristic and to the handled problem, the data
 235 values may have different types and different ranges of their values. The data
 236 types should then be chosen carefully and the ranges of the data values should
 237 be analyzed to minimize the amount of occupied memory space.\\
 238
 239 In addition to the size and latency of GPU memory issues, the
 240 memory access pattern is another important issue to be dealt with
 241 to speedup GPU-based metaheuristics. Indeed, the different
 242 memories have been designed to achieve specific features that
 243 programmers must take into account to optimize their codes and
 244 then to benefit from these features. For instance, the global
 245 memory is optimized for coalesced accesses. The texture and the
 246 constant memory are read-only memories. The texture is optimized
 247 for uncoalesced accesses and the constant memory is optimized for
 248 simultaneously accesses of all threads to the same
 249 location~\cite{luongMultiStart}. Therefore, to improve the
 250 performance of the kernel execution, the programmers should put
 251 coalesced data on global memory, uncoalesced read-only data (e.g.,
 252 metaheuristic instance data) on the texture, concurrent read-only
 253 data (e.g., data for fitness evaluation that all threads
 254 concurrently access) on the constant memory, and the most accessed
 255 data structures (e.g., population of individuals for a CUDA thread
 256 block) on the shared memory.
 257
 258 \subsection{Threads synchronization}
 259 \index{GPU Challenges!threads~synchronization} The thread
 260 synchronization issue is caused by both the GPU architecture and
 261 the synchronization requirements of the implemented method.
 262 Indeed, GPUs are based on a multicore architecture organized into
 263 several multi-processors (Streaming Multiprocessor SM) supporting
 264 the SPMD model (Single Program Multiple Data). Each SM contains
 265 several cores executing the same instruction of different threads
 266 following the SIMD model (Single Instruction Multiple Data). These
 267 threads belong to a warp (a group of 32 threads) and handle
 268 different data elements. An efficient execution of applications on
 269 GPU is achieved when launching a large amount of threads
 270 (thousands of threads)~\cite{CUDA}. However, the execution order
 271 of these thousands of threads is unknown by the programmer which
 272 makes the prediction of their execution order a challenging issue.
 273 Plus, the developer has to control explicitly the
 274 threads through the insertion of barrier synchronizations in the
 275 codes to avoid concurrent accesses to data structures and to meet
 276 some requirements related to data-dependent synchronizations.
 277
 278 \subsection{Thread divergence}
 279
 280 Thread divergence\index{GPU Challenges!thread~divergence} is
 281 another challenging issue in GPU-based
 282 metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally,
 283 metaheuristics contain irregular loops and conditional
 284 instructions when generating and evaluating the neighborhood
 285 (s-metaheuristics), and the population (p-metaheuristics) in the
 286 same block. In addition,  the decision to apply a crossover or a
 287 mutation on an individual in a genetic
 288 algorithm\index{Metaheuristics!genetic~algorithm} and the
 289 exploration of different paths using an ant
 290 colony\index{Metaheuristics!ant~colony~optimization} are random
 291 operations. Threads of the same warp have to execute
 292 instructions simultaneously  leading to different branches whereas
 293 in an SIMD model the threads of a same warp execute the same
 294 instruction. Consequently, the different branches of a conditional
 295 instruction which is data-dependent lead to a serial execution of
 296 the different threads degrading the performance of the application
 297 in terms of execution time. The challenge here is then to revisit
 298 the traditional irregular metaheuristic codes to eliminate these
 299 divergences.
 300
 301 \subsection{Task distribution and CPU/GPU communication}
 302
 303 The performance of GPU-based metaheuristics in terms of execution
 304 time could be improved by choosing the most appropriate parallel
 305 model (algorithmic-level\index{Metaheuristics!algorithmic-level
 306 parallelism}, instruction-level,
 307 solution-level\index{Metaheuristics!solution-level~parallelism}).
 308 Moreover, an efficient decomposition of the metaheuristic and an
 309 efficient assignment of code portions between the CPU and GPU
 310 should be adopted. The objective is to take benefit from the GPU
 311 computing power without affecting the efficiency and the behavior
 312 of the metaheuristic and without losing performance in CPU/GPU
 313 communication\index{GPU Challenges!CPU/GPU~communication} and
 314 memory accesses. In order to decide which part of the
 315 metaheuristic will be executed on which component, one should
 316 perform a careful analysis on the serial code of the
 317 metaheuristic, identify the compute-intensive tasks (e.g.,
 318 exploration of the neighborhood, evaluation of individuals), and
 319 then offload them to the GPU, while the remaining
 320 tasks still run on the CPU in a serial way. \\
 321
 322 The CPU/GPU communication\index{GPU
 323 Challenges!CPU/GPU~communication} is done through the global
 324 memory which is a slow memory making the memory transfer between
 325 the CPU and GPU time-consuming which can significantly degrade the
 326 performance of the application. Accesses to this memory should be
 327 optimized by minimizing the amount of transferred data between the
 328 two components in order to reduce the communication time and,
 329 therefore, the whole execution time of the metaheuristic.
 330
 331 \section{State-of-the-art parallel metaheuristics on GPUs}
 332 \label{ch8:sec:state}
 333 After more than two decades of research by the combinatorial optimisation
 334 community devoted to developing adequate parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} for different types of
 335 parallel architectures (clusters, supercomputers and grids), the actual developement
 336 of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on SIMD architectures.\\
 337
 338 The first works on metaheuristic algorithms implemented on GPUs
 339 started on old graphics cards before the appearance of modern GPUs
 340 equipped with high-level programming interfaces such as CUDA and
 341 OpenCL. Among these pioneering works we cite the work of Wong et al.~\cite{wongOldGPU2006} dealing with the
 342 implementation
 343 of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
 344 is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
 345 Li et al.~\cite{li2007} proposed a full parallelization of genetic
 346 algorithm\index{Metaheuristics!genetic~algorithm}s  on old GPU architectures using
 347 shader libraries based on Direct3D and OpenGL.\\
 348
 349 Such architectures are based on preconfigured pipelined stages
 350 used to accelerate the transformation of 3D geometric primitives
 351 into pixels. Implementing a general-purpose algorithm on such
 352 preconfigured architectures is very hard and requires the
 353 tailoring of the algorithm's data and instructions to fit the
 354 pipelined stages of the GPU. Since then, GPU architectures have
 355 evolved to become programmable using high-level programming
 356 interfaces. In this section we will focus only on recent
 357 state-of-the-art works dealing with metaheuristics implementation
 358 on modern programmable GPUs. In this review two classes are
 359 considered: (1) s-metaheuristics on GPUs and (2) p-metaheuristics
 360 on GPUs. A comparative study is done of the main works and a
 361 classification of the different existing strategies is proposed in
 362 Section~\ref{ch8:sec:synthesis}.
 363
 364 \subsection{Implementing solution-based metaheuristics on GPUs}
 365
 366
 367 A very basic local search algorithm starts with an initial
 368 solution generated either at random or by the mean of a specific
 369 heuristic and is based on two elementary components: the
 370 generation of neighborhood structures using an elementary move
 371 function and a selection function that determines which solution
 372 in the current neighborhood will replace the actual search point.
 373 The search continues as long as there is improvement and stops
 374 when there is no better solution in the current neighborhood. The
 375 exploration (or evaluation) of the different moves of a given
 376 neighborhood is done independently for each move. Thus, the
 377 easiest way to accelerate a local search algorithm is to
 378 parallelize the evaluation of the neighborhood (instruction-level
 379 parallelism). This is  by far the most used scheme in the
 380 literature for parallelizing local search algorithms on GPUs.
 381 Nevertheless, small neighborhoods may lead to  nonoptimal
 382 occupation of the CUDA threads which may lead in turn to an
 383 overhead due to the communication and memory latencies. Therefore,
 384 large neighborhoods are necessary for efficient implementation of
 385 local searches on GPUs.\\
 386
 387 Luong et al.~\cite{luong2010large} proposed efficient
 388 mappings for large neighborhood structures on GPUs. In this work,
 389 three different neighborhoods are studied and mapped to the
 390 hierarchical GPU for binary problems. The three neighborhoods are
 391 based on the Hamming distance. The move operators used in
 392 the three neighborhoods consider Hamming distances of 1,
 393 2, and 3 (this consists on flipping the binary value of one, two,
 394 or three positions at a time in the candidate binary vector).
 395 In~\cite{luong2010large}, each thread is associated to a unique
 396 solution in the neighborhood. The addressed issue is how to
 397 efficiently map the different neighborhoods on the device memory,
 398 more explicitly, how to calculate the memory index of each
 399 solution associated to each CUDA thread's \textit{id}.
 400 %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
 401 The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
 402
 403 In the same context, Deevacq et al.~\cite{audreyANT}
 404 proposed two parallelization strategies inspired by the multi-walk
 405 parallelization strategy, of a 3-opt iterated local
 406 search\index{Metaheuristics!iterated local search} algorithm (ILS)
 407 over a CPU/GPU architecture. In the first strategy, each Local
 408 Search (LS) is associated to a unique CUDA thread and improves a
 409 unique solution by generating its neighborhood. The second
 410 strategy associates each solution to a CUDA block and the
 411 neighborhood exploration is parallelized on the block threads. In
 412 the first strategy, since several LS are executed on different
 413 solutions on each Multi Processor (MP), the data structures should
 414 be stored on the global memory, while the exploration of a single
 415 solution at a time in the second strategy allows the use of the
 416 shared memory to store the related data structures. The two
 417 strategies have been experimented on standard benchmarks of
 418 the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
 419
 420 The same strategy is used by Luong et al.
 421 in~\cite{luongMultiStart} to implement multistart parallel local
 422 search algorithms (a special case of the
 423 algorithmic-level\index{Metaheuristics!algorithmic-level
 424 parallelism} parallel model where several homogeneous LS
 425 algorithms are used). The multistart model is combined with
 426 iteration-level\index{Metaheuristics!iteration-level parallelism}
 427 parallelism: several LS algorithms are managed by the CPU and the
 428 neighborhood evaluation step of each algorithm is parallelized on
 429 the GPU (each GPU thread is associated with one neighbor and
 430 executes the same evaluation function kernel). The advantage of
 431 such a model is that it allows a  high occupancy of the GPU
 432 threads. Nevertheless, memory management\index{GPU
 433 Challenges!memory~management} causes new issues due to the
 434 quantity of data to store and to communicate between CPU     and
 435 GPU. A second proposition for implementing the same model on GPU
 436 consists of implementing the whole LS processes on GPU with each
 437 GPU thread being associated to a unique LS algorithm. This solves
 438 the communication issue encountered in the first model. In
 439 addition, a memory management\index{GPU
 440 Challenges!memory~management} strategy is proposed to improve the
 441 efficiency of the
 442 algorithmic-level\index{Metaheuristics!algorithmic-level
 443 parallelism} model: texture memory is used to avoid memory latency
 444 due to uncoalesced memory accesses. The proposed approaches are
 445 implemented on the quadratic assignment problem (QAP) using CUDA.
 446 The acceleration rates obtained for the
 447 algorithmic-level\index{Metaheuristics!algorithmic-level
 448 parallelism} with usage of texture memory rise from $7.8\times$ to
 449 $12\times$ (for different
 450 QAP benchmark sizes). \\
 451
 452 Janiak et al.~\cite{Janiak_et_al_2008} implemented two
 453 algorithms for TSP and the flow-shop scheduling problem (FSP).
 454 These algorithms are based on a multistart tabu
 455 search\index{Metaheuristics!tabu~search} model. Both of the
 456 algorithms exploit multicore CPU and GPU. A full parallelization
 457 on GPU is adopted using shader libraries where each thread is
 458 mapped with one tabu search\index{Metaheuristics!tabu~search}.
 459 However, even though their experiments report that the use of GPU
 460 speedups the serial execution almost $16 \times$, the mapping of
 461 one thread with one tabu search\index{Metaheuristics!tabu~search}
 462 requires a large number of local search algorithms to
 463 cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
 464
 465 Luong et al.~\cite{luong2012ppsn} proposed a GPU-based
 466 implementation of hybrid metaheuristics on heterogeneous parallel
 467 architectures (multicore CPU  coupled to one GPU).  The challenge
 468 of using such a heterogeneous architecture is how to distribute
 469 tasks between the CPU cores and the GPU in such a way to have
 470 optimal performances. Among the three traditional parallel models
 471 (solution-level\index{Metaheuristics!solution-level~parallelism},
 472 iteration-level\index{Metaheuristics!iteration-level parallelism},
 473 and algorithmic-level\index{Metaheuristics!algorithmic-level
 474 parallelism}), the authors point out that the most convenient
 475 model for the considered heterogeneous architecture is a hybrid
 476 model combining
 477 iteration-level\index{Metaheuristics!iteration-level parallelism}
 478 and algorithmic-level\index{Metaheuristics!algorithmic-level
 479 parallelism} models. Several CPU threads execute several instances
 480 of the same S-metaheuristic in parallel while the GPU device is
 481 associated to one CPU core and  used to accelerate the
 482 neighborhood calculation of several S-metaheuristics at the same
 483 time.
 484  In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
 485 S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the Fast Ant Colony Algorithm (FANT)~\cite{taillardFant}. \\
 486
 487 All the previously noted works  exploit the same parallel models
 488 used on CPUs based on the task parallelism. A different
 489 implementation approach is proposed by Paul in~\cite{gerald2012}
 490 to implement a simulated
 491 annealing\index{Metaheuristics!simulated~annealing} (SA) algorithm
 492 for the QAP on GPUs. Indeed, the author used a preinitialized
 493 matrix \emph{delta} in which the incremental evaluation of simple
 494 swap moves are calculated and stored relative to the initial
 495 permutation $p$. For the GPU implementation, the authors used the
 496 parallel implementation of neighborhood exploration. The
 497 time-consuming tasks in the SA-matrix are identified by the
 498 authors as updating the matrix and passing through it to select
 499 the next accepted move. To initialize the delta matrix, several
 500 threads from different blocks explore different segments of the
 501 matrix (different moves) at the same time.  In order to select the
 502 next accepted swap, several threads are also used. Starting from
 503 the last move, a group of threads explores different subsets of
 504 the delta matrix. The shared memory is used to preload all the
 505 necessary elements for a given group of threads responsible for
 506 the updating of the delta matrix. The main difference in this work
 507 compared to the previous works resides in the introduction of a
 508 data parallelism using the precalculated delta matrix. The use of
 509 this matrix allows the increase in the number of threads
 510 involved in the evaluation of a single move. Experimentations are
 511 done on standard QAP instances from the
 512 QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
 513 achieved by the GPU implementation compared
 514 to the same sequential implementation on CPU using SA-matrix.\\
 515
 516 \subsection{Implementing population-based metaheuristics on GPUs}
 517
 518 State-of-the-art works dealing with the implementation of
 519 p-metaheuristics on GPUs generally rely on parallel models and
 520 research efforts done for parallelizing different classes of
 521 p-metaheuristics over different types of parallel architectures:
 522 supercomputers, clusters, and computational grids. Three main
 523 classes of p-metaheuristics are considered in this section:
 524 evolutionary
 525 algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant
 526 colony\index{Metaheuristics!ant~colony~optimization} optimization
 527 (ACO\index{Metaheuristics!ant~colony~optimization}), and particle
 528 swarm optimization\index{Metaheuristics!particle swarm
 529 optimization} (PSO).
 530
 531 \subsubsection*{Evolutionary algorithms}
 532
 533 Traditional parallel models for EAs are classified in three main
 534 classes: coarse-grain models using several subpopulations
 535 (islands), master-slave models used for the parallelization of CPU
 536 intensive steps (evaluation and transformation), and cellular
 537 models based on the use of one population disposed (generally) on
 538 a toroidal grid.
 539
 540 The three traditional models have been implemented on GPUs by several researchers for different
 541 optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.\\
 542
 543 In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
 544 of the drug discovery application Autodock (molecular docking
 545 application). Autodock uses a genetic
 546 algorithm\index{Metaheuristics!genetic~algorithm} to find optimal
 547 docking positions of a ligand to a protein. The most
 548 time-consuming task in Autodock is the fitness function
 549 evaluation. The fitness function used for a docking problem
 550 consists of calculating the energy  of the ligand-protein complex
 551 (sum of intermolecular energies). The authors explore two
 552 different approaches to evaluate the fitness function on GPU. In
 553 the first approach, each GPU thread calculates the energy function
 554 of a single individual. This approach requires the use of
 555 large-sized populations to saturate the GPU (thousands of
 556 individuals). In the second approach each individual is associated
 557 with one thread block. The evaluation of the energy function is
 558 performed by the threads of the same block. This allows the use of
 559 medium population sizes (hundreds of individuals) and the
 560 acceleration of a single fitness evaluation. Another great
 561 advantage of the per block approach resides in the use of shared
 562 memory instead of global memory to store all the information
 563 related to each individual. The obtained speedups range from  $10
 564 \times$ to $47 \times$ for population sizes
 565 ranging from $50$ to $10000$. \\
 566
 567 Maitre et al.~\cite{maitre2009} also exploited the
 568 master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
 569 C-like metalanguage for easy development of EAs. The user writes a
 570 description  of the problem-specific components (fitness function,
 571 problem representation, etc) in EASEA. The code  is then compiled
 572 to obtain a ready-to-use evolutionary
 573 algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
 574 compiler  uses genetic
 575 algorithm\index{Metaheuristics!genetic~algorithm} LIB and EO
 576 Libraries to produce C++ or JAVA written EA codes.
 577 In~\cite{maitre2009}, the authors  proposed an extension of EASEA
 578 to produce CUDA code from the EASEA files. This extension has been
 579 used  to generate a master-slave parallel EA in which the
 580 sequential algorithm is maintained on CPU and the population is
 581 sent to GPU for evaluation. Two problems have been considered
 582 during the experiments: a benchmark mathematical function and a
 583 real problem (molecular structure prediction). In order to
 584 maximize the GPU occupation, very large populations are used (from
 585 $2000$ to $20000$). Even though transferring such large
 586 populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
 587
 588 The master-slave  model is efficient when the fitness function is
 589 highly time intensive. Nevertheless, it requires the use of
 590 large-sized populations in order to saturate the GPU unless the
 591 per-block is used (one individual perblock) when the acceleration
 592 of the fitness function itself is possible. The use of many
 593 subpopulations of medium sizes is another way to obtain a maximum
 594 occupation of the GPU. This is coarse-grained parallelism (island
 595 model).\\
 596
 597 The coarse-grained model is  used by Pospichal et al.
 598 in~\cite{pospichal10} to implement a parallel genetic
 599 algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
 600 work the entire genetic
 601 algorithm\index{Metaheuristics!genetic~algorithm} is implemented
 602 on GPU. This choice is motivated by the overhead engendered by the
 603 CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication}
 604 when only population evaluation is performed on GPU. Each
 605 population island is mapped with a CUDA thread block and each
 606 thread is responsible for a unique individual. Subpopulations are
 607 stored on shared memory of each block. Nevertheless, because
 608 interblock communications are not possible on the CUDA
 609 architecture, the islands evolve independently in each block, and
 610 migrations are performed
 611 asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.\\
 612
 613 The same strategy is also adopted by Tsutsui and Fujimoto
 614 in~\cite{tsutsuiGAQAP}  to implement a coarse-grained genetic
 615 algorithm\index{Metaheuristics!genetic~algorithm} on GPU to solve
 616 the QAP. Initially, several subpopulations are created on CPU and
 617 transferred to the global memory. The subpopulations are organized
 618 in the global memory into blocks of $8$ individuals in such a way
 619 to allow coalesced memory access by the threads of the same thread
 620 block. Each sub-population is allocated to a single thread block
 621 in the GPU and transfered to the shared memory to start evolution.
 622 Population evaluation and transformation are done in parallel by
 623 the different threads of a given block. Migration is also done
 624 through the global memory. Experiments are performed on standard
 625 QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
 626 implementation reached speedups of $2.9\times$ to $12.6 \times$
 627 compared to a single core implementation of a coarse-grained
 628 genetic algorithm\index{Metaheuristics!genetic~algorithm} on a
 629 Intel i7 processor.\\
 630
 631 Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
 632 implementation of a Quantum Inspired Genetic Algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The
 633 parallel model used is a hierarchical model based on two levels: each
 634 thread in a block transforms a unique individual and a different
 635 population is assigned to each block. The algorithm is run
 636 entirely on GPU. A different instance of the QIGA is run on each
 637 block and the computations have been shared between 8 GPUs. This
 638 approach is very convenient to speed up the experimental process
 639 on metaheuristics that require several independent runs of the
 640 same algorithm in order to assess statistical efficiency. The
 641 populations are stored in the shared memory, the data matrix used
 642 for fitness evaluation is placed in read only constant memory, and
 643 finally seeds for random numbers generated on the GPU are stored
 644 in the global memory.\\
 645
 646 In coarse-grained parallelism, the use of the per-block approach
 647 to implement the islands (one subpopulation per thread block) is
 648 almost natural  and it allows the use of shared memory to store
 649 the subpopulations. Fine-grained parallel models for EAs (cellular
 650 EAs) have been used by many authors to implement parallel EAs on
 651 GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
 652 to the SIMD architecture of the GPU. \\
 653
 654 Pinel et al. in~\cite{pinel2012JPDC} developed a highly
 655 parallel synchronous cellular genetic
 656 algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
 657 GraphCell, to solve the independent task scheduling problem on GPU
 658 architectures. In CGAs, the population is arranged into a
 659 two-dimensional toroidal grid where only neighboring solutions are
 660 allowed to interact with each other  during the recombination
 661 step. In GraphCell, two recombination operators for CGA are
 662 especially designed to run efficiently on GPU. Indeed, instead of
 663 assigning a single thread to each solution of the population to
 664 perform the recombination, in GraphCell, a single thread is
 665 assigned to each task  of each solution. Offsprings are created by
 666 independently modifying the assignment of some tasks in the
 667 current solution. Mainly, each thread chooses one neighboring
 668 solution in the grid as second parent using different selection
 669 strategies and assigns one  task of  the solution (first parent)
 670 to the machine for which the same task is assigned in the second
 671 parent.  This way, the number of threads used for the
 672 recombination step is equal to population size $\times$  size of
 673 the solution (number of tasks). This leads to a high number of
 674 threads used to accelerate the recombination operators especially
 675 when dealing with large instances of the problem. In addition to
 676 the recombination operators, the rest of the CGA steps are  also
 677 parallelized on GPU (fitness evaluation, mutation, and
 678 replacement).\\
 679
 680 A similar work is proposed by Vidal and Alba  in~\cite{albaCGAGPU}
 681 where a  CGA using a toroidal grid is completely implemented on
 682 GPU. A direct mapping between the population and the GPU threads
 683 is done. At each step, several threads execute the same operations
 684 on each individual independently: initialization, computing the
 685 neighborhood, selection, crossover, mutation, and evaluation.  A
 686 synchronization is done for all threads to perform the replacement
 687 stage and form the next generation. All the data of the algorithm
 688 is placed on the global memory. Several experiments have been
 689 performed on a set of standard benchmark functions with different
 690 grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
 691 the GPU implementation against the CPU version range from
 692 $5\times$ to $24\times$ and increase as the size of the population
 693 increases. However, the CPU implementation runs faster than the GPU
 694 version for all the tested  benchmarks when the size of the
 695 population is set to $32^2$. When the size of the population is
 696 too small, there are not enough computations to cover the overhead
 697 created by the call of kernel functions, CPU/GPU
 698 communication\index{GPU Challenges!CPU/GPU~communication}s,
 699 synchronization, and access to global memory. Finally, an
 700 interesting review on GPU parallel computation in bio-inspired
 701 algorithms is proposed by Arenas et al.
 702 in~\cite{arenas2011}. \\
 703
 704 \subsubsection*{Ant colony optimization}
 705
 706 Ant colony optimization
 707 (ACO\index{Metaheuristics!ant~colony~optimization}) is another
 708 p-metaheuristic subject to parallelization on GPUs.
 709 State-of-the-art works on parallelizing
 710 ACO\index{Metaheuristics!ant~colony~optimization} focus on
 711 accelerating the tour construction step performed by each ant by
 712 taking a task-based parallelism approach, with pheromone
 713 deposition on the CPU.\\
 714
 715 In~\cite{cecilia}, Cecilia et al.  present a GPU-based
 716 implementation of
 717 ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
 718 the two steps (tour construction and pheromone update) are
 719 parallelized on the GPU. A data parallelism approach is used to
 720 enhance the performance of the tour  construction step. The
 721 authors use two categories of artificial ants: queen ants
 722 associated with CUDA thread-blocks and worker ants associated with
 723 CUDA threads. A queen ant represents a simulated ant and worker
 724 ants collaborate with the queen ant to accelerate the decision
 725 about the  next city to visit. The tour construction step of each
 726 queen ant is accelerated. Each worker ant  maintains a history of
 727 the search in a tabu list containing a chronological ordering of
 728 the already visited cities. This memory is used to determine the
 729 feasible neighborhoods. After all queen ants have constructed
 730 their tours, the pheromone trails are updated. For pheromone
 731 update, several GPU techniques are also used to increase the data
 732 bandwidth of the application mainly by the use of precalculated
 733 matrices that are easily updated by several threads (one thread
 734 per matrix entry). The achieved speedups are $21 \times$ for tour
 735 construction and  $20 \times$ for pheromone updates.\\
 736
 737 In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose  a
 738 hybrid algorithm combining
 739 ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic
 740 and Tabu Search (TS)\index{Metaheuristics!tabu~search} implemented
 741 on GPU to solve the QAP. A solution of QAP is represented as a
 742 permutation of ${1,2,..,n}$ with $n$ being the size of the
 743 problem. The TS algorithm is based on the 2-opt neighborhood
 744 (swapping of two elements $(i,j)$ in the permutation). The authors
 745 point out that the move cost of each neighbor depends on the
 746 couple $(i,j)$. Two groups of moves are formed according to the
 747 move cost. In order to avoid thread divergence\index{GPU
 748 Challenges!thread~divergence} within the same warp, the
 749 neighborhood evaluation is parallelized in such a way to assign
 750 only moves of the same cost to each thread warp. This strategy is
 751 called MATA for Move-cost Adjusted Thread Assignment. Concerning
 752 the memory management\index{GPU Challenges!memory~management}, all
 753 the data of the ACO\index{Metaheuristics! ant~colony~optimization}
 754 (population, pheromone matrix), QAP matrices, and tabu list are
 755 placed on the global memory of the GPU. Shared memory is used only
 756 for working data common to all threads in a given block.
 757  All the
 758 steps of the hybrid algorithm
 759 ACO\index{Metaheuristics!ant~colony~optimization}-TS
 760 (ACO\index{Metaheuristics!ant~colony~optimization} initialization,
 761 pheromone update, construct solutions, applying TS) are
 762 implemented as kernel functions on the GPU. The GPU/CPU
 763 communications are only used to transfer the best-so-far solution
 764 in order to verify termination conditions. The experiments
 765 performed on standard QAP benchmarks showed that the GPU
 766 implementation using MATA obtained a  speedup of $19 \times$
 767 compared to the CPU implementation,  compared with a speedup of
 768 only $5 \times$
 769 when MATA is not used. \\
 770
 771 \subsubsection*{Particle swarm optimization}
 772 In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
 773 of a standard PSO algorithm. All the data is stored in global
 774 memory (velocities, positions, swarm population, etc). Only
 775 working data is copied to shared memory by each thread. The four
 776 steps of the PSO have been parallelized on GPU: fitness evaluation
 777 of the swarm, update of local best and global best  of each
 778 particle, and update of velocity and position of each particle.
 779 The same strategy is used to parallelize the first and last
 780 steps: the evaluation of fitness functions is performed in
 781 parallel for each dimension  by several threads. It is the case
 782 for the update of position and velocities of each particle: one
 783 dimension at a time is updated for the whole swarm by several
 784 threads. Random numbers needed for  updating the velocities and
 785 positions for the whole PSO processes are generated on CPU at the
 786 starting of the algorithm and transferred to the GPU global
 787 memory. For the steps 2 and 3 (update of local best and global
 788 best of each particle), the GPU threads are mapped to the \emph{N}
 789 particles of the swarm one to one. Experiments done on four
 790 benchmark functions show speedups ranging from $3.7 \times$ to
 791 $9.0 \times$ for swarm sizes
 792 ranging from $400$ to $2800$.\\
 793
 794 \subsection{Synthesis of the implementation strategies}
 795 \label{ch8:sec:synthesis} After reviewing some works dealing with
 796 GPU-based implementation of metaheuristics, in this section we
 797 will try to come up with a classification of the different
 798 strategies used in the literature for the implementation of
 799 metaheuristics on GPUs. One may distinguish between the parallel
 800 models adopted in each metaheuristic (design level)  and the way
 801 they are exploited on GPU architectures (implementation level).
 802 Indeed, even though the parallelization models used in most works
 803 for GPUs are derived from the traditional parallel models of each
 804 metaheuristic (on CPU), their implementation could take a
 805 different way and sometimes it may result in new parallel models
 806 customized for GPUs.\\
 807
 808 Traditional parallel models for metaheuristics are based on an
 809 intuitive task parallelism: the independent tasks of the
 810 algorithms are simply parallelized. For example, in the case of
 811 s-metaheuristics, the evaluation of large neighborhoods could be
 812 done in parallel since there is no synchronization at this step of
 813 the algorithm. That is the case of EAs  when it comes to applying
 814 the evaluation step. Nevertheless, because of the particularity of
 815 the GPU architecture, some authors have used new implementation
 816 techniques to enhance the data parallelism in the sequential
 817 algorithms in order to increase the data throughput of the
 818 application.\\
 819
 820 From this observation we propose the following classification
 821 based on 2 levels: design level and implementation level as
 822 illustrated in Figure~\ref{ch8:fig:classification}. The design
 823 level regroups the three classes of parallel models used in
 824 metaheuristics
 825 (solution-level\index{Metaheuristics!solution-level~parallelism},
 826 iteration-level\index{Metaheuristics!iteration-level parallelism},
 827 algorithmic-level\index{Metaheuristics!algorithmic-level
 828 parallelism}) with examples for s-metaheuristics, EAs,
 829 ACO\index{Metaheuristics!ant~colony~optimization} and PCO. This
 830 classification  is principally built from the reviewed
 831 state-of-the-art works  in the previous section. The
 832 implementation level refers to the way these parallel design
 833 models are implemented on GPU. This classification focuses only on
 834 the mapping strategies between the GPU threads and the
 835 parallelized tasks (neighborhood evaluation, solution
 836 construction, and so on). The different implementation strategies
 837 are explained in the following sections.\\
 838
 839 \begin{figure}[h]
 840 \centerline{\includegraphics[width=1\textwidth]{Chapters/chapter9/figures/classification.pdf}}
 841 \caption{A two level classification of state-of-the-art GPU-based parallel metaheuristics.}
 842 %[Comparison of number of cores in a CPU and in a GPU]
 843 \label{ch8:fig:classification}
 844 \end{figure}
 845
 846
 847 \subsubsection*{GPU thread mapping for solution-level parallelism}
 848 \index{GPU-based Metaheuristics!GPU-thread mapping} Parallel
 849 models at solution level consist of parallelizing a time intensive
 850 atomic task of the algorithm. Generally, it consists of the
 851 fitness evaluation~\cite{kannan}. Nevertheless, crossover
 852 operators have also been parallelized by some
 853 authors~\cite{pinel2012JPDC}. These kinds of models are not always
 854 possible as they are problem-dependent. The GPU implementation of
 855 solution-level models uses the perblock mapping: each solution is
 856 associated to a block of threads. A second level of parallelism is
 857 used inside each block to parallelize the fitness evaluation of a
 858 single solution. This kind of mapping allows the use of shared
 859 memory to store the data structures of the solution and does not
 860 require the use of very large neighborhoods or populations.\\
 861 %data parallelism in SA-matrix to parallelize
 862
 863 \subsubsection*{GPU thread mapping for iteration-level parallelism}
 864 \index{GPU-based Metaheuristics!GPU-thread mapping}
 865 Iteration-level parallelism consists of parallelizing the tasks
 866 performed independently on different solutions. Different mapping
 867 strategies are adopted in the reviewed works to implement these
 868 models.\\
 869
 870 In Figure \ref{ch8:fig:classification}, the first example of
 871 iteration-level\index{Metaheuristics!iteration-level parallelism}
 872 parallelism is the parallel evaluation of neighborhoods in
 873 s-metaheuristics. In most of the reviewed works,  a per-thread
 874 mapping approach is used: each solution of the neighborhood is
 875 mapped to a unique thread in the GPU for
 876 evaluation~\cite{luong2010large, audreyANT}. The same mapping is
 877 used for master-slave parallel EAs to accelerate the evaluation of
 878 large populations. This kind of mapping is only efficient for very
 879 large neighborhoods and very large populations (to saturate the
 880 GPU). Many authors have pointed out that the use of such large
 881 populations (or neighborhoods) may lead to an overhead due to the
 882 communication costs between the CPU and the GPU (if the sequential
 883 part of the algorithm is placed on CPU). In order to circumvent
 884 this issue, many authors have implemented the entire algorithm on
 885 GPU~\cite{pospichal10}. On the other hand, as several solutions
 886 may be mapped with the same thread block in the GPU, shared memory
 887 could not be used and all the data should be placed on global
 888 memory.
 889 %pheromone update data parallelism matrices
 890
 891 \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
 892 \index{GPU-based Metaheuristics!GPU-thread mapping}
 893 Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models have been used: the multistart model and the island model (parallel EAs).\\
 894
 895 The implementation of the multistart model is based on two
 896 different mapping strategies~\cite{luongMultiStart, audreyANT}:
 897 (1) each Local Search (LS) is associated to a unique thread and
 898 (2) each solution (from multiple neighborhoods) is associated to a
 899 unique thread. The first mapping strategy (one thread per LS)
 900 presents a big drawback: the number of LS to use should be very
 901 large to saturate the GPU and cover the memory access latency. On
 902 the other hand, the evaluation of the neighborhood of a single LS
 903 by one thread is time intensive. Furthermore, shared memory could
 904 not be used to store the huge data generated by the different
 905 neighborhoods. In the second mapping strategy, the LS algorithms
 906 are placed on CPU and the neighborhood evaluation of each LS is
 907 parallelized on GPU using per-thread mapping strategy (one thread
 908 per solution). This consists of a hierarchical parallel model
 909 combining algorithmic-level\index{Metaheuristics!algorithmic-level
 910 parallelism} parallelism (multistart) with
 911 iteration-level\index{Metaheuristics!iteration-level parallelism}
 912 parallelism (master-worker).\\
 913
 914 In the island model, the same mapping is used in all the reviewed
 915 works~\cite{tsutsuiGAQAP, nowotniak, maitre2009}: each
 916 subpopulation (island) is associated to one thread block in the
 917 GPU. A second level of parallelism is used inside the block to
 918 parallelize the evaluation step of the local population.
 919 Migrations are always performed through the global memory as
 920 interblock communications  are impossible in CUDA. The first
 921 advantage of this hierarchical implementation is that it allows
 922 the occupation of a large number of threads even for medium
 923 population sizes. The second advantage consists of using shared
 924 memory to store subpopulations to benefit from the low latency of
 925 shared memory.\\
 926
 927 \section{Frameworks for metaheuristics on GPUs}
 928 \label{ch8:sec:frameworks}
 929
 930 After the first pioneering works of metaheuristics on graphics
 931 processing units, the next challenge is to provide easy-to-use
 932 frameworks and libraries for rapid development of metaheuristics
 933 on GPUs. Although the works on this subject are not yet mature and
 934 do not cover the main metaheuristic algorithms, we will present
 935 the only three works to our knowledge, which propose open source
 936 frameworks for developing
 937 metaheuristics on GPUs.\\
 938
 939 The three frameworks reviewed in this section are
 940 PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
 941 ParadisEO-MO-GPU\index{GPU-based
 942 frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
 943 libCUDAOptimize\index{GPU-based
 944 frameworks!libCUDAOptimize}~\cite{libcuda}. PUGACE\index{GPU-based
 945 frameworks!PUGACE} is a framework for implementing EAs on GPUs.
 946 ParadisEO-MO-GPU is an extension of the framework
 947 ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
 948 GPU. Finally, libCUDAOptimize\index{GPU-based
 949 frameworks!libCUDAOptimize} is a  library intended for the
 950 implementation of p-metaheuristics on GPU. The three frameworks
 951 are presented in more detail in the following.
 952
 953 \subsection{PUGACE\index{GPU-based frameworks!PUGACE}: framework for implementing evolutionary computation on GPUs}
 954 PUGACE\index{GPU-based frameworks!PUGACE} is a generic  framework
 955 for easy implementation of cellular evolutionary algorithms on
 956 GPUs implemented using C and CUDA. It is based on the frameworks
 957 MALLBA and JCell (a framework for cellular algorithms). The
 958 authors justified the choice of cellular evolutionary
 959 algorithm\index{Metaheuristics!evolutionary~algorithm} by the good
 960 feedback found in the literature concerning its efficient
 961 implementation on GPUs compared to other parallel models for EAs
 962 (island, master-slave). The main standard evolutionary operators
 963 are already implemented in PUGACE\index{GPU-based
 964 frameworks!PUGACE}: different selection strategies, standard
 965 crossover, and mutation operators (PMX, swap, 2-exchange,
 966 etc.). Different problem encoding is also supported. The framework
 967 is organized as a set of modules in which the different
 968 functionalities  are separated as much as possible in order to
 969 facilitate the extension of the framework. All
 970 the functions and procedures that execute on GPU are implemented in the same file kernel.cu. \\
 971
 972 The implementation strategy adopted on the GPU is as follows.
 973 Population initialization is done on the CPU side and the
 974 population is transferred to the GPU. On the GPU side, each
 975 individual is associated to a unique CUDA thread. The function
 976 evaluation and mutation are done on the GPU while selection and
 977 replacement are maintained on the CPU. In order to avoid thread
 978 divergence\index{GPU Challenges!thread~divergence} appearing in
 979 the same CUDA thread block at the crossover step (because of the
 980 probability of application which may give different results from
 981 one thread to the other), the decision of whether to apply a
 982 crossover is taken at the block level and applied to all the
 983 individuals within the block. It is the decision on the choose
 984 of the cutting point for the crossover.\\
 985
 986 The framework is validated on standard benchmarks of the QAP. Speedups of $15.44 \times$  to $18.41
 987 \times$ are achieved compared to a CPU implementation of a cEA
 988 using population sizes rising from 512 to 1024 and from 1024 to
 989 2048.
 990
 991 % The authors noticed an
 992 % improvement on the speedup when the size of the population increases. This is best explained
 993 % by the linear increasing of execution time on CPU side while a sublinear increasing is
 994 % occuring on GPU side (due to a better exploitation of the GPU threads). Neverethless,
 995 % dispite the positive impact of large populations on th parallel efficieny of the implementation
 996 % on GPU, a study on the impact on the final solution quality when using such large populations
 997 % is to be done. Indeed, too large populations on EAs may lead to high diversification of the
 998 % search and poor intensification capabilities, which leads to poor results.
 999 % For the moment, plateforms whith more than one CPU and one GPU are not yet supported in PUGACE\index{GPU-based frameworks!PUGACE}.
1000 % No details on how to get PUGACE\index{GPU-based frameworks!PUGACE} code source are given in the paper.
1001
1002 \subsection{ParadisEO-MO-GPU}
1003
1004 Melab et al.~\cite{paradiseoGPU} developed a reusable
1005 framework ParadisEO-MO-GPU\index{GPU-based
1006 frameworks!ParadisEO-MO-GPU} for parallel local search
1007 metaheuristics (s-metaheuristics) on GPUs. It focuses on the
1008 iteration-level\index{Metaheuristics!iteration-level parallelism}
1009 parallel model of s-metaheuristics which consists of exploring in
1010 parallel the neighborhood of a problem solution on GPU. The
1011 framework, implemented using C++ and CUDA, is an extension of the
1012 ParadisEO~\cite{paradiseo} framework previously developed by the
1013 same team for parallel and distributed metaheuristics on both
1014 dedicated parallel hardware platforms and computational grids. The
1015 objective of this framework is to facilitate the development of
1016 GPU-based metaheuristics providing a transparent use of GPUs to
1017 users who are unfamiliar with advanced features of all
1018 parallelization techniques and deployment on GPUs. The framework
1019 allows one to efficiently manage the hierarchical organization of
1020 the memories (latencies and sizes) of the GPU and its
1021 communication with the CPU as well as the minimizing of the user
1022 involvement in its management.\\
1023
1024 \begin{figure}[h]
1025 \centerline{\includegraphics[width=0.8\textwidth]{Chapters/chapter9/figures/paradiseoGPU.pdf}}
1026 \caption{The skeleton of the ParadisEO-MO-GPU\index{GPU-based
1027 frameworks!ParadisEO-MO-GPU}.}
1028 %[Comparison of number of cores in a CPU and in a GPU]
1029 \label{ch1:fig:paradiseoGPU}
1030 \end{figure}
1031
1032 The framework is based on a master-worker model where the master
1033 is the CPU and the workers are threads executed by the processing
1034 cores of the GPU. The CPU executes the serial part of the
1035 metaheuristic and sends only the current solution to the GPU to
1036 minimize the transfer cost. The GPU, on its side, generates the
1037 neighboring of the received solution and evaluates them at each
1038 iteration. All the threads execute the same kernel and according
1039 to a static mapping table between the threads and the neighbors
1040 where each thread is associated with exactly one neighbor
1041 evaluation. After all the neighborhood is generated and evaluated,
1042 it is sent back to the CPU which selects the best solution (See
1043 Figure~\ref{ch1:fig:paradiseoGPU}).
1044
1045 \subsection{libCUDAOptimize: an open source library of GPU-based metaheuristics}
1046 \index{GPU-based frameworks!libCudaOptimize}
1047 LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library
1048 for the design and implementation of metaheuristics on GPUs. Until
1049 now, the metaheuristics supported by LibCUDAOptimize are: scatter
1050 search, differential evolution, and particle swarm
1051 optimization\index{Metaheuristics!particle swarm optimization}.
1052 Nevertheless, the library is designed in such a way to allow
1053 further extensions for other metaheuristics and it is still in
1054 development phase by the authors. The parallelization strategy
1055 adopted on GPU is principally based on fitness evaluation. The
1056 sequential structure of the optimization algorithms is maintained
1057 on CPU.
1058
1059 \section{Case study: accelerating large neighborhood LS method on GPUs for solving the Q3AP}
1060 \label{ch8:sec:case}
1061
1062 In this case study, a large neighborhood GPU-based local
1063 search\index{GPU-based Metaheuristics!GPU-based~local~search}
1064 method for solving the Quadratic 3-dimensional Assignment Problem
1065 (Q3AP) will be presented. The local search method is an Iterated
1066 Local Search\index{Metaheuristics!iterated local search}
1067 (ILS)~\cite{stutzle2006ILSforQAP} using an embedded
1068 TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
1069 consists of executing iteratively the embedded local search, each
1070 iteration which starts from a disrupted local optima reached by
1071 the previous local search process. The disruption heuristic is a
1072 performance parameter of an ILS algorithm and should be
1073 judiciously defined. A template of an
1074 ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
1075
1076 \begin{algorithm}[H]
1077     \SetAlgoLined
1078     %\KwData{this text}
1079     %\KwResult{how to write algorithm with \LaTeX2e }
1080    $s_{0}$=GenerateInitSol()\;
1081    $s^*$=LocalSearch($s_0$)\;
1082     \Repeat{a maximum number of iterations is reached}{
1083      $s'$=Perturbate($s^*,history$)\;
1084      $s^{*'}$=LocalSearch($s'$)\;
1085      $s^*$=AcceptationCriteria($s^*,s^{*'},history$)\;
1086
1087     }
1088 \caption{iterated local search template}
1089 \label{ch8:ils_algorithm_template}
1090 \end{algorithm}
1091
1092 %
1093 %   \begin{algorithm}
1094 %  % \label{ch8:ils_algorithm_template}
1095 %   \caption{iterated local search\index{Metaheuristics!iterated local search} template}
1096 %   \begin{algorithmic}[1]
1097 %
1098 %   \STATE $s_{0}$=\texttt{GenerateInitSol}();
1099 %
1100 %   \STATE $s^*$=\texttt{LocalSearch}($s_0$);
1101 %
1102 %   \REPEAT
1103 %     \STATE $s'$=\texttt{Perturbate}($s^*,history$);
1104 %     \STATE $s^{*'}$=\texttt{LocalSearch}($s'$);
1105 %     \STATE $s^*$=\texttt{AcceptationCriteria}($s^*,s^{*'},history$);
1106 %
1107 %   \UNTIL {a maximum number of iterations is reached}
1108 %
1109 %   \end{algorithmic}
1110 %
1111 %   \end{algorithm}
1112
1113 \subsection{The quadratic 3-dimensional assignment problem}
1114
1115 The Q3AP is an extension of the well-known NP-hard QAP. The latter
1116 is one of the most studied problems by the combinatorial
1117 optimization community due to its wide range of practical
1118 applications (site planning, schedule problem, computer-aided
1119 design, etc.) and to its computational challenges since it is
1120 considered as one of the most computationally difficult
1121 optimization problems.\\
1122
1123 The Q3AP was first introduced by William P. Pierskalla in
1124 1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
1125 less studied problem. Indeed, the Q3AP was revisited only this
1126 past year and has recently been used to model some advanced
1127 assignment problems such as the symbol-mapping problem encountered
1128 in wireless communication systems and described
1129 in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
1130 mathematically expressed as follows:
1131
1132 % debut -----------------------------------------------------------------------------
1133 \begin{equation*} min \left \{
1134 \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}\sum_{k=0}^{n-1}\sum_{s=0}^{n-1}\sum_{r=0}^{n-1}c_{ijlksr}x_{ijl}x_{ksr}
1135 \quad+ \right . \end{equation*}
1136
1137 \begin{equation} \label{Q3APFormulation} \left .
1138 \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}\sum_{l=0}^{n-1}b_{ijl}x_{ijl}
1139 \right \} \end{equation}
1140
1141 where
1142
1143 \begin{eqnarray}
1144   X=(x_{ijl})\in I \cap J \cap L, \label{Q3APConstraints1}\\
1145   x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1 \label{Q3APConstraints2}
1146 \end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
1147 \begin{displaymath} I=\lbrace X=(x_{ijl}):
1148 \sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
1149 i=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
1150 \begin{displaymath} J=\lbrace X=(x_{ijl}):
1151 \sum_{i=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
1152 j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
1153 \begin{displaymath} L=\lbrace X=(x_{ijl}):
1154 \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
1155 l=0,1,...,n-1\rbrace  \end{displaymath}
1156 % fin------------------------------------------------------------------------------
1157
1158 Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
1159 \textit{permutation-based formulation} wherein the Q3AP can be expressed as follows:
1160
1161  \begin{equation*}
1162  min\left\{ f(\pi_1,\pi_2) =
1163  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}c_{i\pi_{1(i)}\pi_{2(i)}j\pi_{1(j)}\pi_{2(j)}} \quad
1164  + \right.
1165 \end{equation*}
1166
1167 \begin{equation}
1168  \left. \sum_{i=0}^{n-1}b_{i\pi_{1(i)}\pi_{2(i)}}\right
1169  \} \label{Q3APPermutationBasedFormulation}
1170 \end{equation}
1171 where $\pi_1$ and $\pi_2$ are permutations over the set
1172 $\left\lbrace 0,1,\ldots,n-1\right\rbrace$. According to this
1173 formulation, minimizing the Q3AP consists of finding a double
1174 permutation $(\pi_1,\pi_2)$ which minimizes the objective function
1175 (\ref{Q3APPermutationBasedFormulation}).\\
1176
1177 The Q3AP is proven to be an NP-hard problem since it is an
1178 extension of the quadratic assignment problem and of the axial
1179  3-dimensional assignment problem which are both NP-hard problems. It is particularly computationally difficult since the
1180 number of feasible solutions of an instance of size $n$ is $n! \times n!$.
1181
1182 \subsection{Iterated tabu search algorithm for the Q3AP}
1183
1184 To tackle large-sized instances of the Q3AP and speed up the
1185 search process, a parallel ILS\index{Metaheuristics!iterated local
1186 search} algorithm has been designed. The local search embedded in
1187 the ILS is a TS\index{Metaheuristics!tabu~search}. A TS
1188 procedure~\cite{Glover1989TS} starts from an initial feasible
1189 solution and tries, at each step, to move to a neighboring
1190 solution that minimizes the fitness (for a minimization case). If
1191 no such move exists, the neighbor solution that less degrades the
1192 fitness is chosen as a next move. This enables the TS process to
1193 escape local optima. However, this strategy may generate cycles,
1194 i.e., previous moves can be selected again. To avoid cycles, the
1195 TS manages a short-term memory that contains the moves that have
1196 been recently performed. A TS\index{Metaheuristics!tabu~search}
1197 template is given by Algorithm \ref{TS_pseudo_code}.\\
1198 %
1199 %  \begin{algorithm}
1200 %  \label{TS_pseudo_code}
1201 %  \caption{Tabu search template}
1202 %  \begin{algorithmic}[1]
1203 %  \STATE GenerateInitSol($s_0$);
1204 %  \STATE TabuList=$\phi$;
1205 %  \STATE $t=0$;
1206 %  \REPEAT
1207 %          \STATE $m(t)$ = SelectBestMove($s(t)$);
1208 %          \STATE $s(t+1)$ = ApplyMove($m(t),s(t)$);
1209 %          \STATE TabuList = TabuList $\bigcup \{m(t)\}$;
1210 %          \STATE $t = t + 1$;
1211 %  \UNTIL{a maximum number of iterations is reached}
1212 %
1213 %  \end{algorithmic}
1214 %  \end{algorithm}
1215
1216 \begin{algorithm}[H]
1217     \SetAlgoLined
1218     %\KwData{this text}
1219     %\KwResult{how to write algorithm with \LaTeX2e }
1220     GenerateInitSol($s_0$)\;
1221     TabuList=$\phi$\;
1222     $t=0$\;
1223     \Repeat{a maximum number of iterations is reached}{
1224           $m(t)$ = SelectBestMove($s(t)$)\;
1225           $s(t+1)$ = ApplyMove($m(t),s(t)$)\;
1226           TabuList = TabuList $\bigcup \{m(t)\}$\;
1227           $t = t + 1$\;
1228
1229     }
1230 \caption{tabu search template}
1231 \label{TS_pseudo_code}
1232 \end{algorithm}
1233
1234
1235 \subsection{Neighborhood functions for the Q3AP}
1236
1237 The neighborhood function is a crucial parameter in any local
1238 search algorithm. Indeed, if the neighborhood function is not
1239 adequate to the problem and/or does not consider the targeted
1240 computing framework, any local search algorithm will fail to reach
1241 good quality solutions of the search space.\\
1242
1243 Regarding the Q3AP, many neighborhood structures can be
1244 considered. A basic neighborhood was proposed and investigated in
1245 different works of the
1246 literature~\cite{Hahn2008q3ap,DBLP:journals/ijfcs/LoukilMMTB12}
1247 and consists of the set of all solutions (double-permutations)
1248 generated from the current one by an exchange of two positions in
1249 either the first ($\pi_1$) or the second ($\pi_2$) permutation.
1250 This neighborhood that we denote by $N_b$ can be formalized as
1251 follows:
1252
1253 \begin{equation}
1254 \begin{array}{rl}
1255 N_b(\pi_1,\pi_2)=\{&(\pi_1^{'},\pi_2^{'}): \pi_1^{'}[k]=\pi_1[l], \pi_1^{'}[l]=\pi_1[k]\\
1256                        & 0\leq k \neq l < n;\\
1257                 & \pi_1^{'}[i]=\pi_1[i], \quad 0\leq i\neq k,l < n;\\
1258                 & \pi_2^{'}[j]=\pi_2[j], \quad 0\leq j<n\\ \} \\
1259 \bigcup \\
1260 \{& (\pi_1^{'},\pi_2^{'}): \pi_2^{'}[k]=\pi_2[l], \pi_2^{'}[l]=\pi_2[k]\\
1261                 & 0\leq k \neq l < n;\\
1262                 & \pi_2^{'}[i]=\pi_2[i], \quad 0\leq i < n;\\
1263                 & \pi_1^{'}[j]=\pi_1[j], \quad 0\leq j\neq k,l < n\\
1264 \}& \end{array}
1265 \end{equation}
1266 where $(\pi_1,\pi_2)$ is the current solution and $n$ is the size
1267 of the Q3AP instance. The size $|N_b|$ of such neighborhood is
1268 equal to
1269
1270 \begin{equation}
1271  |N_b| = n \times (n-1)
1272 \end{equation}
1273
1274
1275 In our GPU-based implementations, a large-sized neighborhood
1276 structure has been used for experimentation. In fact, theoretical
1277 and experimental studies have shown that the use of large
1278 neighborhood structures may improve the effectiveness of LS
1279 algorithms~\cite{Ahuja:2007:VLN:1528422.1528438}. However, for
1280 such a neighborhood, the generation/evaluation step of an LS
1281 becomes a time-consuming task and may dramatically increase the
1282 computational time of the LS process. This
1283 justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. \\
1284
1285 The considered large-sized neighborhood consists of swapping two
1286 positions in both permutations $\pi_1$ and $\pi_2$. This
1287 neighborhood structure, $N(\pi_1,\pi_2)$, can be formalized as
1288 follows:
1289
1290 \begin{equation} \begin{array}{rl}
1291 N(\pi_1,\pi_2)=\{&(\pi_1^{'},\pi_2^{'}): \pi_1^{'}[k]=\pi_1[l], \pi_1^{'}[l]=\pi_1[k],\\
1292                 & \pi_2^{'}[r]=\pi_2[s], \pi_2^{'}[s]=\pi_2[r]\\
1293                 & 0\leq k \neq l < n,0\leq r \neq s < n;\\
1294                 & \pi_1^{'}[i]=\pi_1[i], \quad 0\leq i\neq k,l < n;\\
1295                 & \pi_2^{'}[j]=\pi_2^{'}[j], \quad 0\leq j\neq r,s < n\\
1296
1297 \}& \end{array}
1298 \end{equation}
1299
1300 So, for a given Q3AP instance of size $n$, the size $|N|$ of the advanced neighborhood set can be expressed by the following formula:
1301
1302 \begin{equation}
1303   |N|=\left(\frac{n \times (n-1)}{2}\right )^2
1304 \end{equation}
1305
1306 \subsection{Design and implementation of a GPU-based iterated tabu search algorithm for the Q3AP} \label{ch8:ITS-Q3APSection}
1307
1308 The use of GPU devices to speed up the search process of local
1309 search methods is not a straightforward task. It requires one to
1310 consider, at the same time, the characteristics and underlying
1311 issues of the GPU architecture and the parallel models of LS
1312 methods. The main challenges that must be faced when designing a
1313 local search algorithm are the efficient distribution of the
1314 search process between the CPU and the GPU minimizing the data
1315 transfer between them, the hierarchical memory
1316 management\index{GPU Challenges!memory~management} and the
1317 capacity constraints of GPU memories, and the thread
1318 synchronization. All these issues must be regarded when designing
1319 parallel LS models to allow
1320 solving of large scale optimization problems on GPU architectures.\\
1321
1322 To go back to our problem (i.e., Q3AP), we propose in
1323 Algorithm~\ref{ch8:algoITS} an iterated tabu
1324 search\index{Metaheuristics!tabu~search} on GPU (GPU-ITS). The
1325 parallel model is in agreement with the
1326 iteration-level\index{Metaheuristics!iteration-level parallelism}
1327 parallel model of LS methods presented in Section
1328 \ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This
1329 algorithm can be seen as a cooperative model between the CPU and
1330 the GPU. Indeed, the GPU is used as a coprocessor in a synchronous
1331 manner. The resource-consuming part, i.e., the generation and
1332 evaluation kernel, is calculated by the GPU and the rest of the LS
1333 process is handled by the CPU. First of all, at initialization
1334 stage, memory allocations on GPU are made; the input matrices
1335 (distance and flow matrices) and the initial candidate solution of
1336 the Q3AP must be allocated (lines 4 and 5). Since GPUs require
1337 massive computations with predictable memory accesses, a structure
1338 has to be allocated for storing all the neighborhood fitnesses at
1339 different addresses (line 6). Second, the matrices and the initial
1340 candidate solution have to be copied on the GPU (lines 7 and 8).
1341 We notice that the input matrices are read-only structures and do
1342 not change during all the execution of the LS algorithm.
1343 Therefore, their associated memory is copied only once during all
1344 the execution. Third, comes the parallel
1345 iteration-level\index{Metaheuristics!iteration-level parallelism},
1346 in which each neighboring solution is generated, evaluated, and
1347 copied into the neighborhood fitnesses structure (from lines 10 to
1348 14). Fourth, since the order in which candidate neighbors are
1349 evaluated is undefined, the neighborhood fitnesses structure has
1350 to be copied to the host CPU (line 15). Then the selection
1351 strategy is applied to this structure (line 16): the exploration
1352 of the neighborhood fitnesses structure is done by the CPU.
1353 Finally, after a new candidate has been selected, this information
1354 is copied to the GPU (line 18). The process is repeated until a
1355 given number of iterations has been reached.
1356
1357 % \begin{algorithm}
1358 % \caption{Template of an iterated tabu search\index{Metaheuristics!tabu~search} on GPU for solving the Q3AP} \label{LSGPU}
1359 %  \begin{algorithmic}[1]
1360 %  \STATE Choose an initial solution
1361 %  \STATE Evaluate the solution
1362 %  \STATE Initialize the tabu list
1363 %  \STATE Allocate the Q3AP input data on GPU device memory
1364 %  \STATE Allocate a solution on GPU device memory
1365 %  \STATE Allocate a neighborhood fitnesses structure on GPU device memory
1366 %  \STATE Copy the Q3AP input data on GPU device memory
1367 %  \STATE Copy the solution on GPU device memory
1368 %  \REPEAT
1369 %      \FOR{each generated neighbor on GPU}
1370 %          \STATE Incremental evaluation of the candidate solution
1371 %          \STATE Insert the resulting fitness into the neighborhood fitnesses structure
1372 %      \ENDFOR
1373 %      \STATE Copy the neighborhood fitnesses structure on CPU host memory
1374 %      \STATE Select the best admissible neighboring solution
1375 %      \STATE Update the tabu list
1376 %      \STATE Copy the chosen solution on GPU device memory
1377 %  \UNTIL{a maximum number of iterations reached}
1378 %  \end{algorithmic}
1379 %  \end{algorithm}
1380
1381 \begin{algorithm}[H]
1382     \SetAlgoLined
1383     %\KwData{this text}
1384     %\KwResult{how to write algorithm with \LaTeX2e }
1385     Choose an initial solution\;
1386     Evaluate the solution\;
1387     Initialize the tabu list\;
1388     Allocate the Q3AP input data on GPU device memory\;
1389     Allocate a solution on GPU device memory\;
1390     Allocate a neighborhood fitnesses structure on GPU device memory\;
1391     Copy the Q3AP input data on GPU device memory\;
1392     Copy the solution on GPU device memory\;
1393     $t=0$\;
1394     \Repeat{a maximum number of iterations is reached}{
1395         \For{each generated neighbor on GPU}{
1396           Incremental evaluation of the candidate solution\;
1397           Insert the resulting fitness into the neighborhood fitnesses structure\;
1398         }
1399         Copy the neighborhood fitnesses structure on CPU host memory\;
1400         Select the best admissible neighboring solution\;
1401         Update the tabu list\;
1402         Copy the chosen solution on GPU device memory\;
1403     }
1404 \caption{template of an iterated tabu search on GPU for solving the Q3AP}
1405 \label{ch8:algoITS}
1406 \end{algorithm}
1407
1408
1409 \subsection {Experimental results}
1410
1411 In this section, some experimental results related to the approach
1412 presented in Section \ref{ch8:ITS-Q3APSection} are reported. We
1413 recall that the approach is a GPU-based iterated tabu
1414 search\index{Metaheuristics!tabu~search} (GPU-ITS) method
1415 consisting in an iterated local search (ILS) embedding a tabu
1416 search\index{Metaheuristics!tabu~search} (TS) and where the
1417 generation/evaluation step of the TS process is executed on GPU.
1418 The ILS is used to improve the quality of successive local optima
1419 provided by TS methods. This is achieved by perturbing the local
1420 optima reached by the current TS process and reconsidering it as
1421 initial solution of the following TS process. Regarding our
1422 algorithm, the applied perturbation is a random number $\mu $ of
1423 swaps in either the first or the second permutation where $\mu \in
1424 [2:n]$ ($n$ is the instance size).\\
1425
1426 Experiments have been carried out on a node of the Chirloute
1427 cluster of the Lille site. This is one of the 10 sites that
1428 currently make up Grid5000~\cite{grid5000}, the French
1429 experimental computational grid. A Chirloute cluster node consists
1430 of an Intel Xeon E5620 CPU and a NVIDIA Tesla Fermi M2050 (448
1431 cores) GPU type. The number of ILS iterations and the number of TS
1432 iterations were set respectively to 20 and 500. The tabu list size
1433 has been initalized to $\frac{m}{4}$, $m$ being the size of the
1434 neighborhood set.\\
1435
1436 Table \ref{ch8:ITSQ3APResults} reports the obtained results for
1437 the GPU-ITS using our large-sized neighborhood structure. The
1438 method has been tested on 5 Q3AP instances derived from the QAP
1439 Nugent instances in QAPLIB~\cite{burkard1991qaplib}. The average
1440 time measurement for 20 executions is reported in seconds and
1441 acceleration factors compared to a standalone CPU are also
1442 considered. The algorithm is stopped when a maximum number of
1443 iterations has been reached or when the optimal (or best known)
1444 value has been discovered. Average and max values of the
1445 evaluation function of the parallel GPU version have been
1446 measured. The number of successful tries (hits) and the average
1447 number of ILS iterations to converge to the optimal/best known
1448 value are also represented. The associated standard deviation for
1449 each average measurement is shown in small type characters.\\
1450
1451 \begin{table}
1452 %\tiny
1453 %\footnotesize
1454 \small
1455 \begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
1456 \multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average}  & \multicolumn{1}{|c|}{Maximal}  & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
1457 & \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iter.}\\ \hline
1458 Nug12-d & $580$   & $615.4$    & $744$   & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
1459     &   & \tiny{$41.7$}    & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
1460 Nug13-d & $1912$  & $1985.4$   & $2100$  & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
1461 &  & \tiny{$51.0$}   &  & & \tiny{$1.3$} & \tiny{$1.0$} &  & \tiny{$5.6$} \\ \hline
1462 Nug15-d & $2230$  & $2418.1$  & $2580$  & $30$\% & $305.5$ & $5.2$ & $58.8 \times$ & $17$ \\
1463  &  & \tiny{$135.3$}  & & & \tiny{$164.5$} & \tiny{$1.3$} & & \tiny{$5.0$} \\ \hline
1464 Nug18-d & $17836$ & $18110.2$ & $18506$ & $10$\% & $1375.9$ & $12.8$ & $107.4 \times$ & $19$ \\
1465 & & \tiny{$157.8$} & & & \tiny{$123.5$} & \tiny{$2.6$} & & \tiny{$4.2$} \\ \hline
1466 Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \times$ & $18$ \\
1467 & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
1468 \end{tabular}
1469 \caption{Results of the GPU-based iterated tabu search for
1470 different Q3AP instances.} \label{ch8:ITSQ3APResults} \center
1471 \end{table}
1472
1473 %\begin{table}
1474 %\tiny
1475 %%\footnotesize
1476 %\caption{Results of the GPU-based iterated tabu search for different Q3AP instances}
1477 %\label{ch8:ITSQ3APResults} \center
1478 %\begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline
1479 %Instance & Opt./BKV\footnotemark & Average val. & Max val. & Hits & CPU time & GPU time & Accel. & ITS iter.\\ \hline
1480 %Nug12-d & $580$   & $615.4_{41.7}$    & $744$   & $35$\% & $87.7_{40.9}$ & $2.5_{0.9}$ & $\times 34.7 $& $16_{6.3}$ \\ \hline
1481 %Nug13-d & $1912$  & $1985.4_{51.0}$   & $2100$  & $20$\% & $209.2_{1.3}$ & $3.3_{1.0}$ & $\times 63.5$ & $17_{5.6}$ \\ \hline
1482 %Nug15-d & $2230$  & $2418.1_{135.3}$  & $2580$  & $30$\% & $305.5_{164.5}$ & $5.2_{1.3}$ & $\times 58.8 $ & $17_{5.0}$ \\ \hline
1483 %Nug18-d & $17836$ & $18110.2_{157.8}$ & $18506$ & $10$\% & $1375.9_{123.5}$ & $12.8_{2.6}$ & $\times 107.4$ & $19_{4.2}$ \\ \hline
1484 %Nug22-d & $42476$ & $43282.1_{529.6}$ & $44140$ & $25$\% & $4506.5_{341.1}$ & $32.7_{6.6}$ & $\times 137.8$ & $18_{4.0}$ \\ \hline
1485 %\end{tabular}
1486 %\end{table}
1487
1488 \footnotetext{Best known value.}
1489
1490 Regarding the execution time, the generation and evaluation of the
1491 neighborhood in parallel on GPU provides an efficient way to
1492 speedup the search process in comparison with a single CPU. In
1493 fact, for the smaller instance Nug12-d, the GPU version starts to
1494 be faster than the CPU one (acceleration factor of $34.7~\times$).
1495 As long as the problem size increases, the speedup grows
1496 significantly (up to $137.8 \times$ for the Nug22-d instance).
1497 This means that the use of GPU provides an efficient way to deal
1498 with large neighborhoods. Indeed, the proposed neighborhoods were
1499 unpractical in terms of single CPU computational resources for
1500 large Q3AP instances such as Nug22-d (estimated to around $2$
1501 hours per run). So, implementing this algorithm on GPU has allowed
1502 the exploitation of  parallelism in such neighborhood and improved
1503 the robustness/quality of provided solutions.
1504
1505 \section{Conclusion}
1506 \label{ch8:conclusion}
1507
1508 This chapter has presented state-of-the-art GPU-based parallel
1509 metaheuristics\index{Metaheuristics!parallel~metaheuristics} and
1510 a case study on implementing large neighborhood local search
1511 methods on GPUs for solving large benchmarks of the quadratic
1512 three-dimensional assignment problem (Q3AP). \\
1513
1514 Implementing parallel
1515 metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
1516 GPU architectures poses new issues and challenges such as memory
1517 management\index{GPU Challenges!memory~management};  finding
1518 efficient mapping strategies between tasks to parallelize; and the
1519 GPU threads, thread divergence\index{GPU
1520 Challenges!thread~divergence}, and synchronization. Actually, most
1521 of metaheuristics have been implemented on GPU using different
1522 implementation strategies. In this chapter, a two-level
1523 classification of the reviewed works has been proposed: design
1524 level and implementation level. Design level regroups traditional
1525 parallel models used for metaheuristics while implementation level
1526 refers to the way those models are mapped to the GPU architecture.
1527 This classification focuses mainly on  the mapping between the
1528 metaheuristic tasks to parallelize and the GPU threads. Indeed,
1529 the choice of a given mapping strategy strongly influences the
1530 other challenges (memory usage, communication, thread
1531 divergence\index{GPU Challenges!thread~divergence}).
1532
1533 \putbib[Chapters/chapter9/biblio9]