-A very basic local search algorithm starts with an initial solution generated either at random
-or by the mean of a specific heuristic and is based on two elementary components: the generation
-of neighborhood structures using an elementary move function and a selection function which
-determines which solution in the current neighborhood
-will replace the actual search point. The search continues as long as there is
-improvement and stops when there is no better solution in the current neighborhood.
-The exploration (or evaluation) of the different moves of a given neighborhood is done
-independently for each move. Thus, the easiest way to accelerate a local search algorithm
-is to parallelize the evaluation of the neighborhood (instruction-level parallelism). This
-is by far the most used scheme in the literature for parallelizing local search algorithms
-on GPUs. Nevertheless, small neighborhoods may lead to non optimal occupation of the CUDA
-threads which may lead in its turn to an overhead due to the communication and memory latencies.
-Therefore, large neighborhoods are necessary for efficient implementation of
-local searches on GPUs.\\
-
-Luong \emph{et al.}~\cite{luong2010large} proposed efficient mappings for large neighborhood
-structures on GPUs. In this work, three different neighborhoods are studied and mapped to the
-hierarchical GPU for binary problems. The three neighborhoods are based on the \emph{Hamming} distance. The move operators used in the three neighborhoods consider \emph{Hamming} distances of 1, 2 and 3 respectively (it consists on flipping the binary value of one, two or three positions at a time in the candidate binary vector, respectively). In~\cite{luong2010large}, each thread is associated to a unique solution in the neighborhood. The addressed issue is how to efficiently map the different neighborhoods on the device memory. More explicitly, how to calculate the memory index of each solution associated to each CUDA thread's \textit{id}.
-%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes in the candidate vector.
-The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
-
-In the same context, Deevacq \emph{et al.}~\cite{audreyANT} proposed two parallelization strategies inspired by the multi-walk parallelization strategy, of a 3-opt iterated local search\index{Metaheuristics!iterated local search} algorithm (ILS) over a CPU/GPU architecture. In the first strategy, each LS is associated to a unique CUDA thread and improves a unique solution by generating
-its neighborhood. On the contrary, the second strategy associates each solution to a
-CUDA block and the neighborhood exploration is parallelized on the block threads. In the first strategy, since several LS are executed on different solutions on each MP, the data structures
-should be stored on the global memory while the exploration of a single solution at a
-time in the second strategy allows the use of the shared memory to store the related
-data structures. The two strategies have been experimented on standard benchmarks of
-the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies but on the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
-
-The same strategy is used by Luong \emph{et al.} in~\cite{luongMultiStart} to implement multi-start parallel local search algorithms (a special case of the algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} parallel model
-where several homogeneous LS algorithms are used). The multi-start model is combined with iteration-level\index{Metaheuristics!iteration-level parallelism} parallelism: several LS algorithms are managed by the CPU and the neighborhood evaluation step of each algorithm is parallelized on the GPU (each GPU thread is
-associated with one neighbor and executes the same evaluation function kernel).
-The advantage of such a model is that it allows a high occupancy of the GPU
-threads. Nevertheless, memory management\index{GPU Challenges!memory~management} causes new issues due to the quantity of
-data to store and to communicate between CPU and GPU. A second proposition for
-implementing the same model on GPU consists in implementing the whole LS processes
-on GPU and each GPU thread is associated to a unique LS algorithm. This solves the
-communication issue encountered in the first model. In addition, a memory management\index{GPU Challenges!memory~management}
-strategy is proposed to improve the efficiency of the algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} model:
-texture memory is used to avoid memory latency due to uncoalesced memory accesses.
-The proposed approaches are implemented on the quadratic assignment problem (QAP) using
-CUDA. The acceleration rates obtained for the
-algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} with usage of texture memory rise from $7.8\times$ to $12\times$ (for different
-QAP benchmark sizes). \\
-
-Janiak \emph{et al.}~\cite{Janiak_et_al_2008} implemented two algorithms for
-TSP and the flow shop scheduling problem (FSP). These algorithms are based on a
-multi-start tabu search\index{Metaheuristics!tabu~search} model. Both of the two algorithms exploit multi-core CPU and GPU.
-A full parallelization on GPU is adopted using shader libraries where each thread is
-mapped with one tabu search\index{Metaheuristics!tabu~search}. However, even though their experiments report that the
-use of GPU speedups the serial execution almost $16 \times$, the mapping of one
-thread with one tabu search\index{Metaheuristics!tabu~search} requires a large number of local search algorithms to
-cover the memory access latency. The same mapping policy is adopted by Zhu \emph{et al.} in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
-
-Luong \emph{et al.}~\cite{luong2012ppsn} proposed a GPU-based implementation of hybrid metaheuristics on heterogeneous parallel architectures (multi core CPU coupled to one GPU). The challenges of using such a heterogeneous architecture consist on the task distribution between the CPU cores and the GPU in such a way to have optimal performances. Among the three traditional parallel models (solution-level\index{Metaheuristics!solution-level~parallelism}, iteration-level\index{Metaheuristics!iteration-level parallelism} and algorithmic-level\index{Metaheuristics!algorithmic-level parallelism}), the authors pointed out that the most convenient model for the
-considered heterogeneous architecture is a hybrid model combining iteration-level\index{Metaheuristics!iteration-level parallelism}
-and algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models. Several CPU threads execute several instances of the same
-S-metaheuristic in parallel while the GPU device is associated to one CPU core and used
-to accelerate the neighborhood calculation of several S-metaheuristics in the same time.
- In order to efficiently exploit the remaining CPU cores, a load
-balancing heuristic is also proposed in order to decide on the number of additional
-S-metaheuristics to launch on the remaining CPU cores relatively to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of FANT algorithm~\cite{taillardFant}. \\
-
-All the previous works exploit the same parallel models used on CPUs based on the
-task parallelism. A different implementation approach is proposed by Paul in~\cite{gerald2012}
-to implement a simulated annealing\index{Metaheuristics!simulated~annealing} (SA) algorithm for the QAP on GPUs. Indeed, the author used
-a preinitialized matrix \emph{delta} in which the incremental evaluation of simple swap moves are
-calculated and stored relatively to the initial permutation $p$. For the GPU implementation,
-the authors used the parallel implementation of neighborhood exploration. The time consuming
-tasks in the SA-matrix are identified by the authors as: updating the matrix and passing through it to select the next accepted move. To initialize the delta matrix, several threads from different blocks explore different segments of the matrix (different moves) at the same time. In order to select the next accepted swap several threads are also used. Starting from the last move, a group of threads explore different subsets of the delta matrix. The shared memory is used to pre-load all the necessary elements for a given group of threads responsible of the updating of the delta matrix. The main difference in this work compared to the previous works resids in the introduction of a data parallelism using the precalculated delta matrix. The use of this matrix allows to increase the number of threads involved in the evaluation of a single move. Experimentations are done on standard QAP instances from the QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are achieved by the GPU implementation compared
-to the same sequential implementation on CPU using SA-matrix.\\
-
-\subsection{Implementing population-based metaheuristics on GPUs}
-
-State-of-the-art works dealing with the implementation of p-metaheuristics on GPUs generally rely
-on parallel models and research efforts done for parallelizing different classes of p-metaheuristics over different types of parallel architectures: supercomputers, clusters and computational grids. Three main classes of p-metaheuristics are considered in this section: evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant colony\index{Metaheuristics!ant~colony~optimization} optimization (ACO\index{Metaheuristics!ant~colony~optimization}) and particle swarm optimization\index{Metaheuristics!particle swarm optimization} (PSO).
-
-\subsubsection*{Evolutionary Algorithms}
-
-Traditional parallel models for EAs are classified in three main classes: coarse grain models using several sub-populations (islands), master-slave models used for the parallelization of CPU intensive steps (evaluation and transformation) and cellular models based on the use of one population disposed (generally) on a toroidal grid.
-
-The three traditional models have been implemented on GPUs by several researchers for different
-optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU memories.\\
-
-In~\cite{kannan}, Kannan and Ganji presented a CUDA implementation of the drug discovery application
-Autodock (molecular docking application). Autodock uses a genetic algorithm\index{Metaheuristics!genetic~algorithm} to find optimal docking
-positions of a ligand to a protein. The most time consuming task in Autodock is the fitness function
-evaluation. The fitness function used for a docking problem consists in calculating the energy of the ligand-protein complex (sum of intermolecular energies). The authors explored two different approaches to evaluate the fitness function on GPU. In the first approach, each GPU thread calculates the energy function of a single individual. This approach requires the use of large size populations to saturate the GPU (thousands of individuals). In the second approach each individual is associated to one thread block. The evaluation of the energy function is performed by the threads of the same block. This allows the use of medium population sizes (hundreds of individuals) and the acceleration of a single fitness
-evaluation. Another great advantage of the per block approach resides in the use of shared memory instead of global memory to store all the information related to each individual. The obtained speedups range from $10 \times$ to $47 \times$ for population sizes
-ranging from $50$ to $10000$. \\
-
-Maitre \emph{et al.}~\cite{maitre2009} also exploited the master-slave parallelism of EAs on GPUs using EASEA. EASEA is a C-like metalanguage for easy development of EAs. The user writes a description of the problem specific components (fitness function, problem representation, etc) in EASEA. The code is then compiled to obtain a ready to use evolutionary algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA compiler uses genetic algorithm\index{Metaheuristics!genetic~algorithm}LIB and EO Libraries to produce C++ or JAVA written EA codes. In~\cite{maitre2009}, the authors proposed an extension of EASEA to produce CUDA code from the EASEA files. This extension has been used to generate a master-worker parallel EA in which the sequential algorithm is maintained on CPU and the population is sent to GPU for evaluation. Two problems have been considered during the experiments: a benchmark mathematical function and a real problem (molecular structure prediction). In order to maximize the GPU
-occupation, very large populations are used (from $2000$ to $20000$). Even though transferring such large
-populations from the CPU to the GPU device memory at every generation is very costly, the authors reported important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
-
-The master-slave model is efficient when the fitness function is highly time intensive.
-Neverethless, it requires the use of large size populations in order to saturate the GPU unless the per-block is used (one individual per-block) when the acceleration of the fitness function itself is possible. The use of many sub-populations of medium sizes is another way to obtain a maximum occupation of the GPU. This is coarse grained prallelism (island model).\\
-
-The coarse grained model is used by Pospichal \emph{et al.} in~\cite{pospichal10} to implement a parallel genetic algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this work the entire genetic algorithm\index{Metaheuristics!genetic~algorithm} is implemented on GPU. This choice is motivated by the overhead induced by the CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} when only population evaluation is performed on GPU. Each population island is affected to a CUDA thread block and each thread is responsible of a unique individual. Sub-populations are stored on shared memory of each block. Nevertheless, because inter-block communications are not possible on the CUDA architecture, the islands evolve independently in each block and migrations are performed
-asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.\\
-
-The same strategy is also adopted by Tsuitsui and Fujimoto in~\cite{tsutsuiGAQAP} to implement a coarse grained genetic algorithm\index{Metaheuristics!genetic~algorithm} on GPU to solve the quadratic assignment problem (QAP). Initially, several sub-populations are created on CPU and transfered to the global memory. The sub-populations are organized in the global memory into blocks of $8$ individuals in such a way to allow coalesced memory access by the threads of the same thread block. Each sub-population is allocated to a single thread block in the GPU and transfered to the shared memory to start evolution. Population evaluation and transformation are done in parallel by the different threads of a given block. Migration is also done through the global memory.
-Experiments are performed on standard QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU implementation reached speedups of $2.9\times$ to $12.6 \times$ compared to a single core implementation of a coarse grained genetic algorithm\index{Metaheuristics!genetic~algorithm} on a Intel i7 processor.\\
-
-Nowotniak \emph{et al.}~\cite{nowotniak} proposed a GPU-based implementation of quantum inspired genetic algorithm\index{Metaheuristics!genetic~algorithm} called QIGA. The used parallel model is a hierarchical model based on two levels: each thread in a block transforms a unique individual and a different population is assigned to each block. The algorithm is entirely run on GPU. A different instance of the QIGA is run on each block and the computations have been shared between 8 GPUs. This approach is very convenient to speedup the experimental process on metaheuristics
-that require several independent runs of the same algorithm in order to asses statistical efficiency. The populations are stored in the shared memory, the data matrix used for fitness evaluation is placed in read only constant memory and finally seeds for random numbers generated on the GPU are stored in the global memory.\\
-
-In coarse grained parallelism, the use of the per-block approach to implement
-the islands (one sub-population per thread block) is almost natural and it
-allows the use of shared memory to store the sub-populations. Fine grained
-parallel models for EAs (cellular EAs) have been used by many authors to implement
-parallel EAs on GPUs. Indeed, the fine grained parallelism of cEAs fits perfectly to
-the SIMD architecture of the GPU. \\
-
-Pinel \emph{et al.} in~\cite{pinel2012JPDC} developed a highly parallel synchronous cellular genetic algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called GraphCell, to solve the independent task scheduling problem on GPU architectures. In CGAs, the population is arranged into a two dimensional toroidal grid where only neighboring solutions are allowed to interact with each other during the recombination step. In GraphCell, two recombination operators for CGA are especially designed to run efficiently on GPU. Indeed, instead of assigning a single thread to each solution of the population to perform the recombination, in GraphCell, a single thread is assigned to each task of each solution. Offsprings are created by independently modifying the assignment of some tasks in the current solution. Mainly, each thread chooses
-one neighboring solution in the grid as second parent using different selection strategies and assigns one task of the solution (first parent) to the machine for which the same task is assigned in the second parent. This
-way, the number of threads used for the recombination step is equal to: population size $\times$ size of the solution (number of tasks). This leads to a high number of threads used to accelerate the recombination operators especially when dealing with large instances of the problem. In addition to the recombination operators, the rest of the CGA steps are also parallelized on GPU (fitness evaluation, mutation and replacement).\\
-
-A similar work is proposed by Vidal and Alba in~\cite{albaCGAGPU} where a CGA using a toroidal grid is completely implemented on GPU. A direct mapping between the population and the GPU threads is done. At each step, several threads execute the same operations on each individual independently: initialization, computing the neighborhood, selection, crossover, mutation and evaluation. A synchronization is done for all threads to perform the replacement stage and form the next generation.
-All the data of the algorithm is placed on the global memory. Several experiments have been performed on a set of standard benchmark functions with different grid sizes ranging from $32^2$ to $512^2$.
-The speedups reached by the GPU implementation against the CPU version range from $5\times$ to $24\times$ and increases as the size of the population increases. However, the CPU implementation run faster than the GPU version for all the tested benchmarks when the size of the population is set to $32^2$. When the size of the population is too small, there is not enough computations to cover the overhead created by the call of kernel functions, CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication}s, synchronisation and acces to global memory. Finally, an interesting review on GPU parallel computation in bio-inspired algorithms is proposed by Arenas \emph{et al.} in~\cite{arenas2011}. \\
-
-\subsubsection*{Ant Colony Optimization}
-
-Ant colony optimization (ACO\index{Metaheuristics!ant~colony~optimization}) is another p-metaheuristic subject to parallelization on GPUs. State-of-the-art works on parallelizing ACO\index{Metaheuristics!ant~colony~optimization} focuses on accelerating the tour construction step performed by each ant by taking a task-based parallelism approach, with pheromone deposition on the CPU.
-
-In~\cite{cecilia}, Cecilia \emph{et al.} presented a GPU-based implementation of ACO\index{Metaheuristics!ant~colony~optimization} for TSP where the two steps (tour construction and pheromone update) are parallelized on the GPU. A data parallelism approach is used to enhance the performance of the tour construction step. The authors used two categories of artificial ants: queen ants associated with CUDA thread-blocks and worker ants associated with CUDA threads. A queen ant represents a simulated ant and worker ants collaborate with the queen ant to accelerate the decision about the next city to visit. The tour construction step of each queen ant is accelerated. Each worker ant maintains a history of the search in a tabu list containing a chronological
-ordering of the already visited cities. This memory is used to determine the feasible neighborhood. After all queen ants have constructed their tours, the pheromone trails are updated. For pheromone update, several GPU techniques are also used to increase the data bandwidth of the application mainly by the use of precalculated matrices that are easily updated by several threads (one thread per matrix entry). The achieved speedups are $21 \times$ for tour construction and $20 \times$ for pheromone updates.\\
-
-In another work, Tsuitsui \emph{et al.}~\cite{tsutsui} proposed a hybrid algorithm combining ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic and Tabu search (TS)\index{Metaheuristics!tabu~search} implemented on GPU to solve the QAP. A solution of QAP is represented as a permutation of ${1,2,..,n}$ with $n$ being the size of the problem. The TS algorithm is based on the 2-opt neighborhood (swapping of two elements $(i,j)$ in the permutation). The authors pointed out that the move cost of each neighbor depends on the couple $(i,j)$. Two groups of moves are formed according to the move cost. In order to avoid thread divergence\index{GPU Challenges!thread~divergence} within the same warp, the neighborhood evaluation is parallelized in such a way to assign only moves of the same cost to each thread warp. This strategy is called MATA for Move-cost Adjusted Thread Assignment. Concerning the memory management\index{GPU Challenges!memory~management}, all the data of the ACO\index{Metaheuristics!
-ant~colony~optimization} (population, pheromone matrix), QAP matrices, TS tabu list, are placed on the global memory of the GPU. Shared memory is used only for working data common to all threads in a given block.
- All the
-steps of the hybrid algorithm ACO\index{Metaheuristics!ant~colony~optimization}-TS (ACO\index{Metaheuristics!ant~colony~optimization} initialization, pheromone update, construct solutions, applying TS) are implemented as kernel functions on the GPU. The GPU/CPU communication are only used to transfer the best so far solution in order to verify termination conditions. The experiments performed on standard QAP benchmarks showed that the GPU implementation using MATA obtained a speedup of $19 \times$ compared to the CPU implementation, against a speedup of only $5 \times$ when MATA is not used. \\
-
-\subsubsection*{Particle Swarm Optimization}
-In~\cite{zhou2009} Zhou and Tan proposed a full GPU implementation of a standard PSO algorithm. All the data is stored in global memory (velocities, positions, swarm population, etc). Only working data is copied to shared memory by each thread. The four steps of the PSO have been parallelized on GPU: Fitness evaluation of the swarm, update of local best and global best of each particle and update of velocity and position of each particle. The same strategy is used to parallelize the first and last steps: the evaluation of fitness functions is performed in parallel for each dimension by several threads. It is the case for the update of position and velocities of each particle: one dimension at a time is updated for the whole swarm by several threads. Random numbers needed for updating the velocities and positions for the whole PSO processes are generated on CPU at the starting of the algorithm and transferred to the GPU global memory. For the steps 2 and 3 (update of local
-best and global best of each particle), the GPU threads are mapped to the N particles of the swarm one to one. Experiments done on four benchmark functions show speedups ranging from $3.7 \times$ to $9.0 \times$ for swarm sizes ranging from $400$ to $2800$.\\
+A very basic local search algorithm starts with an initial
+solution generated either at random or by the mean of a specific
+heuristic and is based on two elementary components: the
+generation of neighborhood structures using an elementary move
+function and a selection function that determines which solution
+in the current neighborhood will replace the actual search point.
+The search continues as long as there is improvement and stops
+when there is no better solution in the current neighborhood. The
+exploration (or evaluation) of the different moves of a given
+neighborhood is done independently for each move. Thus, the
+easiest way to accelerate a local search algorithm is to
+parallelize the evaluation of the neighborhood (instruction-level
+parallelism). This is by far the most used scheme in the
+literature for parallelizing local search algorithms on GPUs.
+Nevertheless, small neighborhoods may lead to nonoptimal
+occupation of the CUDA threads which may lead in turn to an
+overhead due to the communication and memory latencies. Therefore,
+large neighborhoods are necessary for efficient implementation of
+local searches on GPUs.
+
+Luong et al.~\cite{luong2010large} proposed efficient
+mappings for large neighborhood structures on GPUs. In this work,
+three different neighborhoods are studied and mapped to the
+hierarchical GPU for binary problems. The three neighborhoods are
+based on the Hamming distance. The move operators used in
+the three neighborhoods consider Hamming distances of 1,
+2, and 3 (this consists on flipping the binary value of one, two,
+or three positions at a time in the candidate binary vector).
+In~\cite{luong2010large}, each thread is associated to a unique
+solution in the neighborhood. The addressed issue is how to
+efficiently map the different neighborhoods on the device memory,
+more explicitly, how to calculate the memory index of each
+solution associated to each CUDA thread's \textit{id}.
+%For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes in the candidate vector.
+The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes. % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
+
+In the same context, Deevacq et al.~\cite{audreyANT}
+proposed two parallelization strategies inspired by the multiwalk
+parallelization strategy, of a 3-opt iterated local
+search algorithm (ILS)
+over a CPU/GPU architecture. In the first strategy, each Local
+Search (LS) is associated to a unique CUDA thread and improves a
+unique solution by generating its neighborhood. The second
+strategy associates each solution to a CUDA block and the
+neighborhood exploration is parallelized on the block threads. In
+the first strategy, since several LS are executed on different
+solutions on each Multi Processor (MP), the data structures should
+be stored on the global memory, while the exploration of a single
+solution at a time in the second strategy allows the use of the
+shared memory to store the related data structures. The two
+strategies have been experimented on standard benchmarks of
+the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.
+
+The same strategy is used by Luong et al.
+in~\cite{luongMultiStart} to implement multistart parallel local
+search algorithms (a special case of the
+algorithmic-level parallel model where several homogeneous LS
+algorithms are used). The multistart model is combined with
+iteration-level
+parallelism: several LS algorithms are managed by the CPU and the
+neighborhood evaluation step of each algorithm is parallelized on
+the GPU (each GPU thread is associated with one neighbor and
+executes the same evaluation function kernel). The advantage of
+such a model is that it allows a high occupancy of the GPU
+threads. Nevertheless, memory management causes new issues due to the
+quantity of data to store and to communicate between CPU and
+GPU. A second proposition for implementing the same model on GPU
+consists of implementing the whole LS processes on GPU with each
+GPU thread being associated to a unique LS algorithm. This solves
+the communication issue encountered in the first model. In
+addition, a memory management strategy is proposed to improve the
+efficiency of the
+algorithmic-level model: texture memory is used to avoid memory latency
+due to uncoalesced memory accesses. The proposed approaches are
+implemented on the quadratic assignment problem (QAP) using CUDA.
+The acceleration rates obtained for the
+algorithmic-level with usage of texture memory rise from $7.8\times$ to
+$12\times$ (for different
+QAP benchmark sizes).
+
+Janiak et al.~\cite{Janiak_et_al_2008} implemented two
+algorithms for TSP and the flow-shop scheduling problem (FSP).
+These algorithms are based on a multistart tabu
+search model. Both of the
+algorithms exploit multicore CPU and GPU. A full parallelization
+on GPU is adopted using shader libraries where each thread is
+mapped with one tabu search.
+However, even though their experiments report that the use of GPU
+speedups the serial execution almost $16 \times$, the mapping of
+one thread with one tabu search
+requires a large number of local search algorithms to
+cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.
+
+Luong et al.~\cite{luong2012ppsn} proposed a GPU-based
+implementation of hybrid metaheuristics on heterogeneous parallel
+architectures (multicore CPU coupled to one GPU). The challenge
+of using such a heterogeneous architecture is how to distribute
+tasks between the CPU cores and the GPU in such a way to have
+optimal performances. Among the three traditional parallel models
+(solution-level,
+iteration-level
+and algorithmic-level), the authors point out that the most convenient
+model for the considered heterogeneous architecture is a hybrid
+model combining
+iteration-level
+and algorithmic-level models. Several CPU threads execute several instances
+of the same S-metaheuristic in parallel while the GPU device is
+associated to one CPU core and used to accelerate the
+neighborhood calculation of several S-metaheuristics at the same
+time.
+ In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
+S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the Fast Ant Colony Algorithm (FANT)~\cite{taillardFant}.
+
+All the previously noted works exploit the same parallel models
+used on CPUs based on the task parallelism. A different
+implementation approach is proposed by Paul in~\cite{gerald2012}
+to implement a simulated
+annealing (SA) algorithm
+for the QAP on GPUs. Indeed, the author used a preinitialized
+matrix \emph{delta} in which the incremental evaluation of simple
+swap moves are calculated and stored relative to the initial
+permutation $p$. For the GPU implementation, the authors used the
+parallel implementation of neighborhood exploration. The
+time-consuming tasks in the SA-matrix are identified by the
+authors as updating the matrix and passing through it to select
+the next accepted move. To initialize the delta matrix, several
+threads from different blocks explore different segments of the
+matrix (different moves) at the same time. In order to select the
+next accepted swap, several threads are also used. Starting from
+the last move, a group of threads explores different subsets of
+the delta matrix. The shared memory is used to preload all the
+necessary elements for a given group of threads responsible for
+the updating of the delta matrix. The main difference in this work
+compared to the previous works resides in the introduction of a
+data parallelism using the precalculated delta matrix. The use of
+this matrix allows the increase in the number of threads
+involved in the evaluation of a single move. Experimentations are
+done on standard QAP instances from the
+QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
+achieved by the GPU implementation compared
+to the same sequential implementation on CPU using SA-matrix.
+
+\subsection[Implementing population-based metaheuristics\hfill\break on GPUs]{Implementing population-based metaheuristics on GPUs}
+
+State-of-the-art works dealing with the implementation of
+p-metaheuristics on GPUs generally rely on parallel models and
+research efforts done for parallelizing different classes of
+p-metaheuristics over different types of parallel architectures:
+supercomputers, clusters, and computational grids. Three main
+classes of p-metaheuristics are considered in this section:
+evolutionary
+algorithms (EAs), ant
+colony optimization
+(ACO), and particle
+swarm optimization\index{metaheuristics!particle swarm
+optimization} (PSO).
+\clearpage
+\subsubsection*{Evolutionary algorithms}
+
+Traditional parallel models for EAs are classified in three main
+classes: coarse-grain models using several subpopulations
+(islands), master-slave models used for the parallelization of CPU
+intensive steps (evaluation and transformation), and cellular
+models based on the use of one population disposed (generally) on
+a toroidal grid.
+
+The three traditional models have been implemented on GPUs by several researchers for different
+optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU memories.
+
+In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
+of the drug discovery application Autodock (molecular docking
+application). Autodock uses a genetic
+algorithm to find optimal
+docking positions of a ligand to a protein. The most
+time-consuming task in Autodock is the fitness function
+evaluation. The fitness function used for a docking problem
+consists of calculating the energy of the ligand-protein complex
+(sum of intermolecular energies). The authors explore two
+different approaches to evaluate the fitness function on GPU. In
+the first approach, each GPU thread calculates the energy function
+of a single individual. This approach requires the use of
+large-sized populations to saturate the GPU (thousands of
+individuals). In the second approach each individual is associated
+with one thread block. The evaluation of the energy function is
+performed by the threads of the same block. This allows the use of
+medium population sizes (hundreds of individuals) and the
+acceleration of a single fitness evaluation. Another great
+advantage of the per block approach resides in the use of shared
+memory instead of global memory to store all the information
+related to each individual. The obtained speedups range from $10
+\times$ to $47 \times$ for population sizes
+ranging from $50$ to $10000$.
+
+Maitre et al.~\cite{maitre2009} also exploited the
+master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
+C-like metalanguage for easy development of EAs. The user writes a
+description of the problem-specific components (fitness function,
+problem representation, etc) in EASEA. The code is then compiled
+to obtain a ready-to-use evolutionary
+algorithm. The EASEA
+compiler uses genetic
+algorithm LIB and EO
+Libraries to produce C++ or JAVA written EA codes.
+In~\cite{maitre2009}, the authors proposed an extension of EASEA
+to produce CUDA code from the EASEA files. This extension has been
+used to generate a master-slave parallel EA in which the
+sequential algorithm is maintained on CPU and the population is
+sent to GPU for evaluation. Two problems have been considered
+during the experiments: a benchmark mathematical function and a
+real problem (molecular structure prediction). In order to
+maximize the GPU occupation, very large populations are used (from
+$2000$ to $20000$). Even though transferring such large
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.
+
+The master-slave model is efficient when the fitness function is
+highly time intensive. Nevertheless, it requires the use of
+large-sized populations in order to saturate the GPU unless the
+per-block is used (one individual perblock) when the acceleration
+of the fitness function itself is possible. The use of many
+subpopulations of medium sizes is another way to obtain a maximum
+occupation of the GPU. This is coarse-grained parallelism (island
+model).
+
+The coarse-grained model is used by Pospichal et al.
+in~\cite{pospichal10} to implement a parallel genetic
+algorithm on GPU. In this
+work the entire genetic
+algorithm is implemented
+on GPU. This choice is motivated by the overhead engendered by the
+CPU/GPU communication
+when only population evaluation is performed on GPU. Each
+population island is mapped with a CUDA thread block and each
+thread is responsible for a unique individual. Subpopulations are
+stored on shared memory of each block. Nevertheless, because
+interblock communications are not possible on the CUDA
+architecture, the islands evolve independently in each block, and
+migrations are performed
+asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.
+
+The same strategy is also adopted by Tsutsui and Fujimoto
+in~\cite{tsutsuiGAQAP} to implement a coarse-grained genetic
+algorithm on GPU to solve
+the QAP. Initially, several subpopulations are created on CPU and
+transferred to the global memory. The subpopulations are organized
+in the global memory into blocks of $8$ individuals in such a way
+to allow coalesced memory access by the threads of the same thread
+block. Each sub-population is allocated to a single thread block
+in the GPU and transfered to the shared memory to start evolution.
+Population evaluation and transformation are done in parallel by
+the different threads of a given block. Migration is also done
+through the global memory. Experiments are performed on standard
+QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
+implementation reached speedups of $2.9\times$ to $12.6 \times$
+compared to a single core implementation of a coarse-grained
+genetic algorithm on a
+Intel i7 processor.
+
+Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
+implementation of a Quantum Inspired Genetic Algorithm (QIGA). The
+parallel model used is a hierarchical model based on two levels: each
+thread in a block transforms a unique individual and a different
+population is assigned to each block. The algorithm is run
+entirely on GPU. A different instance of the QIGA is run on each
+block and the computations have been shared between 8 GPUs. This
+approach is very convenient to speed up the experimental process
+on metaheuristics that require several independent runs of the
+same algorithm in order to assess statistical efficiency. The
+populations are stored in the shared memory, the data matrix used
+for fitness evaluation is placed in read only constant memory, and
+finally seeds for random numbers generated on the GPU are stored
+in the global memory.
+
+In coarse-grained parallelism, the use of the per-block approach
+to implement the islands (one subpopulation per thread block) is
+almost natural and it allows the use of shared memory to store
+the subpopulations. Fine-grained parallel models for EAs (cellular
+EAs) have been used by many authors to implement parallel EAs on
+GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
+to the SIMD architecture of the GPU.
+
+Pinel et al. in~\cite{pinel2012JPDC} developed a highly
+parallel synchronous cellular genetic
+algorithm (CGA), called
+GraphCell, to solve the independent task scheduling problem on GPU
+architectures. In CGAs, the population is arranged into a
+two-dimensional toroidal grid where only neighboring solutions are
+allowed to interact with each other during the recombination
+step. In GraphCell, two recombination operators for CGA are
+especially designed to run efficiently on GPU. Indeed, instead of
+assigning a single thread to each solution of the population to
+perform the recombination, in GraphCell, a single thread is
+assigned to each task of each solution. Offsprings are created by
+independently modifying the assignment of some tasks in the
+current solution. Mainly, each thread chooses one neighboring
+solution in the grid as second parent using different selection
+strategies and assigns one task of the solution (first parent)
+to the machine for which the same task is assigned in the second
+parent. This way, the number of threads used for the
+recombination step is equal to population size $\times$ size of
+the solution (number of tasks). This leads to a high number of
+threads used to accelerate the recombination operators especially
+when dealing with large instances of the problem. In addition to
+the recombination operators, the rest of the CGA steps are also
+parallelized on GPU (fitness evaluation, mutation, and
+replacement).
+
+A similar work is proposed by Vidal and Alba in~\cite{albaCGAGPU}
+where a CGA using a toroidal grid is completely implemented on
+GPU. A direct mapping between the population and the GPU threads
+is done. At each step, several threads execute the same operations
+on each individual independently: initialization, computing the
+neighborhood, selection, crossover, mutation, and evaluation. A
+synchronization is done for all threads to perform the replacement
+stage and form the next generation. All the data of the algorithm
+is placed on the global memory. Several experiments have been
+performed on a set of standard benchmark functions with different
+grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
+the GPU implementation against the CPU version range from
+$5\times$ to $24\times$ and increase as the size of the population
+increases. However, the CPU implementation runs faster than the GPU
+version for all the tested benchmarks when the size of the
+population is set to $32^2$. When the size of the population is
+too small, there are not enough computations to cover the overhead
+created by the call of kernel functions, CPU/GPU
+communications,
+synchronization, and access to global memory. Finally, an
+interesting review on GPU parallel computation in bio-inspired
+algorithms is proposed by Arenas et al.
+in~\cite{arenas2011}.
+
+\subsubsection*{Ant colony optimization}
+
+Ant colony optimization
+(ACO) is another
+p-metaheuristic subject to parallelization on GPUs.
+State-of-the-art works on parallelizing
+ACO focus on
+accelerating the tour construction step performed by each ant by
+taking a task-based parallelism approach, with pheromone
+deposition on the CPU.
+
+In~\cite{cecilia}, Cecilia et al. present a GPU-based
+implementation of
+ACO for TSP where
+the two steps (tour construction and pheromone update) are
+parallelized on the GPU. A data parallelism approach is used to
+enhance the performance of the tour construction step. The
+authors use two categories of artificial ants: queen ants
+associated with CUDA thread-blocks and worker ants associated with
+CUDA threads. A queen ant represents a simulated ant and worker
+ants collaborate with the queen ant to accelerate the decision
+about the next city to visit. The tour construction step of each
+queen ant is accelerated. Each worker ant maintains a history of
+the search in a tabu list containing a chronological ordering of
+the already visited cities. This memory is used to determine the
+feasible neighborhoods. After all queen ants have constructed
+their tours, the pheromone trails are updated. For pheromone
+update, several GPU techniques are also used to increase the data
+bandwidth of the application mainly by the use of precalculated
+matrices that are easily updated by several threads (one thread
+per matrix entry). The achieved speedups are $21 \times$ for tour
+construction and $20 \times$ for pheromone updates.
+
+In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose a
+hybrid algorithm combining
+ACO metaheuristic
+and Tabu Search (TS) implemented
+on GPU to solve the QAP. A solution of QAP is represented as a
+permutation of ${1,2,..,n}$ with $n$ being the size of the
+problem. The TS algorithm is based on the 2-opt neighborhood
+(swapping of two elements $(i,j)$ in the permutation). The authors
+point out that the move cost of each neighbor depends on the
+couple $(i,j)$. Two groups of moves are formed according to the
+move cost. In order to avoid thread divergence\index{GPU!thread divergence} within the same warp, the
+neighborhood evaluation is parallelized in such a way to assign
+only moves of the same cost to each thread warp. This strategy is
+called MATA for Move-cost Adjusted Thread Assignment. Concerning
+the memory management\index{GPU!memory management}, all
+the data of the ACO\index{metaheuristics!ant colony optimization}
+(population, pheromone matrix), QAP matrices, and tabu list are
+placed on the global memory of the GPU. Shared memory is used only
+for working data common to all threads in a given block.
+ All the
+steps of the hybrid algorithm
+ACO-TS
+(ACO initialization,
+pheromone update, construct solutions, applying TS) are
+implemented as kernel functions on the GPU. The GPU/CPU
+communications are only used to transfer the best-so-far solution
+in order to verify termination conditions. The experiments
+performed on standard QAP benchmarks showed that the GPU
+implementation using MATA obtained a speedup of $19 \times$
+compared to the CPU implementation, compared with a speedup of
+only $5 \times$
+when MATA is not used.
+
+\subsubsection*{Particle swarm optimization}
+In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
+of a standard PSO algorithm. All the data is stored in global
+memory (velocities, positions, swarm population, etc). Only
+working data is copied to shared memory by each thread. The four
+steps of the PSO have been parallelized on GPU: fitness evaluation
+of the swarm, update of local best and global best of each
+particle, and update of velocity and position of each particle.
+The same strategy is used to parallelize the first and last
+steps: the evaluation of fitness functions is performed in
+parallel for each dimension by several threads. It is the case
+for the update of position and velocities of each particle: one
+dimension at a time is updated for the whole swarm by several
+threads. Random numbers needed for updating the velocities and
+positions for the whole PSO processes are generated on CPU at the
+starting of the algorithm and transferred to the GPU global
+memory. For the steps 2 and 3 (update of local best and global
+best of each particle), the GPU threads are mapped to the \emph{N}
+particles of the swarm one to one. Experiments done on four
+benchmark functions show speedups ranging from $3.7 \times$ to
+$9.0 \times$ for swarm sizes
+ranging from $400$ to $2800$.