new

[book_gpu.git] / BookGPU / Chapters / chapter9 / ch9.tex
diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex

index d10daa6a78e92c9934a28e485fbbc88a527735a2..0ebc4464899aa36c9f29dc45822ff385b20a2adb 100644 (file)
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -36,7 +36,7 @@ remarks are given in Section~\ref{ch8:conclusion}
  
  Combinatorial optimization (CO) is a branch of applied and discrete mathematics.
  It consists in finding optimal configuration(s) among a finite set of possible configurations
  
  Combinatorial optimization (CO) is a branch of applied and discrete mathematics.
  It consists in finding optimal configuration(s) among a finite set of possible configurations
-(or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\  %(Definition~\ref{def:cops})
+(or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:  %(Definition~\ref{def:cops})
  
  
  %\begin{minipage}{0.5\linewidth}
  
  
  %\begin{minipage}{0.5\linewidth}
@@ -44,7 +44,7 @@ A combinatorial problem $P=(S,f)$ can be defined by
  \begin{itemize}
  \item a set of decision variables $X$,
  \item an objective function $f$ to optimize (minimize or maximize) over the set $S$,
  \begin{itemize}
  \item a set of decision variables $X$,
  \item an objective function $f$ to optimize (minimize or maximize) over the set $S$,
-\item subject to constraints on the decision variables.\\
+\item subject to constraints on the decision variables.
  \end{itemize}
  %\end{minipage}
  
  \end{itemize}
  %\end{minipage}
  
@@ -73,7 +73,7 @@ quality solutions in reasonable computation time compared to exact
  methods but with no guarantee to find optimal or even bounded
  solutions. This is due to the nature of the search process adopted
  by these approaches which consists of performing a
  methods but with no guarantee to find optimal or even bounded
  solutions. This is due to the nature of the search process adopted
  by these approaches which consists of performing a
-search in a subset of the whole search space.\\
+search in a subset of the whole search space.
  
  Regarding the number of solutions considered at each iteration in
  the search process, two classes of metaheuristics can be
  
  Regarding the number of solutions considered at each iteration in
  the search process, two classes of metaheuristics can be
@@ -90,7 +90,7 @@ annealing}\index{metaheuristics!simulated annealing}~\cite{Kirkpatrick1983SA},
  \emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local
  search\index{metaheuristics!iterated local
  search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
  \emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local
  search\index{metaheuristics!iterated local
  search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
-neighborhood search}~\cite{HansenMladenovic1997VNS}.\\
+neighborhood search}~\cite{HansenMladenovic1997VNS}.
  
  Unlike s-metaheuristics, p-metaheuristics start with a population
  of solutions and implement an iterative process that evolves the
  
  Unlike s-metaheuristics, p-metaheuristics start with a population
  of solutions and implement an iterative process that evolves the
@@ -105,7 +105,7 @@ Optimization problems, whether real-life or academic, are more
  often NP-hard and CPU time and/or memory consuming. Metaheuristics
  allow the significant reduction of the computational time of the search
  process but remain time-consuming particularly when it comes
  often NP-hard and CPU time and/or memory consuming. Metaheuristics
  allow the significant reduction of the computational time of the search
  process but remain time-consuming particularly when it comes
-dealing with large-sized problems. \\
+dealing with large-sized problems. 
  
  The use of parallelism in the design of metaheuristics is a relevant
  approach that is widely adopted by the combinatorial optimization
  
  The use of parallelism in the design of metaheuristics is a relevant
  approach that is widely adopted by the combinatorial optimization
@@ -130,7 +130,7 @@ high-performance computing.
  
  From the granularity of parallelism point of view, three major parallel
  models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{metaheuristics!algorithmic-level parallelism},
  
  From the granularity of parallelism point of view, three major parallel
  models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{metaheuristics!algorithmic-level parallelism},
-\emph{iteration-level}\index{metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
+\emph{iteration-level}\index{metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. 
  
  \begin{figure}[h!]
  \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
  
  \begin{figure}[h!]
  \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
@@ -187,7 +187,7 @@ consuming. Unlike the two previous parallel models, the
  solution-level\index{metaheuristics!solution-level parallelism}
  parallel model is problem-dependent.}
  \end{itemize}
  solution-level\index{metaheuristics!solution-level parallelism}
  parallel model is problem-dependent.}
  \end{itemize}
-
+\clearpage
  \section{Challenges for the design of GPU-based metaheuristics}
  \label{ch8:sec:challenges}
  
  \section{Challenges for the design of GPU-based metaheuristics}
  \label{ch8:sec:challenges}
  
@@ -226,12 +226,12 @@ threads and the corresponding metaheuristic elements (one neighbor
  per thread, one individual per thread, single population per
  threads block, one ant per thread, etc.) must be defined to ensure
  a maximum occupancy of the GPU and
  per thread, one individual per thread, single population per
  threads block, one ant per thread, etc.) must be defined to ensure
  a maximum occupancy of the GPU and
-to cover CPU/GPU communication\index{GPU!CPU/GPU communication} and memory access times.\\
+to cover CPU/GPU communication\index{GPU!CPU/GPU communication} and memory access times.
  
  According to the used metaheuristic and to the handled problem, the data
  values may have different types and different ranges of their values. The data
  types should then be chosen carefully and the ranges of the data values should
  
  According to the used metaheuristic and to the handled problem, the data
  values may have different types and different ranges of their values. The data
  types should then be chosen carefully and the ranges of the data values should
-be analyzed to minimize the amount of occupied memory space.\\
+be analyzed to minimize the amount of occupied memory space.
  
  In addition to the size and latency of GPU memory issues, the
  memory access pattern is another important issue to be dealt with
  
  In addition to the size and latency of GPU memory issues, the
  memory access pattern is another important issue to be dealt with
@@ -294,7 +294,7 @@ the different threads degrading the performance of the application
  in terms of execution time. The challenge here is then to revisit
  the traditional irregular metaheuristic codes to eliminate these
  divergences.
  in terms of execution time. The challenge here is then to revisit
  the traditional irregular metaheuristic codes to eliminate these
  divergences.
-
+\clearpage
  \subsection{Task distribution and CPU/GPU communication}
  
  The performance of GPU-based metaheuristics in terms of execution
  \subsection{Task distribution and CPU/GPU communication}
  
  The performance of GPU-based metaheuristics in terms of execution
@@ -313,7 +313,7 @@ perform a careful analysis on the serial code of the
  metaheuristic, identify the compute-intensive tasks (e.g.,
  exploration of the neighborhood, evaluation of individuals), and
  then offload them to the GPU, while the remaining
  metaheuristic, identify the compute-intensive tasks (e.g.,
  exploration of the neighborhood, evaluation of individuals), and
  then offload them to the GPU, while the remaining
-tasks still run on the CPU in a serial way. \\
+tasks still run on the CPU in a serial way. 
  
  The CPU/GPU communication is done through the global
  memory which is a slow memory making the memory transfer between
  
  The CPU/GPU communication is done through the global
  memory which is a slow memory making the memory transfer between
@@ -328,7 +328,7 @@ therefore, the whole execution time of the metaheuristic.
  After more than two decades of research by the combinatorial optimisation
  community devoted to developing adequate parallel metaheuristics for different types of
  parallel architectures (clusters, supercomputers and grids), the actual developement
  After more than two decades of research by the combinatorial optimisation
  community devoted to developing adequate parallel metaheuristics for different types of
  parallel architectures (clusters, supercomputers and grids), the actual developement
-of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics on SIMD architectures.\\
+of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics on SIMD architectures.
  
  The first works on metaheuristic algorithms implemented on GPUs
  started on old graphics cards before the appearance of modern GPUs
  
  The first works on metaheuristic algorithms implemented on GPUs
  started on old graphics cards before the appearance of modern GPUs
@@ -339,7 +339,7 @@ of EAs on graphics processing cards and the work by Catala et al. in~\cite{catal
  is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
  Li et al.~\cite{li2007} proposed a full parallelization of genetic
  algorithms  on old GPU architectures using
  is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
  Li et al.~\cite{li2007} proposed a full parallelization of genetic
  algorithms  on old GPU architectures using
-shader libraries based on Direct3D and OpenGL.\\
+shader libraries based on Direct3D and OpenGL.
  
  Such architectures are based on preconfigured pipelined stages
  used to accelerate the transformation of 3D geometric primitives
  
  Such architectures are based on preconfigured pipelined stages
  used to accelerate the transformation of 3D geometric primitives
@@ -377,7 +377,7 @@ Nevertheless, small neighborhoods may lead to  nonoptimal
  occupation of the CUDA threads which may lead in turn to an
  overhead due to the communication and memory latencies. Therefore,
  large neighborhoods are necessary for efficient implementation of
  occupation of the CUDA threads which may lead in turn to an
  overhead due to the communication and memory latencies. Therefore,
  large neighborhoods are necessary for efficient implementation of
-local searches on GPUs.\\
+local searches on GPUs.
  
  Luong et al.~\cite{luong2010large} proposed efficient
  mappings for large neighborhood structures on GPUs. In this work,
  
  Luong et al.~\cite{luong2010large} proposed efficient
  mappings for large neighborhood structures on GPUs. In this work,
@@ -393,10 +393,10 @@ efficiently map the different neighborhoods on the device memory,
  more explicitly, how to calculate the memory index of each
  solution associated to each CUDA thread's \textit{id}.
  %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
  more explicitly, how to calculate the memory index of each
  solution associated to each CUDA thread's \textit{id}.
  %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
-The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
+The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes. % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
  
  In the same context, Deevacq et al.~\cite{audreyANT}
  
  In the same context, Deevacq et al.~\cite{audreyANT}
-proposed two parallelization strategies inspired by the multi-walk
+proposed two parallelization strategies inspired by the multiwalk
  parallelization strategy, of a 3-opt iterated local
  search algorithm (ILS)
  over a CPU/GPU architecture. In the first strategy, each Local
  parallelization strategy, of a 3-opt iterated local
  search algorithm (ILS)
  over a CPU/GPU architecture. In the first strategy, each Local
@@ -410,7 +410,7 @@ be stored on the global memory, while the exploration of a single
  solution at a time in the second strategy allows the use of the
  shared memory to store the related data structures. The two
  strategies have been experimented on standard benchmarks of
  solution at a time in the second strategy allows the use of the
  shared memory to store the related data structures. The two
  strategies have been experimented on standard benchmarks of
-the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
+the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.
  
  The same strategy is used by Luong et al.
  in~\cite{luongMultiStart} to implement multistart parallel local
  
  The same strategy is used by Luong et al.
  in~\cite{luongMultiStart} to implement multistart parallel local
@@ -437,7 +437,7 @@ implemented on the quadratic assignment problem (QAP) using CUDA.
  The acceleration rates obtained for the
  algorithmic-level with usage of texture memory rise from $7.8\times$ to
  $12\times$ (for different
  The acceleration rates obtained for the
  algorithmic-level with usage of texture memory rise from $7.8\times$ to
  $12\times$ (for different
-QAP benchmark sizes). \\
+QAP benchmark sizes). 
  
  Janiak et al.~\cite{Janiak_et_al_2008} implemented two
  algorithms for TSP and the flow-shop scheduling problem (FSP).
  
  Janiak et al.~\cite{Janiak_et_al_2008} implemented two
  algorithms for TSP and the flow-shop scheduling problem (FSP).
@@ -450,7 +450,7 @@ However, even though their experiments report that the use of GPU
  speedups the serial execution almost $16 \times$, the mapping of
  one thread with one tabu search
  requires a large number of local search algorithms to
  speedups the serial execution almost $16 \times$, the mapping of
  one thread with one tabu search
  requires a large number of local search algorithms to
-cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
+cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.
  
  Luong et al.~\cite{luong2012ppsn} proposed a GPU-based
  implementation of hybrid metaheuristics on heterogeneous parallel
  
  Luong et al.~\cite{luong2012ppsn} proposed a GPU-based
  implementation of hybrid metaheuristics on heterogeneous parallel
@@ -470,7 +470,7 @@ associated to one CPU core and  used to accelerate the
  neighborhood calculation of several S-metaheuristics at the same
  time.
   In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
  neighborhood calculation of several S-metaheuristics at the same
  time.
   In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
-S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the Fast Ant Colony Algorithm (FANT)~\cite{taillardFant}. \\
+S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the Fast Ant Colony Algorithm (FANT)~\cite{taillardFant}. 
  
  All the previously noted works  exploit the same parallel models
  used on CPUs based on the task parallelism. A different
  
  All the previously noted works  exploit the same parallel models
  used on CPUs based on the task parallelism. A different
@@ -499,9 +499,9 @@ involved in the evaluation of a single move. Experimentations are
  done on standard QAP instances from the
  QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
  achieved by the GPU implementation compared
  done on standard QAP instances from the
  QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
  achieved by the GPU implementation compared
-to the same sequential implementation on CPU using SA-matrix.\\
+to the same sequential implementation on CPU using SA-matrix.
  
  
-\subsection{Implementing population-based metaheuristics on GPUs}
+\subsection[Implementing population-based metaheuristics\hfill\break on GPUs]{Implementing population-based metaheuristics on GPUs}
  
  State-of-the-art works dealing with the implementation of
  p-metaheuristics on GPUs generally rely on parallel models and
  
  State-of-the-art works dealing with the implementation of
  p-metaheuristics on GPUs generally rely on parallel models and
@@ -515,7 +515,7 @@ colony optimization
  (ACO), and particle
  swarm optimization\index{metaheuristics!particle swarm
  optimization} (PSO).
  (ACO), and particle
  swarm optimization\index{metaheuristics!particle swarm
  optimization} (PSO).
-
+\clearpage
  \subsubsection*{Evolutionary algorithms}
  
  Traditional parallel models for EAs are classified in three main
  \subsubsection*{Evolutionary algorithms}
  
  Traditional parallel models for EAs are classified in three main
@@ -526,7 +526,7 @@ models based on the use of one population disposed (generally) on
  a toroidal grid.
  
  The three traditional models have been implemented on GPUs by several researchers for different
  a toroidal grid.
  
  The three traditional models have been implemented on GPUs by several researchers for different
-optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.\\
+optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.
  
  In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
  of the drug discovery application Autodock (molecular docking
  
  In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
  of the drug discovery application Autodock (molecular docking
@@ -550,7 +550,7 @@ advantage of the per block approach resides in the use of shared
  memory instead of global memory to store all the information
  related to each individual. The obtained speedups range from  $10
  \times$ to $47 \times$ for population sizes
  memory instead of global memory to store all the information
  related to each individual. The obtained speedups range from  $10
  \times$ to $47 \times$ for population sizes
-ranging from $50$ to $10000$. \\
+ranging from $50$ to $10000$. 
  
  Maitre et al.~\cite{maitre2009} also exploited the
  master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
  
  Maitre et al.~\cite{maitre2009} also exploited the
  master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
@@ -571,7 +571,7 @@ during the experiments: a benchmark mathematical function and a
  real problem (molecular structure prediction). In order to
  maximize the GPU occupation, very large populations are used (from
  $2000$ to $20000$). Even though transferring such large
  real problem (molecular structure prediction). In order to
  maximize the GPU occupation, very large populations are used (from
  $2000$ to $20000$). Even though transferring such large
-populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.\\
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.
  
  The master-slave  model is efficient when the fitness function is
  highly time intensive. Nevertheless, it requires the use of
  
  The master-slave  model is efficient when the fitness function is
  highly time intensive. Nevertheless, it requires the use of
@@ -580,7 +580,7 @@ per-block is used (one individual perblock) when the acceleration
  of the fitness function itself is possible. The use of many
  subpopulations of medium sizes is another way to obtain a maximum
  occupation of the GPU. This is coarse-grained parallelism (island
  of the fitness function itself is possible. The use of many
  subpopulations of medium sizes is another way to obtain a maximum
  occupation of the GPU. This is coarse-grained parallelism (island
-model).\\
+model).
  
  The coarse-grained model is  used by Pospichal et al.
  in~\cite{pospichal10} to implement a parallel genetic
  
  The coarse-grained model is  used by Pospichal et al.
  in~\cite{pospichal10} to implement a parallel genetic
@@ -596,7 +596,7 @@ stored on shared memory of each block. Nevertheless, because
  interblock communications are not possible on the CUDA
  architecture, the islands evolve independently in each block, and
  migrations are performed
  interblock communications are not possible on the CUDA
  architecture, the islands evolve independently in each block, and
  migrations are performed
-asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.\\
+asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.
  
  The same strategy is also adopted by Tsutsui and Fujimoto
  in~\cite{tsutsuiGAQAP}  to implement a coarse-grained genetic
  
  The same strategy is also adopted by Tsutsui and Fujimoto
  in~\cite{tsutsuiGAQAP}  to implement a coarse-grained genetic
@@ -614,7 +614,7 @@ QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
  implementation reached speedups of $2.9\times$ to $12.6 \times$
  compared to a single core implementation of a coarse-grained
  genetic algorithm on a
  implementation reached speedups of $2.9\times$ to $12.6 \times$
  compared to a single core implementation of a coarse-grained
  genetic algorithm on a
-Intel i7 processor.\\
+Intel i7 processor.
  
  Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
  implementation of a Quantum Inspired Genetic Algorithm (QIGA). The 
  
  Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
  implementation of a Quantum Inspired Genetic Algorithm (QIGA). The 
@@ -629,7 +629,7 @@ same algorithm in order to assess statistical efficiency. The
  populations are stored in the shared memory, the data matrix used
  for fitness evaluation is placed in read only constant memory, and
  finally seeds for random numbers generated on the GPU are stored
  populations are stored in the shared memory, the data matrix used
  for fitness evaluation is placed in read only constant memory, and
  finally seeds for random numbers generated on the GPU are stored
-in the global memory.\\
+in the global memory.
  
  In coarse-grained parallelism, the use of the per-block approach
  to implement the islands (one subpopulation per thread block) is
  
  In coarse-grained parallelism, the use of the per-block approach
  to implement the islands (one subpopulation per thread block) is
@@ -637,7 +637,7 @@ almost natural  and it allows the use of shared memory to store
  the subpopulations. Fine-grained parallel models for EAs (cellular
  EAs) have been used by many authors to implement parallel EAs on
  GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
  the subpopulations. Fine-grained parallel models for EAs (cellular
  EAs) have been used by many authors to implement parallel EAs on
  GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
-to the SIMD architecture of the GPU. \\
+to the SIMD architecture of the GPU. 
  
  Pinel et al. in~\cite{pinel2012JPDC} developed a highly
  parallel synchronous cellular genetic
  
  Pinel et al. in~\cite{pinel2012JPDC} developed a highly
  parallel synchronous cellular genetic
@@ -663,7 +663,7 @@ threads used to accelerate the recombination operators especially
  when dealing with large instances of the problem. In addition to
  the recombination operators, the rest of the CGA steps are  also
  parallelized on GPU (fitness evaluation, mutation, and
  when dealing with large instances of the problem. In addition to
  the recombination operators, the rest of the CGA steps are  also
  parallelized on GPU (fitness evaluation, mutation, and
-replacement).\\
+replacement).
  
  A similar work is proposed by Vidal and Alba  in~\cite{albaCGAGPU}
  where a  CGA using a toroidal grid is completely implemented on
  
  A similar work is proposed by Vidal and Alba  in~\cite{albaCGAGPU}
  where a  CGA using a toroidal grid is completely implemented on
@@ -687,7 +687,7 @@ communications,
  synchronization, and access to global memory. Finally, an
  interesting review on GPU parallel computation in bio-inspired
  algorithms is proposed by Arenas et al.
  synchronization, and access to global memory. Finally, an
  interesting review on GPU parallel computation in bio-inspired
  algorithms is proposed by Arenas et al.
-in~\cite{arenas2011}. \\
+in~\cite{arenas2011}. 
  
  \subsubsection*{Ant colony optimization}
  
  
  \subsubsection*{Ant colony optimization}
  
@@ -698,7 +698,7 @@ State-of-the-art works on parallelizing
  ACO focus on
  accelerating the tour construction step performed by each ant by
  taking a task-based parallelism approach, with pheromone
  ACO focus on
  accelerating the tour construction step performed by each ant by
  taking a task-based parallelism approach, with pheromone
-deposition on the CPU.\\
+deposition on the CPU.
  
  In~\cite{cecilia}, Cecilia et al.  present a GPU-based
  implementation of
  
  In~\cite{cecilia}, Cecilia et al.  present a GPU-based
  implementation of
@@ -720,7 +720,7 @@ update, several GPU techniques are also used to increase the data
  bandwidth of the application mainly by the use of precalculated
  matrices that are easily updated by several threads (one thread
  per matrix entry). The achieved speedups are $21 \times$ for tour
  bandwidth of the application mainly by the use of precalculated
  matrices that are easily updated by several threads (one thread
  per matrix entry). The achieved speedups are $21 \times$ for tour
-construction and  $20 \times$ for pheromone updates.\\
+construction and  $20 \times$ for pheromone updates.
  
  In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose  a
  hybrid algorithm combining
  
  In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose  a
  hybrid algorithm combining
@@ -753,7 +753,7 @@ performed on standard QAP benchmarks showed that the GPU
  implementation using MATA obtained a  speedup of $19 \times$
  compared to the CPU implementation,  compared with a speedup of
  only $5 \times$
  implementation using MATA obtained a  speedup of $19 \times$
  compared to the CPU implementation,  compared with a speedup of
  only $5 \times$
-when MATA is not used. \\
+when MATA is not used. 
  
  \subsubsection*{Particle swarm optimization}
  In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
  
  \subsubsection*{Particle swarm optimization}
  In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
@@ -776,7 +776,7 @@ best of each particle), the GPU threads are mapped to the \emph{N}
  particles of the swarm one to one. Experiments done on four
  benchmark functions show speedups ranging from $3.7 \times$ to
  $9.0 \times$ for swarm sizes
  particles of the swarm one to one. Experiments done on four
  benchmark functions show speedups ranging from $3.7 \times$ to
  $9.0 \times$ for swarm sizes
-ranging from $400$ to $2800$.\\
+ranging from $400$ to $2800$.
  
  \subsection{Synthesis of the implementation strategies}
  \label{ch8:sec:synthesis} After reviewing some works dealing with
  
  \subsection{Synthesis of the implementation strategies}
  \label{ch8:sec:synthesis} After reviewing some works dealing with
@@ -790,7 +790,7 @@ Indeed, even though the parallelization models used in most works
  for GPUs are derived from the traditional parallel models of each
  metaheuristic (on CPU), their implementation could take a
  different way and sometimes it may result in new parallel models
  for GPUs are derived from the traditional parallel models of each
  metaheuristic (on CPU), their implementation could take a
  different way and sometimes it may result in new parallel models
-customized for GPUs.\\
+customized for GPUs.
  
  Traditional parallel models for metaheuristics are based on an
  intuitive task parallelism: the independent tasks of the
  
  Traditional parallel models for metaheuristics are based on an
  intuitive task parallelism: the independent tasks of the
@@ -802,7 +802,7 @@ the evaluation step. Nevertheless, because of the particularity of
  the GPU architecture, some authors have used new implementation
  techniques to enhance the data parallelism in the sequential
  algorithms in order to increase the data throughput of the
  the GPU architecture, some authors have used new implementation
  techniques to enhance the data parallelism in the sequential
  algorithms in order to increase the data throughput of the
-application.\\
+application.
  
  From this observation we propose the following classification
  based on 2 levels: design level and implementation level as
  
  From this observation we propose the following classification
  based on 2 levels: design level and implementation level as
@@ -820,7 +820,7 @@ models are implemented on GPU. This classification focuses only on
  the mapping strategies between the GPU threads and the
  parallelized tasks (neighborhood evaluation, solution
  construction, and so on). The different implementation strategies
  the mapping strategies between the GPU threads and the
  parallelized tasks (neighborhood evaluation, solution
  construction, and so on). The different implementation strategies
-are explained in the following sections.\\
+are explained in the following sections.
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=1\textwidth]{Chapters/chapter9/figures/classification.pdf}}
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=1\textwidth]{Chapters/chapter9/figures/classification.pdf}}
@@ -843,7 +843,7 @@ associated to a block of threads. A second level of parallelism is
  used inside each block to parallelize the fitness evaluation of a
  single solution. This kind of mapping allows the use of shared
  memory to store the data structures of the solution and does not
  used inside each block to parallelize the fitness evaluation of a
  single solution. This kind of mapping allows the use of shared
  memory to store the data structures of the solution and does not
-require the use of very large neighborhoods or populations.\\
+require the use of very large neighborhoods or populations.
  %data parallelism in SA-matrix to parallelize
  
  \subsubsection*{GPU thread mapping for iteration-level parallelism}
  %data parallelism in SA-matrix to parallelize
  
  \subsubsection*{GPU thread mapping for iteration-level parallelism}
@@ -851,7 +851,7 @@ require the use of very large neighborhoods or populations.\\
  Iteration-level parallelism consists of parallelizing the tasks
  performed independently on different solutions. Different mapping
  strategies are adopted in the reviewed works to implement these
  Iteration-level parallelism consists of parallelizing the tasks
  performed independently on different solutions. Different mapping
  strategies are adopted in the reviewed works to implement these
-models.\\
+models.
  
  In Figure \ref{ch8:fig:classification}, the first example of
  iteration-level
  
  In Figure \ref{ch8:fig:classification}, the first example of
  iteration-level
@@ -875,7 +875,7 @@ memory.
  %pheromone update data parallelism matrices
  
  \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
  %pheromone update data parallelism matrices
  
  \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
-Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level models have been used: the multistart model and the island model (parallel EAs).\\
+Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level models have been used: the multistart model and the island model (parallel EAs).
  
  The implementation of the multistart model is based on two
  different mapping strategies~\cite{luongMultiStart, audreyANT}:
  
  The implementation of the multistart model is based on two
  different mapping strategies~\cite{luongMultiStart, audreyANT}:
@@ -893,7 +893,7 @@ parallelized on GPU using per-thread mapping strategy (one thread
  per solution). This consists of a hierarchical parallel model
  combining algorithmic-level parallelism (multistart) with
  iteration-level
  per solution). This consists of a hierarchical parallel model
  combining algorithmic-level parallelism (multistart) with
  iteration-level
-parallelism (master-worker).\\
+parallelism (master-worker).
  
  In the island model, the same mapping is used in all the reviewed
  works~\cite{tsutsuiGAQAP, nowotniak, maitre2009}: each
  
  In the island model, the same mapping is used in all the reviewed
  works~\cite{tsutsuiGAQAP, nowotniak, maitre2009}: each
@@ -906,7 +906,7 @@ advantage of this hierarchical implementation is that it allows
  the occupation of a large number of threads even for medium
  population sizes. The second advantage consists of using shared
  memory to store subpopulations to benefit from the low latency of
  the occupation of a large number of threads even for medium
  population sizes. The second advantage consists of using shared
  memory to store subpopulations to benefit from the low latency of
-shared memory.\\
+shared memory.
  
  \section{Frameworks for metaheuristics on GPUs}
  \label{ch8:sec:frameworks}
  
  \section{Frameworks for metaheuristics on GPUs}
  \label{ch8:sec:frameworks}
@@ -918,7 +918,7 @@ on GPUs. Although the works on this subject are not yet mature and
  do not cover the main metaheuristic algorithms, we will present
  the only three works to our knowledge, which propose open source
  frameworks for developing
  do not cover the main metaheuristic algorithms, we will present
  the only three works to our knowledge, which propose open source
  frameworks for developing
-metaheuristics on GPUs.\\
+metaheuristics on GPUs.
  
  The three frameworks reviewed in this section are
  PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
  
  The three frameworks reviewed in this section are
  PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
@@ -950,7 +950,7 @@ etc.). Different problem encoding is also supported. The framework
  is organized as a set of modules in which the different
  functionalities  are separated as much as possible in order to
  facilitate the extension of the framework. All
  is organized as a set of modules in which the different
  functionalities  are separated as much as possible in order to
  facilitate the extension of the framework. All
-the functions and procedures that execute on GPU are implemented in the same file kernel.cu. \\
+the functions and procedures that execute on GPU are implemented in the same file kernel.cu. 
  
  The implementation strategy adopted on the GPU is as follows.
  Population initialization is done on the CPU side and the
  
  The implementation strategy adopted on the GPU is as follows.
  Population initialization is done on the CPU side and the
@@ -964,7 +964,7 @@ probability of application which may give different results from
  one thread to the other), the decision of whether to apply a
  crossover is taken at the block level and applied to all the
  individuals within the block. It is the decision on the choose
  one thread to the other), the decision of whether to apply a
  crossover is taken at the block level and applied to all the
  individuals within the block. It is the decision on the choose
-of the cutting point for the crossover.\\
+of the cutting point for the crossover.
  
  The framework is validated on standard benchmarks of the QAP. Speedups of $15.44 \times$  to $18.41
  \times$ are achieved compared to a CPU implementation of a cEA
  
  The framework is validated on standard benchmarks of the QAP. Speedups of $15.44 \times$  to $18.41
  \times$ are achieved compared to a CPU implementation of a cEA
@@ -1002,7 +1002,7 @@ parallelization techniques and deployment on GPUs. The framework
  allows one to efficiently manage the hierarchical organization of
  the memories (latencies and sizes) of the GPU and its
  communication with the CPU as well as the minimizing of the user
  allows one to efficiently manage the hierarchical organization of
  the memories (latencies and sizes) of the GPU and its
  communication with the CPU as well as the minimizing of the user
-involvement in its management.\\
+involvement in its management.
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=0.8\textwidth]{Chapters/chapter9/figures/paradiseoGPU.pdf}}
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=0.8\textwidth]{Chapters/chapter9/figures/paradiseoGPU.pdf}}
@@ -1026,7 +1026,7 @@ it is sent back to the CPU which selects the best solution (See
  Figure~\ref{ch1:fig:paradiseoGPU}).
  
  \subsection{libCUDAOptimize: an open source library of GPU-based metaheuristics}
  Figure~\ref{ch1:fig:paradiseoGPU}).
  
  \subsection{libCUDAOptimize: an open source library of GPU-based metaheuristics}
-\index{GPU-based frameworks!libCudaOptimize}
+\index{GPU-based frameworks!libCUDAOptimize}
  LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library
  for the design and implementation of metaheuristics on GPUs. Until
  now, the metaheuristics supported by LibCUDAOptimize are: scatter
  LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library
  for the design and implementation of metaheuristics on GPUs. Until
  now, the metaheuristics supported by LibCUDAOptimize are: scatter
@@ -1054,7 +1054,7 @@ iteration which starts from a disrupted local optima reached by
  the previous local search process. The disruption heuristic is a
  performance parameter of an ILS algorithm and should be
  judiciously defined. A template of an
  the previous local search process. The disruption heuristic is a
  performance parameter of an ILS algorithm and should be
  judiciously defined. A template of an
-ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
+ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.
  
  \begin{algorithm}[H]
      \SetAlgoLined
  
  \begin{algorithm}[H]
      \SetAlgoLined
@@ -1101,7 +1101,7 @@ optimization community due to its wide range of practical
  applications (site planning, schedule problem, computer-aided
  design, etc.) and to its computational challenges since it is
  considered as one of the most computationally difficult
  applications (site planning, schedule problem, computer-aided
  design, etc.) and to its computational challenges since it is
  considered as one of the most computationally difficult
-optimization problems.\\
+optimization problems.
  
  The Q3AP was first introduced by William P. Pierskalla in
  1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
  
  The Q3AP was first introduced by William P. Pierskalla in
  1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
@@ -1155,7 +1155,7 @@ where $\pi_1$ and $\pi_2$ are permutations over the set
  $\left\lbrace 0,1,\ldots,n-1\right\rbrace$. According to this
  formulation, minimizing the Q3AP consists of finding a double
  permutation $(\pi_1,\pi_2)$ which minimizes the objective function
  $\left\lbrace 0,1,\ldots,n-1\right\rbrace$. According to this
  formulation, minimizing the Q3AP consists of finding a double
  permutation $(\pi_1,\pi_2)$ which minimizes the objective function
-(\ref{Q3APPermutationBasedFormulation}).\\
+(\ref{Q3APPermutationBasedFormulation}).
  
  The Q3AP is proven to be an NP-hard problem since it is an
  extension of the quadratic assignment problem and of the axial
  
  The Q3AP is proven to be an NP-hard problem since it is an
  extension of the quadratic assignment problem and of the axial
@@ -1176,7 +1176,7 @@ escape local optima. However, this strategy may generate cycles,
  i.e., previous moves can be selected again. To avoid cycles, the
  TS manages a short-term memory that contains the moves that have
  been recently performed. A TS
  i.e., previous moves can be selected again. To avoid cycles, the
  TS manages a short-term memory that contains the moves that have
  been recently performed. A TS
-template is given by Algorithm \ref{TS_pseudo_code}.\\
+template is given by Algorithm \ref{TS_pseudo_code}.
  %
  %  \begin{algorithm}
  %  \label{TS_pseudo_code}
  %
  %  \begin{algorithm}
  %  \label{TS_pseudo_code}
@@ -1220,7 +1220,7 @@ The neighborhood function is a crucial parameter in any local
  search algorithm. Indeed, if the neighborhood function is not
  adequate to the problem and/or does not consider the targeted
  computing framework, any local search algorithm will fail to reach
  search algorithm. Indeed, if the neighborhood function is not
  adequate to the problem and/or does not consider the targeted
  computing framework, any local search algorithm will fail to reach
-good quality solutions of the search space.\\
+good quality solutions of the search space.
  
  Regarding the Q3AP, many neighborhood structures can be
  considered. A basic neighborhood was proposed and investigated in
  
  Regarding the Q3AP, many neighborhood structures can be
  considered. A basic neighborhood was proposed and investigated in
@@ -1262,7 +1262,7 @@ algorithms~\cite{Ahuja:2007:VLN:1528422.1528438}. However, for
  such a neighborhood, the generation/evaluation step of an LS
  becomes a time-consuming task and may dramatically increase the
  computational time of the LS process. This
  such a neighborhood, the generation/evaluation step of an LS
  becomes a time-consuming task and may dramatically increase the
  computational time of the LS process. This
-justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. \\
+justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. 
  
  The considered large-sized neighborhood consists of swapping two
  positions in both permutations $\pi_1$ and $\pi_2$. This
  
  The considered large-sized neighborhood consists of swapping two
  positions in both permutations $\pi_1$ and $\pi_2$. This
@@ -1299,7 +1299,7 @@ management\index{GPU!memory management} and the
  capacity constraints of GPU memories, and the thread
  synchronization. All these issues must be regarded when designing
  parallel LS models to allow
  capacity constraints of GPU memories, and the thread
  synchronization. All these issues must be regarded when designing
  parallel LS models to allow
-solving of large scale optimization problems on GPU architectures.\\
+solving of large scale optimization problems on GPU architectures.
  
  To go back to our problem (i.e., Q3AP), we propose in
  Algorithm~\ref{ch8:algoITS} an iterated tabu
  
  To go back to our problem (i.e., Q3AP), we propose in
  Algorithm~\ref{ch8:algoITS} an iterated tabu
@@ -1403,7 +1403,7 @@ optima reached by the current TS process and reconsidering it as
  initial solution of the following TS process. Regarding our
  algorithm, the applied perturbation is a random number $\mu $ of
  swaps in either the first or the second permutation where $\mu \in
  initial solution of the following TS process. Regarding our
  algorithm, the applied perturbation is a random number $\mu $ of
  swaps in either the first or the second permutation where $\mu \in
-[2:n]$ ($n$ is the instance size).\\
+[2:n]$ ($n$ is the instance size).
  
  Experiments have been carried out on a node of the Chirloute
  cluster of the Lille site. This is one of the 10 sites that
  
  Experiments have been carried out on a node of the Chirloute
  cluster of the Lille site. This is one of the 10 sites that
@@ -1413,7 +1413,7 @@ of an Intel Xeon E5620 CPU and a NVIDIA Tesla Fermi M2050 (448
  cores) GPU type. The number of ILS iterations and the number of TS
  iterations were set respectively to 20 and 500. The tabu list size
  has been initalized to $\frac{m}{4}$, $m$ being the size of the
  cores) GPU type. The number of ILS iterations and the number of TS
  iterations were set respectively to 20 and 500. The tabu list size
  has been initalized to $\frac{m}{4}$, $m$ being the size of the
-neighborhood set.\\
+neighborhood set.
  
  Table \ref{ch8:ITSQ3APResults} reports the obtained results for
  the GPU-ITS using our large-sized neighborhood structure. The
  
  Table \ref{ch8:ITSQ3APResults} reports the obtained results for
  the GPU-ITS using our large-sized neighborhood structure. The
@@ -1428,7 +1428,7 @@ evaluation function of the parallel GPU version have been
  measured. The number of successful tries (hits) and the average
  number of ILS iterations to converge to the optimal/best known
  value are also represented. The associated standard deviation for
  measured. The number of successful tries (hits) and the average
  number of ILS iterations to converge to the optimal/best known
  value are also represented. The associated standard deviation for
-each average measurement is shown in small type characters.\\
+each average measurement is shown in small type characters.
  
  \begin{table}
  %\tiny
  
  \begin{table}
  %\tiny
@@ -1449,7 +1449,7 @@ Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \t
  & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
  \end{tabular}
  \caption{Results of the GPU-based iterated tabu search for
  & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
  \end{tabular}
  \caption{Results of the GPU-based iterated tabu search for
-different Q3AP instances.} \label{ch8:ITSQ3APResults} \center
+different Q3AP instances.} \label{ch8:ITSQ3APResults} % \center %Shashi
  \end{table}
  
  %\begin{table}
  \end{table}
  
  %\begin{table}
@@ -1491,7 +1491,7 @@ This chapter has presented state-of-the-art GPU-based parallel
  metaheuristics and
  a case study on implementing large neighborhood local search
  methods on GPUs for solving large benchmarks of the quadratic
  metaheuristics and
  a case study on implementing large neighborhood local search
  methods on GPUs for solving large benchmarks of the quadratic
-three-dimensional assignment problem (Q3AP). \\
+three-dimensional assignment problem (Q3AP). 
  
  Implementing parallel
  metaheuristics on
  
  Implementing parallel
  metaheuristics on