From 17bff40b83bcdcc39769f9e59c70ffae1c525b72 Mon Sep 17 00:00:00 2001 From: couturie Date: Wed, 7 Aug 2013 10:06:11 +0200 Subject: [PATCH] new --- BookGPU/Chapters/chapter10/ch10.tex | 22 +- BookGPU/Chapters/chapter11/ch11.tex | 2 +- BookGPU/Chapters/chapter12/ch12.tex | 6 +- BookGPU/Chapters/chapter13/ch13.tex | 14 +- BookGPU/Chapters/chapter14/ch14.tex | 2 +- BookGPU/Chapters/chapter15/ch15.tex | 2 +- BookGPU/Chapters/chapter16/bdf.tex | 4 +- BookGPU/Chapters/chapter16/ef.tex | 8 +- BookGPU/Chapters/chapter16/gpu.tex | 8 +- BookGPU/Chapters/chapter16/intro.tex | 6 +- BookGPU/Chapters/chapter3/ch3.tex | 14 +- BookGPU/Chapters/chapter4/ch4.tex | 2 +- BookGPU/Chapters/chapter5/ch5.tex | 2 +- BookGPU/Chapters/chapter6/PartieAsync.tex | 2 +- BookGPU/Chapters/chapter6/PartieSync.tex | 4 +- BookGPU/Chapters/chapter8/ch8.tex | 10 +- BookGPU/Chapters/chapter9/ch9.tex | 245 ++++++++++------------ 17 files changed, 167 insertions(+), 186 deletions(-) diff --git a/BookGPU/Chapters/chapter10/ch10.tex b/BookGPU/Chapters/chapter10/ch10.tex index 2664ff0..7a39dca 100644 --- a/BookGPU/Chapters/chapter10/ch10.tex +++ b/BookGPU/Chapters/chapter10/ch10.tex @@ -30,7 +30,7 @@ Finally, we summarize the results obtained and consider new perspectives. \section{Simplex algorithm} \label{chXXX:sec:simplex-algo} -\subsection{Linear programming model\index{Linear programming}} +\subsection{Linear programming model\index{linear programming}} \label{chXXX:subsec:lp-model} %% Mathematics of LP An LP model in its canonical form can be expressed as the following optimization problem: @@ -53,7 +53,7 @@ An optimal solution to the LP problem will therefore reside on a vertex of this \subsection{Standard simplex algorithm} \label{chXXX:subsec:simplex-std} % Simple explanation of the simple algorithm : 3 operations -The simplex method\index{Simplex!Standard method}~\cite{VCLP} is an algorithm for solving LP models. +The simplex method\index{simplex!standard method}~\cite{VCLP} is an algorithm for solving LP models. It proceeds by iteratively visiting vertices on the boundary of the feasible region. This amounts to performing algebraic manipulations on the system of linear equations. @@ -193,7 +193,7 @@ Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$\; \subsection{Revised simplex method} \label{chXXX:subsec:simplex-rev} % Global revised simplex idea -The operation that takes the most time in the standard method is the pivoting operation, and more specifically, the update of the constraints matrix~$\mbf{A}$. The revised method\index{Simplex!Revised method} tries to avoid this costly operation by updating only a smaller part of this matrix. +The operation that takes the most time in the standard method is the pivoting operation, and more specifically, the update of the constraints matrix~$\mbf{A}$. The revised method\index{simplex!revised method} tries to avoid this costly operation by updating only a smaller part of this matrix. %% Revised simplex method operations, useful for the algorithm The revised simplex method uses the same operations as in the standard method to choose the entering and leaving variables. @@ -298,9 +298,9 @@ and $\bs\beta = \mbf{N^T} \mbf{(B^{-1})^T} \mbf{d}$ \subsubsection*{Choice of the leaving variable} The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis. -\section{Branch-and-bound\index{Branch-and-bound} algorithm} +\section{Branch-and-bound\index{branch-and-bound} algorithm} \label{chXXX:sec:bnb} -\subsection{Integer linear programming\index{Integer linear programming}} +\subsection{Integer linear programming\index{integer linear programming}} \label{chXXX:subsec:ilp} % Why and what are integer linear programming problem In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. ILP problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important. @@ -369,7 +369,7 @@ whose objective value $z$ is inferior to the best one yet found. \subsection{Branching strategy} \label{chXXX:subsec:bnb-branching} % What it is and why it is important -The branching strategy\index{Branch-and-bound!Branching} defines the method used to select the variable on which branching will occur. +The branching strategy\index{branch-and-bound!branching} defines the method used to select the variable on which branching will occur. The objective value of the child node depends greatly on the choice of this variable. Branching on a variable may lead to a drop on the upper bound and thus speed up the exploration, while branching on other variables could leave this bound unchanged. @@ -384,7 +384,7 @@ Several branching strategies exist~\cite{Achterberg05}. Let us briefly comment o \subsection{Node selection strategy} \label{chXXX:subsec:bnb-node-select} -The node selection strategy\index{Branch-and-bound!Node selection} defines the methodology used to explore the tree. +The node selection strategy\index{branch-and-bound!node selection} defines the methodology used to explore the tree. While the usual depth-first search and breadth-first search are considered and used, some remarks about the tree exploration must be made. First let us mention a few facts: @@ -419,7 +419,7 @@ thus improving the upper and lower bounds at the same time. \subsection{Cutting-plane methods} \label{chXXX:subsec:cuts} -\textit{Cutting-planes}\index{Branch-and-bound!Cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space. +\textit{Cutting-planes}\index{branch-and-bound!cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space. They may be generated from the first LP solution (\textit{cut-and-branch}) or periodically during the B\&B (\textit{branch-and-cut}). On the one hand cutting-planes may considerably reduce the size of the solution space, but on the other hand they increase the problem size. Moreover, generating cutting-planes is costly since it requires a thorough analysis of the current state of the problem. @@ -442,7 +442,7 @@ In the following section, we first quickly describe the CUDA reduction operation % Reduce operation \subsection{Parallel reduction} \label{chXXX:subsec:reduction} -A parallel reduction\index{Parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU. +A parallel reduction\index{parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU. An optimized way of doing the reduction can be found in the examples\footnote{Available at http://docs.nvidia.com/cuda/cuda-samples/index.html\#advanced} provided by NVIDIA. %In order to keep code listings compact hereafter, the reduction of values among a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}). @@ -752,12 +752,12 @@ The first critical choice is to select an efficient tree traversal strategy that There are two cases where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could, for example, be the one found after solving the root node of the tree. % Global flow -\subsubsection*{Multi-GPU\index{Multi-GPU} exploration} +\subsubsection*{Multi-GPU\index{multi-GPU} exploration} Having the CPU act as a decision maker and the GPU as an explorer, allows for the possibility of using multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The CPU assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task. \section{Performance model} \label{chXXX:sec:MODEL} -Performance models\index{Performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size. +Performance models\index{performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size. CUDA kernels require a different kind of modeling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps, and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency. diff --git a/BookGPU/Chapters/chapter11/ch11.tex b/BookGPU/Chapters/chapter11/ch11.tex index 57b3328..a374345 100644 --- a/BookGPU/Chapters/chapter11/ch11.tex +++ b/BookGPU/Chapters/chapter11/ch11.tex @@ -346,7 +346,7 @@ A popular PAV algorithm (PAVA) is one method that provides efficient numerical s Various serial implementations of the PAVA exist. It is noted \cite{Kearsley_2006} that in PAVA, which is based on the ideas from convex analysis, a decomposition theorem holds, namely, performing PAVA separately on two contiguous subsets of data and then performing PAVA on the result produces isotonic regression on the whole data set. Thus, isotonic regression is parallelizable, and the divide-and-conquer approach, decomposing the original problem into two smaller subproblems, can be implemented on multiple processors. However, to our knowledge, no parallel PAVA for many-core systems such as GPUs exist. -\index{MLS algorithm} \index{Minimum Lower Sets} +\index{MLS (minimum lower sets) algorithm} Another approach to isotonic regression is called the MLS algorithm \cite{Best1990, Robertson_book}. It provides the same solution as the PAVA, but works differently. For each datum (or block), MLS selects the largest contiguous block of subsequent data with the smallest average. If this average is smaller than that of the preceding block, the blocks are merged, and the data in the block are replaced with their average. MLS is also an active set method \cite{Best1990}, but its complexity is $O(n^2)$ as opposed to $O(n)$ of the PAVA, and of another active set algorithm proposed in \cite{Best1990} by the name of Algorithm A. diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index a514438..0269fd2 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -217,7 +217,7 @@ r_k \bot A \mathcal{K}_k(A, v_1). \end{array} \label{ch12:eq:13} \end{equation} -GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{iterative method!Arnoldi process} to construct an +GMRES uses the Arnoldi iterations~\cite{ch12:ref5}\index{iterative method!Arnoldi iterations} to construct an orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_k$ of order $(k+1)\times k$: \begin{equation} @@ -313,7 +313,7 @@ $V$ to $m$ orthogonal vectors. Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts. It solves the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration -$k$, GMRES uses the Arnoldi process\index{iterative method!Arnoldi process} (defined from +$k$, GMRES uses the Arnoldi iterations\index{iterative method!Arnoldi iterations} (defined from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ @@ -526,7 +526,7 @@ is managed by one MPI process and is composed of one CPU core and one GPU card. All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$, and the -initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{iterative method!Arnoldi process} +initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi iterations\index{iterative method!Arnoldi iterations} used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse matrix $M^{-1}$, and it provides a relatively good preconditioning for diff --git a/BookGPU/Chapters/chapter13/ch13.tex b/BookGPU/Chapters/chapter13/ch13.tex index 89e2ce7..90dd946 100755 --- a/BookGPU/Chapters/chapter13/ch13.tex +++ b/BookGPU/Chapters/chapter13/ch13.tex @@ -299,12 +299,12 @@ and $\forall j\in\{1,\ldots,\alpha\}$, \label{ch13:eq:15} \end{equation} -The previous asynchronous scheme\index{asynchronous} of the projected Richardson +The previous asynchronous scheme\index{asynchronous iterations} of the projected Richardson method models computations that are carried out in parallel without order or synchronization (according to the behavior of the parallel iterative method) and describes a subdomain method without overlapping. It is a general model that takes into account all possible situations of parallel computations and nonblocking message -passing. So, the synchronous iterative scheme\index{synchronous} is defined by +passing. So, the synchronous iterative scheme\index{synchronous iterations} is defined by \begin{equation} \forall j\in\{1,\ldots,\alpha\} \mbox{,~} \forall p\in\mathbb{N} \mbox{,~} \rho_j(p)=p. \label{ch13:eq:16} @@ -477,7 +477,7 @@ The first operation of this function is implemented as kernels to be performed b \end{itemize} As mentioned previously, we develop the \emph{synchronous} and \emph{asynchronous} algorithms of the projected Richardson method. Obviously, in this scope, the -synchronous\index{synchronous} or asynchronous\index{asynchronous} communications +synchronous\index{synchronous iterations} or asynchronous\index{asynchronous iterations} communications refer to the communications between the CPU cores (MPI processes) on the GPU cluster, in order to exchange the vector elements associated to subdomain boundaries. For the memory copies between a CPU core and its GPU, we use the synchronous communication @@ -587,7 +587,7 @@ of relaxations is possible in both synchronous and asynchronous cases. On the other hand, an iteration is the update of at least all vector components with $F_i$. -In the synchronous\index{synchronous} algorithm, the global convergence is detected +In the synchronous\index{synchronous iterations} algorithm, the global convergence is detected when the maximal value of the absolute error, $error$, is sufficiently small and/or the maximum number of relaxations, $MaxRelax$, is reached, as follows: $$ @@ -602,7 +602,7 @@ where the function $AllReduce()$ uses the MPI global reduction subroutine\index{ \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local absolute errors, $error$, of all computing nodes, and $p$ (in Algorithm~\ref{ch13:alg:02}) is used as a counter of the local relaxations carried out by a computing node. In -the asynchronous\index{asynchronous} algorithms, the global convergence is detected +the asynchronous\index{asynchronous iterations} algorithms, the global convergence is detected when all computing nodes locally converge. For this, we use a token ring architecture around which a boolean token travels, in one direction, from a computing node to another. Starting from node $0$, the boolean token is set to $true$ by node $i$ if the local @@ -860,9 +860,9 @@ number of computing nodes on the cluster. This is due to the fact that the ratio between the time of the computation over that of the communication is reduced when the computations are performed on GPUs. Indeed, GPUs compute faster than CPUs and communications are more time-consuming. In this context, asynchronous algorithms -are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{synchronous} +are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{synchronous iterations} algorithms might be more penalized by communications, as can be deduced from Figure~\ref{ch13:fig:07}. -That is why we think that asynchronous\index{asynchronous} iterative algorithms +That is why we think that asynchronous\index{asynchronous iterations} iterative algorithms are all the more interesting in this case. diff --git a/BookGPU/Chapters/chapter14/ch14.tex b/BookGPU/Chapters/chapter14/ch14.tex index 2ce9331..e5c2daa 100755 --- a/BookGPU/Chapters/chapter14/ch14.tex +++ b/BookGPU/Chapters/chapter14/ch14.tex @@ -8,7 +8,7 @@ application} \section{Introduction} -The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overview see, e.g., +The lattice Boltzmann (LB) method \index{lattice Boltzmann method} (for an overview see, e.g., \cite{succi-book}) has become a popular approach to a variety of fluid dynamics problems. It provides a way to solve the incompressible isothermal Navier-Stokes equations and has the attractive features of diff --git a/BookGPU/Chapters/chapter15/ch15.tex b/BookGPU/Chapters/chapter15/ch15.tex index 7e25220..64eb771 100644 --- a/BookGPU/Chapters/chapter15/ch15.tex +++ b/BookGPU/Chapters/chapter15/ch15.tex @@ -670,7 +670,7 @@ Fig.~\ref{offdiagonal} for an off-diagonal sector. These copies, along with possible scalings or transpositions, are implemented as CUDA kernels which can be applied to two matrices of any size starting at any offset. - Memory accesses are coalesced\index{coalesced memory accesses} \cite{CUDA_ProgGuide} in order to + Memory accesses are coalesced\index{GPU!coalesced memory accesses} \cite{CUDA_ProgGuide} in order to provide the best performance for such memory-bound kernels. \item[Step 2] (``Local copies''):~data are copied from local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$. diff --git a/BookGPU/Chapters/chapter16/bdf.tex b/BookGPU/Chapters/chapter16/bdf.tex index bd4e5d5..83abbc3 100644 --- a/BookGPU/Chapters/chapter16/bdf.tex +++ b/BookGPU/Chapters/chapter16/bdf.tex @@ -34,7 +34,7 @@ the index $i$ in $h_i$, $\alpha_1^i$ and $\alpha_2^i$ can be omitted. For the first step $t_1$, only the initial condition at $t_0$ is available. -Therefore backward-Euler is used, i.e., +Therefore backward Euler is used, i.e., \begin{equation}\label{eq:BE} \tfrac{1}{h_1}[q(x_1)-q(x_0)] + f(x_1) = b_1. \end{equation} @@ -118,7 +118,7 @@ J\! =\! \frac{\ud x_M}{\ud x_0} \begin{algorithm} \caption{The matrix-free\index{matrix-free} method for - Krylov subspace\index{Krylov subspace} construction.} + Krylov subspace\index{iterative method!Krylov subspace} construction.} \label{alg:mf_Gear} \KwIn{ current Krylov subspace basis vector $v$, time step lengths $h_i$, diff --git a/BookGPU/Chapters/chapter16/ef.tex b/BookGPU/Chapters/chapter16/ef.tex index b0967a9..400b467 100644 --- a/BookGPU/Chapters/chapter16/ef.tex +++ b/BookGPU/Chapters/chapter16/ef.tex @@ -52,7 +52,7 @@ significant savings in simulation time. % One simple way to estimate $x((m+k)T)$ is to use the information at % $mT$ and $(m-1)T$, which leads to the forward-Euler method as To estimate $x((m+k)T)$, -a forward-Euler\index{forward-Euler} style jumping relies only on $x(mT)$ and $x((m-1)T)$, +a forward Euler\index{forward Euler} style jumping relies only on $x(mT)$ and $x((m-1)T)$, i.e., \[ x((m+k)T) @@ -61,14 +61,14 @@ i.e., \] However, this approach is inefficient due to its restriction on envelope step $k$, since larger $k$ usually causes instability. -Instead, backward-Euler\index{backward-Euler} jumping, +Instead, backward Euler\index{backward Euler} jumping, %and the equation becomes \[ x((m+k)T)-x(mT) = k\left[x((m+k)T)-x((m+k-1)T)\right], \] allows larger envelope steps. Here $x((m+k-1)T)$ is the unknown variable to be solved -by Newton iteration\index{Newton iteration}, +by Newton iteration\index{iterative method!Newton iteration}, and $x((m+k)T)$ is dependent on $x((m+k-1)T)$ in each iteration. % Forward-Euler may be used to generate the initial guess. @@ -107,7 +107,7 @@ Different integration rules can be applied to the computation of sensitivity matrix. It can be easily derived that, %for one signal period with $M$ time steps, -if the DAE is integrated with backward-Euler rule, +if the DAE is integrated with backward Euler rule, the sensitivity is \[ J = \frac{\ud x_M}{\ud x_0} diff --git a/BookGPU/Chapters/chapter16/gpu.tex b/BookGPU/Chapters/chapter16/gpu.tex index f4f62f1..26bd536 100644 --- a/BookGPU/Chapters/chapter16/gpu.tex +++ b/BookGPU/Chapters/chapter16/gpu.tex @@ -15,13 +15,13 @@ Finally, the Gear-2 integration is briefly introduced. \underline{G}eneralized \underline{M}inimum \underline{Res}idual, or GMRES method is an iterative method for solving systems of linear equations ($A x=b$) with dense matrix $A$. -The standard GMRES\index{GMRES} is given in Algorithm~\ref{alg:GMRES}. -It constructs a Krylov subspace\index{Krylov subspace} with order $m$, +The standard GMRES\index{iterative method!GMRES} is given in Algorithm~\ref{alg:GMRES}. +It constructs a Krylov subspace\index{iterative method!Krylov subspace} with order $m$, \[ \mathcal{K}_m = \mathrm{span}( b, A^{} b, A^2 b,\ldots, A^{m-1} b ),\] where the approximate solution $x_m$ resides. In practice, an orthonormal basis $V_m$ that spans the subspace $\mathcal{K}_{m}$ can be generated by -the Arnoldi iteration\index{Arnoldi iterations}. +the Arnoldi iterations\index{iterative method!Arnoldi iterations}. The goal of GMRES is to search for an optimal coefficient $y$ such that the linear combination $x_m = V_m y$ will minimize its residual $\| b-Ax_m \|_2$. @@ -78,7 +78,7 @@ a preset tolerance~\cite{Golub:Book'96}. %% \end{algorithm} \begin{algorithm} -\caption{Standard GMRES\index{GMRES} algorithm.} \label{alg:GMRES} +\caption{Standard GMRES\index{iterative method!GMRES} algorithm.} \label{alg:GMRES} \KwIn{ $ A \in \mathbb{R}^{N \times N}$, $b \in \mathbb{R}^N$, and initial guess $x_0 \in \mathbb{R}^N$} \KwOut{ $x \in \mathbb{R}^N$: $\| b - A x\|_2 < tol$} diff --git a/BookGPU/Chapters/chapter16/intro.tex b/BookGPU/Chapters/chapter16/intro.tex index cbed054..4d368b3 100644 --- a/BookGPU/Chapters/chapter16/intro.tex +++ b/BookGPU/Chapters/chapter16/intro.tex @@ -11,7 +11,7 @@ switching power converters. The new method first exploits the parallelisim in the envelope-following method and parallelize the Newton update solving part, which is the most computational expensive, in GPU platforms to boost the simulation performance. To further -speed up the iterative GMRES\index{GMRES} solving for Newton update equation in the +speed up the iterative GMRES\index{iterative method!GMRES} solving for Newton update equation in the envelope-following method, we apply the matrix-free\index{matrix-free} Krylov subspace\index{Krylov subspace} basis generation technique, which was previously used for RF simulation. @@ -76,13 +76,13 @@ next envelope step. {\resizebox{.9\textwidth}{!}{\input{./Chapters/chapter16/figures/envelope.pdf_t}} \label{fig:ef2} } \caption{Transient envelope-following\index{envelope-following} analysis. - (Both two figures reflect backward-Euler\index{backward-Euler} style envelope-following.)} + (Both two figures reflect backward Euler\index{backward Euler} style envelope-following.)} \label{fig:ef_intro} \end{figure} %\IEEEpubidadjcol -Also, iterative GMRES\index{GMRES} solver is typically used in the +Also, iterative GMRES\index{iterative method!GMRES} solver is typically used in the envelope-following method to compute the solution of Newton update due to its efficiency compared to direct LU\index{LU} method. However, as the Jacobian matrix\index{Jacobian matrix} diff --git a/BookGPU/Chapters/chapter3/ch3.tex b/BookGPU/Chapters/chapter3/ch3.tex index 95bce29..1b2e263 100755 --- a/BookGPU/Chapters/chapter3/ch3.tex +++ b/BookGPU/Chapters/chapter3/ch3.tex @@ -18,10 +18,10 @@ However, so as to propose concise and more readable code, we will assume the fol \section{Data transfers, memory management.} This section deals with the following issues: \begin{enumerate} -\item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory~hierarchy!texture~memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images. -\item Data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. \index{memory~hierarchy!texture~memory} Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching \index{prefetching} in shared memory. \index{memory~hierarchy!shared~memory} +\item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory hierarchy!texture memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images. +\item Data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. \index{memory hierarchy!texture memory} Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching \index{prefetching} in shared memory. \index{memory hierarchy!shared memory} \item Data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels cannot directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory. -\item Data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, \index{memory~hierarchy!pinned~memory} keeping in mind it has to be used sparingly. +\item Data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, \index{memory hierarchy!pinned memory} keeping in mind it has to be used sparingly. \end{enumerate} Algorithm \ref{algo:memcopy} summarizes all the above considerations and describes how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest referring to the CUDA programmer's guide. @@ -127,7 +127,7 @@ copy data from GPU global memory to CPU memory\label{algoMedianGeneric:memcpyD2H As mentioned earlier, the selection of the median value can be performed by more than one technique, using either histogram-based or sorting methods, each having its own benefits and drawbacks as will be discussed further down. \subsection{A naive implementation} -As a reference, Listing \ref{lst:medianGeneric} gives a simple, not to say simplistic, implementation of a CUDA kernel (\texttt{kernel\_medianR}) achieving generic $n\times n$ histogram-based median filtering. Its runtime has a very low data dependency, but this implementation does not suit GPU architecture very well. Each pixel loads the whole of its $n\times n$ neighborhood, meaning that one pixel is loaded multiple times inside one single thread block, and even more time-consuming, the use of a local vector (histogram[]) considerably downgrades performance, as the compiler automatically stores such vectors in local memory (slow) \index{memory~hierarchy!local~memory}. +As a reference, Listing \ref{lst:medianGeneric} gives a simple, not to say simplistic, implementation of a CUDA kernel (\texttt{kernel\_medianR}) achieving generic $n\times n$ histogram-based median filtering. Its runtime has a very low data dependency, but this implementation does not suit GPU architecture very well. Each pixel loads the whole of its $n\times n$ neighborhood, meaning that one pixel is loaded multiple times inside one single thread block, and even more time-consuming, the use of a local vector (histogram[]) considerably downgrades performance, as the compiler automatically stores such vectors in local memory (slow) \index{memory hierarchy!local memory}. Table \ref{tab:medianHisto1} displays measured runtimes of \texttt{kernel\_medianR} and pixel throughputs for each GPU version (C2070 and GTX480 targets) and for both CPU and GPU implementations. Usual window sizes of $3\times 3$, $5\times 5$, and $7\times 7$ are shown. Though some specific applications require larger window sizes and dedicated algorithms, such small square window sizes are most widely used in general purpose image processing. GPU runtimes have been obtained with a grid of 64-thread blocks. @@ -192,9 +192,9 @@ On the GPU's side, we note high dependence on window size due to the redundancy \section{NVIDIA GPU tuning recipes} When designing GPU code, besides thinking of the actual data computing process, one must choose the memory type in which to store temporary data. Three types of GPU memory are available: \begin{enumerate} -\item \textbf{Global memory, the most versatile:} \index{memory~hierarchy!global~memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism. -\item \textbf{Registers, the fastest:} \index{memory~hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count} -\item \textbf{Shared memory, a complex compromise:} \index{memory~hierarchy!shared~memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. +\item \textbf{Global memory, the most versatile:} \index{memory hierarchy!global memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism. +\item \textbf{Registers, the fastest:} \index{memory hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count} +\item \textbf{Shared memory, a complex compromise:} \index{memory hierarchy!shared memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses. \end{enumerate} diff --git a/BookGPU/Chapters/chapter4/ch4.tex b/BookGPU/Chapters/chapter4/ch4.tex index 805de25..bd66882 100644 --- a/BookGPU/Chapters/chapter4/ch4.tex +++ b/BookGPU/Chapters/chapter4/ch4.tex @@ -240,7 +240,7 @@ However, our technique requires writing one kernel per mask size, which can be s \lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu} \subsection{Using shared memory to store prefetched data\index{prefetching}.} - \index{memory~hierarchy!shared~memory} + \index{memory hierarchy!shared memory} A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums. This proves to be quite efficient and more versatile, but it obviously generates some overhead because \begin{itemize} diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex index fed9a80..f780577 100644 --- a/BookGPU/Chapters/chapter5/ch5.tex +++ b/BookGPU/Chapters/chapter5/ch5.tex @@ -295,7 +295,7 @@ The spatial derivative in \eqref{ch5:eq:poissoneq} is again approximated with fi \begin{align} \mymat{A}\myvec{u}=\myvec{f}, \qquad \myvec{u},\myvec{f} \in \mathbb{R}^{N}, \quad \mymat{A} \in \mathbb{R}^{N\times N}, \label{ch5:eq:poissonsystem} \end{align} -where $\mymat{A}$ is the sparse matrix formed by finite difference coefficients, $N$ is the number of unknowns, and $\myvec{f}$ is given by \eqref{ch5:eq:poissonrhs}. Equation \eqref{ch5:eq:poissonsystem} can be solved in numerous ways, but a few observations may help do it more efficiently. Direct solvers based on Gaussian elimination are accurate and use a finite number of operations for a constant problem size. However, the arithmetic complexity grows with the problem size by as much as $\mathcal{O}(N^3)$ and does not exploit the sparsity of $\mymat{A}$. Direct solvers are therefore mostly feasible for dense systems of limited sizes. Sparse direct solvers exist, but they are often difficult to parallelize, or applicable for only certain types of matrices. Regardless of the discretization technique, the discretization of an elliptic PDE into a linear system as in \eqref{ch5:eq:poissonsystem} yields a very sparse matrix $\mymat{A}$ when $N$ is large. Iterative methods\index{iterative methods} for solving large sparse linear systems find broad use in scientific applications, because they require only an implementation of the matrix-vector product, and they often use a limited amount of additional memory. Comprehensive introductions to iterative methods may be found in any of~\cite{ch5:Saad2003,ch5:Kelley1995,ch5:Barrett1994}. +where $\mymat{A}$ is the sparse matrix formed by finite difference coefficients, $N$ is the number of unknowns, and $\myvec{f}$ is given by \eqref{ch5:eq:poissonrhs}. Equation \eqref{ch5:eq:poissonsystem} can be solved in numerous ways, but a few observations may help do it more efficiently. Direct solvers based on Gaussian elimination are accurate and use a finite number of operations for a constant problem size. However, the arithmetic complexity grows with the problem size by as much as $\mathcal{O}(N^3)$ and does not exploit the sparsity of $\mymat{A}$. Direct solvers are therefore mostly feasible for dense systems of limited sizes. Sparse direct solvers exist, but they are often difficult to parallelize, or applicable for only certain types of matrices. Regardless of the discretization technique, the discretization of an elliptic PDE into a linear system as in \eqref{ch5:eq:poissonsystem} yields a very sparse matrix $\mymat{A}$ when $N$ is large. Iterative methods\index{iterative method} for solving large sparse linear systems find broad use in scientific applications, because they require only an implementation of the matrix-vector product, and they often use a limited amount of additional memory. Comprehensive introductions to iterative methods may be found in any of~\cite{ch5:Saad2003,ch5:Kelley1995,ch5:Barrett1994}. One benefit of the high abstraction level and the template-based library design is to allow developers to implement their own components, such as iterative methods for solving sparse linear systems. The library includes three popular iterative methods: conjugate gradient, defect correction\index{defect correction}, and geometric multigrid. The conjugate gradient method is applicable only to systems with symmetric positive definite matrices. This is true for the two-dimensional Poisson problem, when it is discretized with a five-point finite difference stencil, because then there will be no off-centered approximations near the boundary. For high-order approximations ($\alpha>1$), we use the defect correction method with multigrid preconditioning. See e.g., \cite{ch5:Trottenberg2001} for details on multigrid methods. diff --git a/BookGPU/Chapters/chapter6/PartieAsync.tex b/BookGPU/Chapters/chapter6/PartieAsync.tex index 0253c9c..9edbe8d 100644 --- a/BookGPU/Chapters/chapter6/PartieAsync.tex +++ b/BookGPU/Chapters/chapter6/PartieAsync.tex @@ -6,7 +6,7 @@ In the previous section, we have seen how to efficiently implement overlap of computations (CPU and GPU) with communications (GPU transfers and internode communications). However, we have previously shown that for some parallel iterative algorithms, it is sometimes even more efficient to use an asynchronous -scheme of iterations\index{iterations asynchronous} \cite{HPCS2002,ParCo05,Para10}. In that case, the nodes do +scheme of iterations\index{asynchronous iterations} \cite{HPCS2002,ParCo05,Para10}. In that case, the nodes do not wait for each other but they perform their iterations using the last external data they have received from the other nodes, even if this data was produced \emph{before} the previous iteration on the other nodes. diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex index bc08557..d8d281c 100755 --- a/BookGPU/Chapters/chapter6/PartieSync.tex +++ b/BookGPU/Chapters/chapter6/PartieSync.tex @@ -210,7 +210,7 @@ achieved serially and not overlapped. When CPU/GPU data transfers are not negligible compared to GPU computations, it can be interesting to overlap internode CPU computations with a \emph{GPU - sequence}\index{GPU sequence} including CPU/GPU data transfers and GPU computations (see + sequence}\index{GPU!sequence} including CPU/GPU data transfers and GPU computations (see \Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic, but their implementation requires explicit CPU multithreading and synchronization, and CPU data buffer duplication. We need to implement two @@ -367,7 +367,7 @@ of the code. \Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA code, explicitly overlapping MPI communications with -streamed GPU sequences\index{GPU sequence!streamed}. +streamed GPU sequences\index{GPU!streamed sequence}. %\begin{algorithm} % \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex index e0b7ddd..604ce40 100644 --- a/BookGPU/Chapters/chapter8/ch8.tex +++ b/BookGPU/Chapters/chapter8/ch8.tex @@ -11,7 +11,7 @@ In practice, a wide range of problems can be modeled as NP-hard combinatorial op Although this bounding mechanism allows the considerable reduction of the exploration time, often only small or moderatelysized instances of COPs can be practically solved. For this reason, over the last decades, parallel computing has been revealed as an attractive way to deal with larger instances of COPs. However, while many contributions have been proposed for parallel B\&B methods using massively parallel processors \cite{ch8:Allen_1997}, networks or clusters of workstations \cite{ch8:Quinn_1990}, and Shared Memory Multiprocessors (SMP) machines \cite{ch8:Casadoa_2008}, very few contributions have been proposed for redesigning B\&B algorithms on Graphical Processing Units (GPUs) \cite{ch8:Carneiro_2011}. For years, the use of GPU accelerators was limited to graphics and video applications. Driven by the demand for high-definition 3D graphics on personal computers, GPUs have evolved into a highly parallel, multithreaded and many-core environment. Their utilization has recently been extended to other application domains such as scientific computing \cite{ch8:Kurzak_2010}.\\ -In this chapter, we rethink the design and implementation of irregular tree-based algorithms such as the B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of newly generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists of exploring in parallel different subspaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{Flowshop Scheduling Problem} (FSP; see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.\\ +In this chapter, we rethink the design and implementation of irregular tree-based algorithms such as the B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of newly generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists of exploring in parallel different subspaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{flowshop scheduling problem} (FSP; see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.\\ When rethinking those two parallel models for GPU's architectures, our main focus was on the lower bound function. Indeed, preliminary experiments we carried out on some of Taillard's problem instances \cite{ch8:Taillard_1993} show that computing the lower bounds takes on average between 98\% and 99\% of the total execution time of the B\&B. The GPU-based lower bound's implementation raises mainly two challenges. On the one hand, having in mind that the execution model of GPUs is Single Instruction Multiple Data (SIMD), irregular computations (containing loops and conditional instructions) contained in the lower bound function may lead to a very challenging issue: the thread or branch divergence. This problem drops down the performance and arises when threads of a same warp (the smallest executable unit of parallelism on the GPU) execute different data-dependent instructions. On the other hand, the lower bound computation usually uses large and frequently accessed data structures. Since GPU is a many-core coprocessor device that provides a hierarchy of memories having different sizes and access latencies, the placement and sharing of these data sets become challenging.\\ @@ -22,7 +22,7 @@ The scope of this chapter is to design parallel B\&B algorithms on GPU accelerat The chapter is organized into seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2}, details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations. -\section{Branch-and-bound \index{Branch-and-bound} algorithm} +\section{Branch-and-bound \index{branch-and-bound} algorithm} \label{ch8:BB} Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow the finding of the optimal solution of a problem with proof of optimality. \\ @@ -151,7 +151,7 @@ Figure~\ref{flow-shop} illustrates a solution of a flow-shop problem instance de \end{figure} -\subsection{Lower bound \index{Lower bound} for the flowshop scheduling problem} +\subsection{Lower bound \index{lower bound} for the flowshop scheduling problem} The lower bounding technique provides a lower bound (LB) for each subproblem generated by the branching operator. The more the bound is accurate, the more it allows the elimination from the search tree that are not promising. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra et al.~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.\\ @@ -233,7 +233,7 @@ In the following, we present how we dealt with the thread/branch divergence issu \subsection{The thread divergence issue} -During the execution of an application on GPU, one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{Thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\ +During the execution of an application on GPU, one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{GPU!thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\ This section discusses thread divergence issues encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely, the locations of nodes in the search tree and the control flow instructions within the bounding operator. \\ @@ -402,7 +402,7 @@ The same transformations as those applied for the first scenario are applied her \section{Memory access optimization} \label{ch8:DataAccessOpt} -Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory allows programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast-access memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device. \\ +Memory access optimizations \index{memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory allows programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast-access memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device. \\ CUDA-enabled devices use several memory spaces, which have different characteristics in term of sizes and access latencies. These memory spaces include global memory, local memory , shared memory, texture memory , and registers. Devices of compute capability 2.0 also have an L1/L2 cache hierarchy that is used to cache local and global memory accesses. diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex index caf17c4..d10daa6 100644 --- a/BookGPU/Chapters/chapter9/ch9.tex +++ b/BookGPU/Chapters/chapter9/ch9.tex @@ -10,22 +10,22 @@ \label{chapter9} \section{Introduction} This chapter presents GPU-based parallel -metaheuristics\index{Metaheuristics!parallel~metaheuristics}, +metaheuristics\index{metaheuristics!parallel metaheuristics}, challenges, and issues related to the particularities of the GPU architecture and a synthesis on the different implementation strategies used in the literature. The implementation of parallel -metaheuristics\index{Metaheuristics!parallel~metaheuristics} on +metaheuristics on GPUs is not straightforward. The traditional models used in CPUs must be rethought to meet the new requirements of GPU architectures. This chapter is organized as follows. Combinatorial -optimization\index{Combinatorial~optimization} and resolution +optimization\index{combinatorial optimization} and resolution methods are introduced in Section~\ref{ch8:sec:optim}. The main traditional parallel models used for metaheuristics are recalled in Section~\ref{ch8:sec:paraMeta}. Section~\ref{ch8:sec:challenges} highlights the main challenges related to the GPU implementation of metaheuristics. A state-of-the-art of GPU-based parallel -metaheuristics\index{Metaheuristics!parallel~metaheuristics} is +metaheuristics is summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based frameworks for metaheuristics are described. Finally, a case study is presented in Section~\ref{ch8:sec:case} and some concluding @@ -34,7 +34,7 @@ remarks are given in Section~\ref{ch8:conclusion} \section{Combinatorial optimization} \label{ch8:sec:optim} -Combinatorial optimization\index{Combinatorial~optimization} (CO) is a branch of applied and discrete mathematics. +Combinatorial optimization (CO) is a branch of applied and discrete mathematics. It consists in finding optimal configuration(s) among a finite set of possible configurations (or solutions) of a given combinatorial optimization problem (COP). The set of all possible solutions noted $S$ is called solution space or search space. Each solution in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\ %(Definition~\ref{def:cops}) @@ -86,9 +86,9 @@ process starts with a single solution (generally set at random) and iteratively improves it by exploring its neighborhood in the search space. The most known methods in this class are local search methods that include \emph{simulated -annealing}\index{Metaheuristics!simulated~annealing}~\cite{Kirkpatrick1983SA}, +annealing}\index{metaheuristics!simulated annealing}~\cite{Kirkpatrick1983SA}, \emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local -search\index{Metaheuristics!iterated local +search\index{metaheuristics!iterated local search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable neighborhood search}~\cite{HansenMladenovic1997VNS}.\\ @@ -129,8 +129,8 @@ high-performance computing. \end{itemize} From the granularity of parallelism point of view, three major parallel -models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism}, -\emph{iteration-level}\index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\ +models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{metaheuristics!algorithmic-level parallelism}, +\emph{iteration-level}\index{metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\ \begin{figure}[h!] \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}} @@ -141,53 +141,50 @@ models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algo \begin{itemize} \item{In the -algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism} parallel model, several self-contained metaheuristics +algorithmic-level parallel model, several self-contained metaheuristics are launched in parallel. The parallel -metaheuristics\index{Metaheuristics!parallel~metaheuristics} may +metaheuristics\index{metaheuristics!parallel metaheuristics} may start with identical or different solutions (s-metaheuristics case) or populations (p-metaheuristics case). Their parameter settings such as the size of tabu list for tabu -search\index{Metaheuristics!tabu~search}, transition probabilities +search\index{metaheuristics!tabu search}, transition probabilities for ant colonies, mutation and crossover probabilities for evolutionary -algorithm\index{Metaheuristics!evolutionary~algorithm}s may be the +algorithm\index{metaheuristics!evolutionary algorithm}s may be the same or different. The parallel processes may evolve independently or in a cooperative manner. In cooperative parallel models, the algorithms exchange information related to the search during evolution in order to find better and more robust solutions.} -\item{In the iteration-level\index{Metaheuristics!iteration-level -parallelism} parallel model, the focus is on the parallelization +\item{In the iteration-level parallel model, the focus is on the parallelization of each iteration of the metaheuristic. Indeed, metaheuristics are generally iterative search processes. Moreover, the most resource-consuming part of a metaheuristic is the evaluation of the generated solutions at each iteration. For s-metaheuristics -(e.g., tabu search\index{Metaheuristics!tabu~search}, simulated +(e.g., tabu search\index{metaheuristics!tabu search}, simulated annealing, variable neighborhood search), the evaluation and generation of the neighborhood is the most time-consuming step of the algorithm particularly when it comes to dealing with large neighborhood sets. In this parallel model, the neighborhood is decomposed into partitions, and each partition is evaluated in a parallel way. For p-metaheuristics (e.g., evolutionary -algorithm\index{Metaheuristics!evolutionary~algorithm}s, ant +algorithms, ant colonies, swarm optimization), the -iteration-level\index{Metaheuristics!iteration-level parallelism} +iteration-level parallel model arises naturally since these metaheuristics deal with a population of independent solutions. In evolutionary -algorithm\index{Metaheuristics!evolutionary~algorithm}s, for -instance, the iteration-level\index{Metaheuristics!iteration-level -parallelism} model consists of decomposing the whole population +algorithms, for +instance, the iteration-level model consists of decomposing the whole population into several partitions where each partition is evaluated in parallel.} \item{In the -solution-level\index{Metaheuristics!solution-level~parallelism} +solution-level parallel model, the focus is on the parallelization of the evaluation of a single solution. This model is useful when the objective function and/or the constraints are time and/or memory consuming. Unlike the two previous parallel models, the -solution-level\index{Metaheuristics!solution-level~parallelism} +solution-level\index{metaheuristics!solution-level parallelism} parallel model is problem-dependent.} \end{itemize} @@ -195,7 +192,7 @@ parallel model is problem-dependent.} \label{ch8:sec:challenges} Developing GPU-based parallel -metaheuristics\index{Metaheuristics!parallel~metaheuristics} is +metaheuristics\index{metaheuristics!parallel metaheuristics} is not straightforward. The parallel models have to be rethought to meet the new requirements of the GPU architecture. Several major issues have to be taken into account both at design and @@ -206,7 +203,7 @@ of tasks, and data transfer between the CPU and GPU~\cite{luongMultiStart}. \subsection{Data placement on a hierarchical memory} -\index{GPU Challenges!data~placement} During the execution of +\index{GPU!data placement} During the execution of metaheuristics on GPU, the different threads may access multiple data structures from multiple memory spaces. These memories have different sizes and access latencies. Nevertheless, faster @@ -229,7 +226,7 @@ threads and the corresponding metaheuristic elements (one neighbor per thread, one individual per thread, single population per threads block, one ant per thread, etc.) must be defined to ensure a maximum occupancy of the GPU and -to cover CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} and memory access times.\\ +to cover CPU/GPU communication\index{GPU!CPU/GPU communication} and memory access times.\\ According to the used metaheuristic and to the handled problem, the data values may have different types and different ranges of their values. The data @@ -256,7 +253,7 @@ data structures (e.g., population of individuals for a CUDA thread block) on the shared memory. \subsection{Threads synchronization} -\index{GPU Challenges!threads~synchronization} The thread +\index{GPU!threads synchronization} The thread synchronization issue is caused by both the GPU architecture and the synchronization requirements of the implemented method. Indeed, GPUs are based on a multicore architecture organized into @@ -277,7 +274,7 @@ some requirements related to data-dependent synchronizations. \subsection{Thread divergence} -Thread divergence\index{GPU Challenges!thread~divergence} is +Thread divergence\index{GPU!thread divergence} is another challenging issue in GPU-based metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally, metaheuristics contain irregular loops and conditional @@ -285,9 +282,9 @@ instructions when generating and evaluating the neighborhood (s-metaheuristics), and the population (p-metaheuristics) in the same block. In addition, the decision to apply a crossover or a mutation on an individual in a genetic -algorithm\index{Metaheuristics!genetic~algorithm} and the +algorithm and the exploration of different paths using an ant -colony\index{Metaheuristics!ant~colony~optimization} are random +colony\index{metaheuristics!ant colony optimization} are random operations. Threads of the same warp have to execute instructions simultaneously leading to different branches whereas in an SIMD model the threads of a same warp execute the same @@ -302,15 +299,14 @@ divergences. The performance of GPU-based metaheuristics in terms of execution time could be improved by choosing the most appropriate parallel -model (algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism}, instruction-level, -solution-level\index{Metaheuristics!solution-level~parallelism}). +model (algorithmic-level, instruction-level, +solution-level). Moreover, an efficient decomposition of the metaheuristic and an efficient assignment of code portions between the CPU and GPU should be adopted. The objective is to take benefit from the GPU computing power without affecting the efficiency and the behavior of the metaheuristic and without losing performance in CPU/GPU -communication\index{GPU Challenges!CPU/GPU~communication} and +communication\index{GPU!CPU/GPU communication} and memory accesses. In order to decide which part of the metaheuristic will be executed on which component, one should perform a careful analysis on the serial code of the @@ -319,8 +315,7 @@ exploration of the neighborhood, evaluation of individuals), and then offload them to the GPU, while the remaining tasks still run on the CPU in a serial way. \\ -The CPU/GPU communication\index{GPU -Challenges!CPU/GPU~communication} is done through the global +The CPU/GPU communication is done through the global memory which is a slow memory making the memory transfer between the CPU and GPU time-consuming which can significantly degrade the performance of the application. Accesses to this memory should be @@ -331,19 +326,19 @@ therefore, the whole execution time of the metaheuristic. \section{State-of-the-art parallel metaheuristics on GPUs} \label{ch8:sec:state} After more than two decades of research by the combinatorial optimisation -community devoted to developing adequate parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} for different types of +community devoted to developing adequate parallel metaheuristics for different types of parallel architectures (clusters, supercomputers and grids), the actual developement -of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on SIMD architectures.\\ +of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics on SIMD architectures.\\ The first works on metaheuristic algorithms implemented on GPUs started on old graphics cards before the appearance of modern GPUs equipped with high-level programming interfaces such as CUDA and OpenCL. Among these pioneering works we cite the work of Wong et al.~\cite{wongOldGPU2006} dealing with the implementation -of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm +of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{metaheuristics!ant colony optimization} algorithm is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and Li et al.~\cite{li2007} proposed a full parallelization of genetic -algorithm\index{Metaheuristics!genetic~algorithm}s on old GPU architectures using +algorithms on old GPU architectures using shader libraries based on Direct3D and OpenGL.\\ Such architectures are based on preconfigured pipelined stages @@ -398,12 +393,12 @@ efficiently map the different neighborhoods on the device memory, more explicitly, how to calculate the memory index of each solution associated to each CUDA thread's \textit{id}. %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes in the candidate vector. -The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\ +The three neighborhoods are implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\ In the same context, Deevacq et al.~\cite{audreyANT} proposed two parallelization strategies inspired by the multi-walk parallelization strategy, of a 3-opt iterated local -search\index{Metaheuristics!iterated local search} algorithm (ILS) +search algorithm (ILS) over a CPU/GPU architecture. In the first strategy, each Local Search (LS) is associated to a unique CUDA thread and improves a unique solution by generating its neighborhood. The second @@ -420,45 +415,40 @@ the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cit The same strategy is used by Luong et al. in~\cite{luongMultiStart} to implement multistart parallel local search algorithms (a special case of the -algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism} parallel model where several homogeneous LS +algorithmic-level parallel model where several homogeneous LS algorithms are used). The multistart model is combined with -iteration-level\index{Metaheuristics!iteration-level parallelism} +iteration-level parallelism: several LS algorithms are managed by the CPU and the neighborhood evaluation step of each algorithm is parallelized on the GPU (each GPU thread is associated with one neighbor and executes the same evaluation function kernel). The advantage of such a model is that it allows a high occupancy of the GPU -threads. Nevertheless, memory management\index{GPU -Challenges!memory~management} causes new issues due to the +threads. Nevertheless, memory management causes new issues due to the quantity of data to store and to communicate between CPU and GPU. A second proposition for implementing the same model on GPU consists of implementing the whole LS processes on GPU with each GPU thread being associated to a unique LS algorithm. This solves the communication issue encountered in the first model. In -addition, a memory management\index{GPU -Challenges!memory~management} strategy is proposed to improve the +addition, a memory management strategy is proposed to improve the efficiency of the -algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism} model: texture memory is used to avoid memory latency +algorithmic-level model: texture memory is used to avoid memory latency due to uncoalesced memory accesses. The proposed approaches are implemented on the quadratic assignment problem (QAP) using CUDA. The acceleration rates obtained for the -algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism} with usage of texture memory rise from $7.8\times$ to +algorithmic-level with usage of texture memory rise from $7.8\times$ to $12\times$ (for different QAP benchmark sizes). \\ Janiak et al.~\cite{Janiak_et_al_2008} implemented two algorithms for TSP and the flow-shop scheduling problem (FSP). These algorithms are based on a multistart tabu -search\index{Metaheuristics!tabu~search} model. Both of the +search model. Both of the algorithms exploit multicore CPU and GPU. A full parallelization on GPU is adopted using shader libraries where each thread is -mapped with one tabu search\index{Metaheuristics!tabu~search}. +mapped with one tabu search. However, even though their experiments report that the use of GPU speedups the serial execution almost $16 \times$, the mapping of -one thread with one tabu search\index{Metaheuristics!tabu~search} +one thread with one tabu search requires a large number of local search algorithms to cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\ @@ -468,15 +458,13 @@ architectures (multicore CPU coupled to one GPU). The challenge of using such a heterogeneous architecture is how to distribute tasks between the CPU cores and the GPU in such a way to have optimal performances. Among the three traditional parallel models -(solution-level\index{Metaheuristics!solution-level~parallelism}, -iteration-level\index{Metaheuristics!iteration-level parallelism}, -and algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism}), the authors point out that the most convenient +(solution-level, +iteration-level +and algorithmic-level), the authors point out that the most convenient model for the considered heterogeneous architecture is a hybrid model combining -iteration-level\index{Metaheuristics!iteration-level parallelism} -and algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism} models. Several CPU threads execute several instances +iteration-level +and algorithmic-level models. Several CPU threads execute several instances of the same S-metaheuristic in parallel while the GPU device is associated to one CPU core and used to accelerate the neighborhood calculation of several S-metaheuristics at the same @@ -488,7 +476,7 @@ All the previously noted works exploit the same parallel models used on CPUs based on the task parallelism. A different implementation approach is proposed by Paul in~\cite{gerald2012} to implement a simulated -annealing\index{Metaheuristics!simulated~annealing} (SA) algorithm +annealing (SA) algorithm for the QAP on GPUs. Indeed, the author used a preinitialized matrix \emph{delta} in which the incremental evaluation of simple swap moves are calculated and stored relative to the initial @@ -522,10 +510,10 @@ p-metaheuristics over different types of parallel architectures: supercomputers, clusters, and computational grids. Three main classes of p-metaheuristics are considered in this section: evolutionary -algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant -colony\index{Metaheuristics!ant~colony~optimization} optimization -(ACO\index{Metaheuristics!ant~colony~optimization}), and particle -swarm optimization\index{Metaheuristics!particle swarm +algorithms (EAs), ant +colony optimization +(ACO), and particle +swarm optimization\index{metaheuristics!particle swarm optimization} (PSO). \subsubsection*{Evolutionary algorithms} @@ -543,7 +531,7 @@ optimization problems. The main chalenges to be raised when implementing the tra In~\cite{kannan}, Kannan and Ganji present a CUDA implementation of the drug discovery application Autodock (molecular docking application). Autodock uses a genetic -algorithm\index{Metaheuristics!genetic~algorithm} to find optimal +algorithm to find optimal docking positions of a ligand to a protein. The most time-consuming task in Autodock is the fitness function evaluation. The fitness function used for a docking problem @@ -570,9 +558,9 @@ C-like metalanguage for easy development of EAs. The user writes a description of the problem-specific components (fitness function, problem representation, etc) in EASEA. The code is then compiled to obtain a ready-to-use evolutionary -algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA +algorithm. The EASEA compiler uses genetic -algorithm\index{Metaheuristics!genetic~algorithm} LIB and EO +algorithm LIB and EO Libraries to produce C++ or JAVA written EA codes. In~\cite{maitre2009}, the authors proposed an extension of EASEA to produce CUDA code from the EASEA files. This extension has been @@ -583,7 +571,7 @@ during the experiments: a benchmark mathematical function and a real problem (molecular structure prediction). In order to maximize the GPU occupation, very large populations are used (from $2000$ to $20000$). Even though transferring such large -populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\ +populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card: $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.\\ The master-slave model is efficient when the fitness function is highly time intensive. Nevertheless, it requires the use of @@ -596,11 +584,11 @@ model).\\ The coarse-grained model is used by Pospichal et al. in~\cite{pospichal10} to implement a parallel genetic -algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this +algorithm on GPU. In this work the entire genetic -algorithm\index{Metaheuristics!genetic~algorithm} is implemented +algorithm is implemented on GPU. This choice is motivated by the overhead engendered by the -CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} +CPU/GPU communication when only population evaluation is performed on GPU. Each population island is mapped with a CUDA thread block and each thread is responsible for a unique individual. Subpopulations are @@ -612,7 +600,7 @@ asynchronously through the global memory. That is, after a given number of gener The same strategy is also adopted by Tsutsui and Fujimoto in~\cite{tsutsuiGAQAP} to implement a coarse-grained genetic -algorithm\index{Metaheuristics!genetic~algorithm} on GPU to solve +algorithm on GPU to solve the QAP. Initially, several subpopulations are created on CPU and transferred to the global memory. The subpopulations are organized in the global memory into blocks of $8$ individuals in such a way @@ -625,11 +613,11 @@ through the global memory. Experiments are performed on standard QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU implementation reached speedups of $2.9\times$ to $12.6 \times$ compared to a single core implementation of a coarse-grained -genetic algorithm\index{Metaheuristics!genetic~algorithm} on a +genetic algorithm on a Intel i7 processor.\\ Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based -implementation of a Quantum Inspired Genetic Algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The +implementation of a Quantum Inspired Genetic Algorithm (QIGA). The parallel model used is a hierarchical model based on two levels: each thread in a block transforms a unique individual and a different population is assigned to each block. The algorithm is run @@ -653,7 +641,7 @@ to the SIMD architecture of the GPU. \\ Pinel et al. in~\cite{pinel2012JPDC} developed a highly parallel synchronous cellular genetic -algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called +algorithm (CGA), called GraphCell, to solve the independent task scheduling problem on GPU architectures. In CGAs, the population is arranged into a two-dimensional toroidal grid where only neighboring solutions are @@ -695,7 +683,7 @@ version for all the tested benchmarks when the size of the population is set to $32^2$. When the size of the population is too small, there are not enough computations to cover the overhead created by the call of kernel functions, CPU/GPU -communication\index{GPU Challenges!CPU/GPU~communication}s, +communications, synchronization, and access to global memory. Finally, an interesting review on GPU parallel computation in bio-inspired algorithms is proposed by Arenas et al. @@ -704,17 +692,17 @@ in~\cite{arenas2011}. \\ \subsubsection*{Ant colony optimization} Ant colony optimization -(ACO\index{Metaheuristics!ant~colony~optimization}) is another +(ACO) is another p-metaheuristic subject to parallelization on GPUs. State-of-the-art works on parallelizing -ACO\index{Metaheuristics!ant~colony~optimization} focus on +ACO focus on accelerating the tour construction step performed by each ant by taking a task-based parallelism approach, with pheromone deposition on the CPU.\\ In~\cite{cecilia}, Cecilia et al. present a GPU-based implementation of -ACO\index{Metaheuristics!ant~colony~optimization} for TSP where +ACO for TSP where the two steps (tour construction and pheromone update) are parallelized on the GPU. A data parallelism approach is used to enhance the performance of the tour construction step. The @@ -736,28 +724,27 @@ construction and $20 \times$ for pheromone updates.\\ In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose a hybrid algorithm combining -ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic -and Tabu Search (TS)\index{Metaheuristics!tabu~search} implemented +ACO metaheuristic +and Tabu Search (TS) implemented on GPU to solve the QAP. A solution of QAP is represented as a permutation of ${1,2,..,n}$ with $n$ being the size of the problem. The TS algorithm is based on the 2-opt neighborhood (swapping of two elements $(i,j)$ in the permutation). The authors point out that the move cost of each neighbor depends on the couple $(i,j)$. Two groups of moves are formed according to the -move cost. In order to avoid thread divergence\index{GPU -Challenges!thread~divergence} within the same warp, the +move cost. In order to avoid thread divergence\index{GPU!thread divergence} within the same warp, the neighborhood evaluation is parallelized in such a way to assign only moves of the same cost to each thread warp. This strategy is called MATA for Move-cost Adjusted Thread Assignment. Concerning -the memory management\index{GPU Challenges!memory~management}, all -the data of the ACO\index{Metaheuristics! ant~colony~optimization} +the memory management\index{GPU!memory management}, all +the data of the ACO\index{metaheuristics!ant colony optimization} (population, pheromone matrix), QAP matrices, and tabu list are placed on the global memory of the GPU. Shared memory is used only for working data common to all threads in a given block. All the steps of the hybrid algorithm -ACO\index{Metaheuristics!ant~colony~optimization}-TS -(ACO\index{Metaheuristics!ant~colony~optimization} initialization, +ACO-TS +(ACO initialization, pheromone update, construct solutions, applying TS) are implemented as kernel functions on the GPU. The GPU/CPU communications are only used to transfer the best-so-far solution @@ -822,11 +809,10 @@ based on 2 levels: design level and implementation level as illustrated in Figure~\ref{ch8:fig:classification}. The design level regroups the three classes of parallel models used in metaheuristics -(solution-level\index{Metaheuristics!solution-level~parallelism}, -iteration-level\index{Metaheuristics!iteration-level parallelism}, -algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism}) with examples for s-metaheuristics, EAs, -ACO\index{Metaheuristics!ant~colony~optimization} and PCO. This +(solution-level, +iteration-level, +algorithmic-level) with examples for s-metaheuristics, EAs, +ACO and PCO. This classification is principally built from the reviewed state-of-the-art works in the previous section. The implementation level refers to the way these parallel design @@ -845,7 +831,7 @@ are explained in the following sections.\\ \subsubsection*{GPU thread mapping for solution-level parallelism} -\index{GPU-based Metaheuristics!GPU-thread mapping} Parallel +\index{GPU!thread mapping} Parallel models at solution level consist of parallelizing a time intensive atomic task of the algorithm. Generally, it consists of the fitness evaluation~\cite{kannan}. Nevertheless, crossover @@ -861,14 +847,14 @@ require the use of very large neighborhoods or populations.\\ %data parallelism in SA-matrix to parallelize \subsubsection*{GPU thread mapping for iteration-level parallelism} -\index{GPU-based Metaheuristics!GPU-thread mapping} +\index{GPU!thread mapping} Iteration-level parallelism consists of parallelizing the tasks performed independently on different solutions. Different mapping strategies are adopted in the reviewed works to implement these models.\\ In Figure \ref{ch8:fig:classification}, the first example of -iteration-level\index{Metaheuristics!iteration-level parallelism} +iteration-level parallelism is the parallel evaluation of neighborhoods in s-metaheuristics. In most of the reviewed works, a per-thread mapping approach is used: each solution of the neighborhood is @@ -889,8 +875,7 @@ memory. %pheromone update data parallelism matrices \subsubsection*{GPU thread mapping for algorithmic-level parallelism} -\index{GPU-based Metaheuristics!GPU-thread mapping} -Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models have been used: the multistart model and the island model (parallel EAs).\\ +Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level models have been used: the multistart model and the island model (parallel EAs).\\ The implementation of the multistart model is based on two different mapping strategies~\cite{luongMultiStart, audreyANT}: @@ -906,9 +891,8 @@ neighborhoods. In the second mapping strategy, the LS algorithms are placed on CPU and the neighborhood evaluation of each LS is parallelized on GPU using per-thread mapping strategy (one thread per solution). This consists of a hierarchical parallel model -combining algorithmic-level\index{Metaheuristics!algorithmic-level -parallelism} parallelism (multistart) with -iteration-level\index{Metaheuristics!iteration-level parallelism} +combining algorithmic-level parallelism (multistart) with +iteration-level parallelism (master-worker).\\ In the island model, the same mapping is used in all the reviewed @@ -950,18 +934,17 @@ frameworks!libCUDAOptimize} is a library intended for the implementation of p-metaheuristics on GPU. The three frameworks are presented in more detail in the following. -\subsection{PUGACE\index{GPU-based frameworks!PUGACE}: framework for implementing evolutionary computation on GPUs} -PUGACE\index{GPU-based frameworks!PUGACE} is a generic framework +\subsection{PUGACE: framework for implementing evolutionary computation on GPUs} +PUGACE is a generic framework for easy implementation of cellular evolutionary algorithms on GPUs implemented using C and CUDA. It is based on the frameworks MALLBA and JCell (a framework for cellular algorithms). The authors justified the choice of cellular evolutionary -algorithm\index{Metaheuristics!evolutionary~algorithm} by the good +algorithm by the good feedback found in the literature concerning its efficient implementation on GPUs compared to other parallel models for EAs (island, master-slave). The main standard evolutionary operators -are already implemented in PUGACE\index{GPU-based -frameworks!PUGACE}: different selection strategies, standard +are already implemented in PUGACE: different selection strategies, standard crossover, and mutation operators (PMX, swap, 2-exchange, etc.). Different problem encoding is also supported. The framework is organized as a set of modules in which the different @@ -975,7 +958,7 @@ population is transferred to the GPU. On the GPU side, each individual is associated to a unique CUDA thread. The function evaluation and mutation are done on the GPU while selection and replacement are maintained on the CPU. In order to avoid thread -divergence\index{GPU Challenges!thread~divergence} appearing in +divergence\index{GPU!thread divergence} appearing in the same CUDA thread block at the crossover step (because of the probability of application which may give different results from one thread to the other), the decision of whether to apply a @@ -1005,7 +988,7 @@ Melab et al.~\cite{paradiseoGPU} developed a reusable framework ParadisEO-MO-GPU\index{GPU-based frameworks!ParadisEO-MO-GPU} for parallel local search metaheuristics (s-metaheuristics) on GPUs. It focuses on the -iteration-level\index{Metaheuristics!iteration-level parallelism} +iteration-level parallel model of s-metaheuristics which consists of exploring in parallel the neighborhood of a problem solution on GPU. The framework, implemented using C++ and CUDA, is an extension of the @@ -1048,7 +1031,7 @@ LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library for the design and implementation of metaheuristics on GPUs. Until now, the metaheuristics supported by LibCUDAOptimize are: scatter search, differential evolution, and particle swarm -optimization\index{Metaheuristics!particle swarm optimization}. +optimization\index{metaheuristics!particle swarm optimization}. Nevertheless, the library is designed in such a way to allow further extensions for other metaheuristics and it is still in development phase by the authors. The parallelization strategy @@ -1060,12 +1043,12 @@ on CPU. \label{ch8:sec:case} In this case study, a large neighborhood GPU-based local -search\index{GPU-based Metaheuristics!GPU-based~local~search} +search\index{GPU!based local search} method for solving the Quadratic 3-dimensional Assignment Problem (Q3AP) will be presented. The local search method is an Iterated -Local Search\index{Metaheuristics!iterated local search} +Local Search\index{metaheuristics!iterated local search} (ILS)~\cite{stutzle2006ILSforQAP} using an embedded -TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle +TS algorithm. The ILS principle consists of executing iteratively the embedded local search, each iteration which starts from a disrupted local optima reached by the previous local search process. The disruption heuristic is a @@ -1182,9 +1165,8 @@ number of feasible solutions of an instance of size $n$ is $n! \times n!$. \subsection{Iterated tabu search algorithm for the Q3AP} To tackle large-sized instances of the Q3AP and speed up the -search process, a parallel ILS\index{Metaheuristics!iterated local -search} algorithm has been designed. The local search embedded in -the ILS is a TS\index{Metaheuristics!tabu~search}. A TS +search process, a parallel ILS algorithm has been designed. The local search embedded in +the ILS is a TS. A TS procedure~\cite{Glover1989TS} starts from an initial feasible solution and tries, at each step, to move to a neighboring solution that minimizes the fitness (for a minimization case). If @@ -1193,7 +1175,7 @@ fitness is chosen as a next move. This enables the TS process to escape local optima. However, this strategy may generate cycles, i.e., previous moves can be selected again. To avoid cycles, the TS manages a short-term memory that contains the moves that have -been recently performed. A TS\index{Metaheuristics!tabu~search} +been recently performed. A TS template is given by Algorithm \ref{TS_pseudo_code}.\\ % % \begin{algorithm} @@ -1313,7 +1295,7 @@ methods. The main challenges that must be faced when designing a local search algorithm are the efficient distribution of the search process between the CPU and the GPU minimizing the data transfer between them, the hierarchical memory -management\index{GPU Challenges!memory~management} and the +management\index{GPU!memory management} and the capacity constraints of GPU memories, and the thread synchronization. All these issues must be regarded when designing parallel LS models to allow @@ -1321,9 +1303,9 @@ solving of large scale optimization problems on GPU architectures.\\ To go back to our problem (i.e., Q3AP), we propose in Algorithm~\ref{ch8:algoITS} an iterated tabu -search\index{Metaheuristics!tabu~search} on GPU (GPU-ITS). The +search on GPU (GPU-ITS). The parallel model is in agreement with the -iteration-level\index{Metaheuristics!iteration-level parallelism} +iteration-level parallel model of LS methods presented in Section \ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This algorithm can be seen as a cooperative model between the CPU and @@ -1342,7 +1324,7 @@ We notice that the input matrices are read-only structures and do not change during all the execution of the LS algorithm. Therefore, their associated memory is copied only once during all the execution. Third, comes the parallel -iteration-level\index{Metaheuristics!iteration-level parallelism}, +iteration-level, in which each neighboring solution is generated, evaluated, and copied into the neighborhood fitnesses structure (from lines 10 to 14). Fourth, since the order in which candidate neighbors are @@ -1411,9 +1393,9 @@ given number of iterations has been reached. In this section, some experimental results related to the approach presented in Section \ref{ch8:ITS-Q3APSection} are reported. We recall that the approach is a GPU-based iterated tabu -search\index{Metaheuristics!tabu~search} (GPU-ITS) method +search (GPU-ITS) method consisting in an iterated local search (ILS) embedding a tabu -search\index{Metaheuristics!tabu~search} (TS) and where the +search (TS) and where the generation/evaluation step of the TS process is executed on GPU. The ILS is used to improve the quality of successive local optima provided by TS methods. This is achieved by perturbing the local @@ -1506,18 +1488,17 @@ the robustness/quality of provided solutions. \label{ch8:conclusion} This chapter has presented state-of-the-art GPU-based parallel -metaheuristics\index{Metaheuristics!parallel~metaheuristics} and +metaheuristics and a case study on implementing large neighborhood local search methods on GPUs for solving large benchmarks of the quadratic three-dimensional assignment problem (Q3AP). \\ Implementing parallel -metaheuristics\index{Metaheuristics!parallel~metaheuristics} on +metaheuristics on GPU architectures poses new issues and challenges such as memory -management\index{GPU Challenges!memory~management}; finding +management; finding efficient mapping strategies between tasks to parallelize; and the -GPU threads, thread divergence\index{GPU -Challenges!thread~divergence}, and synchronization. Actually, most +GPU threads, thread divergence, and synchronization. Actually, most of metaheuristics have been implemented on GPU using different implementation strategies. In this chapter, a two-level classification of the reviewed works has been proposed: design @@ -1528,6 +1509,6 @@ This classification focuses mainly on the mapping between the metaheuristic tasks to parallelize and the GPU threads. Indeed, the choice of a given mapping strategy strongly influences the other challenges (memory usage, communication, thread -divergence\index{GPU Challenges!thread~divergence}). +divergence). \putbib[Chapters/chapter9/biblio9] -- 2.39.5