From: couturie Date: Wed, 7 Aug 2013 09:02:53 +0000 (+0200) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/b4a21f0b9226126a2c50f54a5518be5ef7c60749?hp=17bff40b83bcdcc39769f9e59c70ffae1c525b72 new --- diff --git a/BookGPU/Chapters/chapter10/ch10.tex b/BookGPU/Chapters/chapter10/ch10.tex index 7a39dca..0dd8d21 100644 --- a/BookGPU/Chapters/chapter10/ch10.tex +++ b/BookGPU/Chapters/chapter10/ch10.tex @@ -442,7 +442,7 @@ In the following section, we first quickly describe the CUDA reduction operation % Reduce operation \subsection{Parallel reduction} \label{chXXX:subsec:reduction} -A parallel reduction\index{parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU. +A parallel reduction\index{parallel!reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU. An optimized way of doing the reduction can be found in the examples\footnote{Available at http://docs.nvidia.com/cuda/cuda-samples/index.html\#advanced} provided by NVIDIA. %In order to keep code listings compact hereafter, the reduction of values among a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}). diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index 0269fd2..4fe0eb9 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -457,8 +457,8 @@ nodes\index{neighboring node} over the GPU cluster must exchange between them th elements necessary to compute this multiplication. First, each computing node determines, in its local subvector, the vector elements needed by other nodes. Then, the neighboring nodes exchange between them these shared vector elements. The data exchanges are implemented by using the MPI -point-to-point communication routines: blocking\index{MPI subroutines!blocking} sends with \verb+MPI_Send()+ -and nonblocking\index{MPI subroutines!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02} +point-to-point communication routines: blocking\index{MPI!blocking} sends with \verb+MPI_Send()+ +and nonblocking\index{MPI!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02} shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}, and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing nodes is that presented in Figure~\ref{ch12:fig:01}. @@ -491,7 +491,7 @@ cluster. Consequently, the vector elements to be exchanged must be copied from t and vice versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS} communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations -to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI subroutines!global} +to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI!global} \verb+MPI_Allreduce()+. diff --git a/BookGPU/Chapters/chapter13/ch13.tex b/BookGPU/Chapters/chapter13/ch13.tex index 90dd946..b60105d 100755 --- a/BookGPU/Chapters/chapter13/ch13.tex +++ b/BookGPU/Chapters/chapter13/ch13.tex @@ -322,7 +322,7 @@ each component of the vector must be relaxed an infinite number of times. The ch relaxed components to be used in the computational process may be guided by any criterion, and in particular, a natural criterion is to pickup the most recently available values of the components computed by the processors. Furthermore, the asynchronous -iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI subroutines!nonblocking} +iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI!nonblocking} (asynchronous communications). The important property ensuring the convergence of the parallel projected Richardson @@ -486,13 +486,13 @@ in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsy and \verb+cublasGetVectorAsync()+ in the asynchronous algorithm. Moreover, we use the communication routines of the MPI library to carry out the data exchanges between the neighboring nodes. We use the following communication routines: \verb+MPI_Isend()+ -and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI subroutines!nonblocking} +and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI!nonblocking} sends and receives, respectively. For the synchronous algorithm, we use the MPI routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node in blocking status until all data exchanges with neighboring nodes (sends and receives) are completed. In contrast, for the asynchronous algorithms, we use the MPI routine \verb+MPI_Test()+ which tests the completion of a data exchange (send or receives) -without putting the MPI process in blocking status\index{MPI subroutines!blocking}. +without putting the MPI process in blocking status\index{MPI!blocking}. The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{ch13:alg:02}) computes, at each iteration, the new elements of the iterate vector $U$. Its general code @@ -598,7 +598,7 @@ AllReduce(error,\hspace{0.1cm}maxerror,\hspace{0.1cm}MAX); \\ conv \leftarrow true; \end{array} $$ -where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI subroutines!global} +where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI!global} \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local absolute errors, $error$, of all computing nodes, and $p$ (in Algorithm~\ref{ch13:alg:02}) is used as a counter of the local relaxations carried out by a computing node. In diff --git a/BookGPU/Chapters/chapter15/ch15.tex b/BookGPU/Chapters/chapter15/ch15.tex index 64eb771..2d507e4 100644 --- a/BookGPU/Chapters/chapter15/ch15.tex +++ b/BookGPU/Chapters/chapter15/ch15.tex @@ -917,7 +917,7 @@ one C2050 (Fermi) GPU, located at UPMC (Universit\'e Pierre et Marie Curie, Paris, France). As a remark, the execution times measured on the C2050 would be the same on the C2070 and on the C2075, the only difference between these GPUs -being their memory size and their TDP (Thermal Design Power)\index{TDP (Thermal Design Power)}. +being their memory size and their TDP (Thermal Design Power)\index{TDP (thermal design power)}. We emphasize that the execution times correspond to the complete propagation for all six energies of the large case (see Table~\ref{data-sets}), that is to say to the complete execution of diff --git a/BookGPU/Chapters/chapter16/ef.tex b/BookGPU/Chapters/chapter16/ef.tex index 400b467..1cc747b 100644 --- a/BookGPU/Chapters/chapter16/ef.tex +++ b/BookGPU/Chapters/chapter16/ef.tex @@ -52,7 +52,7 @@ significant savings in simulation time. % One simple way to estimate $x((m+k)T)$ is to use the information at % $mT$ and $(m-1)T$, which leads to the forward-Euler method as To estimate $x((m+k)T)$, -a forward Euler\index{forward Euler} style jumping relies only on $x(mT)$ and $x((m-1)T)$, +a forward Euler\index{Euler!forward Euler} style jumping relies only on $x(mT)$ and $x((m-1)T)$, i.e., \[ x((m+k)T) @@ -61,7 +61,7 @@ i.e., \] However, this approach is inefficient due to its restriction on envelope step $k$, since larger $k$ usually causes instability. -Instead, backward Euler\index{backward Euler} jumping, +Instead, backward Euler\index{Euler!backward Euler} jumping, %and the equation becomes \[ x((m+k)T)-x(mT) = k\left[x((m+k)T)-x((m+k-1)T)\right], diff --git a/BookGPU/Chapters/chapter16/gpu.tex b/BookGPU/Chapters/chapter16/gpu.tex index 26bd536..623ac81 100644 --- a/BookGPU/Chapters/chapter16/gpu.tex +++ b/BookGPU/Chapters/chapter16/gpu.tex @@ -160,7 +160,7 @@ period in order to solve a Newton update. At each time step, SPICE\index{SPICE} has to linearize device models, stamp matrix elements into MNA (short for modified nodal analysis\index{modified nodal analysis, or MNA}) matrices, -and solve circuit equations in its inner Newton iteration\index{Newton iteration}. +and solve circuit equations in its inner Newton iteration\index{iterative method!Newton iteration}. When convergence is attained, circuit states are saved and then next time step begins. This is also the time when we store the needed matrices diff --git a/BookGPU/Chapters/chapter16/intro.tex b/BookGPU/Chapters/chapter16/intro.tex index 4d368b3..dda1808 100644 --- a/BookGPU/Chapters/chapter16/intro.tex +++ b/BookGPU/Chapters/chapter16/intro.tex @@ -76,7 +76,7 @@ next envelope step. {\resizebox{.9\textwidth}{!}{\input{./Chapters/chapter16/figures/envelope.pdf_t}} \label{fig:ef2} } \caption{Transient envelope-following\index{envelope-following} analysis. - (Both two figures reflect backward Euler\index{backward Euler} style envelope-following.)} + (Both two figures reflect backward Euler\index{Euler!backward Euler} style envelope-following.)} \label{fig:ef_intro} \end{figure} diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex index f780577..bdfad6a 100644 --- a/BookGPU/Chapters/chapter5/ch5.tex +++ b/BookGPU/Chapters/chapter5/ch5.tex @@ -188,7 +188,7 @@ We use a Method of Lines (MoL)\index{method of lines} approach to solve \eqref{c \begin{align}\label{ch5:eq:discreteheateq} \frac{\partial u}{\partial t} = \mymat{A}\myvec{u}, \qquad \mymat{A} \in \mathbb{R}^{N\times N}, \quad \myvec{u} \in \mathbb{R}^{N}, \end{align} -where $\mymat{A}$ is the sparse finite difference matrix and $N$ is the number of unknowns in the discrete system. The temporal derivative is now free to be approximated by any suitable choice of a time-integration method\index{time integration}. The most simple integration scheme would be the first-order accurate explicit forward Euler method\index{forward Euler}, +where $\mymat{A}$ is the sparse finite difference matrix and $N$ is the number of unknowns in the discrete system. The temporal derivative is now free to be approximated by any suitable choice of a time-integration method\index{time integration}. The most simple integration scheme would be the first-order accurate explicit forward Euler method\index{Euler!forward Euler}, \begin{align}\label{ch5:eq:forwardeuler} \myvec{u}^{n+1} = \myvec{u}^n + \delta t\,\mymat{A}\myvec{u}^n, \end{align} diff --git a/BookGPU/Chapters/chapter6/PartieAsync.tex b/BookGPU/Chapters/chapter6/PartieAsync.tex index 9edbe8d..3f4a539 100644 --- a/BookGPU/Chapters/chapter6/PartieAsync.tex +++ b/BookGPU/Chapters/chapter6/PartieAsync.tex @@ -139,7 +139,7 @@ communication libraries such as MPI are not systematically performed in parallel the computations~\cite{ChVCV13,Hoefler08a}. So, the logical and classical way to implement such an overlap is to use three threads: one for computing, one for sending, and one for receiving. Moreover, since -the communication is performed by threads, blocking synchronous communications\index{MPI!communication!blocking}\index{MPI!communication!synchronous} +the communication is performed by threads, blocking synchronous communications\index{MPI!blocking}\index{MPI!synchronous} can be used without deteriorating the overall performance. In this basic version, the termination\index{termination} of the global process is performed diff --git a/BookGPU/Chapters/chapter7/ch7.tex b/BookGPU/Chapters/chapter7/ch7.tex index f9f7fd4..53cbae2 100644 --- a/BookGPU/Chapters/chapter7/ch7.tex +++ b/BookGPU/Chapters/chapter7/ch7.tex @@ -359,7 +359,7 @@ Subtracting $g^n$ in \eqref{ch7:eq:discreteupdate} and dividing by a pseudo time \frac{g^{*,n+1}-g^n}{\tau} =\frac{(1-\Gamma)}{\tau} (g_e^n-g^n). \end{align} % -The first term is similar to a first-order accurate Forward Euler\index{forward Euler} approximation of a rate of change term. This motivates an {\em embedded penalty forcing technique} based on adding a correction term of the form +The first term is similar to a first-order accurate Forward Euler\index{Euler!forward Euler} approximation of a rate of change term. This motivates an {\em embedded penalty forcing technique} based on adding a correction term of the form % \begin{align}\label{ch7:eq:penalty} \partial_t g = \mathcal{N}(g) + \frac{1-\Gamma(x)}{\tau} (g_e(t,x)-g(t,x)), \quad {\bf x}\in\Omega_\Gamma, diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex index 604ce40..7570035 100644 --- a/BookGPU/Chapters/chapter8/ch8.tex +++ b/BookGPU/Chapters/chapter8/ch8.tex @@ -96,7 +96,7 @@ This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} id Tree-based strategies consist of building and/or exploring the solution tree in parallel by performing operations on several subproblems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\ -The parallel tree exploration \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists of visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding, and elimination operators on several subproblems simultaneously.\\ +The parallel tree exploration \index{parallel!tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists of visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding, and elimination operators on several subproblems simultaneously.\\ \begin{figure} \begin{center} @@ -112,7 +112,7 @@ The parallel tree exploration \index{parallel tree exploration} model, illustrat Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist of executing the bounding operation in parallel for each subproblem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\ -The parallel evaluation of bounds \index{parallel evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm. +The parallel evaluation of bounds \index{parallel!evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm. \begin{figure} \begin{center}