X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/1874c46934f4ba7e8c2013d3829f65309456d292..6318153555fcb28c475d77850cce474032d79f5a:/BookGPU/Chapters/chapter12/ch12.tex?ds=inline diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index a514438..4fe0eb9 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -217,7 +217,7 @@ r_k \bot A \mathcal{K}_k(A, v_1). \end{array} \label{ch12:eq:13} \end{equation} -GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{iterative method!Arnoldi process} to construct an +GMRES uses the Arnoldi iterations~\cite{ch12:ref5}\index{iterative method!Arnoldi iterations} to construct an orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_k$ of order $(k+1)\times k$: \begin{equation} @@ -313,7 +313,7 @@ $V$ to $m$ orthogonal vectors. Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts. It solves the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration -$k$, GMRES uses the Arnoldi process\index{iterative method!Arnoldi process} (defined from +$k$, GMRES uses the Arnoldi iterations\index{iterative method!Arnoldi iterations} (defined from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ @@ -457,8 +457,8 @@ nodes\index{neighboring node} over the GPU cluster must exchange between them th elements necessary to compute this multiplication. First, each computing node determines, in its local subvector, the vector elements needed by other nodes. Then, the neighboring nodes exchange between them these shared vector elements. The data exchanges are implemented by using the MPI -point-to-point communication routines: blocking\index{MPI subroutines!blocking} sends with \verb+MPI_Send()+ -and nonblocking\index{MPI subroutines!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02} +point-to-point communication routines: blocking\index{MPI!blocking} sends with \verb+MPI_Send()+ +and nonblocking\index{MPI!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02} shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}, and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing nodes is that presented in Figure~\ref{ch12:fig:01}. @@ -491,7 +491,7 @@ cluster. Consequently, the vector elements to be exchanged must be copied from t and vice versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS} communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations -to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI subroutines!global} +to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI!global} \verb+MPI_Allreduce()+. @@ -526,7 +526,7 @@ is managed by one MPI process and is composed of one CPU core and one GPU card. All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$, and the -initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{iterative method!Arnoldi process} +initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi iterations\index{iterative method!Arnoldi iterations} used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse matrix $M^{-1}$, and it provides a relatively good preconditioning for