X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/GMRES2stage.git/blobdiff_plain/fdf2e308fad59acb0f3401f7f3d59a8131ee35ae..eacf4c2eeca7315f0a4a7bc9dec99dc6778843d1:/paper.tex diff --git a/paper.tex b/paper.tex index dd80756..4d59b93 100644 --- a/paper.tex +++ b/paper.tex @@ -241,7 +241,7 @@ % quality. -%\usepackage{eqparbox} +\usepackage{eqparbox} % Also of notable interest is Scott Pakin's eqparbox package for creating % (automatically sized) equal width boxes - aka "natural width parboxes". % Available at: @@ -348,7 +348,8 @@ \hyphenation{op-tical net-works semi-conduc-tor} - +\usepackage[utf8]{inputenc} +\usepackage[T1]{fontenc} \usepackage{algorithm} \usepackage{algpseudocode} \usepackage{amsmath} @@ -362,17 +363,14 @@ \algnewcommand\algorithmicoutput{\textbf{Output:}} \algnewcommand\Output{\item[\algorithmicoutput]} - +\newtheorem{proposition}{Proposition} \begin{document} % % paper title % can use linebreaks \\ within to get better formatting as desired -\title{TSARM: A Two-Stage Algorithm with least-square Residual Minimization to solve large sparse linear systems} -%où -%\title{A two-stage algorithm with error minimization to solve large sparse linear systems} -%où -%\title{???} +\title{TSIRM: A Two-Stage Iteration with least-squares Residual Minimization algorithm to solve large sparse linear systems} + @@ -382,8 +380,8 @@ % use a multiple column layout for up to two different % affiliations -\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja \IEEEauthorrefmark{2} and Christophe Guyeux\IEEEauthorrefmark{1}} -\IEEEauthorblockA{\IEEEauthorrefmark{1} Femto-ST Institute, University of Franche Comte, France\\ +\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja\IEEEauthorrefmark{2}, and Christophe Guyeux\IEEEauthorrefmark{1}} +\IEEEauthorblockA{\IEEEauthorrefmark{1} Femto-ST Institute, University of Franche-Comt\'e, France\\ Email: \{raphael.couturier,christophe.guyeux\}@univ-fcomte.fr} \IEEEauthorblockA{\IEEEauthorrefmark{2} INRIA Bordeaux Sud-Ouest, France\\ Email: lilia.ziane@inria.fr} @@ -427,20 +425,21 @@ Email: lilia.ziane@inria.fr} \begin{abstract} -In this paper we propose a two stage iterative method which increases the -convergence of Krylov iterative methods, typically those of GMRES variants. The -principle of our approach is to build an external iteration over the Krylov -method and to save the current residual frequently (for example, for each -restart of GMRES). Then after a given number of outer iterations, a minimization -step is applied on the matrix composed of the saved residuals in order to -compute a better solution and make a new iteration if necessary. We prove that -our method has the same convergence property than the inner method used. Some -experiments using up to 16,394 cores show that compared to GMRES our algorithm -can be around 7 times faster. +In this article, a two-stage iterative algorithm is proposed to improve the +convergence of Krylov based iterative methods, typically those of GMRES +variants. The principle of the proposed approach is to build an external +iteration over the Krylov method, and to frequently store its current residual +(at each GMRES restart for instance). After a given number of outer iterations, +a least-squares minimization step is applied on the matrix composed by the saved +residuals, in order to compute a better solution and to make new iterations if +required. It is proven that the proposal has the same convergence properties +than the inner embedded method itself. Experiments using up to 16,394 cores +also show that the proposed algorithm runs around 5 or 7 times faster than +GMRES. \end{abstract} \begin{IEEEkeywords} -Iterative Krylov methods; sparse linear systems; residual minimization; PETSc; %à voir... +Iterative Krylov methods; sparse linear systems; two stage iteration; least-squares residual minimization; PETSc \end{IEEEkeywords} @@ -548,45 +547,51 @@ Iterative Krylov methods; sparse linear systems; residual minimization; PETSc; % % You must have at least 2 lines in the paragraph with the drop letter % (should never be an issue) -Iterative methods became more attractive than direct ones to solve very large -sparse linear systems. Iterative methods are more effecient in a parallel -context, with thousands of cores, and require less memory and arithmetic -operations than direct methods. A number of iterative methods are proposed and -adapted by many researchers and the increased need for solving very large sparse -linear systems triggered the development of efficient iterative techniques -suitable for the parallel processing. - -Most of the successful iterative methods currently available are based on Krylov -subspaces which consist in forming a basis of a sequence of successive matrix -powers times an initial vector for example the residual. These methods are based -on orthogonality of vectors of the Krylov subspace basis to solve linear -systems. The most well-known iterative Krylov subspace methods are Conjugate -Gradient method and GMRES method (generalized minimal residual). - -However, iterative methods suffer from scalability problems on parallel -computing platforms with many processors due to their need for reduction -operations and collective communications to perform matrix-vector +Iterative methods have recently become more attractive than direct ones to solve +very large sparse linear systems\cite{Saad2003}. They are more efficient in a +parallel context, supporting thousands of cores, and they require less memory +and arithmetic operations than direct methods~\cite{bahicontascoutu}. This is +why new iterative methods are frequently proposed or adapted by researchers, and +the increasing need to solve very large sparse linear systems has triggered the +development of such efficient iterative techniques suitable for parallel +processing. + +Most of the successful iterative methods currently available are based on +so-called ``Krylov subspaces''. They consist in forming a basis of successive +matrix powers multiplied by an initial vector, which can be for instance the +residual. These methods use vectors orthogonality of the Krylov subspace basis +in order to solve linear systems. The most known iterative Krylov subspace +methods are conjugate gradient and GMRES ones (Generalized Minimal RESidual). + + +However, iterative methods suffer from scalability problems on parallel +computing platforms with many processors, due to their need of reduction +operations, and to collective communications to achieve matrix-vector multiplications. The communications on large clusters with thousands of cores -and large sizes of messages can significantly affect the performances of -iterative methods. In practice, Krylov subspace iteration methods are often used -with preconditioners in order to increase their convergence and accelerate their -performances. However, most of the good preconditioners are not scalable on -large clusters. - -In this paper we propose a two-stage algorithm based on two nested iterations -called inner-outer iterations. This algorithm consists in solving the sparse -linear system iteratively with a small number of inner iterations and restarts -the outer step with a new solution minimizing some error functions over some -previous residuals. This algorithm is iterative and easy to parallelize on large -clusters and the minimization technique improves its convergence and -performances. - -The present paper is organized as follows. In Section~\ref{sec:02} some related -works are presented. Section~\ref{sec:03} presents our two-stage algorithm using -a least-square residual minimization. Section~\ref{sec:04} describes some -convergence results on this method. Section~\ref{sec:05} shows some experimental -results obtained on large clusters of our algorithm using routines of PETSc -toolkit. Finally Section~\ref{sec:06} concludes and gives some perspectives. +and large sizes of messages can significantly affect the performances of these +iterative methods. As a consequence, Krylov subspace iteration methods are often +used with preconditioners in practice, to increase their convergence and +accelerate their performances. However, most of the good preconditioners are +not scalable on large clusters. + +In this research work, a two-stage algorithm based on two nested iterations +called inner-outer iterations is proposed. This algorithm consists in solving +the sparse linear system iteratively with a small number of inner iterations, +and restarting the outer step with a new solution minimizing some error +functions over some previous residuals. For further information on two-stage +iteration methods, interested readers are invited to +consult~\cite{Nichols:1973:CTS}. Two-stage algorithms are easy to parallelize on +large clusters. Furthermore, the least-squares minimization technique improves +its convergence and performances. + +The present article is organized as follows. Related works are presented in +Section~\ref{sec:02}. Section~\ref{sec:03} details the two-stage algorithm using +a least-squares residual minimization, while Section~\ref{sec:04} provides +convergence results regarding this method. Section~\ref{sec:05} shows some +experimental results obtained on large clusters using routines of PETSc +toolkit. This research work ends by a conclusion section, in which the proposal +is summarized while intended perspectives are provided. + %%%********************************************************* %%%********************************************************* @@ -604,113 +609,123 @@ toolkit. Finally Section~\ref{sec:06} concludes and gives some perspectives. %%%********************************************************* %%%********************************************************* -\section{Two-stage algorithm with least-square residuals minimization} +\section{Two-stage iteration with least-squares residuals minimization algorithm} \label{sec:03} A two-stage algorithm is proposed to solve large sparse linear systems of the form $Ax=b$, where $A\in\mathbb{R}^{n\times n}$ is a sparse and square -nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector and -$b\in\mathbb{R}^n$ is the right-hand side. The algorithm is implemented as an -inner-outer iteration solver based on iterative Krylov methods. The main key -points of our solver are given in Algorithm~\ref{algo:01}. - -In order to accelerate the convergence, the outer iteration periodically applies -a least-square minimization on the residuals computed by the inner solver. The -inner solver is based on a Krylov method which does not require to be changed. - -At each outer iteration, the sparse linear system $Ax=b$ is solved, only for $m$ -iterations, using an iterative method restarting with the previous solution. For -example, the GMRES method~\cite{Saad86} or some of its variants can be used as a -inner solver. The current solution of the Krylov method is saved inside a matrix -$S$ composed of successive solutions computed by the inner iteration. - -Periodically, every $s$ iterations, the minimization step is applied in order to -compute a new solution $x$. For that, the previous residuals are computed with -$(b-AS)$. The minimization of the residuals is obtained by +nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector, and +$b\in\mathbb{R}^n$ is the right-hand side. As explained previously, +the algorithm is implemented as an +inner-outer iteration solver based on iterative Krylov methods. The main +key-points of the proposed solver are given in Algorithm~\ref{algo:01}. +It can be summarized as follows: the +inner solver is a Krylov based one. In order to accelerate its convergence, the +outer solver periodically applies a least-squares minimization on the residuals computed by the inner one. %Tsolver which does not required to be changed. + +At each outer iteration, the sparse linear system $Ax=b$ is partially solved +using only $m$ iterations of an iterative method, this latter being initialized +with the last obtained approximation. GMRES method~\cite{Saad86}, or any of its +variants, can potentially be used as inner solver. The current approximation of +the Krylov method is then stored inside a $n \times s$ matrix $S$, which is +composed by the $s$ last solutions that have been computed during the inner +iterations phase. In the remainder, the $i$-th column vector of $S$ will be +denoted by $S_i$. + +At each $s$ iterations, another kind of minimization step is applied in order to +compute a new solution $x$. For that, the previous residuals of $Ax=b$ are computed by +the inner iterations with $(b-AS)$. The minimization of the residuals is obtained by \begin{equation} \underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2 \label{eq:01} \end{equation} -with $R=AS$. Then the new solution $x$ is computed with $x=S\alpha$. +with $R=AS$. The new solution $x$ is then computed with $x=S\alpha$. + + +In practice, $R$ is a dense rectangular matrix belonging in $\mathbb{R}^{n\times s}$, +with $s\ll n$. In order to minimize~\eqref{eq:01}, a least-squares method such as +CGLS ~\cite{Hestenes52} or LSQR~\cite{Paige82} is used. Remark that these methods are more +appropriate than a single direct method in a parallel context. -In practice, $R$ is a dense rectangular matrix in $\mathbb{R}^{n\times s}$, -$s\ll n$. In order to minimize~(\ref{eq:01}), a least-square method such as -CGLS ~\cite{Hestenes52} or LSQR~\cite{Paige82} is used. Those methods are more -appropriate than a direct method in a parallel context. \begin{algorithm}[t] -\caption{TSARM} +\caption{TSIRM} \begin{algorithmic}[1] \Input $A$ (sparse matrix), $b$ (right-hand side) \Output $x$ (solution vector)\vspace{0.2cm} - \State Set the initial guess $x^0$ - \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{tsarm}$)} \label{algo:conv} - \State $x^k=Solve(A,b,x^{k-1},max\_iter_{kryl})$ \label{algo:solve} - \State retrieve error - \State $S_{k~mod~s}=x^k$ \label{algo:store} - \If {$k$ mod $s=0$ {\bf and} error$>\epsilon_{tsarm}$} + \State Set the initial guess $x_0$ + \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{tsirm}$)} \label{algo:conv} + \State $[x_k,error]=Solve(A,b,x_{k-1},max\_iter_{kryl})$ \label{algo:solve} + \State $S_{k \mod s}=x_k$ \label{algo:store} \Comment{update column (k mod s) of S} + \If {$k \mod s=0$ {\bf and} error$>\epsilon_{kryl}$} \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul} - \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$ \label{algo:} - \State $x^k=S\alpha$ \Comment{compute new solution} + \State $\alpha=Least\_Squares(R,b,max\_iter_{ls})$ \label{algo:} + \State $x_k=S\alpha$ \Comment{compute new solution} \EndIf \EndFor \end{algorithmic} \label{algo:01} \end{algorithm} -Algorithm~\ref{algo:01} summarizes the principle of our method. The outer -iteration is inside the for loop. Line~\ref{algo:solve}, the Krylov method is -called for a maximum of $max\_iter_{kryl}$ iterations. In practice, we suggest to set this parameter -equals to the restart number of the GMRES-like method. Moreover, a tolerance -threshold must be specified for the solver. In practice, this threshold must be -much smaller than the convergence threshold of the TSARM algorithm (i.e. -$\epsilon_{tsarm}$). Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the -solution $x_k$ into the column $k~ mod~ s$ of the matrix $S$. After the -minimization, the matrix $S$ is reused with the new values of the residuals. To -solve the minimization problem, an iterative method is used. Two parameters are -required for that: the maximum number of iteration and the threshold to stop the -method. - -To summarize, the important parameters of TSARM are: +Algorithm~\ref{algo:01} summarizes the principle of the proposed method. The +outer iteration is inside the \emph{for} loop. Line~\ref{algo:solve}, the Krylov +method is called for a maximum of $max\_iter_{kryl}$ iterations. In practice, +we suggest to set this parameter equal to the restart number in the GMRES-like +method. Moreover, a tolerance threshold must be specified for the solver. In +practice, this threshold must be much smaller than the convergence threshold of +the TSIRM algorithm (\emph{i.e.}, $\epsilon_{tsirm}$). We also consider that +after the call of the $Solve$ function, we obtain the vector $x_k$ and the error +which is defined by $||Ax^k-b||_2$. + + Line~\ref{algo:store}, +$S_{k \mod s}=x^k$ consists in copying the solution $x_k$ into the column $k +\mod s$ of $S$. After the minimization, the matrix $S$ is reused with the new +values of the residuals. To solve the minimization problem, an iterative method +is used. Two parameters are required for that: the maximum number of iterations +and the threshold to stop the method. + +Let us summarize the most important parameters of TSIRM: \begin{itemize} -\item $\epsilon_{tsarm}$ the threshold to stop the TSARM method -\item $max\_iter_{kryl}$ the maximum number of iterations for the krylov method -\item $s$ the number of outer iterations before applying the minimization step -\item $max\_iter_{ls}$ the maximum number of iterations for the iterative least-square method -\item $\epsilon_{ls}$ the threshold to stop the least-square method +\item $\epsilon_{tsirm}$: the threshold to stop the TSIRM method; +\item $max\_iter_{kryl}$: the maximum number of iterations for the Krylov method; +\item $s$: the number of outer iterations before applying the minimization step; +\item $max\_iter_{ls}$: the maximum number of iterations for the iterative least-squares method; +\item $\epsilon_{ls}$: the threshold used to stop the least-squares method. \end{itemize} -The parallelisation of TSARM relies on the parallelization of all its -parts. More precisely, except the least-square step, all the other parts are +The parallelization of TSIRM relies on the parallelization of all its +parts. More precisely, except the least-squares step, all the other parts are obvious to achieve out in parallel. In order to develop a parallel version of our code, we have chosen to use PETSc~\cite{petsc-web-page}. For line~\ref{algo:matrix_mul} the matrix-matrix multiplication is implemented and efficient since the matrix $A$ is sparse and since the matrix $S$ contains few -colums in practice. As explained previously, at least two methods seem to be -interesting to solve the least-square minimization, CGLS and LSQR. +columns in practice. As explained previously, at least two methods seem to be +interesting to solve the least-squares minimization, CGLS and LSQR. In the following we remind the CGLS algorithm. The LSQR method follows more or -less the same principle but it take more place, so we briefly explain the parallelization of CGLS which is similar to LSQR. +less the same principle but it takes more place, so we briefly explain the parallelization of CGLS which is similar to LSQR. \begin{algorithm}[t] \caption{CGLS} \begin{algorithmic}[1] \Input $A$ (matrix), $b$ (right-hand side) \Output $x$ (solution vector)\vspace{0.2cm} - \State $r=b-Ax$ - \State $p=A'r$ - \State $s=p$ - \State $g=||s||^2_2$ - \For {$k=1,2,3,\ldots$ until convergence (g$<\epsilon_{ls}$)} \label{algo2:conv} - \State $q=Ap$ - \State $\alpha=g/||q||^2_2$ - \State $x=x+alpha*p$ - \State $r=r-alpha*q$ - \State $s=A'*r$ - \State $g_{old}=g$ - \State $g=||s||^2_2$ - \State $\beta=g/g_{old}$ + \State Let $x_0$ be an initial approximation + \State $r_0=b-Ax_0$ + \State $p_1=A^Tr_0$ + \State $s_0=p_1$ + \State $\gamma=||s_0||^2_2$ + \For {$k=1,2,3,\ldots$ until convergence ($\gamma<\epsilon_{ls}$)} \label{algo2:conv} + \State $q_k=Ap_k$ + \State $\alpha_k=\gamma/||q_k||^2_2$ + \State $x_k=x_{k-1}+\alpha_kp_k$ + \State $r_k=r_{k-1}-\alpha_kq_k$ + \State $s_k=A^Tr_k$ + \State $\gamma_{old}=\gamma$ + \State $\gamma=||s_k||^2_2$ + \State $\beta_k=\gamma/\gamma_{old}$ + \State $p_{k+1}=s_k+\beta_kp_k$ \EndFor \end{algorithmic} \label{algo:02} @@ -718,7 +733,7 @@ less the same principle but it take more place, so we briefly explain the parall In each iteration of CGLS, there is two matrix-vector multiplications and some -classical operations: dots, norm, multiplication and addition on vectors. All +classical operations: dot product, norm, multiplication and addition on vectors. All these operations are easy to implement in PETSc or similar environment. @@ -730,30 +745,102 @@ these operations are easy to implement in PETSc or similar environment. \label{sec:04} +We can now claim that, +\begin{proposition} +\label{prop:saad} +If $A$ is either a definite positive or a positive matrix and GMRES($m$) is used as solver, then the TSIRM algorithm is convergent. + +Furthermore, let $r_k$ be the +$k$-th residue of TSIRM, then +we have the following boundaries: +\begin{itemize} +\item when $A$ is positive: +\begin{equation} +||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0|| , +\end{equation} +where $M$ is the symmetric part of $A$, $\alpha = \lambda_{min}(M)^2$ and $\beta = \lambda_{max}(A^T A)$; +\item when $A$ is positive definite: +\begin{equation} +\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_0\|. +\end{equation} +\end{itemize} +%In the general case, where A is not positive definite, we have +%$\|r_n\| \le \inf_{p \in P_n} \|p(A)\| \le \kappa_2(V) \inf_{p \in P_n} \max_{\lambda \in \sigma(A)} |p(\lambda)| \|r_0\|, .$ +\end{proposition} + +\begin{proof} +Let us first recall that the residue is under control when considering the GMRES algorithm on a positive definite matrix, and it is bounded as follows: +\begin{equation*} +\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{k/2} \|r_0\| . +\end{equation*} +Additionally, when $A$ is a positive real matrix with symmetric part $M$, then the residual norm provided at the $m$-th step of GMRES satisfies: +\begin{equation*} +||r_m|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_0|| , +\end{equation*} +where $\alpha$ and $\beta$ are defined as in Proposition~\ref{prop:saad}, which proves +the convergence of GMRES($m$) for all $m$ under such assumptions regarding $A$. +These well-known results can be found, \emph{e.g.}, in~\cite{Saad86}. + +We will now prove by a mathematical induction that, for each $k \in \mathbb{N}^\ast$, +$||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{mk}{2}} ||r_0||$ when $A$ is positive, and $\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_0\|$ when $A$ is positive definite. + +The base case is obvious, as for $k=1$, the TSIRM algorithm simply consists in applying GMRES($m$) once, leading to a new residual $r_1$ that follows the inductive hypothesis due, to the results recalled above. + +Suppose now that the claim holds for all $m=1, 2, \hdots, k-1$, that is, $\forall m \in \{1,2,\hdots, k-1\}$, $||r_m|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$ in the positive case, and $\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_0\|$ in the definite positive one. +We will show that the statement holds too for $r_k$. Two situations can occur: +\begin{itemize} +\item If $k \not\equiv 0 ~(\textrm{mod}\ m)$, then the TSIRM algorithm consists in executing GMRES once. In that case and by using the inductive hypothesis, we obtain either $||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_{k-1}||\leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$ if $A$ is positive, or $\|r_k\| \leqslant \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{m/2} \|r_{k-1}\|$ $\leqslant$ $\left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_{0}\|$ in the positive definite case. +\item Else, the TSIRM algorithm consists in two stages: a first GMRES($m$) execution leads to a temporary $x_k$ whose residue satisfies: +\begin{itemize} +\item $||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_{k-1}||\leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$ in the positive case, +\item $\|r_k\| \leqslant \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{m/2} \|r_{k-1}\|$ $\leqslant$ $\left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_{0}\|$ in the positive definite one, +\end{itemize} +and a least squares resolution. +Let $\operatorname{span}(S) = \left \{ {\sum_{i=1}^k \lambda_i v_i \Big| k \in \mathbb{N}, v_i \in S, \lambda _i \in \mathbb{R}} \right \}$ be the linear span of a set of real vectors $S$. So,\\ +$\min_{\alpha \in \mathbb{R}^s} ||b-R\alpha ||_2 = \min_{\alpha \in \mathbb{R}^s} ||b-AS\alpha ||_2$ + +$\begin{array}{ll} +& = \min_{x \in span\left(S_{k-s+1}, S_{k-s+2}, \hdots, S_{k} \right)} ||b-AS\alpha ||_2\\ +& = \min_{x \in span\left(x_{k-s+1}, x_{k-s}+2, \hdots, x_{k} \right)} ||b-AS\alpha ||_2\\ +& \leqslant \min_{x \in span\left( x_{k} \right)} ||b-Ax ||_2\\ +& \leqslant \min_{\lambda \in \mathbb{R}} ||b-\lambda Ax_{k} ||_2\\ +& \leqslant ||b-Ax_{k}||_2\\ +& = ||r_k||_2\\ +& \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||, \textrm{ if $A$ is positive,}\\ +& \leqslant \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_{0}\|, \textrm{ if $A$ is}\\ +& \textrm{positive definite,} +\end{array}$ +\end{itemize} +which concludes the induction and the proof. +\end{proof} +%We can remark that, at each iterate, the residue of the TSIRM algorithm is lower +%than the one of the GMRES method. %%%********************************************************* %%%********************************************************* -\section{Experiments using petsc} +\section{Experiments using PETSc} \label{sec:05} -In order to see the influence of our algorithm with only one processor, we first -show a comparison with the standard version of GMRES and our algorithm. In -table~\ref{tab:01}, we show the matrices we have used and some of them -characteristics. For all the matrices, the name, the field, the number of rows -and the number of nonzero elements is given. +In order to see the behavior of the proposal when considering only one processor, a first +comparison with GMRES or FGMRES and the new algorithm detailed previously has been experimented. +Matrices that have been used with their characteristics (names, fields, rows, and nonzero coefficients) are detailed in +Table~\ref{tab:01}. These latter, which are real-world applications matrices, +have been extracted + from the Davis collection, University of +Florida~\cite{Dav97}. -\begin{table*} +\begin{table}[htbp] \begin{center} \begin{tabular}{|c|c|r|r|r|} \hline Matrix name & Field &\# Rows & \# Nonzeros \\\hline \hline crashbasis & Optimization & 160,000 & 1,750,416 \\ -parabolic\_fem & Computational fluid dynamics & 525,825 & 2,100,225 \\ +parabolic\_fem & Comput. fluid dynamics & 525,825 & 2,100,225 \\ epb3 & Thermal problem & 84,617 & 463,625 \\ -atmosmodj & Computational fluid dynamics & 1,270,432 & 8,814,880 \\ -bfwa398 & Electromagnetics problem & 398 & 3,678 \\ +atmosmodj & Comput. fluid dynamics & 1,270,432 & 8,814,880 \\ +bfwa398 & Electromagnetics pb & 398 & 3,678 \\ torso3 & 2D/3D problem & 259,156 & 4,429,042 \\ \hline @@ -761,35 +848,37 @@ torso3 & 2D/3D problem & 259,156 & 4,429,042 \\ \caption{Main characteristics of the sparse matrices chosen from the Davis collection} \label{tab:01} \end{center} -\end{table*} - -The following parameters have been chosen for our experiments. As by default +\end{table} +Chosen parameters are detailed below. +%The following parameters have been chosen for our experiments. +As by default the restart of GMRES is performed every 30 iterations, we have chosen to stop -the GMRES every 30 iterations, $max\_iter_{kryl}=30$). $s$ is set to 8. CGLS is +the GMRES every 30 iterations (\emph{i.e.} $max\_iter_{kryl}=30$). $s$ is set to 8. CGLS is chosen to minimize the least-squares problem with the following parameters: $\epsilon_{ls}=1e-40$ and $max\_iter_{ls}=20$. The external precision is set to -$\epsilon_{tsarm}=1e-10$. Those experiments have been performed on a Intel(R) +$\epsilon_{tsirm}=1e-10$. Those experiments have been performed on a Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz with the version 3.5.1 of PETSc. In Table~\ref{tab:02}, some experiments comparing the solving of the linear systems obtained with the previous matrices with a GMRES variant and with out 2 stage algorithm are given. In the second column, it can be noticed that either -gmres or fgmres is used to solve the linear system. According to the matrices, -different preconditioner is used. With TSARM, the same solver and the same -preconditionner is used. This Table shows that TSARM can drastically reduce the -number of iterations to reach the convergence when the number of iterations for -the normal GMRES is more or less greater than 500. In fact this also depends on -tow parameters: the number of iterations to stop GMRES and the number of -iterations to perform the minimization. +GRMES or FGMRES (Flexible GMRES)~\cite{Saad:1993} is used to solve the linear +system. According to the matrices, different preconditioner is used. With +TSIRM, the same solver and the same preconditionner are used. This Table shows +that TSIRM can drastically reduce the number of iterations to reach the +convergence when the number of iterations for the normal GMRES is more or less +greater than 500. In fact this also depends on tow parameters: the number of +iterations to stop GMRES and the number of iterations to perform the +minimization. -\begin{table} +\begin{table}[htbp] \begin{center} \begin{tabular}{|c|c|r|r|r|r|} \hline - \multirow{2}{*}{Matrix name} & Solver / & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} \\ + \multirow{2}{*}{Matrix name} & Solver / & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSIRM CGLS} \\ \cline{3-6} & precond & Time & \# Iter. & Time & \# Iter. \\\hline \hline @@ -802,7 +891,7 @@ torso3 & fgmres / sor & 37.70 & 565 & 34.97 & 510 \\ \hline \end{tabular} -\caption{Comparison of (F)GMRES and 2 stage (F)GMRES algorithms in sequential with some matrices, time is expressed in seconds.} +\caption{Comparison of (F)GMRES and TSIRM with (F)GMRES in sequential with some matrices, time is expressed in seconds.} \label{tab:02} \end{center} \end{table} @@ -811,33 +900,54 @@ torso3 & fgmres / sor & 37.70 & 565 & 34.97 & 510 \\ -In order to perform larger experiments, we have tested some example application +In order to perform larger experiments, we have tested some example applications of PETSc. Those applications are available in the ksp part which is suited for scalable linear equations solvers: \begin{itemize} \item ex15 is an example which solves in parallel an operator using a finite - difference scheme. The diagonal is equals to 4 and 4 extra-diagonals - representing the neighbors in each directions is equal to -1. This example is + difference scheme. The diagonal is equal to 4 and 4 extra-diagonals + representing the neighbors in each directions are equal to -1. This example is used in many physical phenomena, for example, heat and fluid flow, wave - propagation... + propagation, etc. \item ex54 is another example based on 2D problem discretized with quadrilateral finite elements. For this example, the user can define the scaling of material - coefficient in embedded circle, it is called $\alpha$. + coefficient in embedded circle called $\alpha$. \end{itemize} -For more technical details on these applications, interested reader are invited -to read the codes available in the PETSc sources. Those problem have been -chosen because they are scalable with many cores. We have tested other problem -but they are not scalable with many cores. - - - - -\begin{table*} +For more technical details on these applications, interested readers are invited +to read the codes available in the PETSc sources. Those problems have been +chosen because they are scalable with many cores which is not the case of other +problems that we have tested. + +In the following larger experiments are described on two large scale +architectures: Curie and Juqeen. Both these architectures are supercomputer +composed of 80,640 cores for Curie and 458,752 cores for Juqueen. Those machines +are respectively hosted by GENCI in France and Jülich Supercomputing Centre in +Germany. They belongs with other similar architectures of the PRACE initiative ( +Partnership for Advanced Computing in Europe) which aims at proposing high +performance supercomputing architecture to enhance research in Europe. The Curie +architecture is composed of Intel E5-2680 processors at 2.7 GHz with 2Gb memory +by core. The Juqueen architecture is composed of IBM PowerPC A2 at 1.6 GHz with +1Gb memory per core. Both those architecture are equiped with a dedicated high +speed network. + + +In many situations, using preconditioners is essential in order to find the +solution of a linear system. There are many preconditioners available in PETSc. +For parallel applications all the preconditioners based on matrix factorization +are not available. In our experiments, we have tested different kinds of +preconditioners, however as it is not the subject of this paper, we will not +present results with many preconditioners. In practise, we have chosen to use a +multigrid (mg) and successive over-relaxation (sor). For more details on the +preconditioner in PETSc please consult~\cite{petsc-web-page}. + + + +\begin{table*}[htbp] \begin{center} \begin{tabular}{|r|r|r|r|r|r|r|r|r|} \hline - nb. cores & precond & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} & \multicolumn{2}{c|}{TSARM LSQR} & best gain \\ + nb. cores & precond & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\ \cline{3-8} & & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline 2,048 & mg & 403.49 & 18,210 & 73.89 & 3,060 & 77.84 & 3,270 & 5.46 \\ @@ -851,29 +961,62 @@ but they are not scalable with many cores. \hline \end{tabular} -\caption{Comparison of FGMRES and 2 stage FGMRES algorithms for ex15 of Petsc with 25000 components per core on Juqueen (threshold 1e-3, restart=30, s=12), time is expressed in seconds.} +\caption{Comparison of FGMRES and TSIRM with FGMRES for example ex15 of PETSc with two preconditioners (mg and sor) with 25,000 components per core on Juqueen (threshold 1e-3, restart=30, s=12), time is expressed in seconds.} \label{tab:03} \end{center} \end{table*} - -\begin{figure} +Table~\ref{tab:03} shows the execution times and the number of iterations of +example ex15 of PETSc on the Juqueen architecture. Different numbers of cores +are studied ranging from 2,048 up-to 16,383 with the two preconditioners {\it mg} and {\it sor}. For those experiments, the number of components (or unknowns of the +problems) per core is fixed to 25,000, also called weak scaling. This +number can seem relatively small. In fact, for some applications that need a lot +of memory, the number of components per processor requires sometimes to be +small. + + + +In Table~\ref{tab:03}, we can notice that TSIRM is always faster than FGMRES. The last +column shows the ratio between FGMRES and the best version of TSIRM according to +the minimization procedure: CGLS or LSQR. Even if we have computed the worst +case between CGLS and LSQR, it is clear that TSIRM is always faster than +FGMRES. For this example, the multigrid preconditioner is faster than SOR. The +gain between TSIRM and FGMRES is more or less similar for the two +preconditioners. Looking at the number of iterations to reach the convergence, +it is obvious that TSIRM allows the reduction of the number of iterations. It +should be noticed that for TSIRM, in those experiments, only the iterations of +the Krylov solver are taken into account. Iterations of CGLS or LSQR were not +recorded but they are time-consuming. In general each $max\_iter_{kryl}*s$ which +corresponds to 30*12, there are $max\_iter_{ls}$ which corresponds to 15. + +\begin{figure}[htbp] \centering \includegraphics[width=0.45\textwidth]{nb_iter_sec_ex15_juqueen} -\caption{Number of iterations per second with ex15 and the same parameters than in Table~\ref{tab:03}} +\caption{Number of iterations per second with ex15 and the same parameters than in Table~\ref{tab:03} (weak scaling)} \label{fig:01} \end{figure} +In Figure~\ref{fig:01}, the number of iterations per second corresponding to +Table~\ref{tab:03} is displayed. It can be noticed that the number of +iterations per second of FMGRES is constant whereas it decreases with TSIRM with +both preconditioners. This can be explained by the fact that when the number of +cores increases the time for the least-squares minimization step also increases but, generally, +when the number of cores increases, the number of iterations to reach the +threshold also increases, and, in that case, TSIRM is more efficient to reduce +the number of iterations. So, the overall benefit of using TSIRM is interesting. + + -\begin{table*} + +\begin{table*}[htbp] \begin{center} \begin{tabular}{|r|r|r|r|r|r|r|r|r|} \hline - nb. cores & threshold & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} & \multicolumn{2}{c|}{TSARM LSQR} & best gain \\ + nb. cores & threshold & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\ \cline{3-8} & & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline 2,048 & 8e-5 & 108.88 & 16,560 & 23.06 & 3,630 & 22.79 & 3,630 & 4.77 \\ @@ -886,23 +1029,23 @@ but they are not scalable with many cores. \hline \end{tabular} -\caption{Comparison of FGMRES and 2 stage FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25000 components per core on Curie (restart=30, s=12), time is expressed in seconds.} +\caption{Comparison of FGMRES and TSIRM with FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25,000 components per core on Curie (restart=30, s=12), time is expressed in seconds.} \label{tab:04} \end{center} \end{table*} +In Table~\ref{tab:04}, some experiments with example ex54 on the Curie architecture are reported. - -\begin{table*} +\begin{table*}[htbp] \begin{center} \begin{tabular}{|r|r|r|r|r|r|r|r|r|r|r|} \hline - nb. cores & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} & \multicolumn{2}{c|}{TSARM LSQR} & best gain & \multicolumn{3}{c|}{efficiency} \\ + nb. cores & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain & \multicolumn{3}{c|}{efficiency} \\ \cline{2-7} \cline{9-11} - & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & & GMRES & TS CGLS & TS LSQR\\\hline \hline + & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & & FGMRES & TS CGLS & TS LSQR\\\hline \hline 512 & 3,969.69 & 33,120 & 709.57 & 5,790 & 622.76 & 5,070 & 6.37 & 1 & 1 & 1 \\ 1024 & 1,530.06 & 25,860 & 290.95 & 4,830 & 307.71 & 5,070 & 5.25 & 1.30 & 1.21 & 1.01 \\ 2048 & 919.62 & 31,470 & 237.52 & 8,040 & 194.22 & 6,510 & 4.73 & 1.08 & .75 & .80\\ @@ -912,11 +1055,18 @@ but they are not scalable with many cores. \hline \end{tabular} -\caption{Comparison of FGMRES and 2 stage FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores (restart=30, s=12, threshol 5e-5), time is expressed in seconds.} +\caption{Comparison of FGMRES and TSIRM with FGMRES for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores (restart=30, s=12, threshold 5e-5), time is expressed in seconds.} \label{tab:05} \end{center} \end{table*} +\begin{figure}[htbp] +\centering + \includegraphics[width=0.45\textwidth]{nb_iter_sec_ex54_curie} +\caption{Number of iterations per second with ex54 and the same parameters than in Table~\ref{tab:05} (strong scaling)} +\label{fig:02} +\end{figure} + %%%********************************************************* %%%********************************************************* @@ -930,13 +1080,22 @@ but they are not scalable with many cores. %%%********************************************************* %%%********************************************************* +A novel two-stage iterative algorithm has been proposed in this article, +in order to accelerate the convergence Krylov iterative methods. +Our TSIRM proposal acts as a merger between Krylov based solvers and +a least-squares minimization step. +The convergence of the method has been proven in some situations, while +experiments up to 16,394 cores have been led to verify that TSIRM runs +5 or 7 times faster than GMRES. + -future plan : \\ -- study other kinds of matrices, problems, inner solvers\\ -- test the influence of all the parameters\\ -- adaptative number of outer iterations to minimize\\ -- other methods to minimize the residuals?\\ -- implement our solver inside PETSc +For future work, the authors' intention is to investigate +other kinds of matrices, problems, and inner solvers. The +influence of all parameters must be tested too, while +other methods to minimize the residuals must be regarded. +The number of outer iterations to minimize should become +adaptative to improve the overall performances of the proposal. +Finally, this solver will be implemented inside PETSc. % conference papers do not normally have an appendix @@ -948,7 +1107,7 @@ future plan : \\ %%%********************************************************* \section*{Acknowledgment} This paper is partially funded by the Labex ACTION program (contract -ANR-11-LABX-01-01). We acknowledge PRACE for awarding us access to resource +ANR-11-LABX-01-01). We acknowledge PRACE for awarding us access to resources Curie and Juqueen respectively based in France and Germany. @@ -991,5 +1150,3 @@ Curie and Juqueen respectively based in France and Germany. % that's all folks \end{document} - -