X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/GMRES2stage.git/blobdiff_plain/7ff6342cfd0d2fa0e5df79322a88117ece68880e..dbb7065c4ccb3cdeb06911258b62798d3caa624d:/paper.tex diff --git a/paper.tex b/paper.tex index 36298aa..ba4d0b0 100644 --- a/paper.tex +++ b/paper.tex @@ -367,31 +367,29 @@ % % paper title % can use linebreaks \\ within to get better formatting as desired -\title{A Krylov two-stage algorithm to solve large sparse linear systems} +\title{TSARM: A Two-Stage Algorithm with least-square Residual Minimization to solve large sparse linear systems} %où %\title{A two-stage algorithm with error minimization to solve large sparse linear systems} %où %\title{???} + + + % author names and affiliations % use a multiple column layout for up to two different % affiliations -\author{\IEEEauthorblockN{Rapha\"el Couturier} -\IEEEauthorblockA{Femto-ST Institute - DISC Department\\ -Universit\'e de Franche-Comt\'e, IUT de Belfort-Montb\'eliard\\ -19 avenue de Mar\'echal Juin, BP 527 \\ -90016 Belfort Cedex, France\\ -Email: raphael.couturier@univ-fcomte.fr} -\and -\IEEEauthorblockN{Lilia Ziane Khodja} -\IEEEauthorblockA{Centre de Recherche INRIA Bordeaux Sud-Ouest\\ -200 avenue de la Vieille Tour\\ -33405 Talence Cedex, France\\ +\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja \IEEEauthorrefmark{2} and Christophe Guyeux\IEEEauthorrefmark{1}} +\IEEEauthorblockA{\IEEEauthorrefmark{1} Femto-ST Institute, University of Franche Comte, France\\ +Email: \{raphael.couturier,christophe.guyeux\}@univ-fcomte.fr} +\IEEEauthorblockA{\IEEEauthorrefmark{2} INRIA Bordeaux Sud-Ouest, France\\ Email: lilia.ziane@inria.fr} } + + % conference papers do not typically use \thanks and this command % is locked out in conference mode. If really needed, such as for % the acknowledgment of grants, issue a \IEEEoverridecommandlockouts @@ -428,11 +426,20 @@ Email: lilia.ziane@inria.fr} \begin{abstract} -%The abstract goes here. DO NOT USE SPECIAL CHARACTERS, SYMBOLS, OR MATH IN YOUR TITLE OR ABSTRACT. +In this paper we propose a two stage iterative method which increases the +convergence of Krylov iterative methods, typically those of GMRES variants. The +principle of our approach is to build an external iteration over the Krylov +method and to save the current residual frequently (for example, for each +restart of GMRES). Then after a given number of outer iterations, a minimization +step is applied on the matrix composed of the save residuals in order to compute +a better solution and make a new iteration if necessary. We prove that our +method has the same convergence property than the inner method used. Some +experiments using up to 16,394 cores show that compared to GMRES our algorithm +can be around 7 times faster. \end{abstract} \begin{IEEEkeywords} -Iterative Krylov methods; sparse linear systems; error minimization; PETSc; %à voir... +Iterative Krylov methods; sparse linear systems; residual minimization; PETSc; %à voir... \end{IEEEkeywords} @@ -539,12 +546,14 @@ Iterative Krylov methods; sparse linear systems; error minimization; PETSc; %à % no \IEEEPARstart % You must have at least 2 lines in the paragraph with the drop letter % (should never be an issue) -Iterative methods are become more attractive than direct ones to solve very -large sparse linear systems. They are more effective in a parallel context and -require less memory and arithmetic operations than direct methods. A number of -iterative methods are proposed and adapted by many researchers and the increased -need for solving very large sparse linear systems triggered the development of -efficient iterative techniques suitable for the parallel processing. + +Iterative methods became more attractive than direct ones to solve very large +sparse linear systems. Iterative methods are more effecient in a parallel +context, with thousands of cores, and require less memory and arithmetic +operations than direct methods. A number of iterative methods are proposed and +adapted by many researchers and the increased need for solving very large sparse +linear systems triggered the development of efficient iterative techniques +suitable for the parallel processing. Most of the successful iterative methods currently available are based on Krylov subspaces which consist in forming a basis of a sequence of successive matrix @@ -566,15 +575,18 @@ large clusters. In this paper we propose a two-stage algorithm based on two nested iterations called inner-outer iterations. This algorithm consists in solving the sparse linear system iteratively with a small number of inner iterations and restarts -the outer step with a new solution minimizing some error functions over a Krylov -subspace. This algorithm is iterative and easy to parallelize on large clusters -and the minimization technique improves its convergence and performances. +the outer step with a new solution minimizing some error functions over some +previous residuals. This algorithm is iterative and easy to parallelize on large +clusters and the minimization technique improves its convergence and +performances. The present paper is organized as follows. In Section~\ref{sec:02} some related -works are presented. Section~\ref{sec:03} presents our two-stage algorithm based -on Krylov subspace iteration methods. Section~\ref{sec:04} shows some -experimental results obtained on large clusters of our algorithm using routines -of PETSc toolkit. +works are presented. Section~\ref{sec:03} presents our two-stage algorithm using +a least-square residual minimization. Section~\ref{sec:04} describes some +convergence results on this method. In Section~\ref{sec:05}, parallization +details of TSARM are given. Section~\ref{sec:06} shows some experimental +results obtained on large clusters of our algorithm using routines of PETSc +toolkit. Finally Section~\ref{sec:06} concludes and gives some perspectives. %%%********************************************************* %%%********************************************************* @@ -592,7 +604,7 @@ of PETSc toolkit. %%%********************************************************* %%%********************************************************* -\section{A Krylov two-stage algorithm} +\section{Two-stage algorithm with least-square residuals minimization} \label{sec:03} A two-stage algorithm is proposed to solve large sparse linear systems of the form $Ax=b$, where $A\in\mathbb{R}^{n\times n}$ is a sparse and square @@ -601,76 +613,139 @@ $b\in\mathbb{R}^n$ is the right-hand side. The algorithm is implemented as an inner-outer iteration solver based on iterative Krylov methods. The main key points of our solver are given in Algorithm~\ref{algo:01}. -In order to accelerate the convergence, the outer iteration is implemented as an -iterative Krylov method which minimizes some error functions over a Krylov -subspace~\cite{saad96}. At each iteration, the sparse linear system $Ax=b$ is -solved iteratively with an iterative method, for example GMRES -method~\cite{saad86} or some of its variants, and the Krylov subspace that we -used is spanned by a basis $S$ composed of successive solutions issued from the -inner iteration -\begin{equation} - S = \{x^1, x^2, \ldots, x^s\} \text{,~} s\leq n. -\end{equation} -The advantage of such a Krylov subspace is that we neither need an orthogonal -basis nor any synchronization between processors to generate this basis. The -algorithm is periodically restarted every $s$ iterations with a new initial -guess $x=S\alpha$ which minimizes the residual norm $\|b-Ax\|_2$ over the Krylov -subspace spanned by vectors of $S$, where $\alpha$ is a solution of the normal -equations -\begin{equation} - R^TR\alpha = R^Tb, -\end{equation} -which is associated with the least-squares problem +In order to accelerate the convergence, the outer iteration periodically applies +a least-square minimization on the residuals computed by the inner solver. The +inner solver is a Krylov based solver which does not required to be changed. + +At each outer iteration, the sparse linear system $Ax=b$ is solved, only for $m$ +iterations, using an iterative method restarting with the previous solution. For +example, the GMRES method~\cite{Saad86} or some of its variants can be used as a +inner solver. The current solution of the Krylov method is saved inside a matrix +$S$ composed of successive solutions computed by the inner iteration. + +Periodically, every $s$ iterations, the minimization step is applied in order to +compute a new solution $x$. For that, the previous residuals are computed with +$(b-AS)$. The minimization of the residuals is obtained by \begin{equation} \underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2 \label{eq:01} \end{equation} -such that $R=AS$ is a dense rectangular matrix in $\mathbb{R}^{n\times s}$, -$s\ll n$, and $R^T$ denotes the transpose of matrix $R$. We use an iterative -method to solve the least-squares problem~(\ref{eq:01}) such as CGLS -~\cite{hestenes52} or LSQR~\cite{paige82} which are more appropriate than a -direct method in the parallel context. +with $R=AS$. Then the new solution $x$ is computed with $x=S\alpha$. + + +In practice, $R$ is a dense rectangular matrix in $\mathbb{R}^{n\times s}$, +$s\ll n$. In order to minimize~(\ref{eq:01}), a least-square method such as +CGLS ~\cite{Hestenes52} or LSQR~\cite{Paige82} is used. Those methods are more +appropriate than a direct method in a parallel context. \begin{algorithm}[t] -\caption{A Krylov two-stage algorithm} +\caption{TSARM} \begin{algorithmic}[1] \Input $A$ (sparse matrix), $b$ (right-hand side) \Output $x$ (solution vector)\vspace{0.2cm} \State Set the initial guess $x^0$ - \For {$k=1,2,3,\ldots$ until convergence} - \State Solve iteratively $Ax^k=b$ - \State $S_{k~mod~s}=x^k$ - \If {$k$ mod $s=0$ {\bf and} not convergence} - \State Compute dense matrix $R=AS$ - \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$ - \State Compute minimizer $x^k=S\alpha$ + \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{kryl}$)} \label{algo:conv} + \State $x^k=Solve(A,b,x^{k-1},max\_iter_{kryl})$ \label{algo:solve} + \State retrieve error + \State $S_{k~mod~s}=x^k$ \label{algo:store} + \If {$k$ mod $s=0$ {\bf and} error$>\epsilon_{kryl}$} + \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul} + \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$ \label{algo:} + \State $x^k=S\alpha$ \Comment{compute new solution} \EndIf \EndFor \end{algorithmic} \label{algo:01} \end{algorithm} -Operation $S_{k~ mod~ s}=x^k$ consists in copying the residual $x_k$ into the -column $k~ mod~ s$ of the matrix $S$. After the minimization, the matrix $S$ is -reused with the new values of the residuals. +Algorithm~\ref{algo:01} summarizes the principle of our method. The outer +iteration is inside the for loop. Line~\ref{algo:solve}, the Krylov method is +called for a maximum of $max\_iter_{kryl}$ iterations. In practice, we suggest to set this parameter +equals to the restart number of the GMRES-like method. Moreover, a tolerance +threshold must be specified for the solver. In practice, this threshold must be +much smaller than the convergence threshold of the TSARM algorithm (i.e. +$\epsilon$). Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the +solution $x_k$ into the column $k~ mod~ s$ of the matrix $S$. After the +minimization, the matrix $S$ is reused with the new values of the residuals. To +solve the minimization problem, an iterative method is used. Two parameters are +required for that: the maximum number of iteration and the threshold to stop the +method. + +To summarize, the important parameters of TSARM are: +\begin{itemize} +\item $\epsilon_{kryl}$ the threshold to stop the method of the krylov method +\item $max\_iter_{kryl}$ the maximum number of iterations for the krylov method +\item $s$ the number of outer iterations before applying the minimization step +\item $max\_iter_{ls}$ the maximum number of iterations for the iterative least-square method +\item $\epsilon_{ls}$ the threshold to stop the least-square method +\end{itemize} %%%********************************************************* %%%********************************************************* +\section{Convergence results} +\label{sec:04} + +%%%********************************************************* +%%%********************************************************* +\section{Parallelization} +\label{sec:05} + +The parallelisation of TSARM relies on the parallelization of all its +parts. More precisely, except the least-square step, all the other parts are +obvious to achieve out in parallel. In order to develop a parallel version of +our code, we have chosen to use PETSc~\cite{petsc-web-page}. For +line~\ref{algo:matrix_mul} the matrix-matrix multiplication is implemented and +efficient since the matrix $A$ is sparse and since the matrix $S$ contains few +colums in practice. As explained previously, at least two methods seem to be +interesting to solve the least-square minimization, CGLS and LSQR. + +In the following we remind the CGLS algorithm. The LSQR method follows more or +less the same principle but it take more place, so we briefly explain the parallelization of CGLS which is similar to LSQR. + +\begin{algorithm}[t] +\caption{CGLS} +\begin{algorithmic}[1] + \Input $A$ (matrix), $b$ (right-hand side) + \Output $x$ (solution vector)\vspace{0.2cm} + \State $r=b-Ax$ + \State $p=A'r$ + \State $s=p$ + \State $g=||s||^2_2$ + \For {$k=1,2,3,\ldots$ until convergence (g$<\epsilon_{ls}$)} \label{algo2:conv} + \State $q=Ap$ + \State $\alpha=g/||q||^2_2$ + \State $x=x+alpha*p$ + \State $r=r-alpha*q$ + \State $s=A'*r$ + \State $g_{old}=g$ + \State $g=||s||^2_2$ + \State $\beta=g/g_{old}$ + \EndFor +\end{algorithmic} +\label{algo:02} +\end{algorithm} + + +In each iteration of CGLS, there is two matrix-vector multiplications and some +classical operations: dots, norm, multiplication and addition on vectors. All +these operations are easy to implement in PETSc or similar environment. + %%%********************************************************* %%%********************************************************* \section{Experiments using petsc} -\label{sec:04} +\label{sec:06} In order to see the influence of our algorithm with only one processor, we first show a comparison with the standard version of GMRES and our algorithm. In table~\ref{tab:01}, we show the matrices we have used and some of them -characteristics. +characteristics. For all the matrices, the name, the field, the number of rows +and the number of nonzero elements is given. -\begin{table} +\begin{table*} \begin{center} \begin{tabular}{|c|c|r|r|r|} \hline @@ -687,9 +762,30 @@ torso3 & 2D/3D problem & 259,156 & 4,429,042 \\ \caption{Main characteristics of the sparse matrices chosen from the Davis collection} \label{tab:01} \end{center} -\end{table} - - +\end{table*} + +The following parameters have been chosen for our experiments. As by default +the restart of GMRES is performed every 30 iterations, we have chosen to stop +the GMRES every 30 iterations (line \ref{algo:solve} in +Algorithm~\ref{algo:01}). $s$ is set to 8. CGLS is chosen to minimize the +least-squares problem. Two conditions are used to stop CGLS, either the +precision is under $1e-40$ or the number of iterations is greater to $20$. The +external precision is set to $1e-10$ (line \ref{algo:conv} in +Algorithm~\ref{algo:01}). Those experiments have been performed on a Intel(R) +Core(TM) i7-3630QM CPU @ 2.40GHz with the version 3.5.1 of PETSc. + + +In Table~\ref{tab:02}, some experiments comparing the solving of the linear +systems obtained with the previous matrices with a GMRES variant and with out 2 +stage algorithm are given. In the second column, it can be noticed that either +gmres or fgmres is used to solve the linear system. According to the matrices, +different preconditioner is used. With the 2 stage algorithm, the same solver +and the same preconditionner is used. This Table shows that the 2 stage +algorithm can drastically reduce the number of iterations to reach the +convergence when the number of iterations for the normal GMRES is more or less +greater than 500. In fact this also depends on tow parameters: the number of +iterations to stop GMRES and the number of iterations to perform the +minimization. \begin{table} @@ -697,32 +793,76 @@ torso3 & 2D/3D problem & 259,156 & 4,429,042 \\ \begin{tabular}{|c|c|r|r|r|r|} \hline - \multirow{2}{*}{Matrix name} & Solver / & \multicolumn{2}{c|}{gmres variant} & \multicolumn{2}{c|}{2 stage} \\ + \multirow{2}{*}{Matrix name} & Solver / & \multicolumn{2}{c|}{gmres variant} & \multicolumn{2}{c|}{2 stage CGLS} \\ +\cline{3-6} & precond & Time & \# Iter. & Time & \# Iter. \\\hline \hline crashbasis & gmres / none & 15.65 & 518 & 14.12 & 450 \\ -parabolic\_fem & gmres / ilu & 2152 & ?? & 724 & ?? \\ +parabolic\_fem & gmres / ilu & 1009.94 & 7573 & 401.52 & 2970 \\ epb3 & fgmres / sor & 8.67 & 600 & 8.21 & 540 \\ atmosmodj & fgmres / sor & 104.23 & 451 & 88.97 & 366 \\ bfwa398 & gmres / none & 1.42 & 9612 & 0.28 & 1650 \\ -torso3 & fgmres/sor & 565 & 37.70 & 34.97 & 510 \\ +torso3 & fgmres / sor & 37.70 & 565 & 34.97 & 510 \\ \hline \end{tabular} -\caption{Comparison of GMRES and 2 stage GMRES algorithms in sequential with some matrices, time is expressed in seconds.} +\caption{Comparison of (F)GMRES and 2 stage (F)GMRES algorithms in sequential with some matrices, time is expressed in seconds.} \label{tab:02} \end{center} \end{table} -Param : retart 30 iters -cols = 8 -iter cgls = 20 -cgls prec = 1e-40 -prec = 1e-10 -Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz +Larger experiments .... + +\begin{table*} +\begin{center} +\begin{tabular}{|r|r|r|r|r|r|r|r|r|} +\hline + + nb. cores & precond & \multicolumn{2}{c|}{gmres variant} & \multicolumn{2}{c|}{2 stage CGLS} & \multicolumn{2}{c|}{2 stage LSQR} & best gain \\ +\cline{3-8} + & & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline + 2,048 & mg & 403.49 & 18,210 & 73.89 & 3,060 & 77.84 & 3,270 & 5.46 \\ + 2,048 & sor & 745.37 & 57,060 & 87.31 & 6,150 & 104.21 & 7,230 & 8.53 \\ + 4,096 & mg & 562.25 & 25,170 & 97.23 & 3,990 & 89.71 & 3,630 & 6.27 \\ + 4,096 & sor & 912.12 & 70,194 & 145.57 & 9,750 & 168.97 & 10,980 & 6.26 \\ + 8,192 & mg & 917.02 & 40,290 & 148.81 & 5,730 & 143.03 & 5,280 & 6.41 \\ + 8,192 & sor & 1,404.53 & 106,530 & 212.55 & 12,990 & 180.97 & 10,470 & 7.76 \\ + 16,384 & mg & 1,430.56 & 63,930 & 237.17 & 8,310 & 244.26 & 7,950 & 6.03 \\ + 16,384 & sor & 2,852.14 & 216,240 & 418.46 & 21,690 & 505.26 & 23,970 & 6.82 \\ +\hline + +\end{tabular} +\caption{Comparison of FGMRES and 2 stage FGMRES algorithms for ex15 of Petsc with 25000 components per core on Juqueen (threshold 1e-3, restart=30, s=12), time is expressed in seconds.} +\label{tab:03} +\end{center} +\end{table*} + + +\begin{table*} +\begin{center} +\begin{tabular}{|r|r|r|r|r|r|r|r|r|} +\hline + + nb. cores & threshold & \multicolumn{2}{c|}{gmres variant} & \multicolumn{2}{c|}{2 stage CGLS} & \multicolumn{2}{c|}{2 stage LSQR} & best gain \\ +\cline{3-8} + & & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline + 2,048 & 8e-5 & 108.88 & 16,560 & 23.06 & 3,630 & 22.79 & 3,630 & 4.77 \\ + 2,048 & 6e-5 & 194.01 & 30,270 & 35.50 & 5,430 & 27.74 & 4,350 & 6.99 \\ + 4,096 & 7e-5 & 160.59 & 22,530 & 35.15 & 5,130 & 29.21 & 4,350 & 5.49 \\ + 4,096 & 6e-5 & 249.27 & 35,520 & 52.13 & 7,950 & 39.24 & 5,790 & 6.35 \\ + 8,192 & 6e-5 & 149.54 & 17,280 & 28.68 & 3,810 & 29.05 & 3,990 & 5.21 \\ + 8,192 & 5e-5 & 792.11 & 109,590 & 76.83 & 10,470 & 65.20 & 9,030 & 12.14 \\ + 16,384 & 4e-5 & 718.61 & 86,400 & 98.98 & 10,830 & 131.86 & 14,790 & 7.26 \\ +\hline + +\end{tabular} +\caption{Comparison of FGMRES and 2 stage FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25000 components per core on Curie (restart=30, s=12), time is expressed in seconds.} +\label{tab:04} +\end{center} +\end{table*} %%%********************************************************* %%%********************************************************* @@ -731,12 +871,19 @@ Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz %%%********************************************************* %%%********************************************************* \section{Conclusion} -\label{sec:05} +\label{sec:07} %The conclusion goes here. this is more of the conclusion %%%********************************************************* %%%********************************************************* +future plan : \\ +- study other kinds of matrices, problems, inner solvers\\ +- test the influence of all the parameters\\ +- adaptative number of outer iterations to minimize\\ +- other methods to minimize the residuals?\\ +- implement our solver inside PETSc + % conference papers do not normally have an appendix @@ -746,10 +893,10 @@ Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz %%%********************************************************* %%%********************************************************* \section*{Acknowledgment} -%The authors would like to thank... -%more thanks here -%%%********************************************************* -%%%********************************************************* +This paper is partially funded by the Labex ACTION program (contract +ANR-11-LABX-01-01). We acknowledge PRACE for awarding us access to resource +Curie and Juqueen respectively based in France and Germany. + % trigger a \newpage just before the given reference @@ -767,23 +914,23 @@ Intel(R) Core(TM) i7-3630QM CPU @ 2.40GHz % http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/ % The IEEEtran BibTeX style support page is at: % http://www.michaelshell.org/tex/ieeetran/bibtex/ -%\bibliographystyle{IEEEtran} +\bibliographystyle{IEEEtran} % argument is your BibTeX string definitions and bibliography database(s) -%\bibliography{IEEEabrv,../bib/paper} +\bibliography{biblio} % % manually copy in the resultant .bbl file % set second argument of \begin to the number of references % (used to reserve space for the reference number labels box) -\begin{thebibliography}{1} +%% \begin{thebibliography}{1} -\bibitem{saad86} Y.~Saad and M.~H.~Schultz, \emph{GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems}, SIAM Journal on Scientific and Statistical Computing, 7(3):856--869, 1986. +%% \bibitem{saad86} Y.~Saad and M.~H.~Schultz, \emph{GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems}, SIAM Journal on Scientific and Statistical Computing, 7(3):856--869, 1986. -\bibitem{saad96} Y.~Saad, \emph{Iterative Methods for Sparse Linear Systems}, PWS Publishing, New York, 1996. +%% \bibitem{saad96} Y.~Saad, \emph{Iterative Methods for Sparse Linear Systems}, PWS Publishing, New York, 1996. -\bibitem{hestenes52} M.~R.~Hestenes and E.~Stiefel, \emph{Methods of conjugate gradients for solving linear system}, Journal of Research of National Bureau of Standards, B49:409--436, 1952. +%% \bibitem{hestenes52} M.~R.~Hestenes and E.~Stiefel, \emph{Methods of conjugate gradients for solving linear system}, Journal of Research of National Bureau of Standards, B49:409--436, 1952. -\bibitem{paige82} C.~C.~Paige and A.~M.~Saunders, \emph{LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares}, ACM Trans. Math. Softw. 8(1):43--71, 1982. -\end{thebibliography} +%% \bibitem{paige82} C.~C.~Paige and A.~M.~Saunders, \emph{LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares}, ACM Trans. Math. Softw. 8(1):43--71, 1982. +%% \end{thebibliography}