% quality.
-%\usepackage{eqparbox}
+\usepackage{eqparbox}
% Also of notable interest is Scott Pakin's eqparbox package for creating
% (automatically sized) equal width boxes - aka "natural width parboxes".
% Available at:
\hyphenation{op-tical net-works semi-conduc-tor}
-
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{amsmath}
\usepackage{amssymb}
+\usepackage{multirow}
+\usepackage{graphicx}
\algnewcommand\algorithmicinput{\textbf{Input:}}
\algnewcommand\Input{\item[\algorithmicinput]}
\algnewcommand\algorithmicoutput{\textbf{Output:}}
\algnewcommand\Output{\item[\algorithmicoutput]}
-
+\newtheorem{proposition}{Proposition}
\begin{document}
%
% paper title
% can use linebreaks \\ within to get better formatting as desired
-\title{A Krylov two-stage algorithm to solve large sparse linear systems}
-%où
-%\title{A two-stage algorithm with error minimization to solve large sparse linear systems}
-%où
-%\title{???}
+\title{TSIRM: A Two-Stage Iteration with least-squares Residual Minimization algorithm to solve large sparse linear systems}
+
+
+
+
% author names and affiliations
% use a multiple column layout for up to two different
% affiliations
-\author{\IEEEauthorblockN{Rapha\"el Couturier}
-\IEEEauthorblockA{Femto-ST Institute - DISC Department\\
-Universit\'e de Franche-Comt\'e, IUT de Belfort-Montb\'eliard\\
-19 avenue de Mar\'echal Juin, BP 527 \\
-90016 Belfort Cedex, France\\
-Email: raphael.couturier@univ-fcomte.fr}
-\and
-\IEEEauthorblockN{Lilia Ziane Khodja}
-\IEEEauthorblockA{Centre de Recherche INRIA Bordeaux Sud-Ouest\\
-200 avenue de la Vieille Tour\\
-33405 Talence Cedex, France\\
+\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja\IEEEauthorrefmark{2}, and Christophe Guyeux\IEEEauthorrefmark{1}}
+\IEEEauthorblockA{\IEEEauthorrefmark{1} Femto-ST Institute, University of Franche-Comt\'e, France\\
+Email: \{raphael.couturier,christophe.guyeux\}@univ-fcomte.fr}
+\IEEEauthorblockA{\IEEEauthorrefmark{2} INRIA Bordeaux Sud-Ouest, France\\
Email: lilia.ziane@inria.fr}
}
+
+
% conference papers do not typically use \thanks and this command
% is locked out in conference mode. If really needed, such as for
% the acknowledgment of grants, issue a \IEEEoverridecommandlockouts
\begin{abstract}
-%The abstract goes here. DO NOT USE SPECIAL CHARACTERS, SYMBOLS, OR MATH IN YOUR TITLE OR ABSTRACT.
+In this article, a two-stage iterative algorithm is proposed to improve the
+convergence of Krylov based iterative methods, typically those of GMRES
+variants. The principle of the proposed approach is to build an external
+iteration over the Krylov method, and to frequently store its current residual
+(at each GMRES restart for instance). After a given number of outer iterations,
+a least-squares minimization step is applied on the matrix composed by the saved
+residuals, in order to compute a better solution and to make new iterations if
+required. It is proven that the proposal has the same convergence properties
+than the inner embedded method itself. Experiments using up to 16,394 cores
+also show that the proposed algorithm runs around 5 or 7 times faster than
+GMRES.
\end{abstract}
\begin{IEEEkeywords}
-Iterative Krylov methods; sparse linear systems; error minimization; PETSc; %à voir...
+Iterative Krylov methods; sparse linear systems; two stage iteration; least-squares residual minimization; PETSc
\end{IEEEkeywords}
% no \IEEEPARstart
% You must have at least 2 lines in the paragraph with the drop letter
% (should never be an issue)
-Iterative methods are become more attractive than direct ones to solve very
-large sparse linear systems. They are more effective in a parallel context and
-require less memory and arithmetic operations than direct methods. A number of
-iterative methods are proposed and adapted by many researchers and the increased
-need for solving very large sparse linear systems triggered the development of
-efficient iterative techniques suitable for the parallel processing.
-
-Most of the successful iterative methods currently available are based on Krylov
-subspaces which consist in forming a basis of a sequence of successive matrix
-powers times an initial vector for example the residual. These methods are based
-on orthogonality of vectors of the Krylov subspace basis to solve linear
-systems. The most well-known iterative Krylov subspace methods are Conjugate
-Gradient method and GMRES method (generalized minimal residual).
-
-However, iterative methods suffer from scalability problems on parallel
-computing platforms with many processors due to their need for reduction
-operations and collective communications to perform matrix-vector
+
+Iterative methods have recently become more attractive than direct ones to solve
+very large sparse linear systems\cite{Saad2003}. They are more efficient in a
+parallel context, supporting thousands of cores, and they require less memory
+and arithmetic operations than direct methods~\cite{bahicontascoutu}. This is
+why new iterative methods are frequently proposed or adapted by researchers, and
+the increasing need to solve very large sparse linear systems has triggered the
+development of such efficient iterative techniques suitable for parallel
+processing.
+
+Most of the successful iterative methods currently available are based on
+so-called ``Krylov subspaces''. They consist in forming a basis of successive
+matrix powers multiplied by an initial vector, which can be for instance the
+residual. These methods use vectors orthogonality of the Krylov subspace basis
+in order to solve linear systems. The most known iterative Krylov subspace
+methods are conjugate gradient and GMRES ones (Generalized Minimal RESidual).
+
+
+However, iterative methods suffer from scalability problems on parallel
+computing platforms with many processors, due to their need of reduction
+operations, and to collective communications to achieve matrix-vector
multiplications. The communications on large clusters with thousands of cores
-and large sizes of messages can significantly affect the performances of
-iterative methods. In practice, Krylov subspace iteration methods are often used
-with preconditioners in order to increase their convergence and accelerate their
-performances. However, most of the good preconditioners are not scalable on
-large clusters.
-
-In this paper we propose a two-stage algorithm based on two nested iterations
-called inner-outer iterations. This algorithm consists in solving the sparse
-linear system iteratively with a small number of inner iterations and restarts
-the outer step with a new solution minimizing some error functions over a Krylov
-subspace. This algorithm is iterative and easy to parallelize on large clusters
-and the minimization technique improves its convergence and performances.
-
-The present paper is organized as follows. In Section~\ref{sec:02} some related
-works are presented. Section~\ref{sec:03} presents our two-stage algorithm based
-on Krylov subspace iteration methods. Section~\ref{sec:04} shows some
-experimental results obtained on large clusters of our algorithm using routines
-of PETSc toolkit.
+and large sizes of messages can significantly affect the performances of these
+iterative methods. As a consequence, Krylov subspace iteration methods are often
+used with preconditioners in practice, to increase their convergence and
+accelerate their performances. However, most of the good preconditioners are
+not scalable on large clusters.
+
+In this research work, a two-stage algorithm based on two nested iterations
+called inner-outer iterations is proposed. This algorithm consists in solving
+the sparse linear system iteratively with a small number of inner iterations,
+and restarting the outer step with a new solution minimizing some error
+functions over some previous residuals. For further information on two-stage
+iteration methods, interested readers are invited to
+consult~\cite{Nichols:1973:CTS}. Two-stage algorithms are easy to parallelize on
+large clusters. Furthermore, the least-squares minimization technique improves
+its convergence and performances.
+
+The present article is organized as follows. Related works are presented in
+Section~\ref{sec:02}. Section~\ref{sec:03} details the two-stage algorithm using
+a least-squares residual minimization, while Section~\ref{sec:04} provides
+convergence results regarding this method. Section~\ref{sec:05} shows some
+experimental results obtained on large clusters using routines of PETSc
+toolkit. This research work ends by a conclusion section, in which the proposal
+is summarized while intended perspectives are provided.
+
%%%*********************************************************
%%%*********************************************************
%%%*********************************************************
%%%*********************************************************
-\section{A Krylov two-stage algorithm}
+\section{Two-stage iteration with least-squares residuals minimization algorithm}
\label{sec:03}
A two-stage algorithm is proposed to solve large sparse linear systems of the
form $Ax=b$, where $A\in\mathbb{R}^{n\times n}$ is a sparse and square
-nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector and
-$b\in\mathbb{R}^n$ is the right-hand side. The algorithm is implemented as an
-inner-outer iteration solver based on iterative Krylov methods. The main key
-points of our solver are given in Algorithm~\ref{algo:01}.
-
-In order to accelerate the convergence, the outer iteration is implemented as an
-iterative Krylov method which minimizes some error functions over a Krylov
-subspace~\cite{saad96}. At each iteration, the sparse linear system $Ax=b$ is
-solved iteratively with an iterative method, for example GMRES
-method~\cite{saad86} or some of its variants, and the Krylov subspace that we
-used is spanned by a basis $S$ composed of successive solutions issued from the
-inner iteration
-\begin{equation}
- S = \{x^1, x^2, \ldots, x^s\} \text{,~} s\leq n.
-\end{equation}
-The advantage of such a Krylov subspace is that we neither need an orthogonal
-basis nor any synchronization between processors to generate this basis. The
-algorithm is periodically restarted every $s$ iterations with a new initial
-guess $x=S\alpha$ which minimizes the residual norm $\|b-Ax\|_2$ over the Krylov
-subspace spanned by vectors of $S$, where $\alpha$ is a solution of the normal
-equations
-\begin{equation}
- R^TR\alpha = R^Tb,
-\end{equation}
-which is associated with the least-squares problem
+nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector, and
+$b\in\mathbb{R}^n$ is the right-hand side. As explained previously,
+the algorithm is implemented as an
+inner-outer iteration solver based on iterative Krylov methods. The main
+key-points of the proposed solver are given in Algorithm~\ref{algo:01}.
+It can be summarized as follows: the
+inner solver is a Krylov based one. In order to accelerate its convergence, the
+outer solver periodically applies a least-squares minimization on the residuals computed by the inner one. %Tsolver which does not required to be changed.
+
+At each outer iteration, the sparse linear system $Ax=b$ is partially solved
+using only $m$ iterations of an iterative method, this latter being initialized
+with the last obtained approximation. GMRES method~\cite{Saad86}, or any of its
+variants, can potentially be used as inner solver. The current approximation of
+the Krylov method is then stored inside a $n \times s$ matrix $S$, which is
+composed by the $s$ last solutions that have been computed during the inner
+iterations phase. In the remainder, the $i$-th column vector of $S$ will be
+denoted by $S_i$.
+
+At each $s$ iterations, another kind of minimization step is applied in order to
+compute a new solution $x$. For that, the previous residuals of $Ax=b$ are computed by
+the inner iterations with $(b-AS)$. The minimization of the residuals is obtained by
\begin{equation}
\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2
\label{eq:01}
\end{equation}
-such that $R=AS$ is a dense rectangular matrix in $\mathbb{R}^{n\times s}$,
-$s\ll n$, and $R^T$ denotes the transpose of matrix $R$. We use an iterative
-method to solve the least-squares problem~(\ref{eq:01}) such as CGLS
-~\cite{hestenes52} or LSQR~\cite{paige82} which are more appropriate than a
-direct method in the parallel context.
+with $R=AS$. The new solution $x$ is then computed with $x=S\alpha$.
+
+
+In practice, $R$ is a dense rectangular matrix belonging in $\mathbb{R}^{n\times s}$,
+with $s\ll n$. In order to minimize~\eqref{eq:01}, a least-squares method such as
+CGLS ~\cite{Hestenes52} or LSQR~\cite{Paige82} is used. Remark that these methods are more
+appropriate than a single direct method in a parallel context.
+
+
\begin{algorithm}[t]
-\caption{A Krylov two-stage algorithm}
+\caption{TSIRM}
\begin{algorithmic}[1]
\Input $A$ (sparse matrix), $b$ (right-hand side)
\Output $x$ (solution vector)\vspace{0.2cm}
- \State Set the initial guess $x^0$
- \For {$k=1,2,3,\ldots$ until convergence}
- \State Solve iteratively $Ax^k=b$
- \State $S_{k~mod~s}=x^k$
- \If {$k$ mod $s=0$ {\bf and} not convergence}
- \State Compute dense matrix $R=AS$
- \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$
- \State Compute minimizer $x^k=S\alpha$
+ \State Set the initial guess $x_0$
+ \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{tsirm}$)} \label{algo:conv}
+ \State $[x_k,error]=Solve(A,b,x_{k-1},max\_iter_{kryl})$ \label{algo:solve}
+ \State $S_{k \mod s}=x_k$ \label{algo:store} \Comment{update column (k mod s) of S}
+ \If {$k \mod s=0$ {\bf and} error$>\epsilon_{kryl}$}
+ \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul}
+ \State $\alpha=Least\_Squares(R,b,max\_iter_{ls})$ \label{algo:}
+ \State $x_k=S\alpha$ \Comment{compute new solution}
\EndIf
\EndFor
\end{algorithmic}
\label{algo:01}
\end{algorithm}
-Operation $S_{k~ mod~ s}=x^k$ consists in copying the residual $x_k$ into the
-column $k~ mod~ s$ of the matrix $S$. After the minimization, the matrix $S$ is
-reused with the new values of the residuals.
+Algorithm~\ref{algo:01} summarizes the principle of the proposed method. The
+outer iteration is inside the \emph{for} loop. Line~\ref{algo:solve}, the Krylov
+method is called for a maximum of $max\_iter_{kryl}$ iterations. In practice,
+we suggest to set this parameter equal to the restart number in the GMRES-like
+method. Moreover, a tolerance threshold must be specified for the solver. In
+practice, this threshold must be much smaller than the convergence threshold of
+the TSIRM algorithm (\emph{i.e.}, $\epsilon_{tsirm}$). We also consider that
+after the call of the $Solve$ function, we obtain the vector $x_k$ and the error
+which is defined by $||Ax^k-b||_2$.
+
+ Line~\ref{algo:store},
+$S_{k \mod s}=x^k$ consists in copying the solution $x_k$ into the column $k
+\mod s$ of $S$. After the minimization, the matrix $S$ is reused with the new
+values of the residuals. To solve the minimization problem, an iterative method
+is used. Two parameters are required for that: the maximum number of iterations
+and the threshold to stop the method.
+
+Let us summarize the most important parameters of TSIRM:
+\begin{itemize}
+\item $\epsilon_{tsirm}$: the threshold to stop the TSIRM method;
+\item $max\_iter_{kryl}$: the maximum number of iterations for the Krylov method;
+\item $s$: the number of outer iterations before applying the minimization step;
+\item $max\_iter_{ls}$: the maximum number of iterations for the iterative least-squares method;
+\item $\epsilon_{ls}$: the threshold used to stop the least-squares method.
+\end{itemize}
+
+
+The parallelization of TSIRM relies on the parallelization of all its
+parts. More precisely, except the least-squares step, all the other parts are
+obvious to achieve out in parallel. In order to develop a parallel version of
+our code, we have chosen to use PETSc~\cite{petsc-web-page}. For
+line~\ref{algo:matrix_mul} the matrix-matrix multiplication is implemented and
+efficient since the matrix $A$ is sparse and since the matrix $S$ contains few
+columns in practice. As explained previously, at least two methods seem to be
+interesting to solve the least-squares minimization, CGLS and LSQR.
+
+In the following we remind the CGLS algorithm. The LSQR method follows more or
+less the same principle but it takes more place, so we briefly explain the parallelization of CGLS which is similar to LSQR.
+
+\begin{algorithm}[t]
+\caption{CGLS}
+\begin{algorithmic}[1]
+ \Input $A$ (matrix), $b$ (right-hand side)
+ \Output $x$ (solution vector)\vspace{0.2cm}
+ \State Let $x_0$ be an initial approximation
+ \State $r_0=b-Ax_0$
+ \State $p_1=A^Tr_0$
+ \State $s_0=p_1$
+ \State $\gamma=||s_0||^2_2$
+ \For {$k=1,2,3,\ldots$ until convergence ($\gamma<\epsilon_{ls}$)} \label{algo2:conv}
+ \State $q_k=Ap_k$
+ \State $\alpha_k=\gamma/||q_k||^2_2$
+ \State $x_k=x_{k-1}+\alpha_kp_k$
+ \State $r_k=r_{k-1}-\alpha_kq_k$
+ \State $s_k=A^Tr_k$
+ \State $\gamma_{old}=\gamma$
+ \State $\gamma=||s_k||^2_2$
+ \State $\beta_k=\gamma/\gamma_{old}$
+ \State $p_{k+1}=s_k+\beta_kp_k$
+ \EndFor
+\end{algorithmic}
+\label{algo:02}
+\end{algorithm}
+
+
+In each iteration of CGLS, there is two matrix-vector multiplications and some
+classical operations: dot product, norm, multiplication and addition on vectors. All
+these operations are easy to implement in PETSc or similar environment.
+
+
%%%*********************************************************
%%%*********************************************************
+\section{Convergence results}
+\label{sec:04}
+
+We can now claim that,
+\begin{proposition}
+\label{prop:saad}
+If $A$ is either a definite positive or a positive matrix and GMRES($m$) is used as solver, then the TSIRM algorithm is convergent.
+
+Furthermore, let $r_k$ be the
+$k$-th residue of TSIRM, then
+we have the following boundaries:
+\begin{itemize}
+\item when $A$ is positive:
+\begin{equation}
+||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0|| ,
+\end{equation}
+where $M$ is the symmetric part of $A$, $\alpha = \lambda_{min}(M)^2$ and $\beta = \lambda_{max}(A^T A)$;
+\item when $A$ is positive definite:
+\begin{equation}
+\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_0\|.
+\end{equation}
+\end{itemize}
+%In the general case, where A is not positive definite, we have
+%$\|r_n\| \le \inf_{p \in P_n} \|p(A)\| \le \kappa_2(V) \inf_{p \in P_n} \max_{\lambda \in \sigma(A)} |p(\lambda)| \|r_0\|, .$
+\end{proposition}
+
+\begin{proof}
+Let us first recall that the residue is under control when considering the GMRES algorithm on a positive definite matrix, and it is bounded as follows:
+\begin{equation*}
+\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{k/2} \|r_0\| .
+\end{equation*}
+Additionally, when $A$ is a positive real matrix with symmetric part $M$, then the residual norm provided at the $m$-th step of GMRES satisfies:
+\begin{equation*}
+||r_m|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_0|| ,
+\end{equation*}
+where $\alpha$ and $\beta$ are defined as in Proposition~\ref{prop:saad}, which proves
+the convergence of GMRES($m$) for all $m$ under such assumptions regarding $A$.
+These well-known results can be found, \emph{e.g.}, in~\cite{Saad86}.
+
+We will now prove by a mathematical induction that, for each $k \in \mathbb{N}^\ast$,
+$||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{mk}{2}} ||r_0||$ when $A$ is positive, and $\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_0\|$ when $A$ is positive definite.
+
+The base case is obvious, as for $k=1$, the TSIRM algorithm simply consists in applying GMRES($m$) once, leading to a new residual $r_1$ that follows the inductive hypothesis due, to the results recalled above.
+
+Suppose now that the claim holds for all $m=1, 2, \hdots, k-1$, that is, $\forall m \in \{1,2,\hdots, k-1\}$, $||r_m|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$ in the positive case, and $\|r_k\| \leq \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_0\|$ in the definite positive one.
+We will show that the statement holds too for $r_k$. Two situations can occur:
+\begin{itemize}
+\item If $k \not\equiv 0 ~(\textrm{mod}\ m)$, then the TSIRM algorithm consists in executing GMRES once. In that case and by using the inductive hypothesis, we obtain either $||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_{k-1}||\leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$ if $A$ is positive, or $\|r_k\| \leqslant \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{m/2} \|r_{k-1}\|$ $\leqslant$ $\left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_{0}\|$ in the positive definite case.
+\item Else, the TSIRM algorithm consists in two stages: a first GMRES($m$) execution leads to a temporary $x_k$ whose residue satisfies:
+\begin{itemize}
+\item $||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_{k-1}||\leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$ in the positive case,
+\item $\|r_k\| \leqslant \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{m/2} \|r_{k-1}\|$ $\leqslant$ $\left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_{0}\|$ in the positive definite one,
+\end{itemize}
+and a least squares resolution.
+Let $\operatorname{span}(S) = \left \{ {\sum_{i=1}^k \lambda_i v_i \Big| k \in \mathbb{N}, v_i \in S, \lambda _i \in \mathbb{R}} \right \}$ be the linear span of a set of real vectors $S$. So,\\
+$\min_{\alpha \in \mathbb{R}^s} ||b-R\alpha ||_2 = \min_{\alpha \in \mathbb{R}^s} ||b-AS\alpha ||_2$
+
+$\begin{array}{ll}
+& = \min_{x \in span\left(S_{k-s+1}, S_{k-s+2}, \hdots, S_{k} \right)} ||b-AS\alpha ||_2\\
+& = \min_{x \in span\left(x_{k-s+1}, x_{k-s}+2, \hdots, x_{k} \right)} ||b-AS\alpha ||_2\\
+& \leqslant \min_{x \in span\left( x_{k} \right)} ||b-Ax ||_2\\
+& \leqslant \min_{\lambda \in \mathbb{R}} ||b-\lambda Ax_{k} ||_2\\
+& \leqslant ||b-Ax_{k}||_2\\
+& = ||r_k||_2\\
+& \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||, \textrm{ if $A$ is positive,}\\
+& \leqslant \left( 1-\frac{\lambda_{\mathrm{min}}^2(1/2(A^T + A))}{ \lambda_{\mathrm{max}}(A^T A)} \right)^{km/2} \|r_{0}\|, \textrm{ if $A$ is}\\
+& \textrm{positive definite,}
+\end{array}$
+\end{itemize}
+which concludes the induction and the proof.
+\end{proof}
+
+%We can remark that, at each iterate, the residue of the TSIRM algorithm is lower
+%than the one of the GMRES method.
%%%*********************************************************
%%%*********************************************************
-\section{Experiments using petsc}
-\label{sec:04}
+\section{Experiments using PETSc}
+\label{sec:05}
+
+
+In order to see the behavior of the proposal when considering only one processor, a first
+comparison with GMRES or FGMRES and the new algorithm detailed previously has been experimented.
+Matrices that have been used with their characteristics (names, fields, rows, and nonzero coefficients) are detailed in
+Table~\ref{tab:01}. These latter, which are real-world applications matrices,
+have been extracted
+ from the Davis collection, University of
+Florida~\cite{Dav97}.
+
+\begin{table}[htbp]
+\begin{center}
+\begin{tabular}{|c|c|r|r|r|}
+\hline
+Matrix name & Field &\# Rows & \# Nonzeros \\\hline \hline
+crashbasis & Optimization & 160,000 & 1,750,416 \\
+parabolic\_fem & Comput. fluid dynamics & 525,825 & 2,100,225 \\
+epb3 & Thermal problem & 84,617 & 463,625 \\
+atmosmodj & Comput. fluid dynamics & 1,270,432 & 8,814,880 \\
+bfwa398 & Electromagnetics pb & 398 & 3,678 \\
+torso3 & 2D/3D problem & 259,156 & 4,429,042 \\
+\hline
+
+\end{tabular}
+\caption{Main characteristics of the sparse matrices chosen from the Davis collection}
+\label{tab:01}
+\end{center}
+\end{table}
+Chosen parameters are detailed below.
+%The following parameters have been chosen for our experiments.
+As by default
+the restart of GMRES is performed every 30 iterations, we have chosen to stop
+the GMRES every 30 iterations (\emph{i.e.} $max\_iter_{kryl}=30$). $s$ is set to 8. CGLS is
+chosen to minimize the least-squares problem with the following parameters:
+$\epsilon_{ls}=1e-40$ and $max\_iter_{ls}=20$. The external precision is set to
+$\epsilon_{tsirm}=1e-10$. Those experiments have been performed on a Intel(R)
+Core(TM) i7-3630QM CPU @ 2.40GHz with the version 3.5.1 of PETSc.
+
+
+In Table~\ref{tab:02}, some experiments comparing the solving of the linear
+systems obtained with the previous matrices with a GMRES variant and with out 2
+stage algorithm are given. In the second column, it can be noticed that either
+GRMES or FGMRES (Flexible GMRES)~\cite{Saad:1993} is used to solve the linear
+system. According to the matrices, different preconditioner is used. With
+TSIRM, the same solver and the same preconditionner are used. This Table shows
+that TSIRM can drastically reduce the number of iterations to reach the
+convergence when the number of iterations for the normal GMRES is more or less
+greater than 500. In fact this also depends on tow parameters: the number of
+iterations to stop GMRES and the number of iterations to perform the
+minimization.
+
+
+\begin{table}[htbp]
+\begin{center}
+\begin{tabular}{|c|c|r|r|r|r|}
+\hline
+
+ \multirow{2}{*}{Matrix name} & Solver / & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSIRM CGLS} \\
+\cline{3-6}
+ & precond & Time & \# Iter. & Time & \# Iter. \\\hline \hline
+
+crashbasis & gmres / none & 15.65 & 518 & 14.12 & 450 \\
+parabolic\_fem & gmres / ilu & 1009.94 & 7573 & 401.52 & 2970 \\
+epb3 & fgmres / sor & 8.67 & 600 & 8.21 & 540 \\
+atmosmodj & fgmres / sor & 104.23 & 451 & 88.97 & 366 \\
+bfwa398 & gmres / none & 1.42 & 9612 & 0.28 & 1650 \\
+torso3 & fgmres / sor & 37.70 & 565 & 34.97 & 510 \\
+\hline
+
+\end{tabular}
+\caption{Comparison of (F)GMRES and TSIRM with (F)GMRES in sequential with some matrices, time is expressed in seconds.}
+\label{tab:02}
+\end{center}
+\end{table}
+
+
+
+
+
+In order to perform larger experiments, we have tested some example applications
+of PETSc. Those applications are available in the ksp part which is suited for
+scalable linear equations solvers:
+\begin{itemize}
+\item ex15 is an example which solves in parallel an operator using a finite
+ difference scheme. The diagonal is equal to 4 and 4 extra-diagonals
+ representing the neighbors in each directions are equal to -1. This example is
+ used in many physical phenomena, for example, heat and fluid flow, wave
+ propagation, etc.
+\item ex54 is another example based on 2D problem discretized with quadrilateral
+ finite elements. For this example, the user can define the scaling of material
+ coefficient in embedded circle called $\alpha$.
+\end{itemize}
+For more technical details on these applications, interested readers are invited
+to read the codes available in the PETSc sources. Those problems have been
+chosen because they are scalable with many cores which is not the case of other
+problems that we have tested.
+
+In the following larger experiments are described on two large scale
+architectures: Curie and Juqeen. Both these architectures are supercomputer
+composed of 80,640 cores for Curie and 458,752 cores for Juqueen. Those machines
+are respectively hosted by GENCI in France and Jülich Supercomputing Centre in
+Germany. They belongs with other similar architectures of the PRACE initiative (
+Partnership for Advanced Computing in Europe) which aims at proposing high
+performance supercomputing architecture to enhance research in Europe. The Curie
+architecture is composed of Intel E5-2680 processors at 2.7 GHz with 2Gb memory
+by core. The Juqueen architecture is composed of IBM PowerPC A2 at 1.6 GHz with
+1Gb memory per core. Both those architecture are equiped with a dedicated high
+speed network.
+
+
+In many situations, using preconditioners is essential in order to find the
+solution of a linear system. There are many preconditioners available in PETSc.
+For parallel applications all the preconditioners based on matrix factorization
+are not available. In our experiments, we have tested different kinds of
+preconditioners, however as it is not the subject of this paper, we will not
+present results with many preconditioners. In practise, we have chosen to use a
+multigrid (mg) and successive over-relaxation (sor). For more details on the
+preconditioner in PETSc please consult~\cite{petsc-web-page}.
+
+
+
+\begin{table*}[htbp]
+\begin{center}
+\begin{tabular}{|r|r|r|r|r|r|r|r|r|}
+\hline
+
+ nb. cores & precond & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\
+\cline{3-8}
+ & & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline
+ 2,048 & mg & 403.49 & 18,210 & 73.89 & 3,060 & 77.84 & 3,270 & 5.46 \\
+ 2,048 & sor & 745.37 & 57,060 & 87.31 & 6,150 & 104.21 & 7,230 & 8.53 \\
+ 4,096 & mg & 562.25 & 25,170 & 97.23 & 3,990 & 89.71 & 3,630 & 6.27 \\
+ 4,096 & sor & 912.12 & 70,194 & 145.57 & 9,750 & 168.97 & 10,980 & 6.26 \\
+ 8,192 & mg & 917.02 & 40,290 & 148.81 & 5,730 & 143.03 & 5,280 & 6.41 \\
+ 8,192 & sor & 1,404.53 & 106,530 & 212.55 & 12,990 & 180.97 & 10,470 & 7.76 \\
+ 16,384 & mg & 1,430.56 & 63,930 & 237.17 & 8,310 & 244.26 & 7,950 & 6.03 \\
+ 16,384 & sor & 2,852.14 & 216,240 & 418.46 & 21,690 & 505.26 & 23,970 & 6.82 \\
+\hline
+
+\end{tabular}
+\caption{Comparison of FGMRES and TSIRM with FGMRES for example ex15 of PETSc with two preconditioners (mg and sor) with 25,000 components per core on Juqueen (threshold 1e-3, restart=30, s=12), time is expressed in seconds.}
+\label{tab:03}
+\end{center}
+\end{table*}
+
+Table~\ref{tab:03} shows the execution times and the number of iterations of
+example ex15 of PETSc on the Juqueen architecture. Different numbers of cores
+are studied ranging from 2,048 up-to 16,383 with the two preconditioners {\it mg} and {\it sor}. For those experiments, the number of components (or unknowns of the
+problems) per core is fixed to 25,000, also called weak scaling. This
+number can seem relatively small. In fact, for some applications that need a lot
+of memory, the number of components per processor requires sometimes to be
+small.
+
+
+
+In Table~\ref{tab:03}, we can notice that TSIRM is always faster than FGMRES. The last
+column shows the ratio between FGMRES and the best version of TSIRM according to
+the minimization procedure: CGLS or LSQR. Even if we have computed the worst
+case between CGLS and LSQR, it is clear that TSIRM is always faster than
+FGMRES. For this example, the multigrid preconditioner is faster than SOR. The
+gain between TSIRM and FGMRES is more or less similar for the two
+preconditioners. Looking at the number of iterations to reach the convergence,
+it is obvious that TSIRM allows the reduction of the number of iterations. It
+should be noticed that for TSIRM, in those experiments, only the iterations of
+the Krylov solver are taken into account. Iterations of CGLS or LSQR were not
+recorded but they are time-consuming. In general each $max\_iter_{kryl}*s$ which
+corresponds to 30*12, there are $max\_iter_{ls}$ which corresponds to 15.
+
+\begin{figure}[htbp]
+\centering
+ \includegraphics[width=0.45\textwidth]{nb_iter_sec_ex15_juqueen}
+\caption{Number of iterations per second with ex15 and the same parameters than in Table~\ref{tab:03} (weak scaling)}
+\label{fig:01}
+\end{figure}
+
+
+In Figure~\ref{fig:01}, the number of iterations per second corresponding to
+Table~\ref{tab:03} is displayed. It can be noticed that the number of
+iterations per second of FMGRES is constant whereas it decreases with TSIRM with
+both preconditioners. This can be explained by the fact that when the number of
+cores increases the time for the least-squares minimization step also increases but, generally,
+when the number of cores increases, the number of iterations to reach the
+threshold also increases, and, in that case, TSIRM is more efficient to reduce
+the number of iterations. So, the overall benefit of using TSIRM is interesting.
+
+
+
+
+
+
+\begin{table*}[htbp]
+\begin{center}
+\begin{tabular}{|r|r|r|r|r|r|r|r|r|}
+\hline
+
+ nb. cores & threshold & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\
+\cline{3-8}
+ & & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline
+ 2,048 & 8e-5 & 108.88 & 16,560 & 23.06 & 3,630 & 22.79 & 3,630 & 4.77 \\
+ 2,048 & 6e-5 & 194.01 & 30,270 & 35.50 & 5,430 & 27.74 & 4,350 & 6.99 \\
+ 4,096 & 7e-5 & 160.59 & 22,530 & 35.15 & 5,130 & 29.21 & 4,350 & 5.49 \\
+ 4,096 & 6e-5 & 249.27 & 35,520 & 52.13 & 7,950 & 39.24 & 5,790 & 6.35 \\
+ 8,192 & 6e-5 & 149.54 & 17,280 & 28.68 & 3,810 & 29.05 & 3,990 & 5.21 \\
+ 8,192 & 5e-5 & 785.04 & 109,590 & 76.07 & 10,470 & 69.42 & 9,030 & 11.30 \\
+ 16,384 & 4e-5 & 718.61 & 86,400 & 98.98 & 10,830 & 131.86 & 14,790 & 7.26 \\
+\hline
+
+\end{tabular}
+\caption{Comparison of FGMRES and TSIRM with FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25,000 components per core on Curie (restart=30, s=12), time is expressed in seconds.}
+\label{tab:04}
+\end{center}
+\end{table*}
+
+
+In Table~\ref{tab:04}, some experiments with example ex54 on the Curie
+architecture are reported. For this application, we fixed $\alpha=0.6$. As it
+can be seen in that Table, the size of the problem has a strong influence on the
+number of iterations to reach the convergence. That is why we have preferred to
+change the threshold. If we set it to $1e-3$ as with the previous application,
+only one iteration is necessray to reach the convergence. So Table~\ref{tab:04}
+shows the results of differents executions with differents number of cores and
+differents thresholds. As with the previous example, we can observe that TSIRM
+is faster than FGMRES. The ratio greatly depends on the number of iterations for
+FMGRES to reach the threshold. The greater the number of iterations to reach the
+convergence is, the better the ratio between our algorithm and FMGRES is. This
+experiment is also a weak scaling with approximately $25,000$ components per
+core. It can also be observed that the difference between CGLS and LSQR is not
+significant. Both can be good but it seems not possible to know in advance which
+one will be the best.
+
+
+\begin{table*}[htbp]
+\begin{center}
+\begin{tabular}{|r|r|r|r|r|r|r|r|r|r|r|}
+\hline
+
+ nb. cores & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain & \multicolumn{3}{c|}{efficiency} \\
+\cline{2-7} \cline{9-11}
+ & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & & FGMRES & TS CGLS & TS LSQR\\\hline \hline
+ 512 & 3,969.69 & 33,120 & 709.57 & 5,790 & 622.76 & 5,070 & 6.37 & 1 & 1 & 1 \\
+ 1024 & 1,530.06 & 25,860 & 290.95 & 4,830 & 307.71 & 5,070 & 5.25 & 1.30 & 1.21 & 1.01 \\
+ 2048 & 919.62 & 31,470 & 237.52 & 8,040 & 194.22 & 6,510 & 4.73 & 1.08 & .75 & .80\\
+ 4096 & 405.60 & 28,380 & 111.67 & 7,590 & 91.72 & 6,510 & 4.42 & 1.22 & .79 & .84 \\
+ 8192 & 785.04 & 109,590 & 76.07 & 10,470 & 69.42 & 9,030 & 11.30 & .32 & .58 & .56 \\
+
+\hline
+
+\end{tabular}
+\caption{Comparison of FGMRES and TSIRM with FGMRES for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores (restart=30, s=12, threshold 5e-5), time is expressed in seconds.}
+\label{tab:05}
+\end{center}
+\end{table*}
+
+\begin{figure}[htbp]
+\centering
+ \includegraphics[width=0.45\textwidth]{nb_iter_sec_ex54_curie}
+\caption{Number of iterations per second with ex54 and the same parameters than in Table~\ref{tab:05} (strong scaling)}
+\label{fig:02}
+\end{figure}
%%%*********************************************************
%%%*********************************************************
%%%*********************************************************
%%%*********************************************************
\section{Conclusion}
-\label{sec:05}
+\label{sec:06}
%The conclusion goes here. this is more of the conclusion
%%%*********************************************************
%%%*********************************************************
+A novel two-stage iterative algorithm has been proposed in this article,
+in order to accelerate the convergence Krylov iterative methods.
+Our TSIRM proposal acts as a merger between Krylov based solvers and
+a least-squares minimization step.
+The convergence of the method has been proven in some situations, while
+experiments up to 16,394 cores have been led to verify that TSIRM runs
+5 or 7 times faster than GMRES.
+
+
+For future work, the authors' intention is to investigate
+other kinds of matrices, problems, and inner solvers. The
+influence of all parameters must be tested too, while
+other methods to minimize the residuals must be regarded.
+The number of outer iterations to minimize should become
+adaptative to improve the overall performances of the proposal.
+Finally, this solver will be implemented inside PETSc.
% conference papers do not normally have an appendix
%%%*********************************************************
%%%*********************************************************
\section*{Acknowledgment}
-%The authors would like to thank...
-%more thanks here
-%%%*********************************************************
-%%%*********************************************************
+This paper is partially funded by the Labex ACTION program (contract
+ANR-11-LABX-01-01). We acknowledge PRACE for awarding us access to resources
+Curie and Juqueen respectively based in France and Germany.
+
% trigger a \newpage just before the given reference
% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
% The IEEEtran BibTeX style support page is at:
% http://www.michaelshell.org/tex/ieeetran/bibtex/
-%\bibliographystyle{IEEEtran}
+\bibliographystyle{IEEEtran}
% argument is your BibTeX string definitions and bibliography database(s)
-%\bibliography{IEEEabrv,../bib/paper}
+\bibliography{biblio}
%
% <OR> manually copy in the resultant .bbl file
% set second argument of \begin to the number of references
% (used to reserve space for the reference number labels box)
-\begin{thebibliography}{1}
+%% \begin{thebibliography}{1}
-\bibitem{saad86} Y.~Saad and M.~H.~Schultz, \emph{GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems}, SIAM Journal on Scientific and Statistical Computing, 7(3):856--869, 1986.
+%% \bibitem{saad86} Y.~Saad and M.~H.~Schultz, \emph{GMRES: A Generalized Minimal Residual Algorithm for Solving Nonsymmetric Linear Systems}, SIAM Journal on Scientific and Statistical Computing, 7(3):856--869, 1986.
-\bibitem{saad96} Y.~Saad, \emph{Iterative Methods for Sparse Linear Systems}, PWS Publishing, New York, 1996.
+%% \bibitem{saad96} Y.~Saad, \emph{Iterative Methods for Sparse Linear Systems}, PWS Publishing, New York, 1996.
-\bibitem{hestenes52} M.~R.~Hestenes and E.~Stiefel, \emph{Methods of conjugate gradients for solving linear system}, Journal of Research of National Bureau of Standards, B49:409--436, 1952.
+%% \bibitem{hestenes52} M.~R.~Hestenes and E.~Stiefel, \emph{Methods of conjugate gradients for solving linear system}, Journal of Research of National Bureau of Standards, B49:409--436, 1952.
-\bibitem{paige82} C.~C.~Paige and A.~M.~Saunders, \emph{LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares}, ACM Trans. Math. Softw. 8(1):43--71, 1982.
-\end{thebibliography}
+%% \bibitem{paige82} C.~C.~Paige and A.~M.~Saunders, \emph{LSQR: An Algorithm for Sparse Linear Equations and Sparse Least Squares}, ACM Trans. Math. Softw. 8(1):43--71, 1982.
+%% \end{thebibliography}
% that's all folks
\end{document}
-
-