Relecture

[GMRES2stage.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index ceffa3d7903def8c6c51d0395e382a9487e67e2a..896ac71233815d6aa5c7b3f00399c26d575d4248 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -380,8 +380,8 @@
  % use a multiple column layout for up to two different
  % affiliations
  
-\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja \IEEEauthorrefmark{2}, and Christophe Guyeux\IEEEauthorrefmark{1}}
-\IEEEauthorblockA{\IEEEauthorrefmark{1} Femto-ST Institute, University of Franche Comte, France\\
+\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja\IEEEauthorrefmark{2}, and Christophe Guyeux\IEEEauthorrefmark{1}}
+\IEEEauthorblockA{\IEEEauthorrefmark{1} Femto-ST Institute, University of Franche-Comt\'e, France\\
  Email: \{raphael.couturier,christophe.guyeux\}@univ-fcomte.fr}
  \IEEEauthorblockA{\IEEEauthorrefmark{2} INRIA Bordeaux Sud-Ouest, France\\
  Email: lilia.ziane@inria.fr}
@@ -564,7 +564,7 @@ gradient and GMRES ones (Generalized Minimal RESidual).
  
  However,  iterative  methods suffer  from scalability  problems  on parallel
  computing  platforms  with many  processors, due  to  their need  of  reduction
-operations, and to  collective    communications   to  achive   matrix-vector
+operations, and to  collective    communications   to  achieve   matrix-vector
  multiplications. The  communications on large  clusters with thousands  of cores
  and  large  sizes of  messages  can  significantly  affect the  performances  of these
  iterative methods. As a consequence, Krylov subspace iteration methods are often used
@@ -621,19 +621,21 @@ outer solver periodically applies a least-squares minimization  on the residuals
  At each outer iteration, the sparse linear system $Ax=b$ is partially 
  solved using only $m$
  iterations of an iterative method, this latter being initialized with the 
-best known approximation previously obtained. 
-GMRES method~\cite{Saad86}, or any of its variants, can be used for instance as an
-inner solver. The current approximation of the Krylov method is then stored inside a matrix
-$S$ composed by the successive solutions that are computed during inner iterations.
-
-At each $s$ iterations, the minimization step is applied in order to
+last obtained approximation. 
+GMRES method~\cite{Saad86}, or any of its variants, can potentially be used as
+inner solver. The current approximation of the Krylov method is then stored inside a $n \times s$ matrix
+$S$, which is composed by the $s$ last solutions that have been computed during 
+the inner iterations phase.
+In the remainder, the $i$-th column vector of $S$ will be denoted by $S_i$. 
+
+At each $s$ iterations, another kind of minimization step is applied in order to
  compute a new  solution $x$. For that, the previous  residuals of $Ax=b$ are computed by
  the inner iterations with $(b-AS)$. The minimization of the residuals is obtained by  
  \begin{equation}
     \underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2
  \label{eq:01}
  \end{equation}
-with $R=AS$. Then the new solution $x$ is computed with $x=S\alpha$.
+with $R=AS$. The new solution $x$ is then computed with $x=S\alpha$.
  
  
  In  practice, $R$  is a  dense rectangular  matrix belonging in  $\mathbb{R}^{n\times s}$,
@@ -663,14 +665,15 @@ appropriate than a single direct method in a parallel context.
  \label{algo:01}
  \end{algorithm}
  
-Algorithm~\ref{algo:01}  summarizes  the principle  of  our  method.  The  outer
-iteration is  inside the for  loop. Line~\ref{algo:solve}, the Krylov  method is
+Algorithm~\ref{algo:01}  summarizes  the principle  of  the proposed  method.  The  outer
+iteration is  inside the \emph{for}  loop. Line~\ref{algo:solve}, the Krylov  method is
  called for a  maximum of $max\_iter_{kryl}$ iterations.  In practice, we  suggest to set this parameter
-equals to  the restart  number of the  GMRES-like method. Moreover,  a tolerance
+equal to  the restart  number in the  GMRES-like method. Moreover,  a tolerance
  threshold must be specified for the  solver. In practice, this threshold must be
-much  smaller  than the  convergence  threshold  of  the TSIRM  algorithm  (\emph{i.e.}
-$\epsilon_{tsirm}$).  Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the
-solution  $x_k$  into the  column  $k~ mod~ s$ of  the  matrix  $S$. After  the
+much  smaller  than the  convergence  threshold  of  the TSIRM  algorithm  (\emph{i.e.},
+$\epsilon_{tsirm}$).  Line~\ref{algo:store}, $S_{k \mod s}=x^k$ consists in copying the
+solution  $x_k$  into the  column  $k \mod s$ of $S$.
+After  the
  minimization, the matrix $S$ is reused with the new values of the residuals.  To
  solve the minimization problem, an  iterative method is used. Two parameters are
  required for that: the maximum number of iterations and the threshold to stop the
@@ -686,13 +689,13 @@ Let us summarize the most important parameters of TSIRM:
  \end{itemize}
  
  
-The  parallelisation  of  TSIRM  relies   on  the  parallelization  of  all  its
+The  parallelization  of  TSIRM  relies   on  the  parallelization  of  all  its
  parts. More  precisely, except  the least-squares step,  all the other  parts are
  obvious to  achieve out in parallel. In  order to develop a  parallel version of
  our   code,   we   have   chosen  to   use   PETSc~\cite{petsc-web-page}.    For
  line~\ref{algo:matrix_mul} the  matrix-matrix multiplication is  implemented and
  efficient since the  matrix $A$ is sparse and since the  matrix $S$ contains few
-colums in  practice. As explained  previously, at least  two methods seem  to be
+columns in  practice. As explained  previously, at least  two methods seem  to be
  interesting to solve the least-squares minimization, CGLS and LSQR.
  
  In the following  we remind the CGLS algorithm. The LSQR  method follows more or
@@ -735,37 +738,59 @@ these operations are easy to implement in PETSc or similar environment.
  
  \section{Convergence results}
  \label{sec:04}
-Let us recall the following result, see~\cite{Saad86}.
+Let us recall the following result, see~\cite{Saad86} for further readings.
  \begin{proposition}
+\label{prop:saad}
  Suppose that $A$ is a positive real matrix with symmetric part $M$. Then the residual norm provided at the $m$-th step of GMRES satisfies:
  \begin{equation}
  ||r_m|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_0|| ,
  \end{equation}
-where $\alpha = \lambda_min(M)^2$ and $\beta = \lambda_max(A^T A)$, which proves 
+where $\alpha = \lambda_{min}(M)^2$ and $\beta = \lambda_{max}(A^T A)$, which proves 
  the convergence of GMRES($m$) for all $m$ under that assumption regarding $A$.
  \end{proposition}
  
  
  We can now claim that,
  \begin{proposition}
-If $A$ is a positive real matrix and GMRES($m$) is used as solver, then the TSIRM algorithm is convergent.
+If $A$ is a positive real matrix and GMRES($m$) is used as solver, then the TSIRM algorithm is convergent. Furthermore, 
+let $r_k$ be the
+$k$-th residue of TSIRM, then
+we still have:
+\begin{equation}
+||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0|| ,
+\end{equation}
+where $\alpha$ and $\beta$ are defined as in Proposition~\ref{prop:saad}.
  \end{proposition}
  
  \begin{proof}
-Let $r_k = b-Ax_k$, where $x_k$ is the approximation of the solution after the
-$k$-th iterate of TSIRM.
-We will prove that $r_k \rightarrow 0$ when $k \rightarrow +\infty$.
+We will prove by a mathematical induction that, for each $k \in \mathbb{N}^\ast$, 
+$||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{mk}{2}} ||r_0||.$
+
+The base case is obvious, as for $k=1$, the TSIRM algorithm simply consists in applying GMRES($m$) once, leading to a new residual $r_1$ which follows the inductive hypothesis due to Proposition~\ref{prop:saad}.
  
-Each step of the TSIRM algorithm \\
+Suppose now that the claim holds for all $m=1, 2, \hdots, k-1$, that is, $\forall m \in \{1,2,\hdots, k-1\}$, $||r_m|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$.
+We will show that the statement holds too for $r_k$. Two situations can occur:
+\begin{itemize}
+\item If $k \mod m \neq 0$, then the TSIRM algorithm consists in executing GMRES once. In that case, we obtain $||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_{k-1}||\leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$ by the inductive hypothesis.
+\item Else, the TSIRM algorithm consists in two stages: a first GMRES($m$) execution leads to a temporary $x_k$ whose residue satisfies $||r_k|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_{k-1}||\leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||$, and a least squares resolution.
+Let $\operatorname{span}(S) = \left \{ {\sum_{i=1}^k \lambda_i v_i \Big| k \in \mathbb{N}, v_i \in S, \lambda _i \in \mathbb{R}} \right \}$ be the linear span of a set of real vectors $S$. So,\\
  $\min_{\alpha \in \mathbb{R}^s} ||b-R\alpha ||_2 = \min_{\alpha \in \mathbb{R}^s} ||b-AS\alpha ||_2$
  
  $\begin{array}{ll}
-& = \min_{x \in Vect\left(x_0, x_1, \hdots, x_{k-1} \right)} ||b-AS\alpha ||_2\\
-& \leqslant \min_{x \in Vect\left( S_{k-1} \right)} ||b-Ax ||_2\\
-& \leqslant ||b-Ax_{k-1}||
+& = \min_{x \in span\left(S_{k-s+1}, S_{k-s+2}, \hdots, S_{k} \right)} ||b-AS\alpha ||_2\\
+& = \min_{x \in span\left(x_{k-s+1}, x_{k-s}+2, \hdots, x_{k} \right)} ||b-AS\alpha ||_2\\
+& \leqslant \min_{x \in span\left( x_{k} \right)} ||b-Ax ||_2\\
+& \leqslant \min_{\lambda \in \mathbb{R}} ||b-\lambda Ax_{k} ||_2\\
+& \leqslant ||b-Ax_{k}||_2\\
+& = ||r_k||_2\\
+& \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{km}{2}} ||r_0||,
  \end{array}$
+\end{itemize}
+which concludes the induction and the proof.
  \end{proof}
  
+We can remark that, at each iterate, the residue of the TSIRM algorithm is lower 
+than the one of the GMRES method.
  
  %%%*********************************************************
  %%%*********************************************************
@@ -837,7 +862,7 @@ torso3             & fgmres / sor  & 37.70 & 565 & 34.97 & 510 \\
  \hline
  
  \end{tabular}
-\caption{Comparison of (F)GMRES and 2 stage (F)GMRES algorithms in sequential with some matrices, time is expressed in seconds.}
+\caption{Comparison of (F)GMRES and TSIRM with (F)GMRES in sequential with some matrices, time is expressed in seconds.}
  \label{tab:02}
  \end{center}
  \end{table}
@@ -896,7 +921,7 @@ Table~\ref{tab:03} shows  the execution  times and the  number of  iterations of
  example ex15  of PETSc on the  Juqueen architecture. Different  numbers of cores
  are  studied ranging  from  2,048  up-to 16,383.   Two  preconditioners have  been
  tested: {\it mg} and {\it sor}.   For those experiments,  the number  of components  (or unknowns  of the
-problems)  per processor  is fixed  to 25,000,  also called  weak  scaling. This
+problems)  per core  is fixed  to 25,000,  also called  weak  scaling. This
  number can seem relatively small. In fact, for some applications that need a lot
  of  memory, the  number of  components per  processor requires  sometimes  to be
  small.
@@ -943,7 +968,7 @@ the number of iterations. So, the overall benefit of using TSIRM is interesting.
  \begin{tabular}{|r|r|r|r|r|r|r|r|r|} 
  \hline
  
-  nb. cores & threshold   & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSIRM CGLS} &  \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\ 
+  nb. cores & threshold   & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} &  \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\ 
  \cline{3-8}
               &                       & Time  & \# Iter.  & Time  & \# Iter. & Time  & \# Iter. & \\\hline \hline
    2,048      & 8e-5                  & 108.88 & 16,560  & 23.06  &  3,630  & 22.79  & 3,630   & 4.77 \\
@@ -956,7 +981,7 @@ the number of iterations. So, the overall benefit of using TSIRM is interesting.
  \hline
  
  \end{tabular}
-\caption{Comparison of FGMRES  and 2 stage FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25000 components per core on Curie (restart=30, s=12),  time is expressed in seconds.}
+\caption{Comparison of FGMRES  and TSIRM with FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25,000 components per core on Curie (restart=30, s=12),  time is expressed in seconds.}
  \label{tab:04}
  \end{center}
  \end{table*}
@@ -970,9 +995,9 @@ In Table~\ref{tab:04}, some experiments with example ex54 on the Curie architect
  \begin{tabular}{|r|r|r|r|r|r|r|r|r|r|r|} 
  \hline
  
-  nb. cores   & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSIRM CGLS} &  \multicolumn{2}{c|}{TSIRM LSQR} & best gain & \multicolumn{3}{c|}{efficiency} \\ 
+  nb. cores   & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} &  \multicolumn{2}{c|}{TSIRM LSQR} & best gain & \multicolumn{3}{c|}{efficiency} \\ 
  \cline{2-7} \cline{9-11}
-                    & Time  & \# Iter.  & Time  & \# Iter. & Time  & \# Iter. &   & GMRES & TS CGLS & TS LSQR\\\hline \hline
+                    & Time  & \# Iter.  & Time  & \# Iter. & Time  & \# Iter. &   & FGMRES & TS CGLS & TS LSQR\\\hline \hline
     512              & 3,969.69 & 33,120 & 709.57 & 5,790  & 622.76 & 5,070  & 6.37  &   1    &    1    &     1     \\
     1024             & 1,530.06  & 25,860 & 290.95 & 4,830  & 307.71 & 5,070 & 5.25  &  1.30  &    1.21  &   1.01     \\
     2048             & 919.62    & 31,470 & 237.52 & 8,040  & 194.22 & 6,510 & 4.73  & 1.08   &    .75   &   .80\\
@@ -982,7 +1007,7 @@ In Table~\ref{tab:04}, some experiments with example ex54 on the Curie architect
  \hline
  
  \end{tabular}
-\caption{Comparison of FGMRES  and 2 stage FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores (restart=30, s=12, threshol 5e-5),  time is expressed in seconds.}
+\caption{Comparison of FGMRES  and TSIRM with FGMRES for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores (restart=30, s=12, threshold 5e-5),  time is expressed in seconds.}
  \label{tab:05}
  \end{center}
  \end{table*}
@@ -1007,13 +1032,22 @@ In Table~\ref{tab:04}, some experiments with example ex54 on the Curie architect
  %%%*********************************************************
  %%%*********************************************************
  
+A novel two-stage iterative  algorithm has been proposed in this article,
+in order to accelerate the convergence Krylov iterative  methods.
+Our TSIRM proposal acts as a merger between Krylov based solvers and
+a least-squares minimization step.
+The convergence of the method has been proven in some situations, while 
+experiments up to 16,394 cores have been led to verify that TSIRM runs
+5 or  7 times  faster than GMRES.
  
-future plan : \\
-- study other kinds of matrices, problems, inner solvers\\
-- test the influence of all the parameters\\
-- adaptative number of outer iterations to minimize\\
-- other methods to minimize the residuals?\\
-- implement our solver inside PETSc
+
+For future work, the authors' intention is to investigate 
+other kinds of matrices, problems, and inner solvers. The 
+influence of all parameters must be tested too, while 
+other methods to minimize the residuals must be regarded.
+The number of outer iterations to minimize should become 
+adaptative to improve the overall performances of the proposal.
+Finally, this solver will be implemented inside PETSc.
  
  
  % conference papers do not normally have an appendix
@@ -1025,7 +1059,7 @@ future plan : \\
  %%%*********************************************************
  \section*{Acknowledgment}
  This  paper  is   partially  funded  by  the  Labex   ACTION  program  (contract
-ANR-11-LABX-01-01).   We acknowledge PRACE  for awarding  us access  to resource
+ANR-11-LABX-01-01).   We acknowledge PRACE  for awarding  us access  to resources
  Curie and Juqueen respectively based in France and Germany.
  
  
@@ -1068,4 +1102,3 @@ Curie and Juqueen respectively based in France and Germany.
  
  % that's all folks
  \end{document}
-