new

[GMRES2stage.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index fc64064454a2836fe4a5c7d7b90eb851ed760484..b1c6f598dc2f166a99d06aad3f3edc09f45fc6ff 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -439,7 +439,7 @@ can be around 7 times faster.
  \end{abstract}
  
  \begin{IEEEkeywords}
  \end{abstract}
  
  \begin{IEEEkeywords}
-Iterative Krylov methods; sparse linear systems; error minimization; PETSc; %à voir... 
+Iterative Krylov methods; sparse linear systems; residual minimization; PETSc; %à voir... 
  \end{IEEEkeywords}
  
  
  \end{IEEEkeywords}
  
  
@@ -583,10 +583,10 @@ performances.
  The present paper is organized  as follows. In Section~\ref{sec:02} some related
  works are presented. Section~\ref{sec:03} presents our two-stage algorithm using
  a  least-square  residual  minimization.   Section~\ref{sec:04}  describes  some
  The present paper is organized  as follows. In Section~\ref{sec:02} some related
  works are presented. Section~\ref{sec:03} presents our two-stage algorithm using
  a  least-square  residual  minimization.   Section~\ref{sec:04}  describes  some
-convergence   results   on  this   method.    Section~\ref{sec:05}  shows   some
-experimental results obtained on large  clusters of our algorithm using routines
-of  PETSc  toolkit.  Finally   Section~\ref{sec:06}  concludes  and  gives  some
-perspectives.
+convergence  results  on this  method.   In Section~\ref{sec:05},  parallization
+details  of  TSARM  are  given.  Section~\ref{sec:06}  shows  some  experimental
+results  obtained on large  clusters of  our algorithm  using routines  of PETSc
+toolkit.  Finally Section~\ref{sec:06} concludes and gives some perspectives.
  %%%*********************************************************
  %%%*********************************************************
  
  %%%*********************************************************
  %%%*********************************************************
  
@@ -604,7 +604,7 @@ perspectives.
  
  %%%*********************************************************
  %%%*********************************************************
  
  %%%*********************************************************
  %%%*********************************************************
-\section{A Krylov two-stage algorithm}
+\section{Two-stage algorithm with least-square residuals minimization}
  \label{sec:03}
  A two-stage algorithm is proposed  to solve large  sparse linear systems  of the
  form  $Ax=b$,  where  $A\in\mathbb{R}^{n\times   n}$  is  a  sparse  and  square
  \label{sec:03}
  A two-stage algorithm is proposed  to solve large  sparse linear systems  of the
  form  $Ax=b$,  where  $A\in\mathbb{R}^{n\times   n}$  is  a  sparse  and  square
@@ -613,57 +613,72 @@ $b\in\mathbb{R}^n$ is  the right-hand side.  The algorithm is implemented  as an
  inner-outer iteration  solver based  on iterative Krylov  methods. The  main key
  points of our solver are given in Algorithm~\ref{algo:01}.
  
  inner-outer iteration  solver based  on iterative Krylov  methods. The  main key
  points of our solver are given in Algorithm~\ref{algo:01}.
  
-In order to accelerate the convergence, the outer iteration applies a least-square minimization on the residuals computed by the inner   some  error functions  over a  Krylov
-subspace~\cite{Saad2003}. At  each iteration, the  sparse linear system  $Ax=b$ is
-solved   iteratively    with   an   iterative   method,    for   example   GMRES
-method~\cite{Saad86} or  some of its variants,  and the Krylov  subspace that we
-used is spanned by a basis  $S$ composed of successive solutions issued from the
-inner iteration
-\begin{equation}
-  S = \{x^1, x^2, \ldots, x^s\} \text{,~} s\leq n.
-\end{equation} 
-The advantage  of such a Krylov subspace  is that we neither  need an orthogonal
-basis nor  any synchronization  between processors to  generate this  basis. The
-algorithm  is periodically  restarted every  $s$ iterations  with a  new initial
-guess $x=S\alpha$ which minimizes the residual norm $\|b-Ax\|_2$ over the Krylov
-subspace spanned by  vectors of $S$, where $\alpha$ is a  solution of the normal
-equations
-\begin{equation}
-  R^TR\alpha = R^Tb,
-\end{equation}
-which is associated with the least-squares problem
+In order to accelerate the convergence, the outer iteration periodically applies
+a least-square minimization  on the residuals computed by  the inner solver. The
+inner solver is a Krylov based solver which does not required to be changed.
+
+At each outer iteration, the sparse linear system $Ax=b$ is solved, only for $m$
+iterations, using an iterative method restarting with the previous solution. For
+example, the GMRES method~\cite{Saad86} or some of its variants can be used as a
+inner solver. The current solution of the Krylov method is saved inside a matrix
+$S$ composed of successive solutions computed by the inner iteration.
+
+Periodically, every $s$ iterations, the minimization step is applied in order to
+compute a new  solution $x$. For that, the previous  residuals are computed with
+$(b-AS)$. The minimization of the residuals is obtained by 
  \begin{equation}
     \underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2
  \label{eq:01}
  \end{equation}
  \begin{equation}
     \underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2
  \label{eq:01}
  \end{equation}
-such  that $R=AS$  is a  dense rectangular  matrix in  $\mathbb{R}^{n\times s}$,
-$s\ll n$,  and $R^T$ denotes  the transpose of  matrix $R$. We use  an iterative
-method   to  solve   the  least-squares   problem~(\ref{eq:01})  such   as  CGLS
-~\cite{Hestenes52}  or LSQR~\cite{Paige82}  which  are more  appropriate than  a
-direct method in the parallel context.
+with $R=AS$. Then the new solution $x$ is computed with $x=S\alpha$.
+
+
+In  practice, $R$  is a  dense rectangular  matrix in  $\mathbb{R}^{n\times s}$,
+$s\ll n$.   In order  to minimize~(\ref{eq:01}), a  least-square method  such as
+CGLS ~\cite{Hestenes52}  or LSQR~\cite{Paige82} is used. Those  methods are more
+appropriate than a direct method in a parallel context.
  
  \begin{algorithm}[t]
  
  \begin{algorithm}[t]
-\caption{A Krylov two-stage algorithm}
+\caption{TSARM}
  \begin{algorithmic}[1]
    \Input $A$ (sparse matrix), $b$ (right-hand side)
    \Output $x$ (solution vector)\vspace{0.2cm}
    \State Set the initial guess $x^0$
  \begin{algorithmic}[1]
    \Input $A$ (sparse matrix), $b$ (right-hand side)
    \Output $x$ (solution vector)\vspace{0.2cm}
    \State Set the initial guess $x^0$
-  \For {$k=1,2,3,\ldots$ until convergence} \label{algo:conv}
-    \State Solve iteratively $Ax^k=b$  \label{algo:solve}
-    \State $S_{k~mod~s}=x^k$ 
-    \If {$k$ mod $s=0$ {\bf and} not convergence}
-      \State Compute dense matrix $R=AS$
-      \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$
-      \State Compute minimizer $x^k=S\alpha$
+  \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{kryl}$)} \label{algo:conv}
+    \State  $x^k=Solve(A,b,x^{k-1},max\_iter_{kryl})$   \label{algo:solve}
+    \State retrieve error
+    \State $S_{k~mod~s}=x^k$ \label{algo:store}
+    \If {$k$ mod $s=0$ {\bf and} error$>\epsilon_{kryl}$}
+      \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul}
+      \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$ \label{algo:}
+      \State $x^k=S\alpha$  \Comment{compute new solution}
      \EndIf
    \EndFor
  \end{algorithmic}
  \label{algo:01}
  \end{algorithm}
  
      \EndIf
    \EndFor
  \end{algorithmic}
  \label{algo:01}
  \end{algorithm}
  
-Operation $S_{k~  mod~ s}=x^k$ consists in  copying the residual  $x_k$ into the
-column $k~ mod~ s$ of the matrix  $S$. After the minimization, the matrix $S$ is
-reused with the new values of the residuals.
+Algorithm~\ref{algo:01}  summarizes  the principle  of  our  method.  The  outer
+iteration is  inside the for  loop. Line~\ref{algo:solve}, the Krylov  method is
+called for a  maximum of $max\_iter_{kryl}$ iterations.  In practice, we  suggest to set this parameter
+equals to  the restart  number of the  GMRES-like method. Moreover,  a tolerance
+threshold must be specified for the  solver. In practice, this threshold must be
+much  smaller  than the  convergence  threshold  of  the TSARM  algorithm  (i.e.
+$\epsilon$).  Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the
+solution  $x_k$  into the  column  $k~  mod~ s$ of  the  matrix  $S$. After  the
+minimization, the matrix $S$ is reused with the new values of the residuals.  To
+solve the minimization problem, an  iterative method is used. Two parameters are
+required for that: the maximum number of iteration and the threshold to stop the
+method.
+
+To summarize, the important parameters of TSARM are:
+\begin{itemize}
+\item $\epsilon_{kryl}$ the threshold to stop the method of the krylov method
+\item $max\_iter_{kryl}$ the maximum number of iterations for the krylov method
+\item $s$ the number of outer iterations before applying the minimization step
+\item $max\_iter_{ls}$ the maximum number of iterations for the iterative least-square method
+\item $\epsilon_{ls}$ the threshold to stop the least-square method
+\end{itemize}
  
  %%%*********************************************************
  %%%*********************************************************
  
  %%%*********************************************************
  %%%*********************************************************
@@ -671,11 +686,58 @@ reused with the new values of the residuals.
  \section{Convergence results}
  \label{sec:04}
  
  \section{Convergence results}
  \label{sec:04}
  
+
+
  %%%*********************************************************
  %%%*********************************************************
  %%%*********************************************************
  %%%*********************************************************
-\section{Experiments using petsc}
+\section{Parallelization}
  \label{sec:05}
  
  \label{sec:05}
  
+The  parallelisation  of  TSARM  relies   on  the  parallelization  of  all  its
+parts. More  precisely, except  the least-square step,  all the other  parts are
+obvious to  achieve out in parallel. In  order to develop a  parallel version of
+our   code,   we   have   chosen  to   use   PETSc~\cite{petsc-web-page}.    For
+line~\ref{algo:matrix_mul} the  matrix-matrix multiplication is  implemented and
+efficient since the  matrix $A$ is sparse and since the  matrix $S$ contains few
+colums in  practice. As explained  previously, at least  two methods seem  to be
+interesting to solve the least-square minimization, CGLS and LSQR.
+
+In the following  we remind the CGLS algorithm. The LSQR  method follows more or
+less the same principle but it take more place, so we briefly explain the parallelization of CGLS which is similar to LSQR.
+
+\begin{algorithm}[t]
+\caption{CGLS}
+\begin{algorithmic}[1]
+  \Input $A$ (matrix), $b$ (right-hand side)
+  \Output $x$ (solution vector)\vspace{0.2cm}
+  \State $r=b-Ax$
+  \State $p=A'r$
+  \State $s=p$
+  \State $g=||s||^2_2$
+  \For {$k=1,2,3,\ldots$ until convergence (g$<\epsilon_{ls}$)} \label{algo2:conv}
+    \State $q=Ap$
+    \State $\alpha=g/||q||^2_2$
+    \State $x=x+alpha*p$
+    \State $r=r-alpha*q$
+    \State $s=A'*r$
+    \State $g_{old}=g$
+    \State $g=||s||^2_2$
+    \State $\beta=g/g_{old}$
+  \EndFor
+\end{algorithmic}
+\label{algo:02}
+\end{algorithm}
+
+
+In each iteration  of CGLS, there is two  matrix-vector multiplications and some
+classical operations:  dots, norm, multiplication  and addition on  vectors. All
+these operations are easy to implement in PETSc or similar environment.
+
+%%%*********************************************************
+%%%*********************************************************
+\section{Experiments using petsc}
+\label{sec:06}
+
  
  In order to see the influence of our algorithm with only one processor, we first
  show  a comparison  with the  standard version  of GMRES  and our  algorithm. In
  
  In order to see the influence of our algorithm with only one processor, we first
  show  a comparison  with the  standard version  of GMRES  and our  algorithm. In
@@ -752,7 +814,9 @@ torso3             & fgmres / sor  & 37.70 & 565 & 34.97 & 510 \\
  
  
  
  
  
  
-Larger experiments ....
+Larger experiments ....\\
+
+Describe the problems ex15 and ex54
  
  \begin{table*}
  \begin{center}
  
  \begin{table*}
  \begin{center}
@@ -801,6 +865,33 @@ Larger experiments ....
  \label{tab:04}
  \end{center}
  \end{table*}
  \label{tab:04}
  \end{center}
  \end{table*}
+
+
+
+
+
+\begin{table*}
+\begin{center}
+\begin{tabular}{|r|r|r|r|r|r|r|r|} 
+\hline
+
+  nb. cores   & \multicolumn{2}{c|}{gmres variant} & \multicolumn{2}{c|}{2 stage CGLS} &  \multicolumn{2}{c|}{2 stage LSQR} & best gain \\ 
+\cline{2-7}
+                    & Time  & \# Iter.  & Time  & \# Iter. & Time  & \# Iter. & \\\hline \hline
+   512              & 3,969.69 & 33,120 & 709.57 & 5,790  & 622.76 & 5,070 &   \\
+   1024             & 1,530.06  & 25,860 & 290.95 & 4,830  & 307.71 & 5,070 &   \\
+   2048             & 919.62    & 31,470 & 237,52 & 8,040  & 194.22 & 6,510 &  \\
+   4096             & 405.60    & 28,380 & 111.67 & 7,590  & 91.72  & 6,510 &   \\
+   8192             & 785.04   & 109,590 & 76.07  & 10,470 & 69,42 & 9,030  & \\
+
+\hline
+
+\end{tabular}
+\caption{Comparison of FGMRES  and 2 stage FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores (restart=30, s=12, threshol 5e-5),  time is expressed in seconds.}
+\label{tab:05}
+\end{center}
+\end{table*}
+
  %%%*********************************************************
  %%%*********************************************************
  
  %%%*********************************************************
  %%%*********************************************************
  
@@ -809,7 +900,7 @@ Larger experiments ....
  %%%*********************************************************
  %%%*********************************************************
  \section{Conclusion}
  %%%*********************************************************
  %%%*********************************************************
  \section{Conclusion}
-\label{sec:06}
+\label{sec:07}
  %The conclusion goes here. this is more of the conclusion
  %%%*********************************************************
  %%%*********************************************************
  %The conclusion goes here. this is more of the conclusion
  %%%*********************************************************
  %%%*********************************************************
@@ -817,6 +908,7 @@ Larger experiments ....
  
  future plan : \\
  - study other kinds of matrices, problems, inner solvers\\
  
  future plan : \\
  - study other kinds of matrices, problems, inner solvers\\
+- test the influence of all the parameters\\
  - adaptative number of outer iterations to minimize\\
  - other methods to minimize the residuals?\\
  - implement our solver inside PETSc
  - adaptative number of outer iterations to minimize\\
  - other methods to minimize the residuals?\\
  - implement our solver inside PETSc