new

[GMRES2stage.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index f4dba976679275da150445d517e24ea7004d578a..e93737c8fa89c10db7125f76e1777a1264e938d2 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -601,7 +601,19 @@ is summarized while intended perspectives are provided.
  %%%*********************************************************
  \section{Related works}
  \label{sec:02} 
  %%%*********************************************************
  \section{Related works}
  \label{sec:02} 
-%Wherever Times is specified, Times Roman or Times New Roman may be used. If neither is available on your system, please use the font closest in appearance to Times. Avoid using bit-mapped fonts if possible. True-Type 1 or Open Type fonts are preferred. Please embed symbol fonts, as well, for math, etc.
+Krylov subspace iteration methods have increasingly become useful and successful techniques for solving linear and nonlinear systems and eigenvalue problems, especially since the increase development of the preconditioners~\cite{Saad2003,Meijerink77}. One reason of the popularity of these methods is their generality, simplicity and efficiency to solve systems of equations arising from very large and complex problems. %A Krylov method is based on a projection process onto a Krylov subspace spanned by vectors and it forms a sequence of approximations by minimizing the residual over the subspace formed~\cite{}.
+
+GMRES is one of the most widely used Krylov iterative method for solving sparse and large linear systems. It is developed by Saad and al.~\cite{Saad86} as a generalized method to deal with unsymmetric and non-Hermitian problems, and indefinite symmetric problems too. In its original version called full GMRES, it minimizes the residual over the current Krylov subspace until convergence in at most $n$ iterations, where $n$ is the size of the sparse matrix. It should be noted that full GMRES is too expensive in the case of large matrices since the required orthogonalization process per iteration grows quadratically with the number of iterations. For that reason, in practice GMRES is restarted after each $m\ll n$ iterations to avoid the storage of a large orthonormal basis. However, the convergence behavior of the restarted GMRES, called GMRES($m$), in many cases depends quite critically on the value of $m$~\cite{Huang89}. Therefore in most cases, a preconditioning technique is applied to the restarted GMRES method in order to improve its convergence.
+
+In order to enhance the robustness of Krylov iterative solvers, some techniques have been proposed allowing the use of different preconditioners, if necessary, within the iteration instead of restarting. Those techniques may lead to considerable savings in CPU time and memory requirements. Van der Vorst in~\cite{Vorst94} has proposed variants of the GMRES algorithm in which a different preconditioner is applied in each iteration, so-called GMRESR family of nested methods. In fact, the GMRES method is effectively preconditioned with other iterative schemes (or GMRES itself), where the iterations of the GMRES method are called outer iterations while the iterations of the preconditioning process referred to as inner iterations. Saad in~\cite{Saad:1993} has proposed FGMRES which is another variant of the GMRES algorithm using a variable preconditioner. In FGMRES the search directions are preconditioned whereas in GMRESR the residuals are preconditioned. However in practice the good preconditioners are those based on direct methods, as ILU preconditioners, which are not easy to parallelize and suffer from the scalability problems on large clusters of thousands of cores.  
+
+Recently, communication-avoiding methods have been developed to reduce the communication overheads in Krylov subspace iterative solvers. On modern computer architectures, communications between processors are much slower than floating-point arithmetic operations on a given processor. Communication-avoiding techniques reduce either communications between processors or data movements between levels of the memory hierarchy, by reformulating the communication-bound kernels (more frequently SpMV kernels) and the orthogonalization operations within the Krylov iterative solver. Different works have studied the communication-avoiding techniques for the GMRES method, so-called CA-GMRES, on multicore processors and multi-GPU machines~\cite{Mohiyuddin2009,Hoemmen2010,Yamazaki2014}. 
+
+Compared  to all these  works and  to all  the other  works on  Krylov iterative
+method, the originality of our work is to build a second iteration over a Krylov
+iterative method and to minimize the residuals with a least-squares method after
+a given number of outer iterations.
+
  %%%*********************************************************
  %%%*********************************************************
  
  %%%*********************************************************
  %%%*********************************************************
  
@@ -654,10 +666,10 @@ appropriate than a single direct method in a parallel context.
    \Input $A$ (sparse matrix), $b$ (right-hand side)
    \Output $x$ (solution vector)\vspace{0.2cm}
    \State Set the initial guess $x_0$
    \Input $A$ (sparse matrix), $b$ (right-hand side)
    \Output $x$ (solution vector)\vspace{0.2cm}
    \State Set the initial guess $x_0$
-  \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{tsirm}$)} \label{algo:conv}
+  \For {$k=1,2,3,\ldots$ until convergence ($error<\epsilon_{tsirm}$)} \label{algo:conv}
      \State  $[x_k,error]=Solve(A,b,x_{k-1},max\_iter_{kryl})$   \label{algo:solve}
      \State  $[x_k,error]=Solve(A,b,x_{k-1},max\_iter_{kryl})$   \label{algo:solve}
-    \State $S_{k \mod s}=x_k$ \label{algo:store} \Comment{update column (k mod s) of S}
-    \If {$k \mod s=0$ {\bf and} error$>\epsilon_{kryl}$}
+    \State $S_{k \mod s}=x_k$ \label{algo:store} \Comment{update column ($k \mod s$) of $S$}
+    \If {$k \mod s=0$ {\bf and} $error>\epsilon_{kryl}$}
        \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul}
              \State $\alpha=Least\_Squares(R,b,max\_iter_{ls})$ \label{algo:}
        \State $x_k=S\alpha$  \Comment{compute new solution}
        \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul}
              \State $\alpha=Least\_Squares(R,b,max\_iter_{ls})$ \label{algo:}
        \State $x_k=S\alpha$  \Comment{compute new solution}
@@ -675,10 +687,10 @@ method. Moreover,  a tolerance  threshold must be  specified for the  solver. In
  practice, this threshold must be  much smaller than the convergence threshold of
  the  TSIRM algorithm  (\emph{i.e.}, $\epsilon_{tsirm}$).  We also  consider that
  after the call of the $Solve$ function, we obtain the vector $x_k$ and the error
  practice, this threshold must be  much smaller than the convergence threshold of
  the  TSIRM algorithm  (\emph{i.e.}, $\epsilon_{tsirm}$).  We also  consider that
  after the call of the $Solve$ function, we obtain the vector $x_k$ and the error
-which is defined by $||Ax^k-b||_2$.
+which is defined by $||Ax_k-b||_2$.
  
    Line~\ref{algo:store},
  
    Line~\ref{algo:store},
-$S_{k \mod  s}=x^k$ consists in  copying the solution  $x_k$ into the  column $k
+$S_{k \mod  s}=x_k$ consists in  copying the solution  $x_k$ into the  column $k
  \mod s$ of $S$.   After the minimization, the matrix $S$ is  reused with the new
  values of the residuals.  To solve the minimization problem, an iterative method
  is used. Two parameters are required  for that: the maximum number of iterations
  \mod s$ of $S$.   After the minimization, the matrix $S$ is  reused with the new
  values of the residuals.  To solve the minimization problem, an iterative method
  is used. Two parameters are required  for that: the maximum number of iterations
@@ -861,16 +873,15 @@ Core(TM) i7-3630QM CPU @ 2.40GHz with the version 3.5.1 of PETSc.
  
  
  In  Table~\ref{tab:02}, some  experiments comparing  the solving  of  the linear
  
  
  In  Table~\ref{tab:02}, some  experiments comparing  the solving  of  the linear
-systems obtained with the previous matrices  with a GMRES variant and with out 2
-stage algorithm are  given. In the second column, it can  be noticed that either
-GRMES or  FGMRES (Flexible  GMRES)~\cite{Saad:1993} is used  to solve  the linear
-system.   According to  the matrices,  different preconditioner  is  used.  With
-TSIRM, the same solver and the  same preconditionner are used.  This Table shows
-that  TSIRM  can  drastically reduce  the  number  of  iterations to  reach  the
-convergence when the  number of iterations for the normal GMRES  is more or less
-greater than  500. In fact  this also depends  on tow parameters: the  number of
-iterations  to  stop  GMRES  and   the  number  of  iterations  to  perform  the
-minimization.
+systems obtained with the previous matrices  with a GMRES variant and with TSIRM
+are given. In the  second column, it can be noticed that  either GRMES or FGMRES
+(Flexible GMRES)~\cite{Saad:1993} is used to solve the linear system.  According
+to the matrices, different preconditioner  is used.  With TSIRM, the same solver
+and  the  same  preconditionner are  used.   This  Table  shows that  TSIRM  can
+drastically reduce  the number of iterations  to reach the  convergence when the
+number of iterations for  the normal GMRES is more or less  greater than 500. In
+fact this also depends on tow parameters: the number of iterations to stop GMRES
+and the number of iterations to perform the minimization.
  
  
  \begin{table}[htbp]
  
  
  \begin{table}[htbp]
@@ -961,7 +972,7 @@ preconditioner in PETSc please consult~\cite{petsc-web-page}.
  \hline
  
  \end{tabular}
  \hline
  
  \end{tabular}
-\caption{Comparison of FGMRES and TSIRM with FGMRES for example ex15 of PETSc with two preconditioners (mg and sor) with 25,000 components per core on Juqueen (threshold 1e-3, restart=30, s=12),  time is expressed in seconds.}
+\caption{Comparison of FGMRES and TSIRM with FGMRES for example ex15 of PETSc with two preconditioners (mg and sor) with 25,000 components per core on Juqueen ($\epsilon_{tsirm}=1e-3$, $max\_iter_{kryl}=30$, $s=12$, $max\_iter_{ls}=15$, $\epsilon_{ls}=1e-40$),  time is expressed in seconds.}
  \label{tab:03}
  \end{center}
  \end{table*}
  \label{tab:03}
  \end{center}
  \end{table*}
@@ -1016,7 +1027,7 @@ the number of iterations. So, the overall benefit of using TSIRM is interesting.
  \begin{tabular}{|r|r|r|r|r|r|r|r|r|} 
  \hline
  
  \begin{tabular}{|r|r|r|r|r|r|r|r|r|} 
  \hline
  
-  nb. cores & threshold   & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} &  \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\ 
+  nb. cores & $\epsilon_{tsirm}$  & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} &  \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\ 
  \cline{3-8}
               &                       & Time  & \# Iter.  & Time  & \# Iter. & Time  & \# Iter. & \\\hline \hline
    2,048      & 8e-5                  & 108.88 & 16,560  & 23.06  &  3,630  & 22.79  & 3,630   & 4.77 \\
  \cline{3-8}
               &                       & Time  & \# Iter.  & Time  & \# Iter. & Time  & \# Iter. & \\\hline \hline
    2,048      & 8e-5                  & 108.88 & 16,560  & 23.06  &  3,630  & 22.79  & 3,630   & 4.77 \\
@@ -1029,7 +1040,7 @@ the number of iterations. So, the overall benefit of using TSIRM is interesting.
  \hline
  
  \end{tabular}
  \hline
  
  \end{tabular}
-\caption{Comparison of FGMRES  and TSIRM with FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25,000 components per core on Curie (restart=30, s=12),  time is expressed in seconds.}
+\caption{Comparison of FGMRES  and TSIRM with FGMRES algorithms for ex54 of Petsc (both with the MG preconditioner) with 25,000 components per core on Curie ($max\_iter_{kryl}=30$, $s=12$, $max\_iter_{ls}=15$, $\epsilon_{ls}=1e-40$),  time is expressed in seconds.}
  \label{tab:04}
  \end{center}
  \end{table*}
  \label{tab:04}
  \end{center}
  \end{table*}
@@ -1087,7 +1098,7 @@ taken into account with TSIRM.
  \hline
  
  \end{tabular}
  \hline
  
  \end{tabular}
-\caption{Comparison of FGMRES  and TSIRM with FGMRES for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores (restart=30, s=12, threshold 5e-5),  time is expressed in seconds.}
+\caption{Comparison of FGMRES  and TSIRM with FGMRES for ex54 of Petsc (both with the MG preconditioner) with 204,919,225 components on Curie with different number of cores ($\epsilon_{tsirm}=5e-5$, $max\_iter_{kryl}=30$, $s=12$, $max\_iter_{ls}=15$, $\epsilon_{ls}=1e-40$),  time is expressed in seconds.}
  \label{tab:05}
  \end{center}
  \end{table*}
  \label{tab:05}
  \end{center}
  \end{table*}
@@ -1099,6 +1110,26 @@ taken into account with TSIRM.
  \label{fig:02}
  \end{figure}
  
  \label{fig:02}
  \end{figure}
  
+
+Concerning the  experiments some  other remarks are  interesting.
+\begin{itemize}
+\item We  can tested other examples of  PETSc (ex29, ex45, ex49).  For all these
+  examples,  we also obtained  similar gain  between GMRES  and TSIRM  but those
+  examples are  not scalable with many  cores. In general, we  had some problems
+  with more than $4,096$ cores.
+\item We have tested many iterative  solvers available in PETSc.  In fast, it is
+  possible to use most of them with TSIRM. From our point of view, the condition
+  to  use  a  solver inside  TSIRM  is  that  the  solver  must have  a  restart
+  feature. More  precisely, the solver must  support to be  stoped and restarted
+  without decrease its  converge. That is why  with GMRES we stop it  when it is
+  naturraly  restarted (i.e.  with  $m$ the  restart parameter).   The Conjugate
+  Gradient (CG) and all its variants do not have ``restarted'' version in PETSc,
+  so they  are not  efficient.  They  will converge with  TSIRM but  not quickly
+  because if  we compare  a normal CG  with a CG  for which  we stop it  each 16
+  iterations  for example,  the  normal CG  will  be for  more efficient.   Some
+  restarted CG  or CG variant versions exist  and may be interested  to study in
+  future works.
+\end{itemize}
  %%%*********************************************************
  %%%*********************************************************
  
  %%%*********************************************************
  %%%*********************************************************
  
@@ -1121,13 +1152,14 @@ experiments up to 16,394 cores have been led to verify that TSIRM runs
  5 or  7 times  faster than GMRES.
  
  
  5 or  7 times  faster than GMRES.
  
  
-For future work, the authors' intention is to investigate 
-other kinds of matrices, problems, and inner solvers. The 
-influence of all parameters must be tested too, while 
-other methods to minimize the residuals must be regarded.
-The number of outer iterations to minimize should become 
-adaptative to improve the overall performances of the proposal.
-Finally, this solver will be implemented inside PETSc.
+For  future  work, the  authors'  intention is  to  investigate  other kinds  of
+matrices, problems, and  inner solvers. The influence of  all parameters must be
+tested too, while other methods to minimize the residuals must be regarded.  The
+number of outer  iterations to minimize should become  adaptative to improve the
+overall performances of the proposal.   Finally, this solver will be implemented
+inside PETSc. This  would be very interesting because it would  allow us to test
+all the non-linear  examples and compare our algorithm  with the other algorithm
+implemented in PETSc.
  
  
  % conference papers do not normally have an appendix
  
  
  % conference papers do not normally have an appendix