12-10-2014 09

[GMRES2stage.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index a4c9b268faf6197c47d7c288188adc6bf299b3de..59c5600176bf59910b53c6887d1eba5854677d7e 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -601,7 +601,15 @@ is summarized while intended perspectives are provided.
  %%%*********************************************************
  \section{Related works}
  \label{sec:02} 
-%Wherever Times is specified, Times Roman or Times New Roman may be used. If neither is available on your system, please use the font closest in appearance to Times. Avoid using bit-mapped fonts if possible. True-Type 1 or Open Type fonts are preferred. Please embed symbol fonts, as well, for math, etc.
+Krylov subspace iteration methods have increasingly become useful and successful techniques for solving linear and nonlinear systems and eigenvalue problems, especially since the increase development of the preconditioners~\cite{Saad2003,Meijerink77}. One reason of the popularity of these methods is their generality, simplicity and efficiency to solve systems of equations arising from very large and complex problems. %A Krylov method is based on a projection process onto a Krylov subspace spanned by vectors and it forms a sequence of approximations by minimizing the residual over the subspace formed~\cite{}.
+
+GMRES is one of the most widely used Krylov iterative method for solving sparse and large linear systems. It is developed by Saad and al.~\cite{Saad86} as a generalized method to deal with unsymmetric and non-Hermitian problems, and indefinite symmetric problems too. In its original version called full GMRES, it minimizes the residual over the current Krylov subspace until convergence in at most $n$ iterations, where $n$ is the size of the sparse matrix. It should be noted that full GMRES is too expensive in the case of large matrices since the required orthogonalization process per iteration grows quadratically with the number of iterations. For that reason, in practice GMRES is restarted after each $m\ll n$ iterations to avoid the storage of a large orthonormal basis. However, the convergence behavior of the restarted GMRES, called GMRES($m$), in many cases depends quite critically on the value of $m$~\cite{Huang89}. Therefore in most cases, a preconditioning technique is applied to the restarted GMRES method in order to improve its convergence.
+
+In order to enhance the robustness of Krylov iterative solvers, some techniques have been proposed allowing the use of different preconditioners, if necessary, within the iteration instead of restarting. Those techniques may lead to considerable savings in CPU time and memory requirements. Van der Vorst in~\cite{Vorst94} has proposed variants of the GMRES algorithm in which a different preconditioner is applied in each iteration, so-called GMRESR family of nested methods. In fact, the GMRES method is effectively preconditioned with other iterative schemes (or GMRES itself), where the iterations of the GMRES method are called outer iterations while the iterations of the preconditioning process referred to as inner iterations. Saad in~\cite{Saad:1993} has proposed FGMRES which is another variant of the GMRES algorithm using a variable preconditioner. In FGMRES the search directions are preconditioned whereas in GMRESR the residuals are preconditioned. However in practice the good preconditioners are those based on direct methods, as ILU preconditioners, which are not easy to parallelize and suffer from the scalability problems on large clusters of thousands of cores.  
+
+Recently, communication-avoiding methods have been developed to reduce the communication overheads in Krylov subspace iterative solvers. On modern computer architectures, communications between processors are much slower than floating-point arithmetic operations on a given processor. Communication-avoiding techniques reduce either communications between processors or data movements between levels of the memory hierarchy, by reformulating the communication-bound kernels (more frequently SpMV kernels) and the orthogonalization operations within the Krylov iterative solver. Different works have studied the communication-avoiding methods for multicore processors and multi-GPU machines~\cite{}. 
+
+
  %%%*********************************************************
  %%%*********************************************************
  
@@ -654,10 +662,10 @@ appropriate than a single direct method in a parallel context.
    \Input $A$ (sparse matrix), $b$ (right-hand side)
    \Output $x$ (solution vector)\vspace{0.2cm}
    \State Set the initial guess $x_0$
-  \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{tsirm}$)} \label{algo:conv}
+  \For {$k=1,2,3,\ldots$ until convergence ($error<\epsilon_{tsirm}$)} \label{algo:conv}
      \State  $[x_k,error]=Solve(A,b,x_{k-1},max\_iter_{kryl})$   \label{algo:solve}
-    \State $S_{k \mod s}=x_k$ \label{algo:store} \Comment{update column (k mod s) of S}
-    \If {$k \mod s=0$ {\bf and} error$>\epsilon_{kryl}$}
+    \State $S_{k \mod s}=x_k$ \label{algo:store} \Comment{update column ($k \mod s$) of $S$}
+    \If {$k \mod s=0$ {\bf and} $error>\epsilon_{kryl}$}
        \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul}
              \State $\alpha=Least\_Squares(R,b,max\_iter_{ls})$ \label{algo:}
        \State $x_k=S\alpha$  \Comment{compute new solution}
@@ -675,10 +683,10 @@ method. Moreover,  a tolerance  threshold must be  specified for the  solver. In
  practice, this threshold must be  much smaller than the convergence threshold of
  the  TSIRM algorithm  (\emph{i.e.}, $\epsilon_{tsirm}$).  We also  consider that
  after the call of the $Solve$ function, we obtain the vector $x_k$ and the error
-which is defined by $||Ax^k-b||_2$.
+which is defined by $||Ax_k-b||_2$.
  
    Line~\ref{algo:store},
-$S_{k \mod  s}=x^k$ consists in  copying the solution  $x_k$ into the  column $k
+$S_{k \mod  s}=x_k$ consists in  copying the solution  $x_k$ into the  column $k
  \mod s$ of $S$.   After the minimization, the matrix $S$ is  reused with the new
  values of the residuals.  To solve the minimization problem, an iterative method
  is used. Two parameters are required  for that: the maximum number of iterations
@@ -1051,6 +1059,24 @@ core. It can also  be observed that the difference between CGLS  and LSQR is not
  significant. Both can be good but it seems not possible to know in advance which
  one will be the best.
  
+Table~\ref{tab:05} show a strong scaling experiment with the exemple ex54 on the
+Curie  architecture. So  in  this case,  the  number of  unknownws  is fixed  to
+$204,919,225$ and the number of cores ranges from $512$ to $8192$ with the power
+of two.  The  threshold is fixed to $5e-5$ and only  the $mg$ preconditioner has
+been tested. Here  again we can see that TSIRM is  faster that FGMRES. Efficiecy
+of each algorithms is reported. It  can be noticed that FGMRES is more efficient
+than TSIRM except with $8,192$ cores and that its efficiency is greater that one
+whereas the  efficiency of TSIRM is  lower than one. Nevertheless,  the ratio of
+TSIRM  with any  version  of the  least-squares  method is  always faster.  With
+$8,192$ cores when the number of iterations is far more important for FGMRES, we
+can see that it is only slightly more important for TSIRM.
+
+In  Figure~\ref{fig:02}  we report  the  number  of  iterations per  second  for
+experiments  reported in  Table~\ref{tab:05}.  This Figure  highlights that  the
+number of iterations per  seconds is more of less the same  for FGMRES and TSIRM
+with a little advantage for FGMRES. It  can be explained by the fact that, as we
+have previously explained, that the iterations of the least-sqaure steps are not
+taken into account with TSIRM.
  
  \begin{table*}[htbp]
  \begin{center}
@@ -1081,6 +1107,26 @@ one will be the best.
  \label{fig:02}
  \end{figure}
  
+
+Concerning the  experiments some  other remarks are  interesting.
+\begin{itemize}
+\item We  can tested other examples of  PETSc (ex29, ex45, ex49).  For all these
+  examples,  we also obtained  similar gain  between GMRES  and TSIRM  but those
+  examples are  not scalable with many  cores. In general, we  had some problems
+  with more than $4,096$ cores.
+\item We have tested many iterative  solvers available in PETSc.  In fast, it is
+  possible to use most of them with TSIRM. From our point of view, the condition
+  to  use  a  solver inside  TSIRM  is  that  the  solver  must have  a  restart
+  feature. More  precisely, the solver must  support to be  stoped and restarted
+  without decrease its  converge. That is why  with GMRES we stop it  when it is
+  naturraly  restarted (i.e.  with  $m$ the  restart parameter).   The Conjugate
+  Gradient (CG) and all its variants do not have ``restarted'' version in PETSc,
+  so they  are not  efficient.  They  will converge with  TSIRM but  not quickly
+  because if  we compare  a normal CG  with a CG  for which  we stop it  each 16
+  iterations  for example,  the  normal CG  will  be for  more efficient.   Some
+  restarted CG  or CG variant versions exist  and may be interested  to study in
+  future works.
+\end{itemize}
  %%%*********************************************************
  %%%*********************************************************
  
@@ -1103,13 +1149,14 @@ experiments up to 16,394 cores have been led to verify that TSIRM runs
  5 or  7 times  faster than GMRES.
  
  
-For future work, the authors' intention is to investigate 
-other kinds of matrices, problems, and inner solvers. The 
-influence of all parameters must be tested too, while 
-other methods to minimize the residuals must be regarded.
-The number of outer iterations to minimize should become 
-adaptative to improve the overall performances of the proposal.
-Finally, this solver will be implemented inside PETSc.
+For  future  work, the  authors'  intention is  to  investigate  other kinds  of
+matrices, problems, and  inner solvers. The influence of  all parameters must be
+tested too, while other methods to minimize the residuals must be regarded.  The
+number of outer  iterations to minimize should become  adaptative to improve the
+overall performances of the proposal.   Finally, this solver will be implemented
+inside PETSc. This  would be very interesting because it would  allow us to test
+all the non-linear  examples and compare our algorithm  with the other algorithm
+implemented in PETSc.
  
  
  % conference papers do not normally have an appendix