update

[GMRES2stage.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index deff6f33cbf2c9afed920cdf992094cfc9a4e149..112b322324c88803dcc327629623965b330b3276 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -354,6 +354,7 @@
  \usepackage{amsmath}
  \usepackage{amssymb}
  \usepackage{multirow}
+\usepackage{graphicx}
  
  \algnewcommand\algorithmicinput{\textbf{Input:}}
  \algnewcommand\Input{\item[\algorithmicinput]}
@@ -583,8 +584,7 @@ performances.
  The present paper is organized  as follows. In Section~\ref{sec:02} some related
  works are presented. Section~\ref{sec:03} presents our two-stage algorithm using
  a  least-square  residual  minimization.   Section~\ref{sec:04}  describes  some
-convergence  results  on this  method.   In Section~\ref{sec:05},  parallization
-details  of  TSARM  are  given.  Section~\ref{sec:06}  shows  some  experimental
+convergence  results  on this  method.   Section~\ref{sec:05}  shows  some  experimental
  results  obtained on large  clusters of  our algorithm  using routines  of PETSc
  toolkit.  Finally Section~\ref{sec:06} concludes and gives some perspectives.
  %%%*********************************************************
@@ -615,7 +615,7 @@ points of our solver are given in Algorithm~\ref{algo:01}.
  
  In order to accelerate the convergence, the outer iteration periodically applies
  a least-square minimization  on the residuals computed by  the inner solver. The
-inner solver is a Krylov based solver which does not required to be changed.
+inner solver is based on a Krylov method which does not require to be changed.
  
  At each outer iteration, the sparse linear system $Ax=b$ is solved, only for $m$
  iterations, using an iterative method restarting with the previous solution. For
@@ -644,11 +644,11 @@ appropriate than a direct method in a parallel context.
    \Input $A$ (sparse matrix), $b$ (right-hand side)
    \Output $x$ (solution vector)\vspace{0.2cm}
    \State Set the initial guess $x^0$
-  \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{kryl}$)} \label{algo:conv}
+  \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{tsarm}$)} \label{algo:conv}
      \State  $x^k=Solve(A,b,x^{k-1},max\_iter_{kryl})$   \label{algo:solve}
      \State retrieve error
      \State $S_{k~mod~s}=x^k$ \label{algo:store}
-    \If {$k$ mod $s=0$ {\bf and} error$>\epsilon_{kryl}$}
+    \If {$k$ mod $s=0$ {\bf and} error$>\epsilon_{tsarm}$}
        \State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul}
        \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$ \label{algo:}
        \State $x^k=S\alpha$  \Comment{compute new solution}
@@ -664,7 +664,7 @@ called for a  maximum of $max\_iter_{kryl}$ iterations.  In practice, we  sugges
  equals to  the restart  number of the  GMRES-like method. Moreover,  a tolerance
  threshold must be specified for the  solver. In practice, this threshold must be
  much  smaller  than the  convergence  threshold  of  the TSARM  algorithm  (i.e.
-$\epsilon$).  Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the
+$\epsilon_{tsarm}$).  Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the
  solution  $x_k$  into the  column  $k~  mod~ s$ of  the  matrix  $S$. After  the
  minimization, the matrix $S$ is reused with the new values of the residuals.  To
  solve the minimization problem, an  iterative method is used. Two parameters are
@@ -673,25 +673,13 @@ method.
  
  To summarize, the important parameters of TSARM are:
  \begin{itemize}
-\item $\epsilon_{kryl}$ the threshold to stop the method of the krylov method
+\item $\epsilon_{tsarm}$ the threshold to stop the TSARM method
  \item $max\_iter_{kryl}$ the maximum number of iterations for the krylov method
  \item $s$ the number of outer iterations before applying the minimization step
  \item $max\_iter_{ls}$ the maximum number of iterations for the iterative least-square method
  \item $\epsilon_{ls}$ the threshold to stop the least-square method
  \end{itemize}
  
-%%%*********************************************************
-%%%*********************************************************
-
-\section{Convergence results}
-\label{sec:04}
-
-
-
-%%%*********************************************************
-%%%*********************************************************
-\section{Parallelization}
-\label{sec:05}
  
  The  parallelisation  of  TSARM  relies   on  the  parallelization  of  all  its
  parts. More  precisely, except  the least-square step,  all the other  parts are
@@ -733,10 +721,21 @@ In each iteration  of CGLS, there is two  matrix-vector multiplications and some
  classical operations:  dots, norm, multiplication  and addition on  vectors. All
  these operations are easy to implement in PETSc or similar environment.
  
+
+
+%%%*********************************************************
+%%%*********************************************************
+
+\section{Convergence results}
+\label{sec:04}
+
+
+
+
  %%%*********************************************************
  %%%*********************************************************
  \section{Experiments using petsc}
-\label{sec:06}
+\label{sec:05}
  
  
  In order to see the influence of our algorithm with only one processor, we first
@@ -766,12 +765,10 @@ torso3             & 2D/3D problem & 259,156 & 4,429,042 \\
  
  The following  parameters have been chosen  for our experiments.   As by default
  the restart  of GMRES is performed every  30 iterations, we have  chosen to stop
-the     GMRES    every     30    iterations     (line     \ref{algo:solve}    in
-Algorithm~\ref{algo:01}).   $s$ is  set to  8. CGLS  is chosen  to  minimize the
-least-squares  problem.  Two  conditions  are  used to  stop  CGLS,  either  the
-precision is under $1e-40$ or the  number of iterations is greater to $20$.  The
-external   precision    is   set    to   $1e-10$   (line    \ref{algo:conv}   in
-Algorithm~\ref{algo:01}).  Those  experiments have been performed  on a Intel(R)
+the GMRES every 30 iterations, $max\_iter_{kryl}=30$).  $s$ is set to 8. CGLS is
+chosen  to minimize  the least-squares  problem with  the  following parameters:
+$\epsilon_{ls}=1e-40$ and $max\_iter_{ls}=20$.  The external precision is set to
+$\epsilon_{tsarm}=1e-10$.  Those  experiments have been performed  on a Intel(R)
  Core(TM) i7-3630QM CPU @ 2.40GHz with the version 3.5.1 of PETSc.
  
  
@@ -779,13 +776,12 @@ In  Table~\ref{tab:02}, some  experiments comparing  the solving  of  the linear
  systems obtained with the previous matrices  with a GMRES variant and with out 2
  stage algorithm are  given. In the second column, it can  be noticed that either
  gmres or fgmres is used to  solve the linear system.  According to the matrices,
-different preconditioner is  used.  With the 2 stage  algorithm, the same solver
-and  the same  preconditionner  is used.   This  Table shows  that  the 2  stage
-algorithm  can  drastically  reduce  the  number  of  iterations  to  reach  the
-convergence when the  number of iterations for the normal GMRES  is more or less
-greater than  500. In fact  this also depends  on tow parameters: the  number of
-iterations  to  stop  GMRES  and   the  number  of  iterations  to  perform  the
-minimization.
+different  preconditioner is used.   With TSARM,  the same  solver and  the same
+preconditionner is used.  This Table shows that TSARM can drastically reduce the
+number of iterations to reach the  convergence when the number of iterations for
+the normal GMRES is more or less  greater than 500. In fact this also depends on
+tow  parameters: the  number  of iterations  to  stop GMRES  and  the number  of
+iterations to perform the minimization.
  
  
  \begin{table}
@@ -814,16 +810,36 @@ torso3             & fgmres / sor  & 37.70 & 565 & 34.97 & 510 \\
  
  
  
-Larger experiments ....\\
  
-Describe the problems ex15 and ex54
+In order to perform larger  experiments, we have tested some example application
+of PETSc. Those  applications are available in the ksp part  which is suited for
+scalable linear equations solvers:
+\begin{itemize}
+\item ex15  is an example  which solves in  parallel an operator using  a finite
+  difference  scheme.   The  diagonal  is  equals to  4  and  4  extra-diagonals
+  representing the neighbors in each directions  is equal to -1. This example is
+  used  in many  physical phenomena, for  example, heat  and fluid  flow, wave
+  propagation...
+\item ex54 is another example based on 2D problem discretized with quadrilateral
+  finite elements. For this example, the user can define the scaling of material
+  coefficient in embedded circle, it is called $\alpha$.
+\end{itemize}
+For more technical details on  these applications, interested reader are invited
+to  read the  codes available  in the  PETSc sources.   Those problem  have been
+chosen because they  are scalable with many cores. We  have tested other problem
+but they are not scalable with many cores.
+
+In the following larger experiments are described on two large scale architectures: Curie and Juqeen... {\bf description...}\\
+
+
+{\bf Description of preconditioners}
  
  \begin{table*}
  \begin{center}
  \begin{tabular}{|r|r|r|r|r|r|r|r|r|} 
  \hline
  
-  nb. cores & precond   & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} &  \multicolumn{2}{c|}{TSARM LSQR} & best gain \\ 
+  nb. cores & precond   & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSARM CGLS} &  \multicolumn{2}{c|}{TSARM LSQR} & best gain \\ 
  \cline{3-8}
               &                       & Time  & \# Iter.  & Time  & \# Iter. & Time  & \# Iter. & \\\hline \hline
    2,048      & mg                    & 403.49   & 18,210    & 73.89  & 3,060   & 77.84  & 3,270  & 5.46 \\
@@ -837,11 +853,34 @@ Describe the problems ex15 and ex54
  \hline
  
  \end{tabular}
-\caption{Comparison of FGMRES and 2 stage FGMRES algorithms for ex15 of Petsc with 25000 components per core on Juqueen (threshold 1e-3, restart=30, s=12),  time is expressed in seconds.}
+\caption{Comparison of FGMRES and TSARM with FGMRES for example ex15 of PETSc with two preconditioner (mg and sor) with 25,000 components per core on Juqueen (threshold 1e-3, restart=30, s=12),  time is expressed in seconds.}
  \label{tab:03}
  \end{center}
  \end{table*}
  
+Table~\ref{tab:03} shows  the execution  times and the  number of  iterations of
+example ex15  of PETSc on the  Juqueen architecture. Differents  number of cores
+are  studied rangin  from  2,048  upto 16,383.   Two  preconditioners have  been
+tested.   For those experiments,  the number  of components  (or unknown  of the
+problems)  per processor is  fixed to  25,000. This  number can  seem relatively
+small. In fact, for  some applications that need a lot of  memory, the number of
+components per processor requires sometimes to be small.
+
+In this Table, we  can notice that TSARM is always faster  than FGMRES. The last
+column shows the ratio between FGMRES and the best version of TSARM according to
+the minimization procedure: CGLS or LSQR.
+
+
+\begin{figure}
+\centering
+  \includegraphics[width=0.45\textwidth]{nb_iter_sec_ex15_juqueen}
+\caption{Number of iterations per second with ex15 and the same parameters than in Table~\ref{tab:03}}
+\label{fig:01}
+\end{figure}
+
+
+
+
  
  \begin{table*}
  \begin{center}
@@ -900,7 +939,7 @@ Describe the problems ex15 and ex54
  %%%*********************************************************
  %%%*********************************************************
  \section{Conclusion}
-\label{sec:07}
+\label{sec:06}
  %The conclusion goes here. this is more of the conclusion
  %%%*********************************************************
  %%%*********************************************************