23-04-2013

[Krylov_multi.git] / krylov_multi.tex
diff --git a/krylov_multi.tex b/krylov_multi.tex

index b96a5b9aec0e9a27f59c84477910225b2fb1ae37..1affc17cfc6d8dd6d3a3c3552109c1a4c7a2ebcd 100644 (file)
--- a/krylov_multi.tex
+++ b/krylov_multi.tex
@@ -3,8 +3,27 @@
  \usepackage{amsfonts,amssymb}
  \usepackage{amsmath}
  \usepackage{graphicx}
  \usepackage{amsfonts,amssymb}
  \usepackage{amsmath}
  \usepackage{graphicx}
+\usepackage{algorithm}
+\usepackage{algpseudocode}
+\usepackage{multirow}
+
+\algnewcommand\algorithmicinput{\textbf{Input:}}
+\algnewcommand\Input{\item[\algorithmicinput]}
+
+\algnewcommand\algorithmicoutput{\textbf{Output:}}
+\algnewcommand\Output{\item[\algorithmicoutput]}
+
+\newcommand{\Time}[1]{\mathit{Time}_\mathit{#1}}
+\newcommand{\Prec}{\mathit{prec}}
+\newcommand{\Ratio}{\mathit{Ratio}}
+
+\usepackage{xspace}
+\usepackage[textsize=footnotesize]{todonotes}
+\newcommand{\LZK}[2][inline]{%
+\todo[color=green!40,#1]{\sffamily\textbf{LZK:} #2}\xspace}
  
  \title{A scalable multisplitting algorithm for solving large sparse linear systems} 
  
  \title{A scalable multisplitting algorithm for solving large sparse linear systems} 
+\date{}
  
  
  
  
  
  
@@ -17,42 +36,307 @@
  %%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%
  
  %%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%
  
-
  \begin{abstract}
  \begin{abstract}
-In  this  paper we  revist  the  krylov  multisplitting algorithm  presented  in
+In  this paper  we  revisit  the krylov  multisplitting  algorithm presented  in
  \cite{huang1993krylov}  which  uses  a  scalar  method to  minimize  the  krylov
  \cite{huang1993krylov}  which  uses  a  scalar  method to  minimize  the  krylov
-iterations computed by a multisplitting algorithm. Our new algorithm is simply a
-parallel multisplitting algorithm  with few blocks of large  size and a parallel
-krylov  minimization  is used  to  improve  the  convergence. Some  large  scale
-experiments  with a 3D  Poisson problem  are presented.  They show  the obtained
-improvements compared to a classical GMRES both in terms of number of iterations
-and execution times.
+iterations computed by a multisplitting algorithm. Our new algorithm is based on
+a  parallel multisplitting  algorithm  with few  blocks  of large  size using  a
+parallel GMRES method inside each block and on a parallel krylov minimization in
+order to improve the convergence. Some large scale experiments with a 3D Poisson
+problem  are presented.   They  show  the obtained  improvements  compared to  a
+classical GMRES both in terms of number of iterations and execution times.
  \end{abstract}
  
  \end{abstract}
  
-
  %%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%
  
  %%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%
  
-
  \section{Introduction}
  \section{Introduction}
-
  Iterative methods are used to solve  large sparse linear systems of equations of
  the form  $Ax=b$ because they are  easier to parallelize than  direct ones. Many
  Iterative methods are used to solve  large sparse linear systems of equations of
  the form  $Ax=b$ because they are  easier to parallelize than  direct ones. Many
-iterative  methods have  been proposed  and adpated  by many  researchers.  When
-solving large  linear systems  with many cores,  iterative methods  often suffer
-from  scalability  problems.    This  is  due  to  their   need  for  collective
-communications to perform matrix-vector products and reduction operations.
+iterative  methods have  been proposed  and  adapted by  many researchers.   For
+example, the GMRES method and the  Conjugate Gradient method are very well known
+and  used by  many researchers~\cite{S96}. Both  the method  are based  on the
+Krylov subspace which consists in forming  a basis of the sequence of successive
+matrix powers times the initial residual.
+
+When  solving large  linear systems  with  many cores,  iterative methods  often
+suffer  from scalability problems.   This is  due to  their need  for collective
+communications  to  perform  matrix-vector  products and  reduction  operations.
+Preconditionners can be  used in order to increase  the convergence of iterative
+solvers.   However, most  of the  good preconditionners  are not  sclalable when
+thousands of cores are used.
  
  
-%%%%% Lilia
-% doit-on définir le principe et les préliminaires du multisplitting dans l'intro ou dans l'autre section? 
-% valides-tu le titre de la 2eme section? celle que je voudrai rédiger.
+Traditional iterative  solvers have  global synchronizations that  penalize the
+scalability.   Two  possible solutions  consists  either  in using  asynchronous
+iterative  methods~\cite{ref18} or  to  use multisplitting  algorithms. In  this
+paper, we will  reconsider the use of a multisplitting  method. In opposition to
+traditionnal  multisplitting  method  that  suffer  from  slow  convergence,  as
+proposed  in~\cite{huang1993krylov},  the  use  of a  minimization  process  can
+drastically improve the convergence.
  
  
+\LZK[]{Suite\dots}
  
  %%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%
  
  
  %%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%
  
+\section{Related works}
+A general framework  for studying parallel multisplitting has  been presented in~\cite{o1985multi}
+by O'Leary and White. Convergence conditions are given for the
+most general case.  Many authors improved multisplitting algorithms by proposing
+for  example  an  asynchronous  version~\cite{bru1995parallel}  and  convergence
+conditions~\cite{bai1999block,bahi2000asynchronous}   in  this  case   or  other
+two-stage algorithms~\cite{frommer1992h,bru1995parallel}.
+
+In~\cite{huang1993krylov},  the  authors  proposed  a  parallel  multisplitting
+algorithm in which all the tasks except  one are devoted to solve a sub-block of
+the splitting  and to send their  local solution to  the first task which  is in
+charge to  combine the vectors at  each iteration.  These vectors  form a Krylov
+basis for  which the first task minimizes  the error function over  the basis to
+increase the convergence, then the other tasks receive the updated solution until
+convergence of the global system. 
+
+In~\cite{couturier2008gremlins}, the  authors proposed practical implementations
+of multisplitting algorithms that take benefit from multisplitting algorithms\LZK[]{répétition ???} to
+solve large scale linear systems. Inner  solvers could be based on scalar direct
+method with the LU method or scalar iterative one with GMRES.\LZK[]{lu et gmres par exemple}
+
+In~\cite{prace-multi},  the  authors have  proposed a  parallel  multisplitting
+algorithm in which large blocks are solved using a GMRES solver. The authors have
+performed large scale experiments up-to  32,768 cores and they conclude that
+asynchronous  multisplitting algorithm  could be more  efficient  than traditional
+solvers on exascale architecture with hundreds of thousands of cores.
+
+\LZK[]{Peut-être autres related works\ldots}\\
+
+The key idea of a multisplitting method to solve a large system of linear equations $Ax=b$ is defined as follows. The first step consists in partitioning the matrix $A$ in $L$ several ways 
+\begin{equation}
+A = M_l - N_l,
+\label{eq01}
+\end{equation}
+where for all $l\in\{1,\ldots,L\}$ $M_l$ are non-singular matrices. Then the linear system is solved by iteration based on the obtained splittings as follows
+\begin{equation}
+x^{k+1}=\displaystyle\sum^L_{l=1} E_l M^{-1}_l (N_l x^k + b),~k=1,2,3,\ldots
+\label{eq02}
+\end{equation}
+where $E_l$ are non-negative and diagonal weighting matrices and their sum is an identity matrix $I$. The convergence of such a method is dependent on the condition
+\begin{equation}
+\rho(\displaystyle\sum^L_{l=1}E_l M^{-1}_l N_l)<1.
+\label{eq03}
+\end{equation}
+where $\rho$ is the spectral radius of the square matrix.
+
+The advantage of the multisplitting method is that at each iteration $k$ there are $L$ different linear sub-systems
+\begin{equation}
+v_l^k=M^{-1}_l N_l x_l^{k-1} + M^{-1}_l b,~l\in\{1,\ldots,L\},
+\label{eq04}
+\end{equation}
+to be solved independently by a direct or an iterative method, where $v_l^k$ is the solution of the local sub-system. Thus the computations of $\{v_l\}_{1\leq l\leq L}$ may be performed in parallel by a set of processors. A multisplitting method using an iterative method as an inner solver is called an inner-outer iterative method or a two-stage method. The results $v_l$ obtained from the different splittings~(\ref{eq04}) are combined to compute solution $x$ of the linear system by using the diagonal weighting matrices
+\begin{equation}
+x^k = \displaystyle\sum^L_{l=1} E_l v_l^k,
+\label{eq05}
+\end{equation}    
+In the case where the diagonal weighting matrices $E_l$ have only zero and one factors (i.e. $v_l$ are disjoint vectors), the multisplitting method is non-overlapping and corresponds to the block Jacobi method.
+
+%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%
  
  \section{A two-stage method with a minimization}
  
  \section{A two-stage method with a minimization}
+Let $Ax=b$ be a given large and sparse linear system of $n$ equations to solve in parallel on $L$ clusters of processors, physically adjacent or geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square and  non-singular matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The multisplitting of this linear system is defined as follows
+\begin{equation}
+\left\{
+\begin{array}{lll}
+A & = & [A_{1}, \ldots, A_{L}]\\
+x & = & [X_{1}, \ldots, X_{L}]\\
+b & = & [B_{1}, \ldots, B_{L}]
+\end{array}
+\right.
+\label{sec03:eq01}
+\end{equation}  
+where for $l\in\{1,\ldots,L\}$, $A_l$ is a rectangular block of size $n_l\times n$ and $X_l$ and $B_l$ are sub-vectors of size $n_l$ each, such that $\sum_ln_l=n$. In this work, we use a row-by-row splitting without overlapping in such a way that successive rows of sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster. So, the multisplitting format of the linear system is defined as follows
+\begin{equation}
+\forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{m=1}^{l-1}A_{lm}X_m + A_{ll}X_l + \displaystyle\sum_{m=l+1}^{L}A_{lm}X_m = B_l, 
+\label{sec03:eq02}
+\end{equation} 
+where $A_{lm}$ is a sub-block of size $n_l\times  n_m$ of the rectangular matrix $A_l$, $X_m\neq  X_l$ is a sub-vector of size $n_m$ of the solution vector $x$ and $\sum_{m\neq l}n_m+n_l=n$, for all $m\in\{1,\ldots,L\}$.
+
+Our multisplitting method proceeds by iteration for solving the linear system in such a way each sub-system
+\begin{equation}
+\left\{
+\begin{array}{l}
+A_{ll}X_l = Y_l \mbox{,~such that}\\
+Y_l = B_l - \displaystyle\sum_{\substack{m=1\\m\neq l}}^{L}A_{lm}X_m,
+\end{array}
+\right.
+\label{sec03:eq03}
+\end{equation}
+is solved independently by a {\it cluster of processors} and communication are required to update the right-hand side vectors $Y_l$, such that the vectors $X_m$ represent the data dependencies between the clusters. In this work, we use the parallel GMRES method~\cite{ref34} as an inner iteration method to solve sub-systems~(\ref{sec03:eq03}). GMRES is one of the most used Krylov iterative methods to solve sparse linear systems in parallel on clusters of processors. In practice, GMRES is used with a preconditioner to improve its convergence. In this work, we used a preconditioning matrix equivalent to the main diagonal of sparse sub-matrix $A_{ll}$. This preconditioner is straightforward to implement in parallel and gives good performances in many situations.  
+
+It should be noted that the convergence of the inner iterative solver for the different sub-systems~(\ref{sec03:eq03}) does not necessarily involve the convergence of the multisplitting method. It strongly depends on the properties of the global sparse linear system to be solved and the computing environment~\cite{o1985multi,ref18}. Furthermore, the multisplitting
+of the linear system among several clusters of processors increases the spectral radius of the iteration matrix, thereby slowing the convergence. In this work, we based on the work presented in~\cite{huang1993krylov} to increase the convergence and improve the scalability of the multisplitting methods.
+
+In order to accelerate the convergence, we implemented the outer iteration of the multisplitting solver as a Krylov subspace method which minimizes some error function over a Krylov subspace~\cite{S96}. The Krylov subspace that we used is spanned by a basis composed of successive solutions issued from solving the $L$ splittings~(\ref{sec03:eq03})
+\begin{equation}
+S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
+\label{sec03:eq04}
+\end{equation}
+where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a solution of the global linear system. The advantage of such a Krylov subspace is that we need neither an orthogonal basis nor synchronizations between clusters to generate this basis.
+
+The multisplitting method is periodically restarted every $s$ iterations with a new initial guess $\tilde{x}=S\alpha$ which minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace spanned by vectors of  $S$. So $\alpha$ is defined as the solution of the large overdetermined linear system
+\begin{equation}
+R\alpha=b,
+\label{sec03:eq05}
+\end{equation}
+where $R=AS$ is a dense rectangular matrix of size $n\times s$ and $s\ll n$. This leads us to solve a system of normal equations
+\begin{equation}
+R^TR\alpha=R^Tb,
+\label{sec03:eq06}
+\end{equation}
+which is associated with the least squares problem
+\begin{equation}
+\text{minimize}~\|b-R\alpha\|_2,
+\label{sec03:eq07}
+\end{equation}  
+where $R^T$ denotes the transpose of the matrix $R$. Since $R$ (i.e. $AS$) and $b$ are split among $L$ clusters, the symmetric positive definite system~(\ref{sec03:eq06}) is solved in  parallel. Thus an iterative method would be more  appropriate than a direct one to solve this system. We use the parallel conjugate gradient method for the normal equations CGNR~\cite{S96,refCGNR}.
+
+\begin{algorithm}[!t]
+\caption{A two-stage linear solver with inner iteration GMRES method}
+\begin{algorithmic}[1]
+\Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector)
+\Output $X_l$ (solution sub-vector)\vspace{0.2cm}
+\State Load $A_l$, $B_l$
+\State Initialize the initial guess $x^0$
+\State Set the minimizer $\tilde{x}^0=x^0$
+\For {$k=1,2,3,\ldots$ until the global convergence}
+\State Restart with $x^0=\tilde{x}^{k-1}$:
+\For {$j=1,2,\ldots,s$}
+\State Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
+\State Construct basis $S$: add column vector $X_l^j$ to the matrix $S_l^k$
+\State Exchange local values of $X_l^j$ with the neighboring clusters
+\State Compute dense matrix $R$: $R_l^{k,j}=\sum^L_{i=1}A_{li}X_i^j$ 
+\EndFor 
+\State Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
+\State Local solution of the linear system $Ax=b$: $X_l^k=\tilde{X}_l^k$
+\State Exchange the local minimizer $\tilde{X}_l^k$ with the neighboring clusters
+\EndFor
+
+\Statex
+
+\Function {InnerSolver}{$x^0$, $j$}
+\State Compute local right-hand side $Y_l = B_l - \sum^L_{\substack{m=1\\m\neq l}}A_{lm}X_m^0$
+\State Solving local splitting $A_{ll}X_l^j=Y_l$ using parallel GMRES method, such that $X_l^0$ is the initial guess
+\State \Return $X_l^j$
+\EndFunction
+
+\Statex
+
+\Function {UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
+\State Solving normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ clusters using parallel CGNR method
+\State Compute local minimizer $\tilde{X}_l^k=S_l^k\alpha^k$
+\State \Return $\tilde{X}_l^k$
+\EndFunction
+\end{algorithmic}
+\label{algo:01}
+\end{algorithm}
+
+The main key points of our multisplitting method to solve a large sparse linear system are given in Algorithm~\ref{algo:01}. This algorithm is based on a two-stage method with a minimization using GMRES iterative method as an inner solver. It is executed in parallel by each cluster of processors. Matrices and vectors with the subscript $l$ represent the local data for cluster  $l$, where $l\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel iterative algorithms: GMRES method to solve each splitting~(\ref{sec03:eq03}) on a cluster of processors, and CGNR method executed in parallel by all clusters to minimize the function error~(\ref{sec03:eq07}) over the Krylov subspace spanned by $S$. The algorithm requires two global synchronizations between $L$ clusters. The first one is performed at line~$12$ in Algorithm~\ref{algo:01} to exchange local values of vector solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the multisplitting solver. The second one is needed to construct the matrix $R$ of the Krylov subspace. We chose to perform this latter synchronization $s$ times in every outer iteration $k$ (line~$7$ in Algorithm~\ref{algo:01}). This is a straightforward way to compute the sparse matrix-dense matrix multiplication $R=AS$. We implemented all synchronizations by using message passing collective communications of MPI library.
+
+%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Experiments}
+In order to illustrate  the interest  of our algorithm. We have  compared our
+algorithm  with  the  GMRES  method  which  is a very  well  used  method  in  many
+situations.  We have chosen to focus on only one problem which is very simple to
+implement: a 3 dimension Poisson problem.
+
+\begin{equation}
+\left\{
+                \begin{array}{ll}
+                  \nabla u&=f \mbox{~in~} \omega\\
+                  u &=0 \mbox{~on~}  \Gamma=\partial \omega
+                \end{array}
+              \right.
+\end{equation}
+
+After discretization, with a finite  difference scheme, a seven point stencil is
+used. It  is well-known that the  spectral radius of  matrices representing such
+problems are very close to 1.  Moreover, the larger the number of discretization
+points is,  the closer to 1  the spectral radius  is.  Hence, to solve  a matrix
+obtained for  a 3D Poisson  problem, the number  of iterations is high.  Using a
+preconditioner  it  is   possible  to  reduce  the  number   of  iterations  but
+preconditioners are not scalable when using many cores.
+
+Doing many experiments  with many cores is  not easy and requires to  access to a
+supercomputer  with several  hours for  developing  a code  and then  improving
+it. In the following we presented  some experiments we could achieved out on the
+Hector architecture,  the previous UK's  high-end computing resource,  funded by
+the UK Research Councils, which has been stopped in the early 2014.
+
+In the experiments  we report the size of the 3D  poisson considered\LZK[]{Suite\dots ?}
+
+
+The first column  shows the size of the  problem The size is chosen  in order to
+have approximately 50,000 components per core.  The second column represents the
+number of  cores used. In parenthesis,  there is the decomposition  used for the
+Krylov multisplitting. The  third column and the sixth  column respectively show
+the execution time for the GMRES  and the Kyrlow multisplitting code. The fourth
+and  the   seventh  column   describes  the  number   of  iterations.   For  the
+multisplitting  code, the  total number  of inner  iterations is  represented in
+parenthesis.
+
+ We  also give  the other parameters:  the restart  for the GRMES method....
+
+\begin{table}[p]
+\begin{center}
+\begin{tabular}{|c|c||c|c|c||c|c|c||c|} 
+\hline
+\multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} &  \multicolumn{3}{c||}{GMRES} &  \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\
+ \cline{3-8}
+           &                   &  Time (s) & nb Iter. & $\Delta$  &   Time (s)& nb Iter. & $\Delta$ & \\
+\hline
+
+$590^3$ & 4096 (2x2048)        &  433.1    & 55,494    & 4.92e-7  &  74.1    & 1,101(8,211) & 6.62e-08  & 5.84   \\
+\hline
+$743^3$ & 8192 (2x4096)        & 704.4     & 87,822    & 4.80e-07 &  151.2   & 3,061(14,914) & 5.87e-08 & 4.65    \\
+\hline
+$743^3$ & 8192 (4x2048)        & 704.4     & 87,822    & 4.80e-07 &  110.3   & 1,531(12,721) & 1.47e-07& 6.39  \\
+\hline
+
+\end{tabular}
+\caption{Results without preconditioner}
+\label{tab1}
+\end{center}
+\end{table}
+
+
+\begin{table}[p]
+\begin{center}
+\begin{tabular}{|c|c||c|c|c||c|c|c||c|} 
+\hline
+\multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} &  \multicolumn{3}{c||}{GMRES} &  \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\
+ \cline{3-8}
+           &                   &  Time (s) & nb Iter. & $\Delta$  &   Time (s)& nb Iter. & $\Delta$ & \\
+\hline
+
+$590^3$ & 4096 (2x2048)        &  433.0    & 55,494    & 4.92e-7  &  80.4    & 1,091(9,545) & 7.64e-08  & 5.39   \\
+\hline
+$743^3$ & 8192 (2x4096)        & 704.4     & 87,822    & 4.80e-07 &  110.2   & 1,401(12,379) & 1.11e-07 & 6.39    \\
+\hline
+$743^3$ & 8192 (4x2048)        & 704.4     & 87,822    & 4.80e-07 &  139.8   & 1,891(15,960) & 1.60e-07& 5.03  \\
+\hline
+
+\end{tabular}
+\caption{Results with preconditioner}
+\label{tab2}
+\end{center}
+\end{table}
+
+\section{Conclusion and perspectives}
+
+Other applications (=> other matrices)\\
+Larger experiments\\
+Async\\
+Overlapping
  
  
  %%%%%%%%%%%%%%%%%%%%%%%%
  
  
  %%%%%%%%%%%%%%%%%%%%%%%%