13-01-2014 V1

[Krylov_multi.git] / krylov_multi.tex
diff --git a/krylov_multi.tex b/krylov_multi.tex

index 91e47458d85749235dbe5d7e5106fecee13efcf2..ea6cfe92e3e4d0dc013ccbc80055215e27b7ea5c 100644 (file)
--- a/krylov_multi.tex
+++ b/krylov_multi.tex
@@ -3,6 +3,15 @@
  \usepackage{amsfonts,amssymb}
  \usepackage{amsmath}
  \usepackage{graphicx}
+\usepackage{algorithm}
+\usepackage{algpseudocode}
+
+\algnewcommand\algorithmicinput{\textbf{Input:}}
+\algnewcommand\Input{\item[\algorithmicinput]}
+
+\algnewcommand\algorithmicoutput{\textbf{Output:}}
+\algnewcommand\Output{\item[\algorithmicoutput]}
+
  
  \title{A scalable multisplitting algorithm for solving large sparse linear systems} 
  
@@ -89,12 +98,13 @@ v_l^k=M^{-1}_l N_l x_l^{k-1} + M^{-1}_l b,~l\in\{1,\ldots,L\},
  \label{eq04}
  \end{equation}
  to be solved  independently by a direct or  an iterative method, where
-$v_l^k$  is the solution  of the  local sub-system.   A multisplitting
-method  using   an  iterative  method  for  solving   the  $L$  linear
-sub-systems is  called an inner-outer iterative method  or a two-stage
-method.   The   results    $v_l^k$   obtained   from   the   different
-splittings~(\ref{eq04}) are combined to  compute the solution $x^k$ of
-the linear system by using the diagonal weighting matrices
+$v_l^k$  is   the  solution  of   the  local  sub-system.   Thus,  the
+calculations  of $v_l^k$  may be  performed in  parallel by  a  set of
+processors.   A multisplitting  method using  an iterative  method for
+solving the $L$ linear  sub-systems is called an inner-outer iterative
+method or a  two-stage method.  The results $v_l^k$  obtained from the
+different splittings~(\ref{eq04}) are combined to compute the solution
+$x^k$ of the linear system by using the diagonal weighting matrices
  \begin{equation}
  x^k = \displaystyle\sum^L_{l=1} E_l v_l^k,
  \label{eq05}
@@ -144,12 +154,12 @@ solvers on exascale architecture with hunders of thousands of cores.
  
  
  \section{A two-stage method with a minimization}
-Let $Ax=b$ be a given sparse and large linear system of $n$ equations
-to solve in parallel on $L$ clusters, physically adjacent or geographically
-distant, where $A\in\mathbb{R}^{n\times n}$ is a square and nonsingular
-matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$
-is the right-hand side vector. The multisplitting of this linear system 
-is defined as follows:
+Let $Ax=b$ be a given sparse  and large linear system of $n$ equations
+to  solve  in  parallel   on  $L$  clusters,  physically  adjacent  or
+geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square
+and  nonsingular matrix, $x\in\mathbb{R}^{n}$  is the  solution vector
+and   $b\in\mathbb{R}^{n}$  is   the  right-hand   side   vector.  The
+multisplitting of this linear system is defined as follows:
  \begin{equation}
  \left\{
  \begin{array}{lll}
@@ -160,21 +170,24 @@ b & = & [B_{1}, \ldots, B_{L}]
  \right.
  \label{sec03:eq01}
  \end{equation}  
-where for all $l\in\{1,\ldots,L\}$ $A_l$ is a rectangular block of size $n_l\times n$
-and $X_l$ and $B_l$ are sub-vectors of size $n_l$, such that $\sum_ln_l=n$. In this
-case, we use a row-by-row splitting without overlapping in such a way that successive
-rows of the sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster.
-So, the multisplitting format of the linear system is defined as follows:
+where for  $l\in\{1,\ldots,L\}$, $A_l$ is a rectangular  block of size
+$n_l\times n$ and $X_l$ and  $B_l$ are sub-vectors of size $n_l$, such
+that  $\sum_ln_l=n$.  In this  case,  we  use  a row-by-row  splitting
+without overlapping in  such a way that successive  rows of the sparse
+matrix $A$ and  both vectors $x$ and $b$ are  assigned to one cluster.
+So,  the multisplitting  format of  the  linear system  is defined  as
+follows:
  \begin{equation}
  \forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{i=1}^{l-1}A_{li}X_i + A_{ll}X_l + \displaystyle\sum_{i=l+1}^{L}A_{li}X_i = B_l, 
  \label{sec03:eq02}
  \end{equation} 
-where $A_{li}$ is a block of size $n_l\times n_i$ of the rectangular matrix $A_l$, $X_i\neq X_l$
-is a sub-vector of size $n_i$ of the solution vector $x$ and $\sum_{i<l}n_i+\sum_{i>l}n_i+n_l=n$,
-for all $i\in\{1,\ldots,l-1,l+1,\ldots,L\}$. 
+where $A_{li}$ is  a block of size $n_l\times  n_i$ of the rectangular
+matrix  $A_l$, $X_i\neq  X_l$ is  a sub-vector  of size  $n_i$  of the
+solution vector  $x$ and $\sum_{i<l}n_i+\sum_{i>l}n_i+n_l=n$,  for all
+$i\in\{1,\ldots,l-1,l+1,\ldots,L\}$.
  
-The multisplitting method proceeds by iteration for solving the linear system in such a
-way each sub-system
+The multisplitting method proceeds by iteration for solving the linear
+system in such a way each sub-system
  \begin{equation}
  \left\{
  \begin{array}{l}
@@ -184,14 +197,126 @@ Y_l = B_l - \displaystyle\sum_{i=1,i\neq l}^{L}A_{li}X_i,
  \right.
  \label{sec03:eq03}
  \end{equation}
-is solved independently by a cluster of processors and communication are required to
-update the right-hand side vectors $Y_l$, such that the vectors $X_i$ represent the data
-dependencies between the clusters. In this case, the parallel GMRES method is used
-as an inner iteration method for solving the linear sub-systems~(\ref{sec03:eq03}).  
-
-
-
-
+is solved  independently by a cluster of  processors and communication
+are required  to update the  right-hand side vectors $Y_l$,  such that
+the  vectors  $X_i$  represent   the  data  dependencies  between  the
+clusters. In this work,  we use the parallel GMRES method~\cite{ref34}
+as     an     inner    iteration     method     for    solving     the
+sub-systems~(\ref{sec03:eq03}).  It  is a well-known  iterative method
+which  gives good performances  for solving  sparse linear  systems in
+parallel on a cluster of processors.
+
+It should be noted that  the convergence of the inner iterative solver
+for  the  different  linear  sub-systems~(\ref{sec03:eq03})  does  not
+necessarily involve  the convergence of the  multisplitting method. It
+strongly depends on  the properties of the sparse  linear system to be
+solved                 and                the                computing
+environment~\cite{o1985multi,ref18}.  Furthermore,  the multisplitting
+of the  linear system among  several clusters of  processors increases
+the  spectral radius  of  the iteration  matrix,  thereby slowing  the
+convergence.  In   this  paper,  we   based  on  the   work  presented
+in~\cite{huang1993krylov} to increase  the convergence and improve the
+scalability of the multisplitting methods.
+
+In  order  to  accelerate  the  convergence, we  implement  the  outer
+iteration  of the multisplitting  solver as  a Krylov  subspace method
+which minimizes some error function over a Krylov subspace~\cite{S96}.
+The Krylov  space of  the method that  we used  is spanned by  a basis
+composed  of   successive  solutions  issued  from   solving  the  $L$
+splittings~(\ref{sec03:eq03})
+\begin{equation}
+S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
+\label{sec03:eq04}
+\end{equation}
+where   for  $j\in\{1,\ldots,s\}$,  $x^j=[X_1^j,\ldots,X_L^j]$   is  a
+solution of the  global linear system. The advantage  of such a  Krylov subspace is  that we need  neither an orthogonal basis  nor synchronizations between  the different clusters to generate this basis.
+
+The  multisplitting   method  is  periodically   restarted  every  $s$
+iterations  with   a  new  initial   guess  $\tilde{x}=S\alpha$  which
+minimizes  the error  function $\|b-Ax\|_2$  over the  Krylov subspace
+spanned  by  the vectors  of  $S$.  So,  $\alpha$  is  defined as  the
+solution of the large overdetermined linear system
+\begin{equation}
+R\alpha=b,
+\label{sec03:eq05}
+\end{equation}
+where $R=AS$  is a  dense rectangular matrix  of size $n\times  s$ and
+$s\ll n$. This leads us to solve the system of normal equations
+\begin{equation}
+R^TR\alpha=R^Tb,
+\label{sec03:eq06}
+\end{equation}
+which is associated with the least squares problem
+\begin{equation}
+\text{minimize}~\|b-R\alpha\|_2,
+\label{sec03:eq07}
+\end{equation}  
+where  $R^T$  denotes the  transpose  of  the  matrix $R$.  Since  $R$
+(i.e.  $AS$) and  $b$  are  split among  $L$  clusters, the  symmetric
+positive    definite    system~(\ref{sec03:eq06})    is   solved    in
+parallel. Thus, an  iterative method would be more  appropriate than a
+direct  one for  solving this  system. We  use the  parallel conjugate
+gradient method for the normal equations CGNR~\cite{S96,refCGNR}.
+
+\begin{algorithm}[!t]
+\caption{A two-stage linear solver with inner iteration GMRES method}
+\begin{algorithmic}[1]
+\Input $A_l$ (local sparse matrix), $B_l$ (local right-hand side), $x^0$ (initial guess)
+\Output $X_l$ (local solution vector)\vspace{0.2cm}
+\State Load $A_l$, $B_l$, $x^0$
+\State Initialize the minimizer $\tilde{x}^0=x^0$
+\For {$k=1,2,3,\ldots$ until the global convergence}
+\State Restart with $x^0=\tilde{x}^{k-1}$: \textbf{for} $j=1,2,\ldots,s$ \textbf{do}
+\State\hspace{0.5cm} Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
+\State\hspace{0.5cm} Construct the basis $S$: add the column vector $X_l^j$ to the matrix $S_l^k$
+\State\hspace{0.5cm} Exchange the local solution vector $X_l^j$ with the neighboring clusters
+\State\hspace{0.5cm} Compute the dense matrix $R$: $R_l^{k,j}=\sum^L_{i=1}A_{li}X_i^j$ 
+\State\textbf{end for} 
+\State Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
+\State Local solution of the linear system $Ax=b$: $X_l^k=\tilde{X}_l^k$
+\State Exchange the local minimizer $\tilde{X}_l^k$ with the neighboring clusters
+\EndFor
+
+\Statex
+
+\Function {InnerSolver}{$x^0$, $j$}
+\State Compute the local right-hand side: $Y_l = B_l - \sum^L_{i=1,i\neq l}A_{li}X_i^0$
+\State Solving the local splitting $A_{ll}X_l^j=Y_l$ using the parallel GMRES method, such that $X_l^0$ is the initial guess
+\State \Return $X_l^j$
+\EndFunction
+
+\Statex
+
+\Function {UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
+\State Solving the normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ clusters using the parallel CGNR method
+\State Compute the local minimizer: $\tilde{X}_l^k=S_l^k\alpha^k$
+\State \Return $\tilde{X}_l^k$
+\EndFunction
+\end{algorithmic}
+\label{algo:01}
+\end{algorithm}
+
+The main  key points  of the multisplitting  method for  solving large
+sparse  linear  systems are  given  in Algorithm~\ref{algo:01}.   This
+algorithm is based on a two-stage method with a minimization using the
+GMRES iterative method as an  inner solver. It is executed in parallel
+by  each cluster  of processors.   The matrices  and vectors  with the
+subscript  $l$ represent  the local  data for  the cluster  $l$, where
+$l\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel
+iterative algorithms: the GMRES method for solving each splitting on a
+cluster of processors, and the CGNR method executed in parallel by all
+clusters for  minimizing the function  error over the  Krylov subspace
+spanned by  $S$.  The  algorithm requires two  global synchronizations
+between the $L$  clusters. The first one is  performed at line~$12$ in
+Algorithm~\ref{algo:01}  to exchange  the local  values of  the vector
+solution $x$ (i.e. the  minimizer $\tilde{x}$) required to restart the
+multisplitting  solver. The  second  one is  needed  to construct  the
+matrix $R$ of  the Krylov subspace.  We choose  to perform this latter
+synchronization $s$  times in every  outer iteration $k$  (line~$7$ in
+Algorithm~\ref{algo:01}). This is a straightforward way to compute the
+matrix-matrix    multiplication     $R=AS$.     We    implement    all
+synchronizations   by   using   the   MPI   collective   communication
+subroutines.