\usepackage{amsfonts,amssymb}
\usepackage{amsmath}
\usepackage{graphicx}
+%\usepackage{algorithmic}
+%\usepackage[ruled,english,boxed,linesnumbered]{algorithm2e}
+%\usepackage[english]{algorithme}
+\usepackage{algorithm}
+\usepackage{algpseudocode}
+
\title{A scalable multisplitting algorithm for solving large sparse linear systems}
\label{eq04}
\end{equation}
to be solved independently by a direct or an iterative method, where
-$v_l^k$ is the solution of the local sub-system. A multisplitting
-method using an iterative method for solving the $L$ linear
-sub-systems is called an inner-outer iterative method or a two-stage
-method. The results $v_l^k$ obtained from the different
-splittings~(\ref{eq04}) are combined to compute the solution $x^k$ of
-the linear system by using the diagonal weighting matrices
+$v_l^k$ is the solution of the local sub-system. Thus, the
+calculations of $v_l^k$ may be performed in parallel by a set of
+processors. A multisplitting method using an iterative method for
+solving the $L$ linear sub-systems is called an inner-outer iterative
+method or a two-stage method. The results $v_l^k$ obtained from the
+different splittings~(\ref{eq04}) are combined to compute the solution
+$x^k$ of the linear system by using the diagonal weighting matrices
\begin{equation}
x^k = \displaystyle\sum^L_{l=1} E_l v_l^k,
\label{eq05}
\section{A two-stage method with a minimization}
-Let $Ax=b$ be a given sparse and large linear system of $n$ equations
-to solve in parallel on $L$ clusters, physically adjacent or geographically
-distant, where $A\in\mathbb{R}^{n\times n}$ is a square and nonsingular
-matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$
-is the right-hand side vector. The multisplitting of this linear system
-is defined as follows:
+Let $Ax=b$ be a given sparse and large linear system of $n$ equations
+to solve in parallel on $L$ clusters, physically adjacent or
+geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square
+and nonsingular matrix, $x\in\mathbb{R}^{n}$ is the solution vector
+and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The
+multisplitting of this linear system is defined as follows:
\begin{equation}
\left\{
\begin{array}{lll}
\right.
\label{sec03:eq01}
\end{equation}
-where for all $l\in\{1,\ldots,L\}$ $A_l$ is a rectangular block of size $n_l\times n$
-and $X_l$ and $B_l$ are sub-vectors of size $n_l$, such that $\sum_ln_l=n$. In this
-case, we use a row-by-row splitting without overlapping in such a way that successive
-rows of the sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster.
-So, the multisplitting format of the linear system is defined as follows:
+where for $l\in\{1,\ldots,L\}$, $A_l$ is a rectangular block of size
+$n_l\times n$ and $X_l$ and $B_l$ are sub-vectors of size $n_l$, such
+that $\sum_ln_l=n$. In this case, we use a row-by-row splitting
+without overlapping in such a way that successive rows of the sparse
+matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster.
+So, the multisplitting format of the linear system is defined as
+follows:
\begin{equation}
\forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{i=1}^{l-1}A_{li}X_i + A_{ll}X_l + \displaystyle\sum_{i=l+1}^{L}A_{li}X_i = B_l,
\label{sec03:eq02}
\end{equation}
-where $A_{li}$ is a block of size $n_l\times n_i$ of the rectangular matrix $A_l$, $X_i\neq X_l$
-is a sub-vector of size $n_i$ of the solution vector $x$ and $\sum_{i<l}n_i+\sum_{i>l}n_i+n_l=n$,
-for all $i\in\{1,\ldots,l-1,l+1,\ldots,L\}$.
+where $A_{li}$ is a block of size $n_l\times n_i$ of the rectangular
+matrix $A_l$, $X_i\neq X_l$ is a sub-vector of size $n_i$ of the
+solution vector $x$ and $\sum_{i<l}n_i+\sum_{i>l}n_i+n_l=n$, for all
+$i\in\{1,\ldots,l-1,l+1,\ldots,L\}$.
-The multisplitting method proceeds by iteration for solving the linear system in such a
-way each sub-system
+The multisplitting method proceeds by iteration for solving the linear
+system in such a way each sub-system
\begin{equation}
\left\{
\begin{array}{l}
\right.
\label{sec03:eq03}
\end{equation}
-is solved independently by a cluster of processors and communication are required to
-update the right-hand side vectors $Y_l$, such that the vectors $X_i$ represent the data
-dependencies between the clusters. In this case, the parallel GMRES method is used
-as an inner iteration method for solving the linear sub-systems~(\ref{sec03:eq03}).
-
-
-
-
+is solved independently by a cluster of processors and communication
+are required to update the right-hand side vectors $Y_l$, such that
+the vectors $X_i$ represent the data dependencies between the
+clusters. In this work, we use the parallel GMRES method~\cite{ref34}
+as an inner iteration method for solving the
+sub-systems~(\ref{sec03:eq03}). It is a well-known iterative method
+which gives good performances for solving sparse linear systems in
+parallel on a cluster of processors.
+
+It should be noted that the convergence of the inner iterative solver
+for the different linear sub-systems~(\ref{sec03:eq03}) does not
+necessarily involve the convergence of the multisplitting method. It
+strongly depends on the properties of the sparse linear system to be
+solved and the computing
+environment~\cite{o1985multi,ref18}. Furthermore, the multisplitting
+of the linear system among several clusters of processors increases
+the spectral radius of the iteration matrix, thereby slowing the
+convergence. In this paper, we based on the work presented
+in~\cite{huang1993krylov} to increase the convergence and improve the
+scalability of the multisplitting methods.
+
+In order to accelerate the convergence, we implement the outer
+iteration of the multisplitting solver as a Krylov subspace method
+which minimizes some error function over a Krylov subspace~\cite{S96}.
+The Krylov space of the method that we used is spanned by a basis
+composed of successive solutions issued from solving the $L$
+splittings~(\ref{sec03:eq03})
+\begin{equation}
+S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
+\label{sec03:eq04}
+\end{equation}
+where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a
+solution of the global linear system.%The advantage such a method is that the Krylov subspace does not need to be spanned by an orthogonal basis.
+The advantage of such a Krylov subspace is that we need neither an
+orthogonal basis nor synchronizations between the different clusters
+to generate this basis.
+
+The multisplitting method is periodically restarted every $s$
+iterations with a new initial guess $\tilde{x}=S\alpha$ which
+minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace
+spanned by the vectors of $S$. So, $\alpha$ is defined as the
+solution of the large overdetermined linear system
+\begin{equation}
+R\alpha=b,
+\label{sec03:eq05}
+\end{equation}
+where $R=AS$ is a dense rectangular matrix of size $n\times s$ and
+$s\ll n$. This leads us to solve the system of normal equations
+\begin{equation}
+R^TR\alpha=R^Tb,
+\label{sec03:eq06}
+\end{equation}
+which is associated with the least squares problem
+\begin{equation}
+\text{minimize}~\|b-R\alpha\|_2,
+\label{sec03:eq07}
+\end{equation}
+where $R^T$ denotes the transpose of the matrix $R$. Since $R$
+(i.e. $AS$) and $b$ are split among $L$ clusters, the symmetric
+positive definite system~(\ref{sec03:eq06}) is solved in
+parallel. Thus, an iterative method would be more appropriate than a
+direct one for solving this system. We use the parallel conjugate
+gradient method for the normal equations CGNR~\cite{S96,refCGNR}.
+
+\begin{algorithm}[!t]
+\caption{A two-stage linear solver with inner iteration GMRES method}
+\begin{algorithmic}[1]
+\State Load $A_l$, $B_l$, initial guess $x^0$
+\State Initialize the minimizer $\tilde{x}^0=x^0$
+\For {$k=1,2,3,\ldots$ until the global convergence}
+\State Restart with $x^0=\tilde{x}^{k-1}$: \textbf{for} $j=1,2,\ldots,s$ \textbf{do}
+\State\hspace{0.5cm} Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
+\State\hspace{0.5cm} Construct the basis $S$: add the column vector $X_l^j$ to the matrix $S_l$
+\State\hspace{0.5cm} Exchange the local solution vector $X_l^j$ with the neighboring clusters
+\State\hspace{0.5cm} Compute the dense matrix $R$: $R_l^j=\sum^L_{i=1}A_{li}X_i^j$
+\State\textbf{end for}
+\State Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_l$, $R$, $b$}
+\State Local solution of the linear system $Ax=b$: $X_l^k=\tilde{X}_l^k$
+\State Exchange the local minimizer $\tilde{X}_l^k$ with the neighboring clusters
+\EndFor
+
+\Statex
+
+\Function {InnerSolver}{$x^0$, $j$}
+\State Compute the local right-hand side: $Y_l = B_l - \sum^L_{i=1,i\neq l}A_{li}X_i^0$
+\State Solving the local splitting $A_{ll}X_l^j=Y_l$ with the parallel GMRES method, such that $X_l^0$ is the initial guess.
+\State \Return $X_l^j$
+\EndFunction
+
+\Statex
+
+\Function {UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
+\State Solving the normal equations $R^TR\alpha=R^Tb$ in parallel by $L$ clusters using the parallel CGNR method
+\State Compute the local minimizer: $\tilde{X}_l^k=S_l\alpha$
+\State \Return $\tilde{X}_l^k$
+\EndFunction
+\end{algorithmic}
+\label{algo:01}
+\end{algorithm}
+
+The main key points of the multisplitting method for solving large
+sparse linear systems are given in Algorithm~\ref{algo:01}. This
+algorithm is based on a two-stage method with a minimization using the
+GMRES iterative method as an inner solver. It is executed in parallel
+by each cluster of processors. The matrices and vectors with the
+subscript $l$ represent the local data for the cluster $l$, where
+$l\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel
+iterative algorithms: the GMRES method for solving each splitting on a
+cluster of processors, and the CGNR method executed in parallel by all
+clusters for minimizing the function error over the Krylov subspace
+spanned by $S$. The algorithm requires two global synchronizations
+between the $L$ clusters. The first one is performed at line~$12$ in
+Algorithm~\ref{algo:01} to exchange the local values of the vector
+solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the
+multisplitting solver. The second one is needed to construct the
+matrix $R$ of the Krylov subspace. We choose to perform this latter
+synchronization $s$ times in every outer iteration $k$ (line~$7$ in
+Algorithm~\ref{algo:01}). This is a straightforward way to compute the
+matrix-matrix multiplication $R=AS$. We implement all
+synchronizations by using the MPI collective communication
+subroutines.