+\section{A two-stage method with a minimization}
+Let $Ax=b$ be a given large and sparse linear system of $n$ equations to solve in parallel on $L$ clusters of processors, physically adjacent or geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square and non-singular matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The multisplitting of this linear system is defined as follows
+\begin{equation}
+\left\{
+\begin{array}{lll}
+A & = & [A_{1}, \ldots, A_{L}]\\
+x & = & [X_{1}, \ldots, X_{L}]\\
+b & = & [B_{1}, \ldots, B_{L}]
+\end{array}
+\right.
+\label{sec03:eq01}
+\end{equation}
+where for $l\in\{1,\ldots,L\}$, $A_l$ is a rectangular block of size $n_l\times n$ and $X_l$ and $B_l$ are sub-vectors of size $n_l$ each, such that $\sum_ln_l=n$. In this work, we use a row-by-row splitting without overlapping in such a way that successive rows of sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster. So, the multisplitting format of the linear system is defined as follows
+\begin{equation}
+\forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{m=1}^{l-1}A_{lm}X_m + A_{ll}X_l + \displaystyle\sum_{m=l+1}^{L}A_{lm}X_m = B_l,
+\label{sec03:eq02}
+\end{equation}
+where $A_{lm}$ is a sub-block of size $n_l\times n_m$ of the rectangular matrix $A_l$, $X_m\neq X_l$ is a sub-vector of size $n_m$ of the solution vector $x$ and $\sum_{m\neq l}n_m+n_l=n$, for all $m\in\{1,\ldots,L\}$.
+
+Our multisplitting method proceeds by iteration for solving the linear system in such a way each sub-system
+\begin{equation}
+\left\{
+\begin{array}{l}
+A_{ll}X_l = Y_l \mbox{,~such that}\\
+Y_l = B_l - \displaystyle\sum_{\substack{m=1\\m\neq l}}^{L}A_{lm}X_m,
+\end{array}
+\right.
+\label{sec03:eq03}
+\end{equation}
+is solved independently by a {\it cluster of processors} and communication are required to update the right-hand side vectors $Y_l$, such that the vectors $X_m$ represent the data dependencies between the clusters. In this work, we use the parallel GMRES method~\cite{ref34} as an inner iteration method to solve sub-systems~(\ref{sec03:eq03}). GMRES is one of the most used Krylov iterative methods to solve sparse linear systems in parallel on clusters of processors. In practice, GMRES is used with a preconditioner to improve its convergence. In this work, we used a preconditioning matrix equivalent to the main diagonal of sparse sub-matrix $A_{ll}$. This preconditioner is straightforward to implement in parallel and gives good performances in many situations.
+
+It should be noted that the convergence of the inner iterative solver for the different sub-systems~(\ref{sec03:eq03}) does not necessarily involve the convergence of the multisplitting method. It strongly depends on the properties of the global sparse linear system to be solved and the computing environment~\cite{o1985multi,ref18}. Furthermore, the multisplitting
+of the linear system among several clusters of processors increases the spectral radius of the iteration matrix, thereby slowing the convergence. In this work, we based on the work presented in~\cite{huang1993krylov} to increase the convergence and improve the scalability of the multisplitting methods.
+
+In order to accelerate the convergence, we implemented the outer iteration of the multisplitting solver as a Krylov subspace method which minimizes some error function over a Krylov subspace~\cite{S96}. The Krylov subspace that we used is spanned by a basis composed of successive solutions issued from solving the $L$ splittings~(\ref{sec03:eq03})
+\begin{equation}
+S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
+\label{sec03:eq04}
+\end{equation}
+where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a solution of the global linear system. The advantage of such a Krylov subspace is that we need neither an orthogonal basis nor synchronizations between clusters to generate this basis.
+
+The multisplitting method is periodically restarted every $s$ iterations with a new initial guess $\tilde{x}=S\alpha$ which minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace spanned by vectors of $S$. So $\alpha$ is defined as the solution of the large overdetermined linear system
+\begin{equation}
+R\alpha=b,
+\label{sec03:eq05}
+\end{equation}
+where $R=AS$ is a dense rectangular matrix of size $n\times s$ and $s\ll n$. This leads us to solve a system of normal equations
+\begin{equation}
+R^TR\alpha=R^Tb,
+\label{sec03:eq06}
+\end{equation}
+which is associated with the least squares problem
+\begin{equation}
+\text{minimize}~\|b-R\alpha\|_2,
+\label{sec03:eq07}
+\end{equation}
+where $R^T$ denotes the transpose of the matrix $R$. Since $R$ (i.e. $AS$) and $b$ are split among $L$ clusters, the symmetric positive definite system~(\ref{sec03:eq06}) is solved in parallel. Thus an iterative method would be more appropriate than a direct one to solve this system. We use the parallel conjugate gradient method for the normal equations CGNR~\cite{S96,refCGNR}.
+
+\begin{algorithm}[!t]
+\caption{A two-stage linear solver with inner iteration GMRES method}
+\begin{algorithmic}[1]
+\Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector)
+\Output $X_l$ (solution sub-vector)\vspace{0.2cm}
+\State Load $A_l$, $B_l$
+\State Initialize the initial guess $x^0$
+\State Set the minimizer $\tilde{x}^0=x^0$
+\For {$k=1,2,3,\ldots$ until the global convergence}
+\State Restart with $x^0=\tilde{x}^{k-1}$:
+\For {$j=1,2,\ldots,s$}
+\State Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
+\State Construct basis $S$: add column vector $X_l^j$ to the matrix $S_l^k$
+\State Exchange local values of $X_l^j$ with the neighboring clusters
+\State Compute dense matrix $R$: $R_l^{k,j}=\sum^L_{i=1}A_{li}X_i^j$
+\EndFor
+\State Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
+\State Local solution of the linear system $Ax=b$: $X_l^k=\tilde{X}_l^k$
+\State Exchange the local minimizer $\tilde{X}_l^k$ with the neighboring clusters
+\EndFor
+
+\Statex
+
+\Function {InnerSolver}{$x^0$, $j$}
+\State Compute local right-hand side $Y_l = B_l - \sum^L_{\substack{m=1\\m\neq l}}A_{lm}X_m^0$
+\State Solving local splitting $A_{ll}X_l^j=Y_l$ using parallel GMRES method, such that $X_l^0$ is the initial guess
+\State \Return $X_l^j$
+\EndFunction
+
+\Statex
+
+\Function {UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
+\State Solving normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ clusters using parallel CGNR method
+\State Compute local minimizer $\tilde{X}_l^k=S_l^k\alpha^k$
+\State \Return $\tilde{X}_l^k$
+\EndFunction
+\end{algorithmic}
+\label{algo:01}
+\end{algorithm}
+
+The main key points of our multisplitting method to solve a large sparse linear system are given in Algorithm~\ref{algo:01}. This algorithm is based on a two-stage method with a minimization using GMRES iterative method as an inner solver. It is executed in parallel by each cluster of processors. Matrices and vectors with the subscript $l$ represent the local data for cluster $l$, where $l\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel iterative algorithms: GMRES method to solve each splitting~(\ref{sec03:eq03}) on a cluster of processors, and CGNR method executed in parallel by all clusters to minimize the function error~(\ref{sec03:eq07}) over the Krylov subspace spanned by $S$. The algorithm requires two global synchronizations between $L$ clusters. The first one is performed at line~$12$ in Algorithm~\ref{algo:01} to exchange local values of vector solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the multisplitting solver. The second one is needed to construct the matrix $R$ of the Krylov subspace. We chose to perform this latter synchronization $s$ times in every outer iteration $k$ (line~$7$ in Algorithm~\ref{algo:01}). This is a straightforward way to compute the sparse matrix-dense matrix multiplication $R=AS$. We implemented all synchronizations by using message passing collective communications of MPI library.