+\section{A two-stage method with a minimization}
+Let $Ax=b$ be a given large and sparse linear system of $n$ equations to solve in parallel on $L$ clusters of processors, physically adjacent or geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square and non-singular matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The multisplitting of this linear system is defined as follows
+\begin{equation}
+\left\{
+\begin{array}{lll}
+A & = & [A_{1}, \ldots, A_{L}]\\
+x & = & [X_{1}, \ldots, X_{L}]\\
+b & = & [B_{1}, \ldots, B_{L}]
+\end{array}
+\right.
+\label{sec03:eq01}
+\end{equation}
+where for $\ell\in\{1,\ldots,L\}$, $A_\ell$ is a rectangular block of size $n_\ell\times n$ and $X_\ell$ and $B_\ell$ are sub-vectors of size $n_\ell$ each, such that $\sum_\ell n_\ell=n$. In this work, we use a row-by-row splitting without overlapping in such a way that successive rows of sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster. So, the multisplitting format of the linear system is defined as follows
+\begin{equation}
+\forall \ell\in\{1,\ldots,L\} \mbox{,~} A_{\ell \ell}X_\ell + \displaystyle\sum_{\substack{m=1\\m\neq\ell}}^L A_{\ell m}X_m = B_\ell,
+\label{sec03:eq02}
+\end{equation}
+where $A_{\ell m}$ is a sub-block of size $n_\ell\times n_m$ of the rectangular matrix $A_\ell$, $X_m\neq X_\ell$ is a sub-vector of size $n_m$ of the solution vector $x$ and $\sum_{m\neq \ell}n_m+n_\ell=n$, for all $m\in\{1,\ldots,L\}$.
+
+Our multisplitting method proceeds by iteration for solving the linear system in such a way each sub-system
+\begin{equation}
+\left\{
+\begin{array}{l}
+A_{\ell \ell}X_\ell = Y_\ell \mbox{,~such that}\\
+Y_\ell = B_\ell - \displaystyle\sum_{\substack{m=1\\m\neq \ell}}^{L}A_{\ell m}X_m,
+\end{array}
+\right.
+\label{sec03:eq03}
+\end{equation}
+is solved independently by a {\it cluster of processors} and communications are required to update the right-hand side vectors $Y_\ell$, such that the vectors $X_m$ represent the data dependencies between the clusters. In this work, we use the parallel restarted GMRES method~\cite{ref34} as an inner iteration method to solve sub-systems~(\ref{sec03:eq03}). GMRES is one of the most used Krylov iterative methods to solve sparse linear systems. %In practice, GMRES is used with a preconditioner to improve its convergence. In this work, we used a preconditioning matrix equivalent to the main diagonal of sparse sub-matrix $A_{ll}$. This preconditioner is straightforward to implement in parallel and gives good performances in many situations.
+
+It should be noted that the convergence of the inner iterative solver for the different sub-systems~(\ref{sec03:eq03}) does not necessarily involve the convergence of the multisplitting method. It strongly depends on the properties of the global sparse linear system to be solved~\cite{o1985multi,ref18}. Furthermore, the multisplitting of the linear system among several clusters of processors increases the spectral radius of the iteration matrix, thereby slowing the convergence. In this paper, we based on the work presented in~\cite{huang1993krylov} to increase the convergence and improve the scalability of the multisplitting methods.
+
+In order to accelerate the convergence, we implemented the outer iteration of the multisplitting solver as a Krylov iterative method which minimizes some error function over a Krylov subspace~\cite{S96}. The Krylov subspace that we used is spanned by a basis composed of successive solutions issued from solving the $L$ splittings~(\ref{sec03:eq03})
+\begin{equation}
+S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
+\label{sec03:eq04}
+\end{equation}
+where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a solution of the global linear system. The advantage of such a Krylov subspace is that we need neither an orthogonal basis nor synchronizations between clusters to generate this basis.
+
+The multisplitting method is periodically restarted every $s$ iterations with a new initial guess $\tilde{x}=S\alpha$ which minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace spanned by vectors of $S$. So $\alpha$ is defined as the solution of the large overdetermined linear system
+\begin{equation}
+R\alpha=b,
+\label{sec03:eq05}
+\end{equation}
+where $R=AS$ is a dense rectangular matrix of size $n\times s$ and $s\ll n$. This leads us to solve a system of normal equations
+\begin{equation}
+R^TR\alpha=R^Tb,
+\label{sec03:eq06}
+\end{equation}
+which is associated with the least squares problem
+\begin{equation}
+\text{minimize}~\|b-R\alpha\|_2,
+\label{sec03:eq07}
+\end{equation}
+where $R^T$ denotes the transpose of matrix $R$. Since $R$ (i.e. $AS$) and $b$ are split among $L$ clusters, the symmetric positive definite system~(\ref{sec03:eq06}) is solved in parallel. Thus an iterative method would be more appropriate than a direct one to solve this system. We use the parallel Conjugate Gradient method for the normal equations CGNR~\cite{S96,refCGNR}.
+
+\begin{algorithm}[!t]
+\caption{A two-stage linear solver with inner iteration GMRES method}
+\begin{algorithmic}[1]
+\Input $A_\ell$ (sparse sub-matrix), $B_\ell$ (right-hand side sub-vector)
+\Output $X_\ell$ (solution sub-vector)\vspace{0.2cm}
+\State Load $A_\ell$, $B_\ell$
+\State Set the initial guess $x^0$
+\State Set the minimizer $\tilde{x}^0=x^0$
+\For {$k=1,2,3,\ldots$ until the global convergence}
+\State Restart with $x^0=\tilde{x}^{k-1}$:
+\For {$j=1,2,\ldots,s$}
+\State \label{line7}Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
+\State Construct basis $S$: add column vector $X_\ell^j$ to the matrix $S_\ell^k$
+\State Exchange local values of $X_\ell^j$ with the neighboring clusters
+\State Compute dense matrix $R$: $R_\ell^{k,j}=\sum^L_{i=1}A_{\ell i}X_i^j$
+\EndFor
+\State \label{line12}Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_\ell$, $R$, $b$, $k$}
+\State Local solution of linear system $Ax=b$: $X_\ell^k=\tilde{X}_\ell^k$
+\State Exchange the local minimizer $\tilde{X}_\ell^k$ with the neighboring clusters
+\EndFor
+
+\Statex
+
+\Function {InnerSolver}{$x^0$, $j$}
+\State Compute local right-hand side $Y_\ell = B_\ell - \sum^L_{\substack{m=1\\m\neq \ell}}A_{\ell m}X_m^0$
+\State Solving local splitting $A_{\ell \ell}X_\ell^j=Y_\ell$ using parallel GMRES method, such that $X_\ell^0$ is the initial guess
+\State \Return $X_\ell^j$
+\EndFunction
+
+\Statex
+
+\Function {UpdateMinimizer}{$S_\ell$, $R$, $b$, $k$}
+\State Solving normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ clusters using parallel CGNR method
+\State Compute local minimizer $\tilde{X}_\ell^k=S_\ell^k\alpha^k$
+\State \Return $\tilde{X}_\ell^k$
+\EndFunction
+\end{algorithmic}
+\label{algo:01}
+\end{algorithm}
+
+The main key points of our Krylov multisplitting method to solve a large sparse linear system are given in Algorithm~\ref{algo:01}. This algorithm is based on a two-stage method with a minimization using restarted GMRES iterative method as an inner solver. It is executed in parallel by each cluster of processors. Matrices and vectors with the subscript $\ell$ represent the local data for cluster $\ell$, where $\ell\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel iterative algorithms: GMRES method to solve each splitting~(\ref{sec03:eq03}) on a cluster of processors, and CGNR method executed in parallel by all clusters to minimize the function error~(\ref{sec03:eq07}) over the Krylov subspace spanned by $S$. The algorithm requires two global synchronizations between $L$ clusters. The first one is performed at line~\ref{line12} in Algorithm~\ref{algo:01} to exchange local values of vector solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the multisplitting solver. The second one is needed to construct the matrix $R$. We chose to perform this latter synchronization $s$ times in every outer iteration $k$ (line~\ref{line7} in Algorithm~\ref{algo:01}). This is a straightforward way to compute the sparse matrix-dense matrix multiplication $R=AS$. We implemented all synchronizations by using message passing collective communications of MPI library.