1 \documentclass{article}
2 \usepackage[utf8]{inputenc}
3 \usepackage{amsfonts,amssymb}
7 \usepackage{algpseudocode}
9 \algnewcommand\algorithmicinput{\textbf{Input:}}
10 \algnewcommand\Input{\item[\algorithmicinput]}
12 \algnewcommand\algorithmicoutput{\textbf{Output:}}
13 \algnewcommand\Output{\item[\algorithmicoutput]}
16 \title{A scalable multisplitting algorithm for solving large sparse linear systems}
21 \author{Raphaël Couturier \and Lilia Ziane Khodja}
26 %%%%%%%%%%%%%%%%%%%%%%%%
27 %%%%%%%%%%%%%%%%%%%%%%%%
31 In this paper we revist the krylov multisplitting algorithm presented in
32 \cite{huang1993krylov} which uses a scalar method to minimize the krylov
33 iterations computed by a multisplitting algorithm. Our new algorithm is based on
34 a parallel multisplitting algorithm with few blocks of large size using a
35 parallel GMRES method inside each block and on a parallel krylov minimization in
36 order to improve the convergence. Some large scale experiments with a 3D Poisson
37 problem are presented. They show the obtained improvements compared to a
38 classical GMRES both in terms of number of iterations and execution times.
42 %%%%%%%%%%%%%%%%%%%%%%%%
43 %%%%%%%%%%%%%%%%%%%%%%%%
46 \section{Introduction}
48 Iterative methods are used to solve large sparse linear systems of equations of
49 the form $Ax=b$ because they are easier to parallelize than direct ones. Many
50 iterative methods have been proposed and adapted by many researchers. For
51 example, the GMRES method and the Conjugate Gradient method are very well known
52 and used by many researchers ~\cite{S96}. Both the method are based on the
53 Krylov subspace which consists in forming a basis of the sequence of successive
54 matrix powers times the initial residual.
56 When solving large linear systems with many cores, iterative methods often
57 suffer from scalability problems. This is due to their need for collective
58 communications to perform matrix-vector products and reduction operations.
59 Preconditionners can be used in order to increase the convergence of iterative
60 solvers. However, most of the good preconditionners are not sclalable when
61 thousands of cores are used.
64 Traditionnal iterative solvers have global synchronizations that penalize the
65 scalability. Two possible solutions consists either in using asynchronous
66 iterative methods~\cite{ref18} or to use multisplitting algorithms. In this
67 paper, we will reconsider the use of a multisplitting method. In opposition to
68 traditionnal multisplitting method that suffer from slow convergence, as
69 proposed in~\cite{huang1993krylov}, the use of a minimization process can
70 drastically improve the convergence.
75 %%%%%%%%%%%%%%%%%%%%%%%
77 %%%%%%%%%%%%%%%%%%%%%%%
78 The key idea of the multisplitting method for solving a large system
79 of linear equations $Ax=b$ consists in partitioning the matrix $A$ in
82 A = M_l - N_l,~l\in\{1,\ldots,L\},
85 where $M_l$ are nonsingular matrices. Then the linear system is solved
86 by iteration based on the multisplittings as follows
88 x^{k+1}=\displaystyle\sum^L_{l=1} E_l M^{-1}_l (N_l x^k + b),~k=1,2,3,\ldots
91 where $E_l$ are non-negative and diagonal weighting matrices such that
92 $\sum^L_{l=1}E_l=I$ ($I$ is an identity matrix). Thus the convergence
93 of such a method is dependent on the condition
95 \rho(\displaystyle\sum^L_{l=1}E_l M^{-1}_l N_l)<1.
99 The advantage of the multisplitting method is that at each iteration
100 $k$ there are $L$ different linear sub-systems
102 v_l^k=M^{-1}_l N_l x_l^{k-1} + M^{-1}_l b,~l\in\{1,\ldots,L\},
105 to be solved independently by a direct or an iterative method, where
106 $v_l^k$ is the solution of the local sub-system. Thus, the
107 calculations of $v_l^k$ may be performed in parallel by a set of
108 processors. A multisplitting method using an iterative method for
109 solving the $L$ linear sub-systems is called an inner-outer iterative
110 method or a two-stage method. The results $v_l^k$ obtained from the
111 different splittings~(\ref{eq04}) are combined to compute the solution
112 $x^k$ of the linear system by using the diagonal weighting matrices
114 x^k = \displaystyle\sum^L_{l=1} E_l v_l^k,
117 In the case where the diagonal weighting matrices $E_l$ have only zero
118 and one factors (i.e. $v_l^k$ are disjoint vectors), the
119 multisplitting method is non-overlapping and corresponds to the block
121 %%%%%%%%%%%%%%%%%%%%%%%
123 %%%%%%%%%%%%%%%%%%%%%%%
125 \section{Related works}
128 A general framework for studying parallel multisplitting has been presented in
129 \cite{o1985multi} by O'Leary and White. Convergence conditions are given for the
130 most general case. Many authors improved multisplitting algorithms by proposing
131 for example an asynchronous version \cite{bru1995parallel} and convergence
132 conditions \cite{bai1999block,bahi2000asynchronous} in this case or other
133 two-stage algorithms~\cite{frommer1992h,bru1995parallel}.
135 In \cite{huang1993krylov}, the authors proposed a parallel multisplitting
136 algorithm in which all the tasks except one are devoted to solve a sub-block of
137 the splitting and to send their local solution to the first task which is in
138 charge to combine the vectors at each iteration. These vectors form a Krylov
139 basis for which the first task minimizes the error function over the basis to
140 increase the convergence, then the other tasks receive the update solution until
141 convergence of the global system.
145 In \cite{couturier2008gremlins}, the authors proposed practical implementations
146 of multisplitting algorithms that take benefit from multisplitting algorithms to
147 solve large scale linear systems. Inner solvers could be based on scalar direct
148 method with the LU method or scalar iterative one with GMRES.
150 In~\cite{prace-multi}, the authors have proposed a parallel multisplitting
151 algorithm in which large block are solved using a GMRES solver. The authors have
152 performed large scale experimentations upto 32.768 cores and they conclude that
153 asynchronous multisplitting algorithm could more efficient than traditionnal
154 solvers on exascale architecture with hunders of thousands of cores.
157 %%%%%%%%%%%%%%%%%%%%%%%%
158 %%%%%%%%%%%%%%%%%%%%%%%%
161 \section{A two-stage method with a minimization}
162 Let $Ax=b$ be a given sparse and large linear system of $n$ equations
163 to solve in parallel on $L$ clusters, physically adjacent or
164 geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square
165 and nonsingular matrix, $x\in\mathbb{R}^{n}$ is the solution vector
166 and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The
167 multisplitting of this linear system is defined as follows:
171 A & = & [A_{1}, \ldots, A_{L}]\\
172 x & = & [X_{1}, \ldots, X_{L}]\\
173 b & = & [B_{1}, \ldots, B_{L}]
178 where for $l\in\{1,\ldots,L\}$, $A_l$ is a rectangular block of size
179 $n_l\times n$ and $X_l$ and $B_l$ are sub-vectors of size $n_l$, such
180 that $\sum_ln_l=n$. In this case, we use a row-by-row splitting
181 without overlapping in such a way that successive rows of the sparse
182 matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster.
183 So, the multisplitting format of the linear system is defined as
186 \forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{i=1}^{l-1}A_{li}X_i + A_{ll}X_l + \displaystyle\sum_{i=l+1}^{L}A_{li}X_i = B_l,
189 where $A_{li}$ is a block of size $n_l\times n_i$ of the rectangular
190 matrix $A_l$, $X_i\neq X_l$ is a sub-vector of size $n_i$ of the
191 solution vector $x$ and $\sum_{i<l}n_i+\sum_{i>l}n_i+n_l=n$, for all
192 $i\in\{1,\ldots,l-1,l+1,\ldots,L\}$.
194 The multisplitting method proceeds by iteration for solving the linear
195 system in such a way each sub-system
199 A_{ll}X_l = Y_l \mbox{,~such that}\\
200 Y_l = B_l - \displaystyle\sum_{i=1,i\neq l}^{L}A_{li}X_i,
205 is solved independently by a cluster of processors and communication
206 are required to update the right-hand side vectors $Y_l$, such that
207 the vectors $X_i$ represent the data dependencies between the
208 clusters. In this work, we use the parallel GMRES method~\cite{ref34}
209 as an inner iteration method to solve the
210 sub-systems~(\ref{sec03:eq03}). It is a well-known iterative method
211 which gives good performances for solving sparse linear systems in
212 parallel on a cluster of processors.
214 It should be noted that the convergence of the inner iterative solver
215 for the different linear sub-systems~(\ref{sec03:eq03}) does not
216 necessarily involve the convergence of the multisplitting method. It
217 strongly depends on the properties of the sparse linear system to be
218 solved and the computing
219 environment~\cite{o1985multi,ref18}. Furthermore, the multisplitting
220 of the linear system among several clusters of processors increases
221 the spectral radius of the iteration matrix, thereby slowing the
222 convergence. In this paper, we based on the work presented
223 in~\cite{huang1993krylov} to increase the convergence and improve the
224 scalability of the multisplitting methods.
226 In order to accelerate the convergence, we implement the outer
227 iteration of the multisplitting solver as a Krylov subspace method
228 which minimizes some error function over a Krylov subspace~\cite{S96}.
229 The Krylov space of the method that we used is spanned by a basis
230 composed of successive solutions issued from solving the $L$
231 splittings~(\ref{sec03:eq03})
233 S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
236 where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a
237 solution of the global linear system. The advantage of such a Krylov
238 subspace is that we need neither an orthogonal basis nor
239 synchronizations between the different clusters to generate this
242 The multisplitting method is periodically restarted every $s$
243 iterations with a new initial guess $\tilde{x}=S\alpha$ which
244 minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace
245 spanned by the vectors of $S$. So, $\alpha$ is defined as the
246 solution of the large overdetermined linear system
251 where $R=AS$ is a dense rectangular matrix of size $n\times s$ and
252 $s\ll n$. This leads us to solve the system of normal equations
257 which is associated with the least squares problem
259 \text{minimize}~\|b-R\alpha\|_2,
262 where $R^T$ denotes the transpose of the matrix $R$. Since $R$ (i.e.
263 $AS$) and $b$ are split among $L$ clusters, the symmetric positive
264 definite system~(\ref{sec03:eq06}) is solved in parallel. Thus, an
265 iterative method would be more appropriate than a direct one to solve
266 this system. We use the parallel conjugate gradient method for the
267 normal equations CGNR~\cite{S96,refCGNR}.
269 \begin{algorithm}[!t]
270 \caption{A two-stage linear solver with inner iteration GMRES method}
271 \begin{algorithmic}[1]
272 \Input $A_l$ (local sparse matrix), $B_l$ (local right-hand side), $x^0$ (initial guess)
273 \Output $X_l$ (local solution vector)\vspace{0.2cm}
274 \State Load $A_l$, $B_l$, $x^0$
275 \State Initialize the minimizer $\tilde{x}^0=x^0$
276 \For {$k=1,2,3,\ldots$ until the global convergence}
277 \State Restart with $x^0=\tilde{x}^{k-1}$: \textbf{for} $j=1,2,\ldots,s$ \textbf{do}
278 \State\hspace{0.5cm} Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
279 \State\hspace{0.5cm} Construct the basis $S$: add the column vector $X_l^j$ to the matrix $S_l^k$
280 \State\hspace{0.5cm} Exchange the local solution vector $X_l^j$ with the neighboring clusters
281 \State\hspace{0.5cm} Compute the dense matrix $R$: $R_l^{k,j}=\sum^L_{i=1}A_{li}X_i^j$
282 \State\textbf{end for}
283 \State Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
284 \State Local solution of the linear system $Ax=b$: $X_l^k=\tilde{X}_l^k$
285 \State Exchange the local minimizer $\tilde{X}_l^k$ with the neighboring clusters
290 \Function {InnerSolver}{$x^0$, $j$}
291 \State Compute the local right-hand side: $Y_l = B_l - \sum^L_{i=1,i\neq l}A_{li}X_i^0$
292 \State Solving the local splitting $A_{ll}X_l^j=Y_l$ using the parallel GMRES method, such that $X_l^0$ is the initial guess
293 \State \Return $X_l^j$
298 \Function {UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
299 \State Solving the normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ clusters using the parallel CGNR method
300 \State Compute the local minimizer: $\tilde{X}_l^k=S_l^k\alpha^k$
301 \State \Return $\tilde{X}_l^k$
307 The main key points of the multisplitting method to solve a large
308 sparse linear system are given in Algorithm~\ref{algo:01}. This
309 algorithm is based on a two-stage method with a minimization using the
310 GMRES iterative method as an inner solver. It is executed in parallel
311 by each cluster of processors. The matrices and vectors with the
312 subscript $l$ represent the local data for the cluster $l$, where
313 $l\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel
314 iterative algorithms: the GMRES method to solve each splitting on a
315 cluster of processors, and the CGNR method executed in parallel by all
316 clusters to minimize the function error over the Krylov subspace
317 spanned by $S$. The algorithm requires two global synchronizations
318 between the $L$ clusters. The first one is performed at line~$12$ in
319 Algorithm~\ref{algo:01} to exchange the local values of the vector
320 solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the
321 multisplitting solver. The second one is needed to construct the
322 matrix $R$ of the Krylov subspace. We choose to perform this latter
323 synchronization $s$ times in every outer iteration $k$ (line~$7$ in
324 Algorithm~\ref{algo:01}). This is a straightforward way to compute the
325 matrix-matrix multiplication $R=AS$. We implement all
326 synchronizations by using the MPI collective communication
332 %%%%%%%%%%%%%%%%%%%%%%%%
333 %%%%%%%%%%%%%%%%%%%%%%%%
335 \bibliographystyle{plain}
336 \bibliography{biblio}