1 \documentclass{article}
2 \usepackage[utf8]{inputenc}
3 \usepackage{amsfonts,amssymb}
7 \usepackage{algpseudocode}
10 \title{A scalable multisplitting algorithm for solving large sparse linear systems}
15 \author{Raphaël Couturier \and Lilia Ziane Khodja}
20 %%%%%%%%%%%%%%%%%%%%%%%%
21 %%%%%%%%%%%%%%%%%%%%%%%%
25 In this paper we revist the krylov multisplitting algorithm presented in
26 \cite{huang1993krylov} which uses a scalar method to minimize the krylov
27 iterations computed by a multisplitting algorithm. Our new algorithm is based on
28 a parallel multisplitting algorithm with few blocks of large size using a
29 parallel GMRES method inside each block and on a parallel krylov minimization in
30 order to improve the convergence. Some large scale experiments with a 3D Poisson
31 problem are presented. They show the obtained improvements compared to a
32 classical GMRES both in terms of number of iterations and execution times.
36 %%%%%%%%%%%%%%%%%%%%%%%%
37 %%%%%%%%%%%%%%%%%%%%%%%%
40 \section{Introduction}
42 Iterative methods are used to solve large sparse linear systems of equations of
43 the form $Ax=b$ because they are easier to parallelize than direct ones. Many
44 iterative methods have been proposed and adapted by many researchers. For
45 example, the GMRES method and the Conjugate Gradient method are very well known
46 and used by many researchers ~\cite{S96}. Both the method are based on the
47 Krylov subspace which consists in forming a basis of the sequence of successive
48 matrix powers times the initial residual.
50 When solving large linear systems with many cores, iterative methods often
51 suffer from scalability problems. This is due to their need for collective
52 communications to perform matrix-vector products and reduction operations.
53 Preconditionners can be used in order to increase the convergence of iterative
54 solvers. However, most of the good preconditionners are not sclalable when
55 thousands of cores are used.
59 On ne peut pas parler de tout...\\
64 %%%%%%%%%%%%%%%%%%%%%%%
66 %%%%%%%%%%%%%%%%%%%%%%%
67 The key idea of the multisplitting method for solving a large system
68 of linear equations $Ax=b$ consists in partitioning the matrix $A$ in
71 A = M_l - N_l,~l\in\{1,\ldots,L\},
74 where $M_l$ are nonsingular matrices. Then the linear system is solved
75 by iteration based on the multisplittings as follows
77 x^{k+1}=\displaystyle\sum^L_{l=1} E_l M^{-1}_l (N_l x^k + b),~k=1,2,3,\ldots
80 where $E_l$ are non-negative and diagonal weighting matrices such that
81 $\sum^L_{l=1}E_l=I$ ($I$ is an identity matrix). Thus the convergence
82 of such a method is dependent on the condition
84 \rho(\displaystyle\sum^L_{l=1}E_l M^{-1}_l N_l)<1.
88 The advantage of the multisplitting method is that at each iteration
89 $k$ there are $L$ different linear sub-systems
91 v_l^k=M^{-1}_l N_l x_l^{k-1} + M^{-1}_l b,~l\in\{1,\ldots,L\},
94 to be solved independently by a direct or an iterative method, where
95 $v_l^k$ is the solution of the local sub-system. Thus, the
96 calculations of $v_l^k$ may be performed in parallel by a set of
97 processors. A multisplitting method using an iterative method for
98 solving the $L$ linear sub-systems is called an inner-outer iterative
99 method or a two-stage method. The results $v_l^k$ obtained from the
100 different splittings~(\ref{eq04}) are combined to compute the solution
101 $x^k$ of the linear system by using the diagonal weighting matrices
103 x^k = \displaystyle\sum^L_{l=1} E_l v_l^k,
106 In the case where the diagonal weighting matrices $E_l$ have only zero
107 and one factors (i.e. $v_l^k$ are disjoint vectors), the
108 multisplitting method is non-overlapping and corresponds to the block
110 %%%%%%%%%%%%%%%%%%%%%%%
112 %%%%%%%%%%%%%%%%%%%%%%%
114 \section{Related works}
117 A general framework for studying parallel multisplitting has been presented in
118 \cite{o1985multi} by O'Leary and White. Convergence conditions are given for the
119 most general case. Many authors improved multisplitting algorithms by proposing
120 for example an asynchronous version \cite{bru1995parallel} and convergence
121 conditions \cite{bai1999block,bahi2000asynchronous} in this case or other
122 two-stage algorithms~\cite{frommer1992h,bru1995parallel}.
124 In \cite{huang1993krylov}, the authors proposed a parallel multisplitting
125 algorithm in which all the tasks except one are devoted to solve a sub-block of
126 the splitting and to send their local solution to the first task which is in
127 charge to combine the vectors at each iteration. These vectors form a Krylov
128 basis for which the first task minimizes the error function over the basis to
129 increase the convergence, then the other tasks receive the update solution until
130 convergence of the global system.
134 In \cite{couturier2008gremlins}, the authors proposed practical implementations
135 of multisplitting algorithms that take benefit from multisplitting algorithms to
136 solve large scale linear systems. Inner solvers could be based on scalar direct
137 method with the LU method or scalar iterative one with GMRES.
139 In~\cite{prace-multi}, the authors have proposed a parallel multisplitting
140 algorithm in which large block are solved using a GMRES solver. The authors have
141 performed large scale experimentations upto 32.768 cores and they conclude that
142 asynchronous multisplitting algorithm could more efficient than traditionnal
143 solvers on exascale architecture with hunders of thousands of cores.
146 %%%%%%%%%%%%%%%%%%%%%%%%
147 %%%%%%%%%%%%%%%%%%%%%%%%
150 \section{A two-stage method with a minimization}
151 Let $Ax=b$ be a given sparse and large linear system of $n$ equations
152 to solve in parallel on $L$ clusters, physically adjacent or
153 geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square
154 and nonsingular matrix, $x\in\mathbb{R}^{n}$ is the solution vector
155 and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The
156 multisplitting of this linear system is defined as follows:
160 A & = & [A_{1}, \ldots, A_{L}]\\
161 x & = & [X_{1}, \ldots, X_{L}]\\
162 b & = & [B_{1}, \ldots, B_{L}]
167 where for $l\in\{1,\ldots,L\}$, $A_l$ is a rectangular block of size
168 $n_l\times n$ and $X_l$ and $B_l$ are sub-vectors of size $n_l$, such
169 that $\sum_ln_l=n$. In this case, we use a row-by-row splitting
170 without overlapping in such a way that successive rows of the sparse
171 matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster.
172 So, the multisplitting format of the linear system is defined as
175 \forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{i=1}^{l-1}A_{li}X_i + A_{ll}X_l + \displaystyle\sum_{i=l+1}^{L}A_{li}X_i = B_l,
178 where $A_{li}$ is a block of size $n_l\times n_i$ of the rectangular
179 matrix $A_l$, $X_i\neq X_l$ is a sub-vector of size $n_i$ of the
180 solution vector $x$ and $\sum_{i<l}n_i+\sum_{i>l}n_i+n_l=n$, for all
181 $i\in\{1,\ldots,l-1,l+1,\ldots,L\}$.
183 The multisplitting method proceeds by iteration for solving the linear
184 system in such a way each sub-system
188 A_{ll}X_l = Y_l \mbox{,~such that}\\
189 Y_l = B_l - \displaystyle\sum_{i=1,i\neq l}^{L}A_{li}X_i,
194 is solved independently by a cluster of processors and communication
195 are required to update the right-hand side vectors $Y_l$, such that
196 the vectors $X_i$ represent the data dependencies between the
197 clusters. In this work, we use the parallel GMRES method~\cite{ref34}
198 as an inner iteration method for solving the
199 sub-systems~(\ref{sec03:eq03}). It is a well-known iterative method
200 which gives good performances for solving sparse linear systems in
201 parallel on a cluster of processors.
203 It should be noted that the convergence of the inner iterative solver
204 for the different linear sub-systems~(\ref{sec03:eq03}) does not
205 necessarily involve the convergence of the multisplitting method. It
206 strongly depends on the properties of the sparse linear system to be
207 solved and the computing
208 environment~\cite{o1985multi,ref18}. Furthermore, the multisplitting
209 of the linear system among several clusters of processors increases
210 the spectral radius of the iteration matrix, thereby slowing the
211 convergence. In this paper, we based on the work presented
212 in~\cite{huang1993krylov} to increase the convergence and improve the
213 scalability of the multisplitting methods.
215 In order to accelerate the convergence, we implement the outer
216 iteration of the multisplitting solver as a Krylov subspace method
217 which minimizes some error function over a Krylov subspace~\cite{S96}.
218 The Krylov space of the method that we used is spanned by a basis
219 composed of successive solutions issued from solving the $L$
220 splittings~(\ref{sec03:eq03})
222 S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
225 where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a
226 solution of the global linear system.%The advantage such a method is that the Krylov subspace does not need to be spanned by an orthogonal basis.
227 The advantage of such a Krylov subspace is that we need neither an
228 orthogonal basis nor synchronizations between the different clusters
229 to generate this basis.
231 The multisplitting method is periodically restarted every $s$
232 iterations with a new initial guess $\tilde{x}=S\alpha$ which
233 minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace
234 spanned by the vectors of $S$. So, $\alpha$ is defined as the
235 solution of the large overdetermined linear system
240 where $R=AS$ is a dense rectangular matrix of size $n\times s$ and
241 $s\ll n$. This leads us to solve the system of normal equations
246 which is associated with the least squares problem
248 \text{minimize}~\|b-R\alpha\|_2,
251 where $R^T$ denotes the transpose of the matrix $R$. Since $R$
252 (i.e. $AS$) and $b$ are split among $L$ clusters, the symmetric
253 positive definite system~(\ref{sec03:eq06}) is solved in
254 parallel. Thus, an iterative method would be more appropriate than a
255 direct one for solving this system. We use the parallel conjugate
256 gradient method for the normal equations CGNR~\cite{S96,refCGNR}.
258 \begin{algorithm}[!t]
259 \caption{A two-stage linear solver with inner iteration GMRES method}
260 \begin{algorithmic}[1]
261 \State Load $A_l$, $B_l$, initial guess $x^0$
262 \State Initialize the minimizer $\tilde{x}^0=x^0$
263 \For {$k=1,2,3,\ldots$ until the global convergence}
264 \State Restart with $x^0=\tilde{x}^{k-1}$: \textbf{for} $j=1,2,\ldots,s$ \textbf{do}
265 \State\hspace{0.5cm} Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
266 \State\hspace{0.5cm} Construct the basis $S$: add the column vector $X_l^j$ to the matrix $S_l$
267 \State\hspace{0.5cm} Exchange the local solution vector $X_l^j$ with the neighboring clusters
268 \State\hspace{0.5cm} Compute the dense matrix $R$: $R_l^j=\sum^L_{i=1}A_{li}X_i^j$
269 \State\textbf{end for}
270 \State Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_l$, $R$, $b$}
271 \State Local solution of the linear system $Ax=b$: $X_l^k=\tilde{X}_l^k$
272 \State Exchange the local minimizer $\tilde{X}_l^k$ with the neighboring clusters
277 \Function {InnerSolver}{$x^0$, $j$}
278 \State Compute the local right-hand side: $Y_l = B_l - \sum^L_{i=1,i\neq l}A_{li}X_i^0$
279 \State Solving the local splitting $A_{ll}X_l^j=Y_l$ with the parallel GMRES method, such that $X_l^0$ is the initial guess.
280 \State \Return $X_l^j$
285 \Function {UpdateMinimizer}{$S_l$, $R$, $b$, $k$}
286 \State Solving the normal equations $R^TR\alpha=R^Tb$ in parallel by $L$ clusters using the parallel CGNR method
287 \State Compute the local minimizer: $\tilde{X}_l^k=S_l\alpha$
288 \State \Return $\tilde{X}_l^k$
294 The main key points of the multisplitting method for solving large
295 sparse linear systems are given in Algorithm~\ref{algo:01}. This
296 algorithm is based on a two-stage method with a minimization using the
297 GMRES iterative method as an inner solver. It is executed in parallel
298 by each cluster of processors. The matrices and vectors with the
299 subscript $l$ represent the local data for the cluster $l$, where
300 $l\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel
301 iterative algorithms: the GMRES method for solving each splitting on a
302 cluster of processors, and the CGNR method executed in parallel by all
303 clusters for minimizing the function error over the Krylov subspace
304 spanned by $S$. The algorithm requires two global synchronizations
305 between the $L$ clusters. The first one is performed at line~$12$ in
306 Algorithm~\ref{algo:01} to exchange the local values of the vector
307 solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the
308 multisplitting solver. The second one is needed to construct the
309 matrix $R$ of the Krylov subspace. We choose to perform this latter
310 synchronization $s$ times in every outer iteration $k$ (line~$7$ in
311 Algorithm~\ref{algo:01}). This is a straightforward way to compute the
312 matrix-matrix multiplication $R=AS$. We implement all
313 synchronizations by using the MPI collective communication
319 %%%%%%%%%%%%%%%%%%%%%%%%
320 %%%%%%%%%%%%%%%%%%%%%%%%
322 \bibliographystyle{plain}
323 \bibliography{biblio}