X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/Krylov_multi.git/blobdiff_plain/37ae33a711e93b7d43faf6a64064a0576e7f90b5..b326ff727ce4fef24bfa1adef99d0239aa636ea1:/krylov_multi.tex diff --git a/krylov_multi.tex b/krylov_multi.tex index c3f9e1c..1affc17 100644 --- a/krylov_multi.tex +++ b/krylov_multi.tex @@ -1,7 +1,348 @@ \documentclass{article} +\usepackage[utf8]{inputenc} +\usepackage{amsfonts,amssymb} +\usepackage{amsmath} +\usepackage{graphicx} +\usepackage{algorithm} +\usepackage{algpseudocode} +\usepackage{multirow} + +\algnewcommand\algorithmicinput{\textbf{Input:}} +\algnewcommand\Input{\item[\algorithmicinput]} + +\algnewcommand\algorithmicoutput{\textbf{Output:}} +\algnewcommand\Output{\item[\algorithmicoutput]} + +\newcommand{\Time}[1]{\mathit{Time}_\mathit{#1}} +\newcommand{\Prec}{\mathit{prec}} +\newcommand{\Ratio}{\mathit{Ratio}} + +\usepackage{xspace} +\usepackage[textsize=footnotesize]{todonotes} +\newcommand{\LZK}[2][inline]{% +\todo[color=green!40,#1]{\sffamily\textbf{LZK:} #2}\xspace} + +\title{A scalable multisplitting algorithm for solving large sparse linear systems} +\date{} + + \begin{document} +\author{Raphaël Couturier \and Lilia Ziane Khodja} + +\maketitle + + +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% + +\begin{abstract} +In this paper we revisit the krylov multisplitting algorithm presented in +\cite{huang1993krylov} which uses a scalar method to minimize the krylov +iterations computed by a multisplitting algorithm. Our new algorithm is based on +a parallel multisplitting algorithm with few blocks of large size using a +parallel GMRES method inside each block and on a parallel krylov minimization in +order to improve the convergence. Some large scale experiments with a 3D Poisson +problem are presented. They show the obtained improvements compared to a +classical GMRES both in terms of number of iterations and execution times. +\end{abstract} + +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Introduction} +Iterative methods are used to solve large sparse linear systems of equations of +the form $Ax=b$ because they are easier to parallelize than direct ones. Many +iterative methods have been proposed and adapted by many researchers. For +example, the GMRES method and the Conjugate Gradient method are very well known +and used by many researchers~\cite{S96}. Both the method are based on the +Krylov subspace which consists in forming a basis of the sequence of successive +matrix powers times the initial residual. + +When solving large linear systems with many cores, iterative methods often +suffer from scalability problems. This is due to their need for collective +communications to perform matrix-vector products and reduction operations. +Preconditionners can be used in order to increase the convergence of iterative +solvers. However, most of the good preconditionners are not sclalable when +thousands of cores are used. + +Traditional iterative solvers have global synchronizations that penalize the +scalability. Two possible solutions consists either in using asynchronous +iterative methods~\cite{ref18} or to use multisplitting algorithms. In this +paper, we will reconsider the use of a multisplitting method. In opposition to +traditionnal multisplitting method that suffer from slow convergence, as +proposed in~\cite{huang1993krylov}, the use of a minimization process can +drastically improve the convergence. + +\LZK[]{Suite\dots} + +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Related works} +A general framework for studying parallel multisplitting has been presented in~\cite{o1985multi} +by O'Leary and White. Convergence conditions are given for the +most general case. Many authors improved multisplitting algorithms by proposing +for example an asynchronous version~\cite{bru1995parallel} and convergence +conditions~\cite{bai1999block,bahi2000asynchronous} in this case or other +two-stage algorithms~\cite{frommer1992h,bru1995parallel}. + +In~\cite{huang1993krylov}, the authors proposed a parallel multisplitting +algorithm in which all the tasks except one are devoted to solve a sub-block of +the splitting and to send their local solution to the first task which is in +charge to combine the vectors at each iteration. These vectors form a Krylov +basis for which the first task minimizes the error function over the basis to +increase the convergence, then the other tasks receive the updated solution until +convergence of the global system. + +In~\cite{couturier2008gremlins}, the authors proposed practical implementations +of multisplitting algorithms that take benefit from multisplitting algorithms\LZK[]{répétition ???} to +solve large scale linear systems. Inner solvers could be based on scalar direct +method with the LU method or scalar iterative one with GMRES.\LZK[]{lu et gmres par exemple} + +In~\cite{prace-multi}, the authors have proposed a parallel multisplitting +algorithm in which large blocks are solved using a GMRES solver. The authors have +performed large scale experiments up-to 32,768 cores and they conclude that +asynchronous multisplitting algorithm could be more efficient than traditional +solvers on exascale architecture with hundreds of thousands of cores. + +\LZK[]{Peut-être autres related works\ldots}\\ + +The key idea of a multisplitting method to solve a large system of linear equations $Ax=b$ is defined as follows. The first step consists in partitioning the matrix $A$ in $L$ several ways +\begin{equation} +A = M_l - N_l, +\label{eq01} +\end{equation} +where for all $l\in\{1,\ldots,L\}$ $M_l$ are non-singular matrices. Then the linear system is solved by iteration based on the obtained splittings as follows +\begin{equation} +x^{k+1}=\displaystyle\sum^L_{l=1} E_l M^{-1}_l (N_l x^k + b),~k=1,2,3,\ldots +\label{eq02} +\end{equation} +where $E_l$ are non-negative and diagonal weighting matrices and their sum is an identity matrix $I$. The convergence of such a method is dependent on the condition +\begin{equation} +\rho(\displaystyle\sum^L_{l=1}E_l M^{-1}_l N_l)<1. +\label{eq03} +\end{equation} +where $\rho$ is the spectral radius of the square matrix. + +The advantage of the multisplitting method is that at each iteration $k$ there are $L$ different linear sub-systems +\begin{equation} +v_l^k=M^{-1}_l N_l x_l^{k-1} + M^{-1}_l b,~l\in\{1,\ldots,L\}, +\label{eq04} +\end{equation} +to be solved independently by a direct or an iterative method, where $v_l^k$ is the solution of the local sub-system. Thus the computations of $\{v_l\}_{1\leq l\leq L}$ may be performed in parallel by a set of processors. A multisplitting method using an iterative method as an inner solver is called an inner-outer iterative method or a two-stage method. The results $v_l$ obtained from the different splittings~(\ref{eq04}) are combined to compute solution $x$ of the linear system by using the diagonal weighting matrices +\begin{equation} +x^k = \displaystyle\sum^L_{l=1} E_l v_l^k, +\label{eq05} +\end{equation} +In the case where the diagonal weighting matrices $E_l$ have only zero and one factors (i.e. $v_l$ are disjoint vectors), the multisplitting method is non-overlapping and corresponds to the block Jacobi method. + +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% + +\section{A two-stage method with a minimization} +Let $Ax=b$ be a given large and sparse linear system of $n$ equations to solve in parallel on $L$ clusters of processors, physically adjacent or geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square and non-singular matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The multisplitting of this linear system is defined as follows +\begin{equation} +\left\{ +\begin{array}{lll} +A & = & [A_{1}, \ldots, A_{L}]\\ +x & = & [X_{1}, \ldots, X_{L}]\\ +b & = & [B_{1}, \ldots, B_{L}] +\end{array} +\right. +\label{sec03:eq01} +\end{equation} +where for $l\in\{1,\ldots,L\}$, $A_l$ is a rectangular block of size $n_l\times n$ and $X_l$ and $B_l$ are sub-vectors of size $n_l$ each, such that $\sum_ln_l=n$. In this work, we use a row-by-row splitting without overlapping in such a way that successive rows of sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster. So, the multisplitting format of the linear system is defined as follows +\begin{equation} +\forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{m=1}^{l-1}A_{lm}X_m + A_{ll}X_l + \displaystyle\sum_{m=l+1}^{L}A_{lm}X_m = B_l, +\label{sec03:eq02} +\end{equation} +where $A_{lm}$ is a sub-block of size $n_l\times n_m$ of the rectangular matrix $A_l$, $X_m\neq X_l$ is a sub-vector of size $n_m$ of the solution vector $x$ and $\sum_{m\neq l}n_m+n_l=n$, for all $m\in\{1,\ldots,L\}$. + +Our multisplitting method proceeds by iteration for solving the linear system in such a way each sub-system +\begin{equation} +\left\{ +\begin{array}{l} +A_{ll}X_l = Y_l \mbox{,~such that}\\ +Y_l = B_l - \displaystyle\sum_{\substack{m=1\\m\neq l}}^{L}A_{lm}X_m, +\end{array} +\right. +\label{sec03:eq03} +\end{equation} +is solved independently by a {\it cluster of processors} and communication are required to update the right-hand side vectors $Y_l$, such that the vectors $X_m$ represent the data dependencies between the clusters. In this work, we use the parallel GMRES method~\cite{ref34} as an inner iteration method to solve sub-systems~(\ref{sec03:eq03}). GMRES is one of the most used Krylov iterative methods to solve sparse linear systems in parallel on clusters of processors. In practice, GMRES is used with a preconditioner to improve its convergence. In this work, we used a preconditioning matrix equivalent to the main diagonal of sparse sub-matrix $A_{ll}$. This preconditioner is straightforward to implement in parallel and gives good performances in many situations. + +It should be noted that the convergence of the inner iterative solver for the different sub-systems~(\ref{sec03:eq03}) does not necessarily involve the convergence of the multisplitting method. It strongly depends on the properties of the global sparse linear system to be solved and the computing environment~\cite{o1985multi,ref18}. Furthermore, the multisplitting +of the linear system among several clusters of processors increases the spectral radius of the iteration matrix, thereby slowing the convergence. In this work, we based on the work presented in~\cite{huang1993krylov} to increase the convergence and improve the scalability of the multisplitting methods. + +In order to accelerate the convergence, we implemented the outer iteration of the multisplitting solver as a Krylov subspace method which minimizes some error function over a Krylov subspace~\cite{S96}. The Krylov subspace that we used is spanned by a basis composed of successive solutions issued from solving the $L$ splittings~(\ref{sec03:eq03}) +\begin{equation} +S=\{x^1,x^2,\ldots,x^s\},~s\leq n, +\label{sec03:eq04} +\end{equation} +where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a solution of the global linear system. The advantage of such a Krylov subspace is that we need neither an orthogonal basis nor synchronizations between clusters to generate this basis. + +The multisplitting method is periodically restarted every $s$ iterations with a new initial guess $\tilde{x}=S\alpha$ which minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace spanned by vectors of $S$. So $\alpha$ is defined as the solution of the large overdetermined linear system +\begin{equation} +R\alpha=b, +\label{sec03:eq05} +\end{equation} +where $R=AS$ is a dense rectangular matrix of size $n\times s$ and $s\ll n$. This leads us to solve a system of normal equations +\begin{equation} +R^TR\alpha=R^Tb, +\label{sec03:eq06} +\end{equation} +which is associated with the least squares problem +\begin{equation} +\text{minimize}~\|b-R\alpha\|_2, +\label{sec03:eq07} +\end{equation} +where $R^T$ denotes the transpose of the matrix $R$. Since $R$ (i.e. $AS$) and $b$ are split among $L$ clusters, the symmetric positive definite system~(\ref{sec03:eq06}) is solved in parallel. Thus an iterative method would be more appropriate than a direct one to solve this system. We use the parallel conjugate gradient method for the normal equations CGNR~\cite{S96,refCGNR}. + +\begin{algorithm}[!t] +\caption{A two-stage linear solver with inner iteration GMRES method} +\begin{algorithmic}[1] +\Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector) +\Output $X_l$ (solution sub-vector)\vspace{0.2cm} +\State Load $A_l$, $B_l$ +\State Initialize the initial guess $x^0$ +\State Set the minimizer $\tilde{x}^0=x^0$ +\For {$k=1,2,3,\ldots$ until the global convergence} +\State Restart with $x^0=\tilde{x}^{k-1}$: +\For {$j=1,2,\ldots,s$} +\State Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$} +\State Construct basis $S$: add column vector $X_l^j$ to the matrix $S_l^k$ +\State Exchange local values of $X_l^j$ with the neighboring clusters +\State Compute dense matrix $R$: $R_l^{k,j}=\sum^L_{i=1}A_{li}X_i^j$ +\EndFor +\State Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_l$, $R$, $b$, $k$} +\State Local solution of the linear system $Ax=b$: $X_l^k=\tilde{X}_l^k$ +\State Exchange the local minimizer $\tilde{X}_l^k$ with the neighboring clusters +\EndFor + +\Statex + +\Function {InnerSolver}{$x^0$, $j$} +\State Compute local right-hand side $Y_l = B_l - \sum^L_{\substack{m=1\\m\neq l}}A_{lm}X_m^0$ +\State Solving local splitting $A_{ll}X_l^j=Y_l$ using parallel GMRES method, such that $X_l^0$ is the initial guess +\State \Return $X_l^j$ +\EndFunction + +\Statex + +\Function {UpdateMinimizer}{$S_l$, $R$, $b$, $k$} +\State Solving normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ clusters using parallel CGNR method +\State Compute local minimizer $\tilde{X}_l^k=S_l^k\alpha^k$ +\State \Return $\tilde{X}_l^k$ +\EndFunction +\end{algorithmic} +\label{algo:01} +\end{algorithm} + +The main key points of our multisplitting method to solve a large sparse linear system are given in Algorithm~\ref{algo:01}. This algorithm is based on a two-stage method with a minimization using GMRES iterative method as an inner solver. It is executed in parallel by each cluster of processors. Matrices and vectors with the subscript $l$ represent the local data for cluster $l$, where $l\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel iterative algorithms: GMRES method to solve each splitting~(\ref{sec03:eq03}) on a cluster of processors, and CGNR method executed in parallel by all clusters to minimize the function error~(\ref{sec03:eq07}) over the Krylov subspace spanned by $S$. The algorithm requires two global synchronizations between $L$ clusters. The first one is performed at line~$12$ in Algorithm~\ref{algo:01} to exchange local values of vector solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the multisplitting solver. The second one is needed to construct the matrix $R$ of the Krylov subspace. We chose to perform this latter synchronization $s$ times in every outer iteration $k$ (line~$7$ in Algorithm~\ref{algo:01}). This is a straightforward way to compute the sparse matrix-dense matrix multiplication $R=AS$. We implemented all synchronizations by using message passing collective communications of MPI library. + +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% + +\section{Experiments} +In order to illustrate the interest of our algorithm. We have compared our +algorithm with the GMRES method which is a very well used method in many +situations. We have chosen to focus on only one problem which is very simple to +implement: a 3 dimension Poisson problem. + +\begin{equation} +\left\{ + \begin{array}{ll} + \nabla u&=f \mbox{~in~} \omega\\ + u &=0 \mbox{~on~} \Gamma=\partial \omega + \end{array} + \right. +\end{equation} + +After discretization, with a finite difference scheme, a seven point stencil is +used. It is well-known that the spectral radius of matrices representing such +problems are very close to 1. Moreover, the larger the number of discretization +points is, the closer to 1 the spectral radius is. Hence, to solve a matrix +obtained for a 3D Poisson problem, the number of iterations is high. Using a +preconditioner it is possible to reduce the number of iterations but +preconditioners are not scalable when using many cores. + +Doing many experiments with many cores is not easy and requires to access to a +supercomputer with several hours for developing a code and then improving +it. In the following we presented some experiments we could achieved out on the +Hector architecture, the previous UK's high-end computing resource, funded by +the UK Research Councils, which has been stopped in the early 2014. + +In the experiments we report the size of the 3D poisson considered\LZK[]{Suite\dots ?} + + +The first column shows the size of the problem The size is chosen in order to +have approximately 50,000 components per core. The second column represents the +number of cores used. In parenthesis, there is the decomposition used for the +Krylov multisplitting. The third column and the sixth column respectively show +the execution time for the GMRES and the Kyrlow multisplitting code. The fourth +and the seventh column describes the number of iterations. For the +multisplitting code, the total number of inner iterations is represented in +parenthesis. + + We also give the other parameters: the restart for the GRMES method.... + +\begin{table}[p] +\begin{center} +\begin{tabular}{|c|c||c|c|c||c|c|c||c|} +\hline +\multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} & \multicolumn{3}{c||}{GMRES} & \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\ + \cline{3-8} + & & Time (s) & nb Iter. & $\Delta$ & Time (s)& nb Iter. & $\Delta$ & \\ +\hline + +$590^3$ & 4096 (2x2048) & 433.1 & 55,494 & 4.92e-7 & 74.1 & 1,101(8,211) & 6.62e-08 & 5.84 \\ +\hline +$743^3$ & 8192 (2x4096) & 704.4 & 87,822 & 4.80e-07 & 151.2 & 3,061(14,914) & 5.87e-08 & 4.65 \\ +\hline +$743^3$ & 8192 (4x2048) & 704.4 & 87,822 & 4.80e-07 & 110.3 & 1,531(12,721) & 1.47e-07& 6.39 \\ +\hline + +\end{tabular} +\caption{Results without preconditioner} +\label{tab1} +\end{center} +\end{table} + + +\begin{table}[p] +\begin{center} +\begin{tabular}{|c|c||c|c|c||c|c|c||c|} +\hline +\multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} & \multicolumn{3}{c||}{GMRES} & \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\ + \cline{3-8} + & & Time (s) & nb Iter. & $\Delta$ & Time (s)& nb Iter. & $\Delta$ & \\ +\hline + +$590^3$ & 4096 (2x2048) & 433.0 & 55,494 & 4.92e-7 & 80.4 & 1,091(9,545) & 7.64e-08 & 5.39 \\ +\hline +$743^3$ & 8192 (2x4096) & 704.4 & 87,822 & 4.80e-07 & 110.2 & 1,401(12,379) & 1.11e-07 & 6.39 \\ +\hline +$743^3$ & 8192 (4x2048) & 704.4 & 87,822 & 4.80e-07 & 139.8 & 1,891(15,960) & 1.60e-07& 5.03 \\ +\hline + +\end{tabular} +\caption{Results with preconditioner} +\label{tab2} +\end{center} +\end{table} + +\section{Conclusion and perspectives} + +Other applications (=> other matrices)\\ +Larger experiments\\ +Async\\ +Overlapping + + +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% -This paper presents .... +\bibliographystyle{plain} +\bibliography{biblio} \end{document}