X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/Krylov_multi.git/blobdiff_plain/ee6ff2cef0b50051cbcf657aa4f3fc61bc050472..0dcb5e02d747971e35e25b6b2f31d13ac692cc11:/krylov_multi.tex?ds=inline diff --git a/krylov_multi.tex b/krylov_multi.tex index 963ce3e..275700c 100644 --- a/krylov_multi.tex +++ b/krylov_multi.tex @@ -3,8 +3,25 @@ \usepackage{amsfonts,amssymb} \usepackage{amsmath} \usepackage{graphicx} +\usepackage{algorithm} +\usepackage{algpseudocode} +\usepackage{multirow} + +\algnewcommand\algorithmicinput{\textbf{Input:}} +\algnewcommand\Input{\item[\algorithmicinput]} + +\algnewcommand\algorithmicoutput{\textbf{Output:}} +\algnewcommand\Output{\item[\algorithmicoutput]} + +\newcommand{\Time}[1]{\mathit{Time}_\mathit{#1}} +\newcommand{\Prec}{\mathit{prec}} +\newcommand{\Ratio}{\mathit{Ratio}} + +\def\changemargin#1#2{\list{}{\rightmargin#2\leftmargin#1}\item[]} +\let\endchangemargin=\endlist \title{A scalable multisplitting algorithm for solving large sparse linear systems} +\date{} @@ -17,130 +34,136 @@ %%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%% - \begin{abstract} -In this paper we revist the krylov multisplitting algorithm presented in -\cite{huang1993krylov} which uses a scalar method to minimize the krylov +In this paper we revisit the Krylov multisplitting algorithm presented in +\cite{huang1993krylov} which uses a sequential method to minimize the Krylov iterations computed by a multisplitting algorithm. Our new algorithm is based on a parallel multisplitting algorithm with few blocks of large size using a -parallel GMRES method inside each block and on a parallel krylov minimization in +parallel GMRES method inside each block and on a parallel Krylov minimization in order to improve the convergence. Some large scale experiments with a 3D Poisson -problem are presented. They show the obtained improvements compared to a -classical GMRES both in terms of number of iterations and execution times. +problem are presented with up to 8,192 cores. They show the obtained +improvements compared to a classical GMRES both in terms of number of iterations +and execution times. \end{abstract} - %%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%% - \section{Introduction} - Iterative methods are used to solve large sparse linear systems of equations of the form $Ax=b$ because they are easier to parallelize than direct ones. Many -iterative methods have been proposed and adapted by many researchers. When -solving large linear systems with many cores, iterative methods often suffer -from scalability problems. This is due to their need for collective +iterative methods have been proposed and adapted by many researchers. For +example, the GMRES method and the Conjugate Gradient method are very well known +and used by many researchers~\cite{S96}. Both methods are based on the +Krylov subspace which consists in forming a basis of a sequence of successive +matrix powers times the initial residual. + +When solving large linear systems with many cores, iterative methods often +suffer from scalability problems. This is due to their need for collective communications to perform matrix-vector products and reduction operations. -Preconditionners can be used in order to increase the convergence of iterative -solvers. However, most of the good preconditionners are not sclalable when +Preconditioners can be used in order to increase the convergence of iterative +solvers. However, most of the good preconditioners are not scalable when thousands of cores are used. +%Traditional iterative solvers have global synchronizations that penalize the +%scalability. Two possible solutions consists either in using asynchronous +%iterative methods~\cite{ref18} or to use multisplitting algorithms. In this +%paper, we will reconsider the use of a multisplitting method. In opposition to +%traditional multisplitting method that suffer from slow convergence, as +%proposed in~\cite{huang1993krylov}, the use of a minimization process can +%drastically improve the convergence. + +Traditional parallel iterative solvers are based on fine-grain computations that +frequently require data exchanges between computing nodes and have global +synchronizations that penalize the scalability. Particularly, they are more +penalized on large scale architectures or on distributed platforms composed of +distant clusters interconnected by a high-latency network. It is therefore +imperative to develop coarse-grain based algorithms to reduce the communications +in the parallel iterative solvers. Two possible solutions consists either in +using asynchronous iterative methods~\cite{ref18} or to use multisplitting +algorithms. In this paper, we will reconsider the use of a multisplitting +method. In opposition to traditional multisplitting method that suffer from slow +convergence, as proposed in~\cite{huang1993krylov}, the use of a minimization +process can drastically improve the convergence. + +The present paper is organized as follows. First, Section~\ref{sec:02} presents +some related works and the principle of multisplitting methods. Then, in +Section~\ref{sec:03} is presented the algorithm of our Krylov multisplitting +method based on inner-outer iterations. Finally, in Section~\ref{sec:04}, the +parallel experiments on Hector architecture show the performances of the Krylov +multisplitting algorithm compared to the classical GMRES algorithm to solve a 3D +Poisson problem. -A completer... -On ne peut pas parler de tout...\\ +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% +\section{Related works and presentation of the multisplitting method} +\label{sec:02} +A general framework for studying parallel multisplitting has been presented in~\cite{o1985multi} +by O'Leary and White. Convergence conditions are given for the +most general case. Many authors improved multisplitting algorithms by proposing +for example an asynchronous version~\cite{bru1995parallel} and convergence +conditions~\cite{bai1999block,bahi2000asynchronous} in this case or other +two-stage algorithms~\cite{frommer1992h,bru1995parallel}. +In~\cite{huang1993krylov}, the authors proposed a parallel multisplitting +algorithm in which all the tasks except one are devoted to solve a sub-block of +the splitting and to send their local solutions to the first task which is in +charge to combine the vectors at each iteration. These vectors form a Krylov +basis for which the first task minimizes the error function over the basis to +increase the convergence, then the other tasks receive the updated solution until +convergence of the global system. + +In~\cite{couturier2008gremlins}, the authors proposed practical implementations +of multisplitting algorithms to solve large scale linear systems. Inner solvers +could be based on sequential direct method with the LU method or sequential iterative +one with GMRES. -%%%%%%%%%%%%%%%%%%%%%%% -%% BEGIN -%%%%%%%%%%%%%%%%%%%%%%% -The key idea of the multisplitting method for solving a large system -of linear equations $Ax=b$ consists in partitioning the matrix $A$ in -$L$ several ways +In~\cite{prace-multi}, the authors have proposed a parallel multisplitting +algorithm in which large blocks are solved using a GMRES solver. The authors have +performed large scale experiments up-to 32,768 cores and they conclude that +asynchronous multisplitting algorithm could be more efficient than traditional +solvers on an exascale architecture with hundreds of thousands of cores. + +So compared to these works, we propose in this paper a practical multisplitting method based on parallel iterative blocks and gives better results than classical GMRES method for the 3D Poisson problem we considered. +\\ + +The key idea of a multisplitting method to solve a large system of linear equations $Ax=b$ is defined as follows. The first step consists in partitioning the matrix $A$ in $L$ several ways \begin{equation} -A = M_l - N_l,~l\in\{1,\ldots,L\}, +A = M_\ell - N_\ell, \label{eq01} \end{equation} -where $M_l$ are nonsingular matrices. Then the linear system is solved -by iteration based on the multisplittings as follows +where for all $\ell\in\{1,\ldots,L\}$ $M_\ell$ are non-singular matrices. Then the linear system is solved by iteration based on the obtained splittings as follows \begin{equation} -x^{k+1}=\displaystyle\sum^L_{l=1} E_l M^{-1}_l (N_l x^k + b),~k=1,2,3,\ldots +x^{k+1}=\displaystyle\sum^L_{\ell=1} E_\ell M^{-1}_\ell (N_\ell x^k + b),~k=1,2,3,\ldots \label{eq02} \end{equation} -where $E_l$ are non-negative and diagonal weighting matrices such that -$\sum^L_{l=1}E_l=I$ ($I$ is an identity matrix). Thus the convergence -of such a method is dependent on the condition +where $E_\ell$ are non-negative and diagonal weighting matrices and their sum is an identity matrix $I$. The convergence of such a method is dependent on the condition \begin{equation} -\rho(\displaystyle\sum^L_{l=1}E_l M^{-1}_l N_l)<1. +\rho(\displaystyle\sum^L_{\ell=1}E_\ell M^{-1}_\ell N_\ell)<1. \label{eq03} \end{equation} +where $\rho$ is the spectral radius of the square matrix. -The advantage of the multisplitting method is that at each iteration -$k$ there are $L$ different linear sub-systems +The advantage of the multisplitting method is that at each iteration $k$ there are $L$ different linear sub-systems \begin{equation} -v_l^k=M^{-1}_l N_l x_l^{k-1} + M^{-1}_l b,~l\in\{1,\ldots,L\}, +v_\ell^k=M^{-1}_\ell N_\ell x_\ell^{k-1} + M^{-1}_\ell b,~\ell\in\{1,\ldots,L\}, \label{eq04} \end{equation} -to be solved independently by a direct or an iterative method, where -$v_l^k$ is the solution of the local sub-system. A multisplitting -method using an iterative method for solving the $L$ linear -sub-systems is called an inner-outer iterative method or a two-stage -method. The results $v_l^k$ obtained from the different -splittings~(\ref{eq04}) are combined to compute the solution $x^k$ of -the linear system by using the diagonal weighting matrices +to be solved independently by a direct or an iterative method, where $v_\ell$ is the solution of the local sub-system. Thus the computations of $\{v_\ell\}_{1\leq \ell\leq L}$ may be performed in parallel by a set of processors. A multisplitting method using an iterative method as an inner solver is called an inner-outer iterative method or a two-stage method. The results $v_\ell$ obtained from the different splittings~(\ref{eq04}) are combined to compute solution $x$ of the linear system by using the diagonal weighting matrices \begin{equation} -x^k = \displaystyle\sum^L_{l=1} E_l v_l^k, +x^k = \displaystyle\sum^L_{\ell=1} E_\ell v_\ell^k, \label{eq05} \end{equation} -In the case where the diagonal weighting matrices $E_l$ have only zero -and one factors (i.e. $v_l^k$ are disjoint vectors), the -multisplitting method is non-overlapping and corresponds to the block -Jacobi method. -%%%%%%%%%%%%%%%%%%%%%%% -%% END -%%%%%%%%%%%%%%%%%%%%%%% - -\section{Related works} - - -A general framework for studying parallel multisplitting has been presented in -\cite{o1985multi} by O'Leary and White. Convergence conditions are given for the -most general case. Many authors improved multisplitting algorithms by proposing -for example a asynchronous version \cite{bru1995parallel} and convergence -condition \cite{bai1999block,bahi2000asynchronous} in this case or other -two-stage algorithms~\cite{frommer1992h,bru1995parallel} - -In \cite{huang1993krylov}, the authors proposed a parallel multisplitting -algorithm in which all the tasks except one are devoted to solve a sub-block of -the splitting and to send their local solution to the first task which is in -charge to combine the vectors at each iteration. These vectors form a Krylov -basis for which the first tasks minimize the error function over the basis to -increase the convergence, then the other tasks receive the update solution until -convergence of the global system. - - - -In \cite{couturier2008gremlins}, the authors proposed practical implementations -of multisplitting algorithms that take benefit from multisplitting algorithms to -solve large scale linear systems. Inner solvers could be based on scalar direct -method with the LU method or scalar iterative one with GMRES. - - - +In the case where the diagonal weighting matrices $E_\ell$ have only zero and one factors (i.e. $v_\ell$ are disjoint vectors), the multisplitting method is non-overlapping and corresponds to the block Jacobi method. %%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%% - \section{A two-stage method with a minimization} -Let $Ax=b$ be a given sparse and large linear system of $n$ equations -to solve in parallel on $L$ clusters, physically adjacent or geographically -distant, where $A\in\mathbb{R}^{n\times n}$ is a square and nonsingular -matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ -is the right-hand side vector. The multisplitting of this linear system -is defined as follows: +\label{sec:03} +Let $Ax=b$ be a given large and sparse linear system of $n$ equations to solve in parallel on $L$ clusters of processors, physically adjacent or geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square and non-singular matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The multisplitting of this linear system is defined as follows \begin{equation} \left\{ \begin{array}{lll} @@ -151,40 +174,232 @@ b & = & [B_{1}, \ldots, B_{L}] \right. \label{sec03:eq01} \end{equation} -where for all $l\in\{1,\ldots,L\}$ $A_l$ is a rectangular block of size $n_l\times n$ -and $X_l$ and $B_l$ are sub-vectors of size $n_l$, such that $\sum_ln_l=n$. In this -case, we use a row-by-row splitting without overlapping in such a way that successive -rows of the sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster. -So, the multisplitting format of the linear system is defined as follows: +where for $\ell\in\{1,\ldots,L\}$, $A_\ell$ is a rectangular block of size $n_\ell\times n$ and $X_\ell$ and $B_\ell$ are sub-vectors of size $n_\ell$ each, such that $\sum_\ell n_\ell=n$. In this work, we use a row-by-row splitting without overlapping in such a way that successive rows of sparse matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster. So, the multisplitting format of the linear system is defined as follows \begin{equation} -\forall l\in\{1,\ldots,L\} \mbox{,~} \displaystyle\sum_{i=1}^{l-1}A_{li}X_i + A_{ll}X_l + \displaystyle\sum_{i=l+1}^{L}A_{li}X_i = B_l, +\forall \ell\in\{1,\ldots,L\} \mbox{,~} A_{\ell \ell}X_\ell + \displaystyle\sum_{\substack{m=1\\m\neq\ell}}^L A_{\ell m}X_m = B_\ell, \label{sec03:eq02} \end{equation} -where $A_{li}$ is a block of size $n_l\times n_i$ of the rectangular matrix $A_l$, $X_i\neq X_l$ -is a sub-vector of size $n_i$ of the solution vector $x$ and $\sum_{il}n_i+n_l=n$, -for all $i\in\{1,\ldots,l-1,l+1,\ldots,L\}$. +where $A_{\ell m}$ is a sub-block of size $n_\ell\times n_m$ of the rectangular matrix $A_\ell$, $X_m\neq X_\ell$ is a sub-vector of size $n_m$ of the solution vector $x$ and $\sum_{m\neq \ell}n_m+n_\ell=n$, for all $m\in\{1,\ldots,L\}$. -The multisplitting method proceeds by iteration for solving the linear system in such a -way each sub-system +Our multisplitting method proceeds by iteration to solve the linear system in such a way that each sub-system \begin{equation} \left\{ \begin{array}{l} -A_{ll}X_l = Y_l \mbox{,~such that}\\ -Y_l = B_l - \displaystyle\sum_{i=1,i\neq l}^{L}A_{li}X_i, +A_{\ell \ell}X_\ell = Y_\ell \mbox{,~such that}\\ +Y_\ell = B_\ell - \displaystyle\sum_{\substack{m=1\\m\neq \ell}}^{L}A_{\ell m}X_m, \end{array} \right. \label{sec03:eq03} \end{equation} -is solved independently by a cluster of processors and communication are required to -update the right-hand side vectors $Y_l$, such that the vectors $X_i$ represent the data -dependencies between the clusters. In this case, the parallel GMRES method is used -as an inner iteration method for solving the linear sub-systems~(\ref{sec03:eq03}). - +is solved independently by a {\it cluster of processors} and communications are required to update the right-hand side vectors $Y_\ell$, such that the vectors $X_m$ represent the data dependencies between the clusters. In this work, we use the parallel restarted GMRES method~\cite{ref34} as an inner iteration method to solve sub-systems~(\ref{sec03:eq03}). GMRES is one of the most used Krylov iterative methods to solve sparse linear systems. %In practice, GMRES is used with a preconditioner to improve its convergence. In this work, we used a preconditioning matrix equivalent to the main diagonal of sparse sub-matrix $A_{ll}$. This preconditioner is straightforward to implement in parallel and gives good performances in many situations. + +It should be noted that the convergence of the inner iterative solver for the +different sub-systems~(\ref{sec03:eq03}) does not necessarily involve the +convergence of the multisplitting method. It strongly depends on the properties +of the global sparse linear system to be +solved~\cite{o1985multi,ref18}. Furthermore, the splitting of the linear system +among several clusters of processors increases the spectral radius of the +iteration matrix, thereby slowing the convergence. In fact, the larger the +number of splitting is, the larger the spectral radius is. In this paper, we +based on the work presented in~\cite{huang1993krylov} to increase the +convergence and improve the scalability of the multisplitting methods. + +In order to accelerate the convergence, we implemented the outer iteration of the multisplitting solver as a Krylov iterative method which minimizes some error function over a Krylov subspace~\cite{S96}. The Krylov subspace that we used is spanned by a basis composed of successive solutions issued from solving the $L$ splittings~(\ref{sec03:eq03}) +\begin{equation} +S=\{x^1,x^2,\ldots,x^s\},~s\leq n, +\label{sec03:eq04} +\end{equation} +where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a solution of the global linear system. The advantage of such a Krylov subspace is that we need neither an orthogonal basis nor synchronizations between clusters to generate this basis. +The multisplitting method is periodically restarted every $s$ iterations with a new initial guess $\tilde{x}=S\alpha$ which minimizes the error function $\|b-Ax\|_2$ over the Krylov subspace spanned by vectors of $S$. So $\alpha$ is defined as the solution of the large overdetermined linear system +\begin{equation} +R\alpha=b, +\label{sec03:eq05} +\end{equation} +where $R=AS$ is a dense rectangular matrix of size $n\times s$ and $s\ll n$. This leads us to solve a system of normal equations +\begin{equation} +R^TR\alpha=R^Tb, +\label{sec03:eq06} +\end{equation} +which is associated with the least squares problem +\begin{equation} +\text{minimize}~\|b-R\alpha\|_2, +\label{sec03:eq07} +\end{equation} +where $R^T$ denotes the transpose of matrix $R$. Since $R$ (i.e. $AS$) and $b$ are split among $L$ clusters, the symmetric positive definite system~(\ref{sec03:eq06}) is solved in parallel. Thus an iterative method would be more appropriate than a direct one to solve this system. We use the parallel Conjugate Gradient method for the normal equations CGNR~\cite{S96,refCGNR}. + +\begin{algorithm}[!t] +\caption{A two-stage linear solver with inner iteration GMRES method} +\begin{algorithmic}[1] +\Input $A_\ell$ (sparse sub-matrix), $B_\ell$ (right-hand side sub-vector) +\Output $X_\ell$ (solution sub-vector)\vspace{0.2cm} +\State Load $A_\ell$, $B_\ell$ +\State Set the initial guess $x^0$ +\State Set the minimizer $\tilde{x}^0=x^0$ +\For {$k=1,2,3,\ldots$ until the global convergence} +\State Restart with $x^0=\tilde{x}^{k-1}$: +\For {$j=1,2,\ldots,s$} +\State \label{line7}Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$} +\State Construct basis $S$: add column vector $X_\ell^j$ to the matrix $S_\ell^k$ +\State Exchange local values of $X_\ell^j$ with the neighboring clusters +\State Compute dense matrix $R$: $R_\ell^{k,j}=\sum^L_{i=1}A_{\ell i}X_i^j$ +\EndFor +\State \label{line12}Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_\ell$, $R$, $b$, $k$} +\State Local solution of linear system $Ax=b$: $X_\ell^k=\tilde{X}_\ell^k$ +\State Exchange the local minimizer $\tilde{X}_\ell^k$ with the neighboring clusters +\EndFor + +\Statex + +\Function {InnerSolver}{$x^0$, $j$} +\State Compute local right-hand side $Y_\ell = B_\ell - \sum^L_{\substack{m=1\\m\neq \ell}}A_{\ell m}X_m^0$ +\State Solving local splitting $A_{\ell \ell}X_\ell^j=Y_\ell$ using parallel GMRES method, such that $X_\ell^0$ is the initial guess +\State \Return $X_\ell^j$ +\EndFunction + +\Statex + +\Function {UpdateMinimizer}{$S_\ell$, $R$, $b$, $k$} +\State Solving normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ clusters using parallel CGNR method +\State Compute local minimizer $\tilde{X}_\ell^k=S_\ell^k\alpha^k$ +\State \Return $\tilde{X}_\ell^k$ +\EndFunction +\end{algorithmic} +\label{algo:01} +\end{algorithm} + +The main key points of our Krylov multisplitting method to solve a large sparse linear system are given in Algorithm~\ref{algo:01}. This algorithm is based on a two-stage method with a minimization using restarted GMRES iterative method as an inner solver. It is executed in parallel by each cluster of processors. Matrices and vectors with the subscript $\ell$ represent the local data for cluster $\ell$, where $\ell\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel iterative algorithms: GMRES method to solve each splitting~(\ref{sec03:eq03}) on a cluster of processors, and CGNR method executed in parallel by all clusters to minimize the function error~(\ref{sec03:eq07}) over the Krylov subspace spanned by $S$. The algorithm requires two global synchronizations between $L$ clusters. The first one is performed at line~\ref{line12} in Algorithm~\ref{algo:01} to exchange local values of vector solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the multisplitting solver. The second one is needed to construct the matrix $R$. We chose to perform this latter synchronization $s$ times in every outer iteration $k$ (line~\ref{line7} in Algorithm~\ref{algo:01}). This is a straightforward way to compute the sparse matrix-dense matrix multiplication $R=AS$. We implemented all synchronizations by using message passing collective communications of MPI library. +%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%% +\section{Experiments} +\label{sec:04} +In order to illustrate the interest of our algorithm. We have compared our +algorithm with the GMRES method which is a very well used method in many +situations. We have chosen to focus on only one problem which is very simple to +implement: a 3 dimension Poisson problem. +\begin{equation} +\left\{ + \begin{array}{ll} + \nabla u&=f \mbox{~in~} \omega\\ + u &=0 \mbox{~on~} \Gamma=\partial \omega + \end{array} + \right. +\end{equation} +After discretization, with a finite difference scheme, a seven point stencil is +used. It is well-known that the spectral radius of matrices representing such +problems are very close to 1. Moreover, the larger the number of discretization +points is, the closer to 1 the spectral radius is. Hence, to solve a matrix +obtained for a 3D Poisson problem, the number of iterations is high. Using a +preconditioner it is possible to reduce the number of iterations but +preconditioners are not scalable when using many cores. + +%Doing many experiments with many cores is not easy and requires to access to a supercomputer with several hours for developing a code and then improving it. +In the following we present some experiments we could achieved out on the Hector +architecture, a UK's high-end computing resource, funded by the UK Research +Councils~\cite{hector}. This is a Cray XE6 supercomputer, equipped with two +16-core AMD Opteron 2.3 Ghz and 32 GB of memory. Machines are interconnected +with a 3D torus. + +Table~\ref{tab1} shows the result of the experiments. The first column shows +the size of the 3D Poisson problem. The size is chosen in order to have +approximately 50,000 components per core. The second column represents the +number of cores used. In parenthesis, there is the decomposition used for the +Krylov multisplitting. The third column and the sixth column respectively show +the execution time for the GMRES and the Krylov multisplitting codes. The fourth +and the seventh column describes the number of iterations. For the +multisplitting code, the total number of inner iterations is represented in +parenthesis. For the GMRES code (alone and in the multisplitting version) the +restart parameter is fixed to 16. The precision of the GMRES version is fixed to +1e-6. For the multisplitting, there are two precisions, one for the external +solver which is fixed to 1e-6 and another one for the inner solver (GMRES) which +is fixed to 1e-10. It should be noted that a high precision is used but we also +fixed a maximum number of iterations for each internal step. In practice, we +limit the number of iterations in the internal step to 10. So an internal iteration is finished +when the precision is reached or when the maximum internal number of iterations +is reached. The precision and the maximum number of iterations of CGNR method are fixed to 1e-25 and 20 respectively. The size of the Krylov subspace basis $S$ is fixed to 10 vectors. + +\begin{table}[htbp] +\begin{center} +\begin{changemargin}{-1.8cm}{0cm} +\begin{small} +\begin{tabular}{|c|c||c|c|c||c|c|c||c|} +\hline +\multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} & \multicolumn{3}{c||}{GMRES} & \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\ + \cline{3-8} + & & Time (s) & nb Iter. & $\Delta$ & Time (s)& nb Iter. & $\Delta$ & \\ +\hline +$468^3$ & 2,048 (2x1,024) & 299.7 & 41,028 & 5.02e-8 & 48.4 & 691(6,146) & 8.24e-08 & 6.19 \\ +\hline +$590^3$ & 4,096 (2x2,048) & 433.1 & 55,494 & 4.92e-7 & 74.1 & 1,101(8,211) & 6.62e-08 & 5.84 \\ +\hline +$743^3$ & 8,192 (2x4,096) & 704.4 & 87,822 & 4.80e-07 & 151.2 & 3,061(14,914) & 5.87e-08 & 4.65 \\ +\hline +$743^3$ & 8,192 (4x2,048) & 704.4 & 87,822 & 4.80e-07 & 110.3 & 1,531(12,721) & 1.47e-07& 6.39 \\ +\hline + +\end{tabular} +\caption{Results} +\label{tab1} +\end{small} +\end{changemargin} +\end{center} +\end{table} + + +From these experiments, it can be observed that the multisplitting version is +always faster than the GMRES version. The acceleration gain of the +multisplitting version is between 4 and 6. It can be noticed that the number of +iterations is drastically reduced with the multisplitting version even it is not +neglectable. Moreover, with 8,192 cores, we can see that using 4 clusters gives +better performance than simply using 2 clusters. In fact, we can remark that the +precision with 2 clusters is slightly better but in both cases the precision is +under the specified threshold. + +\section{Conclusion and perspectives} +We have implemented a Krylov multisplitting method to solve sparse linear +systems on large-scale computing platforms. We have developed a synchronous +two-stage method based on the block Jacobi multisaplitting which uses GMRES +iterative method as an inner iteration. Our contribution in this paper is +twofold. First we provide a multi cluster decomposition that allows us to choose +the appropriate size of the clusters according to the architecures of the +supercomputer. Second, we have implemented the outer iteration of the +multisplitting method as a Krylov subspace method which minimizes some error +function. This increases the convergence and improves the scalability of the +multisplitting method. + +We have tested our multisplitting method to solve the sparse linear system +issued from the discretization of a 3D Poisson problem. We have compared its +performances to the classical GMRES method on a supercomputer composed of 2,048 +to 8,192 cores. The experimental results showed that the multisplitting method is +about 4 to 6 times faster than the GMRES method for different sizes of the +problem split into 2 or 4 blocks when using multisplitting method. Indeed, the +GMRES method has difficulties to scale with many cores while the Krylov +multisplitting method allows to hide latency and reduce the inter-cluster +communications. + +In future works, we plan to conduct experiments on larger number of cores and +test the scalability of our Krylov multisplitting method. It would be +interesting to validate its performances to solve other linear/nonlinear and +symmetric/nonsymmetric problems. Moreover, we intend to develop multisplitting +methods based on asynchronous iteration in which communications are overlapped +by computations. These methods would be interesting for platforms composed of +distant clusters interconnected by a high-latency network. In addition, we +intend to investigate the convergence improvements of our method by using +preconditioning techniques for Krylov iterative methods and multisplitting +methods with overlapping blocks. + +\section{Acknowledgement} +The authors would like to thank Mark Bull of the EPCC his fruitful remarks and the facilities of HECToR. + +%Other applications (=> other matrices)\\ +%Larger experiments\\ +%Async\\ +%Overlapping\\ +%preconditioning %%%%%%%%%%%%%%%%%%%%%%%%