krylov_multi_reviewed.tex

   1 \documentclass{article}
   2 \usepackage[utf8]{inputenc}
   3 \usepackage{amsfonts,amssymb}
   4 \usepackage{amsmath}
   5 \usepackage{graphicx}
   6 \usepackage{algorithm}
   7 \usepackage{algpseudocode}
   8 \usepackage{multirow}
   9 \usepackage{authblk}
  10
  11
  12 \algnewcommand\algorithmicinput{\textbf{I1nput:}}
  13 \algnewcommand\Input{\item[\algorithmicinput]}
  14
  15 \algnewcommand\algorithmicoutput{\textbf{Output:}}
  16 \algnewcommand\Output{\item[\algorithmicoutput]}
  17
  18 \newcommand{\Time}[1]{\mathit{Time}_\mathit{#1}}
  19 \newcommand{\Prec}{\mathit{prec}}
  20 \newcommand{\Ratio}{\mathit{Ratio}}
  21
  22 \def\changemargin#1#2{\list{}{\rightmargin#2\leftmargin#1}\item[]}
  23 \let\endchangemargin=\endlist
  24
  25 \title{A scalable multisplitting algorithm to solve large sparse linear systems}
  26 \date{}
  27
  28 \author[1]{Raphaël Couturier}
  29 \author[2]{ Lilia Ziane Khodja}
  30 \affil[1]{ Femto-ST Institute\\
  31     University of Franche Comte\\
  32     France\\
  33     email: raphael.couturier@univ-fcomte.fr}
  34 \affil[2]{Inria Bordeaux Sud-Ouest\\
  35     France\\
  36     email: lilia.ziane@inria.fr}
  37 \begin{document}
  38
  39
  40 \maketitle
  41
  42
  43 %%%%%%%%%%%%%%%%%%%%%%%%
  44 %%%%%%%%%%%%%%%%%%%%%%%%
  45
  46 \begin{abstract}
  47 In  this paper  we  revisit  the Krylov  multisplitting  algorithm presented  in
  48 \cite{huang1993krylov}  which  uses  a  sequential  method to  minimize  the  Krylov
  49 iterations computed by a multisplitting algorithm. Our new algorithm is based on
  50 a  parallel multisplitting  algorithm  with few  blocks  of large  size using  a
  51 parallel GMRES method inside each block and on a parallel Krylov minimization in
  52 order to improve the convergence. Some large scale experiments with a 3D Poisson
  53 problem  are  presented  with  up   to  8,192  cores.   They  show  the  obtained
  54 improvements compared to a classical GMRES both in terms of number of iterations
  55 and in terms of execution times.
  56 \end{abstract}
  57
  58 %%%%%%%%%%%%%%%%%%%%%%%%
  59 %%%%%%%%%%%%%%%%%%%%%%%%
  60
  61 \section{Introduction}
  62 Iterative methods are used to solve  large sparse linear systems of equations of
  63 the form  $Ax=b$ because they are  easier to parallelize than  direct ones. Many
  64 iterative  methods have  been proposed  and  adapted by  different researchers.   For
  65 example, the GMRES method and the  Conjugate Gradient method are very well known
  66 and  used~\cite{S96}. Both methods  are based  on the
  67 Krylov subspace which consists in forming  a basis of a sequence of successive
  68 matrix powers times the initial residual.
  69
  70 When  solving large  linear systems  with  many cores,  iterative methods  often
  71 suffer  from scalability problems.   This is  due to  their need  for collective
  72 communications  to  perform  matrix-vector  products and  reduction  operations.
  73 Preconditioners can be  used in order to increase  the convergence of iterative
  74 solvers.   However, most  of the  good preconditioners  are not  scalable when
  75 thousands of cores are used.
  76
  77 %Traditional iterative  solvers have  global synchronizations that  penalize the
  78 %scalability.   Two  possible solutions  consists  either  in using  asynchronous
  79 %iterative  methods~\cite{ref18} or  to  use multisplitting  algorithms. In  this
  80 %paper, we will  reconsider the use of a multisplitting  method. In opposition to
  81 %traditional  multisplitting  method  that  suffer  from  slow  convergence,  as
  82 %proposed  in~\cite{huang1993krylov},  the  use  of a  minimization  process  can
  83 %drastically improve the convergence.
  84
  85 Traditional parallel iterative solvers are based on fine-grain computations that
  86 frequently  require  data exchanges  between  computing  nodes  and have  global
  87 synchronizations that penalize the scalability~\cite{zkcgb+14:ij}. Particularly,
  88 they are more penalized on large scale architectures or on distributed platforms
  89 composed of  distant clusters interconnected  by a high-latency network.   It is
  90 therefore  imperative to  develop coarse-grain  based algorithms  to  reduce the
  91 communications  in  the  parallel  iterative  solvers.  Two  possible  solutions
  92 consists either in using asynchronous iterative methods~\cite{ref18} or in using
  93 multisplitting  algorithms.  In  this paper,  we will  reconsider the  use  of a
  94 multisplitting method.  In opposition to traditional  multisplitting method that
  95 suffer from slow convergence,  as proposed in~\cite{huang1993krylov}, the use of
  96 a minimization process can drastically improve the convergence.\\
  97
  98
  99 %%% AJOUTE************************
 100 %%%*******************************
 101 \noindent  {\bf  Contributions:}\\  In  this  work we  develop  a  new  parallel
 102 two-stage algorithm for large-scale clusters.   Our objective is to create a mix
 103 between Krylov based iterative methods  and the multisplitting method to improve
 104 scalability.   In fact  Krylov subspace  methods are  well-known for  their good
 105 convergence compared to  other iterative methods.  So, our  main contribution is
 106 to  use  the  multisplitting method  which  splits  the  problem to  solve  into
 107 different blocks in  order to reduce the large amount  of communications and, to
 108 implement both inner and outer iterations as Krylov subspace iterations in order
 109 to improve the convergence of the multisplitting algorithm.\\
 110 %%%*******************************
 111 %%%*******************************
 112
 113 The present paper is  organized as follows. First, Section~\ref{sec:02} presents
 114 some  related  works and  the  principle  of  multisplitting methods.  Then,  in
 115 Section~\ref{sec:03}  the algorithm  of our  Krylov multisplitting
 116 method, based  on inner-outer  iterations, is presented. Finally, in  Section~\ref{sec:04}, the
 117 parallel experiments on Hector architecture  show the performances of the Krylov
 118 multisplitting algorithm compared to the classical GMRES algorithm to solve a 3D
 119 Poisson problem.
 120
 121
 122 %%%%%%%%%%%%%%%%%%%%%%%%
 123 %%%%%%%%%%%%%%%%%%%%%%%%
 124
 125 \section{Related works and presentation of the multisplitting method}
 126 \label{sec:02}
 127 A general framework  to study parallel multisplitting methods has  been presented in~\cite{o1985multi}
 128 by O'Leary and White. Convergence conditions are given for the
 129 most general cases.  Many authors have improved multisplitting algorithms by proposing,
 130 for  example,  an  asynchronous  version~\cite{bru1995parallel} or  convergence
 131 conditions~\cite{bai1999block,bahi2000asynchronous}     or  other
 132 two-stage algorithms~\cite{frommer1992h,bru1995parallel}.
 133
 134 In~\cite{huang1993krylov},  the  authors  have proposed  a  parallel  multisplitting
 135 algorithm in which all the tasks except  one are devoted to solve a sub-block of
 136 the splitting  and to send their  local solutions to  the first task which  is in
 137 charge of  combining the vectors at  each iteration.  These vectors  form a Krylov
 138 basis for  which the first task minimizes  the error function over  the basis to
 139 increase the convergence, then the other tasks receive the updated solution until the
 140 convergence of the global system.
 141
 142 In~\cite{couturier2008gremlins}, the  authors have developed practical implementations
 143 of multisplitting algorithms to solve  large scale linear systems. Inner solvers
 144 could be  based on sequential direct method  with the LU method  or sequential iterative
 145 one with GMRES.
 146
 147 In~\cite{prace-multi},  the  authors have  designed a  parallel  multisplitting
 148 algorithm in which large blocks are solved using a GMRES solver. The authors have
 149 performed large scale experiments up-to  32,768 cores and they conclude that
 150 an asynchronous  multisplitting algorithm  could be more  efficient  than traditional
 151 solvers on an exascale architecture with hundreds of thousands of cores.
 152
 153 So, compared to these works, we propose in this paper a practical multisplitting method based on parallel iterative blocks which gives better results than classical GMRES method for the 3D Poisson problem we considered.
 154 \\
 155
 156 The key idea of a multisplitting method to solve a large system of linear equations $Ax=b$ is defined as follows. The first step consists in partitioning the matrix $A$ in $L$ several ways
 157 \begin{equation}
 158 A = M_\ell - N_\ell,
 159 \label{eq01}
 160 \end{equation}
 161 where for all $\ell\in\{1,\ldots,L\}$ $M_\ell$ are non-singular matrices. Then the linear system is solved by an iteration based on the obtained splittings as follows
 162 \begin{equation}
 163 x^{k+1}=\displaystyle\sum^L_{\ell=1} E_\ell M^{-1}_\ell (N_\ell x^k + b),~k=1,2,3,\ldots
 164 \label{eq02}
 165 \end{equation}
 166 where $E_\ell$ are non-negative and diagonal weighting matrices and their sum is an identity matrix $I$. The convergence of such a method is dependent on the condition
 167 \begin{equation}
 168 \rho(\displaystyle\sum^L_{\ell=1}E_\ell M^{-1}_\ell N_\ell)<1.
 169 \label{eq03}
 170 \end{equation}
 171 where $\rho$ is the spectral radius of the square matrix.
 172
 173 The advantage of the multisplitting method is that at each iteration $k$ there are $L$ different linear sub-systems
 174 \begin{equation}
 175 v_\ell^k=M^{-1}_\ell N_\ell x_\ell^{k-1} + M^{-1}_\ell b,~\ell\in\{1,\ldots,L\},
 176 \label{eq04}
 177 \end{equation}
 178 to be solved independently by a direct or an iterative method, where $v_\ell$ is the solution of the local sub-system. Thus the computations of $\{v_\ell\}_{1\leq \ell\leq L}$ may be performed in parallel by a set of processors. A multisplitting method using an iterative method as an inner solver is called an inner-outer iterative method or a two-stage method. The results $v_\ell$ obtained from the different splittings~(\ref{eq04}) are combined to compute solution $x$ of the linear system by using the diagonal weighting matrices
 179 \begin{equation}
 180 x^k = \displaystyle\sum^L_{\ell=1} E_\ell v_\ell^k,
 181 \label{eq05}
 182 \end{equation}
 183 In the case where the diagonal weighting matrices $E_\ell$ have only zero and one factors (i.e. $v_\ell$ are disjoint vectors), the multisplitting method is non-overlapping and corresponds to the block Jacobi method.
 184
 185 %%%%%%%%%%%%%%%%%%%%%%%%
 186 %%%%%%%%%%%%%%%%%%%%%%%%
 187
 188 \section{A two-stage method with a minimization}
 189 \label{sec:03}
 190
 191 %%% MODIFIE ************************
 192 %%%*********************************
 193 Let $Ax=b$ be a given large and sparse linear system of $n$ equations where $A\in\mathbb{R}^{n\times n}$ is a sparse square and non-singular matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ is the right-hand side vector. We use a multisplitting method to solve the linear system on a large computing platform in order to reduce communications. Let the computing platform be composed of $L$ blocks of processors physically adjacent or geographically distant. In this work we apply the block Jacobi multisplitting method to the linear system as follows
 194 %%%*********************************
 195 %%%*********************************
 196
 197
 198 \begin{equation}
 199 \left\{
 200 \begin{array}{lll}
 201 A & = & [A_{1}, \ldots, A_{L}]\\
 202 x & = & [X_{1}, \ldots, X_{L}]\\
 203 b & = & [B_{1}, \ldots, B_{L}]
 204 \end{array}
 205 \right.
 206 \label{sec03:eq01}
 207 \end{equation}
 208 where for $\ell\in\{1,\ldots,L\}$, $A_\ell$ is a rectangular block of size $n_\ell\times n$ and $X_\ell$ and $B_\ell$ are sub-vectors of size $n_\ell$ each, such that $\sum_\ell n_\ell=n$.
 209 %%% MODIFIE ***********************
 210 %%%********************************
 211 The splitting is performed row-by-row without overlapping in such a way that successive rows of sparse matrix $A$ and both vectors $x$ and $b$ are assigned to a block of processors.
 212 %%%********************************
 213 %%%********************************
 214 So, the multisplitting format of the linear system is defined as follows
 215 \begin{equation}
 216 \forall \ell\in\{1,\ldots,L\} \mbox{,~} A_{\ell \ell}X_\ell + \displaystyle\sum_{\substack{m=1\\m\neq\ell}}^L A_{\ell m}X_m = B_\ell,
 217 \label{sec03:eq02}
 218 \end{equation}
 219 where $A_{\ell m}$ is a sub-block of size $n_\ell\times  n_m$ of the rectangular matrix $A_\ell$, $X_m\neq  X_\ell$ is a sub-vector of size $n_m$ of the solution vector $x$ and $\sum_{m\neq \ell}n_m+n_\ell=n$, for all $m\in\{1,\ldots,L\}$.
 220
 221 Our multisplitting method proceeds by iteration to solve the linear system in such a way that each sub-system
 222 \begin{equation}
 223 \left\{
 224 \begin{array}{l}
 225 A_{\ell \ell}X_\ell = Y_\ell \mbox{,~such that}\\
 226 Y_\ell = B_\ell - \displaystyle\sum_{\substack{m=1\\m\neq \ell}}^{L}A_{\ell m}X_m,
 227 \end{array}
 228 \right.
 229 \label{sec03:eq03}
 230 \end{equation}
 231 is solved independently by a {\it block of processors} and communications are required to update the right-hand side vectors $Y_\ell$, such that the vectors $X_m$ represent the data dependencies between the blocks. In this work, we use the parallel restarted GMRES method~\cite{ref34} as an inner iteration method to solve sub-systems~(\ref{sec03:eq03}).
 232 %%% MODIFIE ***********************
 233 %%%********************************
 234 GMRES is one of the most used Krylov iterative methods to solve sparse linear systems by minimizing the residuals over an orthonormal basis of a Krylov subspace.
 235 %%%********************************
 236 %%%********************************
 237 %In practice, GMRES is used with a preconditioner to improve its convergence. In this work, we used a preconditioning matrix equivalent to the main diagonal of sparse sub-matrix $A_{ll}$. This preconditioner is straightforward to implement in parallel and gives good performances in many situations.
 238
 239 It should  be noted that the convergence  of the inner iterative  solver for the
 240 different  sub-systems~(\ref{sec03:eq03})  does   not  necessarily  involve  the
 241 convergence of the multisplitting algorithm.  It strongly depends on the properties
 242 of       the       global      sparse       linear       system      to       be
 243 solved~\cite{o1985multi,ref18}. Furthermore, the  splitting of the linear system
 244 among  several blocks  of  processors  increases the  spectral  radius of  the
 245 iteration  matrix, thereby  slowing the  convergence.  In  fact, the  larger the
 246 number of  splittings is, the larger the  spectral radius is.  In  this paper, our
 247 work is based  on   the  work   presented  in~\cite{huang1993krylov}  to   increase  the
 248 convergence and improve the scalability of the multisplitting methods.
 249
 250 %%% AJOUTE ************************
 251 %%%********************************
 252 Krylov subspace methods are well-known for their good convergence compared to other iterative methods.
 253 %%%********************************
 254 %%%********************************
 255 In order to accelerate the convergence, we implemented the outer iteration of our multisplitting solver as a Krylov iterative method which minimizes some error function over a Krylov subspace~\cite{S96}. The Krylov subspace that we used is spanned by a basis composed of successive solutions issued from solving the $L$ splittings~(\ref{sec03:eq03})
 256 \begin{equation}
 257 S=\{x^1,x^2,\ldots,x^s\},~s\leq n,
 258 \label{sec03:eq04}
 259 \end{equation}
 260 where for $j\in\{1,\ldots,s\}$, $x^j=[X_1^j,\ldots,X_L^j]$ is a solution of the global linear system.
 261 %%% MODIFIE ***********************
 262 %%%********************************
 263 The advantage of such a Krylov subspace is that we neither need an orthonormal basis nor any synchronization between the different blocks to orthogonalize the generated basis. This avoids to perform other synchronizations between the blocks of processors.
 264
 265 The multisplitting method is periodically restarted every $s$ iterations with a new initial guess $\tilde{x}=S\alpha$ which minimizes an error function, in our case it minimizes the residuals $\|b-Ax\|_2$ over the Krylov subspace spanned by vectors of $S$. So $\alpha$ is defined as the solution of the large overdetermined linear system.
 266 %%%********************************
 267 %%%********************************
 268
 269 \begin{equation}
 270 R\alpha=b,
 271 \label{sec03:eq05}
 272 \end{equation}
 273 where $R=AS$ is a dense rectangular matrix of size $n\times s$ and $s\ll n$. This leads us to solve a system of normal equations
 274 \begin{equation}
 275 R^TR\alpha=R^Tb,
 276 \label{sec03:eq06}
 277 \end{equation}
 278 which is associated with the least squares problem
 279 \begin{equation}
 280 \text{minimize}~\|b-R\alpha\|_2,
 281 \label{sec03:eq07}
 282 \end{equation}
 283 where $R^T$ denotes the transpose of matrix $R$. Since $R$ (i.e. $AS$) and $b$ are split among $L$ blocks, the symmetric positive definite system~(\ref{sec03:eq06}) is solved in parallel. Thus an iterative method would be more appropriate than a direct one to solve this system. We use the parallel Conjugate Gradient method for the normal equations CGNR~\cite{S96,refCGNR}.
 284
 285 \begin{algorithm}[!t]
 286 \caption{A two-stage linear solver with inner iteration GMRES method}
 287 \begin{algorithmic}[1]
 288 \Input $A_\ell$ (sparse sub-matrix), $B_\ell$ (right-hand side sub-vector)
 289 \Output $X_\ell$ (solution sub-vector)\vspace{0.2cm}
 290 \State Load $A_\ell$, $B_\ell$
 291 \State Set the initial guess $x^0$
 292 \State Set the minimizer $\tilde{x}^0=x^0$
 293 \For {$k=1,2,3,\ldots$ until the global convergence}
 294 \State Restart with $x^0=\tilde{x}^{k-1}$:
 295 \For {$j=1,2,\ldots,s$}
 296 \State \label{line7}Inner iteration solver: \Call{InnerSolver}{$x^0$, $j$}
 297 \State Construct basis $S$: add column vector $X_\ell^j$ to the matrix $S_\ell^k$
 298 \State Exchange local values of $X_\ell^j$ with the neighboring blocks
 299 \State Compute dense matrix $R$: $R_\ell^{k,j}=\sum^L_{i=1}A_{\ell i}X_i^j$
 300 \EndFor
 301 \State \label{line12}Minimization $\|b-R\alpha\|_2$: \Call{UpdateMinimizer}{$S_\ell$, $R$, $b$, $k$}
 302 \State Local solution of linear system $Ax=b$: $X_\ell^k=\tilde{X}_\ell^k$
 303 \State Exchange the local minimizer $\tilde{X}_\ell^k$ with the neighboring blocks
 304 \EndFor
 305
 306 \Statex
 307
 308 \Function {InnerSolver}{$x^0$, $j$}
 309 \State Compute local right-hand side $Y_\ell = B_\ell - \sum^L_{\substack{m=1\\m\neq \ell}}A_{\ell m}X_m^0$
 310 \State Solving local splitting $A_{\ell \ell}X_\ell^j=Y_\ell$ using parallel GMRES method, such that $X_\ell^0$ is the initial guess
 311 \State \Return $X_\ell^j$
 312 \EndFunction
 313
 314 \Statex
 315
 316 \Function {UpdateMinimizer}{$S_\ell$, $R$, $b$, $k$}
 317 \State Solving normal equations $(R^k)^TR^k\alpha^k=(R^k)^Tb$ in parallel by $L$ blocks using parallel CGNR method
 318 \State Compute local minimizer $\tilde{X}_\ell^k=S_\ell^k\alpha^k$
 319 \State \Return $\tilde{X}_\ell^k$
 320 \EndFunction
 321 \end{algorithmic}
 322 \label{algo:01}
 323 \end{algorithm}
 324
 325 The main key points of our Krylov multisplitting method to solve a large sparse linear system are given in Algorithm~\ref{algo:01}. This algorithm is based on a two-stage method with a minimization using restarted GMRES iterative method as an inner solver. It is executed in parallel by each block of processors. Matrices and vectors with the subscript $\ell$ represent the local data for block $\ell$, where $\ell\in\{1,\ldots,L\}$. The two-stage solver uses two different parallel iterative algorithms: the GMRES method to solve each splitting~(\ref{sec03:eq03}) on a block of processors, and the CGNR method, executed periodically in parallel by all blocks to minimize the function error~(\ref{sec03:eq07}) over the Krylov subspace spanned by $S$. The algorithm requires two global synchronizations between $L$ blocks. The first one is performed  line~\ref{line12} in Algorithm~\ref{algo:01} to exchange local values of vector solution $x$ (i.e. the minimizer $\tilde{x}$) required to restart the multisplitting solver. The second one is needed to construct the matrix $R$. We chose to perform this latter synchronization $s$ times in every outer iteration $k$ (line~\ref{line7} in Algorithm~\ref{algo:01}). This is a straightforward way to compute the sparse matrix-dense matrix multiplication $R=AS$. We implemented all synchronizations by using message passing collective communications of MPI library.
 326
 327 %%%%%%%%%%%%%%%%%%%%%%%%
 328 %%%%%%%%%%%%%%%%%%%%%%%%
 329
 330 \section{Experiments}
 331 \label{sec:04}
 332 %%% MODIFIE ***********************
 333 %%%********************************
 334 In order to illustrate the interest of our Krylov multisplitting algorithm, we have compared its performances with those of a classical block Jacobi multisplitting method and those of the GMRES method which is a commonly used method in many situations.
 335 %%%********************************
 336 %%%********************************
 337  We have chosen to focus on only one problem which is very simple to implement: a 3 dimension Poisson problem.
 338
 339 \begin{equation}
 340 \left\{
 341                 \begin{array}{ll}
 342                   \nabla u&=f \mbox{~in~} \omega\\
 343                   u &=0 \mbox{~on~}  \Gamma=\partial \omega
 344                 \end{array}
 345               \right.
 346 \end{equation}
 347
 348 After discretization, with a finite  difference scheme, a seven point stencil is
 349 used. It  is well-known that the  spectral radius of  matrices representing such
 350 problems are very close to 1.  Moreover, the larger the number of discretization
 351 points is,  the closer to 1  the spectral radius  is.  Hence, to solve  a matrix
 352 obtained for  a 3D Poisson  problem, the number  of iterations is high.  Using a
 353 preconditioner  it  is   possible  to  reduce  the  number   of  iterations  but
 354 preconditioners are not scalable when using many cores.
 355
 356
 357 %%% AJOUTE ************************
 358 %%%********************************
 359 We have performed some experiments on  an infiniband cluster of three Intel Xeon
 360 quad-core E5620 CPUs of 2.40 GHz and  12 GB of memory. For the GMRES code (alone
 361 and in both  multisplitting versions) the restart parameter is  fixed to 16. The
 362 precision  of  the  GMRES version  is  fixed  to  1e-6. For  the  multisplitting
 363 versions, there are  two precisions, one for the external  solver which is fixed
 364 to 1e-6 and another one for the inner solver (GMRES) which is fixed to 1e-10. It
 365 should be noted that a high precision is used but we also fixed a maximum number
 366 of  iterations for  each internal  step.  In practice,  we limit  the number  of
 367 iterations in the internal step to 10. So an internal iteration is finished when
 368 the precision  is reached or when  the maximum internal number  of iterations is
 369 reached. The precision and the maximum  number of iterations of CGNR method used
 370 by   our  Krylov   multisplitting  algorithm   are   fixed  to   1e-25  and   20
 371 respectively. The size of the Krylov subspace basis $S$ is fixed to 10 vectors.
 372
 373 \begin{figure}[htbp]
 374 \centering
 375   \includegraphics[width=0.8\textwidth]{strong_scaling_150x150x150}
 376 \caption{Strong scaling with 3 blocks of 4 cores each to solve a 3D Poisson problem of size $150^3$ components}
 377 \label{fig:001}
 378 \end{figure}
 379
 380 \begin{figure}[htbp]
 381 \centering
 382 \begin{tabular}{c}
 383 \includegraphics[width=0.8\textwidth]{weak_scaling_280k} \\ \includegraphics[width=0.8\textwidth]{weak_scaling_280K2}\\
 384 \end{tabular}
 385 \caption{Weak scaling with 3 blocks of 4 cores each to solve a 3D Poisson problem with approximately 280K components per core}
 386 \label{fig:002}
 387 \end{figure}
 388
 389 %%The experiments are performed on 3 different clusters of cores interconnected by an infiniband network (each cluster is a quad-core CPU).
 390 Figures~\ref{fig:001}  and~\ref{fig:002} show  the  scalability performances  of
 391 GMRES, classical  multisplitting and  Krylov multisplitting methods:  strong and
 392 weak scaling are  presented respectively. We can remark  from these figures that
 393 the performances  of our Krylov multisplitting  method are better  than those of
 394 GMRES and classical multisplitting methods. In the experiments conducted in this
 395 work, our method is approximately twice faster than the GMRES method and about 9
 396 times faster than the classical multisplitting method. Our multisplitting method
 397 uses a  minimization step  over a  Krylov subspace which  reduces the  number of
 398 iterations  and  accelerates the  convergence.   We  can  also remark  that  the
 399 performances of the  classical block Jacobi multisplitting method  are the worst
 400 compared with  those of  the other two  methods.  This  is why in  the following
 401 experiments we compare the performances of our Krylov multisplitting method with
 402 only those of the GMRES method.
 403 %%%********************************
 404 %%%********************************
 405
 406
 407 %%% MODIFIE ************************
 408 %%%*********************************
 409 %Doing many experiments  with many cores is  not easy and requires to  access to a supercomputer  with several  hours for  developing  a code  and then  improving it.
 410 In the following we present some experiments we could achieve out on the Hector
 411 architecture,  a UK  high-end computing  resource, funded  by the  UK Research
 412 Councils~\cite{hector}.  This is  a Cray  XE6 supercomputer,  equipped  with two
 413 16-core AMD  Opteron 2.3 GHz  and 32 GB  of memory. Machines  are interconnected
 414 with a 3D torus. The different parameters used by the GMRES and the Krylov multisplitting codes are as those previously mentioned.
 415
 416 Table~\ref{tab1} shows  the result of  the experiments.  The first  column shows
 417 the  size of  the 3D  Poisson problem.   The  size is  chosen in  order to  have
 418 approximately  50,000 components  per core.   The second  column  represents the
 419 number of cores used. Between brackets,  one can find the decomposition used for
 420 the Krylov  multisplitting. The third  column and the sixth  column respectively
 421 show the execution  time for the GMRES and the  Krylov multisplitting codes. The
 422 fourth  and the  seventh  column describe  the  number of  iterations.  For  the
 423 multisplitting code, the total number of inner iterations is represented between
 424 brackets.
 425 %%%********************************
 426 %%%********************************
 427
 428 \begin{table}[htbp]
 429 \begin{center}
 430 \begin{changemargin}{-1.8cm}{0cm}
 431 \begin{small}
 432 \begin{tabular}{|c|c||c|c|c||c|c|c||c|}
 433 \hline
 434 \multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} &  \multicolumn{3}{c||}{GMRES} &  \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\
 435  \cline{3-8}
 436            &                   &  Time (s) & nb Iter. & $\Delta$  &   Time (s)& nb Iter. & $\Delta$ & \\
 437 \hline
 438 $468^3$ & 2,048 (2x1,024)        &  299.7    & 41,028    & 5.02e-8  &  48.4    & 691(6,146) & 8.24e-08  & 6.19   \\
 439 \hline
 440 $590^3$ & 4,096 (2x2,048)        &  433.1    & 55,494    & 4.92e-7  &  74.1    & 1,101(8,211) & 6.62e-08  & 5.84   \\
 441 \hline
 442 $743^3$ & 8,192 (2x4,096)        & 704.4     & 87,822    & 4.80e-07 &  151.2   & 3,061(14,914) & 5.87e-08 & 4.65    \\
 443 \hline
 444 $743^3$ & 8,192 (4x2,048)        & 704.4     & 87,822    & 4.80e-07 &  110.3   & 1,531(12,721) & 1.47e-07& 6.39  \\
 445 \hline
 446
 447 \end{tabular}
 448 \caption{Results}
 449 \label{tab1}
 450 \end{small}
 451 \end{changemargin}
 452 \end{center}
 453 \end{table}
 454
 455 From these  experiments, it can be  observed that the  multisplitting version is
 456 always  faster   than  the  GMRES   version.   The  acceleration  gain   of  the
 457 multisplitting version ranges between 4 and 6.  It can be noticed that the number of
 458 iterations is drastically reduced with the multisplitting version even it is not
 459 negligible. Moreover, with 8,192 cores, we  can see that using 4 blocks of cores gives a
 460 better performance than simply using 2 blocks. In fact, we can notice that the
 461 precision with 2 blocks is slightly  better but in both cases the precision is
 462 under the specified threshold.
 463
 464
 465 %%% AJOUTE************************
 466 %%%*******************************
 467 In Figure~\ref{fig:01}, the number of iterations per second is reported for both
 468 GMRES and the  multisplitting methods. It should be noted that  we took only the
 469 inner number of  iterations (i.e.  the GMRES iterations)  for the multisplitting
 470 method. Iterations of CGNR are not  taken into account. From this figure, it can
 471 be seen that the number of iterations  per second is higher with GMRES but it is
 472 not  so different with  the multisplitting  method.  For  the case  with $8,192$
 473 cores, the number of iterations per second with 4 blocks is approximately equal
 474 to 115. So it is not different from GMRES.
 475
 476 \begin{figure}[htbp]
 477 \centering
 478   \includegraphics[width=0.7\textwidth]{nb_iter_sec}
 479 \caption{Number of iterations per second  with the same parameters as in Table~\ref{tab1} (weak scaling) with only  blocks of cores}
 480 \label{fig:01}
 481 \end{figure}
 482
 483 \noindent {\bf Final  remarks:}\\ It should be noted, on the  one hand, that the
 484 development of a complete  new code usable with any kind of  problem is a really
 485 long and  fastidious task  if one is  working from  scratch. On the  other hand,
 486 using an existing  tool for the inner solver is also  quite difficult because it
 487 requires to  establish a link  between the inner  solver and the outer  one.  We
 488 plan  to  do that  later  with engineers  working  specifically  on that  point.
 489 Moreover, we think that it is  very important to analyze the convergence of this
 490 method  compared  to  other methods.  In  this  work,  we  have focused  on  the
 491 description of this method and its performances with a typical application.  Many
 492 other  investigations are  required for  this method  as explained  in  the next
 493 section.
 494 %%%*******************************
 495 %%%*******************************
 496
 497 \section{Conclusion and perspectives}
 498 We  have implemented  a  Krylov  multisplitting method  to  solve sparse  linear
 499 systems  on large-scale computing  platforms.  We  have developed  a synchronous
 500 two-stage  method based  on the  block Jacobi  multisplitting which  uses GMRES
 501 iterative  method as  an inner  iteration.  Our  contribution in  this  paper is
 502 twofold. First we provide a multi block decomposition that allows us to choose
 503 the  appropriate size  of  the blocks  according  to the  architecures of  the
 504 supercomputer.  Second,   we  have  implemented  the  outer   iteration  of  the
 505 multisplitting method  as a  Krylov subspace method  which minimizes  some error
 506 function.  This  increases the convergence  and improves the scalability  of the
 507 multisplitting method.
 508
 509 We  have tested  our multisplitting  method to  solve the  sparse  linear system
 510 issued from  the discretization of  a 3D Poisson  problem. We have  compared its
 511 performances to the  classical GMRES method on a  supercomputer composed of 2,048
 512 up-to 8,192 cores. The experimental results showed that the multisplitting method is
 513 about 4  to 6  times faster  than the GMRES  method for  different sizes  of the
 514 problem split into  2 or 4 blocks when using the  multisplitting method. Indeed, the
 515 GMRES  method  has  difficulties to  scale  with  many  cores while  the  Krylov
 516 multisplitting  method  allows to  hide  latency  and  reduce the  inter-block
 517 communications.
 518
 519 In future  works, we plan to conduct  experiments on larger numbers  of cores and
 520 test  the  scalability  of  our   Krylov  multisplitting  method.  It  would  be
 521 interesting  to validate its  performances to  solve other  linear/nonlinear and
 522 symmetric/nonsymmetric problems.  Moreover, we intend  to develop multisplitting
 523 methods based  on asynchronous iterations in which  communications are overlapped
 524 by computations.  These methods would  be interesting for platforms  composed of
 525 distant  clusters interconnected  by  a high-latency  network.  In addition,  we
 526 intend  to investigate  the  convergence  improvements of  our  method by  using
 527 preconditioning  techniques  for  Krylov  iterative methods  and  multisplitting
 528 methods with overlapping blocks.
 529
 530 \section{Acknowledgement}
 531 The authors would like to thank Mark Bull of the EPCC his fruitful remarks and the facilities of HECToR. This work has been partially supported by the Labex
 532 ACTION project (contract “ANR-11-LABX-01-01”).
 533
 534
 535 %Other applications (=> other matrices)\\
 536 %Larger experiments\\
 537 %Async\\
 538 %Overlapping\\
 539 %preconditioning
 540
 541
 542 %%%%%%%%%%%%%%%%%%%%%%%%
 543 %%%%%%%%%%%%%%%%%%%%%%%%
 544
 545 \bibliographystyle{plain}
 546 \bibliography{biblio}
 547
 548 \end{document}