01-05-2014bb

[Krylov_multi.git] / krylov_multi.tex
diff --git a/krylov_multi.tex b/krylov_multi.tex

index 5a736703be2507a04a0fa4cc485d8d7423b9c473..275700cd936214c32e50f9593f8d08c56611053d 100644 (file)
--- a/krylov_multi.tex
+++ b/krylov_multi.tex
@@ -17,10 +17,8 @@
  \newcommand{\Prec}{\mathit{prec}}
  \newcommand{\Ratio}{\mathit{Ratio}}
  
-%\usepackage{xspace}
-%\usepackage[textsize=footnotesize]{todonotes}
-%\newcommand{\LZK}[2][inline]{%
-%\todo[color=green!40,#1]{\sffamily\textbf{LZK:} #2}\xspace}
+\def\changemargin#1#2{\list{}{\rightmargin#2\leftmargin#1}\item[]}
+\let\endchangemargin=\endlist
  
  \title{A scalable multisplitting algorithm for solving large sparse linear systems} 
  \date{}
@@ -38,7 +36,7 @@
  
  \begin{abstract}
  In  this paper  we  revisit  the Krylov  multisplitting  algorithm presented  in
-\cite{huang1993krylov}  which  uses  a  scalar  method to  minimize  the  Krylov
+\cite{huang1993krylov}  which  uses  a  sequential  method to  minimize  the  Krylov
  iterations computed by a multisplitting algorithm. Our new algorithm is based on
  a  parallel multisplitting  algorithm  with few  blocks  of large  size using  a
  parallel GMRES method inside each block and on a parallel Krylov minimization in
@@ -67,19 +65,41 @@ Preconditioners can be  used in order to increase  the convergence of iterative
  solvers.   However, most  of the  good preconditioners  are not  scalable when
  thousands of cores are used.
  
-Traditional iterative  solvers have  global synchronizations that  penalize the
-scalability.   Two  possible solutions  consists  either  in using  asynchronous
-iterative  methods~\cite{ref18} or  to  use multisplitting  algorithms. In  this
-paper, we will  reconsider the use of a multisplitting  method. In opposition to
-traditional  multisplitting  method  that  suffer  from  slow  convergence,  as
-proposed  in~\cite{huang1993krylov},  the  use  of a  minimization  process  can
-drastically improve the convergence.
+%Traditional iterative  solvers have  global synchronizations that  penalize the
+%scalability.   Two  possible solutions  consists  either  in using  asynchronous
+%iterative  methods~\cite{ref18} or  to  use multisplitting  algorithms. In  this
+%paper, we will  reconsider the use of a multisplitting  method. In opposition to
+%traditional  multisplitting  method  that  suffer  from  slow  convergence,  as
+%proposed  in~\cite{huang1993krylov},  the  use  of a  minimization  process  can
+%drastically improve the convergence.
+
+Traditional parallel iterative solvers are based on fine-grain computations that
+frequently  require  data exchanges  between  computing  nodes  and have  global
+synchronizations  that penalize  the  scalability. Particularly,  they are  more
+penalized on large  scale architectures or on distributed  platforms composed of
+distant  clusters interconnected  by  a high-latency  network.  It is  therefore
+imperative to develop coarse-grain based algorithms to reduce the communications
+in the  parallel iterative  solvers. Two possible  solutions consists  either in
+using  asynchronous  iterative  methods~\cite{ref18}  or to  use  multisplitting
+algorithms.  In this  paper,  we will  reconsider  the use  of a  multisplitting
+method. In opposition to traditional multisplitting method that suffer from slow
+convergence, as  proposed in~\cite{huang1993krylov},  the use of  a minimization
+process can drastically improve the convergence.
+
+The present paper is  organized as follows. First, Section~\ref{sec:02} presents
+some  related  works and  the  principle  of  multisplitting methods.  Then,  in
+Section~\ref{sec:03}  is presented  the algorithm  of our  Krylov multisplitting
+method based  on inner-outer  iterations. Finally, in  Section~\ref{sec:04}, the
+parallel experiments on Hector architecture  show the performances of the Krylov
+multisplitting algorithm compared to the classical GMRES algorithm to solve a 3D
+Poisson problem.
  
  
  %%%%%%%%%%%%%%%%%%%%%%%%
  %%%%%%%%%%%%%%%%%%%%%%%%
  
-\section{Related works and presention of the multisplitting method}
+\section{Related works and presentation of the multisplitting method}
+\label{sec:02}
  A general framework  for studying parallel multisplitting has  been presented in~\cite{o1985multi}
  by O'Leary and White. Convergence conditions are given for the
  most general case.  Many authors improved multisplitting algorithms by proposing
@@ -97,7 +117,7 @@ convergence of the global system.
  
  In~\cite{couturier2008gremlins}, the  authors proposed practical implementations
  of multisplitting algorithms to solve  large scale linear systems. Inner solvers
-could be  based on scalar direct method  with the LU method  or scalar iterative
+could be  based on sequential direct method  with the LU method  or sequential iterative
  one with GMRES.
  
  In~\cite{prace-multi},  the  authors have  proposed a  parallel  multisplitting
@@ -142,6 +162,7 @@ In the case where the diagonal weighting matrices $E_\ell$ have only zero and on
  %%%%%%%%%%%%%%%%%%%%%%%%
  
  \section{A two-stage method with a minimization}
+\label{sec:03}
  Let $Ax=b$ be a given large and sparse linear system of $n$ equations to solve in parallel on $L$ clusters of processors, physically adjacent or geographically distant, where $A\in\mathbb{R}^{n\times n}$ is a square and non-singular matrix, $x\in\mathbb{R}^{n}$ is the solution vector and $b\in\mathbb{R}^{n}$ is the right-hand side vector. The multisplitting of this linear system is defined as follows
  \begin{equation}
  \left\{
@@ -253,6 +274,7 @@ The main key points of our Krylov multisplitting method to solve a large sparse
  %%%%%%%%%%%%%%%%%%%%%%%%
  
  \section{Experiments}
+\label{sec:04}
  In order to illustrate  the interest  of our algorithm. We have  compared our
  algorithm  with  the  GMRES  method  which  is a very  well  used  method  in  many
  situations.  We have chosen to focus on only one problem which is very simple to
@@ -276,9 +298,11 @@ preconditioner  it  is   possible  to  reduce  the  number   of  iterations  but
  preconditioners are not scalable when using many cores.
  
  %Doing many experiments  with many cores is  not easy and requires to  access to a supercomputer  with several  hours for  developing  a code  and then  improving it. 
-In the following we present  some experiments we could achieved out on the
-Hector architecture,  the previous UK's  high-end computing resource,  funded by
-the UK Research Councils, which has been stopped in the early 2014.
+In the following we present some experiments we could achieved out on the Hector
+architecture,  a UK's  high-end computing  resource, funded  by the  UK Research
+Councils~\cite{hector}.  This is  a Cray  XE6 supercomputer,  equipped  with two
+16-core AMD  Opteron 2.3 Ghz  and 32 GB  of memory. Machines  are interconnected
+with a 3D torus.
  
  Table~\ref{tab1} shows  the result of  the experiments.  The first  column shows
  the  size of  the  3D Poisson  problem.  The size  is chosen  in  order to  have
@@ -300,24 +324,28 @@ is reached. The precision and the maximum number of iterations of CGNR method ar
  
  \begin{table}[htbp]
  \begin{center}
+\begin{changemargin}{-1.8cm}{0cm}
+\begin{small}
  \begin{tabular}{|c|c||c|c|c||c|c|c||c|} 
  \hline
  \multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} &  \multicolumn{3}{c||}{GMRES} &  \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\
   \cline{3-8}
             &                   &  Time (s) & nb Iter. & $\Delta$  &   Time (s)& nb Iter. & $\Delta$ & \\
  \hline
-$468^3$ & 2048 (2x1024)        &  299.7    & 41,028    & 5.02e-8  &  48.4    & 691(6,146) & 8.24e-08  & 6.19   \\
+$468^3$ & 2,048 (2x1,024)        &  299.7    & 41,028    & 5.02e-8  &  48.4    & 691(6,146) & 8.24e-08  & 6.19   \\
  \hline
-$590^3$ & 4096 (2x2048)        &  433.1    & 55,494    & 4.92e-7  &  74.1    & 1,101(8,211) & 6.62e-08  & 5.84   \\
+$590^3$ & 4,096 (2x2,048)        &  433.1    & 55,494    & 4.92e-7  &  74.1    & 1,101(8,211) & 6.62e-08  & 5.84   \\
  \hline
-$743^3$ & 8192 (2x4096)        & 704.4     & 87,822    & 4.80e-07 &  151.2   & 3,061(14,914) & 5.87e-08 & 4.65    \\
+$743^3$ & 8,192 (2x4,096)        & 704.4     & 87,822    & 4.80e-07 &  151.2   & 3,061(14,914) & 5.87e-08 & 4.65    \\
  \hline
-$743^3$ & 8192 (4x2048)        & 704.4     & 87,822    & 4.80e-07 &  110.3   & 1,531(12,721) & 1.47e-07& 6.39  \\
+$743^3$ & 8,192 (4x2,048)        & 704.4     & 87,822    & 4.80e-07 &  110.3   & 1,531(12,721) & 1.47e-07& 6.39  \\
  \hline
  
  \end{tabular}
  \caption{Results}
  \label{tab1}
+\end{small}
+\end{changemargin}
  \end{center}
  \end{table}
  
@@ -325,16 +353,47 @@ $743^3$ & 8192 (4x2048)        & 704.4     & 87,822    & 4.80e-07 &  110.3   & 1
  From these  experiments, it can be  observed that the  multisplitting version is
  always  faster   than  the  GMRES   version.   The  acceleration  gain   of  the
  multisplitting version is between 4 and 6.  It can be noticed that the number of
-iterations is drastically reduced with  the multisplitting version even it is not
-neglectable.
+iterations is drastically reduced with the multisplitting version even it is not
+neglectable. Moreover, with 8,192 cores, we  can see that using 4 clusters gives
+better performance than simply using 2 clusters. In fact, we can remark that the
+precision with 2 clusters is slightly  better but in both cases the precision is
+under the specified threshold.
  
  \section{Conclusion and perspectives}
-We have implemented a Krylov multisplitting method to solve sparse linear systems on large-scale computing platforms. We have developed a synchronous two-stage method based on the block Jacobi multisplitting and uses GMRES iterative method as an inner iteration. Our contribution in this paper is twofold. First we have constituted a virtual multi-cluster environment based on processors of the computing platform on which each linear sub-system issued from the splitting is solved in parallel by a cluster of processors. Second, we have implemented the outer iteration of the multisplitting method as a Krylov subspace method which minimizes some error function. This increases the convergence and improves the scalability of the multisplitting method.
-
-We have tested our multisplitting method to solve the sparse linear system issued from the discretization of a 3D Poisson problem. We have compared its performances to the classical GMRES method on a supercomputer composed of 2048 to 8192 cores. The experimental results showed that the multisplitting method is about 4 to 6 times faster than the GMRES method for different sizes of the problem split into 2 or 4 blocks when using multisplitting method. Indeed, the GMRES method has difficulties to scale with many cores while the Krylov multisplitting method allows to hide latency and reduce the inter-cluster communications.
-
-In future works, we plan to conduct experiments on larger number of cores and test the scalability of our Krylov multisplitting method. It would be interesting to validate its performances to solve other linear/nonlinear and symmetric/nonsymmetric problems. Moreover, we intend to develop multisplitting methods based on asynchronous iteration in which communications are overlapped by computations. These methods would be interesting for platforms composed of distant clusters interconnected by a high-latency network. In addition, we intend to investigate the convergence improvements of our method by using preconditioning techniques for Krylov iterative methods and multisplitting methods with overlapping blocks.    
-
+We  have implemented  a  Krylov  multisplitting method  to  solve sparse  linear
+systems  on large-scale computing  platforms.  We  have developed  a synchronous
+two-stage  method based  on the  block Jacobi  multisaplitting which  uses GMRES
+iterative  method as  an inner  iteration.  Our  contribution in  this  paper is
+twofold. First we provide a multi cluster decomposition that allows us to choose
+the  appropriate size  of  the clusters  according  to the  architecures of  the
+supercomputer.  Second,   we  have  implemented  the  outer   iteration  of  the
+multisplitting method  as a  Krylov subspace method  which minimizes  some error
+function.  This  increases the convergence  and improves the scalability  of the
+multisplitting method.
+
+We  have tested  our multisplitting  method to  solve the  sparse  linear system
+issued from  the discretization of  a 3D Poisson  problem. We have  compared its
+performances to the  classical GMRES method on a  supercomputer composed of 2,048
+to 8,192 cores. The experimental results showed that the multisplitting method is
+about 4  to 6  times faster  than the GMRES  method for  different sizes  of the
+problem split into  2 or 4 blocks when using  multisplitting method. Indeed, the
+GMRES  method  has  difficulties to  scale  with  many  cores while  the  Krylov
+multisplitting  method  allows to  hide  latency  and  reduce the  inter-cluster
+communications.
+
+In future  works, we plan to conduct  experiments on larger number  of cores and
+test  the  scalability  of  our   Krylov  multisplitting  method.  It  would  be
+interesting  to validate its  performances to  solve other  linear/nonlinear and
+symmetric/nonsymmetric problems.  Moreover, we intend  to develop multisplitting
+methods based  on asynchronous iteration in which  communications are overlapped
+by computations.  These methods would  be interesting for platforms  composed of
+distant  clusters interconnected  by  a high-latency  network.  In addition,  we
+intend  to investigate  the  convergence  improvements of  our  method by  using
+preconditioning  techniques  for  Krylov  iterative methods  and  multisplitting
+methods with overlapping blocks.
+
+\section{Acknowledgement}
+The authors would like to thank Mark Bull of the EPCC his fruitful remarks and the facilities of HECToR.
  
  %Other applications (=> other matrices)\\
  %Larger experiments\\