+After discretization, with a finite difference scheme, a seven point stencil is
+used. It is well-known that the spectral radius of matrices representing such
+problems are very close to 1. Moreover, the larger the number of discretization
+points is, the closer to 1 the spectral radius is. Hence, to solve a matrix
+obtained for a 3D Poisson problem, the number of iterations is high. Using a
+preconditioner it is possible to reduce the number of iterations but
+preconditioners are not scalable when using many cores.
+
+%Doing many experiments with many cores is not easy and requires to access to a supercomputer with several hours for developing a code and then improving it.
+In the following we present some experiments we could achieved out on the Hector
+architecture, a UK's high-end computing resource, funded by the UK Research
+Councils~\cite{hector}. This is a Cray XE6 supercomputer, equipped with two
+16-core AMD Opteron 2.3 Ghz and 32 GB of memory. Machines are interconnected
+with a 3D torus.
+
+Table~\ref{tab1} shows the result of the experiments. The first column shows
+the size of the 3D Poisson problem. The size is chosen in order to have
+approximately 50,000 components per core. The second column represents the
+number of cores used. In parenthesis, there is the decomposition used for the
+Krylov multisplitting. The third column and the sixth column respectively show
+the execution time for the GMRES and the Krylov multisplitting codes. The fourth
+and the seventh column describes the number of iterations. For the
+multisplitting code, the total number of inner iterations is represented in
+parenthesis. For the GMRES code (alone and in the multisplitting version) the
+restart parameter is fixed to 16. The precision of the GMRES version is fixed to
+1e-6. For the multisplitting, there are two precisions, one for the external
+solver which is fixed to 1e-6 and another one for the inner solver (GMRES) which
+is fixed to 1e-10. It should be noted that a high precision is used but we also
+fixed a maximum number of iterations for each internal step. In practice, we
+limit the number of iterations in the internal step to 10. So an internal iteration is finished
+when the precision is reached or when the maximum internal number of iterations
+is reached. The precision and the maximum number of iterations of CGNR method are fixed to 1e-25 and 20 respectively. The size of the Krylov subspace basis $S$ is fixed to 10 vectors.
+
+\begin{table}[htbp]
+\begin{center}
+\begin{tabular}{|c|c||c|c|c||c|c|c||c|}
+\hline
+\multirow{2}{*}{Pb size}&\multirow{2}{*}{Nb. cores} & \multicolumn{3}{c||}{GMRES} & \multicolumn{3}{c||}{Krylov Multisplitting} & \multirow{2}{*}{Ratio}\\
+ \cline{3-8}
+ & & Time (s) & nb Iter. & $\Delta$ & Time (s)& nb Iter. & $\Delta$ & \\
+\hline
+$468^3$ & 2,048 (2x1,024) & 299.7 & 41,028 & 5.02e-8 & 48.4 & 691(6,146) & 8.24e-08 & 6.19 \\
+\hline
+$590^3$ & 4,096 (2x2,048) & 433.1 & 55,494 & 4.92e-7 & 74.1 & 1,101(8,211) & 6.62e-08 & 5.84 \\
+\hline
+$743^3$ & 8,192 (2x4,096) & 704.4 & 87,822 & 4.80e-07 & 151.2 & 3,061(14,914) & 5.87e-08 & 4.65 \\
+\hline
+$743^3$ & 8,192 (4x2,048) & 704.4 & 87,822 & 4.80e-07 & 110.3 & 1,531(12,721) & 1.47e-07& 6.39 \\
+\hline
+
+\end{tabular}
+\caption{Results}
+\label{tab1}
+\end{center}
+\end{table}
+
+
+From these experiments, it can be observed that the multisplitting version is
+always faster than the GMRES version. The acceleration gain of the
+multisplitting version is between 4 and 6. It can be noticed that the number of
+iterations is drastically reduced with the multisplitting version even it is not
+neglectable. Moreover, with 8,192 cores, we can see that using 4 clusters gives
+better performance than simply using 2 clusters. In fact, we can remark that the
+precision with 2 clusters is slightly better but in both cases the precision is
+under the specified threshold.
+
+\section{Conclusion and perspectives}
+We have implemented a Krylov multisplitting method to solve sparse linear
+systems on large-scale computing platforms. We have developed a synchronous
+two-stage method based on the block Jacobi multisaplitting which uses GMRES
+iterative method as an inner iteration. Our contribution in this paper is
+twofold. First we provide a multi cluster decomposition that allows us to choose
+the appropriate size of the clusters according to the architecures of the
+supercomputer. Second, we have implemented the outer iteration of the
+multisplitting method as a Krylov subspace method which minimizes some error
+function. This increases the convergence and improves the scalability of the
+multisplitting method.
+
+We have tested our multisplitting method to solve the sparse linear system
+issued from the discretization of a 3D Poisson problem. We have compared its
+performances to the classical GMRES method on a supercomputer composed of 2,048
+to 8,192 cores. The experimental results showed that the multisplitting method is
+about 4 to 6 times faster than the GMRES method for different sizes of the
+problem split into 2 or 4 blocks when using multisplitting method. Indeed, the
+GMRES method has difficulties to scale with many cores while the Krylov
+multisplitting method allows to hide latency and reduce the inter-cluster
+communications.
+
+In future works, we plan to conduct experiments on larger number of cores and
+test the scalability of our Krylov multisplitting method. It would be
+interesting to validate its performances to solve other linear/nonlinear and
+symmetric/nonsymmetric problems. Moreover, we intend to develop multisplitting
+methods based on asynchronous iteration in which communications are overlapped
+by computations. These methods would be interesting for platforms composed of
+distant clusters interconnected by a high-latency network. In addition, we
+intend to investigate the convergence improvements of our method by using
+preconditioning techniques for Krylov iterative methods and multisplitting
+methods with overlapping blocks.
+
+\section{Acknowledgement}
+The authors would like to thank Mark Bull of the EPCC his fruitful remarks and the facilities of HECToR.
+
+%Other applications (=> other matrices)\\
+%Larger experiments\\
+%Async\\
+%Overlapping\\
+%preconditioning