% quality.
-%\usepackage{eqparbox}
+\usepackage{eqparbox}
% Also of notable interest is Scott Pakin's eqparbox package for creating
% (automatically sized) equal width boxes - aka "natural width parboxes".
% Available at:
\hyphenation{op-tical net-works semi-conduc-tor}
-
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{multirow}
+\usepackage{graphicx}
\algnewcommand\algorithmicinput{\textbf{Input:}}
\algnewcommand\Input{\item[\algorithmicinput]}
\algnewcommand\algorithmicoutput{\textbf{Output:}}
\algnewcommand\Output{\item[\algorithmicoutput]}
-
+\newtheorem{proposition}{Proposition}
\begin{document}
%
% paper title
% can use linebreaks \\ within to get better formatting as desired
-\title{TSARM: A Two-Stage Algorithm with least-square Residual Minimization to solve large sparse linear systems}
-%où
-%\title{A two-stage algorithm with error minimization to solve large sparse linear systems}
-%où
-%\title{???}
+\title{TSIRM: A Two-Stage Iteration with least-square Residual Minimization algorithm to solve large sparse linear systems}
+
% use a multiple column layout for up to two different
% affiliations
-\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja \IEEEauthorrefmark{2} and Christophe Guyeux\IEEEauthorrefmark{1}}
+\author{\IEEEauthorblockN{Rapha\"el Couturier\IEEEauthorrefmark{1}, Lilia Ziane Khodja \IEEEauthorrefmark{2}, and Christophe Guyeux\IEEEauthorrefmark{1}}
\IEEEauthorblockA{\IEEEauthorrefmark{1} Femto-ST Institute, University of Franche Comte, France\\
Email: \{raphael.couturier,christophe.guyeux\}@univ-fcomte.fr}
\IEEEauthorblockA{\IEEEauthorrefmark{2} INRIA Bordeaux Sud-Ouest, France\\
\begin{abstract}
-In this paper we propose a two stage iterative method which increases the
-convergence of Krylov iterative methods, typically those of GMRES variants. The
-principle of our approach is to build an external iteration over the Krylov
-method and to save the current residual frequently (for example, for each
-restart of GMRES). Then after a given number of outer iterations, a minimization
-step is applied on the matrix composed of the save residuals in order to compute
-a better solution and make a new iteration if necessary. We prove that our
-method has the same convergence property than the inner method used. Some
-experiments using up to 16,394 cores show that compared to GMRES our algorithm
-can be around 7 times faster.
+In this article, a two-stage iterative algorithm is proposed to improve the
+convergence of Krylov based iterative methods, typically those of GMRES variants. The
+principle of the proposed approach is to build an external iteration over the Krylov
+method, and to frequently store its current residual (at each
+GMRES restart for instance). After a given number of outer iterations, a minimization
+step is applied on the matrix composed by the saved residuals, in order to
+compute a better solution and to make new iterations if required. It is proven that
+the proposal has the same convergence properties than the inner embedded method itself.
+Experiments using up to 16,394 cores also show that the proposed algorithm
+runs around 5 or 7 times faster than GMRES.
\end{abstract}
\begin{IEEEkeywords}
% You must have at least 2 lines in the paragraph with the drop letter
% (should never be an issue)
-Iterative methods became more attractive than direct ones to solve very large
-sparse linear systems. Iterative methods are more effecient in a parallel
-context, with thousands of cores, and require less memory and arithmetic
-operations than direct methods. A number of iterative methods are proposed and
-adapted by many researchers and the increased need for solving very large sparse
-linear systems triggered the development of efficient iterative techniques
-suitable for the parallel processing.
-
-Most of the successful iterative methods currently available are based on Krylov
-subspaces which consist in forming a basis of a sequence of successive matrix
-powers times an initial vector for example the residual. These methods are based
-on orthogonality of vectors of the Krylov subspace basis to solve linear
-systems. The most well-known iterative Krylov subspace methods are Conjugate
-Gradient method and GMRES method (generalized minimal residual).
+Iterative methods have recently become more attractive than direct ones to solve very large
+sparse linear systems. They are more efficient in a parallel
+context, supporting thousands of cores, and they require less memory and arithmetic
+operations than direct methods. This is why new iterative methods are frequently
+proposed or adapted by researchers, and the increasing need to solve very large sparse
+linear systems has triggered the development of such efficient iterative techniques
+suitable for parallel processing.
+
+Most of the successful iterative methods currently available are based on so-called ``Krylov
+subspaces''. They consist in forming a basis of successive matrix
+powers multiplied by an initial vector, which can be for instance the residual. These methods use vectors orthogonality of the Krylov subspace basis in order to solve linear
+systems. The most known iterative Krylov subspace methods are conjugate
+gradient and GMRES ones (Generalized Minimal RESidual).
+
However, iterative methods suffer from scalability problems on parallel
-computing platforms with many processors due to their need for reduction
-operations and collective communications to perform matrix-vector
+computing platforms with many processors, due to their need of reduction
+operations, and to collective communications to achive matrix-vector
multiplications. The communications on large clusters with thousands of cores
-and large sizes of messages can significantly affect the performances of
-iterative methods. In practice, Krylov subspace iteration methods are often used
-with preconditioners in order to increase their convergence and accelerate their
+and large sizes of messages can significantly affect the performances of these
+iterative methods. As a consequence, Krylov subspace iteration methods are often used
+with preconditioners in practice, to increase their convergence and accelerate their
performances. However, most of the good preconditioners are not scalable on
large clusters.
-In this paper we propose a two-stage algorithm based on two nested iterations
-called inner-outer iterations. This algorithm consists in solving the sparse
-linear system iteratively with a small number of inner iterations and restarts
+In this research work, a two-stage algorithm based on two nested iterations
+called inner-outer iterations is proposed. This algorithm consists in solving the sparse
+linear system iteratively with a small number of inner iterations, and restarting
the outer step with a new solution minimizing some error functions over some
previous residuals. This algorithm is iterative and easy to parallelize on large
-clusters and the minimization technique improves its convergence and
+clusters. Furthermore, the minimization technique improves its convergence and
performances.
-The present paper is organized as follows. In Section~\ref{sec:02} some related
-works are presented. Section~\ref{sec:03} presents our two-stage algorithm using
-a least-square residual minimization. Section~\ref{sec:04} describes some
-convergence results on this method. In Section~\ref{sec:05}, parallization
-details of TSARM are given. Section~\ref{sec:06} shows some experimental
-results obtained on large clusters of our algorithm using routines of PETSc
-toolkit. Finally Section~\ref{sec:06} concludes and gives some perspectives.
+The present article is organized as follows. Related works are presented in
+Section~\ref{sec:02}. Section~\ref{sec:03} details the two-stage algorithm using
+a least-square residual minimization, while Section~\ref{sec:04} provides
+convergence results regarding this method. Section~\ref{sec:05} shows some
+experimental results obtained on large clusters using routines of PETSc
+toolkit. This research work ends by a conclusion section, in which the proposal
+is summarized while intended perspectives are provided.
+
%%%*********************************************************
%%%*********************************************************
%%%*********************************************************
%%%*********************************************************
-\section{Two-stage algorithm with least-square residuals minimization}
+\section{Two-stage iteration with least-square residuals minimization algorithm}
\label{sec:03}
A two-stage algorithm is proposed to solve large sparse linear systems of the
form $Ax=b$, where $A\in\mathbb{R}^{n\times n}$ is a sparse and square
-nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector and
-$b\in\mathbb{R}^n$ is the right-hand side. The algorithm is implemented as an
-inner-outer iteration solver based on iterative Krylov methods. The main key
-points of our solver are given in Algorithm~\ref{algo:01}.
-
-In order to accelerate the convergence, the outer iteration periodically applies
-a least-square minimization on the residuals computed by the inner solver. The
-inner solver is a Krylov based solver which does not required to be changed.
-
-At each outer iteration, the sparse linear system $Ax=b$ is solved, only for $m$
-iterations, using an iterative method restarting with the previous solution. For
-example, the GMRES method~\cite{Saad86} or some of its variants can be used as a
-inner solver. The current solution of the Krylov method is saved inside a matrix
-$S$ composed of successive solutions computed by the inner iteration.
-
-Periodically, every $s$ iterations, the minimization step is applied in order to
-compute a new solution $x$. For that, the previous residuals are computed with
-$(b-AS)$. The minimization of the residuals is obtained by
+nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector, and
+$b\in\mathbb{R}^n$ is the right-hand side. As explained previously,
+the algorithm is implemented as an
+inner-outer iteration solver based on iterative Krylov methods. The main
+key-points of the proposed solver are given in Algorithm~\ref{algo:01}.
+It can be summarized as follows: the
+inner solver is a Krylov based one. In order to accelerate its convergence, the
+outer solver periodically applies a least-square minimization on the residuals computed by the inner one. %Tsolver which does not required to be changed.
+
+At each outer iteration, the sparse linear system $Ax=b$ is partially
+solved using only $m$
+iterations of an iterative method, this latter being initialized with the
+best known approximation previously obtained.
+GMRES method~\cite{Saad86}, or any of its variants, can be used for instance as an
+inner solver. The current approximation of the Krylov method is then stored inside a matrix
+$S$ composed by the successive solutions that are computed during inner iterations.
+
+At each $s$ iterations, the minimization step is applied in order to
+compute a new solution $x$. For that, the previous residuals of $Ax=b$ are computed by
+the inner iterations with $(b-AS)$. The minimization of the residuals is obtained by
\begin{equation}
\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2
\label{eq:01}
with $R=AS$. Then the new solution $x$ is computed with $x=S\alpha$.
-In practice, $R$ is a dense rectangular matrix in $\mathbb{R}^{n\times s}$,
-$s\ll n$. In order to minimize~(\ref{eq:01}), a least-square method such as
-CGLS ~\cite{Hestenes52} or LSQR~\cite{Paige82} is used. Those methods are more
-appropriate than a direct method in a parallel context.
+In practice, $R$ is a dense rectangular matrix belonging in $\mathbb{R}^{n\times s}$,
+with $s\ll n$. In order to minimize~\eqref{eq:01}, a least-square method such as
+CGLS ~\cite{Hestenes52} or LSQR~\cite{Paige82} is used. Remark that these methods are more
+appropriate than a single direct method in a parallel context.
+
+
\begin{algorithm}[t]
-\caption{TSARM}
+\caption{TSIRM}
\begin{algorithmic}[1]
\Input $A$ (sparse matrix), $b$ (right-hand side)
\Output $x$ (solution vector)\vspace{0.2cm}
\State Set the initial guess $x^0$
- \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{kryl}$)} \label{algo:conv}
+ \For {$k=1,2,3,\ldots$ until convergence (error$<\epsilon_{tsirm}$)} \label{algo:conv}
\State $x^k=Solve(A,b,x^{k-1},max\_iter_{kryl})$ \label{algo:solve}
\State retrieve error
- \State $S_{k~mod~s}=x^k$ \label{algo:store}
- \If {$k$ mod $s=0$ {\bf and} error$>\epsilon_{kryl}$}
+ \State $S_{k \mod s}=x^k$ \label{algo:store}
+ \If {$k \mod s=0$ {\bf and} error$>\epsilon_{kryl}$}
\State $R=AS$ \Comment{compute dense matrix} \label{algo:matrix_mul}
- \State Solve least-squares problem $\underset{\alpha\in\mathbb{R}^{s}}{min}\|b-R\alpha\|_2$ \label{algo:}
+ \State $\alpha=Solve\_Least\_Squares(R,b,max\_iter_{ls})$ \label{algo:}
\State $x^k=S\alpha$ \Comment{compute new solution}
\EndIf
\EndFor
called for a maximum of $max\_iter_{kryl}$ iterations. In practice, we suggest to set this parameter
equals to the restart number of the GMRES-like method. Moreover, a tolerance
threshold must be specified for the solver. In practice, this threshold must be
-much smaller than the convergence threshold of the TSARM algorithm (i.e.
-$\epsilon$). Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the
-solution $x_k$ into the column $k~ mod~ s$ of the matrix $S$. After the
+much smaller than the convergence threshold of the TSIRM algorithm (\emph{i.e.}
+$\epsilon_{tsirm}$). Line~\ref{algo:store}, $S_{k~ mod~ s}=x^k$ consists in copying the
+solution $x_k$ into the column $k~ mod~ s$ of the matrix $S$. After the
minimization, the matrix $S$ is reused with the new values of the residuals. To
solve the minimization problem, an iterative method is used. Two parameters are
-required for that: the maximum number of iteration and the threshold to stop the
+required for that: the maximum number of iterations and the threshold to stop the
method.
-To summarize, the important parameters of TSARM are:
+Let us summarize the most important parameters of TSIRM:
\begin{itemize}
-\item $\epsilon_{kryl}$ the threshold to stop the method of the krylov method
-\item $max\_iter_{kryl}$ the maximum number of iterations for the krylov method
-\item $s$ the number of outer iterations before applying the minimization step
-\item $max\_iter_{ls}$ the maximum number of iterations for the iterative least-square method
-\item $\epsilon_{ls}$ the threshold to stop the least-square method
+\item $\epsilon_{tsirm}$: the threshold to stop the TSIRM method;
+\item $max\_iter_{kryl}$: the maximum number of iterations for the Krylov method;
+\item $s$: the number of outer iterations before applying the minimization step;
+\item $max\_iter_{ls}$: the maximum number of iterations for the iterative least-square method;
+\item $\epsilon_{ls}$: the threshold used to stop the least-square method.
\end{itemize}
-%%%*********************************************************
-%%%*********************************************************
-
-\section{Convergence results}
-\label{sec:04}
-
-
-%%%*********************************************************
-%%%*********************************************************
-\section{Parallelization}
-\label{sec:05}
-
-The parallelisation of TSARM relies on the parallelization of all its
+The parallelisation of TSIRM relies on the parallelization of all its
parts. More precisely, except the least-square step, all the other parts are
obvious to achieve out in parallel. In order to develop a parallel version of
our code, we have chosen to use PETSc~\cite{petsc-web-page}. For
interesting to solve the least-square minimization, CGLS and LSQR.
In the following we remind the CGLS algorithm. The LSQR method follows more or
-less the same principle but it take more place, so we briefly explain the parallelization of CGLS which is similar to LSQR.
+less the same principle but it takes more place, so we briefly explain the parallelization of CGLS which is similar to LSQR.
\begin{algorithm}[t]
\caption{CGLS}
In each iteration of CGLS, there is two matrix-vector multiplications and some
-classical operations: dots, norm, multiplication and addition on vectors. All
+classical operations: dot product, norm, multiplication and addition on vectors. All
these operations are easy to implement in PETSc or similar environment.
+
+
%%%*********************************************************
%%%*********************************************************
-\section{Experiments using petsc}
-\label{sec:06}
+
+\section{Convergence results}
+\label{sec:04}
+Let us recall the following result, see~\cite{Saad86}.
+\begin{proposition}
+Suppose that $A$ is a positive real matrix with symmetric part $M$. Then the residual norm provided at the $m$-th step of GMRES satisfies:
+\begin{equation}
+||r_m|| \leqslant \left(1-\dfrac{\alpha}{\beta}\right)^{\frac{m}{2}} ||r_0|| ,
+\end{equation}
+where $\alpha = \lambda_min(M)^2$ and $\beta = \lambda_max(A^T A)$, which proves
+the convergence of GMRES($m$) for all $m$ under that assumption regarding $A$.
+\end{proposition}
+
+
+
+%%%*********************************************************
+%%%*********************************************************
+\section{Experiments using PETSc}
+\label{sec:05}
In order to see the influence of our algorithm with only one processor, we first
show a comparison with the standard version of GMRES and our algorithm. In
-table~\ref{tab:01}, we show the matrices we have used and some of them
+Table~\ref{tab:01}, we show the matrices we have used and some of them
characteristics. For all the matrices, the name, the field, the number of rows
-and the number of nonzero elements is given.
+and the number of nonzero elements are given.
-\begin{table*}
+\begin{table}[htbp]
\begin{center}
\begin{tabular}{|c|c|r|r|r|}
\hline
Matrix name & Field &\# Rows & \# Nonzeros \\\hline \hline
crashbasis & Optimization & 160,000 & 1,750,416 \\
-parabolic\_fem & Computational fluid dynamics & 525,825 & 2,100,225 \\
+parabolic\_fem & Comput. fluid dynamics & 525,825 & 2,100,225 \\
epb3 & Thermal problem & 84,617 & 463,625 \\
-atmosmodj & Computational fluid dynamics & 1,270,432 & 8,814,880 \\
-bfwa398 & Electromagnetics problem & 398 & 3,678 \\
+atmosmodj & Comput. fluid dynamics & 1,270,432 & 8,814,880 \\
+bfwa398 & Electromagnetics pb & 398 & 3,678 \\
torso3 & 2D/3D problem & 259,156 & 4,429,042 \\
\hline
\caption{Main characteristics of the sparse matrices chosen from the Davis collection}
\label{tab:01}
\end{center}
-\end{table*}
+\end{table}
The following parameters have been chosen for our experiments. As by default
the restart of GMRES is performed every 30 iterations, we have chosen to stop
-the GMRES every 30 iterations (line \ref{algo:solve} in
-Algorithm~\ref{algo:01}). $s$ is set to 8. CGLS is chosen to minimize the
-least-squares problem. Two conditions are used to stop CGLS, either the
-precision is under $1e-40$ or the number of iterations is greater to $20$. The
-external precision is set to $1e-10$ (line \ref{algo:conv} in
-Algorithm~\ref{algo:01}). Those experiments have been performed on a Intel(R)
+the GMRES every 30 iterations (\emph{i.e.} $max\_iter_{kryl}=30$). $s$ is set to 8. CGLS is
+chosen to minimize the least-squares problem with the following parameters:
+$\epsilon_{ls}=1e-40$ and $max\_iter_{ls}=20$. The external precision is set to
+$\epsilon_{tsirm}=1e-10$. Those experiments have been performed on a Intel(R)
Core(TM) i7-3630QM CPU @ 2.40GHz with the version 3.5.1 of PETSc.
systems obtained with the previous matrices with a GMRES variant and with out 2
stage algorithm are given. In the second column, it can be noticed that either
gmres or fgmres is used to solve the linear system. According to the matrices,
-different preconditioner is used. With the 2 stage algorithm, the same solver
-and the same preconditionner is used. This Table shows that the 2 stage
-algorithm can drastically reduce the number of iterations to reach the
-convergence when the number of iterations for the normal GMRES is more or less
-greater than 500. In fact this also depends on tow parameters: the number of
-iterations to stop GMRES and the number of iterations to perform the
-minimization.
+different preconditioner is used. With TSIRM, the same solver and the same
+preconditionner are used. This Table shows that TSIRM can drastically reduce the
+number of iterations to reach the convergence when the number of iterations for
+the normal GMRES is more or less greater than 500. In fact this also depends on
+tow parameters: the number of iterations to stop GMRES and the number of
+iterations to perform the minimization.
-\begin{table}
+\begin{table}[htbp]
\begin{center}
\begin{tabular}{|c|c|r|r|r|r|}
\hline
- \multirow{2}{*}{Matrix name} & Solver / & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} \\
+ \multirow{2}{*}{Matrix name} & Solver / & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSIRM CGLS} \\
\cline{3-6}
& precond & Time & \# Iter. & Time & \# Iter. \\\hline \hline
-Larger experiments ....\\
-Describe the problems ex15 and ex54
+In order to perform larger experiments, we have tested some example applications
+of PETSc. Those applications are available in the ksp part which is suited for
+scalable linear equations solvers:
+\begin{itemize}
+\item ex15 is an example which solves in parallel an operator using a finite
+ difference scheme. The diagonal is equal to 4 and 4 extra-diagonals
+ representing the neighbors in each directions is equal to -1. This example is
+ used in many physical phenomena, for example, heat and fluid flow, wave
+ propagation...
+\item ex54 is another example based on 2D problem discretized with quadrilateral
+ finite elements. For this example, the user can define the scaling of material
+ coefficient in embedded circle, it is called $\alpha$.
+\end{itemize}
+For more technical details on these applications, interested reader are invited
+to read the codes available in the PETSc sources. Those problem have been
+chosen because they are scalable with many cores. We have tested other problem
+but they are not scalable with many cores.
+
+In the following larger experiments are described on two large scale architectures: Curie and Juqeen... {\bf description...}\\
+
+
+{\bf Description of preconditioners}
-\begin{table*}
+\begin{table*}[htbp]
\begin{center}
\begin{tabular}{|r|r|r|r|r|r|r|r|r|}
\hline
- nb. cores & precond & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} & \multicolumn{2}{c|}{TSARM LSQR} & best gain \\
+ nb. cores & precond & \multicolumn{2}{c|}{FGMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\
\cline{3-8}
& & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline
2,048 & mg & 403.49 & 18,210 & 73.89 & 3,060 & 77.84 & 3,270 & 5.46 \\
\hline
\end{tabular}
-\caption{Comparison of FGMRES and 2 stage FGMRES algorithms for ex15 of Petsc with 25000 components per core on Juqueen (threshold 1e-3, restart=30, s=12), time is expressed in seconds.}
+\caption{Comparison of FGMRES and TSIRM with FGMRES for example ex15 of PETSc with two preconditioner (mg and sor) with 25,000 components per core on Juqueen (threshold 1e-3, restart=30, s=12), time is expressed in seconds.}
\label{tab:03}
\end{center}
\end{table*}
+Table~\ref{tab:03} shows the execution times and the number of iterations of
+example ex15 of PETSc on the Juqueen architecture. Differents number of cores
+are studied rangin from 2,048 upto 16,383. Two preconditioners have been
+tested. For those experiments, the number of components (or unknown of the
+problems) per processor is fixed to 25,000, also called weak scaling. This
+number can seem relatively small. In fact, for some applications that need a lot
+of memory, the number of components per processor requires sometimes to be
+small.
-\begin{table*}
+
+
+In this Table, we can notice that TSIRM is always faster than FGMRES. The last
+column shows the ratio between FGMRES and the best version of TSIRM according to
+the minimization procedure: CGLS or LSQR. Even if we have computed the worst
+case between CGLS and LSQR, it is clear that TSIRM is alsways faster than
+FGMRES. For this example, the multigrid preconditionner is faster than SOR. The
+gain between TSIRM and FGMRES is more or less similar for the two
+preconditioners. Looking at the number of iterations to reach the convergence,
+it is obvious that TSIRM allows the reduction of the number of iterations. It
+should be noticed that for TSIRM, in those experiments, only the iterations of
+the Krylov solver are taken into account. Iterations of CGLS or LSQR were not
+recorded but they are time-consuming. In general each $max\_iter_{kryl}*s$ which
+corresponds to 30*12, there are $max\_iter_{ls}$ which corresponds to 15.
+
+\begin{figure}[htbp]
+\centering
+ \includegraphics[width=0.45\textwidth]{nb_iter_sec_ex15_juqueen}
+\caption{Number of iterations per second with ex15 and the same parameters than in Table~\ref{tab:03}}
+\label{fig:01}
+\end{figure}
+
+
+In Figure~\ref{fig:01}, the number of iterations per second corresponding to
+Table~\ref{tab:01} is displayed. It can be noticed that the number of
+iterations per second of FMGRES is constant whereas it decrease with TSIRM with
+both preconditioner. This can be explained by the fact that when the number of
+core increases the time for the minimization step also increases but, generally,
+when the number of cores increases, the number of iterations to reach the
+threshold also increases, and, in that case, TSIRM is more efficient to reduce
+the number of iterations. So, the overall benefit of using TSIRM is interesting.
+
+
+
+
+
+
+\begin{table*}[htbp]
\begin{center}
\begin{tabular}{|r|r|r|r|r|r|r|r|r|}
\hline
- nb. cores & threshold & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} & \multicolumn{2}{c|}{TSARM LSQR} & best gain \\
+ nb. cores & threshold & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain \\
\cline{3-8}
& & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & \\\hline \hline
2,048 & 8e-5 & 108.88 & 16,560 & 23.06 & 3,630 & 22.79 & 3,630 & 4.77 \\
4,096 & 7e-5 & 160.59 & 22,530 & 35.15 & 5,130 & 29.21 & 4,350 & 5.49 \\
4,096 & 6e-5 & 249.27 & 35,520 & 52.13 & 7,950 & 39.24 & 5,790 & 6.35 \\
8,192 & 6e-5 & 149.54 & 17,280 & 28.68 & 3,810 & 29.05 & 3,990 & 5.21 \\
- 8,192 & 5e-5 & 792.11 & 109,590 & 76.83 & 10,470 & 65.20 & 9,030 & 12.14 \\
+ 8,192 & 5e-5 & 785.04 & 109,590 & 76.07 & 10,470 & 69.42 & 9,030 & 11.30 \\
16,384 & 4e-5 & 718.61 & 86,400 & 98.98 & 10,830 & 131.86 & 14,790 & 7.26 \\
\hline
\end{table*}
+In Table~\ref{tab:04}, some experiments with example ex54 on the Curie architecture are reported.
-
-\begin{table*}
+\begin{table*}[htbp]
\begin{center}
-\begin{tabular}{|r|r|r|r|r|r|r|r|r|r|}
+\begin{tabular}{|r|r|r|r|r|r|r|r|r|r|r|}
\hline
- nb. cores & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSARM CGLS} & \multicolumn{2}{c|}{TSARM LSQR} & \multicolumn{3}{c|}{efficiency} \\
-\cline{2-10}
- & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & GMRES & TS CGLS & TS LSQR\\\hline \hline
- 512 & 3,969.69 & 33,120 & 709.57 & 5,790 & 622.76 & 5,070 & 1 & 1 & 1 \\
- 1024 & 1,530.06 & 25,860 & 290.95 & 4,830 & 307.71 & 5,070 & 1.30 & 1.21 & 1.01 \\
- 2048 & 919.62 & 31,470 & 237.52 & 8,040 & 194.22 & 6,510 & 1.08 & .75 & .80\\
- 4096 & 405.60 & 28,380 & 111.67 & 7,590 & 91.72 & 6,510 & 1.22 & .79 & .84 \\
- 8192 & 785.04 & 109,590 & 76.07 & 10,470 & 69.42 & 9,030 & .32 & .58 & .56 \\
+ nb. cores & \multicolumn{2}{c|}{GMRES} & \multicolumn{2}{c|}{TSIRM CGLS} & \multicolumn{2}{c|}{TSIRM LSQR} & best gain & \multicolumn{3}{c|}{efficiency} \\
+\cline{2-7} \cline{9-11}
+ & Time & \# Iter. & Time & \# Iter. & Time & \# Iter. & & GMRES & TS CGLS & TS LSQR\\\hline \hline
+ 512 & 3,969.69 & 33,120 & 709.57 & 5,790 & 622.76 & 5,070 & 6.37 & 1 & 1 & 1 \\
+ 1024 & 1,530.06 & 25,860 & 290.95 & 4,830 & 307.71 & 5,070 & 5.25 & 1.30 & 1.21 & 1.01 \\
+ 2048 & 919.62 & 31,470 & 237.52 & 8,040 & 194.22 & 6,510 & 4.73 & 1.08 & .75 & .80\\
+ 4096 & 405.60 & 28,380 & 111.67 & 7,590 & 91.72 & 6,510 & 4.42 & 1.22 & .79 & .84 \\
+ 8192 & 785.04 & 109,590 & 76.07 & 10,470 & 69.42 & 9,030 & 11.30 & .32 & .58 & .56 \\
\hline
%%%*********************************************************
%%%*********************************************************
\section{Conclusion}
-\label{sec:07}
+\label{sec:06}
%The conclusion goes here. this is more of the conclusion
%%%*********************************************************
%%%*********************************************************