+\documentclass[times]{cpeauth}
+
+\usepackage{moreverb}
+
+%\usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref}
+
+%\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em
+%T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
+
+\def\volumeyear{2015}
+
+\usepackage{graphicx}
+\usepackage{wrapfig}
+\usepackage{grffile}
+
+\usepackage[T1]{fontenc}
+\usepackage[utf8]{inputenc}
+\usepackage{amsfonts,amssymb}
+\usepackage{amsmath}
+\usepackage{algorithm}
+\usepackage{algpseudocode}
+%\usepackage{amsthm}
+\usepackage{graphicx}
+\usepackage[american]{babel}
+% Extension pour les liens intra-documents (tagged PDF)
+% et l'affichage correct des URL (commande \url{http://example.com})
+%\usepackage{hyperref}
+
+\usepackage{url}
+\DeclareUrlCommand\email{\urlstyle{same}}
+
+\usepackage[autolanguage,np]{numprint}
+\AtBeginDocument{%
+ \renewcommand*\npunitcommand[1]{\text{#1}}
+ \npthousandthpartsep{}}
+
+\usepackage{xspace}
+\usepackage[textsize=footnotesize]{todonotes}
+
+\newcommand{\AG}[2][inline]{%
+ \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
+\newcommand{\RC}[2][inline]{%
+ \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
+\newcommand{\LZK}[2][inline]{%
+ \todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
+\newcommand{\RCE}[2][inline]{%
+ \todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
+
+\algnewcommand\algorithmicinput{\textbf{Input:}}
+\algnewcommand\Input{\item[\algorithmicinput]}
+
+\algnewcommand\algorithmicoutput{\textbf{Output:}}
+\algnewcommand\Output{\item[\algorithmicoutput]}
+
+\newcommand{\MI}{\mathit{MaxIter}}
+
+\usepackage{array}
+\usepackage{color, colortbl}
+\newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
+\newcolumntype{Z}[1]{>{\raggedleft}m{#1}}
+
+\newcolumntype{g}{>{\columncolor{Gray}}c}
+\definecolor{Gray}{gray}{0.9}
+
+
+
+\begin{document}
+\RCE{Titre a confirmer.}
+\title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms}
+%\itshape{\journalnamelc}\footnotemark[2]}
+
+\author{ Charles Emile Ramamonjisoa and
+ David Laiymani and
+ Arnaud Giersch and
+ Lilia Ziane Khodja and
+ Raphaël Couturier
+}
+
+\address{
+ \centering
+ Femto-ST Institute - DISC Department\\
+ Université de Franche-Comté\\
+ Belfort\\
+ Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr}
+}
+
+%% Lilia Ziane Khodja: Department of Aerospace \& Mechanical Engineering\\ Non Linear Computational Mechanics\\ University of Liege\\ Liege, Belgium. Email: l.zianekhodja@ulg.ac.be
+
+\begin{abstract}
+ABSTRACT
+\end{abstract}
+
+\keywords{Algorithm; distributed; iterative; asynchronous; simulation; simgrid; performance}
+
+\maketitle
+
+\section{Introduction}
+
+\section{The asynchronous iteration model}
+
+\section{SimGrid}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Two-stage splitting methods}
+\label{sec:04}
+\subsection{Multisplitting methods for sparse linear systems}
+\label{sec:04.01}
+Let us consider the following sparse linear system of $n$ equations in $\mathbb{R}$
+\begin{equation}
+Ax=b,
+\label{eq:01}
+\end{equation}
+where $A$ is a sparse square and nonsingular matrix, $b$ is the right-hand side and $x$ is the solution of the system. The multisplitting methods solve the linear system~(\ref{eq:01}) iteratively as follows
+\begin{equation}
+x^{k+1}=\displaystyle\sum^L_{\ell=1} E_\ell M^{-1}_\ell (N_\ell x^k + b),~k=1,2,3,\ldots
+\label{eq:02}
+\end{equation}
+where a collection of $L$ triplets $(M_\ell, N_\ell, E_\ell)$ defines the multisplitting of matrix $A$, such that: the different splittings are defined as $A=M_\ell-N_\ell$ where $M_\ell$ are nonsingular matrices, and $\sum_\ell{E_\ell=I}$ are diagonal nonnegative weighting matrices and $I$ is the identity matrix. The iterations of the multisplitting methods can naturally be computed in parallel such that each processor or a group of processors is responsible for solving one splitting as a linear sub-system
+\begin{equation}
+M_\ell y_\ell^{k+1} = R_\ell^k,\mbox{~such that~} R_\ell^k = N_\ell x^k_\ell + b,
+\label{eq:03}
+\end{equation}
+then the weighting matrices $E_\ell$ are used to compute the solution of the global system~(\ref{eq:01})
+\begin{equation}
+x^{k+1}=\displaystyle\sum^L_{\ell=1} E_\ell y^{k+1}_\ell.
+\label{eq:04}
+\end{equation}
+The convergence of the multisplitting methods, based on synchronous or asynchronous iterations, is studied by many authors. It is dependent on the condition
+\begin{equation}
+\rho(\displaystyle\sum_{\ell=1}^L E_\ell M^{-1}_\ell N_\ell) < 1,
+\label{eq:05}
+\end{equation}
+where $\rho$ is the spectral radius of the square matrix. The different linear splittings~(\ref{eq:03}) arising from the multisplitting of matrix $A$can be solved exactly with a direct method or approximated with an iterative method. When the inner method used to solve the linear sub-systems is iterative, the multisplitting method is called {\it inner-outer iterative method} or {\it two-stage multisplitting method}.
+
+In this paper we are focused on two-stage multisplitting methods where the well-known iterative method GMRES is used as an inner iteration.
+
+\subsection{Simulation of two-stage methods using SimGrid framework}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Experimental, Results and Comments}
+
+
+\textbf{V.1. Setup study and Methodology}
+
+To conduct our study, we have put in place the following methodology
+which can be reused with any grid-enabled applications.
+
+\textbf{Step 1} : Choose with the end users the class of algorithms or
+the application to be tested. Numerical parallel iterative algorithms
+have been chosen for the study in the paper.
+
+\textbf{Step 2} : Collect the software materials needed for the
+experimentation. In our case, we have three variants algorithms for the
+resolution of three 3D-Poisson problem: (1) using the classical GMRES
+\textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this
+paper, (2) using the multisplitting method alias Algo-2 and (3) an
+enhanced version of the multisplitting method as Algo-3. In addition,
+SIMGRID simulator has been chosen to simulate the behaviors of the
+distributed applications. SIMGRID is running on the Mesocentre
+datacenter in Franche-Comte University $[$10$]$ but also in a virtual
+machine on a laptop.
+
+\textbf{Step 3} : Fix the criteria which will be used for the future
+results comparison and analysis. In the scope of this study, we retain
+in one hand the algorithm execution mode (synchronous and asynchronous)
+and in the other hand the execution time and the number of iterations of
+the application before obtaining the convergence.
+
+\textbf{Step 4 }: Setup up the different grid testbeds environment
+which will be simulated in the simulator tool to run the program. The
+following architecture has been configured in Simgrid : 2x16 - that is a
+grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8,
+4x16, 8x8 and 2x50. The network has been designed to operate with a
+bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6
+microseconds (resp. 5E-5) for the intra-clusters links (resp.
+inter-clusters backbone links).
+
+\textbf{Step 5}: Process an extensive and comprehensive testings
+within these configurations in varying the key parameters, especially
+the CPU power capacity, the network parameters and also the size of the
+input matrix. Note that some parameters should be invariant to allow the
+comparison like some program input arguments.
+
+\textbf{Step 6} : Collect and analyze the output results.
+
+\textbf{ V.2. Factors impacting distributed applications performance in
+a grid environment}
+
+From our previous experience on running distributed application in a
+computational grid, many factors are identified to have an impact on the
+program behavior and performance on this specific environment. Mainly,
+first of all, the architecture of the grid itself can obviously
+influence the performance results of the program. The performance gain
+might be important theoretically when the number of clusters and/or the
+number of nodes (processors/cores) in each individual cluster increase.
+
+Another important factor impacting the overall performance of the
+application is the network configuration. Two main network parameters
+can modify drastically the program output results : (i) the network
+bandwidth (bw=bits/s) also known as "the data-carrying capacity"
+$[$13$]$ of the network is defined as the maximum of data that can pass
+from one point to another in a unit of time. (ii) the network latency
+(lat : microsecond) defined as the delay from the start time to send the
+data from a source and the final time the destination have finished to
+receive it. Upon the network characteristics, another impacting factor
+is the application dependent volume of data exchanged between the nodes
+in the cluster and between distant clusters. Large volume of data can be
+transferred in transit between the clusters and nodes during the code
+execution.
+
+ In a grid environment, it is common to distinguish in one hand, the
+"\,intra-network" which refers to the links between nodes within a
+cluster and in the other hand, the "\,inter-network" which is the
+backbone link between clusters. By design, these two networks perform
+with different speed. The intra-network generally works like a high
+speed local network with a high bandwith and very low latency. In
+opposite, the inter-network connects clusters sometime via heterogeneous
+networks components thru internet with a lower speed. The network
+between distant clusters might be a bottleneck for the global
+performance of the application.
+
+\textbf{V.3 Comparing GMRES and Multisplitting algorithms in
+synchronous mode}
+
+In the scope of this paper, our first objective is to demonstrate the
+Algo-2 (Multisplitting method) shows a better performance in grid
+architecture compared with Algo-1 (Classical GMRES) both running in
+\textbf{\textit{synchronous mode}}. Better algorithm performance
+should mean a less number of iterations output and a less execution time
+before reaching the convergence. For a systematic study, the experiments
+should figure out that, for various grid parameters values, the
+simulator will confirm the targeted outcomes, particularly for poor and
+slow networks, focusing on the impact on the communication performance
+on the chosen class of algorithm $[$12$]$.
+
+The following paragraphs present the test conditions, the output results
+and our comments.
+
+
+\textit{3.a Executing the algorithms on various computational grid
+architecture scaling up the input matrix size}
+\\
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline
+ Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline
+ - & N$_{x}$ =170 x 170 x 170 \\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+
+ Table 1 : Clusters x Nodes with NX=150 or NX=170
+
+\RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger}
+
+
+The results in figure 1 show the non-variation of the number of
+iterations of classical GMRES for a given input matrix size; it is not
+the case for the multisplitting method.
+
+%\begin{wrapfigure}{l}{60mm}
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{cluster_x_nodes_nx_150_and_nx_170.pdf}
+\caption{Cluster x Nodes NX=150 and NX=170}
+%\label{overflow}}
+\end{figure}
+%\end{wrapfigure}
+
+Unless the 8x8 cluster, the time
+execution difference between the two algorithms is important when
+comparing between different grid architectures, even with the same number of
+processors (like 2x16 and 4x8 = 32 processors for example). The
+experiment concludes the low sensitivity of the multisplitting method
+(compared with the classical GMRES) when scaling up to higher input
+matrix size.
+
+\textit{3.b Running on various computational grid architecture}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16, 4x8\\ %\hline
+ Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline
+ - & N2 : bw=1Gbs-lat=5E-05 \\
+ Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\
+ \end{tabular}
+\end{footnotesize}
+
+%Table 2 : Clusters x Nodes - Networks N1 x N2
+%\RCE{idem pour tous les tableaux de donnees}
+
+
+%\begin{wrapfigure}{l}{60mm}
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{cluster_x_nodes_n1_x_n2.pdf}
+\caption{Cluster x Nodes N1 x N2}
+%\label{overflow}}
+\end{figure}
+%\end{wrapfigure}
+
+The experiments compare the behavior of the algorithms running first on
+speed inter- cluster network (N1) and a less performant network (N2).
+The figure 2 shows that end users will gain to reduce the execution time
+for both algorithms in using a grid architecture like 4x16 or 8x8: the
+performance was increased in a factor of 2. The results depict also that
+when the network speed drops down, the difference between the execution
+times can reach more than 25\%.
+
+\textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16\\ %\hline
+ Network & N1 : bw=1Gbs \\ %\hline
+ Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\
+ \end{tabular}
+\end{footnotesize}
+
+Table 3 : Network latency impact
+
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{network_latency_impact_on_execution_time.pdf}
+\caption{Network latency impact on execution time}
+%\label{overflow}}
+\end{figure}
+
+
+According the results in table and figure 3, degradation of the network
+latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
+increase more than 75\% (resp. 82\%) of the execution for the classical
+GMRES (resp. multisplitting) algorithm. In addition, it appears that the
+multisplitting method tolerates more the network latency variation with
+a less rate increase. Consequently, in the worst case (lat=6.10$^{-5
+}$), the execution time for GMRES is almost the double of the time for
+the multisplitting, even though, the performance was on the same order
+of magnitude with a latency of 8.10$^{-6}$.
+
+\textit{3.d Network bandwidth impacts on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16\\ %\hline
+ Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Table 4 : Network bandwidth impact
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{network_bandwith_impact_on_execution_time.pdf}
+\caption{Network bandwith impact on execution time}
+%\label{overflow}
+\end{figure}
+
+
+
+The results of increasing the network bandwidth depict the improvement
+of the performance by reducing the execution time for both of the two
+algorithms. However, and again in this case, the multisplitting method
+presents a better performance in the considered bandwidth interval with
+a gain of 40\% which is only around 24\% for classical GMRES.
+
+\textit{3.e Input matrix size impacts on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 4x8\\ %\hline
+ Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Table 5 : Input matrix size impact
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{pb_size_impact_on_execution_time.pdf}
+\caption{Pb size impact on execution time}
+%\label{overflow}}
+\end{figure}
+
+In this experimentation, the input matrix size has been set from
+Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to
+200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5,
+the execution time for the algorithms convergence increases with the
+input matrix size. But the interesting result here direct on (i) the
+drastic increase (300 times) of the number of iterations needed before
+the convergence for the classical GMRES algorithm when the matrix size
+go beyond Nx=150; (ii) the classical GMRES execution time also almost
+the double from Nx=140 compared with the convergence time of the
+multisplitting method. These findings may help a lot end users to setup
+the best and the optimal targeted environment for the application
+deployment when focusing on the problem size scale up. Note that the
+same test has been done with the grid 2x16 getting the same conclusion.
+
+\textit{3.f CPU Power impact on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16\\ %\hline
+ Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Table 6 : CPU Power impact
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{cpu_power_impact_on_execution_time.pdf}
+\caption{CPU Power impact on execution time}
+%\label{overflow}}
+\end{figure}
+
+Using the SIMGRID simulator flexibility, we have tried to determine the
+impact on the algorithms performance in varying the CPU power of the
+clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
+confirm the performance gain, around 95\% for both of the two methods,
+after adding more powerful CPU. Note that the execution time axis in the
+figure is in logarithmic scale.
+
+ \textbf{V.4 Comparing GMRES in native synchronous mode and
+Multisplitting algorithms in asynchronous mode}
+
+The previous paragraphs put in evidence the interests to simulate the
+behavior of the application before any deployment in a real environment.
+We have focused the study on analyzing the performance in varying the
+key factors impacting the results. In the same line, the study compares
+the performance of the two proposed methods in \textbf{synchronous mode
+}. In this section, with the same previous methodology, the goal is to
+demonstrate the efficiency of the multisplitting method in \textbf{
+asynchronous mode} compare with the classical GMRES staying in the
+synchronous mode.
+
+Note that the interest of using the asynchronous mode for data exchange
+is mainly, in opposite of the synchronous mode, the non-wait aspects of
+the current computation after a communication operation like sending
+some data between nodes. Each processor can continue their local
+calculation without waiting for the end of the communication. Thus, the
+asynchronous may theoretically reduce the overall execution time and can
+improve the algorithm performance.
+
+As stated supra, SIMGRID simulator tool has been used to prove the
+efficiency of the multisplitting in asynchronous mode and to find the
+best combination of the grid resources (CPU, Network, input matrix size,
+\ldots ) to get the highest "\,relative gain" in comparison with the
+classical GMRES time.
+
+
+The test conditions are summarized in the table below :
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x50 totaling 100 processors\\ %\hline
+ Processors & 1 GFlops to 1.5 GFlops\\
+ Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline
+ Inter-Network & bw=5 Mbits - lat=2E-02\\
+ Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
+ Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Again, comprehensive and extensive tests have been conducted varying the
+CPU power and the network parameters (bandwidth and latency) in the
+simulator tool with different problem size. The relative gains greater
+than 1 between the two algorithms have been captured after each step of
+the test. Table I below has recorded the best grid configurations
+allowing a multiplitting method time more than 2.5 times lower than
+classical GMRES execution and convergence time. The finding thru this
+experimentation is the tolerance of the multisplitting method under a
+low speed network that we encounter usually with distant clusters thru the
+internet.
+
+% use the same column width for the following three tables
+\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
+\newenvironment{mytable}[1]{% #1: number of columns for data
+ \renewcommand{\arraystretch}{1.3}%
+ \begin{tabular}{|>{\bfseries}r%
+ |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
+ \end{tabular}}
+
+\begin{table}[!t]
+ \centering
+ \caption{Relative gain of the multisplitting algorithm compared with
+the classical GMRES}
+ \label{tab.cluster.2x50}
+
+ \begin{mytable}{6}
+ \hline
+ bw
+ & 5 & 5 & 5 & 5 & 5 & 50 \\
+ \hline
+ lat
+ & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
+ \hline
+ power
+ & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\
+ \hline
+ size
+ & 62 & 62 & 62 & 100 & 100 & 110 \\
+ \hline
+ Prec/Eprec
+ & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\
+ \hline
+ speedup
+ & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\
+ \hline
+ \end{mytable}
+
+ \smallskip
+
+ \begin{mytable}{6}
+ \hline
+ bw
+ & 50 & 50 & 50 & 50 & 10 & 10 \\
+ \hline
+ lat
+ & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\
+ \hline
+ power
+ & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\
+ \hline
+ size
+ & 120 & 130 & 140 & 150 & 171 & 171 \\
+ \hline
+ Prec/Eprec
+ & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\
+ \hline
+ speedup
+ & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\
+ \hline
+ \end{mytable}
+\end{table}
+
+\section{Conclusion}
+CONCLUSION
+
+
+\section*{Acknowledgment}
+
+
+The authors would like to thank\dots{}
+
+
+% trigger a \newpage just before the given reference
+% number - used to balance the columns on the last page
+% adjust value as needed - may need to be readjusted if
+% the document is modified later
+\bibliographystyle{IEEEtran}
+\bibliography{hpccBib}
+
+\end{document}
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: t
+%%% fill-column: 80
+%%% ispell-local-dictionary: "american"
+%%% End: