-\section{Introduction}
-The use of multi-core architectures for solving large scientific problems seems to become imperative in a lot of cases.
-Whatever the scale of these architectures (distributed clusters, computational grids, embedded multi-core \ldots) they are generally
-well adapted to execute complexe parallel applications operating on a large amount of data. Unfortunately, users (industrials or scientists),
-who need such computational resources, may not have an easy access to such efficient architectures. The cost of using the platform and/or the cost of
-testing and deploying an application are often very important. So, in this context it is difficult to optimize a given application for a given
-architecture. In this way and in order to reduce the access cost to these computing resources it seems very interesting to use a simulation environment.
-The advantages are numerous: development life cycle, code debugging, ability to obtain results quickly \ldots at the condition that the simulation results are in education with the real ones.
-
-In this paper we focus on a class of highly efficient parallel algorithms called \emph{iterative algorithms}. The
-parallel scheme of iterative methods is quite simple. It generally involves the division of the problem
-into several \emph{blocks} that will be solved in parallel on multiple
-processing units. Each processing unit has to
-compute an iteration, to send/receive some data dependencies to/from
-its neighbors and to iterate this process until the convergence of
-the method. Several well-known methods demonstrate the convergence of these algorithms~\cite{BT89,Bahi07}.
-In this processing mode a task cannot begin a new iteration while it
-has not received data dependencies from its neighbors. We say that the iteration computation follows a synchronous scheme.
-In the asynchronous scheme a task can compute a new iteration without having to
-wait for the data dependencies coming from its neighbors. Both
-communication and computations are asynchronous inducing that there is
-no more idle times, due to synchronizations, between two
-iterations~\cite{bcvc06:ij}. This model presents some advantages and drawbacks that we detail in section 2 but even if the number of iterations required to converge is
-generally greater than for the synchronous case, it appears that the asynchronous iterative scheme can significantly reduce overall execution
-times by suppressing idle times due to synchronizations~(see \cite{Bahi07} for more details).
-
-Nevertheless, in both cases (synchronous or asynchronous) it is very time consuming to find optimal configuration and deployment requirements
-for a given application on a given multi-core architecture. Finding good resource allocations policies under varying CPU power, network speeds and
-loads is very challenging and labor intensive.~\cite{Calheiros:2011:CTM:1951445.1951450}. This problematic is even more difficult for the asynchronous scheme
-where variations of the parameters of the execution platform can lead to very different number of iterations required to converge and so to very different execution times.
-In this challenging context we think that the use of a simulation tool can greatly leverage the possibility of testing various platform scenarios.
-
-The main contribution of this paper is to show that the use of a simulation tool (i.e. the SimGrid toolkit~\cite{SimGrid}) in the context of real
-parallel applications (i.e. large linear system solver) can help developers to better tune their application for a given multi-core architecture.
-To show the validity of this approach we first compare the simulated execution of the multisplitting algorithm with the GMRES (Generalized Minimal Residual) solver
-\cite{ref1} in synchronous mode. The obtained results on different simulated multi-core architectures confirm the real results previously obtained on non simulated architectures.
-We also confirm the efficiency of the asynchronous multisplitting algorithm comparing to the synchronous GMRES. In this way and with a simple computing architecture (a laptop)
-SimGrid allows us to run a test campaign of a real parallel iterative applications on different simulated multi-core architectures.
-To our knowledge, there is no related work on the large-scale multi-core simulation of a real synchronous and asynchronous iterative application.
-
-This paper is organized as follows. Section 1 \ref{sec:synchro} presents the iteration model we use and more particularly the asynchronous scheme.
-In section \ref{sec:simgrid} the SimGrid simulation toolkit is presented. Section \ref{sec:04} details the different solvers that we use.
-Finally our experimental results are presented in section \ref{\sec:expe} followed by some concluding remarks and perspectives.
-
-
-\section{The asynchronous iteration model}
+\section{Introduction} The use of multi-core architectures to solve large
+scientific problems seems to become imperative in many situations.
+Whatever the scale of these architectures (distributed clusters, computational
+grids, embedded multi-core,~\ldots) they are generally well adapted to execute
+complex parallel applications operating on a large amount of data.
+Unfortunately, users (industrials or scientists), who need such computational
+resources, may not have an easy access to such efficient architectures. The cost
+of using the platform and/or the cost of testing and deploying an application
+are often very important. So, in this context it is difficult to optimize a
+given application for a given architecture. In this way and in order to reduce
+the access cost to these computing resources it seems very interesting to use a
+simulation environment. The advantages are numerous: development life cycle,
+code debugging, ability to obtain results quickly\dots{} In counterpart, the simulation results need to be consistent with the real ones.
+
+In this paper we focus on a class of highly efficient parallel algorithms called
+\emph{iterative algorithms}. The parallel scheme of iterative methods is quite
+simple. It generally involves the division of the problem into several
+\emph{blocks} that will be solved in parallel on multiple processing
+units. Each processing unit has to compute an iteration to send/receive some
+data dependencies to/from its neighbors and to iterate this process until the
+convergence of the method. Several well-known studies demonstrate the
+convergence of these algorithms~\cite{BT89,bahi07}. In this processing mode a
+task cannot begin a new iteration while it has not received data dependencies
+from its neighbors. We say that the iteration computation follows a
+\textit{synchronous} scheme. In the asynchronous scheme a task can compute a new
+iteration without having to wait for the data dependencies coming from its
+neighbors. Both communications and computations are \textit{asynchronous}
+inducing that there is no more idle time, due to synchronizations, between two
+iterations~\cite{bcvc06:ij}. This model presents some advantages and drawbacks
+that we detail in Section~\ref{sec:asynchro} but even if the number of
+iterations required to converge is generally greater than for the synchronous
+case, it appears that the asynchronous iterative scheme can significantly
+reduce overall execution times by suppressing idle times due to
+synchronizations~(see~\cite{bahi07} for more details).
+
+Nevertheless, in both cases (synchronous or asynchronous) it is very time
+consuming to find optimal configuration and deployment requirements for a given
+application on a given multi-core architecture. Finding good resource
+allocations policies under varying CPU power, network speeds and loads is very
+challenging and labor intensive~\cite{Calheiros:2011:CTM:1951445.1951450}. This
+problematic is even more difficult for the asynchronous scheme where a small
+parameter variation of the execution platform and of the application data can
+lead to very different numbers of iterations to reach the convergence and so to
+very different execution times. In this challenging context we think that the
+use of a simulation tool can greatly leverage the possibility of testing various
+platform scenarios.
+
+The {\bf main contribution of this paper} is to show that the use of a
+simulation tool (i.e. the SimGrid toolkit~\cite{SimGrid}) in the context of real
+parallel applications (i.e. large linear system solvers) can help developers to
+better tune their applications for a given multi-core architecture. To show the
+validity of this approach we first compare the simulated execution of the Krylov
+multisplitting algorithm with the GMRES (Generalized Minimal RESidual)
+solver~\cite{saad86} in synchronous mode. The simulation results allow us to
+determine which method to choose for a given multi-core architecture.
+Moreover the obtained results on different simulated multi-core architectures
+confirm the real results previously obtained on non simulated architectures.
+More precisely the simulated results are in accordance (i.e. with the same order
+of magnitude) with the works presented in~\cite{couturier15}, which show that
+the synchronous Krylov multisplitting method is more efficient than GMRES for large
+scale clusters. Simulated results also confirm the efficiency of the
+asynchronous multisplitting algorithm compared to the synchronous GMRES
+especially in case of geographically distant clusters.
+
+In this way and with a simple computing architecture (a laptop) SimGrid allows us
+to run a test campaign of a real parallel iterative applications on
+different simulated multi-core architectures. To our knowledge, there is no
+related work on the large-scale multi-core simulation of a real synchronous and
+asynchronous iterative application.
+
+This paper is organized as follows. Section~\ref{sec:asynchro} presents the
+iteration model we use and more particularly the asynchronous scheme. In
+Section~\ref{sec:simgrid} the SimGrid simulation toolkit is presented.
+Section~\ref{sec:04} details the different solvers that we use. Finally our
+experimental results are presented in Section~\ref{sec:expe} followed by some
+concluding remarks and perspectives.
+
+
+\section{The asynchronous iteration model and the motivations of our work}