-\section{Introduction}
-The use of multi-core architectures for solving large scientific problems seems to become imperative in a lot of cases.
-Whatever the scale of these architectures (distributed clusters, computational grids, embedded multi-core \ldots) they are generally
-well adapted to execute complexe parallel applications operating on a large amount of data. Unfortunately, users (industrials or scientists),
-who need such computational resources may not have an easy access to such efficient architectures. The cost of using the platform and/or the cost of
-testing and deploying an application are often very important. So, in this context it is difficult to optimize a given application for a given
-architecture. In this way and in order to reduce the access cost to these computing resources it seems very interesting to use a simulation environment.
-The advantages are numerous: development life cycle, code debugging, ability to obtain results quickly \ldots
-
-In this paper we focus on a class of highly efficient parallel algorithms called \emph{iterative algorithms}. The
-parallel scheme of iterative methods is quite simple. It generally involves the division of the problem
-into several \emph{blocks} that will be solved in parallel on multiple
-processing units. Then each processing unit has to
-compute an iteration, to send/receive some data dependencies to/from
-its neighbors and to iterate this process until the convergence of
-the method. Several well-known methods demonstrate the convergence of these algorithms~\cite{BT89,Bahi07}.
-In this processing mode a task cannot begin a new iteration while it
-has not received data dependencies from its neighbors. We say that the iteration computation follows a synchronous scheme.
-In the asynchronous scheme a task can compute a new iteration without having to
-wait for the data dependencies coming from its neighbors. Both
-communication and computations are asynchronous inducing that there is
-no more idle times, due to synchronizations, between two
-iterations~\cite{bcvc06:ij}. This model presents some advantages and drawbacks that we detail in section 2 but even if the number of iterations required to converge is
-generally greater than for the synchronous case, it appears that the asynchronous iterative scheme can significantly reduce overall execution
-times by suppressing idle times due to synchronizations~\cite{Bahi07} for more details.
-
-Nevertheless, in both cases (synchronous or asynchronous) it is very time consuming to find optimal configuration and deployment requirements
-for a given application on a given multi-core architecture. Finding good resource allocations policies under varying CPU power, network speeds and
-loads is very challenging and labor intensive.~\cite{Calheiros:2011:CTM:1951445.1951450}. This problematic is even more difficult for the asynchronous scheme
-where variations of the parameters of the execution platform can lead to very different number of iterations required to converge and so to very different execution times.
-In this challenging context we think that the use of a simulation tool can leverage the possibility of testing various platform scenarios.
-
-The main contribution of this paper is to show that the use of a simulation tool (i.e. the SimGrid toolkit~\cite{SimGrid}) in the context of real
-parallel applications (i.e. large linear system solver) can help developers to better tune their application for a given multi-core architecture.
-To show the validity of this approach we first compare the simulated execution of the multisplitting algorithm with the GMRES (Generalized Minimal Residual) solver
-\cite{ref1} both in synchronous mode. The obtained results on different simulated multi-core architectures confirm the results previously obtained on non simulated architecture.
-We also confirm the efficiency of the asynchronous multisplitting algorithm comparing to the synchronous GMRES. In this way and with a simple computing architecture (a laptop)
-SimGrid allows us (with small modifications of the MPI code) to run a test campaign of a real parallel iterative applications on different simulated multi-core architectures.
-To our knowledge, there is no related work on the large-scale multi-core simulation of a real synchronous and asynchronous iterative application.
-
-This paper is organized as follows:
+\section{Introduction} The use of multi-core architectures to solve large
+scientific problems seems to become imperative in many situations.
+Whatever the scale of these architectures (distributed clusters, computational
+grids, embedded multi-core,~\ldots) they are generally well adapted to execute
+complex parallel applications operating on a large amount of data.
+Unfortunately, users (industrials or scientists), who need such computational
+resources, may not have an easy access to such efficient architectures. The cost
+of using the platform and/or the cost of testing and deploying an application
+are often very important. So, in this context it is difficult to optimize a
+given application for a given architecture. In this way and in order to reduce
+the access cost to these computing resources it seems very interesting to use a
+simulation environment. The advantages are numerous: development life cycle,
+code debugging, ability to obtain results quickly,~\ldots. In counterpart, the simulation results need to be consistent with the real ones.
+
+In this paper we focus on a class of highly efficient parallel algorithms called
+\emph{iterative algorithms}. The parallel scheme of iterative methods is quite
+simple. It generally involves the division of the problem into several
+\emph{blocks} that will be solved in parallel on multiple processing
+units. Each processing unit has to compute an iteration, to send/receive some
+data dependencies to/from its neighbors and to iterate this process until the
+convergence of the method. Several well-known methods demonstrate the
+convergence of these algorithms~\cite{BT89,bahi07}. In this processing mode a
+task cannot begin a new iteration while it has not received data dependencies
+from its neighbors. We say that the iteration computation follows a synchronous
+scheme. In the asynchronous scheme a task can compute a new iteration without
+having to wait for the data dependencies coming from its neighbors. Both
+communication and computations are asynchronous inducing that there is no more
+idle time, due to synchronizations, between two iterations~\cite{bcvc06:ij}.
+This model presents some advantages and drawbacks that we detail in
+section~\ref{sec:asynchro} but even if the number of iterations required to
+converge is generally greater than for the synchronous case, it appears that
+the asynchronous iterative scheme can significantly reduce overall execution
+times by suppressing idle times due to synchronizations~(see~\cite{bahi07}
+for more details).
+
+Nevertheless, in both cases (synchronous or asynchronous) it is very time
+consuming to find optimal configuration and deployment requirements for a given
+application on a given multi-core architecture. Finding good resource
+allocations policies under varying CPU power, network speeds and loads is very
+challenging and labor intensive~\cite{Calheiros:2011:CTM:1951445.1951450}. This
+problematic is even more difficult for the asynchronous scheme where a small
+parameter variation of the execution platform can lead to very different numbers
+of iterations to reach the converge and so to very different execution times. In
+this challenging context we think that the use of a simulation tool can greatly
+leverage the possibility of testing various platform scenarios.
+
+The main contribution of this paper is to show that the use of a simulation tool
+(i.e. the SimGrid toolkit~\cite{SimGrid}) in the context of real parallel
+applications (i.e. large linear system solvers) can help developers to better
+tune their application for a given multi-core architecture. To show the validity
+of this approach we first compare the simulated execution of the multisplitting
+algorithm with the GMRES (Generalized Minimal Residual)
+solver~\cite{saad86} in synchronous mode. The obtained results on different
+simulated multi-core architectures confirm the real results previously obtained
+on non simulated architectures. We also confirm the efficiency of the
+asynchronous multisplitting algorithm compared to the synchronous GMRES. In
+this way and with a simple computing architecture (a laptop) SimGrid allows us
+to run a test campaign of a real parallel iterative applications on
+different simulated multi-core architectures. To our knowledge, there is no
+related work on the large-scale multi-core simulation of a real synchronous and
+asynchronous iterative application.
+
+This paper is organized as follows. Section~\ref{sec:asynchro} presents the
+iteration model we use and more particularly the asynchronous scheme. In
+section~\ref{sec:simgrid} the SimGrid simulation toolkit is presented.
+Section~\ref{sec:04} details the different solvers that we use. Finally our
+experimental results are presented in section~\ref{sec:expe} followed by some
+concluding remarks and perspectives.