-Parallel computing and high performance computing (HPC) are becoming
-more and more imperative for solving various problems raised by
-researchers on various scientific disciplines but also by industrial in
-the field. Indeed, the increasing complexity of these requested
-applications combined with a continuous increase of their sizes lead to
-write distributed and parallel algorithms requiring significant hardware
-resources (grid computing, clusters, broadband network, etc\dots{}) but
-also a non-negligible CPU execution time. We consider in this paper a
-class of highly efficient parallel algorithms called iterative executed
-in a distributed environment. As their name suggests, these algorithm
-solves a given problem that might be NP- complete complex by successive
-iterations ($X_{n +1} = f(X_{n})$) from an initial value $X_{0}$ to find
-an approximate value $X^*$ of the solution with a very low
-residual error. Several well-known methods demonstrate the convergence
-of these algorithms. Generally, to reduce the complexity and the
-execution time, the problem is divided into several \emph{pieces} that will
-be solved in parallel on multiple processing units. The latter will
-communicate each intermediate results before a new iteration starts
-until the approximate solution is reached. These distributed parallel
-computations can be performed either in \emph{synchronous} communication mode
-where a new iteration begin only when all nodes communications are
-completed, either \emph{asynchronous} mode where processors can continue
-independently without or few synchronization points. Despite the
-effectiveness of iterative approach, a major drawback of the method is
-the requirement of huge resources in terms of computing capacity,
-storage and high speed communication network. Indeed, limited physical
-resources are blocking factors for large-scale deployment of parallel
-algorithms.
-
-In recent years, the use of a simulation environment to execute parallel
-iterative algorithms found some interests in reducing the highly cost of
-access to computing resources: (1) for the applications development life
-cycle and in code debugging (2) and in production to get results in a
-reasonable execution time with a simulated infrastructure not accessible
-with physical resources. Indeed, the launch of distributed iterative
-asynchronous algorithms to solve a given problem on a large-scale
-simulated environment challenges to find optimal configurations giving
-the best results with a lowest residual error and in the best of
-execution time. According our knowledge, no testing of large-scale
-simulation of the class of algorithm solving to achieve real results has
-been undertaken to date. We had in the scope of this work implemented a
-program for solving large non-symmetric linear system of equations by
-numerical method GMRES (Generalized Minimal Residual) in the simulation
-environment SimGrid. The simulated platform had allowed us to launch
-the application from a modest computing infrastructure by simulating
-different distributed architectures composed by clusters nodes
-interconnected by variable speed networks. In addition, it has been
-permitted to show the effectiveness of asynchronous mode algorithm by
-comparing its performance with the synchronous mode time. With selected
-parameters on the network platforms (bandwidth, latency of inter cluster
-network) and on the clusters architecture (number, capacity calculation
-power) in the simulated environment, the experimental results have
-demonstrated not only the algorithm convergence within a reasonable time
-compared with the physical environment performance, but also a time
-saving of up to \np[\%]{40} in asynchronous mode.
-
-This article is structured as follows: after this introduction, the next
-section will give a brief description of iterative asynchronous model.
-Then, the simulation framework SimGrid will be presented with the
-settings to create various distributed architectures. The algorithm of
-the multi -splitting method used by GMRES written with MPI primitives
-and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the
-next section. At last, the experiments results carried out will be
-presented before the conclusion which we will announce the opening of
-our future work after the results.
+Parallel computing and high performance computing (HPC) are becoming more and more imperative for solving various
+problems raised by researchers on various scientific disciplines but also by industrial in the field. Indeed, the
+increasing complexity of these requested applications combined with a continuous increase of their sizes lead to write
+distributed and parallel algorithms requiring significant hardware resources (grid computing, clusters, broadband
+network, etc.) but also a non-negligible CPU execution time. We consider in this paper a class of highly efficient
+parallel algorithms called \emph{numerical iterative algorithms} executed in a distributed environment. As their name
+suggests, these algorithms solve a given problem by successive iterations ($X_{n +1} = f(X_{n})$) from an initial value
+$X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
+demonstrate the convergence of these algorithms~\cite{BT89,Bahi07}.
+
+Parallelization of such algorithms generally involve the division of the problem into several \emph{blocks} that will
+be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new
+iteration starts and until the approximate solution is reached. These parallel computations can be performed either in
+\emph{synchronous} mode where a new iteration begins only when all nodes communications are completed,
+or in \emph{asynchronous} mode where processors can continue independently with few or no synchronization points. For
+instance in the \textit{Asynchronous Iterations~-- Asynchronous Communications (AIAC)} model~\cite{bcvc06:ij}, local
+computations do not need to wait for required data. Processors can then perform their iterations with the data present
+at that time. Even if the number of iterations required before the convergence is generally greater than for the
+synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to
+synchronizations especially in a grid computing context (see~\cite{Bahi07} for more details).
+
+Parallel numerical applications (synchronous or asynchronous) may have different
+configuration and deployment requirements. Quantifying their resource
+allocation policies and application scheduling algorithms in grid computing
+environments under varying load, CPU power and network speeds is very costly,
+very labor intensive and very time
+consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC
+algorithms is even more problematic since they are very sensible to the
+execution environment context. For instance, variations in the network bandwidth
+(intra and inter-clusters), in the number and the power of nodes, in the number
+of clusters\dots{} can lead to very different number of iterations and so to
+very different execution times. Then, it appears that the use of simulation
+tools to explore various platform scenarios and to run large numbers of
+experiments quickly can be very promising. In this way, the use of a simulation
+environment to execute parallel iterative algorithms found some interests in
+reducing the highly cost of access to computing resources: (1) for the
+applications development life cycle and in code debugging (2) and in production
+to get results in a reasonable execution time with a simulated infrastructure
+not accessible with physical resources. Indeed, the launch of distributed
+iterative asynchronous algorithms to solve a given problem on a large-scale
+simulated environment challenges to find optimal configurations giving the best
+results with a lowest residual error and in the best of execution time.
+
+To our knowledge, there is no existing work on the large-scale simulation of a
+real AIAC application. The aim of this paper is twofold. First we give a first
+approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
+SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of
+asynchronous mode algorithms by comparing their performance with the synchronous
+mode. More precisely, we had implemented a program for solving large
+linear system of equations by numerical method GMRES (Generalized
+Minimal Residual) \cite{ref1}. We show, that with minor modifications of the
+initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a
+real AIAC application on different computing architectures. The simulated
+results we obtained are in line with real results exposed in ??\AG[]{ref?}.
+SimGrid had allowed us to launch the application from a modest computing
+infrastructure by simulating different distributed architectures composed by
+clusters nodes interconnected by variable speed networks. With selected
+parameters on the network platforms (bandwidth, latency of inter cluster
+network) and on the clusters architecture (number, capacity calculation power)
+in the simulated environment, the experimental results have demonstrated not
+only the algorithm convergence within a reasonable time compared with the
+physical environment performance, but also a time saving of up to \np[\%]{40} in
+asynchronous mode.
+\AG{Il faudrait revoir la phrase précédente (couper en deux?). Là, on peut
+ avoir l'impression que le gain de \np[\%]{40} est entre une exécution réelle
+ et une exécution simulée!}
+
+This article is structured as follows: after this introduction, the next section will give a brief description of
+iterative asynchronous model. Then, the simulation framework SimGrid is presented with the settings to create various
+distributed architectures. The algorithm of the multisplitting method used by GMRES \LZK{??? GMRES n'utilise pas la méthode de multisplitting! Sinon ne doit on pas expliquer le choix d'une méthode de multisplitting?} written with MPI primitives and
+its adaptation to SimGrid with SMPI (Simulated MPI) is detailed in the next section. At last, the experiments results
+carried out will be presented before some concluding remarks and future works.