\usepackage[utf8]{inputenc}
\usepackage{amsfonts,amssymb}
\usepackage{amsmath}
-\usepackage{algorithm}
+%\usepackage{algorithm}
\usepackage{algpseudocode}
%\usepackage{amsthm}
\usepackage{graphicx}
\section{Introduction}
-Parallel computing and high performance computing (HPC) are becoming
-more and more imperative for solving various problems raised by
-researchers on various scientific disciplines but also by industrial in
-the field. Indeed, the increasing complexity of these requested
-applications combined with a continuous increase of their sizes lead to
-write distributed and parallel algorithms requiring significant hardware
-resources (grid computing, clusters, broadband network, etc.) but
-also a non-negligible CPU execution time. We consider in this paper a
-class of highly efficient parallel algorithms called iterative executed
-in a distributed environment. As their name suggests, these algorithm
-solves a given problem that might be NP-complete complex by successive
-iterations ($X_{n +1} = f(X_{n})$) from an initial value $X_{0}$ to find
-an approximate value $X^*$ of the solution with a very low
-residual error. Several well-known methods demonstrate the convergence
-of these algorithms. Generally, to reduce the complexity and the
-execution time, the problem is divided into several \emph{pieces} that will
-be solved in parallel on multiple processing units. The latter will
-communicate each intermediate results before a new iteration starts
-until the approximate solution is reached. These distributed parallel
-computations can be performed either in \emph{synchronous} communication mode
-where a new iteration begin only when all nodes communications are
-completed, either \emph{asynchronous} mode where processors can continue
-independently without or few synchronization points. Despite the
-effectiveness of iterative approach, a major drawback of the method is
-the requirement of huge resources in terms of computing capacity,
-storage and high speed communication network. Indeed, limited physical
-resources are blocking factors for large-scale deployment of parallel
-algorithms.
-
-In recent years, the use of a simulation environment to execute parallel
-iterative algorithms found some interests in reducing the highly cost of
-access to computing resources: (1) for the applications development life
-cycle and in code debugging (2) and in production to get results in a
-reasonable execution time with a simulated infrastructure not accessible
-with physical resources. Indeed, the launch of distributed iterative
-asynchronous algorithms to solve a given problem on a large-scale
-simulated environment challenges to find optimal configurations giving
-the best results with a lowest residual error and in the best of
-execution time. According our knowledge, no testing of large-scale
-simulation of the class of algorithm solving to achieve real results has
-been undertaken to date. We had in the scope of this work implemented a
-program for solving large non-symmetric linear system of equations by
-numerical method GMRES (Generalized Minimal Residual) in the simulation
-environment SimGrid. The simulated platform had allowed us to launch
-the application from a modest computing infrastructure by simulating
-different distributed architectures composed by clusters nodes
-interconnected by variable speed networks. In addition, it has been
-permitted to show the effectiveness of asynchronous mode algorithm by
-comparing its performance with the synchronous mode time. With selected
-parameters on the network platforms (bandwidth, latency of inter cluster
-network) and on the clusters architecture (number, capacity calculation
-power) in the simulated environment, the experimental results have
-demonstrated not only the algorithm convergence within a reasonable time
-compared with the physical environment performance, but also a time
+Parallel computing and high performance computing (HPC) are becoming more and more imperative for solving various
+problems raised by researchers on various scientific disciplines but also by industrial in the field. Indeed, the
+increasing complexity of these requested applications combined with a continuous increase of their sizes lead to write
+distributed and parallel algorithms requiring significant hardware resources (grid computing, clusters, broadband
+network, etc.) but also a non-negligible CPU execution time. We consider in this paper a class of highly efficient
+parallel algorithms called \texttt{numerical iterative algorithms} executed in a distributed environment. As their name
+suggests, these algorithm solves a given problem by successive iterations ($X_{n +1} = f(X_{n})$) from an initial value
+$X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
+demonstrate the convergence of these algorithms \cite{}.
+
+Parallelization of such algorithms generally involved the division of the problem into several \emph{pieces} that will
+be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new
+iteration starts until the approximate solution is reached. These parallel computations can be performed
+either in \emph{synchronous} communication mode where a new iteration begin only when all nodes communications are
+completed, either \emph{asynchronous} mode where processors can continue independently without or few synchronization
+points.
+
+% DL : reprendre correction ici
+Despite the effectiveness of iterative approach, a major drawback of the method is the requirement of huge
+resources in terms of computing capacity, storage and high speed communication network. Indeed, limited physical
+resources are blocking factors for large-scale deployment of parallel algorithms.
+
+In recent years, the use of a simulation environment to execute parallel iterative algorithms found some interests in
+reducing the highly cost of access to computing resources: (1) for the applications development life cycle and in code
+debugging (2) and in production to get results in a reasonable execution time with a simulated infrastructure not
+accessible with physical resources. Indeed, the launch of distributed iterative asynchronous algorithms to solve a
+given problem on a large-scale simulated environment challenges to find optimal configurations giving the best results
+with a lowest residual error and in the best of execution time. According our knowledge, no testing of large-scale
+simulation of the class of algorithm solving to achieve real results has been undertaken to date. We had in the scope
+of this work implemented a program for solving large non-symmetric linear system of equations by numerical method
+GMRES (Generalized Minimal Residual) in the simulation environment SimGrid. The simulated platform had allowed us to
+launch the application from a modest computing infrastructure by simulating different distributed architectures
+composed by clusters nodes interconnected by variable speed networks. In addition, it has been permitted to show the
+effectiveness of asynchronous mode algorithm by comparing its performance with the synchronous mode time. With selected
+parameters on the network platforms (bandwidth, latency of inter cluster network) and on the clusters architecture
+(number, capacity calculation power) in the simulated environment, the experimental results have demonstrated not only
+the algorithm convergence within a reasonable time compared with the physical environment performance, but also a time
saving of up to \np[\%]{40} in asynchronous mode.
-This article is structured as follows: after this introduction, the next
-section will give a brief description of iterative asynchronous model.
-Then, the simulation framework SimGrid will be presented with the
-settings to create various distributed architectures. The algorithm of
-the multi-splitting method used by GMRES written with MPI primitives
-and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the
-next section. At last, the experiments results carried out will be
-presented before the conclusion which we will announce the opening of
-our future work after the results.
+This article is structured as follows: after this introduction, the next section will give a brief description of
+iterative asynchronous model. Then, the simulation framework SimGrid will be presented with the settings to create
+various distributed architectures. The algorithm of the multi-splitting method used by GMRES written with MPI
+primitives and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the next section. At last, the experiments
+results carried out will be presented before the conclusion which we will announce the opening of our future work after
+the results.
-\section{The asynchronous iteration model}
+\section{Motivations and scientific context}
+
+As exposed in the introduction, parallel iterative methods are now widely used in many scientific domains. They can be
+classified in three main classes depending on how iterations and communications are managed (for more details readers
+can refer to \cite{bcvc02:ip}). In the \textit{Synchronous Iterations - Synchronous Communications (SISC)} model data
+are exchanged at the end of each iteration. All the processors must begin the same iteration at the same time and
+important idle times on processors are generated. The \textit{Synchronous Iterations - Asynchronous Communications
+(SIAC)} model can be compared to the previous one except that data required on another processor are sent asynchronously
+i.e. without stopping current computations. This technique allows to partially overlap communications by computations
+but unfortunately, the overlapping is only partial and important idle times remain. It is clear that, in a grid
+computing context, where the number of computational nodes is large, heterogeneous and widely distributed, the idle
+times generated by synchronizations are very penalizing. One way to overcome this problem is to use the
+\textit{Asynchronous Iterations - Asynchronous Communications (AIAC)} model. Here, local computations do not need to
+wait for required data. Processors can then perform their iterations with the data present at that time. Figure
+\ref{fig:aiac} illustrates this model where the grey blocks represent the computation phases, the white spaces the idle
+times and the arrows the communications. With this algorithmic model, the number of iterations required before the
+convergence is generally greater than for the two former classes. But, and as detailed in \cite{bcvc06:ij}, AIAC
+algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially
+in a grid computing context.
+
+\begin{figure}[htbp]
+ \centering
+ \includegraphics[width=8cm]{AIAC.pdf}
+ \caption{The Asynchronous Iterations - Asynchronous Communications model }
+ \label{fig:aiac}
+\end{figure}
+
+
+It is very challenging to develop efficient applications for large scale, heterogeneous and distributed platforms such
+as computing grids. Researchers and engineers have to develop techniques for maximizing application performance of these
+multi-cluster platforms, by redesigning the applications and/or by using novel algorithms that can account for the
+composite and heterogeneous nature of the platform. Unfortunately, the deployment of such applications on these very
+large scale systems is very costly, labor intensive and time consuming. In this context, it appears that the use of
+simulation tools to explore various platform scenarios at will and to run enormous numbers of experiments quickly can be
+very promising. Several works...
+
+In the context of AIAC algorithms, the use of simulation tools is even more relevant. Indeed, this class of applications
+is very sensible to the execution environment context. For instance, variations in the network bandwith (intra and
+inter-clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of
+iterations and so to very different execution times.
+
+
-\DL{Décrire le modèle asynchrone. Je m'en charge}
\section{SimGrid}
-\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)}
+SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} is a simulation
+framework to sudy the behavior of large-scale distributed systems. As its name
+says, it emanates from the grid computing community, but is nowadays used to
+study grids, clouds, HPC or peer-to-peer systems.
+%- open source, developped since 1999, one of the major solution in the field
+%
+SimGrid provides several programming interfaces: MSG to simulate Concurrent
+Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to
+run real applications written in MPI~\cite{MPI}. Apart from the native C
+interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming
+languages. The SMPI interface supports applications written in C or Fortran,
+with little or no modifications.
+%- implements most of MPI-2 \cite{ref} standard [CHECK]
+
+%%% explain simulation
+%- simulated processes folded in one real process
+%- simulates interactions on the network, fluid model
+%- able to skip long-lasting computations
+%- traces + visu?
-%%% brief history?
-%%% programming interfaces: MSG, SimDAG, SMPI
%%% platforms
-%%% validation?
+%- describe resources and their interconnection, with their properties
+%- XML files
+
+%%% validation + refs
+
+\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Simulation of the multisplitting method}
\end{equation}
is solved independently by a cluster and communications are required to update the right-hand side sub-vector $Y_l$, such that the sub-vectors $X_m$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel algorithm of GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers.
-\begin{algorithm}
-\caption{A multisplitting solver with GMRES method}
+\begin{figure}
+ %%% IEEE instructions forbid to use an algorithm environment here, use figure
+ %%% instead
\begin{algorithmic}[1]
\Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector)
\Output $X_l$ (solution sub-vector)\vspace{0.2cm}
\State \Return $X_l^k$
\EndFunction
\end{algorithmic}
+\caption{A multisplitting solver with GMRES method}
\label{algo:01}
-\end{algorithm}
+\end{figure}
-Algorithm~\ref{algo:01} shows the main key points of the multisplitting method to solve a large sparse linear system. This algorithm is based on an outer-inner iteration method where the parallel synchronous GMRES method is used to solve the inner iteration. It is executed in parallel by each cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors with the subscript $l$ represent the local data for cluster $l$, while $\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and $\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with neighboring clusters. At every outer iteration $k$, asynchronous communications are performed between processors of the local cluster and those of distant clusters (lines $6$ and $7$ in Algorithm~\ref{algo:01}). The shared vector elements of the solution $x$ are exchanged by message passing using MPI non-blocking communication routines.
+Algorithm on Figure~\ref{algo:01} shows the main key points of the
+multisplitting method to solve a large sparse linear system. This algorithm is
+based on an outer-inner iteration method where the parallel synchronous GMRES
+method is used to solve the inner iteration. It is executed in parallel by each
+cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors
+with the subscript $l$ represent the local data for cluster $l$, while
+$\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and
+$\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with
+neighboring clusters. At every outer iteration $k$, asynchronous communications
+are performed between processors of the local cluster and those of distant
+clusters (lines $6$ and $7$ in Figure~\ref{algo:01}). The shared vector
+elements of the solution $x$ are exchanged by message passing using MPI
+non-blocking communication routines.
\begin{figure}
\centering