synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to
synchronizations especially in a grid computing context (see~\cite{Bahi07} for more details).
- Parallel numerical applications (synchronous or asynchronous) may have different
- configuration and deployment requirements. Quantifying their resource
- allocation policies and application scheduling algorithms in grid computing
- environments under varying load, CPU power and network speeds is very costly,
- very labor intensive and very time
- consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC
- algorithms is even more problematic since they are very sensible to the
+ Parallel (synchronous or asynchronous) applications may have different
+ configuration and deployment requirements. Quantifying their resource
+ allocation policies and application scheduling algorithms in grid computing
+ environments under varying load, CPU power and network speeds is very costly,
+ very labor intensive and very time
+ consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC
+ algorithms is even more problematic since they are very sensible to the
execution environment context. For instance, variations in the network bandwidth
- (intra and inter-clusters), in the number and the power of nodes, in the number
- of clusters\dots{} can lead to very different number of iterations and so to
- very different execution times. Then, it appears that the use of simulation
- tools to explore various platform scenarios and to run large numbers of
- experiments quickly can be very promising. In this way, the use of a simulation
- environment to execute parallel iterative algorithms found some interests in
- reducing the highly cost of access to computing resources: (1) for the
- applications development life cycle and in code debugging (2) and in production
- to get results in a reasonable execution time with a simulated infrastructure
- not accessible with physical resources. Indeed, the launch of distributed
- iterative asynchronous algorithms to solve a given problem on a large-scale
- simulated environment challenges to find optimal configurations giving the best
+ (intra and inter-clusters), in the number and the power of nodes, in the number
+ of clusters\dots{} can lead to very different number of iterations and so to
+ very different execution times. Then, it appears that the use of simulation
+ tools to explore various platform scenarios and to run large numbers of
+ experiments quickly can be very promising. In this way, the use of a simulation
+ environment to execute parallel iterative algorithms found some interests in
+ reducing the highly cost of access to computing resources: (1) for the
+ applications development life cycle and in code debugging (2) and in production
+ to get results in a reasonable execution time with a simulated infrastructure
+ not accessible with physical resources. Indeed, the launch of distributed
+ iterative asynchronous algorithms to solve a given problem on a large-scale
+ simulated environment challenges to find optimal configurations giving the best
results with a lowest residual error and in the best of execution time.
++<<<<<<< HEAD
+To our knowledge, there is no existing work on the large-scale simulation of a
+real AIAC application. The aim of this paper is twofold. First we give a first
+approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
+SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of
+asynchronous mode algorithms by comparing their performance with the synchronous
+mode. More precisely, we had implemented a program for solving large
+linear system of equations by numerical method GMRES (Generalized
+Minimal Residual) \cite{ref1}. We show, that with minor modifications of the
+initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a
+real AIAC application on different computing architectures. The simulated
+results we obtained are in line with real results exposed in ??\AG[]{ref?}.
+SimGrid had allowed us to launch the application from a modest computing
+infrastructure by simulating different distributed architectures composed by
+clusters nodes interconnected by variable speed networks. In the simulated environment, after setting appropriate
+network and cluster parameters like the network bandwidth, latency or the processors power,
+the experimental results have demonstrated a asynchronous execution time saving up to \np[\%]{40} in
+compared to the synchronous mode.
+\AG{Il faudrait revoir la phrase précédente (couper en deux?). Là, on peut
+ avoir l'impression que le gain de \np[\%]{40} est entre une exécution réelle
+ et une exécution simulée!}
+\CER{La phrase a été modifiée}
+
+This article is structured as follows: after this introduction, the next section will give a brief description of
+iterative asynchronous model. Then, the simulation framework SimGrid is presented with the settings to create various
+distributed architectures. The algorithm of the multisplitting method based on GMRES \LZK{??? GMRES n'utilise pas la méthode de multisplitting! Sinon ne doit on pas expliquer le choix d'une méthode de multisplitting?} \CER{La phrase a été corrigée} written with MPI primitives and
+its adaptation to SimGrid with SMPI (Simulated MPI) is detailed in the next section. At last, the experiments results
+carried out will be presented before some concluding remarks and future works.
++=======
+ To our knowledge, there is no existing work on the large-scale simulation of a
+ real AIAC application. {\bf The contribution of the present paper can be
+ summarised in two main points}. First we give a first approach of the
+ simulation of AIAC algorithms using a simulation tool (i.e. the SimGrid
+ toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of the
+ asynchronous multisplitting algorithm by comparing its performance with the
+ synchronous GMRES (Generalized Minimal Residual) \cite{ref1}. Both these codes
+ can be used to solve large linear systems. In this paper, we focus on a 3D
+ Poisson problem. We show, that with minor modifications of the initial MPI
+ code, the SimGrid toolkit allows us to perform a test campaign of a real AIAC
+ application on different computing architectures.
+ % The simulated results we
+ %obtained are in line with real results exposed in ??\AG[]{ref?}.
+ SimGrid had allowed us to launch the application from a modest computing
+ infrastructure by simulating different distributed architectures composed by
+ clusters nodes interconnected by variable speed networks. Parameters of the
+ network platforms are the bandwidth and the latency of inter cluster
+ network. Parameters on the cluster's architecture are the number of machines and
+ the computation power of a machine. Simulations show that the asynchronous
+ multisplitting algorithm can solve the 3D Poisson problem approximately twice
+ faster than GMRES with two distant clusters.
+
+
+
+ This article is structured as follows: after this introduction, the next section
+ will give a brief description of iterative asynchronous model. Then, the
+ simulation framework SimGrid is presented with the settings to create various
+ distributed architectures. Then, the multisplitting method is presented, it is
+ based on GMRES to solve each block obtained of the splitting. This code is
+ written with MPI primitives and its adaptation to SimGrid with SMPI (Simulated
+ MPI) is detailed in the next section. At last, the simulation results carried
+ out will be presented before some concluding remarks and future works.
++>>>>>>> 6785b9ef58de0db67c33ca901c7813f3dfdc76e0
\section{Motivations and scientific context}
A priori, obtaining a relative gain greater than 1 would be difficult in a local
area network configuration where the synchronous mode will take advantage on the
rapid exchange of information on such high-speed links. Thus, the methodology
-adopted was to launch the application on clustered network. In this last
+adopted was to launch the application on a clustered network. In this
configuration, degrading the inter-cluster network performance will penalize the
synchronous mode allowing to get a relative gain greater than 1. This action
-simulates the case of distant clusters linked with long distance network like
-Internet.
+simulates the case of distant clusters linked with long distance network as in grid computing context.
- \AG{Cette partie sur le poisson 3D
- % on sait donc que ce n'est pas une plie ou une sole (/me fatigué)
- n'est pas à sa place. Elle devrait être placée plus tôt.}
- In this paper, we solve the 3D Poisson problem whose the mathematical model is
- \begin{equation}
- \left\{
- \begin{array}{l}
- \nabla^2 u = f \text{~in~} \Omega \\
- u =0 \text{~on~} \Gamma =\partial\Omega
- \end{array}
- \right.
- \label{eq:02}
- \end{equation}
- where $\nabla^2$ is the Laplace operator, $f$ and $u$ are real-valued functions, and $\Omega=[0,1]^3$. The spatial discretization with a finite difference scheme reduces problem~(\ref{eq:02}) to a system of sparse linear equations. The general iteration scheme of our multisplitting method in a 3D domain using a seven point stencil could be written as
- \begin{equation}
- \begin{array}{ll}
- u^{k+1}(x,y,z)= & u^k(x,y,z) - \frac{1}{6}\times\\
- & (u^k(x-1,y,z) + u^k(x+1,y,z) + \\
- & u^k(x,y-1,z) + u^k(x,y+1,z) + \\
- & u^k(x,y,z-1) + u^k(x,y,z+1)),
- \end{array}
- \label{eq:03}
- \end{equation}
- where the iteration matrix $A$ of size $N_x\times N_y\times N_z$ of the discretized linear system is sparse, symmetric and positive definite.
-
- The parallel solving of the 3D Poisson problem with our multisplitting method requires a data partitioning of the problem between clusters and between processors within a cluster. We have chosen the 3D partitioning instead of the row-by-row partitioning in order to reduce the data exchanges at sub-domain boundaries. Figure~\ref{fig:4.2} shows an example of the data partitioning of the 3D Poisson problem between two clusters of processors, where each sub-problem is assigned to a processor. In this context, a processor has at most six neighbors within a cluster or in distant clusters with which it shares data at sub-domain boundaries.
-
- \begin{figure}[!t]
- \centering
- \includegraphics[width=80mm,keepaspectratio]{partition}
- \caption{Example of the 3D data partitioning between two clusters of processors.}
- \label{fig:4.2}
- \end{figure}
-
-As a first step, the algorithm was run on a network consisting of two clusters
-containing 50 hosts each, totaling 100 hosts. Various combinations of the above
-factors have provided the results shown in Table~\ref{tab.cluster.2x50} with a
-matrix size ranging from $N_x = N_y = N_z = \text{62}$ to 171 elements or from
-$\text{62}^\text{3} = \text{\np{238328}}$ to $\text{171}^\text{3} =
-\text{\np{5000211}}$ entries.
+% As a first step,
+The algorithm was run on a two clusters based network with 50 hosts each, totaling 100 hosts. Various combinations of the above
+factors have provided the results shown in Table~\ref{tab.cluster.2x50}. The algorithm convergence with a 3D
+matrix size ranging from $N_x = N_y = N_z = \text{62}$ to 150 elements (that is from
+$\text{62}^\text{3} = \text{\np{238328}}$ to $\text{150}^\text{3} =
+\text{\np{3375000}}$ entries), is obtained in asynchronous in average 2.5 times speeder than the synchronous mode.
\AG{Expliquer comment lire les tableaux.}
-
+\CER{J'ai reformulé la phrase par la lecture du tableau. Plus de détails seront lus dans la partie Interprétations et commentaires}
% use the same column width for the following three tables
\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
\newenvironment{mytable}[1]{% #1: number of columns for data