From: RCE <cramamonjisoa@bilbo.iut-bm.univ-fcomte.fr>
Date: Sun, 27 Apr 2014 17:09:56 +0000 (+0200)
Subject: Merge
X-Git-Tag: hpcc2014_submission~57
X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/commitdiff_plain/e4b14590c5b948dac64c900d8d83e990a3f79122?ds=inline;hp=--cc

Merge
Merge branch 'master' of ssh://info.iut-bm.univ-fcomte.fr/hpcc2014

Conflicts:
	hpcc.tex
---

e4b14590c5b948dac64c900d8d83e990a3f79122
diff --cc hpcc.tex
index 6f7f7ca,49caa2f..640f3ae
--- a/hpcc.tex
+++ b/hpcc.tex
@@@ -127,55 -113,60 +113,90 @@@ at that time. Even if the number of ite
  synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to
  synchronizations especially in a grid computing context (see~\cite{Bahi07} for more details).
  
- Parallel numerical applications (synchronous or asynchronous) may have different
- configuration and deployment requirements.  Quantifying their resource
- allocation policies and application scheduling algorithms in grid computing
- environments under varying load, CPU power and network speeds is very costly,
- very labor intensive and very time
- consuming~\cite{Calheiros:2011:CTM:1951445.1951450}.  The case of AIAC
- algorithms is even more problematic since they are very sensible to the
+ Parallel   (synchronous  or  asynchronous)   applications  may   have  different
+ configuration   and  deployment   requirements.    Quantifying  their   resource
+ allocation  policies and  application  scheduling algorithms  in grid  computing
+ environments under  varying load, CPU power  and network speeds  is very costly,
+ very          labor           intensive          and          very          time
+ consuming~\cite{Calheiros:2011:CTM:1951445.1951450}.     The   case    of   AIAC
+ algorithms  is  even  more problematic  since  they  are  very sensible  to  the
  execution environment context. For instance, variations in the network bandwidth
- (intra and inter-clusters), in the number and the power of nodes, in the number
- of clusters\dots{} can lead to very different number of iterations and so to
- very different execution times. Then, it appears that the use of simulation
- tools to explore various platform scenarios and to run large numbers of
- experiments quickly can be very promising. In this way, the use of a simulation
- environment to execute parallel iterative algorithms found some interests in
- reducing the highly cost of access to computing resources: (1) for the
- applications development life cycle and in code debugging (2) and in production
- to get results in a reasonable execution time with a simulated infrastructure
- not accessible with physical resources. Indeed, the launch of distributed
- iterative asynchronous algorithms to solve a given problem on a large-scale
- simulated environment challenges to find optimal configurations giving the best
+ (intra and inter-clusters), in the number  and the power of nodes, in the number
+ of clusters\dots{}  can lead to  very different number  of iterations and  so to
+ very  different execution times.  Then, it  appears that  the use  of simulation
+ tools  to  explore  various platform  scenarios  and  to  run large  numbers  of
+ experiments quickly can be very promising.  In this way, the use of a simulation
+ environment  to execute parallel  iterative algorithms  found some  interests in
+ reducing  the  highly  cost  of  access  to computing  resources:  (1)  for  the
+ applications development life cycle and  in code debugging (2) and in production
+ to get  results in a reasonable  execution time with  a simulated infrastructure
+ not  accessible  with physical  resources.  Indeed,  the  launch of  distributed
+ iterative  asynchronous algorithms  to solve  a given  problem on  a large-scale
+ simulated environment challenges to  find optimal configurations giving the best
  results with a lowest residual error and in the best of execution time.
  
++<<<<<<< HEAD
 +To our knowledge, there is no existing work on the large-scale simulation of a
 +real AIAC application. The aim of this paper is twofold. First we give a first
 +approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
 +SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of
 +asynchronous mode algorithms by comparing their performance with the synchronous
 +mode. More precisely, we had implemented a program for solving large
 +linear system of equations by numerical method GMRES (Generalized
 +Minimal Residual) \cite{ref1}. We show, that with minor modifications of the
 +initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a
 +real AIAC application on different computing architectures. The simulated
 +results we obtained are in line with real results exposed in ??\AG[]{ref?}.
 +SimGrid had allowed us to launch the application from a modest computing
 +infrastructure by simulating different distributed architectures composed by
 +clusters nodes interconnected by variable speed networks. In the simulated environment, after setting appropriate 
 +network and cluster parameters like the network bandwidth, latency or the processors power, 
 +the experimental results have demonstrated a asynchronous execution time saving up to \np[\%]{40} in
 +compared to the synchronous mode.
 +\AG{Il faudrait revoir la phrase prÃ©cÃ©dente (couper en deux?).  LÃ , on peut
 +  avoir l'impression que le gain de \np[\%]{40} est entre une exÃ©cution rÃ©elle
 +  et une exÃ©cution simulÃ©e!}
 +\CER{La phrase a Ã©tÃ© modifiÃ©e}
 +
 +This article is structured as follows: after this introduction, the next  section will give a brief description of
 +iterative asynchronous model.  Then, the simulation framework SimGrid is presented with the settings to create various
 +distributed architectures. The algorithm of  the multisplitting method based on GMRES \LZK{??? GMRES n'utilise pas la mÃ©thode de multisplitting! Sinon ne doit on pas expliquer le choix d'une mÃ©thode de multisplitting?} \CER{La phrase a Ã©tÃ© corrigÃ©e} written with MPI primitives and
 +its adaptation to SimGrid with SMPI (Simulated MPI) is detailed in the next section. At last, the experiments results
 +carried out will be presented before some concluding remarks and future works.
++=======
+ To our knowledge,  there is no existing work on the  large-scale simulation of a
+ real  AIAC application.   {\bf  The contribution  of  the present  paper can  be
+   summarised  in two  main  points}.  First  we  give a  first  approach of  the
+ simulation  of  AIAC algorithms  using  a  simulation  tool (i.e.   the  SimGrid
+ toolkit~\cite{SimGrid}).    Second,  we   confirm  the   effectiveness   of  the
+ asynchronous  multisplitting algorithm  by  comparing its  performance with  the
+ synchronous GMRES (Generalized Minimal  Residual) \cite{ref1}.  Both these codes
+ can be  used to  solve large linear  systems. In  this paper, we  focus on  a 3D
+ Poisson  problem.  We show,  that with  minor modifications  of the  initial MPI
+ code, the SimGrid  toolkit allows us to  perform a test campaign of  a real AIAC
+ application on different computing architectures.
+ % The  simulated results  we
+ %obtained are  in line with real  results exposed in  ??\AG[]{ref?}. 
+ SimGrid  had  allowed us  to  launch the  application  from  a modest  computing
+ infrastructure  by simulating  different distributed  architectures  composed by
+ clusters  nodes interconnected by  variable speed  networks.  Parameters  of the
+ network  platforms  are   the  bandwidth  and  the  latency   of  inter  cluster
+ network. Parameters on the cluster's architecture are the number of machines and
+ the  computation power  of a  machine.  Simulations show  that the  asynchronous
+ multisplitting algorithm  can solve the  3D Poisson problem  approximately twice
+ faster than GMRES with two distant clusters.
+ 
+ 
+ 
+ This article is structured as follows: after this introduction, the next section
+ will  give a  brief  description  of iterative  asynchronous  model.  Then,  the
+ simulation framework  SimGrid is presented  with the settings to  create various
+ distributed architectures.  Then, the  multisplitting method is presented, it is
+ based  on GMRES to  solve each  block obtained  of the  splitting. This  code is
+ written with MPI  primitives and its adaptation to  SimGrid with SMPI (Simulated
+ MPI) is  detailed in the next  section. At last, the  simulation results carried
+ out will be presented before some concluding remarks and future works.
++>>>>>>> 6785b9ef58de0db67c33ca901c7813f3dfdc76e0
   
  \section{Motivations and scientific context}
  
@@@ -438,54 -482,21 +513,20 @@@ study that the results depend on the fo
  A priori, obtaining a relative gain greater than 1 would be difficult in a local
  area network configuration where the synchronous mode will take advantage on the
  rapid exchange of information on such high-speed links. Thus, the methodology
 -adopted was to launch the application on clustered network. In this last
 +adopted was to launch the application on a clustered network. In this
  configuration, degrading the inter-cluster network performance will penalize the
  synchronous mode allowing to get a relative gain greater than 1.  This action
 -simulates the case of distant clusters linked with long distance network like
 -Internet.
 +simulates the case of distant clusters linked with long distance network as in grid computing context.
  
- \AG{Cette partie sur le poisson 3D
-   % on sait donc que ce n'est pas une plie ou une sole (/me fatiguÃ©)
-   n'est pas Ã  sa place.  Elle devrait Ãªtre placÃ©e plus tÃ´t.}
- In this paper, we solve the 3D Poisson problem whose the mathematical model is 
- \begin{equation}
- \left\{
- \begin{array}{l}
- \nabla^2 u = f \text{~in~} \Omega \\
- u =0 \text{~on~} \Gamma =\partial\Omega
- \end{array}
- \right.
- \label{eq:02}
- \end{equation}
- where $\nabla^2$ is the Laplace operator, $f$ and $u$ are real-valued functions, and $\Omega=[0,1]^3$. The spatial discretization with a finite difference scheme reduces problem~(\ref{eq:02}) to a system of sparse linear equations. The general iteration scheme of our multisplitting method in a 3D domain using a seven point stencil could be written as 
- \begin{equation}
- \begin{array}{ll}
- u^{k+1}(x,y,z)= & u^k(x,y,z) - \frac{1}{6}\times\\
-                & (u^k(x-1,y,z) + u^k(x+1,y,z) + \\
-                & u^k(x,y-1,z) + u^k(x,y+1,z) + \\
-                & u^k(x,y,z-1) + u^k(x,y,z+1)),
- \end{array}
- \label{eq:03}
- \end{equation} 
- where the iteration matrix $A$ of size $N_x\times N_y\times N_z$ of the discretized linear system is sparse, symmetric and positive definite. 
- 
- The parallel solving of the 3D Poisson problem with our multisplitting method requires a data partitioning of the problem between clusters and between processors within a cluster. We have chosen the 3D partitioning instead of the row-by-row partitioning in order to reduce the data exchanges at sub-domain boundaries. Figure~\ref{fig:4.2} shows an example of the data partitioning of the 3D Poisson problem between two clusters of processors, where each sub-problem is assigned to a processor. In this context, a processor has at most six neighbors within a cluster or in distant clusters with which it shares data at sub-domain boundaries. 
- 
- \begin{figure}[!t]
- \centering
-   \includegraphics[width=80mm,keepaspectratio]{partition}
- \caption{Example of the 3D data partitioning between two clusters of processors.}
- \label{fig:4.2}
- \end{figure}
- 
  
 -As a first step, the algorithm was run on a network consisting of two clusters
 -containing 50 hosts each, totaling 100 hosts. Various combinations of the above
 -factors have provided the results shown in Table~\ref{tab.cluster.2x50} with a
 -matrix size ranging from $N_x = N_y = N_z = \text{62}$ to 171 elements or from
 -$\text{62}^\text{3} = \text{\np{238328}}$ to $\text{171}^\text{3} =
 -\text{\np{5000211}}$ entries.
 +% As a first step, 
 +The algorithm was run on a two clusters based network with 50 hosts each, totaling 100 hosts. Various combinations of the above
 +factors have provided the results shown in Table~\ref{tab.cluster.2x50}. The algorithm convergence with a 3D
 +matrix size ranging from $N_x = N_y = N_z = \text{62}$ to 150 elements (that is from
 +$\text{62}^\text{3} = \text{\np{238328}}$ to $\text{150}^\text{3} =
 +\text{\np{3375000}}$ entries), is obtained in asynchronous in average 2.5 times speeder than the synchronous mode. 
  \AG{Expliquer comment lire les tableaux.}
 -
 +\CER{J'ai reformulÃ© la phrase par la lecture du tableau. Plus de dÃ©tails seront lus dans la partie InterprÃ©tations et commentaires}
  % use the same column width for the following three tables
  \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
  \newenvironment{mytable}[1]{% #1: number of columns for data