heteropar10/heteropar10.tex

   1 %% LNCS
   2 \documentclass[10pt,conference]{llncs}
   3
   4 \usepackage[T1]{fontenc}
   5 \usepackage{ucs}
   6 \usepackage[utf8x]{inputenc}
   7 \usepackage{lmodern}
   8 \usepackage{color}
   9 \usepackage{amsmath}
  10 \usepackage{amsfonts}
  11 \usepackage[english]{babel}
  12 \usepackage[pdftex,final]{graphicx}
  13 \usepackage[ruled,vlined]{algorithm2e}
  14 \usepackage[pdftex]{graphicx}
  15 \usepackage{multirow}
  16 \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  17
  18 %\title{MAHEVE: A New Reliable AIAC Mapping Strategy for Heterogeneous
  19 %  Environments \thanks{This work was supported by the European
  20 %    Interreg IV From-P2P project.}}
  21
  22 \title{MAHEVE: An Efficient Reliable Mapping of Asynchronous
  23   Iterative Applications on Volatile and Heterogeneous
  24   Environments \thanks{This work was supported by the European
  25    Interreg IV From-P2P project.}}
  26
  27
  28
  29 \author{Raphaël Couturier, David Laiymani and Sébastien Miquée}
  30   \authorrunning{R. Couturier, D. Laiymani and S. Miquée}
  31   \institute{\vspace{-0.2cm}
  32      University of Franche-Comté \qquad  LIFC laboratory\\%[1mm]
  33      IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort,
  34      France\\\{{\tt
  35       raphael.couturier,david.laiymani,sebastien.miquee}\}{\tt
  36       @univ-fcomte.fr} }
  37
  38
  39 \date{}
  40 \begin{document}
  41
  42
  43 \maketitle
  44
  45 \vspace{-0.2cm}
  46
  47 \begin{abstract}
  48   With the emergence of massive distributed computing resources, such
  49   as grids and distributed clusters architectures, parallel
  50   programming is used to benefit from them and execute problems of
  51   larger sizes. The asynchronous iteration model, called AIAC, has
  52   been proven to be an efficient solution for heterogeneous and
  53   distributed architectures. An efficient mapping of applications'
  54   tasks is essential to reduce their execution time. In this paper we
  55   present a new mapping algorithm, called MAHEVE (Mapping Algorithm
  56   for HEterogeneous and Volatile Environments) which is efficient on
  57   such architectures and integrates a fault tolerance mechanism to
  58   resist computing nodes failures. Our experiments show gains on a
  59   typical AIAC application execution time of about $55\%$, executed
  60   on distributed clusters architectures containing more than 400
  61   computing cores with the JaceP2P-V2 environment.
  62
  63 %\textup{\small \textbf{Keywords:} Mapping algorithm, Fault tolerance,
  64 %  Distributed clusters, Parallel iterative asynchronous algorithms,
  65 %  Heterogeneous architectures.}
  66 \end{abstract}
  67
  68 \section{Introduction}
  69 \label{sec:intro}
  70
  71 Nowadays, scientific applications require a great computation power to
  72 solve large problems. Though personal computers are becoming more
  73 powerful, in many cases they are not sufficient. One well adapted
  74 solution is to use computers clusters in order to combine the power of
  75 many machines. Distributed clusters form such an architecture,
  76 providing a great computing power, by aggregating the computation
  77 power of multiple clusters spread over multiple sites.  Such an
  78 architecture brings users heterogeneity in computing machines as well
  79 as network latency. In order to use such an architecture, parallel
  80 programming is required. In the parallel computing area, in order to
  81 execute very large applications on heterogeneous architectures,
  82 iterative methods are well adapted\cite{book_raph,bcvc06:ij}.
  83
  84
  85 These methods repeat the same instructions block until a convergence
  86 state and a desired approximation of the solution are reached. They
  87 constitute the only known approach to solving some kinds of problems
  88 and are relatively easy to parallelize. The Jacobi or the Conjugate
  89 Gradient\cite{cg} methods are examples of such methods. To parallelize
  90 them, one of the most used methods is the message passing paradigm
  91 which provides efficient mechanisms to exchange data between tasks. As
  92 such a method, we focus here on the asynchronous parallel iterative
  93 model, called AIAC\cite{book_raph} (for \textit{Asynchronous
  94   Iterations -- Asynchronous Communications}).
  95
  96 \begin{figure}[h!]
  97  \vspace{-0.4cm}
  98   \centering
  99   \includegraphics[width=7.4cm]{images/AIAC}
 100   \caption{Two processors computing in the Asynchronous Iterations -- Asynchronous Communications (AIAC) model}
 101   \label{fig:AIAC}
 102   \vspace{-0.90cm}
 103 \end{figure}
 104
 105
 106 In this model, as can be seen on Figure \ref{fig:AIAC}, after each
 107 iteration, a task sends its results to its neighbors and immediately
 108 starts the next iteration with the last received data. The receiving
 109 and sending mechanisms are asynchronous and tasks do not have to wait
 110 for the reception of dependency messages from their
 111 neighbors. Consequently, there is no idle time between two
 112 iterations. Furthermore, this model is tolerant to messages loss and
 113 even if a task is stopped the remaining tasks continue the
 114 computation, with the last available data. Several
 115 experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms
 116 in the context of distributed clusters with high latency between
 117 clusters. These works underline the good adaptability of AIAC
 118 algorithms to network and processor heterogeneity.
 119
 120
 121 In a previous study\cite{pdsec10} we proposed two static mapping
 122 algorithms of tasks to processors dedicated to the AIAC model on
 123 heterogeneous distributed clusters.  Both these two algorithms,
 124 AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC (for
 125 \textit{Farhat Edges-Cuts}) showed an important performances
 126 improvement by reducing up to $50\%$ the application execution
 127 time. These experiments were performed by using the JaceP2P-V2
 128 environment. This Java based platform is an executing and developing
 129 environment dedicated to the AIAC model. By implementing a distributed
 130 backup/restore mechanism it is also fully fault
 131 tolerant\cite{jaceP2P-v2}. In our previous experiments we did not
 132 introduce computing nodes failures during the computation. As
 133 architecture heterogeneity continually evolves according to
 134 computing nodes volatility, we have to take care more precisely about
 135 the heterogeneity of the target platform. Thus in this paper our main
 136 contribution is to propose a new mapping algorithm called MAHEVE
 137 (\textit{Mapping Algorithm for HEterogeneous and Volatile
 138   Environments}).  This algorithm explicitly tackles the heterogeneity
 139 issue and introduces a level of dynamism in order to adapt itself to
 140 the fault tolerance mechanisms. Our experiments show gains of about
 141 $55\%$ on application execution time, which is about 10 points
 142 better than AIAC-QM and about 25 points better than F-EC.
 143
 144
 145 The rest of this paper is organized as
 146 follows. Section~\ref{sec:jacep2p} presents the JaceP2P-V2 middleware
 147 by describing its architecture and briefly presenting its fault
 148 tolerance mechanisms. Section~\ref{sec:pb} formalizes our mapping and
 149 fault tolerance problems and quotes existing issues to address
 150 them. Section~\ref{sec:maheve} describes the new mapping strategy we
 151 propose, MAHEVE. In Section~\ref{sec:expe} we present the experiments
 152 we conducted on the Grid'5000 testbed with more than 400 computing
 153 cores. Finally, we give some concluding remarks and plan our future
 154 work in Section~\ref{sec:conclu}.
 155
 156
 157 \section{JaceP2P-V2}
 158 \label{sec:jacep2p}
 159
 160 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented in
 161 Java, dedicated to developing and executing parallel iterative
 162 asynchronous applications. It is fully fault tolerant allowing it to
 163 execute parallel applications over volatile environments. To our
 164 knowledge this is the only platform dedicated to designing and
 165 executing AIAC algorithms in such volatile environments.
 166
 167
 168 The JaceP2P-V2 platform part, which is based on the daemons and
 169 supervisors paradigm, is composed of three main entities:
 170
 171 \begin{itemize}
 172 \item The ``super-nodes'', which are in charge of supervising free
 173   computing nodes connected to the platform;
 174
 175 \item The ``spawner'', which is launched by a user wanting to execute
 176   a parallel application. It is in charge of a group of computing
 177   nodes and monitors them. If one of them fails, it requires a
 178   replacing one to a super-node;
 179
 180 \item The ``daemon'', when launched, connects to a super-node and
 181   waits for a task to execute. Each daemon can communicate directly
 182   with its computing neighbors.
 183
 184 \end{itemize}
 185
 186
 187 To be able to execute asynchronous iterative applications, JaceP2P-V2
 188 has an asynchronous messaging mechanism. In order to resist daemons
 189 failures, it implements a distributed backup mechanism called the
 190 \textit{uncoordinated distributed checkpointing}\cite{uncoord_cp}.
 191 This decentralized procedure allows the platform to be very scalable,
 192 with no weak points and does not require a secure nor a stable station
 193 for backups. When a daemon dies, it is replaced by another one. Here
 194 we suppose that we have enough available free nodes. Moreover, to
 195 resist supervisors failures and scalability, it reserves some extra
 196 nodes. For more details on the JaceP2P-V2 platform, interested readers
 197 can refer to \cite{jaceP2P-v2}.
 198
 199
 200 \section{Mapping and fault tolerance problems}
 201 \label{sec:pb}
 202
 203 \subsection{Model formalization}
 204 \label{sec:pbmodel}
 205
 206 \subsubsection{Application modeling}
 207 \label{sec:pbmodelapp}
 208
 209
 210 With the AIAC model, all tasks compute in parallel at the same time,
 211 without precedence nor synchronization. During an iteration, each task
 212 computes its job and sends its results to its neighbors, and
 213 immediately starts the next iteration. The TIG\cite{tig1}
 214 (\textit{Task Interaction Graph}) model is the most appropriate to our
 215 problem, as it only models relationships between tasks. In this model,
 216 all the tasks are considered simultaneously executable and
 217 communications can take place at any time during the computation, with
 218 no precedence nor synchronization.
 219
 220
 221 In the TIG model, a parallel application is represented by a graph
 222 $GT(V,E)$, where \mbox{$V = \{V_1,V_2,\dots V_v\}$} is the set of
 223 $|V|$ vertices and \mbox{$E \subset V \times V$} is the set of
 224 undirectional edges. The vertices represent tasks and the edges
 225 represent the mutual communication among tasks. A function \mbox{$EC :
 226   V \rightarrow \mathbb{R}^+$} gives the computation cost of tasks and
 227 \mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost
 228 for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) =
 229   e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}.  Another function
 230 \mbox{$D : V \rightarrow \mathbb{N}^+$} gives the amount of
 231 dependencies of a task, and we define \mbox{$D(V_i) = d_i$}.
 232
 233
 234 \subsubsection{Architecture modeling}
 235 \label{sec:pbmodelarchi}
 236
 237
 238 A distributed clusters architecture can be modeled by a
 239 three-level-graph. The levels are \textit{architecture} (a) (here the
 240 Grid'5000 grid), \textit{cluster} (c), and \textit{computing node} (n)
 241 levels. Let $GG(N,L)$ be a graph representing a distributed clusters
 242 architecture, where \mbox{$N = \{N_1,N_2,\dots N_n\}$} is the set of
 243 $|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The
 244 vertices represent the computing nodes and the edges represent the
 245 links between them. An edge \mbox{$L_i \in L$} is an unordered pair
 246 \mbox{$(N_x,N_y) \in N$}, representing a communication link
 247 between nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow
 248   \mathbb{R}^+$} gives the computational power of nodes and another
 249 function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the
 250 communication latency of links. We define \mbox{$WN(N_i) = wn_i$} and
 251 \mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters
 252 contained in the architecture. A function \mbox{$CN : C \rightarrow
 253   \mathbb{N}^+$} gives the amount of computing nodes contained in a
 254 cluster, and another function \mbox{$CF : C \rightarrow \mathbb{N}^+$}
 255 gives the amount of available computing nodes (not involved in an
 256 application computation) of a cluster. We define \mbox{$CN(C_i) =
 257   C_{Ni}$} and \mbox{$CF(C_i) = C_{Fi}$}. We also define \mbox{$C_{Pi}
 258   = \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of
 259 cluster $C_i$, \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
 260 as the average computation power of cluster $C_i$, and
 261 $C_{\overline{P}fi}$ the average power of its available resources.
 262
 263 We evaluate the \textit{heterogeneity degree} of the architecture,
 264 noted $hd$, by using the \textit{relative standard deviation} method,
 265 with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the
 266 average computing power of nodes and $\sigma_{PN}$ represents the
 267 standard deviation of computing nodes power. This measure provides us
 268 the coefficient of variation of the platform in percentage -- we only
 269 consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd
 270   > 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully
 271 heterogeneous platform.
 272
 273
 274 \subsubsection{Mapping functions}
 275 \label{sec:pbmodelmapping}
 276
 277
 278 When a parallel application $App$, represented by a graph $GT$, is
 279 mapped on a distributed clusters architecture, represented by a graph
 280 $GG$, the execution time of the application, $ET(App)$, can be defined
 281 as the execution time of the slowest task. Indeed, an application ends
 282 when all the tasks have detected convergence and reached the desired
 283 approximation of the solution.  We define $ET(App) = \max_{i=1 \dots
 284   v} ( ET(V_i) )$, where the execution time of each task $i$
 285 \mbox{($i=1 \dots v$)}, $ET(V_i)$, is given by $ET(V_i) =
 286 \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$
 287 is the computational cost of $V_i$, $wn_i$ is the computational power
 288 of the node $N_i$ on which $V_i$ is mapped, $J$ represents the
 289 neighbors set of $V_i$, $c_{ij}$ is the amount of communications
 290 between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the
 291 computing nodes on which $V_i$ and $V_j$ are mapped. As described in
 292 this formula, the execution time of a task depends on the task weight
 293 and on the communications which may occur between this task and its
 294 neighbors. We underline here that in the AIAC model, it is impossible
 295 to predict the number of iterations of a task. So it is difficult to
 296 evaluate a priori its cost $e_i$.
 297
 298
 299 This tasks mapping problem is similar to the classical graph
 300 partitioning and task assignment problem, and is thus NP-complete.
 301
 302
 303 \subsection{Fault tolerance}
 304 \label{sec:pbft}
 305
 306 In volatile environments, computing nodes can disconnect at any time
 307 during the computation, and have thus to be efficiently replaced.
 308
 309
 310 The replacing nodes should be the best ones at the fault time,
 311 according to the chosen mapping algorithm, by searching them in
 312 available nodes. As executing environments can regularly evolve, due
 313 to computing nodes volatility, a mapping algorithm has to keep a
 314 correct overview of the architecture, in real time. Thus, criteria to
 315 assign tasks to nodes should evolve too.
 316
 317
 318 Another problem appears after multiple crashes: some tasks may have
 319 migrated over multiple computing nodes and clusters, and the initial
 320 mapping may be totally changed. So, after having suffered some nodes
 321 failures the tasks mapping could not always satisfy the mapping
 322 criteria (not on the more powerful available machine, too far away
 323 from its neighbors\dots{}). A good fault tolerance policy has to
 324 evolve dynamically with the executing
 325 environment.
 326
 327
 328 \subsection{Specificities of the AIAC mapping problem}
 329 \label{sec:specAIACmapping}
 330
 331 An important point to take into consideration is that we do not allow
 332 the execution of multiple tasks on the same computing node, as this
 333 provides a fall of performances when this one fails. Indeed we should
 334 redeploy all of the tasks from this node to another one, using last
 335 saves, which can be spread on multiple computing nodes. This may
 336 result in large communication overheads and in a waste of computation
 337 time. Nevertheless, to benefit from multi-cores processors, we use a
 338 task level parallelism by multi-threaded sequential solvers for
 339 example.
 340
 341
 342 Another important point in the AIAC model is that as the JaceP2P-V2
 343 environment is fault tolerant and tasks save checkpoints on their
 344 neighbors, it is more efficient to save on near nodes than on far ones
 345 in order to reduce the communication overhead during this operation,
 346 and to restart a task faster.
 347
 348
 349 \subsection{Related work}
 350 \label{sec:pbrw}
 351
 352 In the literature of the TIG mapping many algorithms exist, which can
 353 be broadly classified into two categories.  The first one is the
 354 \textit{Edge-cuts optimization} class, which minimizes the use of the
 355 penalizing links between clusters. As tasks are depending on
 356 neighbors, which are called dependencies, the goal is to choose nodes
 357 where distance, in term of network, is small to improve communications
 358 between tasks. Here we can cite Metis\cite{metis} and
 359 Chaco\cite{chaco} which are libraries containing such kind of
 360 algorithms. The second category is the \textit{Execution time
 361   optimization} class, which aims at minimizing the whole application
 362 execution time. These algorithms look for nodes which can provide the
 363 smallest execution time of tasks using their computational power. Here
 364 we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and
 365 MiniMax\cite{minimax} as such kind of algorithms.
 366
 367
 368 Both classes of algorithms may fit with our goals as in our model we
 369 have both the computational power of nodes and communication costs
 370 which may influence the applications performances.
 371
 372
 373 All mentioned algorithms do not tackle the computing nodes failures
 374 issue, or only basically by applying the same policy. As explained in
 375 Section \ref{sec:pbft}, a more efficient and dedicated replacement
 376 function is needed. Nevertheless, to the best of our knowledge, no
 377 tasks mapping algorithm, addressing explicitly both the executing
 378 platform heterogeneity and the computing nodes failures issues,
 379 exists.
 380
 381
 382 \vspace{-0.25cm}
 383 \section{MAHEVE}
 384 \label{sec:maheve}
 385
 386 Here we present our new tasks mapping strategy, called MAHEVE (for
 387 \textit{Mapping Algorithm for HEterogeneous and Volatile
 388   Environments}). This algorithm aims at taking the best part of each
 389 category mentioned in Section \ref{sec:pbrw}, the edge-cuts
 390 minimization and the application execution time optimization
 391 algorithms.
 392
 393
 394 This new algorithm can be divided into two parts. The first part aims
 395 at performing the initial mapping, and the second part is devoted to
 396 search replacing nodes when computing nodes failures occur.
 397
 398
 399 \vspace{-0.20cm}
 400 \subsection{Initial mapping}
 401 \label{sec:maheve_init}
 402
 403 In this section we will study the main mechanisms of the
 404 \textit{static mapping} done by MAHEVE, which is composed of three
 405 phases: sort of clusters, sort of tasks, and the effective mapping,
 406 which maps tasks (in their sort order) on nodes of clusters (also in
 407 their sort order) with a reservation of some nodes in each cluster.
 408
 409
 410 \vspace{-0.15cm}
 411 \subsubsection{Sorting clusters}
 412 \label{sec:maheve_sort_clusters}
 413
 414 The first step of the initial mapping is to sort clusters according to
 415 the executing platform heterogeneity degree $hd$. The main principles
 416 are that a cluster obtains a better mark when $hd < 0.5$ and it
 417 contains more computing nodes than other clusters ($C_{Fi}$, the
 418 number of available free nodes, is privileged), and when $hd \ge 0.5$
 419 and it contains more powerful computing nodes ($C_{\overline{P}fi}$,
 420 the average free computation power, is privileged). These choices come
 421 from several experiments with the AIAC model, which show that in such
 422 environments it is more efficient to privilege the computation power
 423 or the number of nodes. As the number of nodes, $C_{Fi}$, and the
 424 average free computing power, $C_{\overline{P}fi}$, are not in the
 425 same order of magnitude, we normalize them with two functions, $normN$
 426 and $normP$. They are defined with \mbox{$normN\left(C_{Fi}\right) =
 427   C_{Fi} \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate
 428 of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
 429   C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|}
 430   C_{\overline{P}fj}$}, which is the rate of the average power, both
 431 representing the cluster in the architecture. We note
 432 $normN\left( C_{Fi} \right) = NC_{Fi}$ and $normP(C_{\overline{P}fi})
 433 =NC_{\overline{P}fi}$.
 434
 435
 436 The formula used to give a mark, $M_i$, to a cluster is
 437 \begin{equation}
 438   \label{eq:cluster}
 439   M_i = {NC_{\overline{P}fi}}^{hd} + {NC_{Fi}}^{1 - hd}.
 440 \end{equation}
 441
 442
 443 This compromise function allows us to privilege clusters following our
 444 criteria, as explained previously, according to the heterogeneity
 445 degree. If we study its limits for the $hd$'s extremities, $hd = 0$
 446 and $hd = 1$, we obtain $\lim_{hd \rightarrow 0} M_i = NC_{Fi}$ and
 447 $\lim_{hd \rightarrow 1} M_i = NC_{\overline{P}fi}$, which fit with
 448 our objectives.
 449
 450 Clusters are so sorted and placed in a list containing them, starting
 451 from the cluster which receives the better mark to the one which
 452 receives the lower mark.
 453
 454
 455 \subsubsection{Sorting tasks}
 456 \label{sec:maheve_sort_tasks}
 457
 458 Like clusters, tasks are also sorted according to the heterogeneity
 459 degree of the executing platform. This sorting is done in the same way as
 460 previously, as when $hd < 0.5$ tasks with higher dependencies will be
 461 privileged, and when $hd \ge 0.5$ tasks with higher computing cost are
 462 privileged.
 463 %, in order to be executed on the highest powered computing nodes.
 464
 465
 466 The main function used to classified tasks is
 467 \begin{equation}
 468   \label{eq:tasks}
 469   Q_i = {e_i}^{hd} \times {d_i}^{1 - hd}
 470 \end{equation}
 471
 472 where $Q_i$ is the evaluation of the task $i$ according to the
 473 heterogeneity degree $hd$ and $d_i$, the amount of dependencies of
 474 task $i$.
 475
 476
 477 Tasks are taken in the order of the first sort, determined with
 478 equation (\ref{eq:tasks}), and each task is placed in a new list (the
 479 final one) and some of its dependencies are added. We note $Nb_i =
 480 {d_i}^{1 - hd}$ this amount of dependencies as the lower the
 481 heterogeneity degree is the higher this number will be.  This final
 482 operation allows to control the necessary locality of tasks according
 483 to $hd$. %the heterogeneity degree of the platform.
 484
 485
 486 \subsubsection{Mapping method}
 487 \label{sec:maheve_mapping_method}
 488
 489 The third step of the initial mapping is to allocate tasks to
 490 nodes. As clusters and tasks have been sorted accordingly to the
 491 executing platform heterogeneity degree, ordered from the highest
 492 mark to the lowest, this function maps tasks on each available
 493 computing nodes of clusters, in their respective order in lists (for
 494 example a task classified first in the tasks list is mapped on an
 495 available node of the cluster classified first in the clusters list).
 496 The idea here is not to fulfill each cluster, but to preserve some
 497 computing nodes in each cluster. These conserved nodes will be used to
 498 replace failed nodes.
 499
 500
 501 Here we can mentioned that the whole mapping process (the three steps)
 502 has a complexity of $O( |V|^2 )$, where |V| is the number of tasks.
 503
 504
 505 \subsection{Replacing function}
 506 \label{sec:maheve_rep}
 507
 508 This function is essential in a volatile environment, as an efficient
 509 replacement should reduce the overhead on the application execution
 510 time due to the loss of computing steps and data.
 511
 512 As shown in the previous section, during the initial mapping some
 513 computing nodes in each cluster have been preserved. When a node fails
 514 this function replace it by a free node of the same cluster. If none
 515 is available this function sorts again clusters, to take into
 516 consideration platform modifications, and replace the failed node by
 517 one available in the new sorted clusters list. This mechanism allows
 518 to keep tasks locality and a real time overview of the executing
 519 platform.
 520
 521
 522 \section{Experimentation}
 523 \label{sec:expe}
 524
 525 \subsection{A typical AIAC application and the execution platform}
 526 \label{sec:cond}
 527
 528 We used the ``Kernel CG'' application of the NAS Parallel Benchmarks
 529 (NPB) \cite{nas} to evaluate the performances of our new mapping
 530 algorithm. This benchmark is designed to be used on large
 531 architectures, as it stresses communications%over latency networks
 532 , by processing unstructured matrix vector multiplication with a
 533 Conjugate Gradient method. As this method cannot be executed with the
 534 asynchronous iteration model we have replaced it by another method
 535 called the multisplitting method, which supports the asynchronous
 536 iterative model. More details about this method can be found in
 537 \cite{book_raph}. The chosen problem used a matrix of size $5,000,000$
 538 with a low bandwidth, fixed to $35,000$. This bandwidth size
 539 generates, according to the problem size, between 8 and 20 neighbors
 540 per tasks. This application was executed on 64 nodes.
 541
 542
 543 The platform used for our tests, called Grid’5000\cite{g5k}, is a
 544 French nationwide experimental set of clusters which provides us with
 545 distributed clusters architectures (28 heterogeneous clusters spread
 546 over 9 sites).  We used three distributed clusters architectures, each
 547 having a different heterogeneity degree. The first one was composed of
 548 four clusters spread over four sites, with a total of 106 computing
 549 nodes representing 424 computing cores with \mbox{$hd = 0.08$}; the
 550 second one was composed of four clusters spread over three sites, with
 551 a total of 110 computing nodes representing 440 computing cores with
 552 \mbox{$hd = 0.50$}; and finally the third one was composed of five
 553 clusters spread over four sites with 115 computing nodes representing
 554 620 computing cores with \mbox{$hd = 0.72$}.
 555
 556
 557 All computing nodes of these clusters have at least 4 computing cores
 558 (in the last used architecture, with \mbox{$hd = 0.72$}, two clusters
 559 are composed of 8 computing cores machines) with a minimum of 4GB of
 560 memory (in order to execute the application with a big problem
 561 size). All computing nodes can communicate with each other through an
 562 efficient network. Nevertheless, this latter is shared with many other
 563 users so high latencies appear during executions.
 564
 565
 566 \subsection{Experiments}
 567 \label{sec:experiments}
 568
 569 During executions, we introduced two failures in computing nodes
 570 involved in the computation every 20 seconds to simulate a volatile
 571 environment. Unfortunately, we did not have the opportunity to realize
 572 experiments with more computing nodes over more sites with problems of
 573 larger sizes, but we plan to extend our experiments in the future.
 574
 575
 576 Here we present the results of the evaluation of the MAHEVE algorithm,
 577 compared with FT-AIAC-QM (for \textit{Fault Tolerant AIAC-QM}) and
 578 FT-FEC (for \textit{Fault Tolerant F-EC}) which are respectively the
 579 fault tolerant versions of the AIAC-QM and F-EC mapping algorithms
 580 presented in \cite{pdsec10}. Table \ref{tab:results} shows the
 581 execution times of each mapping algorithm compared to the default
 582 mapping strategy of the JaceP2P-V2 platform, with the corresponding
 583 gains on application execution time, given in brackets. It presents
 584 both the executions with faults (WF) and the fault free (FF)
 585 executions.
 586
 587 \renewcommand{\arraystretch}{1.6}
 588 \begin{table}[h!]
 589   \centering
 590   \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
 591     \hline
 592     \multirow{2}{*}{ ~$hd$~ }&\multicolumn{2}{c|}{ ~Default~ }&\multicolumn{2}{c|}{ ~FT-AIAC-QM~ }&\multicolumn{2}{c|}{ ~FT-FEC~ }&\multicolumn{2}{c|}{ ~MAHEVE~ }\\
 593     \cline{2-9}
 594
 595     & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ \\
 596     \hline
 597     ~$0.08$~ & ~80~ & ~229~ & ~63 (21\%)~ & ~178 (22\%)~ & ~61 (23\%)~ & ~154
 598     (33\%)~ & ~60 (25\%)~ & ~113 (50\%)~ \\
 599
 600     ~$0.50$~ & ~67~ & ~242~ & ~61 (9\%)~ & ~118 (51\%)~ & ~63 (6\%)~ & ~133
 601     (45\%)~ & ~54 (20\%)~ & ~85 (65\%)~ \\
 602
 603     ~$0.72$~ & ~67~ & ~192~ & ~59 (12\%)~ & ~99 (45\%)~ & ~65 (3\%)~ & ~121
 604     (33\%)~ & ~52 (22\%)~ & ~86 (53\%)~\\
 605
 606     \hline
 607   \end{tabular}
 608   \vspace{0.15cm}
 609   \caption{Application execution time in seconds and corresponding gains on various
 610     platforms using different mapping algorithms, with fault free (FF) executions
 611     and with 2 node failures each 20 seconds (WF) executions.}
 612   \label{tab:results}
 613   \vspace{-0.7cm}
 614 \end{table}
 615
 616
 617 First of all, we can note that all mapping algorithms provide an
 618 enhancement of the application performances by considerably reducing
 619 its execution time, especially for executions with node failures, with
 620 an average gain of about $45\%$ in general in comparison to the
 621 default policy. If we focus on executions with node failures (WF),
 622 FT-FEC is efficient on architectures with a low heterogeneity degree
 623 (\mbox{$hd = 0.08$}) by providing gains of about $33\%$, and gains are
 624 roughly the same on heterogeneous architectures (\mbox{$hd =
 625   0.72$}). FT-AIAC-QM is efficient on architectures with a high
 626 heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains of about
 627 $45\%$, whereas it is not so efficient on homogeneous architectures
 628 (\mbox{$hd = 0.08$}) by providing gains of about $22\%$. We can note
 629 here that on an architecture with a heterogeneity degree of $0.50$
 630 FT-AIAC-QM is more efficient than FT-FEC by providing gains up to
 631 $50\%$. Here we point out that in fault free executions (FF), both
 632 algorithms also provide gains on their respective favorite
 633 architectures, though gains are less great than in executions with
 634 faults (WF).
 635
 636
 637 Now if we focus on the performances of our new solution MAHEVE, we can
 638 see that it is all the time better than other algorithms. As can be
 639 seen in \mbox{Table \ref{tab:results}}, in executions with faults
 640 (WF), it reduces the application's execution time by about $50\%$ on
 641 homogeneous architectures (here of $0.08$ heterogeneity degree) which
 642 is more than 25 point better than FT-FEC and near 30 points better
 643 than FT-AIAC-QM. On heterogeneous architectures (here of $0.72$
 644 heterogeneity degree) it also outperforms other mapping algorithms by
 645 reducing the application execution time by about $53\%$ which is
 646 almost 10 points better than FT-AIAC-QM and 20 points better than
 647 FT-FEC. On middle heterogeneity degree architectures (here of $0.50$
 648 heterogeneity degree), MAHEVE is once again better than its two
 649 comparative mapping algorithms by reducing the application execution
 650 time by about $53\%$. These good performances come from the fact that
 651 it is designed to be efficient on both architectures, homogeneous and
 652 heterogeneous. Moreover, as it integrates a fault tolerance
 653 \textit{security} in the initial mapping, it is more efficient when
 654 computing nodes fail. Here we can point out that this algorithm allows
 655 in general gains on application execution time of about $55\%$. In fault free executions (FF), it outperforms once again
 656 the two other algorithms.
 657
 658
 659 \section{Conclusion and future works}
 660 \label{sec:conclu}
 661
 662 In this paper we have presented a new mapping algorithm, called
 663 MAHEVE, to address the AIAC mapping issue on heterogeneous and
 664 volatile environments. It aims at doing an efficient mapping of tasks
 665 on distributed clusters architectures by taking the best part of the
 666 two known approaches, application execution time optimization and
 667 edge-cuts minimization. Experiments show that it is the most efficient
 668 mapping algorithm on all kinds of architectures, as it takes into
 669 account their heterogeneity degree and adapt its sort methods to
 670 it. We have shown that it is all the time better than the two other
 671 comparative mapping algorithms, FT-AIAC-QM and FT-FEC. This can be
 672 explained by the fact that it not only takes care about computing
 673 nodes and clusters, but also about the tasks' properties, what refines
 674 the mapping solution.
 675
 676
 677 In our future works we plan to enhance the MAHEVE algorithm
 678 performances by modifying the notation of clusters, since their
 679 locality has not yet been taken into consideration. This would favor
 680 tasks locality, which would reduce communications delays and provide a
 681 much better convergence rate.
 682
 683
 684 %%%
 685
 686
 687 \bibliographystyle{unsrt}
 688
 689 \bibliography{biblio}
 690
 691
 692 \end{document}
 693