heteropar10/camera_ready/heteropar10.tex

   1 \documentclass[10pt,conference]{llncs}
   2
   3 \usepackage[T1]{fontenc}
   4 \usepackage{ucs}
   5 \usepackage[utf8x]{inputenc}
   6 \usepackage{lmodern}
   7 \usepackage{color}
   8 \usepackage{amsmath}
   9 \usepackage{amsfonts}
  10 \usepackage[english]{babel}
  11 \usepackage[pdftex,final]{graphicx}
  12 \usepackage[ruled,vlined]{algorithm2e}
  13 \usepackage[pdftex]{graphicx}
  14 \usepackage{multirow}
  15 \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  16
  17
  18 \title{MAHEVE: An Efficient Reliable Mapping of Asynchronous Iterative
  19   Applications on Volatile and Heterogeneous Environments \thanks{This
  20     work was supported by the European Interreg IV From-P2P
  21     project.}}
  22
  23
  24 \author{Raphaël Couturier, David Laiymani and Sébastien Miquée}
  25   \authorrunning{R. Couturier, D. Laiymani and S. Miquée}
  26   \institute{\vspace{-0.2cm}
  27      University of Franche-Comté \qquad  LIFC laboratory\\
  28      IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort,
  29      France\\\{{\tt
  30       raphael.couturier,david.laiymani,sebastien.miquee}\}{\tt
  31       @univ-fcomte.fr} }
  32
  33
  34 \date{}
  35 \begin{document}
  36
  37
  38 \maketitle
  39
  40 \vspace{-0.2cm}
  41
  42 \begin{abstract}
  43   With the emergence of massive distributed computing resources, such
  44   as grids and distributed clusters architectures, parallel
  45   programming is used to benefit from them and execute problems of
  46   larger sizes. The asynchronous iteration model, called AIAC, has
  47   been proven to be an efficient solution for heterogeneous and
  48   distributed architectures. An efficient mapping of applications'
  49   tasks is essential to reduce their execution time. In this paper we
  50   present a new mapping algorithm, called MAHEVE (Mapping Algorithm
  51   for HEterogeneous and Volatile Environments) which is efficient on
  52   such architectures and integrates a fault tolerance mechanism to
  53   resist computing nodes failures. Our experiments show gains on a
  54   typical AIAC application execution time of about $55\%$, executed
  55   on distributed clusters architectures containing more than 400
  56   computing cores with the JaceP2P-V2 environment.
  57 \end{abstract}
  58
  59 \section{Introduction}
  60 \label{sec:intro}
  61
  62 Nowadays, scientific applications require a great computation power to
  63 solve large problems. Though personal computers are becoming more
  64 powerful, in many cases they are not sufficient. One well adapted
  65 solution is to use computers clusters in order to combine the power of
  66 many machines. Distributed clusters form such an architecture,
  67 providing a great computing power, by aggregating the computation
  68 power of multiple clusters spread over multiple sites.  Such an
  69 architecture brings users heterogeneity in computing machines as well
  70 as network latency. In order to use such an architecture, parallel
  71 programming is required. In the parallel computing area, in order to
  72 execute very large applications on heterogeneous architectures,
  73 iterative methods are well adapted\cite{book_raph,bcvc06:ij}.
  74
  75
  76 These methods repeat the same instructions block until a convergence
  77 state and a desired approximation of the solution are reached. They
  78 constitute the only known approach to solving some kinds of problems
  79 and are relatively easy to parallelize. The Jacobi or the Conjugate
  80 Gradient\cite{cg} methods are examples of such methods. To parallelize
  81 them, one of the most used methods is the message passing paradigm
  82 which provides efficient mechanisms to exchange data between tasks. As
  83 such a method, we focus here on the asynchronous parallel iterative
  84 model, called AIAC\cite{book_raph} (for \textit{Asynchronous
  85   Iterations -- Asynchronous Communications}).
  86
  87 \begin{figure}[h!]
  88  \vspace{-0.4cm}
  89   \centering
  90   \includegraphics[width=7.4cm]{images/AIAC}
  91   \caption{Two processors computing in the Asynchronous Iterations -- Asynchronous Communications (AIAC) model}
  92   \label{fig:AIAC}
  93   \vspace{-0.90cm}
  94 \end{figure}
  95
  96
  97 In this model, as can be seen on Figure \ref{fig:AIAC}, after each
  98 iteration, a task sends its results to its neighbors and immediately
  99 starts the next iteration with the last received data. The receiving
 100 and sending mechanisms are asynchronous and tasks do not have to wait
 101 for the reception of dependency messages from their
 102 neighbors. Consequently, there is no idle time between two
 103 iterations. Furthermore, this model is tolerant to messages loss and
 104 even if a task is stopped the remaining tasks continue the
 105 computation, with the last available data. Several
 106 experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms
 107 in the context of distributed clusters with high latency between
 108 clusters. These works underline the good adaptability of AIAC
 109 algorithms to network and processor heterogeneity.
 110
 111
 112 In a previous study\cite{pdsec10} we proposed the implementation of
 113 two static mapping algorithms of tasks to processors dedicated to the
 114 AIAC model on heterogeneous distributed clusters. Both these two
 115 algorithms, AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC
 116 (for \textit{Farhat Edges-Cuts}) showed an important performances
 117 improvement by significantly reducing the application execution
 118 time. These experiments were performed by using the fully fault
 119 tolerant JaceP2P-V2 environment, described in next section.  In our
 120 previous experiments we did not introduce computing nodes failures
 121 during the computation. As architecture heterogeneity continually
 122 evolves according to computing nodes volatility, we have to take care
 123 more precisely about the heterogeneity of the target platform. Thus in
 124 this paper our main contribution is to propose a new mapping algorithm
 125 called MAHEVE (\textit{Mapping Algorithm for HEterogeneous and
 126   Volatile Environments}).  This algorithm explicitly tackles the
 127 heterogeneity issue and introduces a level of dynamism in order to
 128 adapt itself to the fault tolerance mechanisms. Our experiments show
 129 gains up to $65\%$ on application execution time, with faults during
 130 executions, which is about 10 points better than AIAC-QM and about 25
 131 points better than F-EC, and MAHEVE also outperforms them in
 132 experiments with no fault during executions.
 133
 134
 135 The rest of this paper is organized as
 136 follows. Section~\ref{sec:jacep2p} presents the JaceP2P-V2 middleware
 137 by describing its architecture and briefly presenting its fault
 138 tolerance mechanisms. Section~\ref{sec:pb} formalizes our mapping and
 139 fault tolerance problems and quotes existing issues to address
 140 them. Section~\ref{sec:maheve} describes the new mapping strategy we
 141 propose, MAHEVE. In Section~\ref{sec:expe} we present the experiments
 142 we conducted on the Grid'5000 testbed with more than 400 computing
 143 cores. Finally, we give some concluding remarks and plan our future
 144 work in Section~\ref{sec:conclu}.
 145
 146
 147 \section{JaceP2P-V2}
 148 \label{sec:jacep2p}
 149
 150 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented in
 151 Java, dedicated to developing and executing parallel iterative
 152 asynchronous applications. It is fully fault tolerant allowing it to
 153 execute parallel applications over volatile environments. To our
 154 knowledge this is the only platform dedicated to designing and
 155 executing AIAC algorithms in such volatile environments.
 156
 157
 158 The JaceP2P-V2 platform part, which is based on the daemons and
 159 supervisors paradigm, is composed of three main entities:
 160 \begin{itemize}
 161 \item The ``super-nodes'', which are in charge of supervising free
 162   computing nodes connected to the platform;
 163
 164 \item The ``spawner'', which is launched by a user wanting to execute
 165   a parallel application. It is in charge of a group of computing
 166   nodes and monitors them. If one of them fails, it requires a
 167   replacing one to a super-node;
 168
 169 \item The ``daemon'', first connects to a super-node and
 170   waits for a task to execute. Each daemon can communicate directly
 171   with its computing neighbors.
 172 \end{itemize}
 173
 174
 175 To be able to execute AIAC applications, JaceP2P-V2 has an
 176 asynchronous messaging mechanism, and to resist daemons failures, it
 177 implements a checkpoint/restart mechanism by using a distributed
 178 backup mechanism called the \textit{uncoordinated distributed
 179   checkpointing}\cite{uncoord_cp}.  This decentralized procedure
 180 allows the platform to be very scalable, with no weak points and does
 181 not require a secure nor a stable station for backups. When a daemon
 182 dies, it is replaced by another one, as we suppose that there are
 183 enough available free nodes. Moreover, to resist supervisors failures
 184 and for scalability, some extra nodes are reserved. For more details
 185 on the JaceP2P-V2 platform, interested readers can refer to
 186 \cite{jaceP2P-v2}.
 187
 188
 189 \section{Mapping and fault tolerance problems}
 190 \label{sec:pb}
 191
 192 \subsection{Model formalization}
 193 \label{sec:pbmodel}
 194
 195 \subsubsection{Application modeling}
 196 \label{sec:pbmodelapp}
 197
 198
 199 The TIG\cite{tig1} (\textit{Task Interaction Graph}) model is the most
 200 appropriate to our problem, as it only models relationships between
 201 tasks. In this model, all the tasks are considered simultaneously
 202 executable and communications can take place at any time during the
 203 computation, with no precedence nor synchronization. As a reminder,
 204 during an iteration in the AIAC model, each task computes its job and
 205 sends its results to its neighbors, and immediately starts the next
 206 iteration.
 207
 208
 209 In the TIG model, a parallel application is represented by a graph
 210 $GT(V,E)$, where \mbox{$V = \{V_1,V_2,\dots V_v\}$} is the set of
 211 $|V|$ vertices and \mbox{$E \subset V \times V$} is the set of
 212 undirectional edges. The vertices represent tasks and the edges
 213 represent the mutual communication among tasks. A function \mbox{$EC :
 214   V \rightarrow \mathbb{R}^+$} gives the computation cost of tasks and
 215 \mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost
 216 for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) =
 217   e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}.  Another function
 218 \mbox{$D : V \rightarrow \mathbb{N}^+$} gives the amount of
 219 dependencies of a task, and we define \mbox{$D(V_i) = d_i$}.
 220
 221
 222 \subsubsection{Architecture modeling}
 223 \label{sec:pbmodelarchi}
 224
 225
 226 A distributed clusters architecture can be modeled by a
 227 three-level-graph. The levels are \textit{architecture} (a) (here the
 228 Grid'5000 grid), \textit{cluster} (c), and \textit{computing node} (n)
 229 levels. Let $GG(N,L)$ be a graph representing a distributed clusters
 230 architecture, where \mbox{$N = \{N_1,N_2,\dots N_n\}$} is the set of
 231 $|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The
 232 vertices represent the computing nodes and the edges represent the
 233 links between them. An edge \mbox{$L_i \in L$} is an unordered pair
 234 \mbox{$(N_x,N_y) \in N$}, representing a communication link between
 235 nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow
 236   \mathbb{R}^+$} gives the computational power of nodes and another
 237 function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the
 238 communication latency of links. We define \mbox{$WN(N_i) = wn_i$} and
 239 \mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters
 240 contained in the architecture. A function \mbox{$CN : C \rightarrow
 241   \mathbb{N}^+$} gives the amount of computing nodes contained in a
 242 cluster, and another function \mbox{$CF : C \rightarrow \mathbb{N}^+$}
 243 gives the amount of available computing nodes (not involved in
 244 computation) of a cluster. We define \mbox{$CN(C_i) = C_{Ni}$} and
 245 \mbox{$CF(C_i) = C_{Fi}$}. We also define
 246 %\mbox{$C_{Pi}
 247 %  = \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of
 248 %cluster $C_i$, \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
 249 %as the average computation power of cluster $C_i$, and
 250 $C_{\overline{P}fi}$ as the average power of available resources of
 251 cluster $C_i$.
 252
 253 We evaluate the \textit{heterogeneity degree} of the architecture,
 254 noted $hd$, by using the \textit{relative standard deviation} method,
 255 with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the
 256 average computing power of nodes and $\sigma_{PN}$ represents the
 257 standard deviation of computing nodes power. This measure provides us
 258 the coefficient of variation of the platform in percentage -- we only
 259 consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd
 260   > 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully
 261 heterogeneous platform.
 262
 263 \vspace{-0.1cm}
 264 \subsubsection{Mapping functions}
 265 \label{sec:pbmodelmapping}
 266
 267
 268 When a parallel application $App$, represented by a graph $GT$, is
 269 mapped on a distributed clusters architecture, represented by a graph
 270 $GG$, the execution time of the application, $ET(App)$, can be defined
 271 as the execution time of the slowest task. Indeed, an application ends
 272 when all the tasks have detected convergence and reached the desired
 273 approximation of the solution.  We define $ET(App) = \max_{i=1 \dots
 274   v} ( ET(V_i) )$, where the execution time of each task $i$
 275 \mbox{($i=1 \dots v$)}, $ET(V_i)$, is given by $ET(V_i) =
 276 \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$
 277 is the computational cost of $V_i$, $wn_i$ is the computational power
 278 of the node $N_i$ on which $V_i$ is mapped, $J$ represents the
 279 neighbors set of $V_i$, $c_{ij}$ is the amount of communications
 280 between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the
 281 computing nodes on which $V_i$ and $V_j$ are mapped. As described in
 282 this formula, the execution time of a task depends on the task weight
 283 and on the communications which may occur between this task and its
 284 neighbors. We underline here that in the AIAC model, it is impossible
 285 to predict the number of iterations of a task. So it is difficult to
 286 evaluate a priori its cost $e_i$.
 287
 288
 289 This tasks mapping problem is similar to the classical graph
 290 partitioning and task assignment problem, and is thus NP-complete.
 291
 292
 293 \subsection{Fault tolerance}
 294 \label{sec:pbft}
 295
 296 In volatile environments, computing nodes can disconnect at any time
 297 during the computation, and have thus to be efficiently replaced.
 298 The replacing nodes should be the best ones at the fault time,
 299 %according to the chosen mapping algorithm,
 300 by searching them in
 301 available nodes. As executing environments can regularly evolve, due
 302 to computing nodes volatility, a mapping algorithm has to keep a
 303 correct overview of the architecture, in real time. Thus, criteria to
 304 assign tasks to nodes should evolve too.
 305
 306
 307 Another problem appears after multiple crashes: some tasks may have
 308 migrated over multiple computing nodes and clusters, and the initial
 309 mapping may be totally changed. So, after having suffered some nodes
 310 failures the tasks mapping could not always satisfy the mapping
 311 criteria (not on the more powerful available machine, too far away
 312 from its neighbors\dots{}). A good fault tolerance policy has to
 313 evolve dynamically with the executing
 314 environment.
 315
 316
 317 \subsection{Specificities of the AIAC mapping problem}
 318 \label{sec:specAIACmapping}
 319
 320 An important point to take into consideration is that we do not allow
 321 the execution of multiple tasks on the same computing node, as this
 322 provides a fall of performances when this one fails. Indeed we should
 323 redeploy all of the tasks from this node to another one, using last
 324 saves, which can be spread on multiple computing nodes. This may
 325 result in large communication overheads and in a waste of computation
 326 time. Nevertheless, to benefit from multi-cores processors, we use a
 327 task level parallelism by multi-threaded sequential solvers for
 328 example.
 329
 330
 331 Another important point in the AIAC model is that as the JaceP2P-V2
 332 environment is fault tolerant and tasks save checkpoints on their
 333 neighbors, it is more efficient to save on near nodes than on far ones
 334 in order to reduce the communication overhead during this operation,
 335 and to restart a task faster.
 336
 337
 338 \subsection{Related work}
 339 \label{sec:pbrw}
 340
 341 In the literature of the TIG mapping many algorithms exist, which can
 342 be broadly classified into two categories.  The first one is the
 343 \textit{Edge-cuts optimization} class, which minimizes the use of the
 344 penalizing links between clusters. As tasks are depending on
 345 neighbors, which are called dependencies, the goal is to choose nodes
 346 where distance, in term of network, is small to improve communications
 347 between tasks. Here we can cite the Farhat's algorithm\cite{fec}, and
 348 Metis\cite{metis} and Chaco\cite{chaco} which are libraries containing
 349 such kind of algorithms. The second category is the \textit{Execution
 350   time optimization} class, which aims at minimizing the whole
 351 application execution time. These algorithms look for nodes which can
 352 provide the smallest execution time of tasks using their computational
 353 power. Here we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and
 354 MiniMax\cite{minimax} as such kind of algorithms.
 355
 356
 357 Both classes of algorithms may fit with our goals as in our model we
 358 have both the computational power of nodes and communication costs
 359 which may influence the applications performances.
 360
 361
 362 All mentioned algorithms do not tackle the computing nodes failures
 363 issue, or only basically by applying the same policy. As explained in
 364 Section \ref{sec:pbft}, a more efficient and dedicated replacement
 365 function is needed. Nevertheless, to the best of our knowledge, no
 366 tasks mapping algorithm, addressing explicitly both the executing
 367 platform heterogeneity and the computing nodes failures issues,
 368 exists.
 369
 370
 371 \vspace{-0.25cm}
 372 \section{MAHEVE}
 373 \label{sec:maheve}
 374
 375 Here we present our new tasks mapping strategy, called MAHEVE (for
 376 \textit{Mapping Algorithm for HEterogeneous and Volatile
 377   Environments}). This algorithm aims at taking the best part of each
 378 category mentioned in Section \ref{sec:pbrw}, the edge-cuts
 379 minimization and the application execution time optimization
 380 algorithms.
 381
 382
 383 This new algorithm can be divided into two parts. The first part aims
 384 at performing the initial mapping, and the second part is devoted to
 385 search replacing nodes when computing nodes failures occur.
 386
 387
 388 \vspace{-0.20cm}
 389 \subsection{Initial mapping}
 390 \label{sec:maheve_init}
 391
 392 In this section we will study the main mechanisms of the
 393 \textit{static mapping} done by MAHEVE, which is composed of three
 394 phases: sort of clusters, sort of tasks, and the effective mapping,
 395 which maps tasks (in their sort order) on nodes of clusters (also in
 396 their sort order) with a reservation of some nodes in each cluster.
 397
 398
 399 %\begin{algorithm}[H]
 400 %  \SetLine
 401 %  \KwIn{GT (app. graph), CL (clusters), hd (heterogeneity degree)}
 402 %  \KwOut{Mp (the mapping done)}
 403 %
 404 %  \BlankLine
 405 %
 406 %  nodes $\leftarrow$ sortingClusters(GG, hd)\;
 407 %%  nodes $\leftarrow$ searchNodes(lt, nbTasks)\;
 408 %
 409 %%  \BlankLine
 410 %
 411 %  tasks $\leftarrow$ sortingTasks(GT, hd)\;
 412 %%  tasks $\leftarrow$ orderTasks(st, hd)\;
 413 %
 414 %%  \BlankLine
 415 %
 416 %  \For{$i = 0$ \KwTo $size(tasks)$}
 417 %  {
 418 %    Mp.add(tasks($i$), nodes($i$))\;
 419 %  }
 420 %
 421 %%  \KwRet(Mp)
 422 %
 423 %  \caption{General algorithm simplified}
 424 %\end{algorithm}
 425
 426
 427 \vspace{-0.15cm}
 428 \subsubsection{Sorting clusters}
 429 \label{sec:maheve_sort_clusters}
 430
 431 The first step of the initial mapping is to sort clusters according to
 432 the executing platform heterogeneity degree $hd$. The main principles
 433 are that a cluster obtains a better mark when $hd < 0.5$ and it
 434 contains more computing nodes than other clusters ($C_{Fi}$, the
 435 number of available free nodes, is privileged), and when $hd \ge 0.5$
 436 and it contains more powerful computing nodes ($C_{\overline{P}fi}$,
 437 the average free computation power, is privileged). These choices come
 438 from several experiments with the AIAC model, which show that in such
 439 environments it is more efficient to privilege the computation power
 440 or the number of nodes. As the number of nodes, $C_{Fi}$, and the
 441 average free computing power, $C_{\overline{P}fi}$, are not in the
 442 same order of magnitude, we normalize them with two functions, $normN$
 443 and $normP$.
 444 %They are defined with \mbox{$normN\left(C_{Fi}\right) =
 445 %  C_{Fi} \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate
 446 %of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
 447 %  C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|}
 448 %  C_{\overline{P}fj}$}, which is the rate of the average power, both
 449 %representing the cluster in the architecture.
 450 We note $normN\left( C_{Fi} \right) = NC_{Fi}$ and
 451 $normP(C_{\overline{P}fi}) =NC_{\overline{P}fi}$.
 452
 453
 454 The formula used to give a mark, $M_i$, to a cluster is
 455 \begin{equation}
 456   \label{eq:cluster}
 457   M_i = {NC_{\overline{P}fi}}^{hd} + {NC_{Fi}}^{1 - hd}.
 458 \end{equation}
 459
 460
 461 This compromise function allows us to privilege clusters following our
 462 criteria, as explained previously, according to the heterogeneity
 463 degree. If we study its limits for the $hd$'s extremities, $hd = 0$
 464 and $hd = 1$, we obtain $\lim_{hd \rightarrow 0} M_i = NC_{Fi} + 1$ and
 465 $\lim_{hd \rightarrow 1} M_i = NC_{\overline{P}fi} + 1$, which fit with
 466 our objectives.
 467
 468 Clusters are so sorted and placed in a list containing them, starting
 469 from the cluster which receives the better mark to the one which
 470 receives the lower mark.
 471
 472
 473 \subsubsection{Sorting tasks}
 474 \label{sec:maheve_sort_tasks}
 475
 476 Like clusters, tasks are also sorted according to the heterogeneity
 477 degree of the executing platform. This sorting is done in the same way as
 478 previously, as when $hd < 0.5$ tasks with higher dependencies will be
 479 privileged, and when $hd \ge 0.5$ tasks with higher computing cost are
 480 privileged.
 481 %, in order to be executed on the highest powered computing nodes.
 482
 483
 484 The main function used to classified tasks is
 485 \begin{equation}
 486   \label{eq:tasks}
 487   Q_i = {e_i}^{hd} \times {d_i}^{1 - hd}
 488 \end{equation}
 489
 490 where $Q_i$ is the evaluation of the task $i$ according to the
 491 heterogeneity degree $hd$ and $d_i$, the amount of dependencies of
 492 task $i$.
 493
 494
 495 Tasks are taken in the order of the first sort, determined with
 496 equation (\ref{eq:tasks}), and each task is placed in a new list (the
 497 final one) and some of its dependencies are added. We note $Nb_i =
 498 {d_i}^{1 - hd}$ this amount of dependencies as the lower the
 499 heterogeneity degree is the higher this number will be. This final
 500 operation allows to control the necessary locality of tasks according
 501 to $hd$. %the heterogeneity degree of the platform.
 502
 503
 504 \subsubsection{Mapping method}
 505 \label{sec:maheve_mapping_method}
 506
 507 The third step of the initial mapping is to allocate tasks to
 508 nodes. As clusters and tasks have been sorted accordingly to the
 509 executing platform heterogeneity degree, ordered from the highest mark
 510 to the lowest, this function maps tasks on almost all available
 511 computing nodes of clusters, in their respective order in lists (for
 512 example a task classified first in the tasks list is mapped on an
 513 available node of the cluster classified first in the clusters list).
 514 The idea here is not to fulfill each cluster, but to preserve some
 515 computing nodes in each cluster. These conserved nodes will be used to
 516 replace failed nodes.
 517
 518
 519 Here we can mentioned that the whole mapping process (the three steps)
 520 has a complexity of $O( |V| log(|V|) )$, where |V| is the number of tasks.
 521
 522
 523 \subsection{Replacing function}
 524 \label{sec:maheve_rep}
 525
 526 %This function is essential in a volatile environment, as an efficient
 527 %replacement should reduce the overhead on the application execution
 528 %time due to the loss of computing steps and data.
 529
 530 As shown in the previous section, during the initial mapping some
 531 computing nodes in each cluster have been preserved. When a node fails
 532 this function replace it by a free node of the same cluster. If none
 533 is available this function sorts again clusters, to take into
 534 consideration platform modifications, and replace the failed node by
 535 one available in the new sorted clusters list. This mechanism allows
 536 to keep tasks locality and a real time overview of the executing
 537 platform.
 538
 539
 540 \section{Experimentation}
 541 \label{sec:expe}
 542
 543 \subsection{A typical AIAC application and the execution platform}
 544 \label{sec:cond}
 545
 546 We used the ``Kernel CG'' application of the NAS Parallel Benchmarks
 547 (NPB) \cite{nas} to evaluate the performances of our new mapping
 548 algorithm. This benchmark is designed to be used on large
 549 architectures, as it stresses communications%over latency networks
 550 , by processing unstructured matrix vector multiplication with a
 551 Conjugate Gradient method. As this method cannot be executed with the
 552 asynchronous iteration model we have replaced it by another method
 553 called the multisplitting method, which supports the asynchronous
 554 iterative model. More details about this method can be found in
 555 \cite{book_raph}. The chosen problem used a matrix of size $5,000,000$
 556 with a low bandwidth, fixed to $35,000$. This bandwidth size
 557 generates, according to the problem size, between 8 and 20 neighbors
 558 per tasks. This application was executed on 64 nodes.
 559
 560
 561 The platform used for our tests, called Grid’5000\cite{g5k}, is a
 562 French nationwide experimental set of clusters which provides us with
 563 distributed clusters architectures (28 heterogeneous clusters spread
 564 over 9 sites).  We used three distributed clusters architectures, each
 565 having a different heterogeneity degree. The first one was composed of
 566 four clusters spread over four sites, with a total of 106 computing
 567 nodes representing 424 computing cores with \mbox{$hd = 0.08$}; the
 568 second one was composed of four clusters spread over three sites, with
 569 a total of 110 computing nodes representing 440 computing cores with
 570 \mbox{$hd = 0.50$}; and finally the third one was composed of five
 571 clusters spread over four sites with 115 computing nodes representing
 572 620 computing cores with \mbox{$hd = 0.72$}.
 573
 574
 575 All computing nodes of these clusters have at least 4 computing cores
 576 (in the last used architecture, with \mbox{$hd = 0.72$}, two clusters
 577 are composed of 8 computing cores machines) with a minimum of 4GB of
 578 memory (in order to execute the application with a big problem
 579 size). All computing nodes can communicate with each other through an
 580 efficient network. Nevertheless, this latter is shared with many other
 581 users so high latencies appear during executions.
 582
 583
 584 \subsection{Experiments}
 585 \label{sec:experiments}
 586
 587 During executions, we introduced two failures in computing nodes
 588 involved in the computation every 20 seconds to simulate a volatile
 589 environment. Unfortunately, we did not have the opportunity to realize
 590 experiments with more computing nodes over more sites with problems of
 591 larger sizes, but we plan to extend our experiments in the future.
 592
 593
 594 Here we present the results of the evaluation of the MAHEVE algorithm,
 595 compared with FT-AIAC-QM (for \textit{Fault Tolerant AIAC-QM}) and
 596 FT-FEC (for \textit{Fault Tolerant F-EC}) which are respectively the
 597 fault tolerant versions of the AIAC-QM and F-EC mapping algorithms
 598 presented in \cite{pdsec10}. Table \ref{tab:results} shows the
 599 execution times of each mapping algorithm compared to the default
 600 mapping strategy of the JaceP2P-V2 platform, with the corresponding
 601 gains on application execution time, given in brackets. It presents
 602 both the executions with faults (WF) and the fault free (FF)
 603 executions.
 604
 605 \vspace{-0.25cm}
 606
 607 \renewcommand{\arraystretch}{1.6}
 608 \begin{table}[h!]
 609   \centering
 610   \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
 611     \hline
 612     \multirow{2}{*}{ ~$hd$~ }&\multicolumn{2}{c|}{ ~Default~ }&\multicolumn{2}{c|}{ ~FT-AIAC-QM~ }&\multicolumn{2}{c|}{ ~FT-FEC~ }&\multicolumn{2}{c|}{ ~MAHEVE~ }\\
 613     \cline{2-9}
 614
 615     & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ \\
 616     \hline
 617     ~$0.08$~ & ~80~ & ~229~ & ~63 (21\%)~ & ~178 (22\%)~ & ~61 (23\%)~ & ~154
 618     (33\%)~ & ~60 (25\%)~ & ~113 (50\%)~ \\
 619
 620     ~$0.50$~ & ~67~ & ~242~ & ~61 (9\%)~ & ~118 (51\%)~ & ~63 (6\%)~ & ~133
 621     (45\%)~ & ~54 (20\%)~ & ~85 (65\%)~ \\
 622
 623     ~$0.72$~ & ~67~ & ~192~ & ~59 (12\%)~ & ~99 (45\%)~ & ~65 (3\%)~ & ~121
 624     (33\%)~ & ~52 (22\%)~ & ~86 (53\%)~\\
 625
 626     \hline
 627   \end{tabular}
 628   \vspace{0.15cm}
 629   \caption{Application execution time in seconds and corresponding gains on various
 630     platforms using different mapping algorithms, with fault free (FF) executions
 631     and with 2 node failures each 20 seconds (WF) executions.}
 632   \label{tab:results}
 633   \vspace{-0.7cm}
 634 \end{table}
 635
 636
 637 First of all, we can note that all mapping algorithms provide an
 638 enhancement of the application performances by considerably reducing
 639 its execution time, especially for executions with node failures, with
 640 an average gain of about $45\%$ in general in comparison to the
 641 default policy. If we focus on executions with node failures (WF),
 642 FT-FEC is efficient on architectures with a low heterogeneity degree
 643 (\mbox{$hd = 0.08$}) by providing gains of about $33\%$, and gains are
 644 roughly the same on heterogeneous architectures (\mbox{$hd =
 645   0.72$}). FT-AIAC-QM is efficient on architectures with a high
 646 heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains of about
 647 $45\%$, whereas it is not so efficient on homogeneous architectures
 648 (\mbox{$hd = 0.08$}) by providing gains of about $22\%$. We can note
 649 here that on an architecture with a heterogeneity degree of $0.50$
 650 FT-AIAC-QM is more efficient than FT-FEC by providing gains up to
 651 $50\%$. Here we point out that in fault free executions (FF), both
 652 algorithms also provide gains on their respective favorite
 653 architectures, though gains are less great than in executions with
 654 faults (WF).
 655
 656
 657 Now if we focus on the performances of our new solution MAHEVE, we can
 658 see that it is all the time better than other algorithms. As can be
 659 seen in \mbox{Table \ref{tab:results}}, in executions with faults
 660 (WF), it reduces the application's execution time by about $50\%$ on
 661 homogeneous architectures (here of $0.08$ heterogeneity degree) which
 662 is more than 25 point better than FT-FEC and near 30 points better
 663 than FT-AIAC-QM. On heterogeneous architectures (here of $0.72$
 664 heterogeneity degree) it also outperforms other mapping algorithms by
 665 reducing the application execution time by about $53\%$ which is
 666 almost 10 points better than FT-AIAC-QM and 20 points better than
 667 FT-FEC. On middle heterogeneity degree architectures (here of $0.50$
 668 heterogeneity degree), MAHEVE is once again better than its two
 669 comparative mapping algorithms by reducing the application execution
 670 time by about $53\%$. These good performances come from the fact that
 671 it is designed to be efficient on both architectures, homogeneous and
 672 heterogeneous. Moreover, as it integrates a fault tolerance
 673 \textit{security} in the initial mapping, it is more efficient when
 674 computing nodes fail. Here we can point out that this algorithm allows
 675 in general gains on application execution time of about $55\%$. In fault free executions (FF), it outperforms once again
 676 the two other algorithms.
 677
 678 \vspace{-0.15cm}
 679 \section{Conclusion and future works}
 680 \label{sec:conclu}
 681
 682 In this paper we have presented a new mapping algorithm, called
 683 MAHEVE, to address the AIAC mapping issue on heterogeneous and
 684 volatile environments. It aims at doing an efficient mapping of tasks
 685 on distributed clusters architectures by taking the best part of the
 686 two known approaches, application execution time optimization and
 687 edge-cuts minimization. Experiments, though using a single
 688 application, show that it is the most efficient mapping algorithm on
 689 all kinds of architectures, as it takes into account their
 690 heterogeneity degree and adapt its sort methods to it. We have shown
 691 that it is all the time better than the two other comparative mapping
 692 algorithms, FT-AIAC-QM and FT-FEC. This can be explained by the fact
 693 that it not only takes care about computing nodes and clusters, but
 694 also about the tasks' properties, what refines the mapping solution.
 695
 696
 697 In our future works we plan to enhance the MAHEVE algorithm
 698 performances by modifying the notation of clusters, since their
 699 locality has not yet been taken into consideration. This would favor
 700 tasks locality, which would reduce communications delays and provide a
 701 much better convergence rate. We also have to validate the algorithm
 702 performance with other AIAC applications.
 703
 704
 705 %% Acknolegdement
 706
 707 \section*{Acknowledgement}
 708 \label{ack}
 709
 710 Experiments presented in this paper were carried out using the
 711 Grid'5000 experimental testbed\cite{g5k}, being developed under the
 712 INRIA ALADDIN development action with support from CNRS, RENATER and
 713 several Universities as well as other funding bodies.
 714
 715 %%%
 716
 717
 718 \bibliographystyle{unsrt}
 719
 720 \bibliography{biblio}
 721
 722
 723 \end{document}
 724