heteropar10/heteropar10.tex

   1 %% LNCS
   2 \documentclass[10pt,conference]{llncs}
   3
   4 \usepackage[T1]{fontenc}
   5 \usepackage{ucs}
   6 \usepackage[utf8x]{inputenc}
   7 \usepackage{lmodern}
   8 \usepackage{color}
   9 \usepackage{amsmath}
  10 \usepackage{amsfonts}
  11 \usepackage[english]{babel}
  12 \usepackage[pdftex,final]{graphicx}
  13 \usepackage[ruled,vlined]{algorithm2e}
  14 \usepackage[pdftex]{graphicx}
  15 \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  16
  17 \title{MAHEVE: A New Reliable AIAC Mapping Strategy for Heterogeneous
  18   Environments \thanks{This work was supported by the European
  19     Interreg IV From-P2P project.}}
  20
  21
  22
  23 \author{Raphaël Couturier, David Laiymani and Sébastien Miquée}
  24   \authorrunning{R. Couturier, D. Laiymani and S. Miquée}
  25   \institute{
  26      University of Franche Comté \qquad  LIFC laboratory\\[1mm]
  27      IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort,
  28      France\\\{{\tt
  29       raphael.couturier,david.laiymani,sebastien.miquee}\}
  30     {\tt @univ-fcomte.fr} }
  31
  32
  33 \date{}
  34 \begin{document}
  35
  36
  37 \maketitle
  38
  39 \vspace{-0.2cm}
  40
  41 \begin{abstract}
  42   With the emergence of massive distributed computing resources, such
  43   as grids and distributed clusters architectures, parallel
  44   programming is used to benefit from them and execute problems of
  45   larger sizes. The asynchronous iteration model, called AIAC, has
  46   been proven to be an efficient solution for heterogeneous and
  47   distributed architectures. An efficient mapping of applications'
  48   tasks is essential to reduce their execution time. We present in
  49   this paper a new mapping algorithm, called MAHEVE (Mapping Algorithm
  50   for HEterogeneous and Volatile Environments) which is efficient on
  51   such architectures and integrates a fault tolerance mechanism to
  52   resist to computing nodes failures. Our experiments show gains on a
  53   typical AIAC application's execution time about $55\%$, executed on
  54   distributed clusters architectures containing more than 400
  55   computing cores with the JaceP2P-V2 environment.
  56
  57 %\textup{\small \textbf{Keywords:} Mapping algorithm, Fault tolerance,
  58 %  Distributed clusters, Parallel iterative asynchronous algorithms,
  59 %  Heterogeneous architectures.}
  60 \end{abstract}
  61
  62 \section{Introduction}
  63 \label{sec:intro}
  64
  65 Nowadays, scientific applications require a great computation power to
  66 solve their large problems. Though personal computers are more
  67 powerful, in many cases they are not sufficient. One well adapted
  68 solution is to use computers clusters in order to combine the power of
  69 many machines. Distributed clusters form such an architecture,
  70 providing a great computing power, by aggregating the computation
  71 power of multiple clusters spread over multiple sites.  Such an
  72 architecture brings users heterogeneity in computing machines as well
  73 as network latency. In order to use such an architecture, parallel
  74 programming is required. In the parallel computing area, in order to
  75 execute very large applications on heterogeneous architectures,
  76 iterative methods are well adapted\cite{book_raph}.
  77
  78
  79 These methods repeat the same instructions block until a convergence
  80 state and a desired approximation of the solution are reached. They
  81 constitute the only known approach to solving some kinds of problems
  82 and are relatively easy to parallelize. The Jacobi or Conjugate
  83 Gradient\cite{cg} methods are examples of such methods.
  84
  85
  86 To parallelize this kind of algorithm, one of the most used methods is
  87 the message passing paradigm which allows an efficient mechanism to
  88 exchange data between tasks. As such a method, we focus here on the
  89 asynchronous parallel iterative model, called AIAC\cite{book_raph}
  90 (for \textit{Asynchronous Iteration and Asynchronous Communication}).
  91
  92 \begin{figure}[h!]
  93  \vspace{-0.4cm}
  94   \centering
  95   \includegraphics[width=7.4cm]{images/IACA}
  96   \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model}
  97   \label{fig:AIAC}
  98   \vspace{-0.90cm}
  99 \end{figure}
 100
 101
 102 In this model, as can be seen on Figure \ref{fig:AIAC}, after each
 103 iteration, a task sends its results to its neighbors and immediately
 104 starts the next iteration with the last received data.  The receiving
 105 and sending mechanisms are asynchronous and tasks do not have to wait
 106 for the reception of dependency messages from their
 107 neighbors. Consequently, there is no idle time between two
 108 iterations. Furthermore, this model is tolerant to messages loss and
 109 even if a task is stopped the remaining tasks continue the
 110 computation, with the last available data.  Several
 111 experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms
 112 in the context of distributed clusters with high latency between
 113 clusters. These works underline the good adaptability of AIAC
 114 algorithms to network and processor heterogeneity.
 115
 116
 117 In a previous study\cite{pdsec10} we proposed two static mapping
 118 algorithms of tasks to processors dedicated to the AIAC model on
 119 heterogeneous distributed clusters.  Both these two algorithms,
 120 AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC (for
 121 \textit{Farhat Edges-Cuts}) showed an important performances
 122 improvement by reducing up to $50\%$ the application's execution
 123 time. These experiments were performed by using the JaceP2P-V2
 124 environment. This Java based platform is an executing and developing
 125 environment dedicated to the AIAC model. By implementing a distributed
 126 backup/restore mechanism it is also fully fault
 127 tolerant\cite{jaceP2P-v2}. In our previous experiments we did not
 128 introduce computing nodes failures during the computation.  And as
 129 architecture's heterogeneity continually evolves according to
 130 computing nodes volatility, we have to more precisely takes care about
 131 the heterogeneity of the target platform. Thus in this paper our main
 132 contribution is to propose a new mapping algorithm called MAHEVE
 133 (\textit{Mapping Algorithm for HEterogeneous and Volatile
 134   Environments}).  This algorithm explicitly tackles the heterogeneity
 135 issue and introduces a level of dynamism in order to adapt itself to
 136 the fault tolerance mechanisms. Our experiments show gains about 55\%
 137 on application's execution time, with about 10 points better than
 138 AIAC-QM and about 25 points better than F-EC.
 139
 140
 141 The rest of this paper is organized as
 142 follows. Section~\ref{sec:jacep2p} reminds the JaceP2P-V2 middleware
 143 by describing its architecture and briefly presenting its fault
 144 tolerance mechanism. Section~\ref{sec:pb} formalizes our mapping and
 145 fault tolerance problems and quotes existing issues to address
 146 them. Section~\ref{sec:maheve} describes the new mapping strategy we
 147 propose, MAHEVE. In section~\ref{sec:expe} we present the experiments
 148 we have conducted on the Grid'5000 testbed with more than 400
 149 computing cores. Finally, we give some concluding remarks and plan our
 150 future work in section~\ref{sec:conclu}.
 151
 152
 153 \section{JaceP2P-V2}
 154 \label{sec:jacep2p}
 155
 156 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented in
 157 Java, dedicated to developing and executing parallel iterative
 158 asynchronous applications. It is fully fault tolerant allowing it to
 159 execute parallel applications over volatile environments. To our
 160 knowledge this is the only platform dedicated to designing and
 161 executing AIAC algorithms in such volatile environments.
 162
 163
 164 The JaceP2P-V2 platform part, which is based on the daemons and
 165 supervisors paradigm, is composed of three main entities:
 166
 167 \begin{itemize}
 168 \item The ``super-nodes'', which is in charge of supervising free
 169   computing nodes connected to the platform.
 170
 171 \item The ``spawner'', which is launched by a user wanting to execute
 172   a parallel application. It is in charge of a group of computing
 173   nodes and monitor them. If one of them fails, it requires a
 174   replacing one to a super-node.
 175
 176 \item The ``daemon'', when launched, connects to a super-node and
 177   waits for a task to execute. Each daemon can communicate directly
 178   with its computing neighbors.
 179
 180 \end{itemize}
 181
 182
 183 To be able to execute asynchronous iterative applications, JaceP2P-V2
 184 has an asynchronous messaging mechanism. In order to resist daemons'
 185 failures, it implements a distributed backup mechanism called the
 186 \textit{uncoordinated distributed checkpointing}\cite{uncoord_cp}.
 187 This decentralized procedure allows the platform to be very scalable,
 188 with no weak points and does not require a secure nor a stable station
 189 for backups. When a daemon dies, it is replaced by another one. Here
 190 we suppose that we have enough available free nodes. Moreover, to
 191 resit to supervisors failures and to be scalable, it reserves some
 192 extra nodes. For more details on the JaceP2P-V2 platform, interested
 193 readers can refer to \cite{jaceP2P-v2}.
 194
 195
 196 \section{Mapping and fault tolerance problems}
 197 \label{sec:pb}
 198
 199 \subsection{Model formalization}
 200 \label{sec:pbmodel}
 201
 202 \subsubsection{Application modeling}
 203 \label{sec:pbmodelapp}
 204
 205
 206 With the AIAC model, all tasks compute in parallel at the same time,
 207 without precedence nor synchronization. During an iteration, each task
 208 computes its job and sends its results to its neighbors, and
 209 immediately starts the next iteration. The TIG\cite{tig1}
 210 (\textit{Task Interaction Graph}) is the most appropriate model to our
 211 problem, as it only models relationships between tasks. In this model,
 212 all the tasks are considered simultaneously executable and
 213 communications can take place at any time during the computation, with
 214 no precedence nor synchronization.
 215
 216
 217 In the TIG model, a parallel application is represented by a graph
 218 $GT(V,E)$, where \mbox{$V = \{V_1,V_2,\dots V_v\}$} is the set of
 219 $|V|$ vertices and \mbox{$E \subset V \times V$} is the set of
 220 undirectional edges. The vertices represent tasks and the edges
 221 represent the mutual communication among tasks. A function \mbox{$EC :
 222   V \rightarrow \mathbb{R}^+$} gives the computation cost of tasks and
 223 \mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost
 224 for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) =
 225   e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}.  Another function
 226 \mbox{$D : V \rightarrow \mathbb{R}^+$} gives the amount of
 227 dependencies of a task, and we define \mbox{$D(V_i) = d_i$}.
 228
 229
 230 \subsubsection{Architecture modeling}
 231 \label{sec:pbmodelarchi}
 232
 233
 234 A distributed clusters architecture can be modeled by a
 235 three-level-graph. The levels are \textit{architecture} (a) (here the
 236 Grid'5000 grid), \textit{cluster} (c), and \textit{computing node} (n)
 237 levels. Let $GG(N,L)$ be a graph representing a distributed clusters
 238 architecture, where \mbox{$N = \{N_1,N_2,\dots N_n\}$} is the set of
 239 $|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The
 240 vertices represent the computing nodes and the edges represent the
 241 links between them. An edge \mbox{$L_i \in L$} is an unordered pair
 242 \mbox{$(N_x,N_y) \in \mathbb{N}$}, representing a communication link
 243 between nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow
 244   \mathbb{R}^+$} gives the computational power of nodes and another
 245 function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the
 246 communication latency of links. We define \mbox{$WN(N_i) = wn_i$} and
 247 \mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters
 248 contained in the architecture. A function \mbox{$CN : C \rightarrow
 249   \mathbb{N}^+$} gives the amount of computing nodes contained in a
 250 cluster, and another function \mbox{$CF : C \rightarrow \mathbb{R}^+$}
 251 gives the amount of available computing nodes (not involve in an
 252 application computation) of a cluster. We define \mbox{$CN(C_i) =
 253   C_{Ni}$} and \mbox{$CF(C_i) = C_{Fi}$}. We also define \mbox{$C_{Pi}
 254   = \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of
 255 cluster $C_i$ and \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
 256 as the average computation power of cluster $C_i$, and
 257 $C_{\overline{P}fi}$ the average power of its available resources.
 258
 259 We evaluate the \textit{heterogeneity degree} of the architecture,
 260 noted $hd$, by using the \textit{relative standard deviation} method,
 261 with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the
 262 average computing power of nodes and $\sigma_{PN}$ represents the
 263 standard deviation of computing nodes power.  This measure provides us
 264 the coefficient of variation of the platform in percentage -- we only
 265 consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd
 266   > 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully
 267 heterogeneous platform.
 268
 269
 270 \subsubsection{Mapping functions}
 271 \label{sec:pbmodelmapping}
 272
 273
 274 When a parallel application $App$, represented by a graph $GT$, is
 275 mapped on a distributed clusters architecture, represented by a graph
 276 $GG$, the execution time of the application, $ET(App)$, can be defined
 277 as the execution time of the slowest task. Indeed, an application ends
 278 when all the tasks have detected convergence and reached the desired
 279 approximation of the solution.  We define $ET(App) = \max_{i=1 \dots
 280   v} ( ET(V_i) )$ where the execution time of each task $i$
 281 \mbox{($i=1 \dots v$)}, $ET(V_i)$ is given by $ET(V_i) =
 282 \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$
 283 is the computational cost of $V_i$, $wn_i$ is the computational power
 284 of the node $N_i$ on which $V_i$ is mapped, $J$ represents the
 285 neighbors set of $V_i$, $c_{ij}$ is the amount of communications
 286 between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the
 287 computing nodes on which are mapped $V_i$ and $V_j$. As described in
 288 this formula, the execution time of a task depends on the task weight
 289 and communications it may occur with its neighbors. We underline here
 290 that in the AIAC model, it is impossible to predict the number of
 291 iterations of a task. So it is difficult to evaluate a priori its cost
 292 $e_i$.
 293
 294 This tasks mapping problem is similar to the classical graph
 295 partitioning and task assignment problem, and is thus NP-complete.
 296
 297
 298 \subsection{Fault tolerance}
 299 \label{sec:pbft}
 300
 301 In volatile environments, computing nodes can disconnect at any time
 302 during the computation, and these ones have to efficiently be replaced.
 303
 304
 305 The replacing nodes should be the best ones at the fault time,
 306 according to the chosen mapping algorithm, by searching them in
 307 available nodes. As executing environments can regularly evolve, due
 308 to computing nodes' volatility, a mapping algorithm has to keep a
 309 right overview of the architecture, in real time. Thus, criteria to
 310 assign tasks to nodes should evolve too.
 311
 312
 313 Another problem appears after multiple crashes: some tasks may have
 314 migrated over multiple computing nodes and clusters, and the initial
 315 mapping may be totally changed. So, after having suffered some nodes
 316 failures the tasks mapping could not always satisfy the mapping
 317 criteria (not on the more powerful available machine, too far away
 318 from its neighbors\dots{}). A good fault tolerance policy has to
 319 evolve dynamically with the executing
 320 environment.
 321
 322
 323 \subsection{Specificities of the AIAC mapping problem}
 324 \label{sec:specAIACmapping}
 325
 326 An important point to take into consideration is that we do not allow
 327 the execution of multiple tasks on the same computing node, as this
 328 provides a fall of performance when this one fails. Indeed we should
 329 redeploy all of the tasks from this node to another one, using last
 330 saves, which can be spread on multiple computing nodes. This may
 331 result in large communication overheads and a waste of computation
 332 time. Nevertheless, to benefit of multi-cores processors, we use a
 333 task level parallelism by running multi-threaded sequential solver for
 334 example.
 335
 336
 337 Another important point in the AIAC model is that as the JaceP2P-V2
 338 environment is fault tolerant and tasks save checkpoints on their
 339 neighbors, it is more efficient to save on near nodes than on far ones
 340 in order to reduce the communication overhead during this operation,
 341 and to faster restart a task.
 342
 343
 344 \subsection{Related work}
 345 \label{sec:pbrw}
 346
 347 In the literature of the TIG mapping many algorithms exist, which can
 348 be broadly classified into two categories.  The first one is the
 349 \textit{Edge-cuts optimization} class, which minimizes the use of the
 350 penalizing links between clusters. As tasks are depending on
 351 neighbors, which are called dependencies, the goal is to choose nodes
 352 where distance, in term of network, is small to improve communications
 353 between tasks. Here we can cite Metis\cite{metis} and
 354 Chaco\cite{chaco} which are libraries containing such kind of
 355 algorithms. The second category is the \textit{Execution time
 356   optimization} class, which aims to minimize the whole application's
 357 execution time. These algorithms look for nodes which can provide the
 358 smallest execution time of tasks using their computational power. Here
 359 we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and
 360 MiniMax\cite{minimax} as such kind of algorithms.
 361
 362
 363 Both classes of algorithms may fit with our goals as in our model we
 364 have both the computational power of nodes and communication costs
 365 which may influence the applications performances.
 366
 367
 368 All mentioned algorithms do not tackle the computing nodes failures
 369 issue, or basically by applying the same policy. As explain in section
 370 \ref{sec:pbft}, a more efficient and dedicated replacement function is
 371 needed. Nevertheless, to the best of our knowledge, no tasks mapping
 372 algorithm, addressing explicitly both the executing platform
 373 heterogeneity and the computing nodes failures issues, exists.
 374
 375
 376 \vspace{-0.25cm}
 377 \section{MAHEVE}
 378 \label{sec:maheve}
 379
 380 Here we present our new tasks mapping strategy, called MAHEVE (for
 381 \textit{Mapping Algorithm for HEterogeneous and Volatile
 382   Environments}). This algorithm aims to take the best part of each
 383 category mentioned in section \ref{sec:pbrw}, the edge-cuts
 384 minimization and the application's execution time optimization
 385 algorithms.
 386
 387
 388 This new algorithm can be divided into two parts. The first part aims
 389 to perform the initial mapping, and the second part is devoted to
 390 search replacing nodes when computing nodes' failures occur.
 391
 392
 393 \vspace{-0.20cm}
 394 \subsection{Initial mapping}
 395 \label{sec:maheve_init}
 396
 397 In this section we will study the main mechanisms of the
 398 \textit{static mapping} done by MAHEVE, which is composed of three
 399 phases: sort of clusters, sort of tasks, and the effective mapping,
 400 which maps tasks (in their sort order) on nodes of clusters (also in
 401 their sort order) with a reservation of some nodes in each cluster.
 402
 403
 404 \vspace{-0.15cm}
 405 \subsubsection{Sorting clusters}
 406 \label{sec:maheve_sort_clusters}
 407
 408 The first step of the initial mapping is to sort clusters accordingly
 409 to the executing platform's heterogeneity degree $hd$. The main
 410 principles are that a cluster obtain a better mark when $hd < 0.5$ and
 411 it contains more computing nodes than other clusters ($C_{Fi}$, the
 412 amount of available free nodes, is privileged), and when $hd \ge 0.5$
 413 and it contains more powerful computing nodes ($C_{\overline{P}fi}$,
 414 the average computation power, is privileged). These choices come from
 415 several experiments with the AIAC model, which show that in such
 416 environments it is more efficient to privilege the computation power
 417 or the amount of nodes. As the amount of nodes, $C_{Fi}$, and the
 418 average computing power, $C_{\overline{P}fi}$, are not in the order of
 419 magnitude, we normalize them with two functions, $normN$ and
 420 $normP$. They are defined as \mbox{$normN\left(C_{Fi}\right) = C_{Fi}
 421   \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate (in
 422 percent) of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
 423   C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|}
 424   C_{\overline{P}fj}$}, which is the rate (in percent) of the average
 425 power, both representing the cluster in the architecture. We note
 426 $NC_{Fi} = normN\left( C_{Fi} \right)$ and $NC_{\overline{P}fi} =
 427 normP(C_{\overline{P}fi})$.
 428
 429
 430 The formula used to give a mark, $M_i$, to a cluster is
 431 \begin{equation}
 432   \label{eq:cluster}
 433   M_i = {NC_{\overline{P}fi}}^{hd} + {NC_{Fi}}^{1 - hd}.
 434 \end{equation}
 435
 436
 437 This compromise function allows us to privilege clusters following our
 438 criteria, as explained previously, according to the heterogeneity
 439 degree. If we study its limits for the $hd$'s extremities, $hd = 0$
 440 and $hd = 1$, we obtain $\lim_{hd \rightarrow 0} M_i = NC_{Fi}$ and
 441 $\lim_{hd \rightarrow 1} M_i = NC_{\overline{P}fi}$, which fit with
 442 our objectives.
 443
 444 Clusters are so sorted and placed in a list containing them, starting
 445 from the cluster which receives the better mark to the one which
 446 receives the lower mark.
 447
 448
 449 \subsubsection{Sorting tasks}
 450 \label{sec:maheve_sort_tasks}
 451
 452 Like clusters, tasks are also sorted accordingly to the heterogeneity
 453 degree of the executing platform. This sort is done in the same way as
 454 previously, as when $hd < 0.5$ tasks with higher dependencies will be
 455 privileged, and when $hd \ge 0.5$ tasks with higher computing cost are
 456 privileged, in order to be executed on highest powered computing
 457 nodes.
 458
 459
 460 The main function used to classified tasks is
 461 \begin{equation}
 462   \label{eq:tasks}
 463   Q_i = e_i^{hd} \times d_i^{1 - hd}
 464 \end{equation}
 465
 466 where $Q_i$ is the evaluation of the task $i$ according to the
 467 heterogeneity degree $hd$ and $d_i$, the amount of dependencies of
 468 task $i$.
 469
 470
 471 Tasks are taken in the order of the first sort, determined with
 472 equation (\ref{eq:tasks}), and each task is placed in a new list (the
 473 final one) and some of its dependencies are added. We note $Nb_i =
 474 d_i^{1 - hd}$ this amount of dependencies as the lower the
 475 heterogeneity degree is the higher this number will be.  This final
 476 operation allows to control the necessary locality of tasks according
 477 to the heterogeneity degree of the platform.
 478
 479
 480 \subsubsection{Mapping method}
 481 \label{sec:maheve_mapping_method}
 482
 483 The third step of the initial mapping is to allocate tasks to
 484 computing nodes. As clusters and tasks have been sorted accordingly to
 485 the executing platform's heterogeneity degree, ordered from the
 486 highest mark to the lower, this function maps tasks on each available
 487 computing nodes of clusters, in their respective order in lists (for
 488 example task classified first in the tasks list is mapped on an
 489 available node of the cluster classified first in clusters list).
 490
 491 The idea here is not to fulfill each cluster, but preserve some
 492 computing nodes in each cluster. These conserved nodes will be used to
 493 replace failed nodes. The fact of considering in the initial mapping
 494 the fault tolerance is a new approach in mapping algorithms.
 495
 496
 497 \subsection{Replacing function}
 498 \label{sec:maheve_rep}
 499
 500 This function is essential in a volatile environment, as an efficient
 501 replacement should reduce the overhead on the application execution
 502 time due to the loss of computing steps and data.
 503
 504 As we have shown in the previous section, during the initial mapping
 505 some computing nodes in each cluster have been preserved for fault
 506 tolerance. When a node fails this function replace it by a free node
 507 of the same cluster. If none is available this function sorts again
 508 clusters, to take into consideration platform's modifications, and
 509 replace the failed node by one available in the new sorted clusters
 510 list. This mechanism allows to keep tasks' locality and a real time
 511 overview of the executing platform.
 512
 513
 514 \section{Experimentation}
 515 \label{sec:expe}
 516
 517 \subsection{A typical AIAC application and the execution platform}
 518 \label{sec:cond}
 519
 520 We used the ``Kernel CG'' application of the NAS Parallel Benchmarks
 521 (NPB) \cite{nas} to evaluate the performances of our new mapping
 522 algorithm. This benchmark is designed to be used on large
 523 architectures, because it stresses communications over latency
 524 networks, by processing unstructured matrix vector multiplication with
 525 a Conjugate Gradient method. As this method cannot be executed with
 526 the asynchronous iteration model we have replaced it by another method
 527 called the multisplitting method. This latter supports the
 528 asynchronous iterative model. For more details about this method,
 529 interested readers are invited to see \cite{book_raph}. The chosen
 530 problem used a matrix of size $5,000,000$ with a low bandwidth, fixed
 531 to $35,000$. This bandwidth size generates, according to the problem's
 532 size, between 8 and 20 neighbors per tasks. This application was
 533 executed on 64 computing nodes.
 534
 535
 536 The platform used for our tests, called Grid’5000\cite{g5k}, is a
 537 French nationwide experimental set of clusters which provides us
 538 distributed clusters architectures (28 heterogeneous clusters spread
 539 over 9 sites).  We used three distributed clusters architectures on
 540 the Grid'5000 testbed, each having a different heterogeneity
 541 degree. The first one was composed of four clusters spread over four
 542 sites, with a total of 106 computing nodes representing 424 computing
 543 cores with \mbox{$hd = 0.08$}; the second one was composed of four
 544 clusters spread over three sites, with a total of 110 computing nodes
 545 representing 440 computing cores with \mbox{$hd = 0.50$}; and finally
 546 the third one was composed of five clusters spread over four sites
 547 with 115 computing nodes representing 620 computing cores with
 548 \mbox{$hd = 0.72$}.
 549
 550
 551 All computing nodes of these clusters have at least 4 computing cores
 552 (in the last used architecture, with \mbox{$hd = 0.72$}, two clusters
 553 are composed of 8 computing cores machines) with a minimum of 4GB of
 554 memory (in order to execute the application with a big problem's
 555 size). All computing nodes can communicate with each other through an
 556 efficient network. Nevertheless, this latter is shared with many other
 557 users so high latencies appear during executions.
 558
 559
 560 \subsection{Experiments}
 561 \label{sec:experiments}
 562
 563 During executions, we introduced two failures in computing nodes
 564 involved in the computation every 20 seconds to simulate a volatile
 565 environment. Unfortunately, we had not the opportunity to realize
 566 experiments with more computing nodes over more sites with problems of
 567 larger sizes, but we plan to extend our experiments in the future.
 568
 569
 570 Here we present the results of the evaluation of the MAHEVE algorithm,
 571 compared with FT-AIAC-QM (for \textit{Fault Tolerant AIAC-QM}) and
 572 FT-FEC (for \textit{Fault Tolerant F-EC}) which are respectively the
 573 fault tolerant versions of the AIAC-QM and F-EC mapping algorithms
 574 presented in \cite{pdsec10}. Table \ref{tab:results} shows the
 575 execution times of each mapping algorithm compared to the default
 576 mapping strategy of the JaceP2P-V2 platform, with the corresponding
 577 gains on application's execution time, in brackets.
 578
 579 \renewcommand{\arraystretch}{1.7}
 580 \begin{table}[h!]
 581   \centering
 582   \begin{tabular}{|c|c|c|c|c|}
 583     \hline
 584     ~$ hd$~ & ~Default~ & ~FT-AIAC-QM~ & ~FT-FEC~ & ~MAHEVE~ \\
 585     \hline
 586     ~$0.08$~&229s&178s (22\%)&154s (33\%)&113s (50\%)\\
 587     ~$0.50$~&242s&118s (51\%)&133s (45\%)&85s (65\%)\\
 588     ~$0.72$~&192s&99s (45\%)&121s (33\%)&86s (53\%)\\
 589     \hline
 590   \end{tabular}
 591   \vspace{0.15cm}
 592   \caption{Application's execution time in seconds and corresponding gains on various
 593     platforms using different mapping algorithms with 2 computing
 594     nodes' failures each 20 seconds}
 595   \label{tab:results}
 596   \vspace{-0.7cm}
 597 \end{table}
 598
 599
 600 First of all, we can note that all mapping algorithms provide an
 601 enhancement of the application's performances by considerably reducing
 602 its execution time, with an average gain about $45\%$ in general in
 603 comparison to the default policy. As shown in \cite{pdsec10}, FT-FEC
 604 is efficient on architectures with a low heterogeneity degree
 605 (\mbox{$hd = 0.08$} by providing gains about $33\%$, and gains are
 606 seemly the same on heterogeneous architectures (\mbox{$hd =
 607   0.72$}). FT-AIAC-QM is efficient on architectures with a high
 608 heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains about
 609 $45\%$, whereas it is not so efficient on homogeneous architectures
 610 (\mbox{$hd = 0.08$}) by providing gains about $22\%$. We can note here
 611 that on an architecture with a heterogeneity degree of $0.50$
 612 FT-AIAC-QM is more efficient than FT-FEC by providing gains up to
 613 $50\%$.
 614
 615 Now if we look at the performances of our new solution, MAHEVE, we can
 616 see that it is all the time better than other algorithms. As can be
 617 seen in \mbox{Table \ref{tab:results}}, it reduces the application's
 618 execution time by about $50\%$ on homogeneous architectures (here of
 619 $0.08$ heterogeneity degree) what is more than 25 point better than
 620 FT-FEC and near 30 points better than FT-AIAC-QM. On heterogeneous
 621 architectures (here of $0.72$ heterogeneity degree) it also
 622 outperforms other mapping algorithms by reducing the application's
 623 execution time by about $53\%$ what is near about 10 points better
 624 than FT-AIAC-QM and 20 points better than FT-FEC. On middle
 625 heterogeneity degree architectures (here of $0.50$ heterogeneity
 626 degree), MAHEVE is another one time better than its two comparative
 627 mapping algorithms by reducing the application's execution time by
 628 about $53\%$. These good performances come from the fact that it is
 629 designed to be efficient on both architectures, homogeneous and
 630 heterogeneous. Moreover, as it integrates a fault tolerance
 631 \textit{security} in the initial mapping, it is more efficient when
 632 computing nodes fail. Here we can point out that this algorithm allows
 633 in general gains on application's execution time about $55\%$.
 634
 635
 636 \section{Conclusion and future works}
 637 \label{sec:conclu}
 638
 639 In this paper we have presented a novel mapping algorithm, called
 640 MAHEVE, to address the AIAC mapping issue on heterogeneous and
 641 volatile environments. It aims to do an efficient mapping of tasks on
 642 distributed clusters architectures by taking the best part of the two
 643 known approaches, application's execution time optimization and
 644 edge-cuts minimization. Experiments show that it is the most efficient
 645 mapping algorithm on all kinds of architectures, as it takes care
 646 about their heterogeneity degree and adapt its sort methods to it. We
 647 have shown that it is all the time better than the two other
 648 comparative mapping algorithms, FT-AIAC-QM and FT-FEC. This can be
 649 explained by the fact that it not only takes care about computing
 650 nodes and clusters, but also about the tasks' properties, what refines
 651 the mapping solution.
 652
 653 In our future works we plan to enhance the MAHEVE algorithm
 654 performances by modifying the notation of clusters as their locality
 655 is now not taken into consideration, in order to favor tasks locality,
 656 which will reduce communications delays providing a better convergence
 657 rate.
 658
 659
 660 %%%
 661
 662
 663 \bibliographystyle{unsrt}
 664
 665 \bibliography{biblio}
 666
 667
 668 \end{document}
 669