heteropar10/camera_ready/heteropar10.tex

   1 \documentclass[10pt,conference]{llncs}
   2
   3 \usepackage[T1]{fontenc}
   4 \usepackage{ucs}
   5 \usepackage[utf8x]{inputenc}
   6 \usepackage{lmodern}
   7 \usepackage{color}
   8 \usepackage{amsmath}
   9 \usepackage{amsfonts}
  10 \usepackage[english]{babel}
  11 \usepackage[pdftex,final]{graphicx}
  12 \usepackage[ruled,vlined]{algorithm2e}
  13 \usepackage[pdftex]{graphicx}
  14 \usepackage{multirow}
  15 \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  16
  17
  18 \title{MAHEVE: An Efficient Reliable Mapping of Asynchronous Iterative
  19   Applications on Volatile and Heterogeneous Environments \thanks{This
  20     work was supported by the European Interreg IV From-P2P
  21     project.}}
  22
  23
  24 \author{Raphaël Couturier, David Laiymani and Sébastien Miquée}
  25   \authorrunning{R. Couturier, D. Laiymani and S. Miquée}
  26   \institute{\vspace{-0.2cm}
  27      University of Franche-Comté \qquad  LIFC laboratory\\%[1mm]
  28      IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort,
  29      France\\\{{\tt
  30       raphael.couturier,david.laiymani,sebastien.miquee}\}{\tt
  31       @univ-fcomte.fr} }
  32
  33
  34 \date{}
  35 \begin{document}
  36
  37
  38 \maketitle
  39
  40 \vspace{-0.2cm}
  41
  42 \begin{abstract}
  43   With the emergence of massive distributed computing resources, such
  44   as grids and distributed clusters architectures, parallel
  45   programming is used to benefit from them and execute problems of
  46   larger sizes. The asynchronous iteration model, called AIAC, has
  47   been proven to be an efficient solution for heterogeneous and
  48   distributed architectures. An efficient mapping of applications'
  49   tasks is essential to reduce their execution time. In this paper we
  50   present a new mapping algorithm, called MAHEVE (Mapping Algorithm
  51   for HEterogeneous and Volatile Environments) which is efficient on
  52   such architectures and integrates a fault tolerance mechanism to
  53   resist computing nodes failures. Our experiments show gains on a
  54   typical AIAC application execution time of about $55\%$, executed
  55   on distributed clusters architectures containing more than 400
  56   computing cores with the JaceP2P-V2 environment.
  57 \end{abstract}
  58
  59 \section{Introduction}
  60 \label{sec:intro}
  61
  62 Nowadays, scientific applications require a great computation power to
  63 solve large problems. Though personal computers are becoming more
  64 powerful, in many cases they are not sufficient. One well adapted
  65 solution is to use computers clusters in order to combine the power of
  66 many machines. Distributed clusters form such an architecture,
  67 providing a great computing power, by aggregating the computation
  68 power of multiple clusters spread over multiple sites.  Such an
  69 architecture brings users heterogeneity in computing machines as well
  70 as network latency. In order to use such an architecture, parallel
  71 programming is required. In the parallel computing area, in order to
  72 execute very large applications on heterogeneous architectures,
  73 iterative methods are well adapted\cite{book_raph,bcvc06:ij}.
  74
  75
  76 These methods repeat the same instructions block until a convergence
  77 state and a desired approximation of the solution are reached. They
  78 constitute the only known approach to solving some kinds of problems
  79 and are relatively easy to parallelize. The Jacobi or the Conjugate
  80 Gradient\cite{cg} methods are examples of such methods. To parallelize
  81 them, one of the most used methods is the message passing paradigm
  82 which provides efficient mechanisms to exchange data between tasks. As
  83 such a method, we focus here on the asynchronous parallel iterative
  84 model, called AIAC\cite{book_raph} (for \textit{Asynchronous
  85   Iterations -- Asynchronous Communications}).
  86
  87 \begin{figure}[h!]
  88  \vspace{-0.4cm}
  89   \centering
  90   \includegraphics[width=7.4cm]{images/AIAC}
  91   \caption{Two processors computing in the Asynchronous Iterations -- Asynchronous Communications (AIAC) model}
  92   \label{fig:AIAC}
  93   \vspace{-0.90cm}
  94 \end{figure}
  95
  96
  97 In this model, as can be seen on Figure \ref{fig:AIAC}, after each
  98 iteration, a task sends its results to its neighbors and immediately
  99 starts the next iteration with the last received data. The receiving
 100 and sending mechanisms are asynchronous and tasks do not have to wait
 101 for the reception of dependency messages from their
 102 neighbors. Consequently, there is no idle time between two
 103 iterations. Furthermore, this model is tolerant to messages loss and
 104 even if a task is stopped the remaining tasks continue the
 105 computation, with the last available data. Several
 106 experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms
 107 in the context of distributed clusters with high latency between
 108 clusters. These works underline the good adaptability of AIAC
 109 algorithms to network and processor heterogeneity.
 110
 111
 112 In a previous study\cite{pdsec10} we proposed the implementation of
 113 two static mapping algorithms of tasks to processors dedicated to the
 114 AIAC model on heterogeneous distributed clusters. Both these two
 115 algorithms, AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC
 116 (for \textit{Farhat Edges-Cuts}) showed an important performances
 117 improvement by significantly reducing the application execution
 118 time. These experiments were performed by using the fully fault
 119 tolerant JaceP2P-V2 environment, described in next section.
 120 %This Java based platform is an executing and developing
 121 %environment dedicated to the AIAC model. By implementing a distributed
 122 %backup/restore mechanism it is also fully fault
 123 %tolerant\cite{jaceP2P-v2}.
 124 In our previous experiments we did not introduce computing nodes
 125 failures during the computation. As architecture heterogeneity
 126 continually evolves according to computing nodes volatility, we have
 127 to take care more precisely about the heterogeneity of the target
 128 platform. Thus in this paper our main contribution is to propose a new
 129 mapping algorithm called MAHEVE (\textit{Mapping Algorithm for
 130   HEterogeneous and Volatile Environments}).  This algorithm
 131 explicitly tackles the heterogeneity issue and introduces a level of
 132 dynamism in order to adapt itself to the fault tolerance
 133 mechanisms. Our experiments show gains up to $65\%$ on application
 134 execution time, with faults during executions, which is about 10
 135 points better than AIAC-QM and about 25 points better than F-EC, and
 136 MAHEVE also outperforms them in experiments with no fault during executions.
 137
 138
 139 The rest of this paper is organized as
 140 follows. Section~\ref{sec:jacep2p} presents the JaceP2P-V2 middleware
 141 by describing its architecture and briefly presenting its fault
 142 tolerance mechanisms. Section~\ref{sec:pb} formalizes our mapping and
 143 fault tolerance problems and quotes existing issues to address
 144 them. Section~\ref{sec:maheve} describes the new mapping strategy we
 145 propose, MAHEVE. In Section~\ref{sec:expe} we present the experiments
 146 we conducted on the Grid'5000 testbed with more than 400 computing
 147 cores. Finally, we give some concluding remarks and plan our future
 148 work in Section~\ref{sec:conclu}.
 149
 150
 151 \section{JaceP2P-V2}
 152 \label{sec:jacep2p}
 153
 154 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented in
 155 Java, dedicated to developing and executing parallel iterative
 156 asynchronous applications. It is fully fault tolerant allowing it to
 157 execute parallel applications over volatile environments. To our
 158 knowledge this is the only platform dedicated to designing and
 159 executing AIAC algorithms in such volatile environments.
 160
 161
 162 The JaceP2P-V2 platform part, which is based on the daemons and
 163 supervisors paradigm, is composed of three main entities:
 164 \begin{itemize}
 165 \item The ``super-nodes'', which are in charge of supervising free
 166   computing nodes connected to the platform;
 167
 168 \item The ``spawner'', which is launched by a user wanting to execute
 169   a parallel application. It is in charge of a group of computing
 170   nodes and monitors them. If one of them fails, it requires a
 171   replacing one to a super-node;
 172
 173 \item The ``daemon'', first connects to a super-node and
 174   waits for a task to execute. Each daemon can communicate directly
 175   with its computing neighbors.
 176 \end{itemize}
 177
 178
 179 To be able to execute AIAC applications, JaceP2P-V2 has an
 180 asynchronous messaging mechanism, and to resist daemons failures, it
 181 implements a checkpoint/restart mechanism by using a distributed
 182 backup mechanism called the \textit{uncoordinated distributed
 183   checkpointing}\cite{uncoord_cp}.  This decentralized procedure
 184 allows the platform to be very scalable, with no weak points and does
 185 not require a secure nor a stable station for backups. When a daemon
 186 dies, it is replaced by another one, as we suppose that there are
 187 enough available free nodes. Moreover, to resist supervisors failures
 188 and for scalability, some extra nodes are reserved. For more details
 189 on the JaceP2P-V2 platform, interested readers can refer to
 190 \cite{jaceP2P-v2}.
 191
 192
 193 \section{Mapping and fault tolerance problems}
 194 \label{sec:pb}
 195
 196 \subsection{Model formalization}
 197 \label{sec:pbmodel}
 198
 199 \subsubsection{Application modeling}
 200 \label{sec:pbmodelapp}
 201
 202
 203 %With the AIAC model, all tasks compute in parallel at the same time,
 204 %without precedence nor synchronization. During an iteration, each task
 205 %computes its job and sends its results to its neighbors, and
 206 %immediately starts the next iteration.
 207 The TIG\cite{tig1} (\textit{Task Interaction Graph}) model is the most
 208 appropriate to our problem, as it only models relationships between
 209 tasks. In this model, all the tasks are considered simultaneously
 210 executable and communications can take place at any time during the
 211 computation, with no precedence nor synchronization. As a reminder,
 212 during an iteration in the AIAC model, each task computes its job and
 213 sends its results to its neighbors, and immediately starts the next
 214 iteration.
 215
 216
 217 In the TIG model, a parallel application is represented by a graph
 218 $GT(V,E)$, where \mbox{$V = \{V_1,V_2,\dots V_v\}$} is the set of
 219 $|V|$ vertices and \mbox{$E \subset V \times V$} is the set of
 220 undirectional edges. The vertices represent tasks and the edges
 221 represent the mutual communication among tasks. A function \mbox{$EC :
 222   V \rightarrow \mathbb{R}^+$} gives the computation cost of tasks and
 223 \mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost
 224 for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) =
 225   e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}.  Another function
 226 \mbox{$D : V \rightarrow \mathbb{N}^+$} gives the amount of
 227 dependencies of a task, and we define \mbox{$D(V_i) = d_i$}.
 228
 229
 230 \subsubsection{Architecture modeling}
 231 \label{sec:pbmodelarchi}
 232
 233
 234 A distributed clusters architecture can be modeled by a
 235 three-level-graph. The levels are \textit{architecture} (a) (here the
 236 Grid'5000 grid), \textit{cluster} (c), and \textit{computing node} (n)
 237 levels. Let $GG(N,L)$ be a graph representing a distributed clusters
 238 architecture, where \mbox{$N = \{N_1,N_2,\dots N_n\}$} is the set of
 239 $|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The
 240 vertices represent the computing nodes and the edges represent the
 241 links between them. An edge \mbox{$L_i \in L$} is an unordered pair
 242 \mbox{$(N_x,N_y) \in N$}, representing a communication link between
 243 nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow
 244   \mathbb{R}^+$} gives the computational power of nodes and another
 245 function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the
 246 communication latency of links. We define \mbox{$WN(N_i) = wn_i$} and
 247 \mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters
 248 contained in the architecture. A function \mbox{$CN : C \rightarrow
 249   \mathbb{N}^+$} gives the amount of computing nodes contained in a
 250 cluster, and another function \mbox{$CF : C \rightarrow \mathbb{N}^+$}
 251 gives the amount of available computing nodes (not involved in
 252 computation) of a cluster. We define \mbox{$CN(C_i) = C_{Ni}$} and
 253 \mbox{$CF(C_i) = C_{Fi}$}. We also define
 254 %\mbox{$C_{Pi}
 255 %  = \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of
 256 %cluster $C_i$, \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
 257 %as the average computation power of cluster $C_i$, and
 258 $C_{\overline{P}fi}$ as the average power of available resources of
 259 cluster $C_i$.
 260
 261 We evaluate the \textit{heterogeneity degree} of the architecture,
 262 noted $hd$, by using the \textit{relative standard deviation} method,
 263 with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the
 264 average computing power of nodes and $\sigma_{PN}$ represents the
 265 standard deviation of computing nodes power. This measure provides us
 266 the coefficient of variation of the platform in percentage -- we only
 267 consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd
 268   > 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully
 269 heterogeneous platform.
 270
 271 \vspace{-0.1cm}
 272 \subsubsection{Mapping functions}
 273 \label{sec:pbmodelmapping}
 274
 275
 276 When a parallel application $App$, represented by a graph $GT$, is
 277 mapped on a distributed clusters architecture, represented by a graph
 278 $GG$, the execution time of the application, $ET(App)$, can be defined
 279 as the execution time of the slowest task. Indeed, an application ends
 280 when all the tasks have detected convergence and reached the desired
 281 approximation of the solution.  We define $ET(App) = \max_{i=1 \dots
 282   v} ( ET(V_i) )$, where the execution time of each task $i$
 283 \mbox{($i=1 \dots v$)}, $ET(V_i)$, is given by $ET(V_i) =
 284 \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$
 285 is the computational cost of $V_i$, $wn_i$ is the computational power
 286 of the node $N_i$ on which $V_i$ is mapped, $J$ represents the
 287 neighbors set of $V_i$, $c_{ij}$ is the amount of communications
 288 between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the
 289 computing nodes on which $V_i$ and $V_j$ are mapped. As described in
 290 this formula, the execution time of a task depends on the task weight
 291 and on the communications which may occur between this task and its
 292 neighbors. We underline here that in the AIAC model, it is impossible
 293 to predict the number of iterations of a task. So it is difficult to
 294 evaluate a priori its cost $e_i$.
 295
 296
 297 This tasks mapping problem is similar to the classical graph
 298 partitioning and task assignment problem, and is thus NP-complete.
 299
 300
 301 \subsection{Fault tolerance}
 302 \label{sec:pbft}
 303
 304 In volatile environments, computing nodes can disconnect at any time
 305 during the computation, and have thus to be efficiently replaced.
 306 The replacing nodes should be the best ones at the fault time,
 307 %according to the chosen mapping algorithm,
 308 by searching them in
 309 available nodes. As executing environments can regularly evolve, due
 310 to computing nodes volatility, a mapping algorithm has to keep a
 311 correct overview of the architecture, in real time. Thus, criteria to
 312 assign tasks to nodes should evolve too.
 313
 314
 315 Another problem appears after multiple crashes: some tasks may have
 316 migrated over multiple computing nodes and clusters, and the initial
 317 mapping may be totally changed. So, after having suffered some nodes
 318 failures the tasks mapping could not always satisfy the mapping
 319 criteria (not on the more powerful available machine, too far away
 320 from its neighbors\dots{}). A good fault tolerance policy has to
 321 evolve dynamically with the executing
 322 environment.
 323
 324
 325 \subsection{Specificities of the AIAC mapping problem}
 326 \label{sec:specAIACmapping}
 327
 328 An important point to take into consideration is that we do not allow
 329 the execution of multiple tasks on the same computing node, as this
 330 provides a fall of performances when this one fails. Indeed we should
 331 redeploy all of the tasks from this node to another one, using last
 332 saves, which can be spread on multiple computing nodes. This may
 333 result in large communication overheads and in a waste of computation
 334 time. Nevertheless, to benefit from multi-cores processors, we use a
 335 task level parallelism by multi-threaded sequential solvers for
 336 example.
 337
 338
 339 Another important point in the AIAC model is that as the JaceP2P-V2
 340 environment is fault tolerant and tasks save checkpoints on their
 341 neighbors, it is more efficient to save on near nodes than on far ones
 342 in order to reduce the communication overhead during this operation,
 343 and to restart a task faster.
 344
 345
 346 \subsection{Related work}
 347 \label{sec:pbrw}
 348
 349 In the literature of the TIG mapping many algorithms exist, which can
 350 be broadly classified into two categories.  The first one is the
 351 \textit{Edge-cuts optimization} class, which minimizes the use of the
 352 penalizing links between clusters. As tasks are depending on
 353 neighbors, which are called dependencies, the goal is to choose nodes
 354 where distance, in term of network, is small to improve communications
 355 between tasks. Here we can cite the Farhat's algorithm\cite{fec}, and
 356 Metis\cite{metis} and Chaco\cite{chaco} which are libraries containing
 357 such kind of algorithms. The second category is the \textit{Execution
 358   time optimization} class, which aims at minimizing the whole
 359 application execution time. These algorithms look for nodes which can
 360 provide the smallest execution time of tasks using their computational
 361 power. Here we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and
 362 MiniMax\cite{minimax} as such kind of algorithms.
 363
 364
 365 Both classes of algorithms may fit with our goals as in our model we
 366 have both the computational power of nodes and communication costs
 367 which may influence the applications performances.
 368
 369
 370 All mentioned algorithms do not tackle the computing nodes failures
 371 issue, or only basically by applying the same policy. As explained in
 372 Section \ref{sec:pbft}, a more efficient and dedicated replacement
 373 function is needed. Nevertheless, to the best of our knowledge, no
 374 tasks mapping algorithm, addressing explicitly both the executing
 375 platform heterogeneity and the computing nodes failures issues,
 376 exists.
 377
 378
 379 \vspace{-0.25cm}
 380 \section{MAHEVE}
 381 \label{sec:maheve}
 382
 383 Here we present our new tasks mapping strategy, called MAHEVE (for
 384 \textit{Mapping Algorithm for HEterogeneous and Volatile
 385   Environments}). This algorithm aims at taking the best part of each
 386 category mentioned in Section \ref{sec:pbrw}, the edge-cuts
 387 minimization and the application execution time optimization
 388 algorithms.
 389
 390
 391 This new algorithm can be divided into two parts. The first part aims
 392 at performing the initial mapping, and the second part is devoted to
 393 search replacing nodes when computing nodes failures occur.
 394
 395
 396 \vspace{-0.20cm}
 397 \subsection{Initial mapping}
 398 \label{sec:maheve_init}
 399
 400 In this section we will study the main mechanisms of the
 401 \textit{static mapping} done by MAHEVE, which is composed of three
 402 phases: sort of clusters, sort of tasks, and the effective mapping,
 403 which maps tasks (in their sort order) on nodes of clusters (also in
 404 their sort order) with a reservation of some nodes in each cluster.
 405
 406
 407 %\begin{algorithm}[H]
 408 %  \SetLine
 409 %  \KwIn{GT (app. graph), CL (clusters), hd (heterogeneity degree)}
 410 %  \KwOut{Mp (the mapping done)}
 411 %
 412 %  \BlankLine
 413 %
 414 %  nodes $\leftarrow$ sortingClusters(GG, hd)\;
 415 %%  nodes $\leftarrow$ searchNodes(lt, nbTasks)\;
 416 %
 417 %%  \BlankLine
 418 %
 419 %  tasks $\leftarrow$ sortingTasks(GT, hd)\;
 420 %%  tasks $\leftarrow$ orderTasks(st, hd)\;
 421 %
 422 %%  \BlankLine
 423 %
 424 %  \For{$i = 0$ \KwTo $size(tasks)$}
 425 %  {
 426 %    Mp.add(tasks($i$), nodes($i$))\;
 427 %  }
 428 %
 429 %%  \KwRet(Mp)
 430 %
 431 %  \caption{General algorithm simplified}
 432 %\end{algorithm}
 433
 434
 435 \vspace{-0.15cm}
 436 \subsubsection{Sorting clusters}
 437 \label{sec:maheve_sort_clusters}
 438
 439 The first step of the initial mapping is to sort clusters according to
 440 the executing platform heterogeneity degree $hd$. The main principles
 441 are that a cluster obtains a better mark when $hd < 0.5$ and it
 442 contains more computing nodes than other clusters ($C_{Fi}$, the
 443 number of available free nodes, is privileged), and when $hd \ge 0.5$
 444 and it contains more powerful computing nodes ($C_{\overline{P}fi}$,
 445 the average free computation power, is privileged). These choices come
 446 from several experiments with the AIAC model, which show that in such
 447 environments it is more efficient to privilege the computation power
 448 or the number of nodes. As the number of nodes, $C_{Fi}$, and the
 449 average free computing power, $C_{\overline{P}fi}$, are not in the
 450 same order of magnitude, we normalize them with two functions, $normN$
 451 and $normP$.
 452 %They are defined with \mbox{$normN\left(C_{Fi}\right) =
 453 %  C_{Fi} \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate
 454 %of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
 455 %  C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|}
 456 %  C_{\overline{P}fj}$}, which is the rate of the average power, both
 457 %representing the cluster in the architecture.
 458 We note $normN\left( C_{Fi} \right) = NC_{Fi}$ and
 459 $normP(C_{\overline{P}fi}) =NC_{\overline{P}fi}$.
 460
 461
 462 The formula used to give a mark, $M_i$, to a cluster is
 463 \begin{equation}
 464   \label{eq:cluster}
 465   M_i = {NC_{\overline{P}fi}}^{hd} + {NC_{Fi}}^{1 - hd}.
 466 \end{equation}
 467
 468
 469 This compromise function allows us to privilege clusters following our
 470 criteria, as explained previously, according to the heterogeneity
 471 degree. If we study its limits for the $hd$'s extremities, $hd = 0$
 472 and $hd = 1$, we obtain $\lim_{hd \rightarrow 0} M_i = NC_{Fi} + 1$ and
 473 $\lim_{hd \rightarrow 1} M_i = NC_{\overline{P}fi} + 1$, which fit with
 474 our objectives.
 475
 476 Clusters are so sorted and placed in a list containing them, starting
 477 from the cluster which receives the better mark to the one which
 478 receives the lower mark.
 479
 480
 481 \subsubsection{Sorting tasks}
 482 \label{sec:maheve_sort_tasks}
 483
 484 Like clusters, tasks are also sorted according to the heterogeneity
 485 degree of the executing platform. This sorting is done in the same way as
 486 previously, as when $hd < 0.5$ tasks with higher dependencies will be
 487 privileged, and when $hd \ge 0.5$ tasks with higher computing cost are
 488 privileged.
 489 %, in order to be executed on the highest powered computing nodes.
 490
 491
 492 The main function used to classified tasks is
 493 \begin{equation}
 494   \label{eq:tasks}
 495   Q_i = {e_i}^{hd} \times {d_i}^{1 - hd}
 496 \end{equation}
 497
 498 where $Q_i$ is the evaluation of the task $i$ according to the
 499 heterogeneity degree $hd$ and $d_i$, the amount of dependencies of
 500 task $i$.
 501
 502
 503 Tasks are taken in the order of the first sort, determined with
 504 equation (\ref{eq:tasks}), and each task is placed in a new list (the
 505 final one) and some of its dependencies are added. We note $Nb_i =
 506 {d_i}^{1 - hd}$ this amount of dependencies as the lower the
 507 heterogeneity degree is the higher this number will be. This final
 508 operation allows to control the necessary locality of tasks according
 509 to $hd$. %the heterogeneity degree of the platform.
 510
 511
 512 \subsubsection{Mapping method}
 513 \label{sec:maheve_mapping_method}
 514
 515 The third step of the initial mapping is to allocate tasks to
 516 nodes. As clusters and tasks have been sorted accordingly to the
 517 executing platform heterogeneity degree, ordered from the highest mark
 518 to the lowest, this function maps tasks on almost all available
 519 computing nodes of clusters, in their respective order in lists (for
 520 example a task classified first in the tasks list is mapped on an
 521 available node of the cluster classified first in the clusters list).
 522 The idea here is not to fulfill each cluster, but to preserve some
 523 computing nodes in each cluster. These conserved nodes will be used to
 524 replace failed nodes.
 525
 526
 527 Here we can mentioned that the whole mapping process (the three steps)
 528 has a complexity of $O( |V| log(|V|) )$, where |V| is the number of tasks.
 529
 530
 531 \subsection{Replacing function}
 532 \label{sec:maheve_rep}
 533
 534 %This function is essential in a volatile environment, as an efficient
 535 %replacement should reduce the overhead on the application execution
 536 %time due to the loss of computing steps and data.
 537
 538 As shown in the previous section, during the initial mapping some
 539 computing nodes in each cluster have been preserved. When a node fails
 540 this function replace it by a free node of the same cluster. If none
 541 is available this function sorts again clusters, to take into
 542 consideration platform modifications, and replace the failed node by
 543 one available in the new sorted clusters list. This mechanism allows
 544 to keep tasks locality and a real time overview of the executing
 545 platform.
 546
 547
 548 \section{Experimentation}
 549 \label{sec:expe}
 550
 551 \subsection{A typical AIAC application and the execution platform}
 552 \label{sec:cond}
 553
 554 We used the ``Kernel CG'' application of the NAS Parallel Benchmarks
 555 (NPB) \cite{nas} to evaluate the performances of our new mapping
 556 algorithm. This benchmark is designed to be used on large
 557 architectures, as it stresses communications%over latency networks
 558 , by processing unstructured matrix vector multiplication with a
 559 Conjugate Gradient method. As this method cannot be executed with the
 560 asynchronous iteration model we have replaced it by another method
 561 called the multisplitting method, which supports the asynchronous
 562 iterative model. More details about this method can be found in
 563 \cite{book_raph}. The chosen problem used a matrix of size $5,000,000$
 564 with a low bandwidth, fixed to $35,000$. This bandwidth size
 565 generates, according to the problem size, between 8 and 20 neighbors
 566 per tasks. This application was executed on 64 nodes.
 567
 568
 569 The platform used for our tests, called Grid’5000\cite{g5k}, is a
 570 French nationwide experimental set of clusters which provides us with
 571 distributed clusters architectures (28 heterogeneous clusters spread
 572 over 9 sites).  We used three distributed clusters architectures, each
 573 having a different heterogeneity degree. The first one was composed of
 574 four clusters spread over four sites, with a total of 106 computing
 575 nodes representing 424 computing cores with \mbox{$hd = 0.08$}; the
 576 second one was composed of four clusters spread over three sites, with
 577 a total of 110 computing nodes representing 440 computing cores with
 578 \mbox{$hd = 0.50$}; and finally the third one was composed of five
 579 clusters spread over four sites with 115 computing nodes representing
 580 620 computing cores with \mbox{$hd = 0.72$}.
 581
 582
 583 All computing nodes of these clusters have at least 4 computing cores
 584 (in the last used architecture, with \mbox{$hd = 0.72$}, two clusters
 585 are composed of 8 computing cores machines) with a minimum of 4GB of
 586 memory (in order to execute the application with a big problem
 587 size). All computing nodes can communicate with each other through an
 588 efficient network. Nevertheless, this latter is shared with many other
 589 users so high latencies appear during executions.
 590
 591
 592 \subsection{Experiments}
 593 \label{sec:experiments}
 594
 595 During executions, we introduced two failures in computing nodes
 596 involved in the computation every 20 seconds to simulate a volatile
 597 environment. Unfortunately, we did not have the opportunity to realize
 598 experiments with more computing nodes over more sites with problems of
 599 larger sizes, but we plan to extend our experiments in the future.
 600
 601
 602 Here we present the results of the evaluation of the MAHEVE algorithm,
 603 compared with FT-AIAC-QM (for \textit{Fault Tolerant AIAC-QM}) and
 604 FT-FEC (for \textit{Fault Tolerant F-EC}) which are respectively the
 605 fault tolerant versions of the AIAC-QM and F-EC mapping algorithms
 606 presented in \cite{pdsec10}. Table \ref{tab:results} shows the
 607 execution times of each mapping algorithm compared to the default
 608 mapping strategy of the JaceP2P-V2 platform, with the corresponding
 609 gains on application execution time, given in brackets. It presents
 610 both the executions with faults (WF) and the fault free (FF)
 611 executions.
 612
 613 \vspace{-0.25cm}
 614
 615 \renewcommand{\arraystretch}{1.6}
 616 \begin{table}[h!]
 617   \centering
 618   \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
 619     \hline
 620     \multirow{2}{*}{ ~$hd$~ }&\multicolumn{2}{c|}{ ~Default~ }&\multicolumn{2}{c|}{ ~FT-AIAC-QM~ }&\multicolumn{2}{c|}{ ~FT-FEC~ }&\multicolumn{2}{c|}{ ~MAHEVE~ }\\
 621     \cline{2-9}
 622
 623     & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ \\
 624     \hline
 625     ~$0.08$~ & ~80~ & ~229~ & ~63 (21\%)~ & ~178 (22\%)~ & ~61 (23\%)~ & ~154
 626     (33\%)~ & ~60 (25\%)~ & ~113 (50\%)~ \\
 627
 628     ~$0.50$~ & ~67~ & ~242~ & ~61 (9\%)~ & ~118 (51\%)~ & ~63 (6\%)~ & ~133
 629     (45\%)~ & ~54 (20\%)~ & ~85 (65\%)~ \\
 630
 631     ~$0.72$~ & ~67~ & ~192~ & ~59 (12\%)~ & ~99 (45\%)~ & ~65 (3\%)~ & ~121
 632     (33\%)~ & ~52 (22\%)~ & ~86 (53\%)~\\
 633
 634     \hline
 635   \end{tabular}
 636   \vspace{0.15cm}
 637   \caption{Application execution time in seconds and corresponding gains on various
 638     platforms using different mapping algorithms, with fault free (FF) executions
 639     and with 2 node failures each 20 seconds (WF) executions.}
 640   \label{tab:results}
 641   \vspace{-0.7cm}
 642 \end{table}
 643
 644
 645 First of all, we can note that all mapping algorithms provide an
 646 enhancement of the application performances by considerably reducing
 647 its execution time, especially for executions with node failures, with
 648 an average gain of about $45\%$ in general in comparison to the
 649 default policy. If we focus on executions with node failures (WF),
 650 FT-FEC is efficient on architectures with a low heterogeneity degree
 651 (\mbox{$hd = 0.08$}) by providing gains of about $33\%$, and gains are
 652 roughly the same on heterogeneous architectures (\mbox{$hd =
 653   0.72$}). FT-AIAC-QM is efficient on architectures with a high
 654 heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains of about
 655 $45\%$, whereas it is not so efficient on homogeneous architectures
 656 (\mbox{$hd = 0.08$}) by providing gains of about $22\%$. We can note
 657 here that on an architecture with a heterogeneity degree of $0.50$
 658 FT-AIAC-QM is more efficient than FT-FEC by providing gains up to
 659 $50\%$. Here we point out that in fault free executions (FF), both
 660 algorithms also provide gains on their respective favorite
 661 architectures, though gains are less great than in executions with
 662 faults (WF).
 663
 664
 665 Now if we focus on the performances of our new solution MAHEVE, we can
 666 see that it is all the time better than other algorithms. As can be
 667 seen in \mbox{Table \ref{tab:results}}, in executions with faults
 668 (WF), it reduces the application's execution time by about $50\%$ on
 669 homogeneous architectures (here of $0.08$ heterogeneity degree) which
 670 is more than 25 point better than FT-FEC and near 30 points better
 671 than FT-AIAC-QM. On heterogeneous architectures (here of $0.72$
 672 heterogeneity degree) it also outperforms other mapping algorithms by
 673 reducing the application execution time by about $53\%$ which is
 674 almost 10 points better than FT-AIAC-QM and 20 points better than
 675 FT-FEC. On middle heterogeneity degree architectures (here of $0.50$
 676 heterogeneity degree), MAHEVE is once again better than its two
 677 comparative mapping algorithms by reducing the application execution
 678 time by about $53\%$. These good performances come from the fact that
 679 it is designed to be efficient on both architectures, homogeneous and
 680 heterogeneous. Moreover, as it integrates a fault tolerance
 681 \textit{security} in the initial mapping, it is more efficient when
 682 computing nodes fail. Here we can point out that this algorithm allows
 683 in general gains on application execution time of about $55\%$. In fault free executions (FF), it outperforms once again
 684 the two other algorithms.
 685
 686 \vspace{-0.15cm}
 687 \section{Conclusion and future works}
 688 \label{sec:conclu}
 689
 690 In this paper we have presented a new mapping algorithm, called
 691 MAHEVE, to address the AIAC mapping issue on heterogeneous and
 692 volatile environments. It aims at doing an efficient mapping of tasks
 693 on distributed clusters architectures by taking the best part of the
 694 two known approaches, application execution time optimization and
 695 edge-cuts minimization. Experiments, though using a single
 696 application, show that it is the most efficient mapping algorithm on
 697 all kinds of architectures, as it takes into account their
 698 heterogeneity degree and adapt its sort methods to it. We have shown
 699 that it is all the time better than the two other comparative mapping
 700 algorithms, FT-AIAC-QM and FT-FEC. This can be explained by the fact
 701 that it not only takes care about computing nodes and clusters, but
 702 also about the tasks' properties, what refines the mapping solution.
 703
 704
 705 In our future works we plan to enhance the MAHEVE algorithm
 706 performances by modifying the notation of clusters, since their
 707 locality has not yet been taken into consideration. This would favor
 708 tasks locality, which would reduce communications delays and provide a
 709 much better convergence rate. We also have to validate the algorithm
 710 performance with other AIAC applications.
 711
 712
 713 %% Acknolegdement
 714
 715 \section*{Acknowledgement}
 716 \label{ack}
 717
 718 Experiments presented in this paper were carried out using the
 719 Grid'5000 experimental testbed\cite{g5k}, being developed under the
 720 INRIA ALADDIN development action with support from CNRS, RENATER and
 721 several Universities as well as other funding bodies.
 722
 723 %%%
 724
 725
 726 \bibliographystyle{unsrt}
 727
 728 \bibliography{biblio}
 729
 730
 731 \end{document}
 732