1 \documentclass[10pt,conference]{llncs}
3 \usepackage[T1]{fontenc}
5 \usepackage[utf8x]{inputenc}
10 \usepackage[english]{babel}
11 \usepackage[pdftex,final]{graphicx}
12 \usepackage[ruled,vlined]{algorithm2e}
13 \usepackage[pdftex]{graphicx}
15 \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
18 \title{MAHEVE: An Efficient Reliable Mapping of Asynchronous Iterative
19 Applications on Volatile and Heterogeneous Environments \thanks{This
20 work was supported by the European Interreg IV From-P2P
24 \author{Raphaël Couturier, David Laiymani and Sébastien Miquée}
25 \authorrunning{R. Couturier, D. Laiymani and S. Miquée}
26 \institute{\vspace{-0.2cm}
27 University of Franche-Comté \qquad LIFC laboratory\\
28 IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort,
30 raphael.couturier,david.laiymani,sebastien.miquee}\}{\tt
43 With the emergence of massive distributed computing resources, such
44 as grids and distributed clusters architectures, parallel
45 programming is used to benefit from them and execute problems of
46 larger sizes. The asynchronous iteration model, called AIAC, has
47 been proven to be an efficient solution for heterogeneous and
48 distributed architectures. An efficient mapping of applications'
49 tasks is essential to reduce their execution time. In this paper we
50 present a new mapping algorithm, called MAHEVE (Mapping Algorithm
51 for HEterogeneous and Volatile Environments) which is efficient on
52 such architectures and integrates a fault tolerance mechanism to
53 resist computing nodes failures. Our experiments show gains on a
54 typical AIAC application execution time of about $55\%$, executed
55 on distributed clusters architectures containing more than 400
56 computing cores with the JaceP2P-V2 environment.
59 \section{Introduction}
62 Nowadays, scientific applications require a great computation power to
63 solve large problems. Though personal computers are becoming more
64 powerful, in many cases they are not sufficient. One well adapted
65 solution is to use computers clusters in order to combine the power of
66 many machines. Distributed clusters form such an architecture,
67 providing a great computing power, by aggregating the computation
68 power of multiple clusters spread over multiple sites. Such an
69 architecture brings users heterogeneity in computing machines as well
70 as network latency. In order to use such an architecture, parallel
71 programming is required. In the parallel computing area, in order to
72 execute very large applications on heterogeneous architectures,
73 iterative methods are well adapted\cite{book_raph,bcvc06:ij}.
76 These methods repeat the same instructions block until a convergence
77 state and a desired approximation of the solution are reached. They
78 constitute the only known approach to solving some kinds of problems
79 and are relatively easy to parallelize. The Jacobi or the Conjugate
80 Gradient\cite{cg} methods are examples of such methods. To parallelize
81 them, one of the most used methods is the message passing paradigm
82 which provides efficient mechanisms to exchange data between tasks. As
83 such a method, we focus here on the asynchronous parallel iterative
84 model, called AIAC\cite{book_raph} (for \textit{Asynchronous
85 Iterations -- Asynchronous Communications}).
90 \includegraphics[width=7.4cm]{images/AIAC}
91 \caption{Two processors computing in the Asynchronous Iterations -- Asynchronous Communications (AIAC) model}
97 In this model, as can be seen on Figure \ref{fig:AIAC}, after each
98 iteration, a task sends its results to its neighbors and immediately
99 starts the next iteration with the last received data. The receiving
100 and sending mechanisms are asynchronous and tasks do not have to wait
101 for the reception of dependency messages from their
102 neighbors. Consequently, there is no idle time between two
103 iterations. Furthermore, this model is tolerant to messages loss and
104 even if a task is stopped the remaining tasks continue the
105 computation, with the last available data. Several
106 experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms
107 in the context of distributed clusters with high latency between
108 clusters. These works underline the good adaptability of AIAC
109 algorithms to network and processor heterogeneity.
112 In a previous study\cite{pdsec10} we proposed the implementation of
113 two static mapping algorithms of tasks to processors dedicated to the
114 AIAC model on heterogeneous distributed clusters. Both these two
115 algorithms, AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC
116 (for \textit{Farhat Edges-Cuts}) showed an important performances
117 improvement by significantly reducing the application execution
118 time. These experiments were performed by using the fully fault
119 tolerant JaceP2P-V2 environment, described in next section. In our
120 previous experiments we did not introduce computing nodes failures
121 during the computation. As architecture heterogeneity continually
122 evolves according to computing nodes volatility, we have to take care
123 more precisely about the heterogeneity of the target platform. Thus in
124 this paper our main contribution is to propose a new mapping algorithm
125 called MAHEVE (\textit{Mapping Algorithm for HEterogeneous and
126 Volatile Environments}). This algorithm explicitly tackles the
127 heterogeneity issue and introduces a level of dynamism in order to
128 adapt itself to the fault tolerance mechanisms. Our experiments show
129 gains up to $65\%$ on application execution time, with faults during
130 executions, which is about 10 points better than AIAC-QM and about 25
131 points better than F-EC, and MAHEVE also outperforms them in
132 experiments with no fault during executions.
135 The rest of this paper is organized as
136 follows. Section~\ref{sec:jacep2p} presents the JaceP2P-V2 middleware
137 by describing its architecture and briefly presenting its fault
138 tolerance mechanisms. Section~\ref{sec:pb} formalizes our mapping and
139 fault tolerance problems and quotes existing issues to address
140 them. Section~\ref{sec:maheve} describes the new mapping strategy we
141 propose, MAHEVE. In Section~\ref{sec:expe} we present the experiments
142 we conducted on the Grid'5000 testbed with more than 400 computing
143 cores. Finally, we give some concluding remarks and plan our future
144 work in Section~\ref{sec:conclu}.
150 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented in
151 Java, dedicated to developing and executing parallel iterative
152 asynchronous applications. It is fully fault tolerant allowing it to
153 execute parallel applications over volatile environments. To our
154 knowledge this is the only platform dedicated to designing and
155 executing AIAC algorithms in such volatile environments.
158 The JaceP2P-V2 platform part, which is based on the daemons and
159 supervisors paradigm, is composed of three main entities:
161 \item The ``super-nodes'', which are in charge of supervising free
162 computing nodes connected to the platform;
164 \item The ``spawner'', which is launched by a user wanting to execute
165 a parallel application. It is in charge of a group of computing
166 nodes and monitors them. If one of them fails, it requires a
167 replacing one to a super-node;
169 \item The ``daemon'', first connects to a super-node and
170 waits for a task to execute. Each daemon can communicate directly
171 with its computing neighbors.
175 To be able to execute AIAC applications, JaceP2P-V2 has an
176 asynchronous messaging mechanism, and to resist daemons failures, it
177 implements a checkpoint/restart mechanism by using a distributed
178 backup mechanism called the \textit{uncoordinated distributed
179 checkpointing}\cite{uncoord_cp}. This decentralized procedure
180 allows the platform to be very scalable, with no weak points and does
181 not require a secure nor a stable station for backups. When a daemon
182 dies, it is replaced by another one, as we suppose that there are
183 enough available free nodes. Moreover, to resist supervisors failures
184 and for scalability, some extra nodes are reserved. For more details
185 on the JaceP2P-V2 platform, interested readers can refer to
189 \section{Mapping and fault tolerance problems}
192 \subsection{Model formalization}
195 \subsubsection{Application modeling}
196 \label{sec:pbmodelapp}
199 The TIG\cite{tig1} (\textit{Task Interaction Graph}) model is the most
200 appropriate to our problem, as it only models relationships between
201 tasks. In this model, all the tasks are considered simultaneously
202 executable and communications can take place at any time during the
203 computation, with no precedence nor synchronization. As a reminder,
204 during an iteration in the AIAC model, each task computes its job and
205 sends its results to its neighbors, and immediately starts the next
209 In the TIG model, a parallel application is represented by a graph
210 $GT(V,E)$, where \mbox{$V = \{V_1,V_2,\dots V_v\}$} is the set of
211 $|V|$ vertices and \mbox{$E \subset V \times V$} is the set of
212 undirectional edges. The vertices represent tasks and the edges
213 represent the mutual communication among tasks. A function \mbox{$EC :
214 V \rightarrow \mathbb{R}^+$} gives the computation cost of tasks and
215 \mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost
216 for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) =
217 e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}. Another function
218 \mbox{$D : V \rightarrow \mathbb{N}^+$} gives the amount of
219 dependencies of a task, and we define \mbox{$D(V_i) = d_i$}.
222 \subsubsection{Architecture modeling}
223 \label{sec:pbmodelarchi}
226 A distributed clusters architecture can be modeled by a
227 three-level-graph. The levels are \textit{architecture} (a) (here the
228 Grid'5000 grid), \textit{cluster} (c), and \textit{computing node} (n)
229 levels. Let $GG(N,L)$ be a graph representing a distributed clusters
230 architecture, where \mbox{$N = \{N_1,N_2,\dots N_n\}$} is the set of
231 $|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The
232 vertices represent the computing nodes and the edges represent the
233 links between them. An edge \mbox{$L_i \in L$} is an unordered pair
234 \mbox{$(N_x,N_y) \in N$}, representing a communication link between
235 nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow
236 \mathbb{R}^+$} gives the computational power of nodes and another
237 function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the
238 communication latency of links. We define \mbox{$WN(N_i) = wn_i$} and
239 \mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters
240 contained in the architecture. A function \mbox{$CN : C \rightarrow
241 \mathbb{N}^+$} gives the amount of computing nodes contained in a
242 cluster, and another function \mbox{$CF : C \rightarrow \mathbb{N}^+$}
243 gives the amount of available computing nodes (not involved in
244 computation) of a cluster. We define \mbox{$CN(C_i) = C_{Ni}$} and
245 \mbox{$CF(C_i) = C_{Fi}$}. We also define
247 % = \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of
248 %cluster $C_i$, \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
249 %as the average computation power of cluster $C_i$, and
250 $C_{\overline{P}fi}$ as the average power of available resources of
253 We evaluate the \textit{heterogeneity degree} of the architecture,
254 noted $hd$, by using the \textit{relative standard deviation} method,
255 with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the
256 average computing power of nodes and $\sigma_{PN}$ represents the
257 standard deviation of computing nodes power. This measure provides us
258 the coefficient of variation of the platform in percentage -- we only
259 consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd
260 > 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully
261 heterogeneous platform.
264 \subsubsection{Mapping functions}
265 \label{sec:pbmodelmapping}
268 When a parallel application $App$, represented by a graph $GT$, is
269 mapped on a distributed clusters architecture, represented by a graph
270 $GG$, the execution time of the application, $ET(App)$, can be defined
271 as the execution time of the slowest task. Indeed, an application ends
272 when all the tasks have detected convergence and reached the desired
273 approximation of the solution. We define $ET(App) = \max_{i=1 \dots
274 v} ( ET(V_i) )$, where the execution time of each task $i$
275 \mbox{($i=1 \dots v$)}, $ET(V_i)$, is given by $ET(V_i) =
276 \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$
277 is the computational cost of $V_i$, $wn_i$ is the computational power
278 of the node $N_i$ on which $V_i$ is mapped, $J$ represents the
279 neighbors set of $V_i$, $c_{ij}$ is the amount of communications
280 between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the
281 computing nodes on which $V_i$ and $V_j$ are mapped. As described in
282 this formula, the execution time of a task depends on the task weight
283 and on the communications which may occur between this task and its
284 neighbors. We underline here that in the AIAC model, it is impossible
285 to predict the number of iterations of a task. So it is difficult to
286 evaluate a priori its cost $e_i$.
289 This tasks mapping problem is similar to the classical graph
290 partitioning and task assignment problem, and is thus NP-complete.
293 \subsection{Fault tolerance}
296 In volatile environments, computing nodes can disconnect at any time
297 during the computation, and have thus to be efficiently replaced.
298 The replacing nodes should be the best ones at the fault time,
299 %according to the chosen mapping algorithm,
301 available nodes. As executing environments can regularly evolve, due
302 to computing nodes volatility, a mapping algorithm has to keep a
303 correct overview of the architecture, in real time. Thus, criteria to
304 assign tasks to nodes should evolve too.
307 Another problem appears after multiple crashes: some tasks may have
308 migrated over multiple computing nodes and clusters, and the initial
309 mapping may be totally changed. So, after having suffered some nodes
310 failures the tasks mapping could not always satisfy the mapping
311 criteria (not on the more powerful available machine, too far away
312 from its neighbors\dots{}). A good fault tolerance policy has to
313 evolve dynamically with the executing
317 \subsection{Specificities of the AIAC mapping problem}
318 \label{sec:specAIACmapping}
320 An important point to take into consideration is that we do not allow
321 the execution of multiple tasks on the same computing node, as this
322 provides a fall of performances when this one fails. Indeed we should
323 redeploy all of the tasks from this node to another one, using last
324 saves, which can be spread on multiple computing nodes. This may
325 result in large communication overheads and in a waste of computation
326 time. Nevertheless, to benefit from multi-cores processors, we use a
327 task level parallelism by multi-threaded sequential solvers for
331 Another important point in the AIAC model is that as the JaceP2P-V2
332 environment is fault tolerant and tasks save checkpoints on their
333 neighbors, it is more efficient to save on near nodes than on far ones
334 in order to reduce the communication overhead during this operation,
335 and to restart a task faster.
338 \subsection{Related work}
341 In the literature of the TIG mapping many algorithms exist, which can
342 be broadly classified into two categories. The first one is the
343 \textit{Edge-cuts optimization} class, which minimizes the use of the
344 penalizing links between clusters. As tasks are depending on
345 neighbors, which are called dependencies, the goal is to choose nodes
346 where distance, in term of network, is small to improve communications
347 between tasks. Here we can cite the Farhat's algorithm\cite{fec}, and
348 Metis\cite{metis} and Chaco\cite{chaco} which are libraries containing
349 such kind of algorithms. The second category is the \textit{Execution
350 time optimization} class, which aims at minimizing the whole
351 application execution time. These algorithms look for nodes which can
352 provide the smallest execution time of tasks using their computational
353 power. Here we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and
354 MiniMax\cite{minimax} as such kind of algorithms.
357 Both classes of algorithms may fit with our goals as in our model we
358 have both the computational power of nodes and communication costs
359 which may influence the applications performances.
362 All mentioned algorithms do not tackle the computing nodes failures
363 issue, or only basically by applying the same policy. As explained in
364 Section \ref{sec:pbft}, a more efficient and dedicated replacement
365 function is needed. Nevertheless, to the best of our knowledge, no
366 tasks mapping algorithm, addressing explicitly both the executing
367 platform heterogeneity and the computing nodes failures issues,
375 Here we present our new tasks mapping strategy, called MAHEVE (for
376 \textit{Mapping Algorithm for HEterogeneous and Volatile
377 Environments}). This algorithm aims at taking the best part of each
378 category mentioned in Section \ref{sec:pbrw}, the edge-cuts
379 minimization and the application execution time optimization
383 This new algorithm can be divided into two parts. The first part aims
384 at performing the initial mapping, and the second part is devoted to
385 search replacing nodes when computing nodes failures occur.
389 \subsection{Initial mapping}
390 \label{sec:maheve_init}
392 In this section we will study the main mechanisms of the
393 \textit{static mapping} done by MAHEVE, which is composed of three
394 phases: sort of clusters, sort of tasks, and the effective mapping,
395 which maps tasks (in their sort order) on nodes of clusters (also in
396 their sort order) with a reservation of some nodes in each cluster.
399 %\begin{algorithm}[H]
401 % \KwIn{GT (app. graph), CL (clusters), hd (heterogeneity degree)}
402 % \KwOut{Mp (the mapping done)}
406 % nodes $\leftarrow$ sortingClusters(GG, hd)\;
407 %% nodes $\leftarrow$ searchNodes(lt, nbTasks)\;
411 % tasks $\leftarrow$ sortingTasks(GT, hd)\;
412 %% tasks $\leftarrow$ orderTasks(st, hd)\;
416 % \For{$i = 0$ \KwTo $size(tasks)$}
418 % Mp.add(tasks($i$), nodes($i$))\;
423 % \caption{General algorithm simplified}
428 \subsubsection{Sorting clusters}
429 \label{sec:maheve_sort_clusters}
431 The first step of the initial mapping is to sort clusters according to
432 the executing platform heterogeneity degree $hd$. The main principles
433 are that a cluster obtains a better mark when $hd < 0.5$ and it
434 contains more computing nodes than other clusters ($C_{Fi}$, the
435 number of available free nodes, is privileged), and when $hd \ge 0.5$
436 and it contains more powerful computing nodes ($C_{\overline{P}fi}$,
437 the average free computation power, is privileged). These choices come
438 from several experiments with the AIAC model, which show that in such
439 environments it is more efficient to privilege the computation power
440 or the number of nodes. As the number of nodes, $C_{Fi}$, and the
441 average free computing power, $C_{\overline{P}fi}$, are not in the
442 same order of magnitude, we normalize them with two functions, $normN$
444 %They are defined with \mbox{$normN\left(C_{Fi}\right) =
445 % C_{Fi} \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate
446 %of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
447 % C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|}
448 % C_{\overline{P}fj}$}, which is the rate of the average power, both
449 %representing the cluster in the architecture.
450 We note $normN\left( C_{Fi} \right) = NC_{Fi}$ and
451 $normP(C_{\overline{P}fi}) =NC_{\overline{P}fi}$.
454 The formula used to give a mark, $M_i$, to a cluster is
457 M_i = {NC_{\overline{P}fi}}^{hd} + {NC_{Fi}}^{1 - hd}.
461 This compromise function allows us to privilege clusters following our
462 criteria, as explained previously, according to the heterogeneity
463 degree. If we study its limits for the $hd$'s extremities, $hd = 0$
464 and $hd = 1$, we obtain $\lim_{hd \rightarrow 0} M_i = NC_{Fi} + 1$ and
465 $\lim_{hd \rightarrow 1} M_i = NC_{\overline{P}fi} + 1$, which fit with
468 Clusters are so sorted and placed in a list containing them, starting
469 from the cluster which receives the better mark to the one which
470 receives the lower mark.
473 \subsubsection{Sorting tasks}
474 \label{sec:maheve_sort_tasks}
476 Like clusters, tasks are also sorted according to the heterogeneity
477 degree of the executing platform. This sorting is done in the same way as
478 previously, as when $hd < 0.5$ tasks with higher dependencies will be
479 privileged, and when $hd \ge 0.5$ tasks with higher computing cost are
481 %, in order to be executed on the highest powered computing nodes.
484 The main function used to classified tasks is
487 Q_i = {e_i}^{hd} \times {d_i}^{1 - hd}
490 where $Q_i$ is the evaluation of the task $i$ according to the
491 heterogeneity degree $hd$ and $d_i$, the amount of dependencies of
495 Tasks are taken in the order of the first sort, determined with
496 equation (\ref{eq:tasks}), and each task is placed in a new list (the
497 final one) and some of its dependencies are added. We note $Nb_i =
498 {d_i}^{1 - hd}$ this amount of dependencies as the lower the
499 heterogeneity degree is the higher this number will be. This final
500 operation allows to control the necessary locality of tasks according
501 to $hd$. %the heterogeneity degree of the platform.
504 \subsubsection{Mapping method}
505 \label{sec:maheve_mapping_method}
507 The third step of the initial mapping is to allocate tasks to
508 nodes. As clusters and tasks have been sorted accordingly to the
509 executing platform heterogeneity degree, ordered from the highest mark
510 to the lowest, this function maps tasks on almost all available
511 computing nodes of clusters, in their respective order in lists (for
512 example a task classified first in the tasks list is mapped on an
513 available node of the cluster classified first in the clusters list).
514 The idea here is not to fulfill each cluster, but to preserve some
515 computing nodes in each cluster. These conserved nodes will be used to
516 replace failed nodes.
519 Here we can mentioned that the whole mapping process (the three steps)
520 has a complexity of $O( |V| log(|V|) )$, where |V| is the number of tasks.
523 \subsection{Replacing function}
524 \label{sec:maheve_rep}
526 %This function is essential in a volatile environment, as an efficient
527 %replacement should reduce the overhead on the application execution
528 %time due to the loss of computing steps and data.
530 As shown in the previous section, during the initial mapping some
531 computing nodes in each cluster have been preserved. When a node fails
532 this function replace it by a free node of the same cluster. If none
533 is available this function sorts again clusters, to take into
534 consideration platform modifications, and replace the failed node by
535 one available in the new sorted clusters list. This mechanism allows
536 to keep tasks locality and a real time overview of the executing
540 \section{Experimentation}
543 \subsection{A typical AIAC application and the execution platform}
546 We used the ``Kernel CG'' application of the NAS Parallel Benchmarks
547 (NPB) \cite{nas} to evaluate the performances of our new mapping
548 algorithm. This benchmark is designed to be used on large
549 architectures, as it stresses communications%over latency networks
550 , by processing unstructured matrix vector multiplication with a
551 Conjugate Gradient method. As this method cannot be executed with the
552 asynchronous iteration model we have replaced it by another method
553 called the multisplitting method, which supports the asynchronous
554 iterative model. More details about this method can be found in
555 \cite{book_raph}. The chosen problem used a matrix of size $5,000,000$
556 with a low bandwidth, fixed to $35,000$. This bandwidth size
557 generates, according to the problem size, between 8 and 20 neighbors
558 per tasks. This application was executed on 64 nodes.
561 The platform used for our tests, called Grid’5000\cite{g5k}, is a
562 French nationwide experimental set of clusters which provides us with
563 distributed clusters architectures (28 heterogeneous clusters spread
564 over 9 sites). We used three distributed clusters architectures, each
565 having a different heterogeneity degree. The first one was composed of
566 four clusters spread over four sites, with a total of 106 computing
567 nodes representing 424 computing cores with \mbox{$hd = 0.08$}; the
568 second one was composed of four clusters spread over three sites, with
569 a total of 110 computing nodes representing 440 computing cores with
570 \mbox{$hd = 0.50$}; and finally the third one was composed of five
571 clusters spread over four sites with 115 computing nodes representing
572 620 computing cores with \mbox{$hd = 0.72$}.
575 All computing nodes of these clusters have at least 4 computing cores
576 (in the last used architecture, with \mbox{$hd = 0.72$}, two clusters
577 are composed of 8 computing cores machines) with a minimum of 4GB of
578 memory (in order to execute the application with a big problem
579 size). All computing nodes can communicate with each other through an
580 efficient network. Nevertheless, this latter is shared with many other
581 users so high latencies appear during executions.
584 \subsection{Experiments}
585 \label{sec:experiments}
587 During executions, we introduced two failures in computing nodes
588 involved in the computation every 20 seconds to simulate a volatile
589 environment. Unfortunately, we did not have the opportunity to realize
590 experiments with more computing nodes over more sites with problems of
591 larger sizes, but we plan to extend our experiments in the future.
594 Here we present the results of the evaluation of the MAHEVE algorithm,
595 compared with FT-AIAC-QM (for \textit{Fault Tolerant AIAC-QM}) and
596 FT-FEC (for \textit{Fault Tolerant F-EC}) which are respectively the
597 fault tolerant versions of the AIAC-QM and F-EC mapping algorithms
598 presented in \cite{pdsec10}. Table \ref{tab:results} shows the
599 execution times of each mapping algorithm compared to the default
600 mapping strategy of the JaceP2P-V2 platform, with the corresponding
601 gains on application execution time, given in brackets. It presents
602 both the executions with faults (WF) and the fault free (FF)
607 \renewcommand{\arraystretch}{1.6}
610 \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
612 \multirow{2}{*}{ ~$hd$~ }&\multicolumn{2}{c|}{ ~Default~ }&\multicolumn{2}{c|}{ ~FT-AIAC-QM~ }&\multicolumn{2}{c|}{ ~FT-FEC~ }&\multicolumn{2}{c|}{ ~MAHEVE~ }\\
615 & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ \\
617 ~$0.08$~ & ~80~ & ~229~ & ~63 (21\%)~ & ~178 (22\%)~ & ~61 (23\%)~ & ~154
618 (33\%)~ & ~60 (25\%)~ & ~113 (50\%)~ \\
620 ~$0.50$~ & ~67~ & ~242~ & ~61 (9\%)~ & ~118 (51\%)~ & ~63 (6\%)~ & ~133
621 (45\%)~ & ~54 (20\%)~ & ~85 (65\%)~ \\
623 ~$0.72$~ & ~67~ & ~192~ & ~59 (12\%)~ & ~99 (45\%)~ & ~65 (3\%)~ & ~121
624 (33\%)~ & ~52 (22\%)~ & ~86 (53\%)~\\
629 \caption{Application execution time in seconds and corresponding gains on various
630 platforms using different mapping algorithms, with fault free (FF) executions
631 and with 2 node failures each 20 seconds (WF) executions.}
637 First of all, we can note that all mapping algorithms provide an
638 enhancement of the application performances by considerably reducing
639 its execution time, especially for executions with node failures, with
640 an average gain of about $45\%$ in general in comparison to the
641 default policy. If we focus on executions with node failures (WF),
642 FT-FEC is efficient on architectures with a low heterogeneity degree
643 (\mbox{$hd = 0.08$}) by providing gains of about $33\%$, and gains are
644 roughly the same on heterogeneous architectures (\mbox{$hd =
645 0.72$}). FT-AIAC-QM is efficient on architectures with a high
646 heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains of about
647 $45\%$, whereas it is not so efficient on homogeneous architectures
648 (\mbox{$hd = 0.08$}) by providing gains of about $22\%$. We can note
649 here that on an architecture with a heterogeneity degree of $0.50$
650 FT-AIAC-QM is more efficient than FT-FEC by providing gains up to
651 $50\%$. Here we point out that in fault free executions (FF), both
652 algorithms also provide gains on their respective favorite
653 architectures, though gains are less great than in executions with
657 Now if we focus on the performances of our new solution MAHEVE, we can
658 see that it is all the time better than other algorithms. As can be
659 seen in \mbox{Table \ref{tab:results}}, in executions with faults
660 (WF), it reduces the application's execution time by about $50\%$ on
661 homogeneous architectures (here of $0.08$ heterogeneity degree) which
662 is more than 25 point better than FT-FEC and near 30 points better
663 than FT-AIAC-QM. On heterogeneous architectures (here of $0.72$
664 heterogeneity degree) it also outperforms other mapping algorithms by
665 reducing the application execution time by about $53\%$ which is
666 almost 10 points better than FT-AIAC-QM and 20 points better than
667 FT-FEC. On middle heterogeneity degree architectures (here of $0.50$
668 heterogeneity degree), MAHEVE is once again better than its two
669 comparative mapping algorithms by reducing the application execution
670 time by about $53\%$. These good performances come from the fact that
671 it is designed to be efficient on both architectures, homogeneous and
672 heterogeneous. Moreover, as it integrates a fault tolerance
673 \textit{security} in the initial mapping, it is more efficient when
674 computing nodes fail. Here we can point out that this algorithm allows
675 in general gains on application execution time of about $55\%$. In fault free executions (FF), it outperforms once again
676 the two other algorithms.
679 \section{Conclusion and future works}
682 In this paper we have presented a new mapping algorithm, called
683 MAHEVE, to address the AIAC mapping issue on heterogeneous and
684 volatile environments. It aims at doing an efficient mapping of tasks
685 on distributed clusters architectures by taking the best part of the
686 two known approaches, application execution time optimization and
687 edge-cuts minimization. Experiments, though using a single
688 application, show that it is the most efficient mapping algorithm on
689 all kinds of architectures, as it takes into account their
690 heterogeneity degree and adapt its sort methods to it. We have shown
691 that it is all the time better than the two other comparative mapping
692 algorithms, FT-AIAC-QM and FT-FEC. This can be explained by the fact
693 that it not only takes care about computing nodes and clusters, but
694 also about the tasks' properties, what refines the mapping solution.
697 In our future works we plan to enhance the MAHEVE algorithm
698 performances by modifying the notation of clusters, since their
699 locality has not yet been taken into consideration. This would favor
700 tasks locality, which would reduce communications delays and provide a
701 much better convergence rate. We also have to validate the algorithm
702 performance with other AIAC applications.
707 \section*{Acknowledgement}
710 Experiments presented in this paper were carried out using the
711 Grid'5000 experimental testbed\cite{g5k}, being developed under the
712 INRIA ALADDIN development action with support from CNRS, RENATER and
713 several Universities as well as other funding bodies.
718 \bibliographystyle{unsrt}
720 \bibliography{biblio}