1 \documentclass[10pt,conference]{llncs}
3 \usepackage[T1]{fontenc}
5 \usepackage[utf8x]{inputenc}
10 \usepackage[english]{babel}
11 \usepackage[pdftex,final]{graphicx}
12 \usepackage[ruled,vlined]{algorithm2e}
13 \usepackage[pdftex]{graphicx}
15 \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
18 \title{MAHEVE: An Efficient Reliable Mapping of Asynchronous Iterative
19 Applications on Volatile and Heterogeneous Environments \thanks{This
20 work was supported by the European Interreg IV From-P2P
24 \author{Raphaël Couturier, David Laiymani and Sébastien Miquée}
25 \authorrunning{R. Couturier, D. Laiymani and S. Miquée}
26 \institute{\vspace{-0.2cm}
27 University of Franche-Comté \qquad LIFC laboratory\\%[1mm]
28 IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort,
30 raphael.couturier,david.laiymani,sebastien.miquee}\}{\tt
43 With the emergence of massive distributed computing resources, such
44 as grids and distributed clusters architectures, parallel
45 programming is used to benefit from them and execute problems of
46 larger sizes. The asynchronous iteration model, called AIAC, has
47 been proven to be an efficient solution for heterogeneous and
48 distributed architectures. An efficient mapping of applications'
49 tasks is essential to reduce their execution time. In this paper we
50 present a new mapping algorithm, called MAHEVE (Mapping Algorithm
51 for HEterogeneous and Volatile Environments) which is efficient on
52 such architectures and integrates a fault tolerance mechanism to
53 resist computing nodes failures. Our experiments show gains on a
54 typical AIAC application execution time of about $55\%$, executed
55 on distributed clusters architectures containing more than 400
56 computing cores with the JaceP2P-V2 environment.
59 \section{Introduction}
62 Nowadays, scientific applications require a great computation power to
63 solve large problems. Though personal computers are becoming more
64 powerful, in many cases they are not sufficient. One well adapted
65 solution is to use computers clusters in order to combine the power of
66 many machines. Distributed clusters form such an architecture,
67 providing a great computing power, by aggregating the computation
68 power of multiple clusters spread over multiple sites. Such an
69 architecture brings users heterogeneity in computing machines as well
70 as network latency. In order to use such an architecture, parallel
71 programming is required. In the parallel computing area, in order to
72 execute very large applications on heterogeneous architectures,
73 iterative methods are well adapted\cite{book_raph,bcvc06:ij}.
76 These methods repeat the same instructions block until a convergence
77 state and a desired approximation of the solution are reached. They
78 constitute the only known approach to solving some kinds of problems
79 and are relatively easy to parallelize. The Jacobi or the Conjugate
80 Gradient\cite{cg} methods are examples of such methods. To parallelize
81 them, one of the most used methods is the message passing paradigm
82 which provides efficient mechanisms to exchange data between tasks. As
83 such a method, we focus here on the asynchronous parallel iterative
84 model, called AIAC\cite{book_raph} (for \textit{Asynchronous
85 Iterations -- Asynchronous Communications}).
90 \includegraphics[width=7.4cm]{images/AIAC}
91 \caption{Two processors computing in the Asynchronous Iterations -- Asynchronous Communications (AIAC) model}
97 In this model, as can be seen on Figure \ref{fig:AIAC}, after each
98 iteration, a task sends its results to its neighbors and immediately
99 starts the next iteration with the last received data. The receiving
100 and sending mechanisms are asynchronous and tasks do not have to wait
101 for the reception of dependency messages from their
102 neighbors. Consequently, there is no idle time between two
103 iterations. Furthermore, this model is tolerant to messages loss and
104 even if a task is stopped the remaining tasks continue the
105 computation, with the last available data. Several
106 experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms
107 in the context of distributed clusters with high latency between
108 clusters. These works underline the good adaptability of AIAC
109 algorithms to network and processor heterogeneity.
112 In a previous study\cite{pdsec10} we proposed the implementation of
113 two static mapping algorithms of tasks to processors dedicated to the
114 AIAC model on heterogeneous distributed clusters. Both these two
115 algorithms, AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC
116 (for \textit{Farhat Edges-Cuts}) showed an important performances
117 improvement by significantly reducing the application execution
118 time. These experiments were performed by using the fully fault
119 tolerant JaceP2P-V2 environment, described in next section.
120 %This Java based platform is an executing and developing
121 %environment dedicated to the AIAC model. By implementing a distributed
122 %backup/restore mechanism it is also fully fault
123 %tolerant\cite{jaceP2P-v2}.
124 In our previous experiments we did not introduce computing nodes
125 failures during the computation. As architecture heterogeneity
126 continually evolves according to computing nodes volatility, we have
127 to take care more precisely about the heterogeneity of the target
128 platform. Thus in this paper our main contribution is to propose a new
129 mapping algorithm called MAHEVE (\textit{Mapping Algorithm for
130 HEterogeneous and Volatile Environments}). This algorithm
131 explicitly tackles the heterogeneity issue and introduces a level of
132 dynamism in order to adapt itself to the fault tolerance
133 mechanisms. Our experiments show gains up to $65\%$ on application
134 execution time, with faults during executions, which is about 10
135 points better than AIAC-QM and about 25 points better than F-EC, and
136 MAHEVE also outperforms them in experiments with no fault during executions.
139 The rest of this paper is organized as
140 follows. Section~\ref{sec:jacep2p} presents the JaceP2P-V2 middleware
141 by describing its architecture and briefly presenting its fault
142 tolerance mechanisms. Section~\ref{sec:pb} formalizes our mapping and
143 fault tolerance problems and quotes existing issues to address
144 them. Section~\ref{sec:maheve} describes the new mapping strategy we
145 propose, MAHEVE. In Section~\ref{sec:expe} we present the experiments
146 we conducted on the Grid'5000 testbed with more than 400 computing
147 cores. Finally, we give some concluding remarks and plan our future
148 work in Section~\ref{sec:conclu}.
154 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented in
155 Java, dedicated to developing and executing parallel iterative
156 asynchronous applications. It is fully fault tolerant allowing it to
157 execute parallel applications over volatile environments. To our
158 knowledge this is the only platform dedicated to designing and
159 executing AIAC algorithms in such volatile environments.
162 The JaceP2P-V2 platform part, which is based on the daemons and
163 supervisors paradigm, is composed of three main entities:
165 \item The ``super-nodes'', which are in charge of supervising free
166 computing nodes connected to the platform;
168 \item The ``spawner'', which is launched by a user wanting to execute
169 a parallel application. It is in charge of a group of computing
170 nodes and monitors them. If one of them fails, it requires a
171 replacing one to a super-node;
173 \item The ``daemon'', first connects to a super-node and
174 waits for a task to execute. Each daemon can communicate directly
175 with its computing neighbors.
179 To be able to execute AIAC applications, JaceP2P-V2 has an
180 asynchronous messaging mechanism, and to resist daemons failures, it
181 implements a checkpoint/restart mechanism by using a distributed
182 backup mechanism called the \textit{uncoordinated distributed
183 checkpointing}\cite{uncoord_cp}. This decentralized procedure
184 allows the platform to be very scalable, with no weak points and does
185 not require a secure nor a stable station for backups. When a daemon
186 dies, it is replaced by another one, as we suppose that there are
187 enough available free nodes. Moreover, to resist supervisors failures
188 and for scalability, some extra nodes are reserved. For more details
189 on the JaceP2P-V2 platform, interested readers can refer to
193 \section{Mapping and fault tolerance problems}
196 \subsection{Model formalization}
199 \subsubsection{Application modeling}
200 \label{sec:pbmodelapp}
203 %With the AIAC model, all tasks compute in parallel at the same time,
204 %without precedence nor synchronization. During an iteration, each task
205 %computes its job and sends its results to its neighbors, and
206 %immediately starts the next iteration.
207 The TIG\cite{tig1} (\textit{Task Interaction Graph}) model is the most
208 appropriate to our problem, as it only models relationships between
209 tasks. In this model, all the tasks are considered simultaneously
210 executable and communications can take place at any time during the
211 computation, with no precedence nor synchronization. As a reminder,
212 during an iteration in the AIAC model, each task computes its job and
213 sends its results to its neighbors, and immediately starts the next
217 In the TIG model, a parallel application is represented by a graph
218 $GT(V,E)$, where \mbox{$V = \{V_1,V_2,\dots V_v\}$} is the set of
219 $|V|$ vertices and \mbox{$E \subset V \times V$} is the set of
220 undirectional edges. The vertices represent tasks and the edges
221 represent the mutual communication among tasks. A function \mbox{$EC :
222 V \rightarrow \mathbb{R}^+$} gives the computation cost of tasks and
223 \mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost
224 for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) =
225 e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}. Another function
226 \mbox{$D : V \rightarrow \mathbb{N}^+$} gives the amount of
227 dependencies of a task, and we define \mbox{$D(V_i) = d_i$}.
230 \subsubsection{Architecture modeling}
231 \label{sec:pbmodelarchi}
234 A distributed clusters architecture can be modeled by a
235 three-level-graph. The levels are \textit{architecture} (a) (here the
236 Grid'5000 grid), \textit{cluster} (c), and \textit{computing node} (n)
237 levels. Let $GG(N,L)$ be a graph representing a distributed clusters
238 architecture, where \mbox{$N = \{N_1,N_2,\dots N_n\}$} is the set of
239 $|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The
240 vertices represent the computing nodes and the edges represent the
241 links between them. An edge \mbox{$L_i \in L$} is an unordered pair
242 \mbox{$(N_x,N_y) \in N$}, representing a communication link between
243 nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow
244 \mathbb{R}^+$} gives the computational power of nodes and another
245 function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the
246 communication latency of links. We define \mbox{$WN(N_i) = wn_i$} and
247 \mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters
248 contained in the architecture. A function \mbox{$CN : C \rightarrow
249 \mathbb{N}^+$} gives the amount of computing nodes contained in a
250 cluster, and another function \mbox{$CF : C \rightarrow \mathbb{N}^+$}
251 gives the amount of available computing nodes (not involved in
252 computation) of a cluster. We define \mbox{$CN(C_i) = C_{Ni}$} and
253 \mbox{$CF(C_i) = C_{Fi}$}. We also define
255 % = \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of
256 %cluster $C_i$, \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
257 %as the average computation power of cluster $C_i$, and
258 $C_{\overline{P}fi}$ as the average power of available resources of
261 We evaluate the \textit{heterogeneity degree} of the architecture,
262 noted $hd$, by using the \textit{relative standard deviation} method,
263 with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the
264 average computing power of nodes and $\sigma_{PN}$ represents the
265 standard deviation of computing nodes power. This measure provides us
266 the coefficient of variation of the platform in percentage -- we only
267 consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd
268 > 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully
269 heterogeneous platform.
272 \subsubsection{Mapping functions}
273 \label{sec:pbmodelmapping}
276 When a parallel application $App$, represented by a graph $GT$, is
277 mapped on a distributed clusters architecture, represented by a graph
278 $GG$, the execution time of the application, $ET(App)$, can be defined
279 as the execution time of the slowest task. Indeed, an application ends
280 when all the tasks have detected convergence and reached the desired
281 approximation of the solution. We define $ET(App) = \max_{i=1 \dots
282 v} ( ET(V_i) )$, where the execution time of each task $i$
283 \mbox{($i=1 \dots v$)}, $ET(V_i)$, is given by $ET(V_i) =
284 \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$
285 is the computational cost of $V_i$, $wn_i$ is the computational power
286 of the node $N_i$ on which $V_i$ is mapped, $J$ represents the
287 neighbors set of $V_i$, $c_{ij}$ is the amount of communications
288 between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the
289 computing nodes on which $V_i$ and $V_j$ are mapped. As described in
290 this formula, the execution time of a task depends on the task weight
291 and on the communications which may occur between this task and its
292 neighbors. We underline here that in the AIAC model, it is impossible
293 to predict the number of iterations of a task. So it is difficult to
294 evaluate a priori its cost $e_i$.
297 This tasks mapping problem is similar to the classical graph
298 partitioning and task assignment problem, and is thus NP-complete.
301 \subsection{Fault tolerance}
304 In volatile environments, computing nodes can disconnect at any time
305 during the computation, and have thus to be efficiently replaced.
306 The replacing nodes should be the best ones at the fault time,
307 %according to the chosen mapping algorithm,
309 available nodes. As executing environments can regularly evolve, due
310 to computing nodes volatility, a mapping algorithm has to keep a
311 correct overview of the architecture, in real time. Thus, criteria to
312 assign tasks to nodes should evolve too.
315 Another problem appears after multiple crashes: some tasks may have
316 migrated over multiple computing nodes and clusters, and the initial
317 mapping may be totally changed. So, after having suffered some nodes
318 failures the tasks mapping could not always satisfy the mapping
319 criteria (not on the more powerful available machine, too far away
320 from its neighbors\dots{}). A good fault tolerance policy has to
321 evolve dynamically with the executing
325 \subsection{Specificities of the AIAC mapping problem}
326 \label{sec:specAIACmapping}
328 An important point to take into consideration is that we do not allow
329 the execution of multiple tasks on the same computing node, as this
330 provides a fall of performances when this one fails. Indeed we should
331 redeploy all of the tasks from this node to another one, using last
332 saves, which can be spread on multiple computing nodes. This may
333 result in large communication overheads and in a waste of computation
334 time. Nevertheless, to benefit from multi-cores processors, we use a
335 task level parallelism by multi-threaded sequential solvers for
339 Another important point in the AIAC model is that as the JaceP2P-V2
340 environment is fault tolerant and tasks save checkpoints on their
341 neighbors, it is more efficient to save on near nodes than on far ones
342 in order to reduce the communication overhead during this operation,
343 and to restart a task faster.
346 \subsection{Related work}
349 In the literature of the TIG mapping many algorithms exist, which can
350 be broadly classified into two categories. The first one is the
351 \textit{Edge-cuts optimization} class, which minimizes the use of the
352 penalizing links between clusters. As tasks are depending on
353 neighbors, which are called dependencies, the goal is to choose nodes
354 where distance, in term of network, is small to improve communications
355 between tasks. Here we can cite the Farhat's algorithm\cite{fec}, and
356 Metis\cite{metis} and Chaco\cite{chaco} which are libraries containing
357 such kind of algorithms. The second category is the \textit{Execution
358 time optimization} class, which aims at minimizing the whole
359 application execution time. These algorithms look for nodes which can
360 provide the smallest execution time of tasks using their computational
361 power. Here we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and
362 MiniMax\cite{minimax} as such kind of algorithms.
365 Both classes of algorithms may fit with our goals as in our model we
366 have both the computational power of nodes and communication costs
367 which may influence the applications performances.
370 All mentioned algorithms do not tackle the computing nodes failures
371 issue, or only basically by applying the same policy. As explained in
372 Section \ref{sec:pbft}, a more efficient and dedicated replacement
373 function is needed. Nevertheless, to the best of our knowledge, no
374 tasks mapping algorithm, addressing explicitly both the executing
375 platform heterogeneity and the computing nodes failures issues,
383 Here we present our new tasks mapping strategy, called MAHEVE (for
384 \textit{Mapping Algorithm for HEterogeneous and Volatile
385 Environments}). This algorithm aims at taking the best part of each
386 category mentioned in Section \ref{sec:pbrw}, the edge-cuts
387 minimization and the application execution time optimization
391 This new algorithm can be divided into two parts. The first part aims
392 at performing the initial mapping, and the second part is devoted to
393 search replacing nodes when computing nodes failures occur.
397 \subsection{Initial mapping}
398 \label{sec:maheve_init}
400 In this section we will study the main mechanisms of the
401 \textit{static mapping} done by MAHEVE, which is composed of three
402 phases: sort of clusters, sort of tasks, and the effective mapping,
403 which maps tasks (in their sort order) on nodes of clusters (also in
404 their sort order) with a reservation of some nodes in each cluster.
407 %\begin{algorithm}[H]
409 % \KwIn{GT (app. graph), CL (clusters), hd (heterogeneity degree)}
410 % \KwOut{Mp (the mapping done)}
414 % nodes $\leftarrow$ sortingClusters(GG, hd)\;
415 %% nodes $\leftarrow$ searchNodes(lt, nbTasks)\;
419 % tasks $\leftarrow$ sortingTasks(GT, hd)\;
420 %% tasks $\leftarrow$ orderTasks(st, hd)\;
424 % \For{$i = 0$ \KwTo $size(tasks)$}
426 % Mp.add(tasks($i$), nodes($i$))\;
431 % \caption{General algorithm simplified}
436 \subsubsection{Sorting clusters}
437 \label{sec:maheve_sort_clusters}
439 The first step of the initial mapping is to sort clusters according to
440 the executing platform heterogeneity degree $hd$. The main principles
441 are that a cluster obtains a better mark when $hd < 0.5$ and it
442 contains more computing nodes than other clusters ($C_{Fi}$, the
443 number of available free nodes, is privileged), and when $hd \ge 0.5$
444 and it contains more powerful computing nodes ($C_{\overline{P}fi}$,
445 the average free computation power, is privileged). These choices come
446 from several experiments with the AIAC model, which show that in such
447 environments it is more efficient to privilege the computation power
448 or the number of nodes. As the number of nodes, $C_{Fi}$, and the
449 average free computing power, $C_{\overline{P}fi}$, are not in the
450 same order of magnitude, we normalize them with two functions, $normN$
452 %They are defined with \mbox{$normN\left(C_{Fi}\right) =
453 % C_{Fi} \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate
454 %of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
455 % C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|}
456 % C_{\overline{P}fj}$}, which is the rate of the average power, both
457 %representing the cluster in the architecture.
458 We note $normN\left( C_{Fi} \right) = NC_{Fi}$ and
459 $normP(C_{\overline{P}fi}) =NC_{\overline{P}fi}$.
462 The formula used to give a mark, $M_i$, to a cluster is
465 M_i = {NC_{\overline{P}fi}}^{hd} + {NC_{Fi}}^{1 - hd}.
469 This compromise function allows us to privilege clusters following our
470 criteria, as explained previously, according to the heterogeneity
471 degree. If we study its limits for the $hd$'s extremities, $hd = 0$
472 and $hd = 1$, we obtain $\lim_{hd \rightarrow 0} M_i = NC_{Fi} + 1$ and
473 $\lim_{hd \rightarrow 1} M_i = NC_{\overline{P}fi} + 1$, which fit with
476 Clusters are so sorted and placed in a list containing them, starting
477 from the cluster which receives the better mark to the one which
478 receives the lower mark.
481 \subsubsection{Sorting tasks}
482 \label{sec:maheve_sort_tasks}
484 Like clusters, tasks are also sorted according to the heterogeneity
485 degree of the executing platform. This sorting is done in the same way as
486 previously, as when $hd < 0.5$ tasks with higher dependencies will be
487 privileged, and when $hd \ge 0.5$ tasks with higher computing cost are
489 %, in order to be executed on the highest powered computing nodes.
492 The main function used to classified tasks is
495 Q_i = {e_i}^{hd} \times {d_i}^{1 - hd}
498 where $Q_i$ is the evaluation of the task $i$ according to the
499 heterogeneity degree $hd$ and $d_i$, the amount of dependencies of
503 Tasks are taken in the order of the first sort, determined with
504 equation (\ref{eq:tasks}), and each task is placed in a new list (the
505 final one) and some of its dependencies are added. We note $Nb_i =
506 {d_i}^{1 - hd}$ this amount of dependencies as the lower the
507 heterogeneity degree is the higher this number will be. This final
508 operation allows to control the necessary locality of tasks according
509 to $hd$. %the heterogeneity degree of the platform.
512 \subsubsection{Mapping method}
513 \label{sec:maheve_mapping_method}
515 The third step of the initial mapping is to allocate tasks to
516 nodes. As clusters and tasks have been sorted accordingly to the
517 executing platform heterogeneity degree, ordered from the highest mark
518 to the lowest, this function maps tasks on almost all available
519 computing nodes of clusters, in their respective order in lists (for
520 example a task classified first in the tasks list is mapped on an
521 available node of the cluster classified first in the clusters list).
522 The idea here is not to fulfill each cluster, but to preserve some
523 computing nodes in each cluster. These conserved nodes will be used to
524 replace failed nodes.
527 Here we can mentioned that the whole mapping process (the three steps)
528 has a complexity of $O( |V| log(|V|) )$, where |V| is the number of tasks.
531 \subsection{Replacing function}
532 \label{sec:maheve_rep}
534 %This function is essential in a volatile environment, as an efficient
535 %replacement should reduce the overhead on the application execution
536 %time due to the loss of computing steps and data.
538 As shown in the previous section, during the initial mapping some
539 computing nodes in each cluster have been preserved. When a node fails
540 this function replace it by a free node of the same cluster. If none
541 is available this function sorts again clusters, to take into
542 consideration platform modifications, and replace the failed node by
543 one available in the new sorted clusters list. This mechanism allows
544 to keep tasks locality and a real time overview of the executing
548 \section{Experimentation}
551 \subsection{A typical AIAC application and the execution platform}
554 We used the ``Kernel CG'' application of the NAS Parallel Benchmarks
555 (NPB) \cite{nas} to evaluate the performances of our new mapping
556 algorithm. This benchmark is designed to be used on large
557 architectures, as it stresses communications%over latency networks
558 , by processing unstructured matrix vector multiplication with a
559 Conjugate Gradient method. As this method cannot be executed with the
560 asynchronous iteration model we have replaced it by another method
561 called the multisplitting method, which supports the asynchronous
562 iterative model. More details about this method can be found in
563 \cite{book_raph}. The chosen problem used a matrix of size $5,000,000$
564 with a low bandwidth, fixed to $35,000$. This bandwidth size
565 generates, according to the problem size, between 8 and 20 neighbors
566 per tasks. This application was executed on 64 nodes.
569 The platform used for our tests, called Grid’5000\cite{g5k}, is a
570 French nationwide experimental set of clusters which provides us with
571 distributed clusters architectures (28 heterogeneous clusters spread
572 over 9 sites). We used three distributed clusters architectures, each
573 having a different heterogeneity degree. The first one was composed of
574 four clusters spread over four sites, with a total of 106 computing
575 nodes representing 424 computing cores with \mbox{$hd = 0.08$}; the
576 second one was composed of four clusters spread over three sites, with
577 a total of 110 computing nodes representing 440 computing cores with
578 \mbox{$hd = 0.50$}; and finally the third one was composed of five
579 clusters spread over four sites with 115 computing nodes representing
580 620 computing cores with \mbox{$hd = 0.72$}.
583 All computing nodes of these clusters have at least 4 computing cores
584 (in the last used architecture, with \mbox{$hd = 0.72$}, two clusters
585 are composed of 8 computing cores machines) with a minimum of 4GB of
586 memory (in order to execute the application with a big problem
587 size). All computing nodes can communicate with each other through an
588 efficient network. Nevertheless, this latter is shared with many other
589 users so high latencies appear during executions.
592 \subsection{Experiments}
593 \label{sec:experiments}
595 During executions, we introduced two failures in computing nodes
596 involved in the computation every 20 seconds to simulate a volatile
597 environment. Unfortunately, we did not have the opportunity to realize
598 experiments with more computing nodes over more sites with problems of
599 larger sizes, but we plan to extend our experiments in the future.
602 Here we present the results of the evaluation of the MAHEVE algorithm,
603 compared with FT-AIAC-QM (for \textit{Fault Tolerant AIAC-QM}) and
604 FT-FEC (for \textit{Fault Tolerant F-EC}) which are respectively the
605 fault tolerant versions of the AIAC-QM and F-EC mapping algorithms
606 presented in \cite{pdsec10}. Table \ref{tab:results} shows the
607 execution times of each mapping algorithm compared to the default
608 mapping strategy of the JaceP2P-V2 platform, with the corresponding
609 gains on application execution time, given in brackets. It presents
610 both the executions with faults (WF) and the fault free (FF)
615 \renewcommand{\arraystretch}{1.6}
618 \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
620 \multirow{2}{*}{ ~$hd$~ }&\multicolumn{2}{c|}{ ~Default~ }&\multicolumn{2}{c|}{ ~FT-AIAC-QM~ }&\multicolumn{2}{c|}{ ~FT-FEC~ }&\multicolumn{2}{c|}{ ~MAHEVE~ }\\
623 & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ \\
625 ~$0.08$~ & ~80~ & ~229~ & ~63 (21\%)~ & ~178 (22\%)~ & ~61 (23\%)~ & ~154
626 (33\%)~ & ~60 (25\%)~ & ~113 (50\%)~ \\
628 ~$0.50$~ & ~67~ & ~242~ & ~61 (9\%)~ & ~118 (51\%)~ & ~63 (6\%)~ & ~133
629 (45\%)~ & ~54 (20\%)~ & ~85 (65\%)~ \\
631 ~$0.72$~ & ~67~ & ~192~ & ~59 (12\%)~ & ~99 (45\%)~ & ~65 (3\%)~ & ~121
632 (33\%)~ & ~52 (22\%)~ & ~86 (53\%)~\\
637 \caption{Application execution time in seconds and corresponding gains on various
638 platforms using different mapping algorithms, with fault free (FF) executions
639 and with 2 node failures each 20 seconds (WF) executions.}
645 First of all, we can note that all mapping algorithms provide an
646 enhancement of the application performances by considerably reducing
647 its execution time, especially for executions with node failures, with
648 an average gain of about $45\%$ in general in comparison to the
649 default policy. If we focus on executions with node failures (WF),
650 FT-FEC is efficient on architectures with a low heterogeneity degree
651 (\mbox{$hd = 0.08$}) by providing gains of about $33\%$, and gains are
652 roughly the same on heterogeneous architectures (\mbox{$hd =
653 0.72$}). FT-AIAC-QM is efficient on architectures with a high
654 heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains of about
655 $45\%$, whereas it is not so efficient on homogeneous architectures
656 (\mbox{$hd = 0.08$}) by providing gains of about $22\%$. We can note
657 here that on an architecture with a heterogeneity degree of $0.50$
658 FT-AIAC-QM is more efficient than FT-FEC by providing gains up to
659 $50\%$. Here we point out that in fault free executions (FF), both
660 algorithms also provide gains on their respective favorite
661 architectures, though gains are less great than in executions with
665 Now if we focus on the performances of our new solution MAHEVE, we can
666 see that it is all the time better than other algorithms. As can be
667 seen in \mbox{Table \ref{tab:results}}, in executions with faults
668 (WF), it reduces the application's execution time by about $50\%$ on
669 homogeneous architectures (here of $0.08$ heterogeneity degree) which
670 is more than 25 point better than FT-FEC and near 30 points better
671 than FT-AIAC-QM. On heterogeneous architectures (here of $0.72$
672 heterogeneity degree) it also outperforms other mapping algorithms by
673 reducing the application execution time by about $53\%$ which is
674 almost 10 points better than FT-AIAC-QM and 20 points better than
675 FT-FEC. On middle heterogeneity degree architectures (here of $0.50$
676 heterogeneity degree), MAHEVE is once again better than its two
677 comparative mapping algorithms by reducing the application execution
678 time by about $53\%$. These good performances come from the fact that
679 it is designed to be efficient on both architectures, homogeneous and
680 heterogeneous. Moreover, as it integrates a fault tolerance
681 \textit{security} in the initial mapping, it is more efficient when
682 computing nodes fail. Here we can point out that this algorithm allows
683 in general gains on application execution time of about $55\%$. In fault free executions (FF), it outperforms once again
684 the two other algorithms.
687 \section{Conclusion and future works}
690 In this paper we have presented a new mapping algorithm, called
691 MAHEVE, to address the AIAC mapping issue on heterogeneous and
692 volatile environments. It aims at doing an efficient mapping of tasks
693 on distributed clusters architectures by taking the best part of the
694 two known approaches, application execution time optimization and
695 edge-cuts minimization. Experiments, though using a single
696 application, show that it is the most efficient mapping algorithm on
697 all kinds of architectures, as it takes into account their
698 heterogeneity degree and adapt its sort methods to it. We have shown
699 that it is all the time better than the two other comparative mapping
700 algorithms, FT-AIAC-QM and FT-FEC. This can be explained by the fact
701 that it not only takes care about computing nodes and clusters, but
702 also about the tasks' properties, what refines the mapping solution.
705 In our future works we plan to enhance the MAHEVE algorithm
706 performances by modifying the notation of clusters, since their
707 locality has not yet been taken into consideration. This would favor
708 tasks locality, which would reduce communications delays and provide a
709 much better convergence rate. We also have to validate the algorithm
710 performance with other AIAC applications.
715 \section*{Acknowledgement}
718 Experiments presented in this paper were carried out using the
719 Grid'5000 experimental testbed\cite{g5k}, being developed under the
720 INRIA ALADDIN development action with support from CNRS, RENATER and
721 several Universities as well as other funding bodies.
726 \bibliographystyle{unsrt}
728 \bibliography{biblio}