2 \documentclass[10pt,conference,compsocconf]{IEEEtran}
5 %\documentclass[conference,compsoc,a4paper,onecolumn]{IEEEtran}
6 %\documentclass[compsoc,a4paper,oneside]{IEEEtran}
8 \usepackage[T1]{fontenc}
10 \usepackage[utf8x]{inputenc}
14 %\usepackage[Glenn]{fncychap}
18 %\usepackage{graphicx}
20 % Definition des marges
22 %\setpapersize[portrait]{A4}
24 \usepackage[english]{babel}
25 % Extension pour les graphiques EPS
26 %\usepackage[dvips]{graphicx}
27 \usepackage[pdftex,final]{graphicx}
28 % Extension pour les liens intra-documents (tagged PDF)
29 % et l'affichage correct des URL (commande \url{http://example.com})
31 % Extension pour que plus de titres apparaissent dans la table des matieres
32 % (table des matieres, bibliographie, index).
33 %\usepackage{tocbibind}
34 %\newcommand{\tabledesmatieres}{
35 % \setlength{\parskip}{0ex} % suppression de l'espace entre les paragraphes
39 %\NoAutoSpaceBeforeFDP
40 %\author{Sébastien Miquée}
42 \usepackage[ruled,vlined]{algorithm2e}
45 \newcommand{\myitemize}[1]
53 \usepackage[pdftex]{graphicx}
54 \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
58 % *** PDF, URL AND HYPERLINK PACKAGES ***
63 \title{Mapping Asynchronous Iterative Applications on Heterogeneous
64 Distributed Architectures}
67 \IEEEauthorblockN{Raphaël Couturier \hspace{10pt} David Laiymani
68 \hspace{10pt} Sébastien Miquée}
70 \IEEEauthorblockA{Laboratoire d'Informatique de Franche-Comté
72 %University of Franche-Comté\\
73 IUT de Belfort-Montbéliard,2 Rue Engel Gros, BP 27, 90016 Belfort,
75 % Tel.: +33-3-84587782 \hspace{20pt} Fax: +33-3-84587781\\
77 \{raphael.couturier,david.laiymani,sebastien.miquee\}@univ-fcomte.fr}
79 \thanks{This work was supported by the European Interreg IV From-P2P project
80 and the region of Franche-Comté}
83 %% Permet de réactiver le \thanks
84 \IEEEoverridecommandlockouts
89 %% left, top, right, bottom
90 %\setmargrb{20mm}{15mm}{20mm}{15mm}
92 \IEEEcompsoctitleabstractindextext
98 To design parallel numerical algorithms on large scale distributed
99 and heterogeneous platforms, the asynchronous iteration model (AIAC)
100 may be an efficient solution. This class of algorithm is very
101 suitable since it enables communication/computation overlapping and
102 it suppresses all synchronizations between computation nodes. Since
103 target architectures are composed of more than one thousand
104 heterogeneous nodes connected through heterogeneous networks, the
105 need for mapping algorithms is crucial. In this paper, we propose a
106 new mapping algorithm dedicated to the AIAC model. To evaluate our
107 mapping algorithm we implemented it in the JaceP2P programming and
108 executing environment dedicated to AIAC applications and we
109 conducted a set of experiments on the Grid'5000 testbed. Results
110 are very encouraging and show that the use of our algorithm brings
111 an important gain in term of execution time (about $40\%$).
114 % than 700 computing cores and with a real and typical AIAC
115 % application based on the NAS parallel benchmarks.
117 % To design parallel and distributed applications on heterogeneous
118 % distributed architectures, the asynchronous iteration model may be
119 % an efficient solution. Such architectures contain various computing
120 % nodes connected by heterogeneous links. Nowadays, these
121 % architectures offer users too many computing nodes as they need, so
122 % a choice should be done, which is called ``tasks mapping''. In this
123 % paper we propose a comparison between different mapping algorithms,
124 % in order to evaluate mapping needs of the asynchronous iteration
125 % model on this kind of architectures. To do our experiments we used a
126 % middleware, JaceP2P-V2, which allows users to design asynchronous
127 % iterative applications and execute them on distributed
128 % architectures, in which we added a mapping library.\\
132 Mapping algorithms; Distributed clusters; Parallel iterative
133 asynchronous algorithms; Heterogeneous distributed architectures
139 \IEEEpeerreviewmaketitle
142 % Mapping algorithms, Distributed clusters, Parallel iterative
143 % asynchronous algorithms, Heterogeneous architectures.
146 \section{Introduction}
149 Nowadays scientists of many domains, like climatic simulation or
150 biological research, need large and powerful architectures to compute
151 their large applications. Distributed clusters architectures, which
152 are part of the grid architecture, are one of the best architectures
153 used to solve such applications with an acceptable execution
154 time. These architectures provide a lot of heterogeneous computing
155 nodes interconnected by a high performance network, but even with the
156 greatest efforts of their maintainers, there are latency and
157 computation capacity differences between clusters of each site.
159 In order to efficiently use this massive distributed computation
160 power, numerous numerical algorithms have been modified. These
161 algorithms can be broadly classified into two categories. First,
162 \textit{Direct methods}, which give the exact solution of the problem
163 using a finite number of operations (e.g. Cholesky,
164 LU\ldots. These methods cannot be applied to all kinds of
165 numerical problems. In general, they are not well adapted to very
166 large problems. Then \textit{iterative methods}, that repeat the same
167 instructions until a desired approximation of the solution is reached
168 -- we say that the algorithm has converged. Iterative algorithms
169 constitute the only known approach to solving some kinds of problems
170 and they are easier to parallelize than direct methods. The Jacobi or
171 Conjugate Gradient algorithms are examples of such iterative
178 % \includegraphics[width=7.4cm]{images/ISCA}
179 % \caption{Two processors computing in the Synchronous Iteration - Asynchronous Communication (SIAC) model}
183 In the rest of this paper we only focus on iterative methods. Now to
184 parallelize this kind of algorithm, two classes of parallel iterative
185 models can be described. In \textit{The synchronous iteration model}
186 after each iteration a node sends its results to its neighbors and
187 waits for the reception of all dependency messages from its neighbors
188 to start the next iteration. This results in large idle times and is
189 equivalent to a global synchronization of nodes after each
190 iteration. These synchronizations can strongly penalize the overall
191 performance of the application particularly in case of large scale
192 platforms with high latency network. Furthermore, if a message is
193 lost, its receiver will wait forever for this message and the
194 application will be blocked. In the same way, if a machine fails, all
195 the computation will be blocked.
200 \includegraphics[width=7.4cm]{images/IACA}
201 \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model}
206 In the\textit{The asynchronous iteration model} a node sends its
207 results to its neighbors and starts immediately the next iteration
208 with the last received data. These data could be data from previous
209 iterations, because the most recent data has not arrived in time or
210 neighbors have not finish their current iteration. The receiving and
211 sending mechanisms are asynchronous and nodes do not have to wait for
212 the reception of dependency messages from their
213 neighbors. Consequently, there is no more idle time between two
214 iterations. Furthermore, this model is tolerant to messages loss and
215 even if a node dies, the remaining nodes continue the computation,
216 with the last data the failed node sent. Unfortunately, the
217 asynchronous iteration model generally requires more iterations than
218 the synchronous one to converge to the solution.
220 This class of algorithms is very suitable in a distributed clusters
221 computing context because it suppresses all synchronizations between
222 computation nodes, tolerates messages loss and enables the overlapping
223 of communications by computations. Interested readers might consult
224 \cite{book_raph} for a precise classification and comparison of
225 parallel iterative algorithms. In this way, several experiments
226 \cite{book_raph} show the relevance of the AIAC algorithms in the
227 context of distributed clusters with high latency between
228 clusters. These works underline the good adaptability of AIAC
229 algorithms to network and processor heterogeneity.
231 As we aim to solve very large problems on heterogeneous distributed
232 architectures, in the rest of this study we only focus on the
233 asynchronous iteration model. In order to efficiently use such
234 algorithms on distributed clusters architectures, it is essential to
235 map the tasks of the application to the best sub-sets of nodes of the
236 target architecture. This mapping procedure must take into account
237 parameters such as network heterogeneity, computing nodes
238 heterogeneity and tasks heterogeneity in order to minimize the overall
239 execution time of the application.
241 %sets of computing nodes, which can improve applications execution
242 %time. Indeed, as there are more available nodes as we need on such
243 %architectures, it is important to select appropriate computing nodes,
244 %due to their heterogeneity at computation power as well as relying
246 To the best of our knowledge, there exits no algorithm which
247 specifically addresses the mapping of AIAC applications on distributed
248 architectures. The aim of this paper is to propose a new mapping
249 algorithm dedicated to AIAC applications and to implement it into a
250 real large scale computing platform, JaceP2P-V2. Experiments conducted
251 on the Grid'5000 testbed with more than 400 computing cores show that
252 this new algorithm enhances the performance of JaceP2P-V2 by
253 about $40\%$ for a real and typical AIAC application.
255 %this mapping problem. Our aim is to evaluate the
256 %more common approaches to solve this problem, by using several mapping
257 %algorithms implemented in a real large scale computing platform,
258 %JaceP2P-V2, on a true distributed clusters architecture.\\
260 The rest of this paper is organized as follows. Section
261 \ref{sec:jacep2p} presents the JaceP2P-V2 middleware. We focus here on
262 one of the main drawbacks of this platform: its lack of an efficient
263 mapping strategy. Section \ref{sec:pb} presents our mapping problem
264 and quotes existing issues to address it. Section
265 \ref{sec:aiacmapping} describes the specificities of the AIAC model
266 and details the main solution we propose to address the AIAC mapping
267 problem. In section \ref{sec:expe} we describe the experiments we have
268 conducted, with their different components and results. These results
269 were conducted on the Grid'5000 testbed with more than 400 computing
270 cores and show an important gain (about $40\%$) of the overall
271 execution time for a typical AIAC application, i.e. based on the NAS
272 Parallel Benchmark. Finally, we give some concluding remarks and plan
273 our future work in section \ref{sec:conclu}.
279 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented
280 using the Java programming language and dedicated to developing and
281 executing parallel iterative asynchronous applications. JaceP2P-V2
282 executes parallel iterative asynchronous applications with
283 dependencies between computing nodes. In addition, JaceP2P is fault
284 tolerant which allows it to execute parallel applications over
285 volatile environments and even for stable environments like local
286 clusters, it offers a safer and crash free platform. To our knowledge
287 this is the only platform dedicated to designing and executing AIAC
290 %\subsection{Architecture}
291 %\label{sec:archijaceP2P}
293 %In this section we describe the JaceP2P-V2 environment.
295 %on Figure \ref{fig:jaceP2P-v2}, which shows
297 The JaceP2P-V2 architecture, is composed of three main entities:
302 % \includegraphics[width=7.4cm]{images/JACEP2P-V2}
303 % \caption{The JaceP2P-V2 architecture}
304 % \label{fig:jaceP2P-v2}
308 \item The first entity is the ``super-node'' %(represented by a big
309 %circle in Figure \ref{fig:jaceP2P-v2}).
310 Super-nodes form a circular network and store, in registers, the
311 identifiers of all the computing nodes that are connected to the
312 platform and that are not executing any application.
313 % Each super-node has a status table containing the
314 % number of connected computing nodes to each super-node and all the
315 % super-nodes share a ``token'' that is passed successively from a
316 % super-node to the next one. Once a super-node has the token, it
317 % computes the average number of computing nodes connected to a
318 % super-node ($avg$) using the status table. If $avg$ is lower than
319 % the number of computing nodes connected to it, then it sends the
320 % identifiers of the extra computing nodes to the super-nodes that
321 % have the number of computing nodes connected to them less than
322 % $avg$. If the number of computing nodes connected to it has changed,
323 % it broadcasts the information to all the super-nodes in the
324 % platform. Finally, it passes the token to the next super node. This
325 % distribution reduces the load of the super-nodes.
326 A super-node regularly receives heartbeat messages
328 % doted lines in Figure \ref{fig:jaceP2P-v2})
329 from the computing nodes connected to it. If a super-node does not
330 receive a heartbeat message from a computing node for a given period
331 of time, it declares that this computing node is dead and deletes
332 its identifier from the register.
334 \item The second entity is the ``spawner''
335 %(represented by a square in
336 % Figure \ref{fig:jaceP2P-v2}).
337 When a user wants to execute a
338 parallel application that requires $N$ computing nodes, he or she
339 launches a spawner. The spawner contacts a super-node to reserve the
340 $N$ computing nodes plus some extra nodes. When the spawner receives
341 the list of nodes from the super-node, it transforms the extra
342 nodes into spawners (for fault tolerance and scalability reasons)
343 and stores the identifiers of the rest of the nodes in its own
345 % Once the extra nodes are transformed into spawners, they
346 % form a circular network and they receive the register containing the
347 % identifiers of the computing nodes.
348 Then each spawner becomes
349 responsible for a subgroup of computing nodes, starts the tasks on
350 the computing nodes under its command and sends a specific register
353 % computing node receives a specific register that only contains the
354 % identifiers of the daemons it interacts with and that depends on the
355 % application being executed. These specific registers reduce the
356 % number of messages sent by the spawners to update the register of
357 % the daemons after a daemon crashes because usually a small number of
358 % daemons is affected by this crash.
359 % If the spawner receives a message from a computing node informing
360 % that one of its neighbors has failed, it fetches a new one from the
361 % super-node in order to replace the dead one. The spawner initializes
362 % the new daemon, which retrieves the last backup (see next paragraph)
363 % of the dead node and continues the computing task from that
366 \item The third entity is the ``daemon'', or the computing node,
367 (represented in Figure \ref{fig:jaceP2P-v2} by a hashed small circle
368 if it is idle and by a white small circle if it is executing an
369 application). Once launched, it connects to a super-node and waits
370 for a task to execute. Once they begin executing an application they
371 form a circular network which is only used in the failure detection
372 mechanism. Each daemon can communicate directly with the daemons
373 whose identifiers are in its register. At the end of a task, the
374 daemons reconnect to a super-node.
377 To be able to execute asynchronous iterative applications, JaceP2P-V2
378 has an asynchronous messaging mechanism and to resist daemons'
379 failures, it implements a distributed backup mechanism called the
380 uncoordinated distributed checkpointing. For more details on the
381 JaceP2P-V2 platform, readers can refer to \cite{jaceP2P-v2}.
383 % This method allows daemons to
384 % save their data on neighboring daemons without any user
385 % intervention. The asynchronous nature of the application allows two
386 % daemons to execute two different iterations, thus each daemon saves
387 % its status without synchronizing with other daemons. This
388 % decentralized procedure allows the platform to be very scalable, with
389 % no weak points and does not require a secure and stable station for
390 % backups. Moreover, since the AIAC model is tolerant to messages loss,
391 % if a daemon dies, the other computing nodes continue their tasks and
392 % are not affected by this failure.
393 % The application convergence detection is done by daemons, using the
394 % decentralized global convergence detection algorithm presented in
395 % \cite{conv_dec}. It consists of two phases: the detection phase and
396 % the verification phase. This algorithm aims to detect efficiently
397 % the global convergence of asynchronous iterative parallel algorithms
398 % on distributed architectures.
402 \subsubsection*{Benefits of mapping}
405 In the JaceP2P-V2 environment, presented in the previous section,
406 there is no effective mapping solution. Indeed, when a user wants to
407 launch an application, the spawner emits a request to the super-node,
408 which is in charge of available daemons. Basically, the super-node
409 returns the amount of requested computing nodes by choosing in its own
412 %if there are sufficient daemons connected on it, or in other
413 %super-nodes lists, in addition of its one.
414 In this method, the super-node only cares about the amount of
415 requested nodes, it returns in general nodes in the order of their
416 connection to the platform -- there is no specific selection.
417 Distributed architectures such as distributed clusters,
419 %on Figure \ref{fig:pbdistclust},
420 are often composed of heterogeneous
421 clusters linked via heterogeneous networks with high latencies and
422 bandwidths. As an example the Grid'5000\cite{g5k} testbed is composed
423 of 23 clusters spread over 9 sites. Those clusters are heterogeneous,
424 with computing powers starting from bi-cores at 2GHz to
425 bi-quad-cores at 2.83GHz with 2Gb of memory for the first one to 8Gb
426 for the second. Links relying clusters are 10Gb/s capable, but as many
427 researchers use this platform, high latencies appear in links between
429 %In the targeted architectures, which are distributed clusters, each
430 %cluster provides a fixed number of homogeneous computing nodes --
431 %computing nodes are heterogeneous from a cluster to another. Figure
432 %\ref{fig:pbdistclust} represents such an architecture, in which we can
433 %see different sites containing several heterogeneous clusters. Each
434 %cluster is relied by high speed network with clusters in the same site
435 %and in others. We note that the architecture may not be represented by
436 %a full connected graph, as some sites are not directly connected to
437 %some others. The Grid'5000 testbed, described in section
438 %\ref{sec:g5k}, is an example of such a distributed clusters
439 %architecture. Though there are efficient links relying each site, a
440 %residual latency continues to exist, at local clusters (in the same
441 %site) as well as distant clusters (from two distinct sites), and can
442 %penalize performance.
444 % \begin{figure}[ht!]
446 % \includegraphics[width=7.8cm]{images/dist_clust}
447 % \caption{A distributed clusters architecture}
448 % \label{fig:pbdistclust}
452 With such an architecture, it could be
453 efficient to assign tasks communicating with each other on the same
454 cluster, in order to improve communications. But, as we use very large
455 problems, it is quite impossible to find clusters containing as many
456 computing nodes as requested. So we have to dispatch tasks over
457 several clusters. That implies a nedd to deal with heterogeneity in clusters
458 computing power and heterogeneity in network. We should make a
459 trade-off between both components in order to take the best part
460 of each one to improve the overall performance.
462 %literature in high performance computing has broadly demonstrated the
463 %benefits of mapping solutions on the applications execution time.
465 In order to check if a tasks mapping algorithm would provide
466 performance improvement in JaceP2P-V2 environment, we have evaluated
467 the contributions of a simple mapping algorithm, which is described in
468 section \ref{sec:sma}. These experiments used the NPB Kernel CG
469 application described in section \ref{sec:cg}, with two problem sizes
470 (the given problem sizes are the sides sizes of square matrices used)
471 and using a distributed clusters architecture composed of 102
472 computing nodes, representing 320 computing cores, spread over 5
473 clusters in 5 sites. The results of these experiments are given in
474 Table \ref{tab:benef}.
476 \renewcommand{\arraystretch}{1.5}
480 \begin{tabular}{|c|c|c|}
482 Problem size&$550,000$&$5,000,000$\\
484 Execution Time (without mapping)&141s&129s\\
485 Execution Time (with mapping)&97s&81s\\
487 Gains&$31\%$&$37\%$\\
490 \caption{Effects of a simple tasks
491 mapping algorithm on application's execution time}
495 As can be seen in Table \ref{tab:benef}, the effects of a
496 simple tasks mapping algorithm are significant.
497 %Moreover, we can see
498 %that are scalable with the application's size, which demonstrates the
499 %real needs of the platform for a such algorithm.
500 This encouraged us to look further for better task mapping
501 algorithms. In the next section, we describe the specificities of our
502 model and issues which can be exploited.
504 \section{Problem description}
507 % In this section we describe the AIAC mapping problem. We first
508 % formalize the different elements we should take into consideration:
509 % the application, the targeted architecture and the objectives
510 % functions of the mapping. We also give a state of the art about
511 % considered kinds of mapping algorithms.
513 \subsection{Model formalization}
516 % In this section the models of the applications and architectures we
517 % used are given, with the objectives functions of the mapping
520 \subsubsection{Application modeling}
521 \label{sec:pbmodelapp}
523 In high performance computing, when we want to improve the global
524 execution time of parallel applications we have to make an efficient
525 assignation of tasks to computing nodes. Usually, to assign tasks of
526 parallel applications to computing nodes, scheduling algorithms are
527 used. These algorithms often represent the application by a graph,
528 called DAG \cite{dag1,dag2,dag3,dag4} (Directed Acyclic Graph). In
529 this graph, each task is represented by a vertex which is relayed to
530 others by edges, which represent dependencies and communications
531 between tasks. This means that some tasks could not start before other
532 ones finish their computation and send their results. As discussed in
533 the introduction, in the AIAC model, there is no precedence between
536 Indeed, with the AIAC model, all tasks compute in parallel at the same
537 time. As communications are asynchronous, there is no synchronization
538 and no precedence. During an iteration, each task does its job and
539 sends results to its neighbors and continues with the next
540 iteration. If a task receives new data from its dependencies, it
541 includes them and the computation continues with these new data. If
542 not all dependencies data, or none, are received before starting the
543 computation of the next iteration, old data are used instead. Tasks
544 are not blocked on dependencies. Nevertheless regularly receiving new
545 data allows tasks to converge more quickly. So, it appears that DAG
546 are not appropriate to modeling AIAC applications. TIG\cite{tig1,
547 tig2} (Task Interaction Graph) are more appropriate.
548 %That is why we use the
549 %TIG\cite{tig1, tig2} (Task Interaction Graph) model instead of DAG,
550 %which allows to modeling application using tasks and their
551 %communication dependencies. Figure \ref{fig:tig} gives an example of a
556 \includegraphics[width=4cm]{images/tig}
557 \caption{An example of a TIG of a nine tasks application}
561 In the TIG model, a parallel program is represented by a graph , as
562 can be seen in Figure \ref{fig:tig}. This graph $GT(V,E)$, where $V =
563 \{V_1,V_2,\dots V_v\}$ is the set of $|V|$ vertices and $E \subset V
564 \times V$ is the set of undirectional edges. The vertices represent
565 tasks and the edges represent the mutual communication among tasks. A
566 function $ET : V \rightarrow R^+$ gives the computation cost of tasks
567 and $CT : E \rightarrow R^+$ gives the communication cost for message
568 passing on edges. We define $v = |V|$, $ET(V_i) = e_i$ and
569 $CT(V_i,V_j) = c_{ij}$. For example, in Figure \ref{fig:tig},
570 \mbox{$e_0$ = 10} and $c_{01} = 2$, $c_{03} = 2$ and $c_{04} = 2$.
571 Tasks in TIG exchange information during their execution and there is
572 no precedence relationship among tasks; each task cooperates with its
573 neighbors. This model is used to represent applications, where tasks
574 are considered to be executed simultaneously. Temporal dependencies in
575 the execution of tasks are not explicitly addressed: all the tasks are
576 considered simultaneously executable and communications can take place
577 at any time during the computation. That is why vertices and edges are
579 describing computational and communication costs.\\
582 \subsubsection{Architecture modeling}
583 \label{sec:pbmodelarchi}
585 As TIG models the application, we have to model the targeted
586 architecture. A distributed clusters architecture can be modeled by a
587 three-level-graph. The levels are \textit{architecture} (a), in our
588 study it is the Grid'5000 grid, \textit{cluster} (c) and computing
590 %Figure \ref{fig:pbdistclust} in section
591 %\ref{sec:benef} shows such a model.
592 Let $GG(N,L)$ be a graph
593 representing a distributed clusters architecture, where $N =
594 \{N_1,N_2,\dots N_n\}$ is the set of $|N|$ vertices and $L$ is the set
595 of undirectional edges. The vertices represent the computing nodes and
596 the edges represent the links between them. An edge $L_i \in L$ is an
597 unordered pair $(N_x,N_y) \in N$, representing a communication link
598 between nodes $x$ and $y$. Let be $|C|$ the number of clusters in the
599 architecture containing computing nodes. A function $WN : N
600 \rightarrow R^+$ gives the computational power of nodes and $WL : L
601 \rightarrow R^+$ gives the communication latency of links. We define
602 $WN(N_i) = wn_i$ and $WL(L_i,L_j) = wl_{ij}$.
604 An architecture with a three-level-graph is specified according as
605 follows. All computing nodes are in the same node level. When
606 computing nodes can communicate to one another with the same
607 communication latency, they can be grouped into the same cluster. In
608 addition, like in the Grid'5000 testbed, if computing nodes seemly
609 have the same computational power with a low communication latency, a
610 cluster of these nodes can be defined. All participating clusters,
611 including computing nodes, are in the same architecture level and
612 communicate through the architecture network.\\
615 \subsubsection{Mapping functions}
616 \label{sec:pbmodelmapping}
618 After having described the two graphs used to model the application
619 and the architecture, this section defines our objectives.
621 When a parallel application $App$, represented by a graph $GT$, is
622 mapped on a distributed clusters architecture, represented by a graph
623 $GG$, the execution time of the application, $ET(App)$, can be defined
624 as the execution time of the slowest task.
625 %In the AIAC model, tasks
626 %have, in general, seemly the same work and communication loads; so the
627 %difference on execution time depends on the executing machines.
628 Indeed an application ends when all the tasks have detected
629 convergence and have reached the desired approximation of the
630 solution, that is why the execution time of the application depends on
632 %, this converges last.
633 We define $ ET(App) = \max_{i=1 \dots v} ( ET(V_i) )$
636 % ET(App) = \max_{i=1 \dots v} ( ET(V_i) )
639 %$ET(V_s)$ is the execution time of the slowest task $V_s$. The
640 execution time of each task $i$ \mbox{($i=1 \dots v$)}, $ET(V_i)$ is
641 given by $ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}$
645 % ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}
648 where $e_i$ is the computational cost of $V_i$, $wn_i$ is the
649 computational power of the node $N_i$ on which $V_i$ is mapped, $J$
650 represents the neighbors set of $V_i$, $c_{ij}$ is the amount of
651 communications between $V_i$ and $V_j$, and $wl_{ij}$ is the link
652 latency between the computing nodes on which are mapped $V_i$ and
654 %Note that in the AIAC model, we only consider for $e_i$ the
655 %cost of an only one iteration, as we cannot determine in advance the
656 %amount of iterations a task should complete to converge.
657 We underline here that in the AIAC model, it is impossible to predict
658 the number of iterations of a task. So it is difficult to evaluate a
659 priori the cost $e_i$ of a task. In the remainder, we approximate
660 $e_i$ by the cost of one iteration.
662 The mapping problem is similar to the classical graph partitioning and
663 task assignment problem \cite{npcomp}, and is thus NP-complete.
665 %generalization of this problem by considering both heterogeneous tasks
666 %graph and architecture graph makes the problem more complex.
667 %The JaceP2P-V2 platform is design to execute parallel asynchronous
668 %iterative application on heterogeneous distributed architectures, so
669 %it has to launch applications on a variety of computing nodes relying
670 %by non-uniform links. As demonstrates in the previous section, it does
671 %not map efficiently tasks on nodes and the overall performance can be
672 %strongly penalize by this lack of mapping. Indeed, if the
673 %heterogeneity degree of the architecture used is high, it can be
674 %possible to have two neighbors tasks executing on two foreign
675 %computing nodes, that penalizes communications between these tasks,
676 %that increase the execution time. With the same mining, it is
677 %interesting to have similar computing nodes, in order to have seemly
678 %the same iteration execution time that allows to converge more
679 %quickly. We can conclude that JaceP2P-V2 needs an algorithm to
680 %efficiently map tasks on computing nodes.
682 %%As we use parallel distributed algorithms on distributed
683 %%architectures, we have to efficiently map tasks on computing nodes, in
684 %%order to obtain best performance and the slowest execution time of
686 %There exist many scheduling and mapping algorithms in the literature,
687 %and the research is active in this domain. A first approach is to use
688 %one the numerous scheduling algorithms, but our model does not fit
689 %with them. Indeed, a scheduling implies that we have tasks which are
690 %depending on others, more precisely that some tasks cannot start
691 %computation before having received data from other tasks, which are
692 %called precedences, or ``strong dependencies''. This class of
693 %algorithms uses a well known application's representation: the DAG
694 %(Direct Acyclic Graph). This model cannot be used with our
695 %problematic. Indeed, in asynchronous iterative applications, all the
696 %tasks are executed in the same time, in parallel; there is no
697 %precedence. So we have to look at another way, which is the ``tasks
700 %The second approach is to use a tasks mapping algorithm, which only
701 %aims to map tasks on nodes, following a metric, to minimize the
702 %application's execution time. To determine which class of tasks
703 %mapping algorithms we should consider, it is important to draw up a
704 %list of which information we have in hands. When an application using
705 %the asynchronous iteration model is run, the number of iterations to
706 %converge to the solution is unpredictable and change from a run to
707 %another. So, we cannot give a good approximation of the execution time
708 %of a task, but we can retrieve an estimation of the computational
709 %amount of a task (plus de détails ?). In addition, as the asynchronous
710 %iteration model based algorithms are able to provide to applications
711 %the message lost tolerance, we cannot exactly quantify the amount of
712 %communications in an application; some algorithms can converge with
713 %receiving only few dependencies messages. One application's
714 %representation which fit perfectly with our model is the TIG (Task
715 %Interaction Graph), which only considers relations between tasks --
716 %the only thing which is predictable in our model. The figure
717 %\ref{fig:tig} presents an example of a TIG.
721 %On Figure \ref{fig:tig} we can see an application with 6 tasks, in
722 %which each task is in relation with tasks of previous and next rank,
723 %with two exceptions, the first and the last tasks, which can
724 %eventually be in relation, depending of the application. This is a
725 %commonly scheme in asynchronous iteration model; in general, tasks are
726 %in relation with some others which previous and next ranks, but in
727 %some cases, dependencies are more complex.
729 \subsection{Related work}
732 %In the previous section we have determined that we need to use TIG
733 %model based mapping algorithms to address our problem.
734 In the literature of the TIG mapping, we can find many algorithms,
735 which can be divided into two categories. In the \textit{Edge-cuts
736 optimization} class of algorithms, yhe aim is to minimize the use of the
737 penalizing links between clusters. As tasks are depending on
738 neighbors, which are called dependencies, the goal is to choose nodes
739 where distance, in term of network, is small, to improve
740 communications between tasks. Here we can cite Metis\cite{metis},
741 Chaco\cite{chaco} and PaGrid\cite{pagrid} which are libraries
742 containing such kind of algorithms. The main drawback of edge-cuts
743 algorithms is that they do not tackle the computing nodes
744 heterogeneity issues. They only focus on communication overheads.
745 % The Figure \ref{fig:edge} shows
746 % that the optimization is on edges of communications, which are
751 % \includegraphics[width=3.5cm]{images/edgecut}
752 % \caption{The edge-cuts optimization}
755 In the \textit{Execution time optimization} class of algorithms the
756 aim is to minimize the whole execution time of the application. They
757 look for nodes which can provide the small execution time of tasks
758 using their computational power. Here we can cite
759 FastMap\cite{fastmap} and MiniMax\cite{minimax} as such kind of
760 algorithms. QM\cite{qm_these} is also an algorithm of this category,
761 but it aims to find for each task the node which can provide the best
762 execution time. QM works at the task level, whereas others work at the
765 % \ref{fig:et} shows that the optimization is on tasks, which are
770 % \includegraphics[width=3.5cm]{images/exectime}
771 % \caption{The execution time optimization}
775 The two classes of algorithms may fit with our goals, because in our
776 model we have both the computational power of nodes and communication
777 costs may influence the applications performance. We can also cite
778 partitioning tools like Scotch \cite{scotch} which aims at privileging
779 the load balancing of their partitioning schemes. Nevertheless, to the
780 best of our knowledge, none of the existing algorithms take into
781 consideration the specificities of the AIAC model (see next section).
782 %specifically address the AIAC mapping problem.
783 %As the existing mapping algorithms are not designed to fit
784 %with the AIAC mapping problem.
786 \section{AIAC mapping}
787 \label{sec:aiacmapping}
789 % In this section we present the specificities of the AIAC model, which
790 % are interesting in the mapping problem, and the solution we propose:
791 % the AIAC QM algorithm, which is an extended version of the QM
794 \subsection{Specificities of the AIAC mapping problem}
795 \label{sec:specAIACmapping}
797 An important point to take into consideration in the AIAC model is
798 that we do not allow the execution of multiple tasks on the same
799 computing node. This comes from the fact that the targeted
800 architectures are volatile distributed environments. Assigning
801 multiple tasks to a node provides a fall of performance when this
802 node fails. Indeed we should redeploy all of the tasks from this node
803 to another one, using last saves, which implies to search a new
804 available computing node, transfer saves to it and restart the
805 computation from this point (which could be far from this just before
808 Nevertheless, in order to benefit of multi-cores architectures, we use
809 a task level parallelism by running multi-threaded sequential solver
811 %In addition, it is more simple and efficient to parallelize at the
812 %task level using, as an example with the CG application, a
813 %multi-threaded linear solver, which benefits of the multi-cores
814 %architecture of computing nodes.
816 Another important point in the AIAC model is that we should take into
817 account precisely the locality issue. This comes from the fact that in
818 this model, the faster and more frequently a task receives its
819 dependencies, the faster it converges. Moreover, as the JaceP2P-V2
820 environment is fault tolerant and tasks save checkpoints on their
821 neighbors, it is more efficient to save on near nodes than on far
825 %As our problem is on task mapping, only considering the dependency
826 %relations between tasks, with a part of consideration on task weight,
827 %we should use TIG (\textit{task interaction graph}) mapping model.
829 %\section{Mapping algorithms}
832 \subsection{AIAC Quick-quality Map}
835 %After having describe the different kinds of mapping algorithms which
836 %can be found in the literature, we now present the three algorithms we
837 %use to do mapping in the JaceP2P-V2 environment.
839 We present here the solution we propose, called \textit{AIAC QM
840 algorithm}, to address the AIAC mapping problem. We decided to
841 improve the \textit{Quick-quality Map} (QM) algorithm since it is one
842 of the most accurate method to address the TIG mapping problem.
844 %\subsection{Modified Quick-quality Map}
845 %\label{sec:modifiedqm}
847 %As the previous algorithm describe in section \ref{sec:sma} showed
848 %that mapping provides a significant increase of applications
849 %performance (which can be seen in the section \ref{sec:experiments}),
850 %we decide to try another mapping algorithm, which is a modified
851 %version of the \textit{Quick-quality Map} (QM) algorithm.
854 %As explained in section \ref{sec:pb}, the asynchronous iteration model
855 %is specific, as it is not a good solution to map many tasks on the
856 %same node. This is why QM has been modified to take into account these
858 % Indeed, originally QM tries to map many tasks on the
859 %same node to improve the execution time of tasks by decreasing
860 %communications costs. This solution can be good if communications
861 %between tasks are heavy and if we consider that computing nodes are
862 %stable and are not volatile. As
864 %, we have modified some parts
865 %of it to fit with our constraints. This was an opportunity to be taken
866 %to insert a little part of ``edge-cuts'' optimization, as in our model
867 %communications have to be taken into account.
868 In its original version, this algorithm aims at prioritizing the
869 computational power of nodes. Indeed, its aim is to find the more
870 powerful node to map a task on. Moreover, a part of this algorithm is
871 designed to map multiple tasks on the same node, in order to improve
872 local communications. This solution can be efficient if communications
873 between tasks are heavy and if we consider that computing nodes are
874 stable and not volatile. This last point is in contradiction with
875 our model, as we authorize only the execution of one task on a single
876 node -- this allows to lose only the work of a single task in case of
877 node's fault, with a low cost on restarting mechanism. Instead
878 assigning multiple tasks on the same computing node, our mapping
879 algorithm tries to keep tasks locally, to improve communications, by
880 trying to assign tasks to computing nodes in the neighborhood
881 of which their neighbors are mapped on.
883 % The pseudo-code of AIAC QM is given in Algorithm \ref{alg:qmmodified}.
888 % \dontprintsemicolon
890 % \KwIn{Sets of tasks and computing nodes}
891 % \KwOut{Mapping of tasks to nodes}
895 % sort nodes by descending power\;
896 % map tasks in order on nodes\;
897 % set all tasks \textit{moveable}\;
902 % \While{one task is moveable}{
903 % \For{each task $t_{i}$ $\&\&$ $t_{i}$ is moveable }{
904 % $n_{c} \leftarrow$ current node of $t_{i}$\;
905 % $n_{n} \leftarrow t_{i}$\;
909 % \For{$k = 0 ; k < \frac{f \cdot n}{r} ; k++$}{
910 % select random node $n_{r}$ in $[0,\frac{n}{r}]$\;
911 % \If{ET($t_{i}$,$n_{r}$) $<$ ET($t_{i}$,$n_{n}$)}{
912 % $n_{n} \leftarrow n_{r} $\;
918 % \For{each node $n_{v}$ near $dep(t_{i})$}{
919 % \If{ET($t_{i}$,$n_{v}$) $<$ ET($t_{i}$,$n_{n}$)}{
920 % $n_{n} \leftarrow n_{v} $\;
926 % \If{$n_{n} \neq n_{c}$}{
927 % map $t_{i}$ on $n_{n}$\;
928 % update ET of $t_{i}$ and dep($t_{i}$)\;
934 % set $t_i$ not moveable\;
935 % $r \leftarrow r+1$ if all tasks have been considered\;
938 % \caption{The AIAC QM}
939 % \label{alg:qmmodified}
943 So, in this algorithm all nodes are first sorted in descending order
944 according to their computation power, and all tasks are mapped on
945 these nodes according to their identifier (they are also marked as
946 ``moveable'', that means that each task can be moved from a node to
947 another). As in the original QM algorithm, AIAC QM keeps track of the
948 \textit{number of rounds} $r$ ($r > 0$), that all tasks have been
949 searched for a better node. This allows to reduce at each round the
950 number of considered nodes. While there is at least one moveable task,
951 it performs for each moveable task the search for a better node. It
952 chooses a set of nodes, $\frac{f \cdot n}{r}$, where $f$ is defined as
953 the search factor and $n$ is the number of nodes. $r$ and $f \in
954 ]0,1]$ control the portion of nodes that will be considered where more
955 numerous the rounds are, the less the considered nodes will be. Then
956 the algorithm estimates the execution time $ET(v)$ of the task on each
957 node. If it is smaller than the current node on which the task is
958 mapped on, this node becomes the new potential node for task $t_i$.
960 After having randomly searched for a new node, the AIAC QM tries to
961 map the task on nodes that are neighbors of nodes of which the
962 dependencies of $t_i$ are mapped on. This is one of the major
963 modification to the original QM algorithm. It introduces a little part
964 of ``edge-cuts'' optimization. In the original version, it tries to
965 map the task $t_i$ on the same node of one as its dependencies. As
966 explain in \ref{sec:specAIACmapping}, this is not an acceptable
967 solution in our case. Instead, the algorithm now searches to map task
968 $t_i$ on nodes which are near the ones its dependencies are mapped
969 on. This search requires a parameter which indicates the maximum
970 distance at which nodes should be from the node of dependencies of
973 At the end of the algorithm, if a new node is found, $t_i$ is mapped
974 on and its execution time is updated and $t_i$ is set to ``not
975 moveable''. The execution time of each of its dependencies is also
976 updated, and if this new execution time is higher than the previous,
977 the task is set to ``moveable''. And finally, if all tasks have been
978 considered in this round, $r$ is incremented.
980 The complexity of the AIAC QM algorithm is about $O(n^2 \cdot
981 t \cdot ln(r))$. This complexity is the same as the original algorithm
982 (details are given in \cite{qm_these}, with an increase of a factor
983 $n$, corresponding to the edge-cuts part.
985 \section{Experimentation}
988 % We now describe the experiments we have conducted and their
989 % components, to evaluate the effects of the AIAC QM algorithm on
990 % application execution time.
992 \subsection{The NAS Parallel Benchmark Kernel CG and the Grid'5000 platform }
995 We used the ``Kernel CG'' of the NAS Parallel Benchmarks (NPB)
996 \cite{nas} to evaluate the performance of the mapping
997 algorithm. This benchmark is designed to be used on large
998 architectures, because it tests communications over latency networks,
999 by processing unstructured matrix vector multiplication. In this
1000 benchmark, a Conjugate Gradient is used to compute an approximation of
1001 the smallest eigenvalue of a large, sparse and symmetric positive
1002 definite matrix, by the inverse power method. In our tests, the whole
1003 matrix contains nonzero values, in order to stress more
1004 communications. As the Conjugate Gradient method cannot be executed
1005 with the asynchronous iteration model we have replaced it by another
1006 method called the multisplitting method. This latter supports the
1007 asynchronous iterative model.
1010 With the multisplitting algorithm, the $A$ matrix is split into
1011 horizontal rectangle parts.
1012 %as Figure \ref{fig:multisplit} shows.
1014 of these parts is assigned to a processor -- so the size of data
1015 depends on the matrix size but also on the number of participating
1016 nodes. In this way, a processor is in charge of computing its $X Sub$
1017 part by solving the following subsystem: $ASub \times XSub = BSub -
1018 DepLeft \times XLeft - DepRight \times XRight$
1020 After solving $XSub$, the result must be sent to other
1021 processors which depend on it.
1023 % \begin{figure}[h!]
1025 % \includegraphics[width=7.4cm]{images/multisplit}
1026 % \caption{Data decomposition for the multisplitting method
1028 % \label{fig:multisplit}
1031 % The multisplitting method can be decomposed into four phases:
1034 % \item \textbf{Data decomposition}. In this phase, data are allocated
1035 % to each processor assuming the decomposition shown in Figure
1036 % \ref{fig:multisplit}. Then, each processor iterates until
1037 % convergence is reached.
1038 % \item \textbf{Computation}. At the beginning of the computation, each
1039 % processor computes
1040 % $BLoc = BSub - DepLeft \times XLeft - DepRight \times XRight$. Then,
1041 % it solves $ASub \times XSub = BLoc$ by using a
1042 % multi-threaded sequential version of the Conjugate Gradient method.
1043 % \item \textbf{Data exchange}. Each processor sends its $XSub$
1044 % part to its neighbors. Here, the neighborhood is closely related
1045 % to the density of the $A$ matrix. Clearly, a dense matrix implies an
1046 % \textit{all-to-all} communication scheme while a matrix with a low
1047 % bandwidth reduces the density of the communication scheme.
1048 % \item \textbf{Convergence detection} Each processor computes its local
1049 % convergence and sends it to a server node. When this one detects
1050 % that each processor has converged, it stops the whole computation
1054 %It can be pointed out here that it is possible to modify the data
1055 %decomposition in order to obtain non disjoint rectangle matrices. This
1056 %property of multisplitting methods, called \textit{overlapping}, can
1057 %bring significant improvements to convergence speed, but it is not the
1059 For more details about this method, interested readers are invited to
1060 see \cite{book_raph}. In our benchmark, the sequential solver part of
1061 the multisplitting method is the Conjugate Gradient, using the
1062 MTJ\cite{mtj} library. Its implementation is multi-threaded, so it
1063 benefits from multi-core processors.
1065 We point out here that this benchmark is a typical AIAC
1066 application. In our study, we consider that the computational costs of
1067 tasks are approximately the same and that the communications costs are
1068 also the same (this comes from the difficulty to evaluate real costs
1069 in the AIAC model). For our experiments the bandwidth of matrices has
1070 been reduced in order to limit the dependencies and we fixed it to
1071 $35,000$. This bandwidth size generates, according to the problem's
1072 size, between 10 and 25 neighbors per tasks.
1074 % The general form of the TIG for this application is given
1075 % by Figure \ref{fig:tigcg}.
1077 % \begin{figure}[h!]
1079 % \includegraphics[width=8cm]{images/tigcg2}
1080 % \caption{Part of the form of the TIG representing an instance of the
1081 % NAS Kernel CG application}
1085 %This figure shows 6 tasks, which are represented by a circle in which
1086 %the identifier of the task is given.
1088 % The computational cost of a task is given by the number on the top
1089 % left-hand side of each circle (for example the cost of task 31 is
1090 % 1000). Communications between tasks are represented by edges on
1091 % which the amount of communication is given (for example, the
1092 % communication cost between tasks 29 and 30 is about 30).
1093 % Dotted lines represent communications with tasks which are not
1094 % represented on the figure. We can see here that each task has four
1095 % neighbors (the two previous and the two next). This amount of
1096 % neighbors is directly related to the bandwidth of the matrix (in this
1097 % example the bandwidth is very small). For more details about the
1098 % influence of the bandwidth on the amount of neighbors, interested
1099 % readers are invited to see \cite{largescale}.
1103 %\subsubsection*{The Grid'5000 platform}
1106 The platform used for our tests, called Grid’5000\cite{g5k}, is a
1107 French nationwide experimental set of clusters which provides a
1108 configurable and controllable instrument. We can find many clusters
1109 with different kinds of computers with various specifications and
1110 software. Clusters are spread over 9 sites,
1111 %as can be seen on Figure \ref{fig:g5ksite},
1112 and the computing power represents more than 5000
1113 computing cores interconnected by the ``Renater'' network. This
1114 network is the national network for research and education; it
1115 provides a large bandwidth with high latency. Intra-clusters networks
1116 present small bandwidth and low latencies.
1119 % \begin{figure}[h!]
1121 % \includegraphics[height=6.5cm]{images/g5k-noms}
1122 % \caption{The Grid'5000 sites map}
1123 % \label{fig:g5ksite}
1128 \subsection{Other mapping algorithms}
1129 \label{sec:othermaping}
1131 % In this section we present the two other mapping algorithms we used
1132 % in our experiments to compare the performance of the AIAC QM
1133 % algorithm. The first one was used to evaluate the benefits of a
1134 % mapping solution in section \ref{sec:benef}. The second one was used
1135 % to show the differences between the two mapping class, the ``execution
1136 % time'' and the ``edge-cuts'' optimizations, as it is a fully edge-cut
1137 % optimization based mapping algorithm.
1139 \subsubsection{A Simple Mapping algorithm}
1142 %As mentioned in section \ref{sec:pb}, the asynchronous iteration model
1143 %has some specificities which distinguishes it from other classical model
1146 %The first thing we have done was to be sure that a mapping algorithm
1147 %would enhance applications performance.
1148 The \textit{Simple Mapping algorithm} (SMa) was designed to
1149 show the benefits of a mapping algorithm in the JaceP2P-V2 platform.
1150 %The text of the \textit{Simple Mapping} if given by
1151 %Algorithm \ref{alg:sma} gives the pseudo-code of the \textit{Simple
1152 % Mapping algorithm}.
1153 %, in which we can see that it is very simple, with a complexity in $O(n^
1154 %2)$ resulting from sort methods.
1160 % \dontprintsemicolon
1161 % \KwIn{Sets of tasks and computing nodes}
1162 % \KwOut{Mapping of tasks to nodes}
1166 % sort computing nodes by cluster\;
1167 % sort clusters by size, from higher to lower\;
1168 % map tasks in order on sorted clusters\;
1169 % \caption{The Simple Mapping algorithm}
1174 %The aim of this algorithm is to do similarly to a very simple
1175 %edge-cuts optimization algorithm. To do that
1176 The algorithm puts each node in a cluster entity.
1177 %, which is given for now by the convention name of nodes
1178 %on Grid'5000 (for example azur-3.sophia.grid500.fr is the node 3 of
1179 %the azur cluster in the sophia site).
1180 Then it sorts clusters by their size, from the higher to the lower.
1181 %; this operation is the part which
1182 %is inspired by edge-cuts optimization based algorithms.
1183 Finally, all tasks are mapped in order on the sorted clusters; each
1184 task is assigned to a particular computing node of the chosen cluster.
1186 %Though this algorithm is trivial, it allows AIAC applications to run faster
1187 %on distributed clusters architectures, with a gain over $30\%$ on
1188 %execution, as experiments described in section \ref{sec:benef}.
1190 \subsubsection{Edge-cuts optimization}
1191 \label{sec:edgcutalgo}
1193 As explained in section \ref{sec:pb}, the asynchronous iteration model
1194 is so specific and unpredictable that we would like to evaluate the
1195 second kind of mapping algorithm, which aims to optimize the
1197 %Multilevel partitioning algorithms such as Metis and
1198 %Chaco fail to address the limitations imposed by heterogeneity in the
1199 %underlying targeted system. They assume that computing nodes and
1200 %network relying them are homogeneous. This is not corresponding to our
1201 %execution environment, which is fully heterogeneous. These methods are
1202 %based on ``graph growing'' (GGP) and/or ``greedy graph growing''
1203 %(GGGP) algorithms which aim to divide tasks into two, or for some
1204 %algorithm a power of two, partitions. In our case, we do not know in
1205 %advance the number of partitions we need. Indeed, it depends on the
1206 %amount of nodes in each cluster and on the number of tasks.
1208 %As GGP and GGGP algorithms seems to be efficient in specific cases, it
1209 %could be interesting to adapt one to our model, in order to evaluate a
1210 %real edge-cuts optimization algorithm.
1211 We choose the Farhat's algorithm\cite{farhat}, which has the ability
1212 to divide the graph into any number of partitions, thereby avoiding
1213 recursive bisection.
1214 %Therefore its running execution time is
1215 %independent of the desired number of subsections.
1217 % The adapted version of this algorithm, Farhat's Edge-Cut (F-EC),
1218 % evaluated in the JaceP2P-V2 environment is described in Algorithm
1219 % \ref{alg:edgecuts}.
1224 % \dontprintsemicolon
1226 % \KwIn{Sets of tasks and computing nodes}
1227 % \KwOut{Mapping of tasks to nodes}
1232 % sort nodes by cluster\;
1233 % $lTasks \leftarrow$ sort tasks by dep degree\;
1234 % $changeCluster \leftarrow$ true\;
1235 % $cTasks \leftarrow$ empty;
1239 % \While{one task is not mapped}{
1240 % \If{$changeCluster$}{
1241 % $curCluster \leftarrow$ nextCluster()\;
1242 % $places \leftarrow$ size($curCluster$)\;
1243 % $changeCluster \leftarrow$ false\;
1244 % $mTasks \leftarrow$ empty\;
1249 % \If{no task in cTasks}{
1250 % $cTasks \leftarrow$ first task from $lTasks$\;
1255 % $curTask \leftarrow$ first task in $cTasks$\;
1259 % \eIf{$( 1 + $dep(curTask)~$\cdot~\delta) <= places)$}{
1260 % remove $curTask$ from $cTasks$\;
1261 % add $curTask$ in $mTasks$\;
1262 % $places \leftarrow places - 1$\;
1263 % add dep(curTask) in cTasks\;
1265 % $changeCluster$ $\leftarrow$ true\;
1266 % associate $mTasks$ with $curCluster$\;
1270 % \caption{The Fahrat's Edge-Cut algorithm}
1271 % \label{alg:edgecuts}
1274 This algorithm aims to do a ``clusterization'' of the tasks. First, it
1275 groups computing nodes in clusters, which are sorted according to
1276 their number of nodes, from the higher to the lower. Tasks are ordered
1277 following their dependency degree, starting from the higher to the
1278 lower. Tasks in the top of the list have a higher priority to be
1279 mapped. Next, the algorithm tries to map on each cluster the maximum
1280 number of tasks. To map a task on a cluster, the algorithm evaluates
1281 if there is enough space to map the task and some of its
1282 dependencies. This amount of dependencies is fixed by a factor
1283 $\delta$, which is a parameter of the algorithm. In the positive case,
1284 the task is mapped on the current cluster and its dependencies become
1285 priority tasks to be mapped. This allows to keep the focus on the
1286 communicating tasks locality.
1287 %All computing nodes are first grouped in clusters.
1288 %%, using their full
1289 %%qualified name -- for now, no criteria is used to sort clusters, such
1290 %%are their power or size.
1291 %Next, tasks are inserted in a list $lTasks$, sorted in descending
1292 %according to their dependency degree. This allows to map tasks with
1293 %their dependencies to improve communications. Next begins the main
1296 %While not all tasks have been mapped the algorithm do the
1297 %following. First, at the beginning, it chooses the first available
1298 %cluster to be the current cluster to map tasks on, and sets the
1299 %variable $places$ to its size, in number of computing nodes, and
1300 %create a new tasks' list $mTasks$. During the execution of the
1301 %algorithm, passing in this block means that there is no more node
1302 %available in the current cluster, so it has to choose the next
1303 %cluster. Next, it looks in $cTasks$, which is a tasks list containing
1304 %tasks which are in instance to be mapped. If $cTasks$ is empty it
1305 %takes the first available task in $lTasks$ -- at the beginning it
1306 %takes the first one, which has the higher dependency degree.
1308 %Next, it takes the first task in $cTasks$ to try to map it on the
1309 %current cluster. The task is mapped on if there is enough available
1310 %computing nodes on the current cluster. This amount of node
1311 %corresponds to $1 + dep( curTask ) \cdot \delta$, which represents the
1312 %task plus the amount of its dependencies multiplied by a ``local
1313 %dependency factor'' $\delta$, which indicates how many dependencies
1314 %must be a least with $curTask$ on the same cluster. If the task could
1315 %be mapped on this cluster, $curTask$ is added in $mTasks$, which is a
1316 %tasks' list containing tasks which are currently mapped on current
1317 %cluster, and removed it from $cTasks$. The number of available nodes,
1318 %$places$, is decremented, and the dependencies of $curTask$ are added
1319 %to $cTasks$. This list is sorted in the same way as $lTasks$, indeed
1320 %tasks are added according to their dependency degree, from the higher
1321 %to the lower. Otherwise, if there is not enough or no more nodes
1322 %available in the cluster, it has to change the current cluster, so it
1323 %associates the current cluster and list of tasks mapped on, $mTasks$.
1326 \subsection{Experiments}
1327 \label{sec:experiments}
1329 After having described the different components of the experiments, we
1330 now present the impacts of the AIAC QM mapping on applications running
1331 with JaceP2P-V2 on a heterogeneous distributed clusters
1332 architecture. In the following, we note ``heterogeneity degree'' the
1333 degree of heterogeneity of distributed clusters; it is the ratio
1334 between the average and the standard deviation of the computing nodes
1335 power. This heterogeneity degree may vary from 0, nodes are
1336 homogeneous, to 10, nodes are totally heterogeneous. In these
1337 experiments, we consider that there is no computing nodes failing during
1338 applications execution.
1340 The application used to realize these experiments is the KernelCG of
1341 the NAS parallel benchmark, in the multi-splitting version. Two
1342 problem sizes were used: one using a matrix of size $550,000$ (named
1343 ``class E'') using 64 computing nodes and the other using a matrix of
1344 size $5,000,000$ (named ``class F'') using 128 nodes.
1347 %\ref{fig:tigcg} shows a part of a TIG of this application.
1357 \subsubsection{About heterogeneity}
1358 \label{sec:xphetero}
1360 The first experiments concern the study of the impact of the
1361 heterogeneity of the computing nodes on the mapping
1362 results. Heterogeneity is an important factor in high performance
1363 computing in the grid all the more so when using the asynchronous
1366 As mapping algorithms take in parameter a factor of research (for AIAC
1367 QM) and the amount of local dependencies (for F-EC), we fixed both to
1368 $50\%$. That means for AIAC QM that at each round the amount of
1369 considering nodes would be divided by two, and for F-EC that each task
1370 requires half of its dependencies on the same local cluster.
1372 Four experiments were done using four architectures having different
1373 heterogeneity degrees -- in two architectures computing nodes are more
1374 heterogeneous than in the others. In these experiments, we did not
1375 affect the networks heterogeneity, because of the difficulty to
1376 disturb and control network on Grid'5000; by default, networks are
1377 already quite heterogeneous. We needed more than 200 computing nodes
1378 to execute our application because of the small capacity of some
1379 clusters to execute the largest problems (there is not enough
1380 memory). The nodes used have more than 2 GB of RAM and both execute a
1381 Linux 64 bits distribution.
1383 The first architecture, Arc1.1, was composed of 113 computing nodes
1384 representing 440 computing cores, spread over 5 clusters in 4
1385 geographically distant sites. In Arc1.1 we used bi-cores (2 clusters), quad-cores (2
1386 clusters) and bi-quad-cores (1 cluster) machines. Its heterogeneity
1387 degree value is 6.43. This architecture was used to run class E of the CG
1388 application using 64 computing nodes. The second architecture, Arc1.2,
1389 used to execute class F of the CG application, using 128 computing
1390 nodes, was composed of 213 computing nodes representing 840 computing
1391 cores, with a heterogeneity degree of 6.49. This architecture was
1392 spread on the same clusters and sites as Arc1.1. The results of the
1393 experiments on Arc1.1 and Arc1.2 are given in Table \ref{tab:exph1E}
1394 and Table \ref{tab:exph1F}, which give the gains in execution time obtained
1395 in comparison to the version without mapping.
1399 \renewcommand{\arraystretch}{1.5}
1403 \begin{tabular}[h!]{|c||c|c|c|c|}
1405 Algorithm& None&SMa & AIAC QM & F-EC \\
1408 Execution time&150s&110s&101s&90s\\
1410 % Gains&--&$27\%$&$33\%$&\textcolor{blue}{$40\%$}\\
1411 Gains&--&$27\%$&$33\%$&$40\%$\\
1414 \caption{Gains in time of the execution of the class E of the CG
1415 application on Arc1.1 using 64 nodes}
1422 \renewcommand{\arraystretch}{1.5}
1426 \begin{tabular}[h!]{|c||c|c|c|c|}
1428 Algorithm& None &SMa & AIAC QM & F-EC \\
1431 Execution time&403s&265s&250s&218s\\
1433 % Gains&--&$34\%$&$38\%$&\textcolor{blue}{$46\%$}\\
1434 Gains&--&$34\%$&$38\%$&$46\%$\\
1437 \caption{Gains in time of the execution of the class F of the CG
1438 application on Arc1.2 using 128 nodes}
1443 At first, we can see that the Simple Mapping algorithm, though it is
1444 simple, provides a significant improvement of application execution
1445 time. This highlights that JaceP2P-V2 really needs a mapping algorithm
1446 in order to be more efficient. Then, we can see that the F-EC and the
1447 AIAC QM algorithms provide a better mapping than the Simple Mapping
1448 algorithms. We can see a significant difference between both
1449 algorithms. This comes from the homogeneity of clusters. In this case,
1450 the F-EC algorithm is more efficient since the minimization of the
1451 communications becomes more important than the tackle of the
1452 computational power heterogeneity problem.
1453 %Indeed, it is more benefic for ta becomes more important than the
1454 %tacklebecomes more important than the
1455 %tackle sks to have locally their
1456 %dependencies, which allows to improve communications, in case of
1457 %computing nodes are more homogeneous -- communications are more
1458 %important than computing power (that is why the F-EC algorithm is more
1460 The effect is that tasks do less iterations as they
1461 receive more frequently updated data from their neighbors. In
1462 addition, as tasks and their dependencies are on the same cluster,
1463 communications are improved, but also as computations take
1464 approximately the same time, the amount of iterations is reduced and
1465 the algorithm can converge more quickly.
1467 % Another important positive point is that gains are scalable, which allows
1468 % to foresee big improvements for very large applications.\\
1470 The third architecture, Arc2.1, was composed of 112 computing nodes,
1471 representing 394 computing cores, spread over 5 clusters in 5
1472 sites. In this architecture we used bi-cores (3 clusters),
1473 quad-cores (1 cluster) and bi-quad-cores (1 cluster) machines. Its
1474 heterogeneity degree's value is 8.41. This architecture was used to run
1475 class E of the CG application, using 64 computing nodes. The fourth
1476 architecture, Arc2.2, used to execute class F of the CG
1477 application, using 128 computing nodes, was composed of 212 computing
1478 nodes representing 754 computing cores, with a degree of heterogeneity
1479 of 8.44. This architecture was spread on the same clusters and sites
1480 as Arc2.1. The results of the experiments on Arc2.1 and Arc2.2 are
1481 given in Table \ref{tab:exph2E} and Table \ref{tab:exph2F}, which give the
1482 gains in execution time obtained in comparison to the version without
1487 \renewcommand{\arraystretch}{1.5}
1491 \begin{tabular}[h!]{|c||c|c|c|c|}
1493 Algorithm&None& SMa & AIAC QM & F-EC \\
1496 Execution time&498s&341s&273s&385s\\
1498 % Gains&$32\%$&\textcolor{blue}{$45\%$}&\textcolor{red}{$23\%$}\\
1499 Gains&--&$32\%$&$45\%$&$23\%$\\
1502 \caption{Gains in time of the execution of the class E of the CG
1503 application on Arc2.1 using 64 nodes}
1508 \renewcommand{\arraystretch}{1.5}
1512 \begin{tabular}[h!]{|c||c|c|c|c|}
1514 Algorithm& None&SMa & AIAC QM & F-EC \\
1517 Execution time&943s&594s&453s&660s\\
1519 % Gains&$37\%$&\textcolor{blue}{$52\%$}&\textcolor{red}{$30\%$}\\
1520 Gains&--&$37\%$&$52\%$&$30\%$\\
1523 \caption{Gains in time of the execution of the class F of the CG
1524 application on Arc2.2 using 128 nodes}
1529 To begin with, these experiments confirm that a mapping algorithm is
1530 needed and that improvements are always scalable. Then, we can see
1531 that the F-EC algorithm falls in performance and AIAC QM is
1532 improved. What is surprising is that the Simple Mapping algorithm is
1533 better than F-EC. This can be explained by the fact that as computing
1534 nodes are quite heterogeneous, computations are not the same, so it is
1535 not significant to map dependencies close to tasks. In this case, the
1536 most important is the power of computing nodes. So, in this kind of
1537 architecture, it is more efficient to choose the best computing nodes
1538 to compute iterations more quickly and to improve the convergence
1541 % Here, it is important to note that the AIAC QM algorithm offers a gain
1542 % of about $50\%$ on the execution time, that is to say that the
1543 % application takes half of the execution time than without mapping.
1545 % \subsubsection{Parameters variation}
1546 % \label{sec:xpvariation}
1548 % After having evaluated mapping algorithms on the heterogeneity of
1549 % distributed clusters, we now propose to change the parameters of AIAC
1550 % QM and F-EC algorithms, in order to determine which values are the
1553 % To do these experiments, we used an architecture composed of 122
1554 % computing nodes representing 506 computing cores, spread over 5
1555 % clusters in 5 sites. In this architecture we used bi-cores (2
1556 % clusters), quad-cores (2 clusters) and bi-quad-cores (1 cluster)
1557 % machines. Its heterogeneity degree value is 4.98.
1558 % %, which means that
1559 % %computing nodes power is very heterogeneous.
1561 % The parameters of each algorithm, $f$ (the search factor) for
1562 % AIAC QM and $\delta$ (the amount of local dependencies) for F-EC,
1563 % varied both with values $10\%$, $50\%$ and $90\%$. We used the CG
1564 % multi-splitting application on 64 computing nodes. The results of
1565 % these experiments are given in Table \ref{tab:expparams}. Results
1566 % reported in this table represent the gains in execution time provided
1567 % by each algorithm with different parameters values.
1570 % \renewcommand{\arraystretch}{1.5}
1574 % \begin{tabular}[h!]{|c||c|c|c|}
1576 % Parameters& $10\%$ & $50\%$ & $90\%$ \\
1579 % % Simple & \multicolumn{3}{c|}{\textcolor{blue}{$39\%$}}\\
1580 % SMa & \multicolumn{3}{c|}{$30\%$}\\
1582 % % AIAC QM & $35\%$ & \textcolor{blue}{$41\%$} & $35\%$ \\
1583 % AIAC QM & $30\%$ & $32\%$ & $30\%$ \\
1585 % % F-EC & $39\%$ & $35\%$ & \textcolor{blue}{$45\%$} \\
1586 % F-EC & $40\%$ & $37\%$ & $45\%$ \\
1589 % % \caption{Parameters variations using a $500'000$ problem size on an
1590 % % architecture of 5.37 heterogeneity degree}
1591 % \caption{Gains in execution time with mapping algorithms parameters
1592 % variations using the class E of the CG application using 64
1594 % % \vspace*{-0.4cm}
1595 % \label{tab:expparams}
1596 % % \vspace*{-0.9cm}
1599 % %First of all, we can see that the Simple mapping provides the same
1600 % %order of performance, as shown in the precedent section, so it is
1601 % %not affected by the heterogeneity degree. Secondly,
1602 % For the AIAC QM algorithm, we can note that the best value for its
1603 % parameter $f$ is about $50\%$, but its impact is not big enough to
1604 % indicate a specific configuration.
1605 % % With a low heterogeneity degree, this mapping algorithm provides a
1606 % % good performance improvement.
1607 % Finally, and this is not surprising, the F-EC algorithm is more
1608 % efficient with a factor $\delta$ near $100\%$, which directly comes
1609 % from its aim. But we can see that it is more efficient to have a
1610 % factor around $10\%$ than having one around $50\%$.
1612 % We can note here, with a lower heterogeneity degree than in previous
1613 % experiments, gains are lower and the difference between AIAC QM and
1614 % F-EC (with parameters at $50\%$) is lower. It can be explained as the
1615 % fact that more the heterogeneity degree tends to 0 more computing
1616 % nodes are the same, so a mapping solution will not be efficient,
1617 % except one only optimizing network latency.
1618 %These experiments show that the impact of parameters values does not
1619 %well influence the AIAC QM, whereas it is very important for the F-EC
1622 %\section{Discussion}
1623 %\label{sec:discussion}
1625 %In this paper three algorithms for mapping asynchronous iterative
1626 %applications on heterogeneous distributed architectures are described.
1627 %This approach is relatively new, so we had to study related works. We
1628 %assume in our model that the amount of computation, the computational
1629 %cost of each task, cannot be known in advance, and it is the same
1630 %about communications. Two mains class of algorithms are mainly used,
1631 %the ``execution time'' and the ``edge-cuts'' optimizations, with a
1632 %newer preference fort the first one. Indeed, the efficiency of second
1633 %one is criticized for its objectives, invoking the fact that these
1634 %algorithms do not optimize the right metric. It is true in a broad
1635 %spectrum of mapping domains, but we have shown that in our case, it
1636 %could be an efficient solution, depending on the architecture
1639 %As each experiment takes a lot of time and the Grid'5000 platform is
1640 %shared by many researchers, we could not conducted as many experiments
1641 %as we wanted and as we need to purpose an exhaustive view of this part
1642 %of the mapping domain. We cannot design a more general state of the
1643 %mapping of asynchronous iterative applications on distributed
1644 %architectures, but we can draw the main lines of future works.
1646 %%giving the first stones at the building of this specific part of the
1649 %We have shown that the heterogeneity in computing nodes power takes an
1650 %important part in an efficient mapping algorithm. This parameter and
1651 %probably the heterogeneity in network should really be more taken into
1652 %consideration. Maybe a good solution consists in designing of mapping
1653 %algorithms giving more important priority to one or to the other
1654 %optimization objectives. This leads to design a novel algorithm, which
1655 %takes into account the different points discussed in this paper, which
1656 %would probably be an hybrid algorithm, efficient with our model on the
1657 %targeted architectures.
1659 \section{Conclusion and future works}
1662 In this paper we have presented
1663 %three algorithms to address the
1664 %mapping problem for asynchronous iterative applications in
1665 %heterogeneous distributed architectures. As the asynchronous iteration
1666 %model is very specific, it was not clear of which kind of mapping
1667 %algorithm should be efficient on such a problem. The two main
1668 %approaches given in the literature, the ``execution time'' and the
1669 %``edge-cuts'' optimization algorithms, have been evaluated on
1670 %different architectures, with different heterogeneity degrees.
1671 %% These architectures varied in their
1672 %%heterogeneity, to evaluate the algorithms.
1674 %%We proposed three mapping algorithms for the JaceP2P-V2
1675 %%environment. The first is a simple way mapping, the Simple Mapping
1676 %%algorithm, which always provides a good and stable improvement of
1677 %%performance on all kinds of architectures.
1679 a specific mapping algorithm for the AIAC model, called AIAC QM. This
1680 algorithm is based on the execution time optimization but it also
1681 includes a small degree of edge-cuts optimization. Experiments show that
1682 the AIAC QM mapping algorithm is efficient on architectures with a
1683 high heterogeneity degree. This can be explained by the fact that all
1684 iteration computations are quite different, for our example, and the
1685 convergence is more quickly detected as the more powerful computing
1686 nodes progress in the computation.
1687 % The F-EC algorithm, which is based
1688 % on the ``edge-cuts'' optimization, is meanwhile efficient on
1689 % architectures with a low heterogeneity degree. This can be explained
1690 % by the fact that in such an environment, it is more accurate for a
1691 % task to have its dependencies locally on the same cluster in order to
1692 % have efficient communications and to allow iterations to be computed
1693 % together, which improves the convergence detection speed.
1694 % Experiments we conducted have shown gains in execution time up to
1695 % $50\%$, which denotes a division by 2 of this execution time, for a
1696 % typical AIAC application on more than 700 computing cores.
1697 % %Experiments have shown that
1698 % %the importance of the parameter of both algorithms, AIAC QM and F-EC,
1699 % %is not so essential for the first one, instead it is very critical for
1700 % %the second one, but we cannot be sure that it is true all the time on
1701 % %all kinds of architectures; it maybe depends on the heterogeneity
1702 % %degree of the network.
1703 % As we did not influence the network's heterogeneity,
1704 % % as we did for the computational power of nodes,
1705 % the evaluation of the network impact on the
1706 % application execution time would be one of our next work.
1708 %For now, these three mapping algorithms are implemented in an
1709 %additional library for the JaceP2P-V2 environment. The results
1710 %presented in this paper show that a mapping algorithm allows to
1711 %improve applications performance, but as the executing architectures
1712 %should have a variety of heterogeneity degree, we have to find a
1713 %compromise between the two approaches in order to have an efficient
1714 %mapping algorithm on all kinds of architectures. In the future, we
1715 %would like to design an efficient mapping algorithm to improve the
1716 %execution of asynchronous iteration applications on heterogeous
1717 %distributed architectures. As the algorithm should be integrated in
1718 %the JaceP2P-V2 environment, which is fully decentralized and fault
1719 %tolerant, the new mapping algorithm should have also these
1720 %characteristics, in order to retrieve a fully decentralized and fault
1721 %tolerant environment.
1722 % Our future works concern the amelioration of the AIAC QM algorithm, in
1723 % order to improve it on homogeneous distributed architectures. As the
1724 % F-EC mapping algorithm is efficient on such architectures, we will
1725 % give a more important consideration to the edge-cuts part of AIAC
1728 In our future work we plan to take into consideration the fault
1729 tolerance problem. In this study we have realized our experiments
1730 without computing node fault, which is not the real case. We have to
1731 take into account the AIAC QM algorithm about this important
1732 parameter. First we have to efficiently choose new nodes to replace
1733 failed ones. Secondly, as we do checkpointing to save tasks' states,
1734 we have to efficiently choose backup nodes not to fail in case
1735 a whole cluster fails, as we save on neighbors (which are
1736 in general on the same cluster for communication efficiency reasons),
1737 an important part of the application is lost and we cannot restart
1738 this part; so the whole application fails. A trade-off should be done
1739 by having some saving nodes in external clusters.
1741 % \subsubsection*{Acknowledgements}
1745 % This work was supported by the European Interreg IV From-P2P project
1746 % and the region of Franche-Comté.
1748 % Experiments presented in this paper were carried out using the
1749 % Grid'5000\cite{g5k} experimental testbed, being developed under the
1750 % INRIA ALADDIN development action with support from CNRS, RENATER and
1751 % several Universities as well as other funding bodies.
1754 \bibliographystyle{unsrt}
1756 \bibliography{biblio}