pdsec2010/pdsec2010.tex

   1 %% Version PDSEC
   2 \documentclass[conference,compsoc,a4paper]{IEEEtran}
   3
   4 %% Pour Ingrid
   5 %\documentclass[conference,compsoc,a4paper,onecolumn]{IEEEtran}
   6 %\documentclass[compsoc,a4paper,oneside]{IEEEtran}
   7
   8 \usepackage[T1]{fontenc}
   9 \usepackage{ucs}
  10 \usepackage[utf8x]{inputenc}
  11 \usepackage{lmodern}
  12 \usepackage{color}
  13 %% Jolis entetes %%
  14 %\usepackage[Glenn]{fncychap}
  15 \usepackage{amsmath}
  16 %\usepackage{amsthm}
  17 \usepackage{amsfonts}
  18 %\usepackage{graphicx}
  19 %\usepackage{xspace}
  20 % Definition des marges
  21 \usepackage{vmargin}
  22 \setpapersize[portrait]{A4}
  23
  24 \usepackage[english]{babel}
  25 % Extension pour les graphiques EPS
  26 %\usepackage[dvips]{graphicx}
  27 \usepackage[pdftex,final]{graphicx}
  28 % Extension pour les liens intra-documents (tagged PDF)
  29 % et l'affichage correct des URL (commande \url{http://example.com})
  30 \usepackage{hyperref}
  31 % Extension pour que plus de titres apparaissent dans la table des matieres
  32 % (table des matieres, bibliographie, index).
  33 %\usepackage{tocbibind}
  34 %\newcommand{\tabledesmatieres}{
  35 %  \setlength{\parskip}{0ex} % suppression de l'espace entre les paragraphes
  36 %  \tableofcontents
  37 %  \espacementnormal
  38 %}
  39 %\NoAutoSpaceBeforeFDP
  40 %\author{Sébastien Miquée}
  41
  42 \usepackage[ruled,vlined]{algorithm2e}
  43 \usepackage{cite}
  44
  45 \newcommand{\myitemize}[1]
  46 {
  47 \begin{itemize}
  48 \itemsep=#1pt
  49 }
  50
  51 %
  52 \ifCLASSINFOpdf
  53    \usepackage[pdftex]{graphicx}
  54    \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  55 \else
  56 \fi
  57
  58 % *** PDF, URL AND HYPERLINK PACKAGES ***
  59 %
  60 \usepackage{url}
  61
  62
  63 \title{Mapping Asynchronous Iterative Applications on Heterogeneous
  64   Distributed Architectures}
  65
  66 \author{
  67 \IEEEauthorblockN{Raphaël Couturier \hspace{10pt} David Laiymani
  68   \hspace{10pt} Sébastien Miquée}
  69
  70 \IEEEauthorblockA{Laboratoire d'Informatique de Franche-Comté
  71   (LIFC)\\
  72   University of Franche-Comté\\
  73   IUT de Belfort-Montbéliard,2 Rue Engel Gros, BP 27, 90016 Belfort,
  74   France\\
  75  % Tel.: +33-3-84587782 \hspace{20pt} Fax: +33-3-84587781\\
  76 Email:
  77 \{raphael.couturier,david.laiymani,sebastien.miquee\}@univ-fcomte.fr}
  78
  79 %\thanks{This work was supported by the European Interreg IV From-P2P project.}
  80 }
  81
  82 %% Permet de réactiver le \thanks
  83 \IEEEoverridecommandlockouts
  84
  85 \date{}
  86 \begin{document}
  87
  88 %% left, top, right, bottom
  89 \setmargrb{20mm}{15mm}{20mm}{15mm}
  90
  91 \IEEEcompsoctitleabstractindextext
  92
  93 \maketitle
  94
  95
  96 \begin{abstract}
  97   To design parallel numerical algorithms on large scale distributed
  98   and heterogeneous platforms, the asynchronous iteration model (AIAC)
  99   may be an efficient solution. This class of algorithm is very
 100   suitable since it enables communication/computation overlapping and
 101   it suppresses all synchronizations between computation nodes. Since
 102   target architectures are composed of more than one thousand
 103   heterogeneous nodes connected through heterogeneous networks, the
 104   need for mapping algorithms is crucial. In this paper, we propose a
 105   new mapping algorithm dedicated to the AIAC model. To evaluate our
 106   mapping algorithm we first implemented it in the JaceP2P programming
 107   and executing environment dedicated to AIAC applications. Then we
 108   conducted a set of experiments on the Grid'5000 testbed with more
 109   than 700 computing cores and with a real and typical AIAC
 110   application based on the NAS parallel benchmarks. Results are very
 111   encouraging and show that the use of our algorithm brings an
 112   important gain in term of execution time (about $40\%$).\\
 113
 114 %  To design parallel and distributed applications on heterogeneous
 115 %  distributed architectures, the asynchronous iteration model may be
 116 %  an efficient solution. Such architectures contain various computing
 117 %  nodes connected by heterogeneous links. Nowadays, these
 118 %  architectures offer users too many computing nodes as they need, so
 119 %  a choice should be done, which is called ``tasks mapping''. In this
 120 %  paper we propose a comparison between different mapping algorithms,
 121 %  in order to evaluate mapping needs of the asynchronous iteration
 122 %  model on this kind of architectures. To do our experiments we used a
 123 %  middleware, JaceP2P-V2, which allows users to design asynchronous
 124 %  iterative applications and execute them on distributed
 125 %  architectures, in which we added a mapping library.\\
 126
 127   \textup{\small \textbf{Keywords:} Mapping algorithms, Distributed
 128     clusters, Parallel iterative asynchronous algorithms, Heterogeneous
 129     architectures.}
 130 \end{abstract}
 131
 132
 133 \IEEEpeerreviewmaketitle
 134
 135 %\begin{keywords}
 136 %  Mapping algorithms, Distributed clusters, Parallel iterative
 137 %  asynchronous algorithms, Heterogeneous architectures.
 138 %\end{keywords}
 139
 140 \section{Introduction}
 141 \label{sec:intro}
 142
 143 Nowadays scientists of many domains, like climatic simulation or
 144 biological research, need great and powerful architectures to compute
 145 their large applications. Distributed clusters architectures, which
 146 are part of the grid architecture, are one of the best architectures
 147 used to solve such applications with an acceptable execution
 148 time. These architectures provide a lot of heterogeneous computing
 149 nodes interconnected by a high performance network, but even with the
 150 greatest efforts of their maintainers, there are latency and
 151 computation capacity differences between clusters of each site.
 152
 153 In order to efficiently use this massive
 154 distributed computation power, numerous numerical algorithms have been
 155 elaborated. These algorithms can be broadly classified into two
 156 categories:
 157
 158 \myitemize{5}
 159 \item \textbf{Direct methods}, which give the exact solution of the
 160   problem using a finite number of operations
 161   (e.g. Cholesky\cite{cholesky-cg}, LU\cite{lu},etc). However, these
 162   methods cannot be applied to all kinds of numerical problems. In
 163   general, they are not well adapted to very large problems.
 164 \item \textbf{Iterative methods}, that repeat the same instructions
 165   until a desired approximation of the solution is reached -- we say
 166   that the algorithm has converged. Iterative algorithms constitute
 167   the only known approach to solving some kinds of problems and they
 168   are easier to parallelize than direct methods. The Jacobi or
 169   Conjugate Gradient\cite{cg} algorithms are examples of such
 170   iterative methods.
 171 \end{itemize}
 172
 173
 174 In the rest of this paper we only focus on iterative methods. Now to
 175 parallelize this kind of algorithm, two classes of parallel iterative
 176 models can be described:
 177
 178 \myitemize{5}
 179
 180 \begin{figure}[h!]
 181   \vspace{0.1cm}
 182   \centering
 183   \includegraphics[width=7.4cm]{images/ISCA}
 184   \caption{Two processors computing in the Synchronous Iteration - Asynchronous Communication (SIAC) model}
 185   \label{fig:SIAC}
 186 \end{figure}
 187
 188 \item \textbf{The synchronous iteration model}. In this model, as can
 189   be seen on Figure \ref{fig:SIAC}, after each iteration (represented
 190   by a filled rectangle), a node sends its results to its neighbors
 191   and waits for the reception of all dependency messages from its
 192   neighbors to start the next iteration. This results in large idle
 193   times (represented by spaces between each iteration) and is
 194   equivalent to a global synchronization of nodes after each
 195   iteration. These synchronizations can strongly penalize the overall
 196   performances of the application particularly in case of large scale
 197   platforms with high latency network. Furthermore, if a message is
 198   lost, its receiver will wait forever for this message and the
 199   application will be blocked. In the same way, if a machine falls
 200   down, all the computation will be blocked.
 201
 202 \begin{figure}[h!]
 203   \vspace{0.1cm}
 204   \centering
 205   \includegraphics[width=7.4cm]{images/IACA}
 206   \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model}
 207   \label{fig:AIAC}
 208 \end{figure}
 209
 210
 211 \item \textbf{The asynchronous iteration model}. In this
 212   model\cite{book_raph}, as can be seen on Figure \ref{fig:AIAC}, after
 213   each iteration, a node sends its results to its neighbors and starts
 214   immediately the next iteration with the last received data. These
 215   data could be data from previous iterations, because last data are
 216   not arrived in time or neighbors have not finish their current
 217   iteration. The receiving and sending mechanisms are asynchronous and
 218   nodes do not have to wait for the reception of dependency messages
 219   from their neighbors. Consequently, there is no more idle time
 220   between two iterations. Furthermore, this model is tolerant to
 221   messages loss and even if a node dies, the remaining nodes continue
 222   the computation, with the last data the failed node
 223   sent. Unfortunately, the asynchronous iteration model generally
 224   requires more iterations than the synchronous one to converge to the
 225   solution.
 226
 227   This class of algorithms is very suitable in a distributed clusters
 228   computing context because it suppresses all synchronizations between
 229   computation nodes, tolerates messages loss and enables the
 230   overlapping of communications by computations. Interested readers
 231   might consult \cite{bcvc06:ij} for a precise classification and
 232   comparison of parallel iterative algorithms. In this way, several
 233   experiments \cite{bcvc06:ij} show the relevance of the AIAC
 234   algorithms in the context of distributed clusters with high latency
 235   between clusters. These works underline the good adaptability of
 236   AIAC algorithms to network and processor heterogeneity.
 237 \end{itemize}
 238
 239 As we aim to solve very large problems on heterogeneous distributed
 240 architectures, in the rest of this study we only focus on the
 241 asynchronous iteration model. In order to efficiently use such
 242 algorithms on distributed clusters architectures, it is essential to
 243 map the tasks of the application to the best sub-sets of nodes of the
 244 target architecture. This mapping procedure must take into account
 245 parameters such as network heterogeneity, computing nodes
 246 heterogeneity and tasks heterogeneity in order to minimize the overall
 247 execution time of the application.
 248 %we have to select
 249 %sets of computing nodes, which can improve applications execution
 250 %time. Indeed, as there are more available nodes as we need on such
 251 %architectures, it is important to select appropriate computing nodes,
 252 %due to their heterogeneity at computation power as well as relying
 253 %network.
 254 To the best of our knowledge, there exits no algorithm which
 255 specifically addresses the mapping of AIAC applications on distributed
 256 architectures. The aim of this paper is to propose a new mapping
 257 algorithm dedicated to AIAC applications and to implement it into a
 258 real large scale computing platform, JaceP2P-V2. Experiments conducted
 259 on the Grid'5000 testbed with more than 400 computing cores show that
 260 this new algorithm allows to enhance the performances of JaceP2P-V2 of
 261 about $40\%$ for a real and typical AIAC application.
 262
 263 %this mapping problem. Our aim is to evaluate the
 264 %more common approaches to solve this problem, by using several mapping
 265 %algorithms implemented in a real large scale computing platform,
 266 %JaceP2P-V2, on a true distributed clusters architecture.\\
 267
 268 The rest of this paper is organized as follows. Section
 269 \ref{sec:jacep2p} presents the JaceP2P-V2 middleware. We focus here on
 270 one of the main drawbacks of this platform: its lack of an efficient
 271 mapping strategy. Section \ref{sec:pb} presents our mapping problem
 272 and quotes existing issues to address it. Section
 273 \ref{sec:aiacmapping} describes the specificities of the AIAC model
 274 and details the main solution we propose to address the AIAC mapping
 275 problem. In section \ref{sec:expe} we describe the experiments we have
 276 conducted, with their different components and results. These results
 277 were conducted on the Grid'5000 testbed with more than 400 computing
 278 cores and show an important gain (about $40\%$) of the overall
 279 execution time for a typical AIAC application, i.e. based on the NAS
 280 Parallel Benchmark. Finally, we give some concluding remarks and plan
 281 our future work in section \ref{sec:conclu}.
 282
 283
 284 \section{JaceP2P-V2}
 285 \label{sec:jacep2p}
 286
 287 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented
 288 using the Java programming language and dedicated to developing and
 289 executing parallel iterative asynchronous applications. JaceP2P-V2
 290 executes parallel iterative asynchronous applications with
 291 dependencies between computing nodes. In addition, JaceP2P is fault
 292 tolerant which allows it to execute parallel applications over
 293 volatile environments and even for stable environments like local
 294 clusters, it offers a safer and crash free platform. To our knowledge
 295 this is the only platform dedicated to designing and executing AIAC
 296 algorithms.
 297
 298 \subsection{Architecture}
 299 \label{sec:archijaceP2P}
 300
 301 In this section we describe the JaceP2P-V2 environment. As can be seen
 302 on Figure \ref{fig:jaceP2P-v2}, which shows its architecture, this
 303 platform is composed of three main entities:
 304
 305 \begin{figure}[h!]
 306   \vspace{0.1cm}
 307   \centering
 308   \includegraphics[width=7.4cm]{images/JACEP2P-V2}
 309   \caption{The JaceP2P-V2 architecture}
 310   \label{fig:jaceP2P-v2}
 311 \end{figure}
 312
 313 \myitemize{5}
 314 \item The first entity is the ``super-node'' (represented by a big
 315   circle in figure \ref{fig:jaceP2P-v2}).  Super-nodes form a circular
 316   network and store, in registers, the identifiers of all the
 317   computing nodes that are connected to the platform and that are not
 318   executing any application.
 319 %  Each super-node has a status table containing the
 320 %  number of connected computing nodes to each super-node and all the
 321 %  super-nodes share a ``token'' that is passed successively from a
 322 %  super-node to the next one. Once a super-node has the token, it
 323 %  computes the average number of computing nodes connected to a
 324 %  super-node ($avg$) using the status table. If $avg$ is lower than
 325 %  the number of computing nodes connected to it, then it sends the
 326 %  identifiers of the extra computing nodes to the super-nodes that
 327 %  have the number of computing nodes connected to them less than
 328 %  $avg$. If the number of computing nodes connected to it has changed,
 329 %  it broadcasts the information to all the super-nodes in the
 330 %  platform. Finally, it passes the token to the next super node. This
 331 %  distribution reduces the load of the super-nodes.
 332   A super-node regularly receives heartbeat messages (represented by
 333   doted lines in figure \ref{fig:jaceP2P-v2}) from the computing nodes
 334   connected to it. If a super-node does not receive a heartbeat
 335   message from a computing node for a given period of time, it
 336   declares that this computing node is dead and deletes its identifier
 337   from the register.
 338
 339 \item The second entity is the ``spawner'' (represented by a square in
 340   figure \ref{fig:jaceP2P-v2}).  When a user wants to execute a
 341   parallel application that requires $N$ computing nodes, he or she
 342   launches a spawner. The spawner contacts a super-node to reserve the
 343   $N$ computing nodes plus some extra nodes. When the spawner receives
 344   the list of nodes from the super-node, it transforms the extra
 345   nodes into spawners (for fault tolerance and scalability reasons)
 346   and stores the identifiers of the rest of the nodes in its own
 347   register. Once the extra nodes are transformed into spawners, they
 348   form a circular network and they receive the register containing the
 349   identifiers of the computing nodes. Then each spawner becomes
 350   responsible for a subgroup of computing nodes, starts the tasks on
 351   the computing nodes under its command and sends a specific register
 352   to them.
 353 %  So each
 354 %  computing node receives a specific register that only contains the
 355 %  identifiers of the daemons it interacts with and that depends on the
 356 %  application being executed. These specific registers reduce the
 357 %  number of messages sent by the spawners to update the register of
 358 %  the daemons after a daemon crashes because usually a small number of
 359 %  daemons is affected by this crash.
 360   If the spawner receives a message from a computing node informing
 361   that one of its neighbors is fallen, it fetches a new one from the
 362   super-node in order to replace the dead one. The spawner initializes
 363   the new daemon, which retrieves the last backup (see next paragraph)
 364   of the dead node and continues the computing task from that
 365   checkpoint.
 366
 367 \item The third entity is the ``daemon'', or the computing node,
 368   (represented in figure \ref{fig:jaceP2P-v2} by a hashed small circle
 369   if it is idle and by a white small circle if it is executing an
 370   application). Once launched, it connects to a super-node and waits
 371   for a task to execute. Once they begin executing an application they
 372   form a circular network which is only used in the failure detection
 373   mechanism. Each daemon can communicate directly with the daemons
 374   whose identifiers are in its register. At the end of a task, the
 375   daemons reconnect to a super-node.
 376 \end{itemize}
 377
 378 To be able to execute asynchronous iterative applications, JaceP2P-V2
 379 has an asynchronous messaging mechanism and to resist daemons'
 380 failures, it implements a distributed backup mechanism called the
 381 uncoordinated distributed checkpointing. This method allows daemons to
 382 save their data on neighboring daemons without any user
 383 intervention. The asynchronous nature of the application allows two
 384 daemons to execute two different iterations, thus each daemon saves
 385 its status without synchronizing with other daemons. This
 386 decentralized procedure allows the platform to be very scalable, with
 387 no weak points and does not require a secure and stable station for
 388 backups. Moreover, since the AIAC model is tolerant to messages loss,
 389 if a daemon dies, the other computing nodes continue their tasks and
 390 are not affected by this failure.
 391 % The application convergence detection is done by daemons, using the
 392 % decentralized global convergence detection algorithm presented in
 393 % \cite{conv_dec}. It consists of two phases: the detection phase and
 394 % the verification phase. This algorithm aims to detect efficiently
 395 % the global convergence of asynchronous iterative parallel algorithms
 396 % on distributed architectures.
 397 For more details on the JaceP2P-V2 platform, readers can refer to
 398 \cite{jaceP2P-v2}.
 399
 400 \subsection{Benefits of mapping}
 401 \label{sec:benef}
 402
 403 In the JaceP2P-V2 environment, presented in the previous section,
 404 there is no effective mapping solution. Indeed, when a user wants to
 405 launch an application, the spawner emits a request to the super-node,
 406 which is in charge of available daemons. Basically, the super-node
 407 returns the amount of requested computing nodes by choosing in its own
 408 list.
 409 %,
 410 %if there are sufficient daemons connected on it, or in other
 411 %super-nodes lists, in addition of its one.
 412 In this method, the super-node only cares about the amount of
 413 requested nodes, it returns in general nodes in the order of their
 414 connection to the platform -- there is no specific selection.
 415 Distributed architectures such as distributed clusters, as can be seen
 416 on Figure \ref{fig:pbdistclust}, are often composed of heterogeneous
 417 clusters linked via heterogeneous networks with high latencies and
 418 bandwidths. As an example the Grid'5000\cite{g5k} testbed is composed
 419 of 23 clusters spread over 9 sites. Those clusters are heterogeneous,
 420 with computing powers starting from bi-cores at 2GHz to
 421 bi-quadri-cores at 2.83GHz with 2Go of memory for the first one to 8Go
 422 for the second. Links relying clusters are 10Gb/s capable, but as many
 423 researchers use this platform, high latencies appear in links between
 424 sites.
 425 %In the targeted architectures, which are distributed clusters, each
 426 %cluster provides a fixed number of homogeneous computing nodes --
 427 %computing nodes are heterogeneous from a cluster to another. Figure
 428 %\ref{fig:pbdistclust} represents such an architecture, in which we can
 429 %see different sites containing several heterogeneous clusters. Each
 430 %cluster is relied by high speed network with clusters in the same site
 431 %and in others. We note that the architecture may not be represented by
 432 %a full connected graph, as some sites are not directly connected to
 433 %some others. The Grid'5000 testbed, described in section
 434 %\ref{sec:g5k}, is an example of such a distributed clusters
 435 %architecture.  Though there are efficient links relying each site, a
 436 %residual latency continues to exist, at local clusters (in the same
 437 %site) as well as distant clusters (from two distinct sites), and can
 438 %penalize performances.
 439
 440 \begin{figure}[ht!]
 441   \centering
 442   \includegraphics[width=7.8cm]{images/dist_clust}
 443   \caption{A distributed clusters architecture}
 444   \label{fig:pbdistclust}
 445 \end{figure}
 446
 447
 448 With such an architecture, it could be
 449 efficient to assign tasks communicating with each other on the same
 450 cluster, in order to improve communications. But, as we use very large
 451 problems, it is quite impossible to find clusters containing as many
 452 computing nodes as requested. So we have to dispatch tasks over
 453 several clusters. That implies to deal with heterogeneity in clusters
 454 computing power and heterogeneity in network. We should make a
 455 trade-off between both components in order to take the best part
 456 of each one to improve the overall performances.
 457 %The
 458 %literature in high performance computing has broadly demonstrated the
 459 %benefits of mapping solutions on the applications execution time.
 460
 461 In order to check if a tasks mapping algorithm would provide
 462 performances improvement in JaceP2P-V2 environment, we have evaluated
 463 the contributions of a simple mapping algorithm, which is described in
 464 section \ref{sec:sma}. These experiments used the NPB Kernel CG
 465 application described in section \ref{sec:cg}, with two problem sizes
 466 (the given problem sizes are the sides sizes of square matrices used)
 467 and using a distributed clusters architecture composed of 102
 468 computing nodes, representing 320 computing cores, spread over 5
 469 clusters in 5 sites. The results of these experiments are given in
 470 Table \ref{tab:benef}.
 471
 472 \renewcommand{\arraystretch}{1.5}
 473
 474 \begin{table}[h!]
 475   \centering
 476   \begin{tabular}{|c|c|c|}
 477     \hline
 478     Problem size&$550,000$&$5,000,000$\\
 479     \hline
 480     Execution Time (without mapping)&141s&129s\\
 481     Execution Time (with mapping)&97s&81s\\
 482     \hline
 483     Gains&$31\%$&$37\%$\\
 484     \hline
 485   \end{tabular}
 486   \caption{Effects of a simple tasks
 487     mapping algorithm on application's execution time}
 488   \label{tab:benef}
 489 \end{table}
 490
 491 As can be seen in Table \ref{tab:benef}, the effects of a
 492 simple tasks mapping algorithm are significant.
 493 %Moreover, we can see
 494 %that are scalable with the application's size, which demonstrates the
 495 %real needs of the platform for a such algorithm.
 496 This encouraged us to look further for better task mapping
 497 algorithms. In the next section, we describe the specificities of our
 498 model and issues which can be exploited.
 499
 500 \section{Problem description}
 501 \label{sec:pb}
 502
 503 In this section we describe the AIAC mapping problem. We first
 504 formalize the different elements we should take into consideration:
 505 the application, the targeted architecture and the objectives
 506 functions of the mapping. We also give a state of the art about
 507 considered kinds of mapping algorithms.
 508
 509 \subsection{Model formalization}
 510 \label{sec:pbmodel}
 511
 512 In this section the models of the applications and architectures we
 513 used are given, with the objectives functions of the mapping
 514 algorithms.
 515
 516 \subsubsection{Application modeling}
 517 \label{sec:pbmodelapp}
 518
 519 In high performance computing, when we want to improve the global
 520 execution time of parallel applications we have to make an efficient
 521 assignation of tasks to computing nodes. Usually, to assign tasks of
 522 parallel applications to computing nodes, scheduling algorithms are
 523 used. These algorithms often represent the application by a graph,
 524 called DAG \cite{dag1,dag2,dag3,dag4} (Directed Acyclic Graph). In
 525 this graph, each task is represented by a vertex which is relied to
 526 others by edges, which represent dependencies and communications
 527 between tasks. This means that some tasks could not start before other
 528 ones finish their computation and send their results. As exposed in
 529 the introduction, in the AIAC model, there is no precedence between
 530 tasks.
 531
 532 Indeed, with the AIAC model, all tasks compute in parallel at the same
 533 time. As communications are asynchronous, there is no synchronization
 534 and no precedence. During an iteration, each task does its job and
 535 sends results to its neighbors and continues with the next
 536 iteration. If a task receives new data from its dependencies, it
 537 includes them and the computation continues with these new data. If
 538 not all dependencies data, or none, are received before starting the
 539 computation of the next iteration, old data are used instead. Tasks
 540 are not blocked on dependencies. Nevertheless regularly receiving new
 541 data allows tasks to converge more quickly. So, it appears that DAG
 542 are not appropriate to modeling AIAC applications. TIG\cite{tig1,
 543   tig2} (Task Interaction Graph) are more appropriate.
 544 %That is why we use the
 545 %TIG\cite{tig1, tig2} (Task Interaction Graph) model instead of DAG,
 546 %which allows to modeling application using tasks and their
 547 %communication dependencies. Figure \ref{fig:tig} gives an example of a
 548 %TIG.
 549
 550 \begin{figure}[h!]
 551   \centering
 552   \includegraphics[width=5cm]{images/tig}
 553   \caption{An example of a TIG of a nine tasks application}
 554   \label{fig:tig}
 555 \end{figure}
 556
 557 In the TIG model, a parallel program is represented by a graph
 558 %, as can
 559 %be seen in Figure \ref{fig:tig}. This graph
 560 $GT(V,E)$, where $V = \{V_1,V_2,\dots V_v\}$ is the set of $|V|$
 561 vertices and $E \subset V \times V$ is the set of undirectional edges
 562 (see Figure \ref{fig:tig}). The vertices represent tasks and the edges
 563 represent the mutual communication among tasks. A function $ET : V
 564 \rightarrow R^+$ gives the computation cost of tasks and $CT : E
 565 \rightarrow R^+$ gives the communication cost for message passing on
 566 edges. We define $v = |V|$, $ET(V_i) = e_i$ and $CT(V_i,V_j) =
 567 c_{ij}$. For example, in Figure \ref{fig:tig}, \mbox{$e_0$ = 10} and $c_{01}
 568 = 2$, $c_{03} = 2$ and $c_{04} = 2$. Tasks in TIG exchange information
 569 during their execution and there is no precedence relationship among
 570 tasks; each task cooperates with its neighbors. This model is used to
 571 represent applications, where tasks are considered to be executed
 572 simultaneously. Temporal dependencies in the execution of tasks are
 573 not explicitly addressed: all the tasks are considered simultaneously
 574 executable and communications can take place at any time during the
 575 computation. That is why vertices and edges are labeled with weights
 576 describing computational and communication costs.\\
 577
 578
 579 \subsubsection{Architecture modeling}
 580 \label{sec:pbmodelarchi}
 581
 582 As TIG models the application, we have to model the targeted
 583 architecture. A distributed clusters architecture can be modeled by a
 584 three-level-graph. The levels are \textit{architecture} (a), in our
 585 study it is the Grid'5000 grid, \textit{cluster} (c) and computing
 586 node (n) levels. Figure \ref{fig:pbdistclust} in section
 587 \ref{sec:benef} shows such a model. Let $GG(N,L)$ be a graph
 588 representing a distributed clusters architecture, where $N =
 589 \{N_1,N_2,\dots N_n\}$ is the set of $|N|$ vertices and $L$ is the set
 590 of undirectional edges. The vertices represent the computing nodes and
 591 the edges represent the links between them. An edge $L_i \in L$ is an
 592 unordered pair $(N_x,N_y) \in N$, representing a communication link
 593 between nodes $x$ and $y$. Let be $|C|$ the number of clusters in the
 594 architecture containing computing nodes. A function $WN : N
 595 \rightarrow R^+$ gives the computational power of nodes and $WL : L
 596 \rightarrow R^+$ gives the communication latency of links. We define
 597 $WN(N_i) = wn_i$ and $WL(L_i,L_j) = wl_{ij}$.
 598
 599 An architecture with a three-level-graph is specified according as
 600 follows. All computing nodes are in the same node level. When
 601 computing nodes can communicate to one another with the same
 602 communication latency, they can be grouped into the same cluster. In
 603 addition, like in the Grid'5000 testbed, if computing nodes seemly
 604 have the same computational power with a low communication latency, a
 605 cluster of these nodes can be defined. All participating clusters,
 606 including computing nodes, are in the same architecture level and
 607 communicate through the architecture network.\\
 608
 609
 610 \subsubsection{Mapping functions}
 611 \label{sec:pbmodelmapping}
 612
 613 After having described the two graphs used to model the application
 614 and the architecture, this section defines our objectives.
 615
 616 When a parallel application $App$, represented by a graph $GT$, is
 617 mapped on a distributed clusters architecture, represented by a graph
 618 $GG$, the execution time of the application, $ET(App)$, can be defined
 619 as the execution time of the slowest task.
 620 %In the AIAC model, tasks
 621 %have, in general, seemly the same work and communication loads; so the
 622 %difference on execution time depends on the executing machines.
 623 Indeed an application ends when all the tasks have detected
 624 convergence and have reached the desired approximation of the
 625 solution, that is why the execution time of the application depends on
 626 the slowest task.
 627 %, this converges last.
 628 We define
 629 \begin{equation}
 630   \label{eq:et}
 631   ET(App) = \max_{i=1 \dots v} ( ET(V_i) )
 632 \end{equation}
 633
 634 where the
 635 %$ET(V_s)$ is the execution time of the slowest task $V_s$. The
 636 execution time of each task $i$ \mbox{($i=1 \dots v$)}, $ET(V_i)$ is
 637 given by
 638 \begin{equation}
 639   \label{eq:ettask}
 640   ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}
 641 \end{equation}
 642
 643 where $e_i$ is the computational cost of $V_i$, $wn_i$ is the
 644 computational power of the node $N_i$ on which $V_i$ is mapped, $J$
 645 represents the neighbors set of $V_i$, $c_{ij}$ is the amount of
 646 communications between $V_i$ and $V_j$, and $wn_{ij}$ is the link
 647 latency between the computing nodes on which are mapped $V_i$ and
 648 $V_j$.
 649 %Note that in the AIAC model, we only consider for $e_i$ the
 650 %cost of an only one iteration, as we cannot determine in advance the
 651 %amount of iterations a task should complete to converge.
 652 We underline here that in the AIAC model, it is impossible to predict
 653 the number of iterations of a task. So it is difficult to evaluate a
 654 priori the cost $e_i$ of a task. In the remainder, we approximate
 655 $e_i$ by the cost of one iteration.
 656
 657 The mapping problem is similar to the classical graph partitioning and
 658 task assignment problem \cite{npcomp}, and is thus NP-complete.
 659 %The
 660 %generalization of this problem by considering both heterogeneous tasks
 661 %graph and architecture graph makes the problem more complex.
 662 %The JaceP2P-V2 platform is design to execute parallel asynchronous
 663 %iterative application on heterogeneous distributed architectures, so
 664 %it has to launch applications on a variety of computing nodes relying
 665 %by non-uniform links. As demonstrates in the previous section, it does
 666 %not map efficiently tasks on nodes and the overall performances can be
 667 %strongly penalize by this lack of mapping. Indeed, if the
 668 %heterogeneity degree of the architecture used is high, it can be
 669 %possible to have two neighbors tasks executing on two foreign
 670 %computing nodes, that penalizes communications between these tasks,
 671 %that increase the execution time. With the same mining, it is
 672 %interesting to have similar computing nodes, in order to have seemly
 673 %the same iteration execution time that allows to converge more
 674 %quickly. We can conclude that JaceP2P-V2 needs an algorithm to
 675 %efficiently map tasks on computing nodes.
 676 %
 677 %%As we use parallel distributed algorithms on distributed
 678 %%architectures, we have to efficiently map tasks on computing nodes, in
 679 %%order to obtain best performances and the slowest execution time of
 680 %%the application.
 681 %There exist many scheduling and mapping algorithms in the literature,
 682 %and the research is active in this domain. A first approach is to use
 683 %one the numerous scheduling algorithms, but our model does not fit
 684 %with them. Indeed, a scheduling implies that we have tasks which are
 685 %depending on others, more precisely that some tasks cannot start
 686 %computation before having received data from other tasks, which are
 687 %called precedences, or ``strong dependencies''. This class of
 688 %algorithms uses a well known application's representation: the DAG
 689 %(Direct Acyclic Graph). This model cannot be used with our
 690 %problematic. Indeed, in asynchronous iterative applications, all the
 691 %tasks are executed in the same time, in parallel; there is no
 692 %precedence. So we have to look at another way, which is the ``tasks
 693 %mapping''.
 694 %
 695 %The second approach is to use a tasks mapping algorithm, which only
 696 %aims to map tasks on nodes, following a metric, to minimize the
 697 %application's execution time. To determine which class of tasks
 698 %mapping algorithms we should consider, it is important to draw up a
 699 %list of which information we have in hands. When an application using
 700 %the asynchronous iteration model is run, the number of iterations to
 701 %converge to the solution is unpredictable and change from a run to
 702 %another. So, we cannot give a good approximation of the execution time
 703 %of a task, but we can retrieve an estimation of the computational
 704 %amount of a task (plus de détails ?). In addition, as the asynchronous
 705 %iteration model based algorithms are able to provide to applications
 706 %the message lost tolerance, we cannot exactly quantify the amount of
 707 %communications in an application; some algorithms can converge with
 708 %receiving only few dependencies messages. One application's
 709 %representation which fit perfectly with our model is the TIG (Task
 710 %Interaction Graph), which only considers relations between tasks --
 711 %the only thing which is predictable in our model. The figure
 712 %\ref{fig:tig} presents an example of a TIG.
 713 %
 714 %
 715 %
 716 %On figure \ref{fig:tig} we can see an application with 6 tasks, in
 717 %which each task is in relation with tasks of previous and next rank,
 718 %with two exceptions, the first and the last tasks, which can
 719 %eventually be in relation, depending of the application. This is a
 720 %commonly scheme in asynchronous iteration model; in general, tasks are
 721 %in relation with some others which previous and next ranks, but in
 722 %some cases, dependencies are more complex.
 723
 724 \subsection{Related work}
 725 \label{sec:pbrw}
 726
 727 %In the previous section we have determined that we need to use TIG
 728 %model based mapping algorithms to address our problem.
 729 In the literature of the TIG mapping, we can find many algorithms,
 730 which can be divided into two categories:
 731
 732 \myitemize{5}
 733 \item \textbf{Edge-cuts optimization}. The aim of this class of
 734   algorithms is to minimize the use of the penalizing links between
 735   clusters. As tasks are depending on neighbors, which are called here
 736   dependencies, the goal is to choose nodes which distance, in term of
 737   network, is small, to improve communications between tasks. Here we
 738   can cite Metis\cite{metis}, Chaco\cite{chaco} and
 739   PaGrid\cite{pagrid} which are libraries containing such kind of
 740   algorithms. The main drawback of edge-cuts algorithms is that they
 741   do not tackle the computing nodes heterogeneity issues. They only
 742   focus on communication overheads.
 743 %  The figure \ref{fig:edge} shows
 744 %  that the optimization is on edges of communications, which are
 745 %  circled in red.
 746 %
 747 %  \begin{figure}[h!]
 748 %    \centering
 749 %    \includegraphics[width=3.5cm]{images/edgecut}
 750 %    \caption{The edge-cuts optimization}
 751 %    \label{fig:edge}
 752 %  \end{figure}
 753 \item \textbf{Execution time optimization}. The aim of these
 754   algorithms is to minimize the whole execution time of the
 755   application. They look for nodes which can provide the small
 756   execution time of tasks using their computational power. Here we can
 757   cite FastMap\cite{fastmap} and MiniMax\cite{minimax} as such kind of
 758   algorithms. QM\cite{qm_these} is also an algorithm of this category,
 759   but it aims to find for each task the node which can provide the
 760   best execution time. QM works at the task level,
 761   whereas others work at the application level.\\
 762 %  The figure
 763 %  \ref{fig:et} shows that the optimization is on tasks, which are
 764 %  circled in red.\\
 765 %
 766 %  \begin{figure}[h!]
 767 %    \centering
 768 %    \includegraphics[width=3.5cm]{images/exectime}
 769 %    \caption{The execution time optimization}
 770 %    \label{fig:et}
 771 %  \end{figure}
 772 \end{itemize}
 773
 774 The two classes of algorithms may fit with our goals, because in our
 775 model we have both the computational power of nodes and communication
 776 costs may influence the applications performances.
 777
 778 Nevertheless, to the best of our knowledge, none of the existing
 779 algorithms take into consideration the specificities of the AIAC model
 780 (see next section).
 781 %specifically address the AIAC mapping problem.
 782 %As the existing mapping algorithms are not designed to fit
 783 %with the AIAC mapping problem.
 784
 785 \section{AIAC mapping}
 786 \label{sec:aiacmapping}
 787
 788 In this section we present the specificities of the AIAC model, which
 789 are interesting in the mapping problem, and the solution we propose:
 790 the AIAC QM algorithm, which is an extended version of the QM
 791 algorithm.
 792
 793 \subsection{Specificities of the AIAC mapping problem}
 794 \label{sec:specAIACmapping}
 795
 796 An important point to take into consideration in the AIAC model is
 797 that we do not allow the execution of multiple tasks on the same
 798 computing node. This comes from the fact that the targeted
 799 architectures are volatile distributed environments. Assigning
 800 multiple tasks to a node provides a fall of performances when this
 801 node fails. Indeed we should redeploy all of the tasks from this node
 802 to another one, using last saves, which implies to search a new
 803 available computing node, transfer saves to it and restart the
 804 computation from this point (which could be far from this just before
 805 the failure).
 806
 807 Nevertheless, in order to benefit of multi-cores architectures, we use
 808 a task level parallelism by running multi-threaded sequential solver
 809 for example.
 810 %In addition, it is more simple and efficient to parallelize at the
 811 %task level using, as an example with the CG application, a
 812 %multi-threaded linear solver, which benefits of the multi-cores
 813 %architecture of computing nodes.
 814
 815 Another important point in the AIAC model is that we should take into
 816 account precisely the locality issue. This comes from the fact that in
 817 this model, the faster and more frequently a task receives its
 818 dependencies, the faster it converges. Moreover, as the JaceP2P-V2
 819 environment is fault tolerant and tasks save checkpoints on their
 820 neighbors, it is more efficient to save on near nodes than on far
 821 ones.
 822
 823
 824 %As our problem is on task mapping, only considering the dependency
 825 %relations between tasks, with a part of consideration on task weight,
 826 %we should use TIG (\textit{task interaction graph}) mapping model.
 827
 828 %\section{Mapping algorithms}
 829 %\label{sec:mapalgo}
 830
 831 \subsection{AIAC Quick-quality Map}
 832 \label{sec:qmmodif}
 833
 834 %After having describe the different kinds of mapping algorithms which
 835 %can be found in the literature, we now present the three algorithms we
 836 %use to do mapping in the JaceP2P-V2 environment.
 837
 838 We present here the solution we propose, called \textit{AIAC QM
 839   algorithm}, to address the AIAC mapping problem. We decided to
 840 improve the \textit{Quick-quality Map} (QM) algorithm since it is one
 841 of the most accurate to address the TIG mapping problem.
 842 %
 843 %\subsection{Modified Quick-quality Map}
 844 %\label{sec:modifiedqm}
 845 %
 846 %As the previous algorithm describe in section \ref{sec:sma} showed
 847 %that mapping provides a significant increase of applications
 848 %performances (which can be seen in the section \ref{sec:experiments}),
 849 %we decide to try another mapping algorithm, which is a modified
 850 %version of the \textit{Quick-quality Map} (QM) algorithm.
 851 %
 852 %
 853 %As explained in section \ref{sec:pb}, the asynchronous iteration model
 854 %is specific, as it is not a good solution to map many tasks on the
 855 %same node. This is why QM has been modified to take into account these
 856 %characteristics.
 857 % Indeed, originally QM tries to map many tasks on the
 858 %same node to improve the execution time of tasks by decreasing
 859 %communications costs. This solution can be good if communications
 860 %between tasks are heavy and if we consider that computing nodes are
 861 %stable and are not volatile. As
 862
 863 %, we have modified some parts
 864 %of it to fit with our constraints. This was an opportunity to be taken
 865 %to insert a little part of ``edge-cuts'' optimization, as in our model
 866 %communications have to be taken into account.
 867 In its original version, this algorithm aims at privileging the
 868 computational power of nodes. Indeed, its aim is to find the more
 869 powerful node to map a task on. Moreover, a part of this algorithm is
 870 designed to map multiple tasks on the same node, in order to improve
 871 local communications. This solution can be efficient if communications
 872 between tasks are heavy and if we consider that computing nodes are
 873 stable and not volatile. This last point is in contradiction with
 874 our model, as we authorize only the execution of one task on a single
 875 node -- this allows to lose only the work of a single task in case of
 876 node's fault, with a low cost on restarting mechanism. Instead
 877 assigning multiple tasks on the same computing node, our mapping
 878 algorithm tries to keep tasks locality, to improve communications, by
 879 trying to assign tasks to computing nodes in the neighborhood
 880 of which their neighbors are mapped on.
 881
 882 The pseudo-code of AIAC QM is given in Algorithm \ref{alg:qmmodified}.
 883
 884 \SetAlgoSkip{}
 885 \begin{algorithm}
 886   \SetLine
 887   \dontprintsemicolon
 888
 889   \KwIn{Sets of tasks and computing nodes}
 890   \KwOut{Mapping of tasks to nodes}
 891
 892   \BlankLine
 893
 894   sort nodes by descending power\;
 895   map tasks in order on nodes\;
 896   set all tasks \textit{moveable}\;
 897   $r \leftarrow 1$\;
 898
 899   \BlankLine
 900
 901   \While{one task is moveable}{
 902     \For{each task $t_{i}$ $\&\&$ $t_{i}$ is moveable }{
 903       $n_{c} \leftarrow$ current node of $t_{i}$\;
 904       $n_{n} \leftarrow t_{i}$\;
 905
 906       \BlankLine
 907
 908       \For{$k = 0 ; k < \frac{f \cdot n}{r} ; k++$}{
 909         select random node $n_{r}$ in $[0,\frac{n}{r}]$\;
 910         \If{ET($t_{i}$,$n_{r}$) $<$ ET($t_{i}$,$n_{n}$)}{
 911           $n_{n} \leftarrow n_{r} $\;
 912         }
 913       }
 914
 915       \BlankLine
 916
 917       \For{each node $n_{v}$ near $dep(t_{i})$}{
 918         \If{ET($t_{i}$,$n_{v}$) $<$ ET($t_{i}$,$n_{n}$)}{
 919           $n_{n} \leftarrow n_{v} $\;
 920         }
 921       }
 922
 923       \BlankLine
 924
 925       \If{$n_{n} \neq n_{c}$}{
 926         map $t_{i}$ on $n_{n}$\;
 927         update ET of $t_{i}$ and dep($t_{i}$)\;
 928       }
 929     }
 930
 931     \BlankLine
 932
 933     set $t_i$ not moveable\;
 934     $r \leftarrow r+1$ if all tasks have been considered\;
 935   }
 936
 937   \caption{The AIAC QM}
 938   \label{alg:qmmodified}
 939 \end{algorithm}
 940 %\vspace{-0.5cm}
 941
 942 All nodes are first sorted in descending order according to their
 943 computation power, and all tasks are mapped on these nodes according
 944 to their identifier (they are also marked as ``moveable'', that means
 945 that each task can be moved from a node to another). As in the
 946 original QM algorithm, AIAC QM keeps track of the \textit{number of
 947   rounds} $r$ ($r > 0$), that all tasks have been searched for a
 948 better node. This allows to reduce at each round the number of
 949 considered nodes. While there is at least one moveable task, it
 950 performs for each moveable task the search for a better node. It
 951 chooses a set of nodes, $\frac{f \cdot n}{r}$, where $f$ is defined as
 952 the search factor and $n$ is the number of nodes. $r$ and $f
 953 \in ]0,1]$ control the portion of nodes that will be considered where
 954 more numerous the rounds are, the less the considered nodes will
 955 be. Then the algorithm estimates the execution time $ET(v)$ of the
 956 task on each node. If it is smaller than the current node on which the
 957 task is mapped on, this node becomes the new potential node for
 958 task $t_i$.
 959
 960 After having randomly searched for a new node, the AIAC QM tries to
 961 map the task on nodes that are neighbors of nodes of which the
 962 dependencies of $t_i$ are mapped on. This is one of the major
 963 modification to the original QM algorithm. It introduces a little part
 964 of ``edge-cuts'' optimization. In the original version, it tries to
 965 map the task $t_i$ on the same node of one as its dependencies. As
 966 explain in \ref{sec:specAIACmapping}, this is not an acceptable
 967 solution in our case. Instead, the algorithm now searches to map task
 968 $t_i$ on nodes which are near the ones its dependencies are mapped
 969 on. This search requires a parameter which indicates the maximum
 970 distance at which nodes should be from the node of dependencies of
 971 $t_i$.
 972
 973 At the end of the algorithm, if a new node is found, $t_i$ is mapped
 974 on and its execution time is updated and $t_i$ is set to ``not
 975 moveable''. The execution time of each of its dependencies is also
 976 updated, and if this new execution time is higher than the previous,
 977 the task is set to ``moveable''. And finally, if all tasks have been
 978 considered in this round, $r$ is incremented.
 979
 980 The complexity of the AIAC QM algorithm is about $O(n^2 \cdot
 981 t \cdot ln(r))$. This complexity is the same as the original
 982 (details are given in \cite{qm_these}, with a an increase of a factor
 983 $n$, corresponding to the edge-cuts part.
 984
 985 \section{Experimentation}
 986 \label{sec:expe}
 987
 988 We now describe the experiments we have conducted and their
 989 components, to evaluate the effects of the AIAC QM algorithm on
 990 application execution time.
 991
 992 \subsection{The NAS Parallel Benchmark Kernel CG}
 993 \label{sec:cg}
 994
 995 We used the ``Kernel CG'' of the NAS Parallel Benchmarks (NPB)
 996 \cite{nas} to evaluate the performances of the mapping
 997 algorithm. This benchmark is designed to be used on large
 998 architectures, because it tests communications over latency networks,
 999 by processing unstructured matrix vector multiplication. In this
1000 benchmark, a Conjugate Gradient is used to compute an approximation of
1001 the smallest eigenvalue of a large, sparse and symmetric positive
1002 definite matrix, by the inverse power method. In our tests, the whole
1003 matrix contains nonzero values, in order to stress more
1004 communications. As the Conjugate Gradient method cannot be executed
1005 with the asynchronous iteration model we have replaced it by another
1006 method called the multisplitting method. This latter supports the
1007 asynchronous iterative model.
1008
1009
1010 With the multisplitting algorithm, the $A$ matrix is split into
1011 horizontal rectangle parts, as Figure \ref{fig:multisplit} shows. Each
1012 of these parts is affected to a processor -- so the size of data
1013 depends on the matrix size but also on the number of participating
1014 nodes. In this way, a processor is in charge of computing its $X Sub$
1015 part by solving the following subsystem: $ASub \times XSub = BSub -
1016 DepLeft \times XLeft - DepRight \times XRight$
1017
1018 After solving $XSub$, the result must be sent to other
1019 processors which depend on it.
1020
1021 \begin{figure}[h!]
1022   \centering
1023   \includegraphics[width=7.4cm]{images/multisplit}
1024   \caption{Data decomposition for the multisplitting method
1025     implementation}
1026   \label{fig:multisplit}
1027 \end{figure}
1028
1029 The multisplitting method can be decomposed into four phases:
1030 \begin{enumerate}
1031 \itemsep=5pt
1032 \item \textbf{Data decomposition}. In this phase, data are allocated
1033   to each processor assuming the decomposition exposed on figure
1034   \ref{fig:multisplit}. Then, each processor iterates until converge
1035   on the following.
1036 \item \textbf{Computation}. At the beginning of the computation, each
1037   processor computes
1038     $BLoc = BSub - DepLeft \times XLeft - DepRight \times XRight$. Then,
1039   it solves $ASub \times XSub = BLoc$ by using a
1040   multi-threaded sequential version of the Conjugate Gradient method.
1041 \item \textbf{Data exchange}. Each processor sends its $XSub$
1042   part to its neighbors. Here, the neighborhood is closely related
1043   to the density of the $A$ matrix. Clearly, a dense matrix implies an
1044   \textit{all-to-all} communication scheme while a matrix with a low
1045   bandwidth reduces the density of the communication scheme.
1046 \item \textbf{Convergence detection} Each processor computes its local
1047   convergence and sends it to a server node. When this one detects
1048   that each processor has converged, it stops the whole computation
1049   process.
1050 \end{enumerate}
1051
1052 %It can be pointed out here that it is possible to modify the data
1053 %decomposition in order to obtain non disjoint rectangle matrices. This
1054 %property of multisplitting methods, called \textit{overlapping}, can
1055 %bring significant improvements to convergence speed, but it is not the
1056 %aim of this paper.
1057 For more details about this method, interested readers are invited to
1058 see \cite{book_raph}. In our benchmark, the sequential solver part of
1059 the multisplitting method is the Conjugate Gradient, using the
1060 MTJ\cite{mtj} library. Its implementation is multi-threaded, so it
1061 benefits from multi-core processors.
1062
1063 We point out here that this benchmark is a typical AIAC
1064 application. The general form of the TIG for this application is given
1065 by Figure \ref{fig:tigcg}.
1066
1067 \begin{figure}[h!]
1068   \centering
1069   \includegraphics[width=8cm]{images/tigcg2}
1070   \caption{Part of the form of the TIG representing an instance of the
1071     NAS Kernel CG application}
1072   \label{fig:tigcg}
1073 \end{figure}
1074
1075 This figure shows 6 tasks, which are represented by a circle in which
1076 the identifier of the task is given. In our study, we consider that
1077 the computational costs of tasks are approximately the same and that the
1078 communications costs also the same (this comes from the difficulty to
1079 evaluate real costs in the AIAC model).
1080 % The computational cost of a task is given by the number on the top
1081 % left-hand side of each circle (for example the cost of task 31 is
1082 % 1000). Communications between tasks are represented by edges on
1083 % which the amount of communication is given (for example, the
1084 % communication cost between tasks 29 and 30 is about 30).
1085 Doted lines represent communications with tasks which are not
1086 represented on the figure. We can see here that each task has four
1087 neighbors (the two previous and the two next). This amount of
1088 neighbors is directly related to the bandwidth of the matrix (in this
1089 example the bandwidth is very small). For more details about the
1090 influence of the bandwidth on the amount of neighbors, interested
1091 readers are invited to see \cite{largescale}.
1092
1093 For our experiments the bandwidth of matrices has been reduced in
1094 order to limit the dependencies and we fixed it to $35,000$. This
1095 bandwidth size generates, according to the problem's size, between 10
1096 and 25 neighbors per tasks.
1097
1098 \subsection{The Grid'5000 platform}
1099 \label{sec:g5k}
1100
1101 The platform used for our tests, called Grid’5000\cite{g5k}, is a
1102 French nationwide experimental set of clusters which provides a
1103 configurable and controllable instrument. We can find many clusters
1104 with different kinds of computers with various specifications and
1105 software.
1106
1107 \begin{figure}[h!]
1108   \centering
1109   \includegraphics[height=6.5cm]{images/g5k-noms}
1110   \caption{The Grid'5000 sites map}
1111   \label{fig:g5ksite}
1112 \end{figure}
1113
1114 Clusters are spread over 9 sites, as can be seen on Figure
1115 \ref{fig:g5ksite}, and the computing power represents more than 5000
1116 computing cores interconnected by the ``Renater'' network. This
1117 network is the national network for research and education; it
1118 provides a large bandwidth with high latency. Intra-clusters networks
1119 present small bandwidth and low latencies.
1120
1121
1122 \subsection{Other mapping algorithms}
1123 \label{sec:othermaping}
1124
1125 In this section we present the two other mapping algorithms we used
1126 in our experiments to compare the performances of the AIAC QM
1127 algorithm. The first one was used to evaluate the benefits of a
1128 mapping solution in section \ref{sec:benef}. The second one was used
1129 to show the differences between the two mapping class, the ``execution
1130 time'' and the ``edge-cuts'' optimizations, as it is a fully edge-cut
1131 optimization based mapping algorithm.
1132
1133 \subsubsection{A Simple Mapping algorithm}
1134 \label{sec:sma}
1135
1136 %As mentioned in section \ref{sec:pb}, the asynchronous iteration model
1137 %has some specificities which distinguishes it from other classical model
1138 %for the mapping.
1139 %
1140 %The first thing we have done was to be sure that a mapping algorithm
1141 %would enhance applications performances.
1142 The \textit{Simple Mapping algorithm} (SMa) was designed to
1143 show the benefits of a mapping algorithm in the JaceP2P-V2 platform.
1144 %The text of the \textit{Simple Mapping} if given by
1145 Algorithm \ref{alg:sma} gives the pseudo-code of the \textit{Simple
1146   Mapping algorithm}.
1147 %, in which we can see that it is very simple, with a complexity in $O(n^
1148 %2)$ resulting from sort methods.
1149
1150
1151 \SetAlgoSkip{}
1152 \begin{algorithm}
1153   \SetLine
1154   \dontprintsemicolon
1155   \KwIn{Sets of tasks and computing nodes}
1156   \KwOut{Mapping of tasks to nodes}
1157
1158   \BlankLine
1159
1160   sort computing nodes by cluster\;
1161   sort clusters by size, from higher to lower\;
1162   map tasks in order on sorted clusters\;
1163   \caption{The Simple Mapping algorithm}
1164   \label{alg:sma}
1165 \end{algorithm}
1166
1167
1168 %The aim of this algorithm is to do similarly to a very simple
1169 %edge-cuts optimization algorithm. To do that
1170 The algorithm puts each node in a cluster entity.
1171 %, which is given for now by the convention name of nodes
1172 %on Grid'5000 (for example azur-3.sophia.grid500.fr is the node 3 of
1173 %the azur cluster in the sophia site).
1174 Then it sorts clusters by their size, from the higher to the lower.
1175 %; this operation is the part which
1176 %is inspired by edge-cuts optimization based algorithms.
1177 Finally, all tasks are mapped in order on the sorted clusters; each
1178 task is assigned to a particular computing node of the chosen cluster.
1179
1180 %Though this algorithm is trivial, it allows AIAC applications to run faster
1181 %on distributed clusters architectures, with a gain over $30\%$ on
1182 %execution, as experiments described in section \ref{sec:benef}.
1183
1184 \subsubsection{Edge-cuts optimization}
1185 \label{sec:edgcutalgo}
1186
1187 As explained in section \ref{sec:pb}, the asynchronous iteration model
1188 is so specific and unpredictable that we would like to evaluate the
1189 second kind of mapping algorithm, which aims to optimize the
1190 ``edge-cuts''.
1191 %Multilevel partitioning algorithms such as Metis and
1192 %Chaco fail to address the limitations imposed by heterogeneity in the
1193 %underlying targeted system. They assume that computing nodes and
1194 %network relying them are homogeneous. This is not corresponding to our
1195 %execution environment, which is fully heterogeneous. These methods are
1196 %based on ``graph growing'' (GGP) and/or ``greedy graph growing''
1197 %(GGGP) algorithms which aim to divide tasks into two, or for some
1198 %algorithm a power of two, partitions. In our case, we do not know in
1199 %advance the number of partitions we need. Indeed, it depends on the
1200 %amount of nodes in each cluster and on the number of tasks.
1201 %
1202 %As GGP and GGGP algorithms seems to be efficient in specific cases, it
1203 %could be interesting to adapt one to our model, in order to evaluate a
1204 %real edge-cuts optimization algorithm.
1205 We choose the Farhat's algorithm\cite{farhat}, which has the ability
1206 to divide the graph into any number of partitions, thereby avoiding
1207 recursive bisection.
1208 %Therefore its running execution time is
1209 %independent of the desired number of subsections.
1210
1211 The adapted version of this algorithm, Farhat's Edge-Cut (F-EC),
1212 evaluated in the JaceP2P-V2 environment is described in Algorithm
1213 \ref{alg:edgecuts}.
1214
1215 \SetAlgoSkip{}
1216 \begin{algorithm}
1217   \SetLine
1218   \dontprintsemicolon
1219
1220   \KwIn{Sets of tasks and computing nodes}
1221   \KwOut{Mapping of tasks to nodes}
1222
1223   \BlankLine
1224
1225
1226   sort nodes by cluster\;
1227   $lTasks \leftarrow$ sort tasks by dep degree\;
1228   $changeCluster \leftarrow$ true\;
1229   $cTasks \leftarrow$ empty;
1230
1231   \BlankLine
1232
1233   \While{one task is not mapped}{
1234     \If{$changeCluster$}{
1235       $curCluster \leftarrow$ nextCluster()\;
1236       $places \leftarrow$ size($curCluster$)\;
1237       $changeCluster \leftarrow$ false\;
1238       $mTasks \leftarrow$ empty\;
1239     }
1240
1241     \BlankLine
1242
1243     \If{no task in cTasks}{
1244       $cTasks \leftarrow$ first task from $lTasks$\;
1245     }
1246
1247     \BlankLine
1248
1249     $curTask \leftarrow$ first task in $cTasks$\;
1250
1251     \BlankLine
1252
1253     \eIf{$( 1 + $dep(curTask)~$\cdot~\delta) <= places)$}{
1254       remove $curTask$ from $cTasks$\;
1255       add $curTask$ in $mTasks$\;
1256       $places \leftarrow places - 1$\;
1257       add dep(curTask) in cTasks\;
1258     }{
1259       $changeCluster$ $\leftarrow$ true\;
1260       associate $mTasks$ with $curCluster$\;
1261     }
1262   }
1263
1264   \caption{The Fahrat's Edge-Cut algorithm}
1265   \label{alg:edgecuts}
1266 \end{algorithm}
1267
1268 This algorithm aims to do a ``clusterization'' of the tasks. First, it
1269 groups computing nodes in clusters, which are sorted according to
1270 their number of nodes, from the higher to the lower. Tasks are ordered
1271 following their dependency degree, starting from the higher to the
1272 lower. Tasks in the top of the list have a higher priority to be
1273 mapped. Next, the algorithm tries to map on each cluster the maximum
1274 number of tasks. To map a task on a cluster, the algorithm evaluates
1275 if there is enough place to map the task and some of its
1276 dependencies. This amount of dependencies is fixed by a factor
1277 $\delta$, which is a parameter of the algorithm. In the positive case,
1278 the task is mapped on the current cluster and its dependencies become
1279 priority tasks to be mapped. This allows to keep the focus on the
1280 communicating tasks locality.
1281 %All computing nodes are first grouped in clusters.
1282 %%, using their full
1283 %%qualified name -- for now, no criteria is used to sort clusters, such
1284 %%are their power or size.
1285 %Next, tasks are inserted in a list $lTasks$, sorted in descending
1286 %according to their dependency degree. This allows to map tasks with
1287 %their dependencies to improve communications. Next begins the main
1288 %mapping loop.
1289 %
1290 %While not all tasks have been mapped the algorithm do the
1291 %following. First, at the beginning, it chooses the first available
1292 %cluster to be the current cluster to map tasks on, and sets the
1293 %variable $places$ to its size, in number of computing nodes, and
1294 %create a new tasks' list $mTasks$. During the execution of the
1295 %algorithm, passing in this block means that there is no more node
1296 %available in the current cluster, so it has to choose the next
1297 %cluster. Next, it looks in $cTasks$, which is a tasks list containing
1298 %tasks which are in instance to be mapped. If $cTasks$ is empty it
1299 %takes the first available task in $lTasks$ -- at the beginning it
1300 %takes the first one, which has the higher dependency degree.
1301 %
1302 %Next, it takes the first task in $cTasks$ to try to map it on the
1303 %current cluster. The task is mapped on if there is enough available
1304 %computing nodes on the current cluster. This amount of node
1305 %corresponds to $1 + dep( curTask ) \cdot \delta$, which represents the
1306 %task plus the amount of its dependencies multiplied by a ``local
1307 %dependency factor'' $\delta$, which indicates how many dependencies
1308 %must be a least with $curTask$ on the same cluster. If the task could
1309 %be mapped on this cluster, $curTask$ is added in $mTasks$, which is a
1310 %tasks' list containing tasks which are currently mapped on current
1311 %cluster, and removed it from $cTasks$. The number of available nodes,
1312 %$places$, is decremented, and the dependencies of $curTask$ are added
1313 %to $cTasks$. This list is sorted in the same way as $lTasks$, indeed
1314 %tasks are added according to their dependency degree, from the higher
1315 %to the lower. Otherwise, if there is not enough or no more nodes
1316 %available in the cluster, it has to change the current cluster, so it
1317 %associates the current cluster and list of tasks mapped on, $mTasks$.
1318
1319
1320 \subsection{Experiments}
1321 \label{sec:experiments}
1322
1323 After having described the different components of the experiments, we
1324 now present the impacts of the AIAC QM mapping on applications running
1325 with JaceP2P-V2 on a heterogeneous distributed clusters
1326 architecture. In the following, we note ``heterogeneity degree'' the
1327 degree of heterogeneity of distributed clusters; it is the ratio
1328 between the average and the standard deviation of the computing nodes
1329 power. This heterogeneity degree may vary from 0, nodes are
1330 homogeneous, to 10, nodes are totally heterogeneous. In these
1331 experiments, we consider that there is no computing nodes fault during
1332 applications execution.
1333
1334 The application used to realize these experiments is the KernelCG of
1335 the NAS parallel benchmark, in the multi-splitting version. Two
1336 problem sizes were used: one using a matrix of size $550,000$ (named
1337 ``class E'') using 64 computing nodes and the other using a matrix of
1338 size $5,000,000$ (named ``class F'') using 128 nodes.
1339 %
1340 %Figure
1341 %\ref{fig:tigcg} shows a part of a TIG of this application.
1342 %
1343 %\begin{figure}[h!]
1344 %  \centering
1345 %
1346 %  \label{fig:tigcg}
1347 %\end{figure}
1348
1349
1350
1351 \subsubsection{About heterogeneity}
1352 \label{sec:xphetero}
1353
1354 The first experiments concern the study of the impact of the
1355 heterogeneity of the computing nodes on the mapping
1356 results. Heterogeneity takes an important place in the high
1357 performance computing on grid, all the more so when using the
1358 asynchronous iteration model.
1359
1360 As mapping algorithms take in parameter a factor of research (for AIAC
1361 QM) and the amount of local dependencies (for F-EC), we fixed both to
1362 $50\%$. That means for AIAC QM that at each round the amount of
1363 considering nodes would be divided by two, and for F-EC that each task
1364 requires half of its dependencies on the same local cluster.
1365
1366 Four experiments were done using four architectures having different
1367 heterogeneity degrees -- in two architectures computing nodes are more
1368 heterogeneous than in the others. In these experiments, we did not
1369 affect the networks heterogeneity, because of the difficulty to
1370 disturb and control network on Grid'5000; by default, networks are
1371 already quite heterogeneous. We needed more than 200 computing nodes
1372 to execute our application because of the small capacity of some
1373 clusters to execute the largest problems (there is not enough memory).
1374
1375 The first architecture, Arc1.1, was composed of 113 computing nodes
1376 representing 440 computing cores, spread over 5 clusters in 4
1377 sites. In Arc1.1 we used bi-cores (2 clusters), quadri-cores (2
1378 clusters) and bi-quadri-cores (1 cluster) machines. Its heterogeneity
1379 degree value is 6.43. This architecture was used to run class E of the CG
1380 application using 64 computing nodes. The second architecture, Arc1.2,
1381 used to execute class F of the CG application, using 128 computing
1382 nodes, was composed of 213 computing nodes representing 840 computing
1383 cores, with a heterogeneity degree of 6.49. This architecture was
1384 spread on the same clusters and sites as Arc1.1. The results of the
1385 experiments on Arc1.1 and Arc1.2 are given in Table \ref{tab:exph1E}
1386 and Table \ref{tab:exph1F}, which give the gains in execution time obtained
1387 in comparison to the version without mapping.
1388
1389 %\vspace{0.2cm}
1390
1391 \renewcommand{\arraystretch}{1.5}
1392
1393 \begin{table}[h!]
1394   \centering
1395   \begin{tabular}[h!]{|c||c|c|c|c|}
1396     \hline
1397     Algorithm& None&SMa & AIAC QM & F-EC \\
1398     \hline
1399     \hline
1400     Execution time&150s&110s&101s&90s\\
1401     \hline
1402 %    Gains&--&$27\%$&$33\%$&\textcolor{blue}{$40\%$}\\
1403     Gains&--&$27\%$&$33\%$&$40\%$\\
1404     \hline
1405   \end{tabular}
1406   \caption{Gains in time of the execution of the class E of the CG
1407     application on Arc1.1 using 64 computing nodes, with mapping
1408     algorithms}
1409   \label{tab:exph1E}
1410   \vspace*{-0.3cm}
1411 \end{table}
1412
1413 %\vspace{-0.5cm}
1414
1415 \renewcommand{\arraystretch}{1.5}
1416
1417 \begin{table}[h!]
1418   \centering
1419   \begin{tabular}[h!]{|c||c|c|c|c|}
1420     \hline
1421     Algorithm& None &SMa & AIAC QM & F-EC \\
1422     \hline
1423     \hline
1424     Execution time&403s&265s&250s&218s\\
1425     \hline
1426 %    Gains&--&$34\%$&$38\%$&\textcolor{blue}{$46\%$}\\
1427     Gains&--&$34\%$&$38\%$&$46\%$\\
1428     \hline
1429   \end{tabular}
1430   \caption{Gains in time of the execution of the class F of the CG
1431     application on Arc1.2 using 128 computing nodes, with mapping
1432     algorithms}
1433   \label{tab:exph1F}
1434   \vspace*{-0.3cm}
1435 \end{table}
1436
1437 At first, we can see that the Simple Mapping algorithm, though it is
1438 simple, provides a significant improvement of application execution
1439 time. This highlights that JaceP2P-V2 really needs a mapping algorithm
1440 in order to be more efficient. Then, we can see that the F-EC and the
1441 AIAC QM algorithms provide a better mapping than the Simple Mapping
1442 algorithms, we can see a significant difference between both
1443 algorithms. This comes from the homogeneity of clusters. In this case,
1444 the F-EC algorithm is more efficient since the minimization of the
1445 communications becomes more important than the tackle of the
1446 computational power heterogeneity problem.
1447 %Indeed, it is more benefic for ta becomes more important than the
1448 %tacklebecomes more important than the
1449 %tackle sks to have locally their
1450 %dependencies, which allows to improve communications, in case of
1451 %computing nodes are more homogeneous -- communications are more
1452 %important than computing power (that is why the F-EC algorithm is more
1453 %efficient).
1454 The effect is that tasks do less iterations as they
1455 receive more frequently updated data from their neighbors. In
1456 addition, as tasks and their dependencies are on the same cluster,
1457 communications are improved, but also as computations take
1458 approximately the same time, the amount of iterations is reduce and
1459 the algorithm can converge more quickly.
1460
1461 Another important positive point is that gains are scalable, which allows
1462 to foresee big improvements for very large applications.\\
1463
1464 The third architecture, Arc2.1, was composed of 112 computing nodes,
1465 representing 394 computing cores, spread over 5 clusters in 5
1466 sites. In this architecture we used bi-cores (3 clusters),
1467 quadri-cores (1 cluster) and bi-quadri-cores (1 cluster) machines. Its
1468 heterogeneity degree's value is 8.41. This architecture was used to run
1469 class E of the CG application, using 64 computing nodes. The fourth
1470 architecture, Arc2.2, used to execute class F of the CG
1471 application, using 128 computing nodes, was composed of 212 computing
1472 nodes representing 754 computing cores, with a degree of heterogeneity
1473 of 8.44. This architecture was spread on the same clusters and sites
1474 as Arc2.1. The results of the experiments on Arc2.1 and Arc2.2 are
1475 given in Table \ref{tab:exph2E} and Table \ref{tab:exph2F}, which give the
1476 gains in execution time obtained in comparison to the version without
1477 mapping.
1478
1479 %\vspace{0.2cm}
1480
1481 \renewcommand{\arraystretch}{1.5}
1482
1483 \begin{table}[h!]
1484   \centering
1485   \begin{tabular}[h!]{|c||c|c|c|c|}
1486     \hline
1487     Algorithm&None& SMa & AIAC QM & F-EC \\
1488     \hline
1489     \hline
1490     Execution time&498s&341s&273s&385s\\
1491     \hline
1492 %    Gains&$32\%$&\textcolor{blue}{$45\%$}&\textcolor{red}{$23\%$}\\
1493     Gains&--&$32\%$&$45\%$&$23\%$\\
1494     \hline
1495   \end{tabular}
1496   \caption{Gains in time of the execution of the class E of the CG
1497     application on Arc2.1 using 64 computing nodes, with mapping
1498     algorithms}
1499   \label{tab:exph2E}
1500   \vspace*{-0.3cm}
1501 \end{table}
1502
1503 \renewcommand{\arraystretch}{1.5}
1504
1505 \begin{table}[h!]
1506   \centering
1507   \begin{tabular}[h!]{|c||c|c|c|c|}
1508     \hline
1509     Algorithm& None&SMa & AIAC QM & F-EC \\
1510     \hline
1511     \hline
1512     Execution time&943s&594s&453s&660s\\
1513     \hline
1514 %    Gains&$37\%$&\textcolor{blue}{$52\%$}&\textcolor{red}{$30\%$}\\
1515     Gains&--&$37\%$&$52\%$&$30\%$\\
1516     \hline
1517   \end{tabular}
1518   \caption{Gains in time of the execution of the class F of the CG
1519     application on Arc2.2 using 128 computing nodes, with mapping
1520     algorithms}
1521   \label{tab:exph2F}
1522   \vspace*{-0.3cm}
1523 \end{table}
1524
1525 To begin with, these experiments confirm that a mapping algorithm is
1526 needed and that improvements are always scalable. Then, we can see
1527 that the F-EC algorithm falls in performances and AIAC QM is
1528 improved. What is surprising is that the Simple Mapping algorithm is
1529 better than F-EC. This can be explained by the fact that as computing
1530 nodes are quite heterogeneous, computations are not the same, so it is
1531 not significant to map dependencies close to tasks. In this case, the
1532 most important is the power of computing nodes. So, in this kind of
1533 architecture, it is more efficient to choose the best computing nodes
1534 to compute iterations more quickly and to improve the convergence
1535 detection.
1536
1537 Here, it is important to note that the AIAC QM algorithm offers a gain
1538 of about $50\%$ on the execution time, that is to say that the
1539 application takes half of the execution time than without mapping.
1540
1541 \subsubsection{Parameters variation}
1542 \label{sec:xpvariation}
1543
1544 After having evaluated mapping algorithms on the heterogeneity of
1545 distributed clusters, we now propose to change the parameters of AIAC
1546 QM and F-EC algorithms, in order to determine which values are the
1547 most accurate.
1548
1549 To do these experiments, we used an architecture composed of 122
1550 computing nodes representing 506 computing cores, spread over 5
1551 clusters in 5 sites. In this architecture we used bi-cores (2
1552 clusters), quadri-cores (2 clusters) and bi-quadri-cores (1 cluster)
1553 machines. Its heterogeneity degree value is 4.98.
1554 %, which means that
1555 %computing nodes power is very heterogeneous.
1556
1557 The parameters of each algorithm, $f$ (the search factor) for
1558 AIAC QM and $\delta$ (the amount of local dependencies) for F-EC,
1559 varied both with values $10\%$, $50\%$ and $90\%$. We used the CG
1560 multi-splitting application on 64 computing nodes. The results of
1561 these experiments are given in Table \ref{tab:expparams}. Results
1562 exposed in this table represent the gains in execution time provided
1563 by each algorithm with different parameters values.
1564
1565 %\vspace{0.2cm}
1566 \renewcommand{\arraystretch}{1.5}
1567
1568 \begin{table}[h!]
1569   \centering
1570   \begin{tabular}[h!]{|c||c|c|c|}
1571     \hline
1572     Parameters& $10\%$ & $50\%$ & $90\%$ \\
1573     \hline
1574     \hline
1575 %    Simple & \multicolumn{3}{c|}{\textcolor{blue}{$39\%$}}\\
1576     SMa & \multicolumn{3}{c|}{$30\%$}\\
1577     \hline
1578 %    AIAC QM & $35\%$ & \textcolor{blue}{$41\%$} & $35\%$ \\
1579     AIAC QM & $30\%$ & $32\%$ & $30\%$ \\
1580     \hline
1581 %    F-EC & $39\%$ & $35\%$ & \textcolor{blue}{$45\%$} \\
1582     F-EC & $40\%$ & $37\%$ & $45\%$ \\
1583     \hline
1584   \end{tabular}
1585 %  \caption{Parameters variations using a $500'000$ problem size on an
1586 %    architecture of 5.37 heterogeneity degree}
1587   \caption{Gains in execution time with mapping algorithms parameters
1588     variations using the class E of the CG application using 64
1589     computing nodes}
1590 %  \vspace*{-0.4cm}
1591   \label{tab:expparams}
1592 %  \vspace*{-0.9cm}
1593 \end{table}
1594
1595 %First of all, we can see that the Simple mapping provides the same
1596 %order of performances, as shown in the precedent section, so it is
1597 %not affected by the heterogeneity degree. Secondly,
1598 For the AIAC QM algorithm, we can note that the best value for its
1599 parameter $f$ is about $50\%$, but its impact is not big enough to
1600 indicate a specific configuration.
1601 % With a low heterogeneity degree, this mapping algorithm provides a
1602 % good performances improvement.
1603 Finally, and this is not surprising, the F-EC algorithm is more
1604 efficient with a factor $\delta$ near $100\%$, which directly comes
1605 from its aim. But we can see that it is more efficient to have a
1606 factor around $10\%$ than having one around $50\%$.
1607
1608 We can note here, with a lower heterogeneity degree than in previous
1609 experiments, gains are lower and the difference between AIAC QM and
1610 F-EC (with parameters at $50\%$) is lower. It can be explained as the
1611 fact that more the heterogeneity degree tends to 0 more computing
1612 nodes are the same, so a mapping solution will not be efficient,
1613 except one only optimizing network latency.
1614 %These experiments show that the impact of parameters values does not
1615 %well influence the AIAC QM, whereas it is very important for the F-EC
1616 %algorithm.
1617
1618 %\section{Discussion}
1619 %\label{sec:discussion}
1620 %
1621 %In this paper three algorithms for mapping asynchronous iterative
1622 %applications on heterogeneous distributed architectures are described.
1623 %This approach is relatively new, so we had to study related works. We
1624 %assume in our model that the amount of computation, the computational
1625 %cost of each task, cannot be known in advance, and it is the same
1626 %about communications. Two mains class of algorithms are mainly used,
1627 %the ``execution time'' and the ``edge-cuts'' optimizations, with a
1628 %newer preference fort the first one. Indeed, the efficiency of second
1629 %one is criticized for its objectives, invoking the fact that these
1630 %algorithms do not optimize the right metric. It is true in a broad
1631 %spectrum of mapping domains, but we have shown that in our case, it
1632 %could be an efficient solution, depending on the architecture
1633 %heterogeneity.
1634 %
1635 %As each experiment takes a lot of time and the Grid'5000 platform is
1636 %shared by many researchers, we could not conducted as many experiments
1637 %as we wanted and as we need to purpose an exhaustive view of this part
1638 %of the mapping domain. We cannot design a more general state of the
1639 %mapping of asynchronous iterative applications on distributed
1640 %architectures, but we can draw the main lines of future works.
1641 %%, by
1642 %%giving the first stones at the building of this specific part of the
1643 %%mapping domain.
1644 %
1645 %We have shown that the heterogeneity in computing nodes power takes an
1646 %important part in an efficient mapping algorithm. This parameter and
1647 %probably the heterogeneity in network should really be more taken into
1648 %consideration. Maybe a good solution consists in designing of mapping
1649 %algorithms giving more important priority to one or to the other
1650 %optimization objectives. This leads to design a novel algorithm, which
1651 %takes into account the different points discussed in this paper, which
1652 %would probably be an hybrid algorithm, efficient with our model on the
1653 %targeted architectures.
1654
1655 \section{Conclusion and future works}
1656 \label{sec:conclu}
1657
1658 In this paper we have presented
1659 %three algorithms to address the
1660 %mapping problem for asynchronous iterative applications in
1661 %heterogeneous distributed architectures. As the asynchronous iteration
1662 %model is very specific, it was not clear of which kind of mapping
1663 %algorithm should be efficient on such a problem. The two main
1664 %approaches given in the literature, the ``execution time'' and the
1665 %``edge-cuts'' optimization algorithms, have been evaluated on
1666 %different architectures, with different heterogeneity degrees.
1667 %% These architectures varied in their
1668 %%heterogeneity, to evaluate the algorithms.
1669 %
1670 %%We proposed three mapping algorithms for the JaceP2P-V2
1671 %%environment. The first is a simple way mapping, the Simple Mapping
1672 %%algorithm, which always provides a good and stable improvement of
1673 %%performances on all kinds of architectures.
1674 %We propose
1675 a specific mapping algorithm for the AIAC model, called AIAC QM. This
1676 algorithm is based on the execution time optimization but it also
1677 includes a small degree of edge-cuts optimization. Experiments on a
1678 real large scale architecture of a typical AIAC application show that
1679 the AIAC QM mapping algorithm is efficient on architectures with a
1680 high heterogeneity degree. This can be explained by the fact that all
1681 iteration computations are quite different, for our example, and the
1682 convergence is more quickly detected as the more powerful computing
1683 nodes progress in the computation. The F-EC algorithm, which is based
1684 on the ``edge-cuts'' optimization, is meanwhile efficient on
1685 architectures with a low heterogeneity degree. This can be explained
1686 by the fact that in such an environment, it is more accurate for a
1687 task to have its dependencies locally on the same cluster in order to
1688 have efficient communications and to allow iterations to be computed
1689 together, which improves the convergence detection speed.
1690 Experiments we conducted have shown gains in execution time up to
1691 $50\%$, which denotes a division by 2 of this execution time, for a
1692 typical AIAC application on more than 700 computing cores.
1693 %Experiments have shown that
1694 %the importance of the parameter of both algorithms, AIAC QM and F-EC,
1695 %is not so essential for the first one, instead it is very critical for
1696 %the second one, but we cannot be sure that it is true all the time on
1697 %all kinds of architectures; it maybe depends on the heterogeneity
1698 %degree of the network.
1699 As we did not influence the network's heterogeneity,
1700 % as we did for the computational power of nodes,
1701 the evaluation of the network impact on the
1702 application execution time would be one of our next work.
1703
1704 %For now, these three mapping algorithms are implemented in an
1705 %additional library for the JaceP2P-V2 environment. The results
1706 %presented in this paper show that a mapping algorithm allows to
1707 %improve applications performances, but as the executing architectures
1708 %should have a variety of heterogeneity degree, we have to find a
1709 %compromise between the two approaches in order to have an efficient
1710 %mapping algorithm on all kinds of architectures. In the future, we
1711 %would like to design an efficient mapping algorithm to improve the
1712 %execution of asynchronous iteration applications on heterogeneous
1713 %distributed architectures. As the algorithm should be integrated in
1714 %the JaceP2P-V2 environment, which is fully decentralized and fault
1715 %tolerant, the new mapping algorithm should have also these
1716 %characteristics, in order to retrieve a fully decentralized and fault
1717 %tolerant environment.
1718 Our future works concern the amelioration of the AIAC QM algorithm, in
1719 order to improve it on homogeneous distributed architectures. As the
1720 F-EC mapping algorithm is efficient on such architectures, we will
1721 give a more important consideration to the edge-cuts part of AIAC
1722 QM. Another important point is to take into consideration the fault
1723 tolerance problem. In this study we have realized our experiments
1724 without computing node fault, which is not the real case. We have to
1725 take into account the AIAC QM algorithm about this important
1726 parameter. First we have to efficiently choose new nodes to replace
1727 failed ones. Secondly, as we do checkpointing to save tasks' states,
1728 we have to efficiently choose backup nodes not to fall in case
1729 a whole cluster fails, as we save on neighbors (which are
1730 in general on the same cluster for communication efficiency reasons),
1731 an important part of the application is lost and we cannot restart
1732 this part; so the whole application fails. A trade-off should be done
1733 by having some saving nodes in external clusters.
1734
1735 \section*{Acknowledgements}
1736
1737 \label{sec:merci}
1738
1739 This work was supported by the European Interreg IV From-P2P project
1740 and the region of Franche-Comté.
1741
1742 Experiments presented in this paper were carried out using the
1743 Grid'5000\cite{g5k} experimental testbed, being developed under the
1744 INRIA ALADDIN development action with support from CNRS, RENATER and
1745 several Universities as well as other funding bodies.
1746
1747
1748 \bibliographystyle{unsrt}
1749 %\nocite{*}
1750 \bibliography{biblio}
1751
1752
1753 \end{document}
1754
1755 %% %%