pdsec2010/pdsec2010.tex

   1 %% Version PDSEC
   2 \documentclass[10pt,conference,compsocconf]{IEEEtran}
   3
   4 %% Pour Ingrid
   5 %\documentclass[conference,compsoc,a4paper,onecolumn]{IEEEtran}
   6 %\documentclass[compsoc,a4paper,oneside]{IEEEtran}
   7
   8 \usepackage[T1]{fontenc}
   9 \usepackage{ucs}
  10 \usepackage[utf8x]{inputenc}
  11 \usepackage{lmodern}
  12 \usepackage{color}
  13 %% Jolis entetes %%
  14 %\usepackage[Glenn]{fncychap}
  15 \usepackage{amsmath}
  16 %\usepackage{amsthm}
  17 \usepackage{amsfonts}
  18 %\usepackage{graphicx}
  19 %\usepackage{xspace}
  20 % Definition des marges
  21 %\usepackage{vmargin}
  22 %\setpapersize[portrait]{A4}
  23
  24 \usepackage[english]{babel}
  25 % Extension pour les graphiques EPS
  26 %\usepackage[dvips]{graphicx}
  27 \usepackage[pdftex,final]{graphicx}
  28 % Extension pour les liens intra-documents (tagged PDF)
  29 % et l'affichage correct des URL (commande \url{http://example.com})
  30 \usepackage{hyperref}
  31 % Extension pour que plus de titres apparaissent dans la table des matieres
  32 % (table des matieres, bibliographie, index).
  33 %\usepackage{tocbibind}
  34 %\newcommand{\tabledesmatieres}{
  35 %  \setlength{\parskip}{0ex} % suppression de l'espace entre les paragraphes
  36 %  \tableofcontents
  37 %  \espacementnormal
  38 %}
  39 %\NoAutoSpaceBeforeFDP
  40 %\author{Sébastien Miquée}
  41
  42 \usepackage[ruled,vlined]{algorithm2e}
  43 \usepackage{cite}
  44
  45 \newcommand{\myitemize}[1]
  46 {
  47 \begin{itemize}
  48 \itemsep=#1pt
  49 }
  50
  51 %
  52 \ifCLASSINFOpdf
  53    \usepackage[pdftex]{graphicx}
  54    \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  55 \else
  56 \fi
  57
  58 % *** PDF, URL AND HYPERLINK PACKAGES ***
  59 %
  60 \usepackage{url}
  61
  62
  63 \title{Mapping Asynchronous Iterative Applications on Heterogeneous
  64   Distributed Architectures}
  65
  66 \author{
  67 \IEEEauthorblockN{Raphaël Couturier \hspace{10pt} David Laiymani
  68   \hspace{10pt} Sébastien Miquée}
  69
  70 \IEEEauthorblockA{Laboratoire d'Informatique de Franche-Comté
  71   (LIFC)\\
  72   %University of Franche-Comté\\
  73   IUT de Belfort-Montbéliard,2 Rue Engel Gros, BP 27, 90016 Belfort,
  74   France\\
  75  % Tel.: +33-3-84587782 \hspace{20pt} Fax: +33-3-84587781\\
  76 Email:
  77 \{raphael.couturier,david.laiymani,sebastien.miquee\}@univ-fcomte.fr}
  78
  79 \thanks{This work was supported by the European Interreg IV From-P2P project
  80 and the region of Franche-Comté}
  81 }
  82
  83 %% Permet de réactiver le \thanks
  84 \IEEEoverridecommandlockouts
  85
  86 \date{}
  87 \begin{document}
  88
  89 %% left, top, right, bottom
  90 %\setmargrb{20mm}{15mm}{20mm}{15mm}
  91
  92 \IEEEcompsoctitleabstractindextext
  93
  94 \maketitle
  95
  96
  97 \begin{abstract}
  98   To design parallel numerical algorithms on large scale distributed
  99   and heterogeneous platforms, the asynchronous iteration model (AIAC)
 100   may be an efficient solution. This class of algorithm is very
 101   suitable since it enables communication/computation overlapping and
 102   it suppresses all synchronizations between computation nodes. Since
 103   target architectures are composed of more than one thousand
 104   heterogeneous nodes connected through heterogeneous networks, the
 105   need for mapping algorithms is crucial. In this paper, we propose a
 106   new mapping algorithm dedicated to the AIAC model. To evaluate our
 107   mapping algorithm we implemented it in the JaceP2P programming and
 108   executing environment dedicated to AIAC applications and we
 109   conducted a set of experiments on the Grid'5000 testbed.  Results
 110   are very encouraging and show that the use of our algorithm brings
 111   an important gain in term of execution time (about $40\%$).
 112
 113 %with more
 114 %  than 700 computing cores and with a real and typical AIAC
 115 %  application based on the NAS parallel benchmarks.
 116
 117 %  To design parallel and distributed applications on heterogeneous
 118 %  distributed architectures, the asynchronous iteration model may be
 119 %  an efficient solution. Such architectures contain various computing
 120 %  nodes connected by heterogeneous links. Nowadays, these
 121 %  architectures offer users too many computing nodes as they need, so
 122 %  a choice should be done, which is called ``tasks mapping''. In this
 123 %  paper we propose a comparison between different mapping algorithms,
 124 %  in order to evaluate mapping needs of the asynchronous iteration
 125 %  model on this kind of architectures. To do our experiments we used a
 126 %  middleware, JaceP2P-V2, which allows users to design asynchronous
 127 %  iterative applications and execute them on distributed
 128 %  architectures, in which we added a mapping library.\\
 129 \end{abstract}
 130
 131 \begin{IEEEkeywords}
 132 Mapping algorithms; Distributed clusters; Parallel iterative
 133 asynchronous algorithms; Heterogeneous distributed architectures
 134 \end{IEEEkeywords}
 135
 136
 137
 138
 139 \IEEEpeerreviewmaketitle
 140
 141 %\begin{keywords}
 142 %  Mapping algorithms, Distributed clusters, Parallel iterative
 143 %  asynchronous algorithms, Heterogeneous architectures.
 144 %\end{keywords}
 145
 146 \section{Introduction}
 147 \label{sec:intro}
 148
 149 Nowadays scientists of many domains, like climatic simulation or
 150 biological research, need large and powerful architectures to compute
 151 their large applications. Distributed clusters architectures, which
 152 are part of the grid architecture, are one of the best architectures
 153 used to solve such applications with an acceptable execution
 154 time. These architectures provide a lot of heterogeneous computing
 155 nodes interconnected by a high performance network, but even with the
 156 greatest efforts of their maintainers, there are latency and
 157 computation capacity differences between clusters of each site.
 158
 159 In order to efficiently use this massive distributed computation
 160 power, numerous numerical algorithms have been modified. These
 161 algorithms can be broadly classified into two categories. First,
 162 \textit{Direct methods}, which give the exact solution of the problem
 163 using a finite number of operations (e.g. Cholesky,
 164 LU\ldots. These methods cannot be applied to all kinds of
 165 numerical problems. In general, they are not well adapted to very
 166 large problems. Then \textit{iterative methods}, that repeat the same
 167 instructions until a desired approximation of the solution is reached
 168 -- we say that the algorithm has converged. Iterative algorithms
 169 constitute the only known approach to solving some kinds of problems
 170 and they are easier to parallelize than direct methods. The Jacobi or
 171 Conjugate Gradient algorithms are examples of such iterative
 172 methods.
 173 % \myitemize{5}
 174
 175 % \begin{figure}[h!]
 176 %   \vspace{0.1cm}
 177 %   \centering
 178 %   \includegraphics[width=7.4cm]{images/ISCA}
 179 %   \caption{Two processors computing in the Synchronous Iteration - Asynchronous Communication (SIAC) model}
 180 %   \label{fig:SIAC}
 181 % \end{figure}
 182
 183 In the rest of this paper we only focus on iterative methods. Now to
 184 parallelize this kind of algorithm, two classes of parallel iterative
 185 models can be described. In \textit{The synchronous iteration model}
 186 after each iteration a node sends its results to its neighbors and
 187 waits for the reception of all dependency messages from its neighbors
 188 to start the next iteration. This results in large idle times and is
 189 equivalent to a global synchronization of nodes after each
 190 iteration. These synchronizations can strongly penalize the overall
 191 performance of the application particularly in case of large scale
 192 platforms with high latency network. Furthermore, if a message is
 193 lost, its receiver will wait forever for this message and the
 194 application will be blocked. In the same way, if a machine fails, all
 195 the computation will be blocked.
 196
 197 \begin{figure}[h!]
 198   \vspace{0.1cm}
 199   \centering
 200   \includegraphics[width=7.4cm]{images/IACA}
 201   \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model}
 202   \label{fig:AIAC}
 203 \end{figure}
 204
 205
 206 In the\textit{The asynchronous iteration model} a node sends its
 207 results to its neighbors and starts immediately the next iteration
 208 with the last received data. These data could be data from previous
 209 iterations, because the most recent data has not arrived in time or
 210 neighbors have not finish their current iteration. The receiving and
 211 sending mechanisms are asynchronous and nodes do not have to wait for
 212 the reception of dependency messages from their
 213 neighbors. Consequently, there is no more idle time between two
 214 iterations. Furthermore, this model is tolerant to messages loss and
 215 even if a node dies, the remaining nodes continue the computation,
 216 with the last data the failed node sent. Unfortunately, the
 217 asynchronous iteration model generally requires more iterations than
 218 the synchronous one to converge to the solution.
 219
 220 This class of algorithms is very suitable in a distributed clusters
 221 computing context because it suppresses all synchronizations between
 222 computation nodes, tolerates messages loss and enables the overlapping
 223 of communications by computations. Interested readers might consult
 224 \cite{book_raph} for a precise classification and comparison of
 225 parallel iterative algorithms. In this way, several experiments
 226 \cite{book_raph} show the relevance of the AIAC algorithms in the
 227 context of distributed clusters with high latency between
 228 clusters. These works underline the good adaptability of AIAC
 229 algorithms to network and processor heterogeneity.
 230
 231 As we aim to solve very large problems on heterogeneous distributed
 232 architectures, in the rest of this study we only focus on the
 233 asynchronous iteration model. In order to efficiently use such
 234 algorithms on distributed clusters architectures, it is essential to
 235 map the tasks of the application to the best sub-sets of nodes of the
 236 target architecture. This mapping procedure must take into account
 237 parameters such as network heterogeneity, computing nodes
 238 heterogeneity and tasks heterogeneity in order to minimize the overall
 239 execution time of the application.
 240 %we have to select
 241 %sets of computing nodes, which can improve applications execution
 242 %time. Indeed, as there are more available nodes as we need on such
 243 %architectures, it is important to select appropriate computing nodes,
 244 %due to their heterogeneity at computation power as well as relying
 245 %network.
 246 To the best of our knowledge, there exits no algorithm which
 247 specifically addresses the mapping of AIAC applications on distributed
 248 architectures. The aim of this paper is to propose a new mapping
 249 algorithm dedicated to AIAC applications and to implement it into a
 250 real large scale computing platform, JaceP2P-V2. Experiments conducted
 251 on the Grid'5000 testbed with more than 400 computing cores show that
 252 this new algorithm enhances the performance of JaceP2P-V2 by
 253 about $40\%$ for a real and typical AIAC application.
 254
 255 %this mapping problem. Our aim is to evaluate the
 256 %more common approaches to solve this problem, by using several mapping
 257 %algorithms implemented in a real large scale computing platform,
 258 %JaceP2P-V2, on a true distributed clusters architecture.\\
 259
 260 The rest of this paper is organized as follows. Section
 261 \ref{sec:jacep2p} presents the JaceP2P-V2 middleware. We focus here on
 262 one of the main drawbacks of this platform: its lack of an efficient
 263 mapping strategy. Section \ref{sec:pb} presents our mapping problem
 264 and quotes existing issues to address it. Section
 265 \ref{sec:aiacmapping} describes the specificities of the AIAC model
 266 and details the main solution we propose to address the AIAC mapping
 267 problem. In section \ref{sec:expe} we describe the experiments we have
 268 conducted, with their different components and results. These results
 269 were conducted on the Grid'5000 testbed with more than 400 computing
 270 cores and show an important gain (about $40\%$) of the overall
 271 execution time for a typical AIAC application, i.e. based on the NAS
 272 Parallel Benchmark. Finally, we give some concluding remarks and plan
 273 our future work in section \ref{sec:conclu}.
 274
 275
 276 \section{JaceP2P-V2}
 277 \label{sec:jacep2p}
 278
 279 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented
 280 using the Java programming language and dedicated to developing and
 281 executing parallel iterative asynchronous applications. JaceP2P-V2
 282 executes parallel iterative asynchronous applications with
 283 dependencies between computing nodes. In addition, JaceP2P is fault
 284 tolerant which allows it to execute parallel applications over
 285 volatile environments and even for stable environments like local
 286 clusters, it offers a safer and crash free platform. To our knowledge
 287 this is the only platform dedicated to designing and executing AIAC
 288 algorithms.
 289
 290 %\subsection{Architecture}
 291 %\label{sec:archijaceP2P}
 292
 293 %In this section we describe the JaceP2P-V2 environment.
 294 %As can be seen
 295 %on Figure \ref{fig:jaceP2P-v2}, which shows
 296
 297 The JaceP2P-V2 architecture, is composed of three main entities:
 298
 299  % \begin{figure}[h!]
 300 %    \vspace{0.1cm}
 301 %    \centering
 302 %    \includegraphics[width=7.4cm]{images/JACEP2P-V2}
 303 %    \caption{The JaceP2P-V2 architecture}
 304 %    \label{fig:jaceP2P-v2}
 305 %  \end{figure}
 306
 307 \myitemize{5}
 308 \item The first entity is the ``super-node'' %(represented by a big
 309   %circle in Figure \ref{fig:jaceP2P-v2}).
 310   Super-nodes form a circular network and store, in registers, the
 311   identifiers of all the computing nodes that are connected to the
 312   platform and that are not executing any application.
 313 %  Each super-node has a status table containing the
 314 %  number of connected computing nodes to each super-node and all the
 315 %  super-nodes share a ``token'' that is passed successively from a
 316 %  super-node to the next one. Once a super-node has the token, it
 317 %  computes the average number of computing nodes connected to a
 318 %  super-node ($avg$) using the status table. If $avg$ is lower than
 319 %  the number of computing nodes connected to it, then it sends the
 320 %  identifiers of the extra computing nodes to the super-nodes that
 321 %  have the number of computing nodes connected to them less than
 322 %  $avg$. If the number of computing nodes connected to it has changed,
 323 %  it broadcasts the information to all the super-nodes in the
 324 %  platform. Finally, it passes the token to the next super node. This
 325 %  distribution reduces the load of the super-nodes.
 326   A super-node regularly receives heartbeat messages
 327 %(represented by
 328 %  doted lines in Figure \ref{fig:jaceP2P-v2})
 329   from the computing nodes connected to it. If a super-node does not
 330   receive a heartbeat message from a computing node for a given period
 331   of time, it declares that this computing node is dead and deletes
 332   its identifier from the register.
 333
 334 \item The second entity is the ``spawner''
 335 %(represented by a square in
 336  % Figure \ref{fig:jaceP2P-v2}).
 337 When a user wants to execute a
 338   parallel application that requires $N$ computing nodes, he or she
 339   launches a spawner. The spawner contacts a super-node to reserve the
 340   $N$ computing nodes plus some extra nodes. When the spawner receives
 341   the list of nodes from the super-node, it transforms the extra
 342   nodes into spawners (for fault tolerance and scalability reasons)
 343   and stores the identifiers of the rest of the nodes in its own
 344   register.
 345 % Once the extra nodes are transformed into spawners, they
 346 %   form a circular network and they receive the register containing the
 347 %   identifiers of the computing nodes.
 348 Then each spawner becomes
 349   responsible for a subgroup of computing nodes, starts the tasks on
 350   the computing nodes under its command and sends a specific register
 351   to them.
 352 %  So each
 353 %  computing node receives a specific register that only contains the
 354 %  identifiers of the daemons it interacts with and that depends on the
 355 %  application being executed. These specific registers reduce the
 356 %  number of messages sent by the spawners to update the register of
 357 %  the daemons after a daemon crashes because usually a small number of
 358 %  daemons is affected by this crash.
 359  %  If the spawner receives a message from a computing node informing
 360 %   that one of its neighbors has failed, it fetches a new one from the
 361 %   super-node in order to replace the dead one. The spawner initializes
 362 %   the new daemon, which retrieves the last backup (see next paragraph)
 363 %   of the dead node and continues the computing task from that
 364 %   checkpoint.
 365
 366 \item The third entity is the ``daemon'', or the computing node,
 367   (represented in Figure \ref{fig:jaceP2P-v2} by a hashed small circle
 368   if it is idle and by a white small circle if it is executing an
 369   application). Once launched, it connects to a super-node and waits
 370   for a task to execute. Once they begin executing an application they
 371   form a circular network which is only used in the failure detection
 372   mechanism. Each daemon can communicate directly with the daemons
 373   whose identifiers are in its register. At the end of a task, the
 374   daemons reconnect to a super-node.
 375 \end{itemize}
 376
 377 To be able to execute asynchronous iterative applications, JaceP2P-V2
 378 has an asynchronous messaging mechanism and to resist daemons'
 379 failures, it implements a distributed backup mechanism called the
 380 uncoordinated distributed checkpointing. For more details on the
 381 JaceP2P-V2 platform, readers can refer to \cite{jaceP2P-v2}.
 382
 383 %  This method allows daemons to
 384 % save their data on neighboring daemons without any user
 385 % intervention. The asynchronous nature of the application allows two
 386 % daemons to execute two different iterations, thus each daemon saves
 387 % its status without synchronizing with other daemons. This
 388 % decentralized procedure allows the platform to be very scalable, with
 389 % no weak points and does not require a secure and stable station for
 390 % backups. Moreover, since the AIAC model is tolerant to messages loss,
 391 % if a daemon dies, the other computing nodes continue their tasks and
 392 % are not affected by this failure.
 393 % The application convergence detection is done by daemons, using the
 394 % decentralized global convergence detection algorithm presented in
 395 % \cite{conv_dec}. It consists of two phases: the detection phase and
 396 % the verification phase. This algorithm aims to detect efficiently
 397 % the global convergence of asynchronous iterative parallel algorithms
 398 % on distributed architectures.
 399
 400
 401
 402 \subsubsection*{Benefits of mapping}
 403 \label{sec:benef}
 404
 405 In the JaceP2P-V2 environment, presented in the previous section,
 406 there is no effective mapping solution. Indeed, when a user wants to
 407 launch an application, the spawner emits a request to the super-node,
 408 which is in charge of available daemons. Basically, the super-node
 409 returns the amount of requested computing nodes by choosing in its own
 410 list.
 411 %,
 412 %if there are sufficient daemons connected on it, or in other
 413 %super-nodes lists, in addition of its one.
 414 In this method, the super-node only cares about the amount of
 415 requested nodes, it returns in general nodes in the order of their
 416 connection to the platform -- there is no specific selection.
 417 Distributed architectures such as distributed clusters,
 418 %as can be seen
 419 %on Figure \ref{fig:pbdistclust},
 420 are often composed of heterogeneous
 421 clusters linked via heterogeneous networks with high latencies and
 422 bandwidths. As an example the Grid'5000\cite{g5k} testbed is composed
 423 of 23 clusters spread over 9 sites. Those clusters are heterogeneous,
 424 with computing powers starting from bi-cores at 2GHz to
 425 bi-quad-cores at 2.83GHz with 2Gb of memory for the first one to 8Gb
 426 for the second. Links relying clusters are 10Gb/s capable, but as many
 427 researchers use this platform, high latencies appear in links between
 428 sites.
 429 %In the targeted architectures, which are distributed clusters, each
 430 %cluster provides a fixed number of homogeneous computing nodes --
 431 %computing nodes are heterogeneous from a cluster to another. Figure
 432 %\ref{fig:pbdistclust} represents such an architecture, in which we can
 433 %see different sites containing several heterogeneous clusters. Each
 434 %cluster is relied by high speed network with clusters in the same site
 435 %and in others. We note that the architecture may not be represented by
 436 %a full connected graph, as some sites are not directly connected to
 437 %some others. The Grid'5000 testbed, described in section
 438 %\ref{sec:g5k}, is an example of such a distributed clusters
 439 %architecture.  Though there are efficient links relying each site, a
 440 %residual latency continues to exist, at local clusters (in the same
 441 %site) as well as distant clusters (from two distinct sites), and can
 442 %penalize performance.
 443
 444 % \begin{figure}[ht!]
 445 %   \centering
 446 %   \includegraphics[width=7.8cm]{images/dist_clust}
 447 %   \caption{A distributed clusters architecture}
 448 %   \label{fig:pbdistclust}
 449 % \end{figure}
 450
 451
 452 With such an architecture, it could be
 453 efficient to assign tasks communicating with each other on the same
 454 cluster, in order to improve communications. But, as we use very large
 455 problems, it is quite impossible to find clusters containing as many
 456 computing nodes as requested. So we have to dispatch tasks over
 457 several clusters. That implies a nedd to deal with heterogeneity in clusters
 458 computing power and heterogeneity in network. We should make a
 459 trade-off between both components in order to take the best part
 460 of each one to improve the overall performance.
 461 %The
 462 %literature in high performance computing has broadly demonstrated the
 463 %benefits of mapping solutions on the applications execution time.
 464
 465 In order to check if a tasks mapping algorithm would provide
 466 performance improvement in JaceP2P-V2 environment, we have evaluated
 467 the contributions of a simple mapping algorithm, which is described in
 468 section \ref{sec:sma}. These experiments used the NPB Kernel CG
 469 application described in section \ref{sec:cg}, with two problem sizes
 470 (the given problem sizes are the sides sizes of square matrices used)
 471 and using a distributed clusters architecture composed of 102
 472 computing nodes, representing 320 computing cores, spread over 5
 473 clusters in 5 sites. The results of these experiments are given in
 474 Table \ref{tab:benef}.
 475
 476 \renewcommand{\arraystretch}{1.5}
 477
 478 \begin{table}[h!]
 479   \centering
 480   \begin{tabular}{|c|c|c|}
 481     \hline
 482     Problem size&$550,000$&$5,000,000$\\
 483     \hline
 484     Execution Time (without mapping)&141s&129s\\
 485     Execution Time (with mapping)&97s&81s\\
 486     \hline
 487     Gains&$31\%$&$37\%$\\
 488     \hline
 489   \end{tabular}
 490   \caption{Effects of a simple tasks
 491     mapping algorithm on application's execution time}
 492   \label{tab:benef}
 493 \end{table}
 494
 495 As can be seen in Table \ref{tab:benef}, the effects of a
 496 simple tasks mapping algorithm are significant.
 497 %Moreover, we can see
 498 %that are scalable with the application's size, which demonstrates the
 499 %real needs of the platform for a such algorithm.
 500 This encouraged us to look further for better task mapping
 501 algorithms. In the next section, we describe the specificities of our
 502 model and issues which can be exploited.
 503
 504 \section{Problem description}
 505 \label{sec:pb}
 506
 507 % In this section we describe the AIAC mapping problem. We first
 508 % formalize the different elements we should take into consideration:
 509 % the application, the targeted architecture and the objectives
 510 % functions of the mapping. We also give a state of the art about
 511 % considered kinds of mapping algorithms.
 512
 513 \subsection{Model formalization}
 514 \label{sec:pbmodel}
 515
 516 % In this section the models of the applications and architectures we
 517 % used are given, with the objectives functions of the mapping
 518 % algorithms.
 519
 520 \subsubsection{Application modeling}
 521 \label{sec:pbmodelapp}
 522
 523 In high performance computing, when we want to improve the global
 524 execution time of parallel applications we have to make an efficient
 525 assignation of tasks to computing nodes. Usually, to assign tasks of
 526 parallel applications to computing nodes, scheduling algorithms are
 527 used. These algorithms often represent the application by a graph,
 528 called DAG \cite{dag1,dag2,dag3,dag4} (Directed Acyclic Graph). In
 529 this graph, each task is represented by a vertex which is relayed to
 530 others by edges, which represent dependencies and communications
 531 between tasks. This means that some tasks could not start before other
 532 ones finish their computation and send their results. As discussed in
 533 the introduction, in the AIAC model, there is no precedence between
 534 tasks.
 535
 536 Indeed, with the AIAC model, all tasks compute in parallel at the same
 537 time. As communications are asynchronous, there is no synchronization
 538 and no precedence. During an iteration, each task does its job and
 539 sends results to its neighbors and continues with the next
 540 iteration. If a task receives new data from its dependencies, it
 541 includes them and the computation continues with these new data. If
 542 not all dependencies data, or none, are received before starting the
 543 computation of the next iteration, old data are used instead. Tasks
 544 are not blocked on dependencies. Nevertheless regularly receiving new
 545 data allows tasks to converge more quickly. So, it appears that DAG
 546 are not appropriate to modeling AIAC applications. TIG\cite{tig1,
 547   tig2} (Task Interaction Graph) are more appropriate.
 548 %That is why we use the
 549 %TIG\cite{tig1, tig2} (Task Interaction Graph) model instead of DAG,
 550 %which allows to modeling application using tasks and their
 551 %communication dependencies. Figure \ref{fig:tig} gives an example of a
 552 %TIG.
 553
 554 \begin{figure}[h!]
 555   \centering
 556   \includegraphics[width=4cm]{images/tig}
 557   \caption{An example of a TIG of a nine tasks application}
 558   \label{fig:tig}
 559 \end{figure}
 560
 561 In the TIG model, a parallel program is represented by a graph , as
 562 can be seen in Figure \ref{fig:tig}.  This graph $GT(V,E)$, where $V =
 563 \{V_1,V_2,\dots V_v\}$ is the set of $|V|$ vertices and $E \subset V
 564 \times V$ is the set of undirectional edges. The vertices represent
 565 tasks and the edges represent the mutual communication among tasks. A
 566 function $ET : V \rightarrow R^+$ gives the computation cost of tasks
 567 and $CT : E \rightarrow R^+$ gives the communication cost for message
 568 passing on edges. We define $v = |V|$, $ET(V_i) = e_i$ and
 569 $CT(V_i,V_j) = c_{ij}$.  For example, in Figure \ref{fig:tig},
 570 \mbox{$e_0$ = 10} and $c_{01} = 2$, $c_{03} = 2$ and $c_{04} = 2$.
 571 Tasks in TIG exchange information during their execution and there is
 572 no precedence relationship among tasks; each task cooperates with its
 573 neighbors. This model is used to represent applications, where tasks
 574 are considered to be executed simultaneously. Temporal dependencies in
 575 the execution of tasks are not explicitly addressed: all the tasks are
 576 considered simultaneously executable and communications can take place
 577 at any time during the computation. That is why vertices and edges are
 578 labeled with weights
 579 describing computational and communication costs.\\
 580
 581
 582 \subsubsection{Architecture modeling}
 583 \label{sec:pbmodelarchi}
 584
 585 As TIG models the application, we have to model the targeted
 586 architecture. A distributed clusters architecture can be modeled by a
 587 three-level-graph. The levels are \textit{architecture} (a), in our
 588 study it is the Grid'5000 grid, \textit{cluster} (c) and computing
 589 node (n) levels.
 590 %Figure \ref{fig:pbdistclust} in section
 591 %\ref{sec:benef} shows such a model.
 592 Let $GG(N,L)$ be a graph
 593 representing a distributed clusters architecture, where $N =
 594 \{N_1,N_2,\dots N_n\}$ is the set of $|N|$ vertices and $L$ is the set
 595 of undirectional edges. The vertices represent the computing nodes and
 596 the edges represent the links between them. An edge $L_i \in L$ is an
 597 unordered pair $(N_x,N_y) \in N$, representing a communication link
 598 between nodes $x$ and $y$. Let be $|C|$ the number of clusters in the
 599 architecture containing computing nodes. A function $WN : N
 600 \rightarrow R^+$ gives the computational power of nodes and $WL : L
 601 \rightarrow R^+$ gives the communication latency of links. We define
 602 $WN(N_i) = wn_i$ and $WL(L_i,L_j) = wl_{ij}$.
 603
 604 An architecture with a three-level-graph is specified according as
 605 follows. All computing nodes are in the same node level. When
 606 computing nodes can communicate to one another with the same
 607 communication latency, they can be grouped into the same cluster. In
 608 addition, like in the Grid'5000 testbed, if computing nodes seemly
 609 have the same computational power with a low communication latency, a
 610 cluster of these nodes can be defined. All participating clusters,
 611 including computing nodes, are in the same architecture level and
 612 communicate through the architecture network.\\
 613
 614
 615 \subsubsection{Mapping functions}
 616 \label{sec:pbmodelmapping}
 617
 618 After having described the two graphs used to model the application
 619 and the architecture, this section defines our objectives.
 620
 621 When a parallel application $App$, represented by a graph $GT$, is
 622 mapped on a distributed clusters architecture, represented by a graph
 623 $GG$, the execution time of the application, $ET(App)$, can be defined
 624 as the execution time of the slowest task.
 625 %In the AIAC model, tasks
 626 %have, in general, seemly the same work and communication loads; so the
 627 %difference on execution time depends on the executing machines.
 628 Indeed an application ends when all the tasks have detected
 629 convergence and have reached the desired approximation of the
 630 solution, that is why the execution time of the application depends on
 631 the slowest task.
 632 %, this converges last.
 633 We define $ ET(App) = \max_{i=1 \dots v} ( ET(V_i) )$
 634 % \begin{equation}
 635 %   \label{eq:et}
 636 %   ET(App) = \max_{i=1 \dots v} ( ET(V_i) )
 637 % \End{equation}
 638 where the
 639 %$ET(V_s)$ is the execution time of the slowest task $V_s$. The
 640 execution time of each task $i$ \mbox{($i=1 \dots v$)}, $ET(V_i)$ is
 641 given by $ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}$
 642
 643 % \begin{equation}
 644 %   \label{eq:ettask}
 645 %   ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}
 646 % \End{equation}
 647
 648 where $e_i$ is the computational cost of $V_i$, $wn_i$ is the
 649 computational power of the node $N_i$ on which $V_i$ is mapped, $J$
 650 represents the neighbors set of $V_i$, $c_{ij}$ is the amount of
 651 communications between $V_i$ and $V_j$, and $wl_{ij}$ is the link
 652 latency between the computing nodes on which are mapped $V_i$ and
 653 $V_j$.
 654 %Note that in the AIAC model, we only consider for $e_i$ the
 655 %cost of an only one iteration, as we cannot determine in advance the
 656 %amount of iterations a task should complete to converge.
 657 We underline here that in the AIAC model, it is impossible to predict
 658 the number of iterations of a task. So it is difficult to evaluate a
 659 priori the cost $e_i$ of a task. In the remainder, we approximate
 660 $e_i$ by the cost of one iteration.
 661
 662 The mapping problem is similar to the classical graph partitioning and
 663 task assignment problem \cite{npcomp}, and is thus NP-complete.
 664 %The
 665 %generalization of this problem by considering both heterogeneous tasks
 666 %graph and architecture graph makes the problem more complex.
 667 %The JaceP2P-V2 platform is design to execute parallel asynchronous
 668 %iterative application on heterogeneous distributed architectures, so
 669 %it has to launch applications on a variety of computing nodes relying
 670 %by non-uniform links. As demonstrates in the previous section, it does
 671 %not map efficiently tasks on nodes and the overall performance can be
 672 %strongly penalize by this lack of mapping. Indeed, if the
 673 %heterogeneity degree of the architecture used is high, it can be
 674 %possible to have two neighbors tasks executing on two foreign
 675 %computing nodes, that penalizes communications between these tasks,
 676 %that increase the execution time. With the same mining, it is
 677 %interesting to have similar computing nodes, in order to have seemly
 678 %the same iteration execution time that allows to converge more
 679 %quickly. We can conclude that JaceP2P-V2 needs an algorithm to
 680 %efficiently map tasks on computing nodes.
 681 %
 682 %%As we use parallel distributed algorithms on distributed
 683 %%architectures, we have to efficiently map tasks on computing nodes, in
 684 %%order to obtain best performance and the slowest execution time of
 685 %%the application.
 686 %There exist many scheduling and mapping algorithms in the literature,
 687 %and the research is active in this domain. A first approach is to use
 688 %one the numerous scheduling algorithms, but our model does not fit
 689 %with them. Indeed, a scheduling implies that we have tasks which are
 690 %depending on others, more precisely that some tasks cannot start
 691 %computation before having received data from other tasks, which are
 692 %called precedences, or ``strong dependencies''. This class of
 693 %algorithms uses a well known application's representation: the DAG
 694 %(Direct Acyclic Graph). This model cannot be used with our
 695 %problematic. Indeed, in asynchronous iterative applications, all the
 696 %tasks are executed in the same time, in parallel; there is no
 697 %precedence. So we have to look at another way, which is the ``tasks
 698 %mapping''.
 699 %
 700 %The second approach is to use a tasks mapping algorithm, which only
 701 %aims to map tasks on nodes, following a metric, to minimize the
 702 %application's execution time. To determine which class of tasks
 703 %mapping algorithms we should consider, it is important to draw up a
 704 %list of which information we have in hands. When an application using
 705 %the asynchronous iteration model is run, the number of iterations to
 706 %converge to the solution is unpredictable and change from a run to
 707 %another. So, we cannot give a good approximation of the execution time
 708 %of a task, but we can retrieve an estimation of the computational
 709 %amount of a task (plus de détails ?). In addition, as the asynchronous
 710 %iteration model based algorithms are able to provide to applications
 711 %the message lost tolerance, we cannot exactly quantify the amount of
 712 %communications in an application; some algorithms can converge with
 713 %receiving only few dependencies messages. One application's
 714 %representation which fit perfectly with our model is the TIG (Task
 715 %Interaction Graph), which only considers relations between tasks --
 716 %the only thing which is predictable in our model. The figure
 717 %\ref{fig:tig} presents an example of a TIG.
 718 %
 719 %
 720 %
 721 %On Figure \ref{fig:tig} we can see an application with 6 tasks, in
 722 %which each task is in relation with tasks of previous and next rank,
 723 %with two exceptions, the first and the last tasks, which can
 724 %eventually be in relation, depending of the application. This is a
 725 %commonly scheme in asynchronous iteration model; in general, tasks are
 726 %in relation with some others which previous and next ranks, but in
 727 %some cases, dependencies are more complex.
 728
 729 \subsection{Related work}
 730 \label{sec:pbrw}
 731
 732 %In the previous section we have determined that we need to use TIG
 733 %model based mapping algorithms to address our problem.
 734 In the literature of the TIG mapping, we can find many algorithms,
 735 which can be divided into two categories. In the \textit{Edge-cuts
 736   optimization} class of algorithms, yhe aim is to minimize the use of the
 737 penalizing links between clusters. As tasks are depending on
 738 neighbors, which are called dependencies, the goal is to choose nodes
 739 where distance, in term of network, is small, to improve
 740 communications between tasks. Here we can cite Metis\cite{metis},
 741 Chaco\cite{chaco} and PaGrid\cite{pagrid} which are libraries
 742 containing such kind of algorithms. The main drawback of edge-cuts
 743 algorithms is that they do not tackle the computing nodes
 744 heterogeneity issues. They only focus on communication overheads.
 745 %  The Figure \ref{fig:edge} shows
 746 %  that the optimization is on edges of communications, which are
 747 %  circled in red.
 748 %
 749 %  \begin{figure}[h!]
 750 %    \centering
 751 %    \includegraphics[width=3.5cm]{images/edgecut}
 752 %    \caption{The edge-cuts optimization}
 753 %    \label{fig:edge}
 754 %  \end{figure}
 755 In the \textit{Execution time optimization} class of algorithms the
 756 aim is to minimize the whole execution time of the application. They
 757 look for nodes which can provide the small execution time of tasks
 758 using their computational power. Here we can cite
 759 FastMap\cite{fastmap} and MiniMax\cite{minimax} as such kind of
 760 algorithms. QM\cite{qm_these} is also an algorithm of this category,
 761 but it aims to find for each task the node which can provide the best
 762 execution time. QM works at the task level, whereas others work at the
 763 application level
 764 %  The Figure
 765 %  \ref{fig:et} shows that the optimization is on tasks, which are
 766 %  circled in red.\\
 767 %
 768 %  \begin{figure}[h!]
 769 %    \centering
 770 %    \includegraphics[width=3.5cm]{images/exectime}
 771 %    \caption{The execution time optimization}
 772 %    \label{fig:et}
 773 %  \end{figure}
 774
 775 The two classes of algorithms may fit with our goals, because in our
 776 model we have both the computational power of nodes and communication
 777 costs may influence the applications performance. We can also cite
 778 partitioning tools like Scotch \cite{scotch} which aims at privileging
 779 the load balancing of their partitioning schemes. Nevertheless, to the
 780 best of our knowledge, none of the existing algorithms take into
 781 consideration the specificities of the AIAC model (see next section).
 782 %specifically address the AIAC mapping problem.
 783 %As the existing mapping algorithms are not designed to fit
 784 %with the AIAC mapping problem.
 785
 786 \section{AIAC mapping}
 787 \label{sec:aiacmapping}
 788
 789 % In this section we present the specificities of the AIAC model, which
 790 % are interesting in the mapping problem, and the solution we propose:
 791 % the AIAC QM algorithm, which is an extended version of the QM
 792 % algorithm.
 793
 794 \subsection{Specificities of the AIAC mapping problem}
 795 \label{sec:specAIACmapping}
 796
 797 An important point to take into consideration in the AIAC model is
 798 that we do not allow the execution of multiple tasks on the same
 799 computing node. This comes from the fact that the targeted
 800 architectures are volatile distributed environments. Assigning
 801 multiple tasks to a node provides a fall of performance when this
 802 node fails. Indeed we should redeploy all of the tasks from this node
 803 to another one, using last saves, which implies to search a new
 804 available computing node, transfer saves to it and restart the
 805 computation from this point (which could be far from this just before
 806 the failure).
 807
 808 Nevertheless, in order to benefit of multi-cores architectures, we use
 809 a task level parallelism by running multi-threaded sequential solver
 810 for example.
 811 %In addition, it is more simple and efficient to parallelize at the
 812 %task level using, as an example with the CG application, a
 813 %multi-threaded linear solver, which benefits of the multi-cores
 814 %architecture of computing nodes.
 815
 816 Another important point in the AIAC model is that we should take into
 817 account precisely the locality issue. This comes from the fact that in
 818 this model, the faster and more frequently a task receives its
 819 dependencies, the faster it converges. Moreover, as the JaceP2P-V2
 820 environment is fault tolerant and tasks save checkpoints on their
 821 neighbors, it is more efficient to save on near nodes than on far
 822 ones.
 823
 824
 825 %As our problem is on task mapping, only considering the dependency
 826 %relations between tasks, with a part of consideration on task weight,
 827 %we should use TIG (\textit{task interaction graph}) mapping model.
 828
 829 %\section{Mapping algorithms}
 830 %\label{sec:mapalgo}
 831
 832 \subsection{AIAC Quick-quality Map}
 833 \label{sec:qmmodif}
 834
 835 %After having describe the different kinds of mapping algorithms which
 836 %can be found in the literature, we now present the three algorithms we
 837 %use to do mapping in the JaceP2P-V2 environment.
 838
 839 We present here the solution we propose, called \textit{AIAC QM
 840   algorithm}, to address the AIAC mapping problem. We decided to
 841 improve the \textit{Quick-quality Map} (QM) algorithm since it is one
 842 of the most accurate method to address the TIG mapping problem.
 843 %
 844 %\subsection{Modified Quick-quality Map}
 845 %\label{sec:modifiedqm}
 846 %
 847 %As the previous algorithm describe in section \ref{sec:sma} showed
 848 %that mapping provides a significant increase of applications
 849 %performance (which can be seen in the section \ref{sec:experiments}),
 850 %we decide to try another mapping algorithm, which is a modified
 851 %version of the \textit{Quick-quality Map} (QM) algorithm.
 852 %
 853 %
 854 %As explained in section \ref{sec:pb}, the asynchronous iteration model
 855 %is specific, as it is not a good solution to map many tasks on the
 856 %same node. This is why QM has been modified to take into account these
 857 %characteristics.
 858 % Indeed, originally QM tries to map many tasks on the
 859 %same node to improve the execution time of tasks by decreasing
 860 %communications costs. This solution can be good if communications
 861 %between tasks are heavy and if we consider that computing nodes are
 862 %stable and are not volatile. As
 863
 864 %, we have modified some parts
 865 %of it to fit with our constraints. This was an opportunity to be taken
 866 %to insert a little part of ``edge-cuts'' optimization, as in our model
 867 %communications have to be taken into account.
 868 In its original version, this algorithm aims at prioritizing the
 869 computational power of nodes. Indeed, its aim is to find the more
 870 powerful node to map a task on. Moreover, a part of this algorithm is
 871 designed to map multiple tasks on the same node, in order to improve
 872 local communications. This solution can be efficient if communications
 873 between tasks are heavy and if we consider that computing nodes are
 874 stable and not volatile. This last point is in contradiction with
 875 our model, as we authorize only the execution of one task on a single
 876 node -- this allows to lose only the work of a single task in case of
 877 node's fault, with a low cost on restarting mechanism. Instead
 878 assigning multiple tasks on the same computing node, our mapping
 879 algorithm tries to keep tasks locally, to improve communications, by
 880 trying to assign tasks to computing nodes in the neighborhood
 881 of which their neighbors are mapped on.
 882
 883 % The pseudo-code of AIAC QM is given in Algorithm \ref{alg:qmmodified}.
 884
 885 % \SetAlgoSkip{}
 886 % \begin{algorithm}
 887 %   \SetLine
 888 %   \dontprintsemicolon
 889
 890 %   \KwIn{Sets of tasks and computing nodes}
 891 %   \KwOut{Mapping of tasks to nodes}
 892
 893 %   \BlankLine
 894
 895 %   sort nodes by descending power\;
 896 %   map tasks in order on nodes\;
 897 %   set all tasks \textit{moveable}\;
 898 %   $r \leftarrow 1$\;
 899
 900 %   \BlankLine
 901
 902 %   \While{one task is moveable}{
 903 %     \For{each task $t_{i}$ $\&\&$ $t_{i}$ is moveable }{
 904 %       $n_{c} \leftarrow$ current node of $t_{i}$\;
 905 %       $n_{n} \leftarrow t_{i}$\;
 906
 907 %       \BlankLine
 908
 909 %       \For{$k = 0 ; k < \frac{f \cdot n}{r} ; k++$}{
 910 %         select random node $n_{r}$ in $[0,\frac{n}{r}]$\;
 911 %         \If{ET($t_{i}$,$n_{r}$) $<$ ET($t_{i}$,$n_{n}$)}{
 912 %           $n_{n} \leftarrow n_{r} $\;
 913 %         }
 914 %       }
 915
 916 %       \BlankLine
 917
 918 %       \For{each node $n_{v}$ near $dep(t_{i})$}{
 919 %         \If{ET($t_{i}$,$n_{v}$) $<$ ET($t_{i}$,$n_{n}$)}{
 920 %           $n_{n} \leftarrow n_{v} $\;
 921 %         }
 922 %       }
 923
 924 %       \BlankLine
 925
 926 %       \If{$n_{n} \neq n_{c}$}{
 927 %         map $t_{i}$ on $n_{n}$\;
 928 %         update ET of $t_{i}$ and dep($t_{i}$)\;
 929 %       }
 930 %     }
 931
 932 %     \BlankLine
 933
 934 %     set $t_i$ not moveable\;
 935 %     $r \leftarrow r+1$ if all tasks have been considered\;
 936 %   }
 937
 938 %   \caption{The AIAC QM}
 939 %   \label{alg:qmmodified}
 940 % \end{algorithm}
 941 % %\vspace{-0.5cm}
 942
 943 So, in this algorithm all nodes are first sorted in descending order
 944 according to their computation power, and all tasks are mapped on
 945 these nodes according to their identifier (they are also marked as
 946 ``moveable'', that means that each task can be moved from a node to
 947 another). As in the original QM algorithm, AIAC QM keeps track of the
 948 \textit{number of rounds} $r$ ($r > 0$), that all tasks have been
 949 searched for a better node. This allows to reduce at each round the
 950 number of considered nodes. While there is at least one moveable task,
 951 it performs for each moveable task the search for a better node. It
 952 chooses a set of nodes, $\frac{f \cdot n}{r}$, where $f$ is defined as
 953 the search factor and $n$ is the number of nodes. $r$ and $f \in
 954 ]0,1]$ control the portion of nodes that will be considered where more
 955 numerous the rounds are, the less the considered nodes will be. Then
 956 the algorithm estimates the execution time $ET(v)$ of the task on each
 957 node. If it is smaller than the current node on which the task is
 958 mapped on, this node becomes the new potential node for task $t_i$.
 959
 960 After having randomly searched for a new node, the AIAC QM tries to
 961 map the task on nodes that are neighbors of nodes of which the
 962 dependencies of $t_i$ are mapped on. This is one of the major
 963 modification to the original QM algorithm. It introduces a little part
 964 of ``edge-cuts'' optimization. In the original version, it tries to
 965 map the task $t_i$ on the same node of one as its dependencies. As
 966 explain in \ref{sec:specAIACmapping}, this is not an acceptable
 967 solution in our case. Instead, the algorithm now searches to map task
 968 $t_i$ on nodes which are near the ones its dependencies are mapped
 969 on. This search requires a parameter which indicates the maximum
 970 distance at which nodes should be from the node of dependencies of
 971 $t_i$.
 972
 973 At the end of the algorithm, if a new node is found, $t_i$ is mapped
 974 on and its execution time is updated and $t_i$ is set to ``not
 975 moveable''. The execution time of each of its dependencies is also
 976 updated, and if this new execution time is higher than the previous,
 977 the task is set to ``moveable''. And finally, if all tasks have been
 978 considered in this round, $r$ is incremented.
 979
 980 The complexity of the AIAC QM algorithm is about $O(n^2 \cdot
 981 t \cdot ln(r))$. This complexity is the same as the original algorithm
 982 (details are given in \cite{qm_these}, with an increase of a factor
 983 $n$, corresponding to the edge-cuts part.
 984
 985 \section{Experimentation}
 986 \label{sec:expe}
 987
 988 % We now describe the experiments we have conducted and their
 989 % components, to evaluate the effects of the AIAC QM algorithm on
 990 % application execution time.
 991
 992 \subsection{The NAS Parallel Benchmark Kernel CG and the Grid'5000 platform }
 993 \label{sec:cg}
 994
 995 We used the ``Kernel CG'' of the NAS Parallel Benchmarks (NPB)
 996 \cite{nas} to evaluate the performance of the mapping
 997 algorithm. This benchmark is designed to be used on large
 998 architectures, because it tests communications over latency networks,
 999 by processing unstructured matrix vector multiplication. In this
1000 benchmark, a Conjugate Gradient is used to compute an approximation of
1001 the smallest eigenvalue of a large, sparse and symmetric positive
1002 definite matrix, by the inverse power method. In our tests, the whole
1003 matrix contains nonzero values, in order to stress more
1004 communications. As the Conjugate Gradient method cannot be executed
1005 with the asynchronous iteration model we have replaced it by another
1006 method called the multisplitting method. This latter supports the
1007 asynchronous iterative model.
1008
1009
1010 With the multisplitting algorithm, the $A$ matrix is split into
1011 horizontal rectangle parts.
1012 %as Figure \ref{fig:multisplit} shows.
1013 Each
1014 of these parts is assigned to a processor -- so the size of data
1015 depends on the matrix size but also on the number of participating
1016 nodes. In this way, a processor is in charge of computing its $X Sub$
1017 part by solving the following subsystem: $ASub \times XSub = BSub -
1018 DepLeft \times XLeft - DepRight \times XRight$
1019
1020 After solving $XSub$, the result must be sent to other
1021 processors which depend on it.
1022
1023 % \begin{figure}[h!]
1024 %   \centering
1025 %   \includegraphics[width=7.4cm]{images/multisplit}
1026 %   \caption{Data decomposition for the multisplitting method
1027 %     implementation}
1028 %   \label{fig:multisplit}
1029 % \end{figure}
1030
1031 % The multisplitting method can be decomposed into four phases:
1032 % \begin{enumerate}
1033 % \itemsep=5pt
1034 % \item \textbf{Data decomposition}. In this phase, data are allocated
1035 %   to each processor assuming the decomposition shown in Figure
1036 %   \ref{fig:multisplit}. Then, each processor iterates until
1037 %   convergence is reached.
1038 % \item \textbf{Computation}. At the beginning of the computation, each
1039 %   processor computes
1040 %     $BLoc = BSub - DepLeft \times XLeft - DepRight \times XRight$. Then,
1041 %   it solves $ASub \times XSub = BLoc$ by using a
1042 %   multi-threaded sequential version of the Conjugate Gradient method.
1043 % \item \textbf{Data exchange}. Each processor sends its $XSub$
1044 %   part to its neighbors. Here, the neighborhood is closely related
1045 %   to the density of the $A$ matrix. Clearly, a dense matrix implies an
1046 %   \textit{all-to-all} communication scheme while a matrix with a low
1047 %   bandwidth reduces the density of the communication scheme.
1048 % \item \textbf{Convergence detection} Each processor computes its local
1049 %   convergence and sends it to a server node. When this one detects
1050 %   that each processor has converged, it stops the whole computation
1051 %   process.
1052 % \end{enumerate}
1053
1054 %It can be pointed out here that it is possible to modify the data
1055 %decomposition in order to obtain non disjoint rectangle matrices. This
1056 %property of multisplitting methods, called \textit{overlapping}, can
1057 %bring significant improvements to convergence speed, but it is not the
1058 %aim of this paper.
1059 For more details about this method, interested readers are invited to
1060 see \cite{book_raph}. In our benchmark, the sequential solver part of
1061 the multisplitting method is the Conjugate Gradient, using the
1062 MTJ\cite{mtj} library. Its implementation is multi-threaded, so it
1063 benefits from multi-core processors.
1064
1065 We point out here that this benchmark is a typical AIAC
1066 application. In our study, we consider that the computational costs of
1067 tasks are approximately the same and that the communications costs are
1068 also the same (this comes from the difficulty to evaluate real costs
1069 in the AIAC model). For our experiments the bandwidth of matrices has
1070 been reduced in order to limit the dependencies and we fixed it to
1071 $35,000$. This bandwidth size generates, according to the problem's
1072 size, between 10 and 25 neighbors per tasks.
1073
1074 % The general form of the TIG for this application is given
1075 % by Figure \ref{fig:tigcg}.
1076
1077 % \begin{figure}[h!]
1078 %   \centering
1079 %   \includegraphics[width=8cm]{images/tigcg2}
1080 %   \caption{Part of the form of the TIG representing an instance of the
1081 %     NAS Kernel CG application}
1082 %   \label{fig:tigcg}
1083 % \end{figure}
1084
1085 %This figure shows 6 tasks, which are represented by a circle in which
1086 %the identifier of the task is given.
1087
1088 % The computational cost of a task is given by the number on the top
1089 % left-hand side of each circle (for example the cost of task 31 is
1090 % 1000). Communications between tasks are represented by edges on
1091 % which the amount of communication is given (for example, the
1092 % communication cost between tasks 29 and 30 is about 30).
1093 % Dotted lines represent communications with tasks which are not
1094 % represented on the figure. We can see here that each task has four
1095 % neighbors (the two previous and the two next). This amount of
1096 % neighbors is directly related to the bandwidth of the matrix (in this
1097 % example the bandwidth is very small). For more details about the
1098 % influence of the bandwidth on the amount of neighbors, interested
1099 % readers are invited to see \cite{largescale}.
1100
1101
1102
1103 %\subsubsection*{The Grid'5000 platform}
1104 %\Label{sec:g5k}
1105
1106 The platform used for our tests, called Grid’5000\cite{g5k}, is a
1107 French nationwide experimental set of clusters which provides a
1108 configurable and controllable instrument. We can find many clusters
1109 with different kinds of computers with various specifications and
1110 software. Clusters are spread over 9 sites,
1111 %as can be seen on Figure \ref{fig:g5ksite},
1112 and the computing power represents more than 5000
1113 computing cores interconnected by the ``Renater'' network. This
1114 network is the national network for research and education; it
1115 provides a large bandwidth with high latency. Intra-clusters networks
1116 present small bandwidth and low latencies.
1117
1118
1119 % \begin{figure}[h!]
1120 %   \centering
1121 %   \includegraphics[height=6.5cm]{images/g5k-noms}
1122 %   \caption{The Grid'5000 sites map}
1123 %   \label{fig:g5ksite}
1124 % \end{figure}
1125
1126
1127
1128 \subsection{Other mapping algorithms}
1129 \label{sec:othermaping}
1130
1131 % In this section we present the two other mapping algorithms we used
1132 % in our experiments to compare the performance of the AIAC QM
1133 % algorithm. The first one was used to evaluate the benefits of a
1134 % mapping solution in section \ref{sec:benef}. The second one was used
1135 % to show the differences between the two mapping class, the ``execution
1136 % time'' and the ``edge-cuts'' optimizations, as it is a fully edge-cut
1137 % optimization based mapping algorithm.
1138
1139 \subsubsection{A Simple Mapping algorithm}
1140 \label{sec:sma}
1141
1142 %As mentioned in section \ref{sec:pb}, the asynchronous iteration model
1143 %has some specificities which distinguishes it from other classical model
1144 %for the mapping.
1145 %
1146 %The first thing we have done was to be sure that a mapping algorithm
1147 %would enhance applications performance.
1148 The \textit{Simple Mapping algorithm} (SMa) was designed to
1149 show the benefits of a mapping algorithm in the JaceP2P-V2 platform.
1150 %The text of the \textit{Simple Mapping} if given by
1151 %Algorithm \ref{alg:sma} gives the pseudo-code of the \textit{Simple
1152 %  Mapping algorithm}.
1153 %, in which we can see that it is very simple, with a complexity in $O(n^
1154 %2)$ resulting from sort methods.
1155
1156
1157 % \SetAlgoSkip{}
1158 % \begin{algorithm}
1159 %   \SetLine
1160 %   \dontprintsemicolon
1161 %   \KwIn{Sets of tasks and computing nodes}
1162 %   \KwOut{Mapping of tasks to nodes}
1163
1164 %   \BlankLine
1165
1166 %   sort computing nodes by cluster\;
1167 %   sort clusters by size, from higher to lower\;
1168 %   map tasks in order on sorted clusters\;
1169 %   \caption{The Simple Mapping algorithm}
1170 %   \label{alg:sma}
1171 % \end{algorithm}
1172
1173
1174 %The aim of this algorithm is to do similarly to a very simple
1175 %edge-cuts optimization algorithm. To do that
1176 The algorithm puts each node in a cluster entity.
1177 %, which is given for now by the convention name of nodes
1178 %on Grid'5000 (for example azur-3.sophia.grid500.fr is the node 3 of
1179 %the azur cluster in the sophia site).
1180 Then it sorts clusters by their size, from the higher to the lower.
1181 %; this operation is the part which
1182 %is inspired by edge-cuts optimization based algorithms.
1183 Finally, all tasks are mapped in order on the sorted clusters; each
1184 task is assigned to a particular computing node of the chosen cluster.
1185
1186 %Though this algorithm is trivial, it allows AIAC applications to run faster
1187 %on distributed clusters architectures, with a gain over $30\%$ on
1188 %execution, as experiments described in section \ref{sec:benef}.
1189
1190 \subsubsection{Edge-cuts optimization}
1191 \label{sec:edgcutalgo}
1192
1193 As explained in section \ref{sec:pb}, the asynchronous iteration model
1194 is so specific and unpredictable that we would like to evaluate the
1195 second kind of mapping algorithm, which aims to optimize the
1196 ``edge-cuts''.
1197 %Multilevel partitioning algorithms such as Metis and
1198 %Chaco fail to address the limitations imposed by heterogeneity in the
1199 %underlying targeted system. They assume that computing nodes and
1200 %network relying them are homogeneous. This is not corresponding to our
1201 %execution environment, which is fully heterogeneous. These methods are
1202 %based on ``graph growing'' (GGP) and/or ``greedy graph growing''
1203 %(GGGP) algorithms which aim to divide tasks into two, or for some
1204 %algorithm a power of two, partitions. In our case, we do not know in
1205 %advance the number of partitions we need. Indeed, it depends on the
1206 %amount of nodes in each cluster and on the number of tasks.
1207 %
1208 %As GGP and GGGP algorithms seems to be efficient in specific cases, it
1209 %could be interesting to adapt one to our model, in order to evaluate a
1210 %real edge-cuts optimization algorithm.
1211 We choose the Farhat's algorithm\cite{farhat}, which has the ability
1212 to divide the graph into any number of partitions, thereby avoiding
1213 recursive bisection.
1214 %Therefore its running execution time is
1215 %independent of the desired number of subsections.
1216
1217 % The adapted version of this algorithm, Farhat's Edge-Cut (F-EC),
1218 % evaluated in the JaceP2P-V2 environment is described in Algorithm
1219 % \ref{alg:edgecuts}.
1220
1221 % \SetAlgoSkip{}
1222 % \begin{algorithm}
1223 %   \SetLine
1224 %   \dontprintsemicolon
1225
1226 %   \KwIn{Sets of tasks and computing nodes}
1227 %   \KwOut{Mapping of tasks to nodes}
1228
1229 %   \BlankLine
1230
1231
1232 %   sort nodes by cluster\;
1233 %   $lTasks \leftarrow$ sort tasks by dep degree\;
1234 %   $changeCluster \leftarrow$ true\;
1235 %   $cTasks \leftarrow$ empty;
1236
1237 %   \BlankLine
1238
1239 %   \While{one task is not mapped}{
1240 %     \If{$changeCluster$}{
1241 %       $curCluster \leftarrow$ nextCluster()\;
1242 %       $places \leftarrow$ size($curCluster$)\;
1243 %       $changeCluster \leftarrow$ false\;
1244 %       $mTasks \leftarrow$ empty\;
1245 %     }
1246
1247 %     \BlankLine
1248
1249 %     \If{no task in cTasks}{
1250 %       $cTasks \leftarrow$ first task from $lTasks$\;
1251 %     }
1252
1253 %     \BlankLine
1254
1255 %     $curTask \leftarrow$ first task in $cTasks$\;
1256
1257 %     \BlankLine
1258
1259 %     \eIf{$( 1 + $dep(curTask)~$\cdot~\delta) <= places)$}{
1260 %       remove $curTask$ from $cTasks$\;
1261 %       add $curTask$ in $mTasks$\;
1262 %       $places \leftarrow places - 1$\;
1263 %       add dep(curTask) in cTasks\;
1264 %     }{
1265 %       $changeCluster$ $\leftarrow$ true\;
1266 %       associate $mTasks$ with $curCluster$\;
1267 %     }
1268 %   }
1269
1270 %   \caption{The Fahrat's Edge-Cut algorithm}
1271 %   \label{alg:edgecuts}
1272 % \end{algorithm}
1273
1274 This algorithm aims to do a ``clusterization'' of the tasks. First, it
1275 groups computing nodes in clusters, which are sorted according to
1276 their number of nodes, from the higher to the lower. Tasks are ordered
1277 following their dependency degree, starting from the higher to the
1278 lower. Tasks in the top of the list have a higher priority to be
1279 mapped. Next, the algorithm tries to map on each cluster the maximum
1280 number of tasks. To map a task on a cluster, the algorithm evaluates
1281 if there is enough space to map the task and some of its
1282 dependencies. This amount of dependencies is fixed by a factor
1283 $\delta$, which is a parameter of the algorithm. In the positive case,
1284 the task is mapped on the current cluster and its dependencies become
1285 priority tasks to be mapped. This allows to keep the focus on the
1286 communicating tasks locality.
1287 %All computing nodes are first grouped in clusters.
1288 %%, using their full
1289 %%qualified name -- for now, no criteria is used to sort clusters, such
1290 %%are their power or size.
1291 %Next, tasks are inserted in a list $lTasks$, sorted in descending
1292 %according to their dependency degree. This allows to map tasks with
1293 %their dependencies to improve communications. Next begins the main
1294 %mapping loop.
1295 %
1296 %While not all tasks have been mapped the algorithm do the
1297 %following. First, at the beginning, it chooses the first available
1298 %cluster to be the current cluster to map tasks on, and sets the
1299 %variable $places$ to its size, in number of computing nodes, and
1300 %create a new tasks' list $mTasks$. During the execution of the
1301 %algorithm, passing in this block means that there is no more node
1302 %available in the current cluster, so it has to choose the next
1303 %cluster. Next, it looks in $cTasks$, which is a tasks list containing
1304 %tasks which are in instance to be mapped. If $cTasks$ is empty it
1305 %takes the first available task in $lTasks$ -- at the beginning it
1306 %takes the first one, which has the higher dependency degree.
1307 %
1308 %Next, it takes the first task in $cTasks$ to try to map it on the
1309 %current cluster. The task is mapped on if there is enough available
1310 %computing nodes on the current cluster. This amount of node
1311 %corresponds to $1 + dep( curTask ) \cdot \delta$, which represents the
1312 %task plus the amount of its dependencies multiplied by a ``local
1313 %dependency factor'' $\delta$, which indicates how many dependencies
1314 %must be a least with $curTask$ on the same cluster. If the task could
1315 %be mapped on this cluster, $curTask$ is added in $mTasks$, which is a
1316 %tasks' list containing tasks which are currently mapped on current
1317 %cluster, and removed it from $cTasks$. The number of available nodes,
1318 %$places$, is decremented, and the dependencies of $curTask$ are added
1319 %to $cTasks$. This list is sorted in the same way as $lTasks$, indeed
1320 %tasks are added according to their dependency degree, from the higher
1321 %to the lower. Otherwise, if there is not enough or no more nodes
1322 %available in the cluster, it has to change the current cluster, so it
1323 %associates the current cluster and list of tasks mapped on, $mTasks$.
1324
1325
1326 \subsection{Experiments}
1327 \label{sec:experiments}
1328
1329 After having described the different components of the experiments, we
1330 now present the impacts of the AIAC QM mapping on applications running
1331 with JaceP2P-V2 on a heterogeneous distributed clusters
1332 architecture. In the following, we note ``heterogeneity degree'' the
1333 degree of heterogeneity of distributed clusters; it is the ratio
1334 between the average and the standard deviation of the computing nodes
1335 power. This heterogeneity degree may vary from 0, nodes are
1336 homogeneous, to 10, nodes are totally heterogeneous. In these
1337 experiments, we consider that there is no computing nodes failing during
1338 applications execution.
1339
1340 The application used to realize these experiments is the KernelCG of
1341 the NAS parallel benchmark, in the multi-splitting version. Two
1342 problem sizes were used: one using a matrix of size $550,000$ (named
1343 ``class E'') using 64 computing nodes and the other using a matrix of
1344 size $5,000,000$ (named ``class F'') using 128 nodes.
1345 %
1346 %Figure
1347 %\ref{fig:tigcg} shows a part of a TIG of this application.
1348 %
1349 %\begin{figure}[h!]
1350 %  \centering
1351 %
1352 %  \label{fig:tigcg}
1353 %\end{figure}
1354
1355
1356
1357 \subsubsection{About heterogeneity}
1358 \label{sec:xphetero}
1359
1360 The first experiments concern the study of the impact of the
1361 heterogeneity of the computing nodes on the mapping
1362 results. Heterogeneity is an important factor in high performance
1363 computing in the grid all the more so when using the asynchronous
1364 iteration model.
1365
1366 As mapping algorithms take in parameter a factor of research (for AIAC
1367 QM) and the amount of local dependencies (for F-EC), we fixed both to
1368 $50\%$. That means for AIAC QM that at each round the amount of
1369 considering nodes would be divided by two, and for F-EC that each task
1370 requires half of its dependencies on the same local cluster.
1371
1372 Four experiments were done using four architectures having different
1373 heterogeneity degrees -- in two architectures computing nodes are more
1374 heterogeneous than in the others. In these experiments, we did not
1375 affect the networks heterogeneity, because of the difficulty to
1376 disturb and control network on Grid'5000; by default, networks are
1377 already quite heterogeneous. We needed more than 200 computing nodes
1378 to execute our application because of the small capacity of some
1379 clusters to execute the largest problems (there is not enough
1380 memory). The nodes used have more than 2 GB of RAM and both execute a
1381 Linux 64 bits distribution.
1382
1383 The first architecture, Arc1.1, was composed of 113 computing nodes
1384 representing 440 computing cores, spread over 5 clusters in 4
1385 geographically distant sites. In Arc1.1 we used bi-cores (2 clusters), quad-cores (2
1386 clusters) and bi-quad-cores (1 cluster) machines. Its heterogeneity
1387 degree value is 6.43. This architecture was used to run class E of the CG
1388 application using 64 computing nodes. The second architecture, Arc1.2,
1389 used to execute class F of the CG application, using 128 computing
1390 nodes, was composed of 213 computing nodes representing 840 computing
1391 cores, with a heterogeneity degree of 6.49. This architecture was
1392 spread on the same clusters and sites as Arc1.1. The results of the
1393 experiments on Arc1.1 and Arc1.2 are given in Table \ref{tab:exph1E}
1394 and Table \ref{tab:exph1F}, which give the gains in execution time obtained
1395 in comparison to the version without mapping.
1396
1397 %\vspace{0.2cm}
1398
1399 \renewcommand{\arraystretch}{1.5}
1400
1401 \begin{table}[h!]
1402   \centering
1403   \begin{tabular}[h!]{|c||c|c|c|c|}
1404     \hline
1405     Algorithm& None&SMa & AIAC QM & F-EC \\
1406     \hline
1407     \hline
1408     Execution time&150s&110s&101s&90s\\
1409     \hline
1410 %    Gains&--&$27\%$&$33\%$&\textcolor{blue}{$40\%$}\\
1411     Gains&--&$27\%$&$33\%$&$40\%$\\
1412     \hline
1413   \end{tabular}
1414   \caption{Gains in time of the execution of the class E of the CG
1415     application on Arc1.1 using 64 nodes}
1416   \label{tab:exph1E}
1417   \vspace*{-0.3cm}
1418 \end{table}
1419
1420 %\vspace{-0.5cm}
1421
1422 \renewcommand{\arraystretch}{1.5}
1423
1424 \begin{table}[h!]
1425   \centering
1426   \begin{tabular}[h!]{|c||c|c|c|c|}
1427     \hline
1428     Algorithm& None &SMa & AIAC QM & F-EC \\
1429     \hline
1430     \hline
1431     Execution time&403s&265s&250s&218s\\
1432     \hline
1433 %    Gains&--&$34\%$&$38\%$&\textcolor{blue}{$46\%$}\\
1434     Gains&--&$34\%$&$38\%$&$46\%$\\
1435     \hline
1436   \end{tabular}
1437   \caption{Gains in time of the execution of the class F of the CG
1438     application on Arc1.2 using 128 nodes}
1439   \label{tab:exph1F}
1440   \vspace*{-0.3cm}
1441 \end{table}
1442
1443 At first, we can see that the Simple Mapping algorithm, though it is
1444 simple, provides a significant improvement of application execution
1445 time. This highlights that JaceP2P-V2 really needs a mapping algorithm
1446 in order to be more efficient. Then, we can see that the F-EC and the
1447 AIAC QM algorithms provide a better mapping than the Simple Mapping
1448 algorithms. We can see a significant difference between both
1449 algorithms. This comes from the homogeneity of clusters. In this case,
1450 the F-EC algorithm is more efficient since the minimization of the
1451 communications becomes more important than the tackle of the
1452 computational power heterogeneity problem.
1453 %Indeed, it is more benefic for ta becomes more important than the
1454 %tacklebecomes more important than the
1455 %tackle sks to have locally their
1456 %dependencies, which allows to improve communications, in case of
1457 %computing nodes are more homogeneous -- communications are more
1458 %important than computing power (that is why the F-EC algorithm is more
1459 %efficient).
1460 The effect is that tasks do less iterations as they
1461 receive more frequently updated data from their neighbors. In
1462 addition, as tasks and their dependencies are on the same cluster,
1463 communications are improved, but also as computations take
1464 approximately the same time, the amount of iterations is reduced and
1465 the algorithm can converge more quickly.
1466
1467 % Another important positive point is that gains are scalable, which allows
1468 % to foresee big improvements for very large applications.\\
1469
1470 The third architecture, Arc2.1, was composed of 112 computing nodes,
1471 representing 394 computing cores, spread over 5 clusters in 5
1472 sites. In this architecture we used bi-cores (3 clusters),
1473 quad-cores (1 cluster) and bi-quad-cores (1 cluster) machines. Its
1474 heterogeneity degree's value is 8.41. This architecture was used to run
1475 class E of the CG application, using 64 computing nodes. The fourth
1476 architecture, Arc2.2, used to execute class F of the CG
1477 application, using 128 computing nodes, was composed of 212 computing
1478 nodes representing 754 computing cores, with a degree of heterogeneity
1479 of 8.44. This architecture was spread on the same clusters and sites
1480 as Arc2.1. The results of the experiments on Arc2.1 and Arc2.2 are
1481 given in Table \ref{tab:exph2E} and Table \ref{tab:exph2F}, which give the
1482 gains in execution time obtained in comparison to the version without
1483 mapping.
1484
1485 %\vspace{0.2cm}
1486
1487 \renewcommand{\arraystretch}{1.5}
1488
1489 \begin{table}[h!]
1490   \centering
1491   \begin{tabular}[h!]{|c||c|c|c|c|}
1492     \hline
1493     Algorithm&None& SMa & AIAC QM & F-EC \\
1494     \hline
1495     \hline
1496     Execution time&498s&341s&273s&385s\\
1497     \hline
1498 %    Gains&$32\%$&\textcolor{blue}{$45\%$}&\textcolor{red}{$23\%$}\\
1499     Gains&--&$32\%$&$45\%$&$23\%$\\
1500     \hline
1501   \end{tabular}
1502   \caption{Gains in time of the execution of the class E of the CG
1503     application on Arc2.1 using 64 nodes}
1504   \label{tab:exph2E}
1505   \vspace*{-0.3cm}
1506 \end{table}
1507
1508 \renewcommand{\arraystretch}{1.5}
1509
1510 \begin{table}[h!]
1511   \centering
1512   \begin{tabular}[h!]{|c||c|c|c|c|}
1513     \hline
1514     Algorithm& None&SMa & AIAC QM & F-EC \\
1515     \hline
1516     \hline
1517     Execution time&943s&594s&453s&660s\\
1518     \hline
1519 %    Gains&$37\%$&\textcolor{blue}{$52\%$}&\textcolor{red}{$30\%$}\\
1520     Gains&--&$37\%$&$52\%$&$30\%$\\
1521     \hline
1522   \end{tabular}
1523   \caption{Gains in time of the execution of the class F of the CG
1524     application on Arc2.2 using 128 nodes}
1525   \label{tab:exph2F}
1526   \vspace*{-0.3cm}
1527 \end{table}
1528
1529 To begin with, these experiments confirm that a mapping algorithm is
1530 needed and that improvements are always scalable. Then, we can see
1531 that the F-EC algorithm falls in performance and AIAC QM is
1532 improved. What is surprising is that the Simple Mapping algorithm is
1533 better than F-EC. This can be explained by the fact that as computing
1534 nodes are quite heterogeneous, computations are not the same, so it is
1535 not significant to map dependencies close to tasks. In this case, the
1536 most important is the power of computing nodes. So, in this kind of
1537 architecture, it is more efficient to choose the best computing nodes
1538 to compute iterations more quickly and to improve the convergence
1539 detection.
1540
1541 % Here, it is important to note that the AIAC QM algorithm offers a gain
1542 % of about $50\%$ on the execution time, that is to say that the
1543 % application takes half of the execution time than without mapping.
1544
1545 % \subsubsection{Parameters variation}
1546 % \label{sec:xpvariation}
1547
1548 % After having evaluated mapping algorithms on the heterogeneity of
1549 % distributed clusters, we now propose to change the parameters of AIAC
1550 % QM and F-EC algorithms, in order to determine which values are the
1551 % most accurate.
1552
1553 % To do these experiments, we used an architecture composed of 122
1554 % computing nodes representing 506 computing cores, spread over 5
1555 % clusters in 5 sites. In this architecture we used bi-cores (2
1556 % clusters), quad-cores (2 clusters) and bi-quad-cores (1 cluster)
1557 % machines. Its heterogeneity degree value is 4.98.
1558 % %, which means that
1559 % %computing nodes power is very heterogeneous.
1560
1561 % The parameters of each algorithm, $f$ (the search factor) for
1562 % AIAC QM and $\delta$ (the amount of local dependencies) for F-EC,
1563 % varied both with values $10\%$, $50\%$ and $90\%$. We used the CG
1564 % multi-splitting application on 64 computing nodes. The results of
1565 % these experiments are given in Table \ref{tab:expparams}. Results
1566 % reported in this table represent the gains in execution time provided
1567 % by each algorithm with different parameters values.
1568
1569 % %\vspace{0.2cm}
1570 % \renewcommand{\arraystretch}{1.5}
1571
1572 % \begin{table}[h!]
1573 %   \centering
1574 %   \begin{tabular}[h!]{|c||c|c|c|}
1575 %     \hline
1576 %     Parameters& $10\%$ & $50\%$ & $90\%$ \\
1577 %     \hline
1578 %     \hline
1579 % %    Simple & \multicolumn{3}{c|}{\textcolor{blue}{$39\%$}}\\
1580 %     SMa & \multicolumn{3}{c|}{$30\%$}\\
1581 %     \hline
1582 % %    AIAC QM & $35\%$ & \textcolor{blue}{$41\%$} & $35\%$ \\
1583 %     AIAC QM & $30\%$ & $32\%$ & $30\%$ \\
1584 %     \hline
1585 % %    F-EC & $39\%$ & $35\%$ & \textcolor{blue}{$45\%$} \\
1586 %     F-EC & $40\%$ & $37\%$ & $45\%$ \\
1587 %     \hline
1588 %   \end{tabular}
1589 % %  \caption{Parameters variations using a $500'000$ problem size on an
1590 % %    architecture of 5.37 heterogeneity degree}
1591 %   \caption{Gains in execution time with mapping algorithms parameters
1592 %     variations using the class E of the CG application using 64
1593 %     computing nodes}
1594 % %  \vspace*{-0.4cm}
1595 %   \label{tab:expparams}
1596 % %  \vspace*{-0.9cm}
1597 % \end{table}
1598
1599 % %First of all, we can see that the Simple mapping provides the same
1600 % %order of performance, as shown in the precedent section, so it is
1601 % %not affected by the heterogeneity degree. Secondly,
1602 % For the AIAC QM algorithm, we can note that the best value for its
1603 % parameter $f$ is about $50\%$, but its impact is not big enough to
1604 % indicate a specific configuration.
1605 % % With a low heterogeneity degree, this mapping algorithm provides a
1606 % % good performance improvement.
1607 % Finally, and this is not surprising, the F-EC algorithm is more
1608 % efficient with a factor $\delta$ near $100\%$, which directly comes
1609 % from its aim. But we can see that it is more efficient to have a
1610 % factor around $10\%$ than having one around $50\%$.
1611
1612 % We can note here, with a lower heterogeneity degree than in previous
1613 % experiments, gains are lower and the difference between AIAC QM and
1614 % F-EC (with parameters at $50\%$) is lower. It can be explained as the
1615 % fact that more the heterogeneity degree tends to 0 more computing
1616 % nodes are the same, so a mapping solution will not be efficient,
1617 % except one only optimizing network latency.
1618 %These experiments show that the impact of parameters values does not
1619 %well influence the AIAC QM, whereas it is very important for the F-EC
1620 %algorithm.
1621
1622 %\section{Discussion}
1623 %\label{sec:discussion}
1624 %
1625 %In this paper three algorithms for mapping asynchronous iterative
1626 %applications on heterogeneous distributed architectures are described.
1627 %This approach is relatively new, so we had to study related works. We
1628 %assume in our model that the amount of computation, the computational
1629 %cost of each task, cannot be known in advance, and it is the same
1630 %about communications. Two mains class of algorithms are mainly used,
1631 %the ``execution time'' and the ``edge-cuts'' optimizations, with a
1632 %newer preference fort the first one. Indeed, the efficiency of second
1633 %one is criticized for its objectives, invoking the fact that these
1634 %algorithms do not optimize the right metric. It is true in a broad
1635 %spectrum of mapping domains, but we have shown that in our case, it
1636 %could be an efficient solution, depending on the architecture
1637 %heterogeneity.
1638 %
1639 %As each experiment takes a lot of time and the Grid'5000 platform is
1640 %shared by many researchers, we could not conducted as many experiments
1641 %as we wanted and as we need to purpose an exhaustive view of this part
1642 %of the mapping domain. We cannot design a more general state of the
1643 %mapping of asynchronous iterative applications on distributed
1644 %architectures, but we can draw the main lines of future works.
1645 %%, by
1646 %%giving the first stones at the building of this specific part of the
1647 %%mapping domain.
1648 %
1649 %We have shown that the heterogeneity in computing nodes power takes an
1650 %important part in an efficient mapping algorithm. This parameter and
1651 %probably the heterogeneity in network should really be more taken into
1652 %consideration. Maybe a good solution consists in designing of mapping
1653 %algorithms giving more important priority to one or to the other
1654 %optimization objectives. This leads to design a novel algorithm, which
1655 %takes into account the different points discussed in this paper, which
1656 %would probably be an hybrid algorithm, efficient with our model on the
1657 %targeted architectures.
1658
1659 \section{Conclusion and future works}
1660 \label{sec:conclu}
1661
1662 In this paper we have presented
1663 %three algorithms to address the
1664 %mapping problem for asynchronous iterative applications in
1665 %heterogeneous distributed architectures. As the asynchronous iteration
1666 %model is very specific, it was not clear of which kind of mapping
1667 %algorithm should be efficient on such a problem. The two main
1668 %approaches given in the literature, the ``execution time'' and the
1669 %``edge-cuts'' optimization algorithms, have been evaluated on
1670 %different architectures, with different heterogeneity degrees.
1671 %% These architectures varied in their
1672 %%heterogeneity, to evaluate the algorithms.
1673 %
1674 %%We proposed three mapping algorithms for the JaceP2P-V2
1675 %%environment. The first is a simple way mapping, the Simple Mapping
1676 %%algorithm, which always provides a good and stable improvement of
1677 %%performance on all kinds of architectures.
1678 %We propose
1679 a specific mapping algorithm for the AIAC model, called AIAC QM. This
1680 algorithm is based on the execution time optimization but it also
1681 includes a small degree of edge-cuts optimization. Experiments  show that
1682 the AIAC QM mapping algorithm is efficient on architectures with a
1683 high heterogeneity degree. This can be explained by the fact that all
1684 iteration computations are quite different, for our example, and the
1685 convergence is more quickly detected as the more powerful computing
1686 nodes progress in the computation.
1687 % The F-EC algorithm, which is based
1688 % on the ``edge-cuts'' optimization, is meanwhile efficient on
1689 % architectures with a low heterogeneity degree. This can be explained
1690 % by the fact that in such an environment, it is more accurate for a
1691 % task to have its dependencies locally on the same cluster in order to
1692 % have efficient communications and to allow iterations to be computed
1693 % together, which improves the convergence detection speed.
1694 % Experiments we conducted have shown gains in execution time up to
1695 % $50\%$, which denotes a division by 2 of this execution time, for a
1696 % typical AIAC application on more than 700 computing cores.
1697 % %Experiments have shown that
1698 % %the importance of the parameter of both algorithms, AIAC QM and F-EC,
1699 % %is not so essential for the first one, instead it is very critical for
1700 % %the second one, but we cannot be sure that it is true all the time on
1701 % %all kinds of architectures; it maybe depends on the heterogeneity
1702 % %degree of the network.
1703 % As we did not influence the network's heterogeneity,
1704 % % as we did for the computational power of nodes,
1705 % the evaluation of the network impact on the
1706 % application execution time would be one of our next work.
1707
1708 %For now, these three mapping algorithms are implemented in an
1709 %additional library for the JaceP2P-V2 environment. The results
1710 %presented in this paper show that a mapping algorithm allows to
1711 %improve applications performance, but as the executing architectures
1712 %should have a variety of heterogeneity degree, we have to find a
1713 %compromise between the two approaches in order to have an efficient
1714 %mapping algorithm on all kinds of architectures. In the future, we
1715 %would like to design an efficient mapping algorithm to improve the
1716 %execution of asynchronous iteration applications on heterogeous
1717 %distributed architectures. As the algorithm should be integrated in
1718 %the JaceP2P-V2 environment, which is fully decentralized and fault
1719 %tolerant, the new mapping algorithm should have also these
1720 %characteristics, in order to retrieve a fully decentralized and fault
1721 %tolerant environment.
1722 % Our future works concern the amelioration of the AIAC QM algorithm, in
1723 % order to improve it on homogeneous distributed architectures. As the
1724 % F-EC mapping algorithm is efficient on such architectures, we will
1725 % give a more important consideration to the edge-cuts part of AIAC
1726 % QM.
1727
1728 In our future work we plan to take into consideration the fault
1729 tolerance problem. In this study we have realized our experiments
1730 without computing node fault, which is not the real case. We have to
1731 take into account the AIAC QM algorithm about this important
1732 parameter. First we have to efficiently choose new nodes to replace
1733 failed ones. Secondly, as we do checkpointing to save tasks' states,
1734 we have to efficiently choose backup nodes not to fail in case
1735 a whole cluster fails, as we save on neighbors (which are
1736 in general on the same cluster for communication efficiency reasons),
1737 an important part of the application is lost and we cannot restart
1738 this part; so the whole application fails. A trade-off should be done
1739 by having some saving nodes in external clusters.
1740
1741 % \subsubsection*{Acknowledgements}
1742
1743 % \label{sec:merci}
1744
1745 % This work was supported by the European Interreg IV From-P2P project
1746 % and the region of Franche-Comté.
1747
1748 % Experiments presented in this paper were carried out using the
1749 % Grid'5000\cite{g5k} experimental testbed, being developed under the
1750 % INRIA ALADDIN development action with support from CNRS, RENATER and
1751 % several Universities as well as other funding bodies.
1752
1753
1754 \bibliographystyle{unsrt}
1755 %\nocite{*}
1756 \bibliography{biblio}
1757
1758
1759 \end{document}
1760
1761 %% %%