pdsec2010/pdsec2010.tex

   1 %% Version PDSEC
   2 \documentclass[10pt,conference,compsocconf]{IEEEtran}
   3
   4 %% Pour Ingrid
   5 %\documentclass[conference,compsoc,a4paper,onecolumn]{IEEEtran}
   6 %\documentclass[compsoc,a4paper,oneside]{IEEEtran}
   7
   8 \usepackage[T1]{fontenc}
   9 \usepackage{ucs}
  10 \usepackage[utf8x]{inputenc}
  11 \usepackage{lmodern}
  12 \usepackage{color}
  13 %% Jolis entetes %%
  14 %\usepackage[Glenn]{fncychap}
  15 \usepackage{amsmath}
  16 %\usepackage{amsthm}
  17 \usepackage{amsfonts}
  18 %\usepackage{graphicx}
  19 %\usepackage{xspace}
  20 % Definition des marges
  21 %\usepackage{vmargin}
  22 %\setpapersize[portrait]{A4}
  23
  24 \usepackage[english]{babel}
  25 % Extension pour les graphiques EPS
  26 %\usepackage[dvips]{graphicx}
  27 \usepackage[pdftex,final]{graphicx}
  28 % Extension pour les liens intra-documents (tagged PDF)
  29 % et l'affichage correct des URL (commande \url{http://example.com})
  30 %\usepackage{hyperref}
  31 % Extension pour que plus de titres apparaissent dans la table des matieres
  32 % (table des matieres, bibliographie, index).
  33 %\usepackage{tocbibind}
  34 %\newcommand{\tabledesmatieres}{
  35 %  \setlength{\parskip}{0ex} % suppression de l'espace entre les paragraphes
  36 %  \tableofcontents
  37 %  \espacementnormal
  38 %}
  39 %\NoAutoSpaceBeforeFDP
  40 %\author{Sébastien Miquée}
  41
  42 \usepackage[ruled,vlined]{algorithm2e}
  43 \usepackage{cite}
  44
  45 \newcommand{\myitemize}[1]
  46 {
  47 \begin{itemize}
  48 \itemsep=#1pt
  49 }
  50
  51 %
  52 \ifCLASSINFOpdf
  53    \usepackage[pdftex]{graphicx}
  54    \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  55 \else
  56 \fi
  57
  58 % *** PDF, URL AND HYPERLINK PACKAGES ***
  59 %
  60 \usepackage{url}
  61
  62
  63 \title{Mapping Asynchronous Iterative Applications on Heterogeneous
  64   Distributed Architectures}
  65
  66 \author{
  67 \IEEEauthorblockN{Raphaël Couturier \hspace{10pt} David Laiymani
  68   \hspace{10pt} Sébastien Miquée}
  69
  70 \IEEEauthorblockA{Laboratoire d'Informatique de Franche-Comté
  71   (LIFC)\\
  72   %University of Franche-Comté\\
  73   IUT de Belfort-Montbéliard,2 Rue Engel Gros, BP 27, 90016 Belfort,
  74   France\\
  75  % Tel.: +33-3-84587782 \hspace{20pt} Fax: +33-3-84587781\\
  76 Email:
  77 \{raphael.couturier,david.laiymani,sebastien.miquee\}@univ-fcomte.fr}
  78
  79 \thanks{This work was supported by the European Interreg IV From-P2P project
  80 and the region of Franche-Comté}
  81 }
  82
  83 %% Permet de réactiver le \thanks
  84 \IEEEoverridecommandlockouts
  85
  86 \date{}
  87 \begin{document}
  88
  89 %% left, top, right, bottom
  90 %\setmargrb{20mm}{15mm}{20mm}{15mm}
  91
  92 \IEEEcompsoctitleabstractindextext
  93
  94 \maketitle
  95
  96
  97 \begin{abstract}
  98   To design parallel numerical algorithms on large scale distributed
  99   and heterogeneous platforms, the asynchronous iteration model (AIAC)
 100   may be an efficient solution. This class of algorithm is very
 101   suitable since it enables communication/computation overlapping and
 102   it suppresses all synchronizations between computation nodes. Since
 103   target architectures are composed of more than one thousand
 104   heterogeneous nodes connected through heterogeneous networks, the
 105   need for mapping algorithms is crucial. In this paper, we propose a
 106   new mapping algorithm dedicated to the AIAC model. To evaluate our
 107   mapping algorithm we implemented it in the JaceP2P programming and
 108   executing environment dedicated to AIAC applications and we
 109   conducted a set of experiments on the Grid'5000 testbed.  Results
 110   are very encouraging and show that the use of our algorithm brings
 111   an important gain in term of execution time (about $40\%$).
 112
 113 %with more
 114 %  than 700 computing cores and with a real and typical AIAC
 115 %  application based on the NAS parallel benchmarks.
 116
 117 %  To design parallel and distributed applications on heterogeneous
 118 %  distributed architectures, the asynchronous iteration model may be
 119 %  an efficient solution. Such architectures contain various computing
 120 %  nodes connected by heterogeneous links. Nowadays, these
 121 %  architectures offer users too many computing nodes as they need, so
 122 %  a choice should be done, which is called ``tasks mapping''. In this
 123 %  paper we propose a comparison between different mapping algorithms,
 124 %  in order to evaluate mapping needs of the asynchronous iteration
 125 %  model on this kind of architectures. To do our experiments we used a
 126 %  middleware, JaceP2P-V2, which allows users to design asynchronous
 127 %  iterative applications and execute them on distributed
 128 %  architectures, in which we added a mapping library.\\
 129 \end{abstract}
 130
 131 \begin{IEEEkeywords}
 132 Mapping algorithms; Distributed clusters; Parallel iterative
 133 asynchronous algorithms; Heterogeneous distributed architectures.
 134 \end{IEEEkeywords}
 135
 136
 137
 138
 139 \IEEEpeerreviewmaketitle
 140
 141 %\begin{keywords}
 142 %  Mapping algorithms, Distributed clusters, Parallel iterative
 143 %  asynchronous algorithms, Heterogeneous architectures.
 144 %\end{keywords}
 145
 146 \section{Introduction}
 147 \label{sec:intro}
 148
 149 Nowadays scientists of many domains, like climatic simulation or
 150 biological research, need large and powerful architectures to compute
 151 their large applications. Distributed clusters architectures, which
 152 are part of the grid architecture, are one of the best architectures
 153 used to solve such applications with an acceptable execution
 154 time. These architectures provide a lot of heterogeneous computing
 155 nodes interconnected by a high performance network, but even with the
 156 greatest efforts of their maintainers, there are latency and
 157 computation capacity differences between clusters of each site.
 158
 159 In order to efficiently use this massive distributed computation
 160 power, numerous numerical algorithms have been modified. These
 161 algorithms can be broadly classified into two categories. First,
 162 \textit{direct methods}, which give the exact solution of the problem
 163 using a finite number of operations (e.g. Cholesky,
 164 LU\dots). These methods cannot be applied to all kinds of
 165 numerical problems. Generally, they are not well adapted to very
 166 large problems. Then \textit{iterative methods}, that repeat the same
 167 instructions until a desired approximation of the solution is reached
 168 -- we say that the algorithm has converged. Iterative algorithms
 169 constitute the only known approach to solving some kinds of problems
 170 and are easier to parallelize than direct methods. The Jacobi or
 171 Conjugate Gradient algorithms are examples of such iterative
 172 methods.
 173 % \myitemize{5}
 174
 175 % \begin{figure}[h!]
 176 %   \vspace{0.1cm}
 177 %   \centering
 178 %   \includegraphics[width=7.4cm]{images/ISCA}
 179 %   \caption{Two processors computing in the Synchronous Iteration - Asynchronous Communication (SIAC) model}
 180 %   \label{fig:SIAC}
 181 % \end{figure}
 182
 183 In the rest of this paper we only focus on iterative methods. Now to
 184 parallelize this kind of algorithm, two classes of parallel iterative
 185 models can be described. In the \textit{synchronous iteration model}
 186 after each iteration a node sends its results to its neighbors and
 187 waits for the reception of all dependency messages from them
 188 to start the next iteration. This results in large idle times and is
 189 equivalent to a global synchronization of nodes after each
 190 iteration. These synchronizations can strongly penalize the overall
 191 performance of the application particularly in case of large scale
 192 platforms with high latency network. Furthermore, if a message is
 193 lost, its receiver will wait forever for this message and the
 194 application will be blocked. In the same way, if a machine fails, all
 195 the computation will be blocked.
 196
 197 \begin{figure}[h!]
 198   \centering
 199   \includegraphics[width=7.4cm]{images/IACA}
 200   \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model}
 201   \label{fig:AIAC}
 202 \end{figure}
 203 \vspace{-0.2cm}
 204
 205 In the \textit{asynchronous iteration model} a node sends its
 206 results to its neighbors and starts immediately the next iteration
 207 with the last received data. These data could be data from previous
 208 iterations, because the most recent data has not arrived in time or
 209 neighbors have not finish their current iteration. The receiving and
 210 sending mechanisms are asynchronous and nodes do not have to wait for
 211 the reception of dependency messages from their
 212 neighbors. Consequently, there is no more idle time between two
 213 iterations. Furthermore, this model is tolerant to messages loss and
 214 even if a node dies, the remaining nodes continue the computation,
 215 with the last data the failed node sent. Unfortunately, the
 216 asynchronous iteration model generally requires more iterations than
 217 the synchronous one to converge to the solution.
 218
 219 This class of algorithms is very suitable in a distributed clusters
 220 computing context because it suppresses all synchronizations between
 221 computation nodes, tolerates messages loss and enables the overlapping
 222 of communications by computations. Interested readers might consult
 223 \cite{book_raph} for a precise classification and comparison of
 224 parallel iterative algorithms. In this way, several experiments
 225 \cite{book_raph} show the relevance of the AIAC algorithms in the
 226 context of distributed clusters with high latency between
 227 clusters. These works underline the good adaptability of AIAC
 228 algorithms to network and processor heterogeneity.
 229
 230 As we aim to solve very large problems on heterogeneous distributed
 231 architectures, in the rest of this study we only focus on the
 232 asynchronous iteration model. In order to efficiently use such
 233 algorithms on distributed clusters architectures, it is essential to
 234 map the application's tasks to the best sub-sets of nodes of the
 235 target architecture. This mapping procedure must take into account
 236 parameters such as network heterogeneity, computing nodes
 237 heterogeneity and tasks heterogeneity in order to minimize the overall
 238 execution time of the application.
 239 %we have to select
 240 %sets of computing nodes, which can improve applications execution
 241 %time. Indeed, as there are more available nodes as we need on such
 242 %architectures, it is important to select appropriate computing nodes,
 243 %due to their heterogeneity at computation power as well as relying
 244 %network.
 245 To the best of our knowledge, there exits no algorithm which
 246 specifically addresses the mapping of AIAC applications on distributed
 247 architectures. The aim of this paper is to propose a new mapping
 248 algorithm dedicated to AIAC applications and to implement it into a
 249 real large scale computing platform, JaceP2P-V2. Experiments conducted
 250 on the Grid'5000 testbed with more than 400 computing cores show that
 251 this new algorithm enhances the performance of JaceP2P-V2 by
 252 about $40\%$ for a real and typical AIAC application.
 253
 254 %this mapping problem. Our aim is to evaluate the
 255 %more common approaches to solve this problem, by using several mapping
 256 %algorithms implemented in a real large scale computing platform,
 257 %JaceP2P-V2, on a true distributed clusters architecture.\\
 258
 259 The rest of this paper is organized as follows. Section
 260 \ref{sec:jacep2p} presents the JaceP2P-V2 middleware. We focus here on
 261 one of the main drawbacks of this platform: its lack of an efficient
 262 mapping strategy. Section \ref{sec:pb} presents our mapping problem
 263 and quotes existing issues to address it. Section
 264 \ref{sec:aiacmapping} describes the specificities of the AIAC model
 265 and details the main solution we propose to address the AIAC mapping
 266 problem. In section \ref{sec:expe} we describe the experiments we have
 267 conducted, with their different components and results. These results
 268 were conducted on the Grid'5000 testbed with more than 400 computing
 269 cores and show an important gain (about $40\%$) of the overall
 270 execution time for a typical AIAC application, i.e. based on the NAS
 271 Parallel Benchmark. Finally, we give some concluding remarks and plan
 272 our future work in section \ref{sec:conclu}.
 273
 274
 275 \section{JaceP2P-V2}
 276 \label{sec:jacep2p}
 277
 278 JaceP2P-V2\cite{jaceP2P-v2} is a distributed platform implemented
 279 using the Java programming language and dedicated to developing and
 280 executing parallel iterative asynchronous applications. JaceP2P-V2
 281 executes parallel iterative asynchronous applications with
 282 dependencies between computing nodes. In addition, JaceP2P is fault
 283 tolerant which allows it to execute parallel applications over
 284 volatile environments and even for stable environments like local
 285 clusters, it offers a safer and crash free platform. To our knowledge
 286 this is the only platform dedicated to designing and executing AIAC
 287 algorithms.
 288
 289 %\subsection{Architecture}
 290 %\label{sec:archijaceP2P}
 291
 292 %In this section we describe the JaceP2P-V2 environment.
 293 %As can be seen
 294 %on Figure \ref{fig:jaceP2P-v2}, which shows
 295
 296 The JaceP2P-V2 architecture, is composed of three main entities:
 297
 298  % \begin{figure}[h!]
 299 %    \vspace{0.1cm}
 300 %    \centering
 301 %    \includegraphics[width=7.4cm]{images/JACEP2P-V2}
 302 %    \caption{The JaceP2P-V2 architecture}
 303 %    \label{fig:jaceP2P-v2}
 304 %  \end{figure}
 305
 306 \myitemize{3}
 307 %\begin{itemize}
 308 \item The first entity is the ``super-node''. %(represented by a big
 309   %circle in Figure \ref{fig:jaceP2P-v2}).
 310   Super-nodes form a circular network and store, in registers, the
 311   identifiers of all the computing nodes that are connected to the
 312   platform and that are not executing any application.
 313 %  Each super-node has a status table containing the
 314 %  number of connected computing nodes to each super-node and all the
 315 %  super-nodes share a ``token'' that is passed successively from a
 316 %  super-node to the next one. Once a super-node has the token, it
 317 %  computes the average number of computing nodes connected to a
 318 %  super-node ($avg$) using the status table. If $avg$ is lower than
 319 %  the number of computing nodes connected to it, then it sends the
 320 %  identifiers of the extra computing nodes to the super-nodes that
 321 %  have the number of computing nodes connected to them less than
 322 %  $avg$. If the number of computing nodes connected to it has changed,
 323 %  it broadcasts the information to all the super-nodes in the
 324 %  platform. Finally, it passes the token to the next super node. This
 325 %  distribution reduces the load of the super-nodes.
 326   A super-node regularly receives heartbeat messages
 327 %(represented by
 328 %  doted lines in Figure \ref{fig:jaceP2P-v2})
 329   from the computing nodes connected to it. If a super-node does not
 330   receive a heartbeat message from a computing node for a given period
 331   of time, it declares that this computing node is dead and deletes
 332   its identifier from the register.
 333
 334 \item The second entity is the ``spawner''.
 335 %(represented by a square in
 336  % Figure \ref{fig:jaceP2P-v2}).
 337   When a user wants to execute a parallel application that requires
 338   $N$ computing nodes, he or she launches a spawner. This one contacts
 339   a super-node to reserve the $N$ computing nodes plus some extra
 340   nodes. When it receives the list of nodes from the super-node, it
 341   transforms the extra nodes into spawners (for fault tolerance and
 342   scalability reasons) and stores the identifiers of the rest of the
 343   nodes in its own register.
 344 % Once the extra nodes are transformed into spawners, they
 345 %   form a circular network and they receive the register containing the
 346 %   identifiers of the computing nodes.
 347 Then each spawner becomes
 348   responsible for a subgroup of computing nodes, starts the tasks on
 349   the computing nodes under its command and sends a specific register
 350   to them.
 351 %  So each
 352 %  computing node receives a specific register that only contains the
 353 %  identifiers of the daemons it interacts with and that depends on the
 354 %  application being executed. These specific registers reduce the
 355 %  number of messages sent by the spawners to update the register of
 356 %  the daemons after a daemon crashes because usually a small number of
 357 %  daemons is affected by this crash.
 358  %  If the spawner receives a message from a computing node informing
 359 %   that one of its neighbors has failed, it fetches a new one from the
 360 %   super-node in order to replace the dead one. The spawner initializes
 361 %   the new daemon, which retrieves the last backup (see next paragraph)
 362 %   of the dead node and continues the computing task from that
 363 %   checkpoint.
 364
 365 \item The third entity is the ``daemon'', or the computing node.
 366 %  (represented in Figure \ref{fig:jaceP2P-v2} by a hashed small circle
 367 %  if it is idle and by a white small circle if it is executing an
 368 %  application).
 369   Once launched, it connects to a super-node and waits for a task to
 370   execute. Once they begin executing an application daemons form a
 371   circular network which is only used in the failure detection
 372   mechanism. Each daemon can communicate directly with the daemons
 373   whose identifiers are in its register. At the end of a task, the
 374   daemons reconnect to a super-node.
 375 \end{itemize}
 376
 377 To be able to execute asynchronous iterative applications, JaceP2P-V2
 378 has an asynchronous messaging mechanism and to resist daemons'
 379 failures, it implements a distributed backup mechanism called the
 380 uncoordinated distributed checkpointing. For more details on the
 381 JaceP2P-V2 platform, interested readers can refer to \cite{jaceP2P-v2}.
 382
 383 %  This method allows daemons to
 384 % save their data on neighboring daemons without any user
 385 % intervention. The asynchronous nature of the application allows two
 386 % daemons to execute two different iterations, thus each daemon saves
 387 % its status without synchronizing with other daemons. This
 388 % decentralized procedure allows the platform to be very scalable, with
 389 % no weak points and does not require a secure and stable station for
 390 % backups. Moreover, since the AIAC model is tolerant to messages loss,
 391 % if a daemon dies, the other computing nodes continue their tasks and
 392 % are not affected by this failure.
 393 % The application convergence detection is done by daemons, using the
 394 % decentralized global convergence detection algorithm presented in
 395 % \cite{conv_dec}. It consists of two phases: the detection phase and
 396 % the verification phase. This algorithm aims to detect efficiently
 397 % the global convergence of asynchronous iterative parallel algorithms
 398 % on distributed architectures.
 399
 400
 401 \vspace*{0.2cm}
 402
 403 \subsubsection*{Benefits of mapping}
 404 \label{sec:benef}
 405
 406 In the previously described JaceP2P-V2 environment there is no
 407 effective mapping solution. Indeed, when a user wants to launch an
 408 application, the spawner emits a request to the super-node, which is
 409 in charge of available daemons. Basically, the super-node returns the
 410 amount of requested computing nodes by choosing in its own list.
 411 %,
 412 %if there are sufficient daemons connected on it, or in other
 413 %super-nodes lists, in addition of its one.
 414 In this method, the super-node only cares about the amount of
 415 requested nodes and it returns in general nodes in the order of their
 416 connection to the platform -- there is no specific selection.
 417 Distributed architectures such as distributed clusters,
 418 %as can be seen
 419 %on Figure \ref{fig:pbdistclust},
 420 are often composed of heterogeneous
 421 clusters linked via heterogeneous networks with high latencies and
 422 bandwidths. As an example the Grid'5000\cite{g5k} testbed is composed
 423 of 23 clusters spread over 9 sites. Those clusters are heterogeneous,
 424 with computing powers starting from bi-cores at 2GHz to
 425 bi-quad-cores at 2.83GHz with 2Gb of memory for the first one to 8Gb
 426 for the second. Links relying clusters are 10Gb/s capable, but as many
 427 researchers use this platform, high latencies appear in links between
 428 sites.
 429 %In the targeted architectures, which are distributed clusters, each
 430 %cluster provides a fixed number of homogeneous computing nodes --
 431 %computing nodes are heterogeneous from a cluster to another. Figure
 432 %\ref{fig:pbdistclust} represents such an architecture, in which we can
 433 %see different sites containing several heterogeneous clusters. Each
 434 %cluster is relied by high speed network with clusters in the same site
 435 %and in others. We note that the architecture may not be represented by
 436 %a full connected graph, as some sites are not directly connected to
 437 %some others. The Grid'5000 testbed, described in section
 438 %\ref{sec:g5k}, is an example of such a distributed clusters
 439 %architecture.  Though there are efficient links relying each site, a
 440 %residual latency continues to exist, at local clusters (in the same
 441 %site) as well as distant clusters (from two distinct sites), and can
 442 %penalize performance.
 443
 444 % \begin{figure}[ht!]
 445 %   \centering
 446 %   \includegraphics[width=7.8cm]{images/dist_clust}
 447 %   \caption{A distributed clusters architecture}
 448 %   \label{fig:pbdistclust}
 449 % \end{figure}
 450
 451
 452 With such an architecture, it could be
 453 efficient to assign tasks communicating with each other on the same
 454 cluster, in order to improve communications. But, as we use very large
 455 problems, it is quite impossible to find clusters containing as many
 456 computing nodes as requested. So we have to dispatch tasks over
 457 several clusters. That implies a need to deal with heterogeneity in clusters
 458 computing power and heterogeneity in network. We should make a
 459 trade-off between both components in order to take the best part
 460 of each one to improve the overall performance.
 461 %The
 462 %literature in high performance computing has broadly demonstrated the
 463 %benefits of mapping solutions on the applications execution time.
 464
 465 In order to check if a tasks mapping algorithm would provide
 466 performance improvement in the JaceP2P-V2 environment, we have evaluated
 467 the contributions of a simple mapping algorithm, which is described in
 468 section \ref{sec:sma}. These experiments used the NPB Kernel CG
 469 application described in section \ref{sec:cg}, with two problem sizes
 470 (the given problem sizes are the sides sizes of square matrices used)
 471 and using a distributed clusters architecture composed of 102
 472 computing nodes, representing 320 computing cores, spread over 5
 473 clusters in 5 sites. The results of these experiments are given in
 474 Table \ref{tab:benef}.
 475
 476 \renewcommand{\arraystretch}{1.5}
 477
 478 \begin{table}[h!]
 479   \centering
 480   \begin{tabular}{|c|c|c|}
 481     \hline
 482     Problem size&$550,000$&$5,000,000$\\
 483     \hline
 484     Execution Time (without mapping)&141s&129s\\
 485     Execution Time (with mapping)&97s&81s\\
 486     \hline
 487     Gains&$31\%$&$37\%$\\
 488     \hline
 489   \end{tabular}
 490   \caption{Effects of a simple tasks
 491     mapping algorithm on application's execution time}
 492   \label{tab:benef}
 493   \vspace{-0.5cm}
 494 \end{table}
 495
 496 As can be seen in Table \ref{tab:benef}, the effects of a
 497 simple tasks mapping algorithm are significant.
 498 %Moreover, we can see
 499 %that are scalable with the application's size, which demonstrates the
 500 %real needs of the platform for a such algorithm.
 501 This encouraged us to look further for better task mapping
 502 algorithms. In the next section, we describe the specificities of our
 503 model and issues which can be exploited.
 504
 505 \section{Problem description}
 506 \label{sec:pb}
 507
 508 % In this section we describe the AIAC mapping problem. We first
 509 % formalize the different elements we should take into consideration:
 510 % the application, the targeted architecture and the objectives
 511 % functions of the mapping. We also give a state of the art about
 512 % considered kinds of mapping algorithms.
 513
 514 \subsection{Model formalization}
 515 \label{sec:pbmodel}
 516
 517 % In this section the models of the applications and architectures we
 518 % used are given, with the objectives functions of the mapping
 519 % algorithms.
 520
 521 \subsubsection{Application modeling}
 522 \label{sec:pbmodelapp}
 523
 524 In high performance computing, when we want to improve the global
 525 execution time of parallel applications we have to make an efficient
 526 assignation of tasks to computing nodes. Usually, to assign tasks of
 527 parallel applications to computing nodes, scheduling algorithms are
 528 used. These algorithms often represent the application by a graph,
 529 called DAG \cite{dag1,dag2,dag3,dag4} (Directed Acyclic Graph). In
 530 this graph, each task is represented by a vertex which is relayed to
 531 others by edges, which represent dependencies and communications
 532 between tasks. This means that some tasks could not start before other
 533 ones finish their computation and send their results. As discussed in
 534 the introduction, in the AIAC model, there is no precedence between
 535 tasks.
 536
 537 Indeed, with the AIAC model, all tasks compute in parallel at the same
 538 time. As communications are asynchronous, there is no synchronization
 539 and no precedence. During an iteration, each task does its job and
 540 sends results to its neighbors and continues with the next
 541 iteration. If a task receives new data from its dependencies, it
 542 includes them and the computation continues with these new data. If
 543 not all dependencies data, or none, are received before starting the
 544 computation of the next iteration, old data are used instead. Tasks
 545 are not blocked on dependencies. Nevertheless regularly receiving new
 546 data allows tasks to converge more quickly. So, it appears that DAG
 547 are not appropriate to modeling AIAC applications. TIG\cite{tig1,
 548   tig2} (Task Interaction Graph) are more appropriate.
 549 %That is why we use the
 550 %TIG\cite{tig1, tig2} (Task Interaction Graph) model instead of DAG,
 551 %which allows to modeling application using tasks and their
 552 %communication dependencies. Figure \ref{fig:tig} gives an example of a
 553 %TIG.
 554
 555 \begin{figure}[h!]
 556   \centering
 557   \includegraphics[width=4cm]{images/tig}
 558   \caption{An example of a TIG of a nine tasks application}
 559   \label{fig:tig}
 560 \end{figure}
 561
 562 In the TIG model, a parallel program is represented by a graph , as
 563 can be seen in Figure \ref{fig:tig}.  This graph $GT(V,E)$, where $V =
 564 \{V_1,V_2,\dots V_v\}$ is the set of $|V|$ vertices and $E \subset V
 565 \times V$ is the set of undirectional edges. The vertices represent
 566 tasks and the edges represent the mutual communication among tasks. A
 567 function $ET : V \rightarrow R^+$ gives the computation cost of tasks
 568 and $CT : E \rightarrow R^+$ gives the communication cost for message
 569 passing on edges. We define $v = |V|$, $ET(V_i) = e_i$ and
 570 $CT(V_i,V_j) = c_{ij}$.  For example, in Figure \ref{fig:tig},
 571 \mbox{$e_0$ = 10} and $c_{01} = 2$, $c_{03} = 2$ and $c_{04} = 2$.
 572 Tasks in TIG exchange information during their execution and there is
 573 no precedence relationship among tasks; each task cooperates with its
 574 neighbors. This model is used to represent applications, where tasks
 575 are considered to be executed simultaneously. Temporal dependencies in
 576 the execution of tasks are not explicitly addressed: all the tasks are
 577 considered simultaneously executable and communications can take place
 578 at any time during the computation. That is why vertices and edges are
 579 labeled with weights
 580 describing computational and communication costs.
 581
 582 \vspace*{0.2cm}
 583
 584 \subsubsection{Architecture modeling}
 585 \label{sec:pbmodelarchi}
 586
 587 As TIG models the application, we have to model the targeted
 588 architecture. A distributed clusters architecture can be modeled by a
 589 three-level-graph. The levels are \textit{architecture} (a), in our
 590 study it is the Grid'5000 grid, \textit{cluster} (c) and computing
 591 node (n) levels.
 592 %Figure \ref{fig:pbdistclust} in section
 593 %\ref{sec:benef} shows such a model.
 594 Let $GG(N,L)$ be a graph
 595 representing a distributed clusters architecture, where $N =
 596 \{N_1,N_2,\dots N_n\}$ is the set of $|N|$ vertices and $L$ is the set
 597 of undirectional edges. The vertices represent the computing nodes and
 598 the edges represent the links between them. An edge $L_i \in L$ is an
 599 unordered pair $(N_x,N_y) \in N$, representing a communication link
 600 between nodes $N_x$ and $N_y$. Let be $|C|$ the number of clusters in the
 601 architecture containing computing nodes. A function $WN : N
 602 \rightarrow R^+$ gives the computational power of nodes and $WL : L
 603 \rightarrow R^+$ gives the communication latency of links. We define
 604 $WN(N_i) = wn_i$ and $WL(L_i,L_j) = wl_{ij}$.
 605
 606 An architecture with a three-level-graph is specified according as
 607 follows. All computing nodes are in the same node level. When
 608 computing nodes can communicate to one another with the same
 609 communication latency, they can be grouped into the same cluster. In
 610 addition, like in the Grid'5000 testbed, if computing nodes seemly
 611 have the same computational power with a low communication latency, a
 612 cluster of these nodes can be defined. All participating clusters,
 613 including computing nodes, are in the same architecture level and
 614 communicate through the architecture network.
 615
 616 \vspace*{0.2cm}
 617
 618 \subsubsection{Mapping functions}
 619 \label{sec:pbmodelmapping}
 620
 621 After having described the two graphs used to model the application
 622 and the architecture, this section defines our objectives.
 623
 624 When a parallel application $App$, represented by a graph $GT$, is
 625 mapped on a distributed clusters architecture, represented by a graph
 626 $GG$, the execution time of the application, $ET(App)$, can be defined
 627 as the execution time of the slowest task.
 628 %In the AIAC model, tasks
 629 %have, in general, seemly the same work and communication loads; so the
 630 %difference on execution time depends on the executing machines.
 631 Indeed an application ends when all the tasks have detected
 632 convergence and have reached the desired approximation of the
 633 solution, that is why the execution time of the application depends on
 634 the slowest task.
 635 %, this converges last.
 636 We define $ ET(App) = \max_{i=1 \dots v} ( ET(V_i) )$
 637 % \begin{equation}
 638 %   \label{eq:et}
 639 %   ET(App) = \max_{i=1 \dots v} ( ET(V_i) )
 640 % \End{equation}
 641 where the
 642 %$ET(V_s)$ is the execution time of the slowest task $V_s$. The
 643 execution time of each task $i$ \mbox{($i=1 \dots v$)}, $ET(V_i)$ is
 644 given by $ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}$
 645
 646 % \begin{equation}
 647 %   \label{eq:ettask}
 648 %   ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}
 649 % \End{equation}
 650
 651 where $e_i$ is the computational cost of $V_i$, $wn_i$ is the
 652 computational power of the node $N_i$ on which $V_i$ is mapped, $J$
 653 represents the neighbors set of $V_i$, $c_{ij}$ is the amount of
 654 communications between $V_i$ and $V_j$, and $wl_{ij}$ is the link
 655 latency between the computing nodes on which are mapped $V_i$ and
 656 $V_j$.
 657 %Note that in the AIAC model, we only consider for $e_i$ the
 658 %cost of an only one iteration, as we cannot determine in advance the
 659 %amount of iterations a task should complete to converge.
 660 We underline here that in the AIAC model, it is impossible to predict
 661 the number of iterations of a task. So it is difficult to evaluate a
 662 priori the cost $e_i$ of a task. In the remainder, we approximate
 663 $e_i$ by the cost of one iteration.
 664
 665 The mapping problem is similar to the classical graph partitioning and
 666 task assignment problem \cite{npcomp}, and is thus NP-complete.
 667 %The
 668 %generalization of this problem by considering both heterogeneous tasks
 669 %graph and architecture graph makes the problem more complex.
 670 %The JaceP2P-V2 platform is design to execute parallel asynchronous
 671 %iterative application on heterogeneous distributed architectures, so
 672 %it has to launch applications on a variety of computing nodes relying
 673 %by non-uniform links. As demonstrates in the previous section, it does
 674 %not map efficiently tasks on nodes and the overall performance can be
 675 %strongly penalize by this lack of mapping. Indeed, if the
 676 %heterogeneity degree of the architecture used is high, it can be
 677 %possible to have two neighbors tasks executing on two foreign
 678 %computing nodes, that penalizes communications between these tasks,
 679 %that increase the execution time. With the same mining, it is
 680 %interesting to have similar computing nodes, in order to have seemly
 681 %the same iteration execution time that allows to converge more
 682 %quickly. We can conclude that JaceP2P-V2 needs an algorithm to
 683 %efficiently map tasks on computing nodes.
 684 %
 685 %%As we use parallel distributed algorithms on distributed
 686 %%architectures, we have to efficiently map tasks on computing nodes, in
 687 %%order to obtain best performance and the slowest execution time of
 688 %%the application.
 689 %There exist many scheduling and mapping algorithms in the literature,
 690 %and the research is active in this domain. A first approach is to use
 691 %one the numerous scheduling algorithms, but our model does not fit
 692 %with them. Indeed, a scheduling implies that we have tasks which are
 693 %depending on others, more precisely that some tasks cannot start
 694 %computation before having received data from other tasks, which are
 695 %called precedences, or ``strong dependencies''. This class of
 696 %algorithms uses a well known application's representation: the DAG
 697 %(Direct Acyclic Graph). This model cannot be used with our
 698 %problematic. Indeed, in asynchronous iterative applications, all the
 699 %tasks are executed in the same time, in parallel; there is no
 700 %precedence. So we have to look at another way, which is the ``tasks
 701 %mapping''.
 702 %
 703 %The second approach is to use a tasks mapping algorithm, which only
 704 %aims to map tasks on nodes, following a metric, to minimize the
 705 %application's execution time. To determine which class of tasks
 706 %mapping algorithms we should consider, it is important to draw up a
 707 %list of which information we have in hands. When an application using
 708 %the asynchronous iteration model is run, the number of iterations to
 709 %converge to the solution is unpredictable and change from a run to
 710 %another. So, we cannot give a good approximation of the execution time
 711 %of a task, but we can retrieve an estimation of the computational
 712 %amount of a task (plus de détails ?). In addition, as the asynchronous
 713 %iteration model based algorithms are able to provide to applications
 714 %the message lost tolerance, we cannot exactly quantify the amount of
 715 %communications in an application; some algorithms can converge with
 716 %receiving only few dependencies messages. One application's
 717 %representation which fit perfectly with our model is the TIG (Task
 718 %Interaction Graph), which only considers relations between tasks --
 719 %the only thing which is predictable in our model. The figure
 720 %\ref{fig:tig} presents an example of a TIG.
 721 %
 722 %
 723 %
 724 %On Figure \ref{fig:tig} we can see an application with 6 tasks, in
 725 %which each task is in relation with tasks of previous and next rank,
 726 %with two exceptions, the first and the last tasks, which can
 727 %eventually be in relation, depending of the application. This is a
 728 %commonly scheme in asynchronous iteration model; in general, tasks are
 729 %in relation with some others which previous and next ranks, but in
 730 %some cases, dependencies are more complex.
 731
 732 \subsection{Related work}
 733 \label{sec:pbrw}
 734
 735 %In the previous section we have determined that we need to use TIG
 736 %model based mapping algorithms to address our problem.
 737 In the literature of the TIG mapping, we can find many algorithms,
 738 which can be divided into two categories. First, in the
 739 \textit{Edge-cuts optimization} class of algorithms, the aim is to
 740 minimize the use of the penalizing links between clusters. As tasks
 741 are depending on neighbors, which are called dependencies, the goal is
 742 to choose nodes where distance, in term of network, is small to
 743 improve communications between tasks. Here we can cite
 744 Metis\cite{metis}, Chaco\cite{chaco} and PaGrid\cite{pagrid} which are
 745 libraries containing such kind of algorithms. The main drawback of
 746 edge-cuts algorithms is that they do not tackle the computing nodes
 747 heterogeneity issues. They only focus on communication overheads.
 748 %  The Figure \ref{fig:edge} shows
 749 %  that the optimization is on edges of communications, which are
 750 %  circled in red.
 751 %
 752 %  \begin{figure}[h!]
 753 %    \centering
 754 %    \includegraphics[width=3.5cm]{images/edgecut}
 755 %    \caption{The edge-cuts optimization}
 756 %    \label{fig:edge}
 757 %  \end{figure}
 758 Then, in the \textit{Execution time optimization} class of algorithms
 759 the aim is to minimize the whole execution time of the
 760 application. They look for nodes which can provide the small execution
 761 time of tasks using their computational power. Here we can cite
 762 FastMap\cite{fastmap} and MiniMax\cite{minimax} as such kind of
 763 algorithms. QM\cite{qm_these} is also an algorithm of this category,
 764 but it aims to find for each task the node which can provide the best
 765 execution time. QM works at the task level, whereas others work at the
 766 application level
 767 %  The Figure
 768 %  \ref{fig:et} shows that the optimization is on tasks, which are
 769 %  circled in red.\\
 770 %
 771 %  \begin{figure}[h!]
 772 %    \centering
 773 %    \includegraphics[width=3.5cm]{images/exectime}
 774 %    \caption{The execution time optimization}
 775 %    \label{fig:et}
 776 %  \end{figure}
 777
 778 The two classes of algorithms may fit with our goals, because in our
 779 model we have both the computational power of nodes and communication
 780 costs which may influence the applications performance. We can also
 781 cite partitioning tools like Scotch \cite{scotch} which aims at
 782 privileging the load balancing of their partitioning
 783 schemes. Nevertheless, to the best of our knowledge, none of the
 784 existing algorithms take into consideration the specificities of the
 785 AIAC model (see next section).
 786 %specifically address the AIAC mapping problem.
 787 %As the existing mapping algorithms are not designed to fit
 788 %with the AIAC mapping problem.
 789
 790 \section{AIAC mapping}
 791 \label{sec:aiacmapping}
 792
 793 % In this section we present the specificities of the AIAC model, which
 794 % are interesting in the mapping problem, and the solution we propose:
 795 % the AIAC QM algorithm, which is an extended version of the QM
 796 % algorithm.
 797
 798 \subsection{Specificities of the AIAC mapping problem}
 799 \label{sec:specAIACmapping}
 800
 801 An important point to take into consideration in the AIAC model is
 802 that we do not allow the execution of multiple tasks on the same
 803 computing node. This comes from the fact that the targeted
 804 architectures are volatile distributed environments. Assigning
 805 multiple tasks to a node provides a fall of performance when this
 806 node fails. Indeed we should redeploy all of the tasks from this node
 807 to another one, using last saves, which implies to search a new
 808 available computing node, transfer saves to it and restart the
 809 computation from this point (which could be far from this just before
 810 the failure).
 811
 812 Nevertheless, in order to benefit of multi-cores architectures, we use
 813 a task level parallelism by running multi-threaded sequential solver
 814 for example.
 815 %In addition, it is more simple and efficient to parallelize at the
 816 %task level using, as an example with the CG application, a
 817 %multi-threaded linear solver, which benefits of the multi-cores
 818 %architecture of computing nodes.
 819
 820 Another important point in the AIAC model is that we should take into
 821 account precisely the locality issue. This comes from the fact that in
 822 this model, the faster and more frequently a task receives its
 823 dependencies, the faster it converges. Moreover, as the JaceP2P-V2
 824 environment is fault tolerant and tasks save checkpoints on their
 825 neighbors, it is more efficient to save on near nodes than on far
 826 ones. In the synchronous model, both heterogeneity and locality must
 827 be taken into account in a balanced way. In the asynchronous model,
 828 since no synchronization occurs, the heterogeneity issue is less
 829 important.
 830
 831
 832 %As our problem is on task mapping, only considering the dependency
 833 %relations between tasks, with a part of consideration on task weight,
 834 %we should use TIG (\textit{task interaction graph}) mapping model.
 835
 836 %\section{Mapping algorithms}
 837 %\label{sec:mapalgo}
 838
 839 \subsection{AIAC Quick-quality Map}
 840 \label{sec:qmmodif}
 841
 842 %After having describe the different kinds of mapping algorithms which
 843 %can be found in the literature, we now present the three algorithms we
 844 %use to do mapping in the JaceP2P-V2 environment.
 845
 846 We present here the solution we propose, called \textit{AIAC QM
 847   algorithm}, to address the AIAC mapping problem. We decided to
 848 improve the \textit{Quick-quality Map} (QM) algorithm since it is one
 849 of the most accurate method to address the TIG mapping problem.
 850 %
 851 %\subsection{Modified Quick-quality Map}
 852 %\label{sec:modifiedqm}
 853 %
 854 %As the previous algorithm describe in section \ref{sec:sma} showed
 855 %that mapping provides a significant increase of applications
 856 %performance (which can be seen in the section \ref{sec:experiments}),
 857 %we decide to try another mapping algorithm, which is a modified
 858 %version of the \textit{Quick-quality Map} (QM) algorithm.
 859 %
 860 %
 861 %As explained in section \ref{sec:pb}, the asynchronous iteration model
 862 %is specific, as it is not a good solution to map many tasks on the
 863 %same node. This is why QM has been modified to take into account these
 864 %characteristics.
 865 % Indeed, originally QM tries to map many tasks on the
 866 %same node to improve the execution time of tasks by decreasing
 867 %communications costs. This solution can be good if communications
 868 %between tasks are heavy and if we consider that computing nodes are
 869 %stable and are not volatile. As
 870
 871 %, we have modified some parts
 872 %of it to fit with our constraints. This was an opportunity to be taken
 873 %to insert a little part of ``edge-cuts'' optimization, as in our model
 874 %communications have to be taken into account.
 875 In its original version, this algorithm aims at prioritizing the
 876 computational power of nodes. Indeed, its aim is to find the more
 877 powerful node to map a task on. Moreover, a part of this algorithm is
 878 designed to map multiple tasks on the same node, in order to improve
 879 local communications. This solution can be efficient if communications
 880 between tasks are heavy and if we consider that computing nodes are
 881 stable and not volatile. This last point is in contradiction with
 882 our model, as we authorize only the execution of one task on a single
 883 node -- this allows to lose only the work of a single task in case of
 884 node's fault, with a low cost on restarting mechanism. Instead
 885 assigning multiple tasks on the same computing node, our mapping
 886 algorithm tries to keep tasks locally, to improve communications, by
 887 trying to assign tasks to computing nodes in the neighborhood
 888 of which their neighbors are mapped on.
 889
 890 % The pseudo-code of AIAC QM is given in Algorithm \ref{alg:qmmodified}.
 891
 892 % \SetAlgoSkip{}
 893 % \begin{algorithm}
 894 %   \SetLine
 895 %   \dontprintsemicolon
 896
 897 %   \KwIn{Sets of tasks and computing nodes}
 898 %   \KwOut{Mapping of tasks to nodes}
 899
 900 %   \BlankLine
 901
 902 %   sort nodes by descending power\;
 903 %   map tasks in order on nodes\;
 904 %   set all tasks \textit{moveable}\;
 905 %   $r \leftarrow 1$\;
 906
 907 %   \BlankLine
 908
 909 %   \While{one task is moveable}{
 910 %     \For{each task $t_{i}$ $\&\&$ $t_{i}$ is moveable }{
 911 %       $n_{c} \leftarrow$ current node of $t_{i}$\;
 912 %       $n_{n} \leftarrow t_{i}$\;
 913
 914 %       \BlankLine
 915
 916 %       \For{$k = 0 ; k < \frac{f \cdot n}{r} ; k++$}{
 917 %         select random node $n_{r}$ in $[0,\frac{n}{r}]$\;
 918 %         \If{ET($t_{i}$,$n_{r}$) $<$ ET($t_{i}$,$n_{n}$)}{
 919 %           $n_{n} \leftarrow n_{r} $\;
 920 %         }
 921 %       }
 922
 923 %       \BlankLine
 924
 925 %       \For{each node $n_{v}$ near $dep(t_{i})$}{
 926 %         \If{ET($t_{i}$,$n_{v}$) $<$ ET($t_{i}$,$n_{n}$)}{
 927 %           $n_{n} \leftarrow n_{v} $\;
 928 %         }
 929 %       }
 930
 931 %       \BlankLine
 932
 933 %       \If{$n_{n} \neq n_{c}$}{
 934 %         map $t_{i}$ on $n_{n}$\;
 935 %         update ET of $t_{i}$ and dep($t_{i}$)\;
 936 %       }
 937 %     }
 938
 939 %     \BlankLine
 940
 941 %     set $t_i$ not moveable\;
 942 %     $r \leftarrow r+1$ if all tasks have been considered\;
 943 %   }
 944
 945 %   \caption{The AIAC QM}
 946 %   \label{alg:qmmodified}
 947 % \end{algorithm}
 948 % %\vspace{-0.5cm}
 949
 950 So, in this algorithm all nodes are first sorted in descending order
 951 according to their computation power, and all tasks are mapped on
 952 these nodes according to their identifier (they are also marked as
 953 ``moveable''; it means that each task can be moved from a node to
 954 another). As in the original QM algorithm, AIAC QM keeps track of the
 955 \textit{number of rounds} $r$ ($r > 0$), that all tasks have been
 956 searched for a better node. This allows to reduce at each round the
 957 number of considered nodes. While there is at least one moveable task,
 958 it performs for each moveable task the search for a better node. It
 959 chooses a set of nodes, $\frac{f \cdot n}{r}$, where $f$ is defined as
 960 the search factor and $n$ is the number of nodes. $r$ and $f \in
 961 ]0,1]$ control the portion of nodes that will be considered where more
 962 numerous the rounds are, the less the considered nodes will be. Then
 963 the algorithm estimates the execution time $ET(v)$ of the task on each
 964 node. If it is smaller than the current node on which the task is
 965 mapped on, this node becomes the new potential node for task $t_i$.
 966
 967 After having randomly searched for a new node, the AIAC QM tries to
 968 map the task on nodes that are neighbors of nodes of which the
 969 dependencies of $t_i$ are mapped on. This is one of the major
 970 modification to the original QM algorithm. It introduces a little part
 971 of ``edge-cuts'' optimization. In the original version, it tries to
 972 map the task $t_i$ on the same node of one of its dependencies. As
 973 explain in \ref{sec:specAIACmapping}, this is not an acceptable
 974 solution in our case. Instead, the algorithm now searches to map task
 975 $t_i$ on nodes which are near the ones its dependencies are mapped
 976 on. This search requires a parameter which indicates the maximum
 977 distance at which nodes should be from the node of dependencies of
 978 $t_i$.
 979
 980 At the end of the algorithm, if a new node is found, $t_i$ is mapped
 981 on and its execution time is updated and $t_i$ is set to ``not
 982 moveable''. The execution time of each of its dependencies is also
 983 updated, and if this new execution time is higher than the previous,
 984 the task is set to ``moveable''. And finally, if all tasks have been
 985 considered in this round, $r$ is incremented.
 986
 987 The complexity of the AIAC QM algorithm is about $O(n^2 \cdot
 988 t \cdot ln(r))$. This complexity is the same as the original algorithm
 989 (details are given in \cite{qm_these}, with an increase of a factor
 990 $n$, corresponding to the edge-cuts part).
 991
 992 \section{Experimentation}
 993 \label{sec:expe}
 994
 995 % We now describe the experiments we have conducted and their
 996 % components, to evaluate the effects of the AIAC QM algorithm on
 997 % application execution time.
 998
 999 \subsection{The NAS Parallel Benchmark Kernel CG and the Grid'5000 platform }
1000 \label{sec:cg}
1001
1002 We used the ``Kernel CG'' of the NAS Parallel Benchmarks (NPB)
1003 \cite{nas} to evaluate the performance of the mapping
1004 algorithm. This benchmark is designed to be used on large
1005 architectures, because it tests communications over latency networks,
1006 by processing unstructured matrix vector multiplication. In this
1007 benchmark, a Conjugate Gradient is used to compute an approximation of
1008 the smallest eigenvalue of a large, sparse and symmetric positive
1009 definite matrix, by the inverse power method. In our tests, the whole
1010 matrix contains nonzero values, in order to stress more
1011 communications. As the Conjugate Gradient method cannot be executed
1012 with the asynchronous iteration model we have replaced it by another
1013 method called the multisplitting method. This latter supports the
1014 asynchronous iterative model.
1015
1016
1017 With the multisplitting algorithm, the $A$ matrix is split into
1018 horizontal rectangle parts.
1019 %as Figure \ref{fig:multisplit} shows.
1020 Each
1021 of these parts is assigned to a processor -- so the size of data
1022 depends on the matrix size but also on the number of participating
1023 nodes. In this way, a processor is in charge of computing its {\small $X Sub$}
1024 part by solving the following subsystem: {\small $ASub \times XSub = BSub -
1025 DepLeft \times XLeft - DepRight \times XRight$}.
1026
1027 After solving {\small $XSub$}, the result must be sent to other
1028 processors which depend on it.
1029
1030 % \begin{figure}[h!]
1031 %   \centering
1032 %   \includegraphics[width=7.4cm]{images/multisplit}
1033 %   \caption{Data decomposition for the multisplitting method
1034 %     implementation}
1035 %   \label{fig:multisplit}
1036 % \end{figure}
1037
1038 % The multisplitting method can be decomposed into four phases:
1039 % \begin{enumerate}
1040 % \itemsep=5pt
1041 % \item \textbf{Data decomposition}. In this phase, data are allocated
1042 %   to each processor assuming the decomposition shown in Figure
1043 %   \ref{fig:multisplit}. Then, each processor iterates until
1044 %   convergence is reached.
1045 % \item \textbf{Computation}. At the beginning of the computation, each
1046 %   processor computes
1047 %     $BLoc = BSub - DepLeft \times XLeft - DepRight \times XRight$. Then,
1048 %   it solves $ASub \times XSub = BLoc$ by using a
1049 %   multi-threaded sequential version of the Conjugate Gradient method.
1050 % \item \textbf{Data exchange}. Each processor sends its $XSub$
1051 %   part to its neighbors. Here, the neighborhood is closely related
1052 %   to the density of the $A$ matrix. Clearly, a dense matrix implies an
1053 %   \textit{all-to-all} communication scheme while a matrix with a low
1054 %   bandwidth reduces the density of the communication scheme.
1055 % \item \textbf{Convergence detection} Each processor computes its local
1056 %   convergence and sends it to a server node. When this one detects
1057 %   that each processor has converged, it stops the whole computation
1058 %   process.
1059 % \end{enumerate}
1060
1061 %It can be pointed out here that it is possible to modify the data
1062 %decomposition in order to obtain non disjoint rectangle matrices. This
1063 %property of multisplitting methods, called \textit{overlapping}, can
1064 %bring significant improvements to convergence speed, but it is not the
1065 %aim of this paper.
1066 For more details about this method, interested readers are invited to
1067 see \cite{book_raph}. In our benchmark, the sequential solver part of
1068 the multisplitting method is the Conjugate Gradient, using the
1069 MTJ\cite{mtj} library. Its implementation is multi-threaded, to
1070 benefit from multi-core processors.
1071
1072 We point out here that this benchmark is a typical AIAC
1073 application. In our study, we consider that the computational costs of
1074 tasks are approximately the same and that the communications costs are
1075 also the same (this comes from the difficulty to evaluate real costs
1076 in the AIAC model). For our experiments the bandwidth of matrices has
1077 been reduced in order to limit the dependencies and we fixed it to
1078 $35,000$. This bandwidth size generates, according to the problem's
1079 size, between 10 and 25 neighbors per tasks.
1080
1081 % The general form of the TIG for this application is given
1082 % by Figure \ref{fig:tigcg}.
1083
1084 % \begin{figure}[h!]
1085 %   \centering
1086 %   \includegraphics[width=8cm]{images/tigcg2}
1087 %   \caption{Part of the form of the TIG representing an instance of the
1088 %     NAS Kernel CG application}
1089 %   \label{fig:tigcg}
1090 % \end{figure}
1091
1092 %This figure shows 6 tasks, which are represented by a circle in which
1093 %the identifier of the task is given.
1094
1095 % The computational cost of a task is given by the number on the top
1096 % left-hand side of each circle (for example the cost of task 31 is
1097 % 1000). Communications between tasks are represented by edges on
1098 % which the amount of communication is given (for example, the
1099 % communication cost between tasks 29 and 30 is about 30).
1100 % Dotted lines represent communications with tasks which are not
1101 % represented on the figure. We can see here that each task has four
1102 % neighbors (the two previous and the two next). This amount of
1103 % neighbors is directly related to the bandwidth of the matrix (in this
1104 % example the bandwidth is very small). For more details about the
1105 % influence of the bandwidth on the amount of neighbors, interested
1106 % readers are invited to see \cite{largescale}.
1107
1108
1109
1110 %\subsubsection*{The Grid'5000 platform}
1111 %\Label{sec:g5k}
1112
1113 The platform used for our tests, called Grid’5000\cite{g5k}, is a
1114 French nationwide experimental set of clusters which provides a
1115 configurable and controllable instrument. We can find many clusters
1116 with different kinds of computers with various specifications and
1117 software. Clusters are spread over 9 sites,
1118 %as can be seen on Figure \ref{fig:g5ksite},
1119 and the computing power represents more than 5000
1120 computing cores interconnected by the ``Renater'' network. This
1121 network is the national network for research and education; it
1122 provides a large bandwidth with high latency. Intra-clusters networks
1123 present small bandwidth and low latencies.
1124
1125
1126 % \begin{figure}[h!]
1127 %   \centering
1128 %   \includegraphics[height=6.5cm]{images/g5k-noms}
1129 %   \caption{The Grid'5000 sites map}
1130 %   \label{fig:g5ksite}
1131 % \end{figure}
1132
1133
1134
1135 \subsection{Other mapping algorithms}
1136 \label{sec:othermaping}
1137
1138 % In this section we present the two other mapping algorithms we used
1139 % in our experiments to compare the performance of the AIAC QM
1140 % algorithm. The first one was used to evaluate the benefits of a
1141 % mapping solution in section \ref{sec:benef}. The second one was used
1142 % to show the differences between the two mapping class, the ``execution
1143 % time'' and the ``edge-cuts'' optimizations, as it is a fully edge-cut
1144 % optimization based mapping algorithm.
1145
1146 \subsubsection{A Simple Mapping algorithm}
1147 \label{sec:sma}
1148
1149 %As mentioned in section \ref{sec:pb}, the asynchronous iteration model
1150 %has some specificities which distinguishes it from other classical model
1151 %for the mapping.
1152 %
1153 %The first thing we have done was to be sure that a mapping algorithm
1154 %would enhance applications performance.
1155 The \textit{Simple Mapping algorithm} (SMa) was designed to
1156 show the benefits of a mapping algorithm in the JaceP2P-V2 platform.
1157 %The text of the \textit{Simple Mapping} if given by
1158 %Algorithm \ref{alg:sma} gives the pseudo-code of the \textit{Simple
1159 %  Mapping algorithm}.
1160 %, in which we can see that it is very simple, with a complexity in $O(n^
1161 %2)$ resulting from sort methods.
1162
1163
1164 % \SetAlgoSkip{}
1165 % \begin{algorithm}
1166 %   \SetLine
1167 %   \dontprintsemicolon
1168 %   \KwIn{Sets of tasks and computing nodes}
1169 %   \KwOut{Mapping of tasks to nodes}
1170
1171 %   \BlankLine
1172
1173 %   sort computing nodes by cluster\;
1174 %   sort clusters by size, from higher to lower\;
1175 %   map tasks in order on sorted clusters\;
1176 %   \caption{The Simple Mapping algorithm}
1177 %   \label{alg:sma}
1178 % \end{algorithm}
1179
1180
1181 %The aim of this algorithm is to do similarly to a very simple
1182 %edge-cuts optimization algorithm. To do that
1183 The algorithm puts each node in a cluster entity.
1184 %, which is given for now by the convention name of nodes
1185 %on Grid'5000 (for example azur-3.sophia.grid500.fr is the node 3 of
1186 %the azur cluster in the sophia site).
1187 Then it sorts clusters by their size, from the higher to the lower.
1188 %; this operation is the part which
1189 %is inspired by edge-cuts optimization based algorithms.
1190 Finally, all tasks are mapped in order on the sorted clusters; each
1191 task is assigned to a particular computing node of the chosen cluster.
1192
1193 %Though this algorithm is trivial, it allows AIAC applications to run faster
1194 %on distributed clusters architectures, with a gain over $30\%$ on
1195 %execution, as experiments described in section \ref{sec:benef}.
1196
1197 \vspace*{0.2cm}
1198
1199 \subsubsection{Edge-cuts optimization}
1200 \label{sec:edgcutalgo}
1201
1202 As explained in section \ref{sec:pb}, the asynchronous iteration model
1203 is so specific and unpredictable that we would like to evaluate the
1204 second kind of mapping algorithm, which aims to optimize the
1205 ``edge-cuts''.
1206 %Multilevel partitioning algorithms such as Metis and
1207 %Chaco fail to address the limitations imposed by heterogeneity in the
1208 %underlying targeted system. They assume that computing nodes and
1209 %network relying them are homogeneous. This is not corresponding to our
1210 %execution environment, which is fully heterogeneous. These methods are
1211 %based on ``graph growing'' (GGP) and/or ``greedy graph growing''
1212 %(GGGP) algorithms which aim to divide tasks into two, or for some
1213 %algorithm a power of two, partitions. In our case, we do not know in
1214 %advance the number of partitions we need. Indeed, it depends on the
1215 %amount of nodes in each cluster and on the number of tasks.
1216 %
1217 %As GGP and GGGP algorithms seems to be efficient in specific cases, it
1218 %could be interesting to adapt one to our model, in order to evaluate a
1219 %real edge-cuts optimization algorithm.
1220 We choose the Farhat's algorithm\cite{farhat}, which has the ability
1221 to divide the graph into any number of partitions, thereby avoiding
1222 recursive bisection.
1223 %Therefore its running execution time is
1224 %independent of the desired number of subsections.
1225
1226 % The adapted version of this algorithm, Farhat's Edge-Cut (F-EC),
1227 % evaluated in the JaceP2P-V2 environment is described in Algorithm
1228 % \ref{alg:edgecuts}.
1229
1230 % \SetAlgoSkip{}
1231 % \begin{algorithm}
1232 %   \SetLine
1233 %   \dontprintsemicolon
1234
1235 %   \KwIn{Sets of tasks and computing nodes}
1236 %   \KwOut{Mapping of tasks to nodes}
1237
1238 %   \BlankLine
1239
1240
1241 %   sort nodes by cluster\;
1242 %   $lTasks \leftarrow$ sort tasks by dep degree\;
1243 %   $changeCluster \leftarrow$ true\;
1244 %   $cTasks \leftarrow$ empty;
1245
1246 %   \BlankLine
1247
1248 %   \While{one task is not mapped}{
1249 %     \If{$changeCluster$}{
1250 %       $curCluster \leftarrow$ nextCluster()\;
1251 %       $places \leftarrow$ size($curCluster$)\;
1252 %       $changeCluster \leftarrow$ false\;
1253 %       $mTasks \leftarrow$ empty\;
1254 %     }
1255
1256 %     \BlankLine
1257
1258 %     \If{no task in cTasks}{
1259 %       $cTasks \leftarrow$ first task from $lTasks$\;
1260 %     }
1261
1262 %     \BlankLine
1263
1264 %     $curTask \leftarrow$ first task in $cTasks$\;
1265
1266 %     \BlankLine
1267
1268 %     \eIf{$( 1 + $dep(curTask)~$\cdot~\delta) <= places)$}{
1269 %       remove $curTask$ from $cTasks$\;
1270 %       add $curTask$ in $mTasks$\;
1271 %       $places \leftarrow places - 1$\;
1272 %       add dep(curTask) in cTasks\;
1273 %     }{
1274 %       $changeCluster$ $\leftarrow$ true\;
1275 %       associate $mTasks$ with $curCluster$\;
1276 %     }
1277 %   }
1278
1279 %   \caption{The Fahrat's Edge-Cut algorithm}
1280 %   \label{alg:edgecuts}
1281 % \end{algorithm}
1282
1283 This algorithm aims to do a ``clusterization'' of the tasks. First, it
1284 groups computing nodes in clusters, which are sorted according to
1285 their number of nodes, from the higher to the lower. Tasks are ordered
1286 following their dependency degree, starting from the higher to the
1287 lower. Tasks in the top of the list have a higher priority to be
1288 mapped. Next, the algorithm tries to map on each cluster the maximum
1289 number of tasks. To map a task on a cluster, the algorithm evaluates
1290 if there is enough space to map the task and some of its
1291 dependencies. This amount of dependencies is fixed by a factor
1292 $\delta$, which is a parameter of the algorithm. In the positive case,
1293 the task is mapped on the current cluster and its dependencies become
1294 priority tasks to be mapped. This allows to keep the focus on the
1295 communicating tasks locality.
1296 %All computing nodes are first grouped in clusters.
1297 %%, using their full
1298 %%qualified name -- for now, no criteria is used to sort clusters, such
1299 %%are their power or size.
1300 %Next, tasks are inserted in a list $lTasks$, sorted in descending
1301 %according to their dependency degree. This allows to map tasks with
1302 %their dependencies to improve communications. Next begins the main
1303 %mapping loop.
1304 %
1305 %While not all tasks have been mapped the algorithm do the
1306 %following. First, at the beginning, it chooses the first available
1307 %cluster to be the current cluster to map tasks on, and sets the
1308 %variable $places$ to its size, in number of computing nodes, and
1309 %create a new tasks' list $mTasks$. During the execution of the
1310 %algorithm, passing in this block means that there is no more node
1311 %available in the current cluster, so it has to choose the next
1312 %cluster. Next, it looks in $cTasks$, which is a tasks list containing
1313 %tasks which are in instance to be mapped. If $cTasks$ is empty it
1314 %takes the first available task in $lTasks$ -- at the beginning it
1315 %takes the first one, which has the higher dependency degree.
1316 %
1317 %Next, it takes the first task in $cTasks$ to try to map it on the
1318 %current cluster. The task is mapped on if there is enough available
1319 %computing nodes on the current cluster. This amount of node
1320 %corresponds to $1 + dep( curTask ) \cdot \delta$, which represents the
1321 %task plus the amount of its dependencies multiplied by a ``local
1322 %dependency factor'' $\delta$, which indicates how many dependencies
1323 %must be a least with $curTask$ on the same cluster. If the task could
1324 %be mapped on this cluster, $curTask$ is added in $mTasks$, which is a
1325 %tasks' list containing tasks which are currently mapped on current
1326 %cluster, and removed it from $cTasks$. The number of available nodes,
1327 %$places$, is decremented, and the dependencies of $curTask$ are added
1328 %to $cTasks$. This list is sorted in the same way as $lTasks$, indeed
1329 %tasks are added according to their dependency degree, from the higher
1330 %to the lower. Otherwise, if there is not enough or no more nodes
1331 %available in the cluster, it has to change the current cluster, so it
1332 %associates the current cluster and list of tasks mapped on, $mTasks$.
1333
1334
1335 \subsection{Experiments}
1336 \label{sec:experiments}
1337
1338 After having described the different components of the experiments, we
1339 now present the impacts of the AIAC QM mapping on applications running
1340 with JaceP2P-V2 on a heterogeneous distributed clusters
1341 architecture. In the following, we note ``heterogeneity degree'' the
1342 degree of heterogeneity of distributed clusters; it is the ratio
1343 between the average and the standard deviation of the computing nodes
1344 power. This heterogeneity degree may vary from 0, nodes are
1345 homogeneous, to 10, nodes are totally heterogeneous. In these
1346 experiments, we consider that there is no computing nodes failing during
1347 applications execution.
1348
1349 The application used to realize these experiments is the KernelCG of
1350 the NAS parallel benchmark, in the multi-splitting version. Two
1351 problem sizes were used: one using a matrix of size $550,000$ (named
1352 ``class E'') using 64 computing nodes and the other using a matrix of
1353 size $5,000,000$ (named ``class F'') using 128 nodes.
1354 %
1355 %Figure
1356 %\ref{fig:tigcg} shows a part of a TIG of this application.
1357 %
1358 %\begin{figure}[h!]
1359 %  \centering
1360 %
1361 %  \label{fig:tigcg}
1362 %\end{figure}
1363
1364
1365
1366 %\subsubsection{About heterogeneity}
1367 %\label{sec:xphetero}
1368
1369 %The first experiments concern the study of the impact of the
1370 Our experiments concern the study of the impact of the
1371 heterogeneity of the computing nodes on the mapping
1372 results. Heterogeneity is an important factor in high performance
1373 computing in the grid all the more so when using the asynchronous
1374 iteration model.
1375
1376 As mapping algorithms take in parameter a factor of research (for AIAC
1377 QM) and the amount of local dependencies (for F-EC), we fixed both to
1378 $50\%$. That means for AIAC QM that at each round the amount of
1379 considering nodes would be divided by two, and for F-EC that each task
1380 requires half of its dependencies on the same local cluster.
1381
1382 Four experiments were done using four architectures having different
1383 heterogeneity degrees -- in two architectures computing nodes are more
1384 heterogeneous than in the others. In these experiments, we did not
1385 affect the networks heterogeneity, because of the difficulty to
1386 disturb and control network on Grid'5000; by default, networks are
1387 already quite heterogeneous. We needed more than 200 computing nodes
1388 to execute our application because of the small capacity of some
1389 clusters to execute the largest problems (there is not enough
1390 memory). The nodes used have more than 2 GB of RAM and both execute a
1391 Linux 64 bits distribution.
1392
1393 The first architecture, Arc1.1, was composed of 113 computing nodes
1394 representing 440 computing cores, spread over 5 clusters in 4
1395 geographically distant sites. In Arc1.1 we used bi-cores (2 clusters), quad-cores (2
1396 clusters) and bi-quad-cores (1 cluster) machines. Its heterogeneity
1397 degree value is 6.43. This architecture was used to run class E of the CG
1398 application using 64 computing nodes. The second architecture, Arc1.2,
1399 used to execute class F of the CG application, using 128 computing
1400 nodes, was composed of 213 computing nodes representing 840 computing
1401 cores, with a heterogeneity degree of 6.49. This architecture was
1402 spread on the same clusters and sites as Arc1.1. The results of the
1403 experiments on Arc1.1 and Arc1.2 are given in Table \ref{tab:exph1E}
1404 and Table \ref{tab:exph1F}, which give the gains in execution time obtained
1405 in comparison to the version without mapping.
1406
1407 %\vspace{0.2cm}
1408
1409 \renewcommand{\arraystretch}{1.5}
1410
1411 \begin{table}[h!]
1412   \centering
1413   \begin{tabular}[h!]{|c||c|c|c|c|}
1414     \hline
1415     Algorithm& None&SMa & AIAC QM & F-EC \\
1416     \hline
1417     \hline
1418     Execution time&150s&110s&101s&90s\\
1419     \hline
1420 %    Gains&--&$27\%$&$33\%$&\textcolor{blue}{$40\%$}\\
1421     Gains&--&$27\%$&$33\%$&$40\%$\\
1422     \hline
1423   \end{tabular}
1424   \caption{Gains in time of the execution of the class E of the CG
1425     application on Arc1.1 using 64 nodes}
1426   \label{tab:exph1E}
1427   \vspace*{-0.4cm}
1428 \end{table}
1429
1430 %\vspace{-0.5cm}
1431
1432 \renewcommand{\arraystretch}{1.5}
1433
1434 \begin{table}[h!]
1435   \centering
1436   \begin{tabular}[h!]{|c||c|c|c|c|}
1437     \hline
1438     Algorithm& None &SMa & AIAC QM & F-EC \\
1439     \hline
1440     \hline
1441     Execution time&403s&265s&250s&218s\\
1442     \hline
1443 %    Gains&--&$34\%$&$38\%$&\textcolor{blue}{$46\%$}\\
1444     Gains&--&$34\%$&$38\%$&$46\%$\\
1445     \hline
1446   \end{tabular}
1447   \caption{Gains in time of the execution of the class F of the CG
1448     application on Arc1.2 using 128 nodes}
1449   \label{tab:exph1F}
1450   \vspace*{-0.4cm}
1451 \end{table}
1452
1453 At first, we can see that the Simple Mapping algorithm, though it is
1454 simple, provides a significant improvement of application execution
1455 time. This highlights that JaceP2P-V2 really needs a mapping algorithm
1456 in order to be more efficient. Then, we can see that the F-EC and the
1457 AIAC QM algorithms provide a better mapping than the Simple Mapping
1458 algorithms. We can see a significant difference between both
1459 algorithms. This comes from the homogeneity of clusters. In this case,
1460 the F-EC algorithm is more efficient since the minimization of the
1461 communications becomes more important than the tackle of the
1462 computational power heterogeneity problem.
1463 %Indeed, it is more benefic for ta becomes more important than the
1464 %tacklebecomes more important than the
1465 %tackle sks to have locally their
1466 %dependencies, which allows to improve communications, in case of
1467 %computing nodes are more homogeneous -- communications are more
1468 %important than computing power (that is why the F-EC algorithm is more
1469 %efficient).
1470 The effect is that tasks do less iterations as they
1471 receive more frequently updated data from their neighbors. In
1472 addition, as tasks and their dependencies are on the same cluster,
1473 communications are improved, but also as computations take
1474 approximately the same time, the amount of iterations is reduced and
1475 the algorithm can converge more quickly.
1476
1477 % Another important positive point is that gains are scalable, which allows
1478 % to foresee big improvements for very large applications.\\
1479
1480 The third architecture, Arc2.1, was composed of 112 computing nodes,
1481 representing 394 computing cores, spread over 5 clusters in 5
1482 sites. In this architecture we used bi-cores (3 clusters),
1483 quad-cores (1 cluster) and bi-quad-cores (1 cluster) machines. Its
1484 heterogeneity degree's value is 8.41. This architecture was used to run
1485 class E of the CG application, using 64 computing nodes. The fourth
1486 architecture, Arc2.2, used to execute class F of the CG
1487 application, using 128 computing nodes, was composed of 212 computing
1488 nodes representing 754 computing cores, with a degree of heterogeneity
1489 of 8.44. This architecture was spread on the same clusters and sites
1490 as Arc2.1. The results of the experiments on Arc2.1 and Arc2.2 are
1491 given in Table \ref{tab:exph2E} and Table \ref{tab:exph2F}, which give the
1492 gains in execution time obtained in comparison to the version without
1493 mapping.
1494
1495 %\vspace{0.2cm}
1496
1497 \renewcommand{\arraystretch}{1.5}
1498
1499 \begin{table}[h!]
1500   \centering
1501   \begin{tabular}[h!]{|c||c|c|c|c|}
1502     \hline
1503     Algorithm&None& SMa & AIAC QM & F-EC \\
1504     \hline
1505     \hline
1506     Execution time&498s&341s&273s&385s\\
1507     \hline
1508 %    Gains&$32\%$&\textcolor{blue}{$45\%$}&\textcolor{red}{$23\%$}\\
1509     Gains&--&$32\%$&$45\%$&$23\%$\\
1510     \hline
1511   \end{tabular}
1512   \caption{Gains in time of the execution of the class E of the CG
1513     application on Arc2.1 using 64 nodes}
1514   \label{tab:exph2E}
1515   \vspace*{-0.4cm}
1516 \end{table}
1517
1518 \renewcommand{\arraystretch}{1.5}
1519
1520 \begin{table}[h!]
1521   \centering
1522   \begin{tabular}[h!]{|c||c|c|c|c|}
1523     \hline
1524     Algorithm& None&SMa & AIAC QM & F-EC \\
1525     \hline
1526     \hline
1527     Execution time&943s&594s&453s&660s\\
1528     \hline
1529 %    Gains&$37\%$&\textcolor{blue}{$52\%$}&\textcolor{red}{$30\%$}\\
1530     Gains&--&$37\%$&$52\%$&$30\%$\\
1531     \hline
1532   \end{tabular}
1533   \caption{Gains in time of the execution of the class F of the CG
1534     application on Arc2.2 using 128 nodes}
1535   \label{tab:exph2F}
1536   \vspace*{-0.4cm}
1537 \end{table}
1538
1539 To begin with, these experiments confirm that a mapping algorithm is
1540 needed and that improvements are always scalable. Then, we can see
1541 that the F-EC algorithm falls in performance and AIAC QM is
1542 improved. What is surprising is that the Simple Mapping algorithm is
1543 better than F-EC. This can be explained by the fact that as computing
1544 nodes are quite heterogeneous, computations are not the same, so it is
1545 not significant to map dependencies close to tasks. In this case, the
1546 most important is the power of computing nodes. So, in this kind of
1547 architecture, it is more efficient to choose the best computing nodes
1548 to compute iterations more quickly and to improve the convergence
1549 detection.
1550
1551 % Here, it is important to note that the AIAC QM algorithm offers a gain
1552 % of about $50\%$ on the execution time, that is to say that the
1553 % application takes half of the execution time than without mapping.
1554
1555 % \subsubsection{Parameters variation}
1556 % \label{sec:xpvariation}
1557
1558 % After having evaluated mapping algorithms on the heterogeneity of
1559 % distributed clusters, we now propose to change the parameters of AIAC
1560 % QM and F-EC algorithms, in order to determine which values are the
1561 % most accurate.
1562
1563 % To do these experiments, we used an architecture composed of 122
1564 % computing nodes representing 506 computing cores, spread over 5
1565 % clusters in 5 sites. In this architecture we used bi-cores (2
1566 % clusters), quad-cores (2 clusters) and bi-quad-cores (1 cluster)
1567 % machines. Its heterogeneity degree value is 4.98.
1568 % %, which means that
1569 % %computing nodes power is very heterogeneous.
1570
1571 % The parameters of each algorithm, $f$ (the search factor) for
1572 % AIAC QM and $\delta$ (the amount of local dependencies) for F-EC,
1573 % varied both with values $10\%$, $50\%$ and $90\%$. We used the CG
1574 % multi-splitting application on 64 computing nodes. The results of
1575 % these experiments are given in Table \ref{tab:expparams}. Results
1576 % reported in this table represent the gains in execution time provided
1577 % by each algorithm with different parameters values.
1578
1579 % %\vspace{0.2cm}
1580 % \renewcommand{\arraystretch}{1.5}
1581
1582 % \begin{table}[h!]
1583 %   \centering
1584 %   \begin{tabular}[h!]{|c||c|c|c|}
1585 %     \hline
1586 %     Parameters& $10\%$ & $50\%$ & $90\%$ \\
1587 %     \hline
1588 %     \hline
1589 % %    Simple & \multicolumn{3}{c|}{\textcolor{blue}{$39\%$}}\\
1590 %     SMa & \multicolumn{3}{c|}{$30\%$}\\
1591 %     \hline
1592 % %    AIAC QM & $35\%$ & \textcolor{blue}{$41\%$} & $35\%$ \\
1593 %     AIAC QM & $30\%$ & $32\%$ & $30\%$ \\
1594 %     \hline
1595 % %    F-EC & $39\%$ & $35\%$ & \textcolor{blue}{$45\%$} \\
1596 %     F-EC & $40\%$ & $37\%$ & $45\%$ \\
1597 %     \hline
1598 %   \end{tabular}
1599 % %  \caption{Parameters variations using a $500'000$ problem size on an
1600 % %    architecture of 5.37 heterogeneity degree}
1601 %   \caption{Gains in execution time with mapping algorithms parameters
1602 %     variations using the class E of the CG application using 64
1603 %     computing nodes}
1604 % %  \vspace*{-0.4cm}
1605 %   \label{tab:expparams}
1606 % %  \vspace*{-0.9cm}
1607 % \end{table}
1608
1609 % %First of all, we can see that the Simple mapping provides the same
1610 % %order of performance, as shown in the precedent section, so it is
1611 % %not affected by the heterogeneity degree. Secondly,
1612 % For the AIAC QM algorithm, we can note that the best value for its
1613 % parameter $f$ is about $50\%$, but its impact is not big enough to
1614 % indicate a specific configuration.
1615 % % With a low heterogeneity degree, this mapping algorithm provides a
1616 % % good performance improvement.
1617 % Finally, and this is not surprising, the F-EC algorithm is more
1618 % efficient with a factor $\delta$ near $100\%$, which directly comes
1619 % from its aim. But we can see that it is more efficient to have a
1620 % factor around $10\%$ than having one around $50\%$.
1621
1622 % We can note here, with a lower heterogeneity degree than in previous
1623 % experiments, gains are lower and the difference between AIAC QM and
1624 % F-EC (with parameters at $50\%$) is lower. It can be explained as the
1625 % fact that more the heterogeneity degree tends to 0 more computing
1626 % nodes are the same, so a mapping solution will not be efficient,
1627 % except one only optimizing network latency.
1628 %These experiments show that the impact of parameters values does not
1629 %well influence the AIAC QM, whereas it is very important for the F-EC
1630 %algorithm.
1631
1632 %\section{Discussion}
1633 %\label{sec:discussion}
1634 %
1635 %In this paper three algorithms for mapping asynchronous iterative
1636 %applications on heterogeneous distributed architectures are described.
1637 %This approach is relatively new, so we had to study related works. We
1638 %assume in our model that the amount of computation, the computational
1639 %cost of each task, cannot be known in advance, and it is the same
1640 %about communications. Two mains class of algorithms are mainly used,
1641 %the ``execution time'' and the ``edge-cuts'' optimizations, with a
1642 %newer preference fort the first one. Indeed, the efficiency of second
1643 %one is criticized for its objectives, invoking the fact that these
1644 %algorithms do not optimize the right metric. It is true in a broad
1645 %spectrum of mapping domains, but we have shown that in our case, it
1646 %could be an efficient solution, depending on the architecture
1647 %heterogeneity.
1648 %
1649 %As each experiment takes a lot of time and the Grid'5000 platform is
1650 %shared by many researchers, we could not conducted as many experiments
1651 %as we wanted and as we need to purpose an exhaustive view of this part
1652 %of the mapping domain. We cannot design a more general state of the
1653 %mapping of asynchronous iterative applications on distributed
1654 %architectures, but we can draw the main lines of future works.
1655 %%, by
1656 %%giving the first stones at the building of this specific part of the
1657 %%mapping domain.
1658 %
1659 %We have shown that the heterogeneity in computing nodes power takes an
1660 %important part in an efficient mapping algorithm. This parameter and
1661 %probably the heterogeneity in network should really be more taken into
1662 %consideration. Maybe a good solution consists in designing of mapping
1663 %algorithms giving more important priority to one or to the other
1664 %optimization objectives. This leads to design a novel algorithm, which
1665 %takes into account the different points discussed in this paper, which
1666 %would probably be an hybrid algorithm, efficient with our model on the
1667 %targeted architectures.
1668
1669 \section{Conclusion and future works}
1670 \label{sec:conclu}
1671
1672 In this paper we have presented
1673 %three algorithms to address the
1674 %mapping problem for asynchronous iterative applications in
1675 %heterogeneous distributed architectures. As the asynchronous iteration
1676 %model is very specific, it was not clear of which kind of mapping
1677 %algorithm should be efficient on such a problem. The two main
1678 %approaches given in the literature, the ``execution time'' and the
1679 %``edge-cuts'' optimization algorithms, have been evaluated on
1680 %different architectures, with different heterogeneity degrees.
1681 %% These architectures varied in their
1682 %%heterogeneity, to evaluate the algorithms.
1683 %
1684 %%We proposed three mapping algorithms for the JaceP2P-V2
1685 %%environment. The first is a simple way mapping, the Simple Mapping
1686 %%algorithm, which always provides a good and stable improvement of
1687 %%performance on all kinds of architectures.
1688 %We propose
1689 a specific mapping algorithm for the AIAC model, called AIAC QM. This
1690 algorithm is based on the execution time optimization but it also
1691 includes a small degree of edge-cuts optimization. Experiments  show that
1692 the AIAC QM mapping algorithm is efficient on architectures with a
1693 high heterogeneity degree. This can be explained by the fact that all
1694 iteration computations are quite different, for our example, and the
1695 convergence is more quickly detected as the more powerful computing
1696 nodes progress in the computation.
1697 % The F-EC algorithm, which is based
1698 % on the ``edge-cuts'' optimization, is meanwhile efficient on
1699 % architectures with a low heterogeneity degree. This can be explained
1700 % by the fact that in such an environment, it is more accurate for a
1701 % task to have its dependencies locally on the same cluster in order to
1702 % have efficient communications and to allow iterations to be computed
1703 % together, which improves the convergence detection speed.
1704 % Experiments we conducted have shown gains in execution time up to
1705 % $50\%$, which denotes a division by 2 of this execution time, for a
1706 % typical AIAC application on more than 700 computing cores.
1707 % %Experiments have shown that
1708 % %the importance of the parameter of both algorithms, AIAC QM and F-EC,
1709 % %is not so essential for the first one, instead it is very critical for
1710 % %the second one, but we cannot be sure that it is true all the time on
1711 % %all kinds of architectures; it maybe depends on the heterogeneity
1712 % %degree of the network.
1713 % As we did not influence the network's heterogeneity,
1714 % % as we did for the computational power of nodes,
1715 % the evaluation of the network impact on the
1716 % application execution time would be one of our next work.
1717
1718 %For now, these three mapping algorithms are implemented in an
1719 %additional library for the JaceP2P-V2 environment. The results
1720 %presented in this paper show that a mapping algorithm allows to
1721 %improve applications performance, but as the executing architectures
1722 %should have a variety of heterogeneity degree, we have to find a
1723 %compromise between the two approaches in order to have an efficient
1724 %mapping algorithm on all kinds of architectures. In the future, we
1725 %would like to design an efficient mapping algorithm to improve the
1726 %execution of asynchronous iteration applications on heterogeous
1727 %distributed architectures. As the algorithm should be integrated in
1728 %the JaceP2P-V2 environment, which is fully decentralized and fault
1729 %tolerant, the new mapping algorithm should have also these
1730 %characteristics, in order to retrieve a fully decentralized and fault
1731 %tolerant environment.
1732 % Our future works concern the amelioration of the AIAC QM algorithm, in
1733 % order to improve it on homogeneous distributed architectures. As the
1734 % F-EC mapping algorithm is efficient on such architectures, we will
1735 % give a more important consideration to the edge-cuts part of AIAC
1736 % QM.
1737
1738 In our future work we plan to take into consideration the fault
1739 tolerance problem. In this study we have realized our experiments
1740 without computing node fault, which is not the real case. We have to
1741 take into account the AIAC QM algorithm about this important
1742 parameter. First we have to efficiently choose new nodes to replace
1743 failed ones. Secondly, as we do checkpointing to save tasks' states,
1744 we have to efficiently choose backup nodes not to fail in case
1745 a whole cluster fails, as we save on neighbors (which are
1746 in general on the same cluster for communication efficiency reasons),
1747 an important part of the application is lost and we cannot restart
1748 this part; so the whole application fails. A trade-off should be done
1749 by having some saving nodes in external clusters.\\
1750
1751 % \subsubsection*{Acknowledgements}
1752
1753 % \label{sec:merci}
1754
1755 % This work was supported by the European Interreg IV From-P2P project
1756 % and the region of Franche-Comté.
1757
1758 % Experiments presented in this paper were carried out using the
1759 % Grid'5000\cite{g5k} experimental testbed, being developed under the
1760 % INRIA ALADDIN development action with support from CNRS, RENATER and
1761 % several Universities as well as other funding bodies.
1762
1763
1764 \bibliographystyle{unsrt}
1765 %\nocite{*}
1766 \bibliography{biblio}
1767
1768
1769 \end{document}
1770
1771 %% %%