X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/interreg4.git/blobdiff_plain/0056f255c291d1a3dce163a646494d73fdffbb4f..013c93d104bec94042ad7c3675d0810f7013084c:/pdsec2010/pdsec2010.tex diff --git a/pdsec2010/pdsec2010.tex b/pdsec2010/pdsec2010.tex index f760a2d..6f76693 100644 --- a/pdsec2010/pdsec2010.tex +++ b/pdsec2010/pdsec2010.tex @@ -1,5 +1,5 @@ %% Version PDSEC -\documentclass[conference,compsoc,a4paper]{IEEEtran} +\documentclass[10pt,conference,compsocconf]{IEEEtran} %% Pour Ingrid %\documentclass[conference,compsoc,a4paper,onecolumn]{IEEEtran} @@ -18,8 +18,8 @@ %\usepackage{graphicx} %\usepackage{xspace} % Definition des marges -\usepackage{vmargin} -\setpapersize[portrait]{A4} +%\usepackage{vmargin} +%\setpapersize[portrait]{A4} \usepackage[english]{babel} % Extension pour les graphiques EPS @@ -27,7 +27,7 @@ \usepackage[pdftex,final]{graphicx} % Extension pour les liens intra-documents (tagged PDF) % et l'affichage correct des URL (commande \url{http://example.com}) -\usepackage{hyperref} +%\usepackage{hyperref} % Extension pour que plus de titres apparaissent dans la table des matieres % (table des matieres, bibliographie, index). %\usepackage{tocbibind} @@ -69,14 +69,15 @@ \IEEEauthorblockA{Laboratoire d'Informatique de Franche-Comté (LIFC)\\ - University of Franche-Comté\\ + %University of Franche-Comté\\ IUT de Belfort-Montbéliard,2 Rue Engel Gros, BP 27, 90016 Belfort, France\\ % Tel.: +33-3-84587782 \hspace{20pt} Fax: +33-3-84587781\\ Email: \{raphael.couturier,david.laiymani,sebastien.miquee\}@univ-fcomte.fr} -%\thanks{This work was supported by the European Interreg IV From-P2P project.} +\thanks{This work was supported by the European Interreg IV From-P2P project +and the region of Franche-Comté} } %% Permet de réactiver le \thanks @@ -86,7 +87,7 @@ Email: \begin{document} %% left, top, right, bottom -\setmargrb{20mm}{15mm}{20mm}{15mm} +%\setmargrb{20mm}{15mm}{20mm}{15mm} \IEEEcompsoctitleabstractindextext @@ -103,13 +104,15 @@ Email: heterogeneous nodes connected through heterogeneous networks, the need for mapping algorithms is crucial. In this paper, we propose a new mapping algorithm dedicated to the AIAC model. To evaluate our - mapping algorithm we first implemented it in the JaceP2P programming - and executing environment dedicated to AIAC applications. Then we - conducted a set of experiments on the Grid'5000 testbed with more - than 700 computing cores and with a real and typical AIAC - application based on the NAS parallel benchmarks. Results are very - encouraging and show that the use of our algorithm brings an - important gain in term of execution time (about $40\%$).\\ + mapping algorithm we implemented it in the JaceP2P programming and + executing environment dedicated to AIAC applications and we + conducted a set of experiments on the Grid'5000 testbed. Results + are very encouraging and show that the use of our algorithm brings + an important gain in term of execution time (about $40\%$). + +%with more +% than 700 computing cores and with a real and typical AIAC +% application based on the NAS parallel benchmarks. % To design parallel and distributed applications on heterogeneous % distributed architectures, the asynchronous iteration model may be @@ -123,11 +126,14 @@ Email: % middleware, JaceP2P-V2, which allows users to design asynchronous % iterative applications and execute them on distributed % architectures, in which we added a mapping library.\\ - - \textup{\small \textbf{Keywords:} Mapping algorithms, Distributed - clusters, Parallel iterative asynchronous algorithms, Heterogeneous - architectures.} \end{abstract} + +\begin{IEEEkeywords} +Mapping algorithms; Distributed clusters; Parallel iterative +asynchronous algorithms; Heterogeneous distributed architectures. +\end{IEEEkeywords} + + \IEEEpeerreviewmaketitle @@ -141,7 +147,7 @@ Email: \label{sec:intro} Nowadays scientists of many domains, like climatic simulation or -biological research, need great and powerful architectures to compute +biological research, need large and powerful architectures to compute their large applications. Distributed clusters architectures, which are part of the grid architecture, are one of the best architectures used to solve such applications with an acceptable execution @@ -150,97 +156,82 @@ nodes interconnected by a high performance network, but even with the greatest efforts of their maintainers, there are latency and computation capacity differences between clusters of each site. -In order to efficiently use this massive -distributed computation power, numerous numerical algorithms have been -elaborated. These algorithms can be broadly classified into two -categories: - -\myitemize{5} -\item \textbf{Direct methods}, which give the exact solution of the - problem using a finite number of operations - (e.g. Cholesky\cite{cholesky-cg}, LU\cite{lu},etc). However, these - methods cannot be applied to all kinds of numerical problems. In - general, they are not well adapted to very large problems. -\item \textbf{Iterative methods}, that repeat the same instructions - until a desired approximation of the solution is reached -- we say - that the algorithm has converged. Iterative algorithms constitute - the only known approach to solving some kinds of problems and they - are easier to parallelize than direct methods. The Jacobi or - Conjugate Gradient\cite{cg} algorithms are examples of such - iterative methods. -\end{itemize} - +In order to efficiently use this massive distributed computation +power, numerous numerical algorithms have been modified. These +algorithms can be broadly classified into two categories. First, +\textit{direct methods}, which give the exact solution of the problem +using a finite number of operations (e.g. Cholesky, +LU\dots). These methods cannot be applied to all kinds of +numerical problems. Generally, they are not well adapted to very +large problems. Then \textit{iterative methods}, that repeat the same +instructions until a desired approximation of the solution is reached +-- we say that the algorithm has converged. Iterative algorithms +constitute the only known approach to solving some kinds of problems +and are easier to parallelize than direct methods. The Jacobi or +Conjugate Gradient algorithms are examples of such iterative +methods. +% \myitemize{5} + +% \begin{figure}[h!] +% \vspace{0.1cm} +% \centering +% \includegraphics[width=7.4cm]{images/ISCA} +% \caption{Two processors computing in the Synchronous Iteration - Asynchronous Communication (SIAC) model} +% \label{fig:SIAC} +% \end{figure} In the rest of this paper we only focus on iterative methods. Now to parallelize this kind of algorithm, two classes of parallel iterative -models can be described: - -\myitemize{5} - -\begin{figure}[h!] - \vspace{0.1cm} - \centering - \includegraphics[width=7.4cm]{images/ISCA} - \caption{Two processors computing in the Synchronous Iteration - Asynchronous Communication (SIAC) model} - \label{fig:SIAC} -\end{figure} - -\item \textbf{The synchronous iteration model}. In this model, as can - be seen on Figure \ref{fig:SIAC}, after each iteration (represented - by a filled rectangle), a node sends its results to its neighbors - and waits for the reception of all dependency messages from its - neighbors to start the next iteration. This results in large idle - times (represented by spaces between each iteration) and is - equivalent to a global synchronization of nodes after each - iteration. These synchronizations can strongly penalize the overall - performances of the application particularly in case of large scale - platforms with high latency network. Furthermore, if a message is - lost, its receiver will wait forever for this message and the - application will be blocked. In the same way, if a machine falls - down, all the computation will be blocked. +models can be described. In the \textit{synchronous iteration model} +after each iteration a node sends its results to its neighbors and +waits for the reception of all dependency messages from them +to start the next iteration. This results in large idle times and is +equivalent to a global synchronization of nodes after each +iteration. These synchronizations can strongly penalize the overall +performance of the application particularly in case of large scale +platforms with high latency network. Furthermore, if a message is +lost, its receiver will wait forever for this message and the +application will be blocked. In the same way, if a machine fails, all +the computation will be blocked. \begin{figure}[h!] - \vspace{0.1cm} \centering \includegraphics[width=7.4cm]{images/IACA} \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model} \label{fig:AIAC} \end{figure} - - -\item \textbf{The asynchronous iteration model}. In this - model\cite{book_raph}, as can be seen on Figure \ref{fig:AIAC}, after - each iteration, a node sends its results to its neighbors and starts - immediately the next iteration with the last received data. These - data could be data from previous iterations, because last data are - not arrived in time or neighbors have not finish their current - iteration. The receiving and sending mechanisms are asynchronous and - nodes do not have to wait for the reception of dependency messages - from their neighbors. Consequently, there is no more idle time - between two iterations. Furthermore, this model is tolerant to - messages loss and even if a node dies, the remaining nodes continue - the computation, with the last data the failed node - sent. Unfortunately, the asynchronous iteration model generally - requires more iterations than the synchronous one to converge to the - solution. - - This class of algorithms is very suitable in a distributed clusters - computing context because it suppresses all synchronizations between - computation nodes, tolerates messages loss and enables the - overlapping of communications by computations. Interested readers - might consult \cite{bcvc06:ij} for a precise classification and - comparison of parallel iterative algorithms. In this way, several - experiments \cite{bcvc06:ij} show the relevance of the AIAC - algorithms in the context of distributed clusters with high latency - between clusters. These works underline the good adaptability of - AIAC algorithms to network and processor heterogeneity. -\end{itemize} +\vspace{-0.2cm} + +In the \textit{asynchronous iteration model} a node sends its +results to its neighbors and starts immediately the next iteration +with the last received data. These data could be data from previous +iterations, because the most recent data has not arrived in time or +neighbors have not finish their current iteration. The receiving and +sending mechanisms are asynchronous and nodes do not have to wait for +the reception of dependency messages from their +neighbors. Consequently, there is no more idle time between two +iterations. Furthermore, this model is tolerant to messages loss and +even if a node dies, the remaining nodes continue the computation, +with the last data the failed node sent. Unfortunately, the +asynchronous iteration model generally requires more iterations than +the synchronous one to converge to the solution. + +This class of algorithms is very suitable in a distributed clusters +computing context because it suppresses all synchronizations between +computation nodes, tolerates messages loss and enables the overlapping +of communications by computations. Interested readers might consult +\cite{book_raph} for a precise classification and comparison of +parallel iterative algorithms. In this way, several experiments +\cite{book_raph} show the relevance of the AIAC algorithms in the +context of distributed clusters with high latency between +clusters. These works underline the good adaptability of AIAC +algorithms to network and processor heterogeneity. As we aim to solve very large problems on heterogeneous distributed architectures, in the rest of this study we only focus on the asynchronous iteration model. In order to efficiently use such algorithms on distributed clusters architectures, it is essential to -map the tasks of the application to the best sub-sets of nodes of the +map the application's tasks to the best sub-sets of nodes of the target architecture. This mapping procedure must take into account parameters such as network heterogeneity, computing nodes heterogeneity and tasks heterogeneity in order to minimize the overall @@ -257,7 +248,7 @@ architectures. The aim of this paper is to propose a new mapping algorithm dedicated to AIAC applications and to implement it into a real large scale computing platform, JaceP2P-V2. Experiments conducted on the Grid'5000 testbed with more than 400 computing cores show that -this new algorithm allows to enhance the performances of JaceP2P-V2 of +this new algorithm enhances the performance of JaceP2P-V2 by about $40\%$ for a real and typical AIAC application. %this mapping problem. Our aim is to evaluate the @@ -295,27 +286,30 @@ clusters, it offers a safer and crash free platform. To our knowledge this is the only platform dedicated to designing and executing AIAC algorithms. -\subsection{Architecture} -\label{sec:archijaceP2P} +%\subsection{Architecture} +%\label{sec:archijaceP2P} -In this section we describe the JaceP2P-V2 environment. As can be seen -on Figure \ref{fig:jaceP2P-v2}, which shows its architecture, this -platform is composed of three main entities: +%In this section we describe the JaceP2P-V2 environment. +%As can be seen +%on Figure \ref{fig:jaceP2P-v2}, which shows -\begin{figure}[h!] - \vspace{0.1cm} - \centering - \includegraphics[width=7.4cm]{images/JACEP2P-V2} - \caption{The JaceP2P-V2 architecture} - \label{fig:jaceP2P-v2} -\end{figure} +The JaceP2P-V2 architecture, is composed of three main entities: -\myitemize{5} -\item The first entity is the ``super-node'' (represented by a big - circle in figure \ref{fig:jaceP2P-v2}). Super-nodes form a circular - network and store, in registers, the identifiers of all the - computing nodes that are connected to the platform and that are not - executing any application. + % \begin{figure}[h!] +% \vspace{0.1cm} +% \centering +% \includegraphics[width=7.4cm]{images/JACEP2P-V2} +% \caption{The JaceP2P-V2 architecture} +% \label{fig:jaceP2P-v2} +% \end{figure} + +\myitemize{3} +%\begin{itemize} +\item The first entity is the ``super-node''. %(represented by a big + %circle in Figure \ref{fig:jaceP2P-v2}). + Super-nodes form a circular network and store, in registers, the + identifiers of all the computing nodes that are connected to the + platform and that are not executing any application. % Each super-node has a status table containing the % number of connected computing nodes to each super-node and all the % super-nodes share a ``token'' that is passed successively from a @@ -329,24 +323,28 @@ platform is composed of three main entities: % it broadcasts the information to all the super-nodes in the % platform. Finally, it passes the token to the next super node. This % distribution reduces the load of the super-nodes. - A super-node regularly receives heartbeat messages (represented by - doted lines in figure \ref{fig:jaceP2P-v2}) from the computing nodes - connected to it. If a super-node does not receive a heartbeat - message from a computing node for a given period of time, it - declares that this computing node is dead and deletes its identifier - from the register. - -\item The second entity is the ``spawner'' (represented by a square in - figure \ref{fig:jaceP2P-v2}). When a user wants to execute a - parallel application that requires $N$ computing nodes, he or she - launches a spawner. The spawner contacts a super-node to reserve the - $N$ computing nodes plus some extra nodes. When the spawner receives - the list of nodes from the super-node, it transforms the extra - nodes into spawners (for fault tolerance and scalability reasons) - and stores the identifiers of the rest of the nodes in its own - register. Once the extra nodes are transformed into spawners, they - form a circular network and they receive the register containing the - identifiers of the computing nodes. Then each spawner becomes + A super-node regularly receives heartbeat messages +%(represented by +% doted lines in Figure \ref{fig:jaceP2P-v2}) + from the computing nodes connected to it. If a super-node does not + receive a heartbeat message from a computing node for a given period + of time, it declares that this computing node is dead and deletes + its identifier from the register. + +\item The second entity is the ``spawner''. +%(represented by a square in + % Figure \ref{fig:jaceP2P-v2}). + When a user wants to execute a parallel application that requires + $N$ computing nodes, he or she launches a spawner. This one contacts + a super-node to reserve the $N$ computing nodes plus some extra + nodes. When it receives the list of nodes from the super-node, it + transforms the extra nodes into spawners (for fault tolerance and + scalability reasons) and stores the identifiers of the rest of the + nodes in its own register. +% Once the extra nodes are transformed into spawners, they +% form a circular network and they receive the register containing the +% identifiers of the computing nodes. +Then each spawner becomes responsible for a subgroup of computing nodes, starts the tasks on the computing nodes under its command and sends a specific register to them. @@ -357,19 +355,20 @@ platform is composed of three main entities: % number of messages sent by the spawners to update the register of % the daemons after a daemon crashes because usually a small number of % daemons is affected by this crash. - If the spawner receives a message from a computing node informing - that one of its neighbors is fallen, it fetches a new one from the - super-node in order to replace the dead one. The spawner initializes - the new daemon, which retrieves the last backup (see next paragraph) - of the dead node and continues the computing task from that - checkpoint. - -\item The third entity is the ``daemon'', or the computing node, - (represented in figure \ref{fig:jaceP2P-v2} by a hashed small circle - if it is idle and by a white small circle if it is executing an - application). Once launched, it connects to a super-node and waits - for a task to execute. Once they begin executing an application they - form a circular network which is only used in the failure detection + % If the spawner receives a message from a computing node informing +% that one of its neighbors has failed, it fetches a new one from the +% super-node in order to replace the dead one. The spawner initializes +% the new daemon, which retrieves the last backup (see next paragraph) +% of the dead node and continues the computing task from that +% checkpoint. + +\item The third entity is the ``daemon'', or the computing node. +% (represented in Figure \ref{fig:jaceP2P-v2} by a hashed small circle +% if it is idle and by a white small circle if it is executing an +% application). + Once launched, it connects to a super-node and waits for a task to + execute. Once they begin executing an application daemons form a + circular network which is only used in the failure detection mechanism. Each daemon can communicate directly with the daemons whose identifiers are in its register. At the end of a task, the daemons reconnect to a super-node. @@ -378,47 +377,52 @@ platform is composed of three main entities: To be able to execute asynchronous iterative applications, JaceP2P-V2 has an asynchronous messaging mechanism and to resist daemons' failures, it implements a distributed backup mechanism called the -uncoordinated distributed checkpointing. This method allows daemons to -save their data on neighboring daemons without any user -intervention. The asynchronous nature of the application allows two -daemons to execute two different iterations, thus each daemon saves -its status without synchronizing with other daemons. This -decentralized procedure allows the platform to be very scalable, with -no weak points and does not require a secure and stable station for -backups. Moreover, since the AIAC model is tolerant to messages loss, -if a daemon dies, the other computing nodes continue their tasks and -are not affected by this failure. +uncoordinated distributed checkpointing. For more details on the +JaceP2P-V2 platform, interested readers can refer to \cite{jaceP2P-v2}. + +% This method allows daemons to +% save their data on neighboring daemons without any user +% intervention. The asynchronous nature of the application allows two +% daemons to execute two different iterations, thus each daemon saves +% its status without synchronizing with other daemons. This +% decentralized procedure allows the platform to be very scalable, with +% no weak points and does not require a secure and stable station for +% backups. Moreover, since the AIAC model is tolerant to messages loss, +% if a daemon dies, the other computing nodes continue their tasks and +% are not affected by this failure. % The application convergence detection is done by daemons, using the % decentralized global convergence detection algorithm presented in % \cite{conv_dec}. It consists of two phases: the detection phase and % the verification phase. This algorithm aims to detect efficiently % the global convergence of asynchronous iterative parallel algorithms % on distributed architectures. -For more details on the JaceP2P-V2 platform, readers can refer to -\cite{jaceP2P-v2}. -\subsection{Benefits of mapping} + +\vspace*{0.2cm} + +\subsubsection*{Benefits of mapping} \label{sec:benef} -In the JaceP2P-V2 environment, presented in the previous section, -there is no effective mapping solution. Indeed, when a user wants to -launch an application, the spawner emits a request to the super-node, -which is in charge of available daemons. Basically, the super-node -returns the amount of requested computing nodes by choosing in its own -list. +In the previously described JaceP2P-V2 environment there is no +effective mapping solution. Indeed, when a user wants to launch an +application, the spawner emits a request to the super-node, which is +in charge of available daemons. Basically, the super-node returns the +amount of requested computing nodes by choosing in its own list. %, %if there are sufficient daemons connected on it, or in other %super-nodes lists, in addition of its one. In this method, the super-node only cares about the amount of -requested nodes, it returns in general nodes in the order of their +requested nodes and it returns in general nodes in the order of their connection to the platform -- there is no specific selection. -Distributed architectures such as distributed clusters, as can be seen -on Figure \ref{fig:pbdistclust}, are often composed of heterogeneous +Distributed architectures such as distributed clusters, +%as can be seen +%on Figure \ref{fig:pbdistclust}, +are often composed of heterogeneous clusters linked via heterogeneous networks with high latencies and bandwidths. As an example the Grid'5000\cite{g5k} testbed is composed of 23 clusters spread over 9 sites. Those clusters are heterogeneous, with computing powers starting from bi-cores at 2GHz to -bi-quadri-cores at 2.83GHz with 2Go of memory for the first one to 8Go +bi-quad-cores at 2.83GHz with 2Gb of memory for the first one to 8Gb for the second. Links relying clusters are 10Gb/s capable, but as many researchers use this platform, high latencies appear in links between sites. @@ -435,14 +439,14 @@ sites. %architecture. Though there are efficient links relying each site, a %residual latency continues to exist, at local clusters (in the same %site) as well as distant clusters (from two distinct sites), and can -%penalize performances. +%penalize performance. -\begin{figure}[ht!] - \centering - \includegraphics[width=7.8cm]{images/dist_clust} - \caption{A distributed clusters architecture} - \label{fig:pbdistclust} -\end{figure} +% \begin{figure}[ht!] +% \centering +% \includegraphics[width=7.8cm]{images/dist_clust} +% \caption{A distributed clusters architecture} +% \label{fig:pbdistclust} +% \end{figure} With such an architecture, it could be @@ -450,16 +454,16 @@ efficient to assign tasks communicating with each other on the same cluster, in order to improve communications. But, as we use very large problems, it is quite impossible to find clusters containing as many computing nodes as requested. So we have to dispatch tasks over -several clusters. That implies to deal with heterogeneity in clusters +several clusters. That implies a need to deal with heterogeneity in clusters computing power and heterogeneity in network. We should make a trade-off between both components in order to take the best part -of each one to improve the overall performances. +of each one to improve the overall performance. %The %literature in high performance computing has broadly demonstrated the %benefits of mapping solutions on the applications execution time. In order to check if a tasks mapping algorithm would provide -performances improvement in JaceP2P-V2 environment, we have evaluated +performance improvement in the JaceP2P-V2 environment, we have evaluated the contributions of a simple mapping algorithm, which is described in section \ref{sec:sma}. These experiments used the NPB Kernel CG application described in section \ref{sec:cg}, with two problem sizes @@ -486,6 +490,7 @@ Table \ref{tab:benef}. \caption{Effects of a simple tasks mapping algorithm on application's execution time} \label{tab:benef} + \vspace{-0.5cm} \end{table} As can be seen in Table \ref{tab:benef}, the effects of a @@ -500,18 +505,18 @@ model and issues which can be exploited. \section{Problem description} \label{sec:pb} -In this section we describe the AIAC mapping problem. We first -formalize the different elements we should take into consideration: -the application, the targeted architecture and the objectives -functions of the mapping. We also give a state of the art about -considered kinds of mapping algorithms. +% In this section we describe the AIAC mapping problem. We first +% formalize the different elements we should take into consideration: +% the application, the targeted architecture and the objectives +% functions of the mapping. We also give a state of the art about +% considered kinds of mapping algorithms. \subsection{Model formalization} \label{sec:pbmodel} -In this section the models of the applications and architectures we -used are given, with the objectives functions of the mapping -algorithms. +% In this section the models of the applications and architectures we +% used are given, with the objectives functions of the mapping +% algorithms. \subsubsection{Application modeling} \label{sec:pbmodelapp} @@ -522,10 +527,10 @@ assignation of tasks to computing nodes. Usually, to assign tasks of parallel applications to computing nodes, scheduling algorithms are used. These algorithms often represent the application by a graph, called DAG \cite{dag1,dag2,dag3,dag4} (Directed Acyclic Graph). In -this graph, each task is represented by a vertex which is relied to +this graph, each task is represented by a vertex which is relayed to others by edges, which represent dependencies and communications between tasks. This means that some tasks could not start before other -ones finish their computation and send their results. As exposed in +ones finish their computation and send their results. As discussed in the introduction, in the AIAC model, there is no precedence between tasks. @@ -549,32 +554,32 @@ are not appropriate to modeling AIAC applications. TIG\cite{tig1, \begin{figure}[h!] \centering - \includegraphics[width=5cm]{images/tig} + \includegraphics[width=4cm]{images/tig} \caption{An example of a TIG of a nine tasks application} \label{fig:tig} \end{figure} -In the TIG model, a parallel program is represented by a graph -%, as can -%be seen in Figure \ref{fig:tig}. This graph -$GT(V,E)$, where $V = \{V_1,V_2,\dots V_v\}$ is the set of $|V|$ -vertices and $E \subset V \times V$ is the set of undirectional edges -(see Figure \ref{fig:tig}). The vertices represent tasks and the edges -represent the mutual communication among tasks. A function $ET : V -\rightarrow R^+$ gives the computation cost of tasks and $CT : E -\rightarrow R^+$ gives the communication cost for message passing on -edges. We define $v = |V|$, $ET(V_i) = e_i$ and $CT(V_i,V_j) = -c_{ij}$. For example, in Figure \ref{fig:tig}, \mbox{$e_0$ = 10} and $c_{01} -= 2$, $c_{03} = 2$ and $c_{04} = 2$. Tasks in TIG exchange information -during their execution and there is no precedence relationship among -tasks; each task cooperates with its neighbors. This model is used to -represent applications, where tasks are considered to be executed -simultaneously. Temporal dependencies in the execution of tasks are -not explicitly addressed: all the tasks are considered simultaneously -executable and communications can take place at any time during the -computation. That is why vertices and edges are labeled with weights -describing computational and communication costs.\\ - +In the TIG model, a parallel program is represented by a graph , as +can be seen in Figure \ref{fig:tig}. This graph $GT(V,E)$, where $V = +\{V_1,V_2,\dots V_v\}$ is the set of $|V|$ vertices and $E \subset V +\times V$ is the set of undirectional edges. The vertices represent +tasks and the edges represent the mutual communication among tasks. A +function $ET : V \rightarrow R^+$ gives the computation cost of tasks +and $CT : E \rightarrow R^+$ gives the communication cost for message +passing on edges. We define $v = |V|$, $ET(V_i) = e_i$ and +$CT(V_i,V_j) = c_{ij}$. For example, in Figure \ref{fig:tig}, +\mbox{$e_0$ = 10} and $c_{01} = 2$, $c_{03} = 2$ and $c_{04} = 2$. +Tasks in TIG exchange information during their execution and there is +no precedence relationship among tasks; each task cooperates with its +neighbors. This model is used to represent applications, where tasks +are considered to be executed simultaneously. Temporal dependencies in +the execution of tasks are not explicitly addressed: all the tasks are +considered simultaneously executable and communications can take place +at any time during the computation. That is why vertices and edges are +labeled with weights +describing computational and communication costs. + +\vspace*{0.2cm} \subsubsection{Architecture modeling} \label{sec:pbmodelarchi} @@ -583,14 +588,16 @@ As TIG models the application, we have to model the targeted architecture. A distributed clusters architecture can be modeled by a three-level-graph. The levels are \textit{architecture} (a), in our study it is the Grid'5000 grid, \textit{cluster} (c) and computing -node (n) levels. Figure \ref{fig:pbdistclust} in section -\ref{sec:benef} shows such a model. Let $GG(N,L)$ be a graph +node (n) levels. +%Figure \ref{fig:pbdistclust} in section +%\ref{sec:benef} shows such a model. +Let $GG(N,L)$ be a graph representing a distributed clusters architecture, where $N = \{N_1,N_2,\dots N_n\}$ is the set of $|N|$ vertices and $L$ is the set of undirectional edges. The vertices represent the computing nodes and the edges represent the links between them. An edge $L_i \in L$ is an unordered pair $(N_x,N_y) \in N$, representing a communication link -between nodes $x$ and $y$. Let be $|C|$ the number of clusters in the +between nodes $N_x$ and $N_y$. Let be $|C|$ the number of clusters in the architecture containing computing nodes. A function $WN : N \rightarrow R^+$ gives the computational power of nodes and $WL : L \rightarrow R^+$ gives the communication latency of links. We define @@ -604,8 +611,9 @@ addition, like in the Grid'5000 testbed, if computing nodes seemly have the same computational power with a low communication latency, a cluster of these nodes can be defined. All participating clusters, including computing nodes, are in the same architecture level and -communicate through the architecture network.\\ +communicate through the architecture network. +\vspace*{0.2cm} \subsubsection{Mapping functions} \label{sec:pbmodelmapping} @@ -625,25 +633,25 @@ convergence and have reached the desired approximation of the solution, that is why the execution time of the application depends on the slowest task. %, this converges last. -We define -\begin{equation} - \label{eq:et} - ET(App) = \max_{i=1 \dots v} ( ET(V_i) ) -\end{equation} - +We define $ ET(App) = \max_{i=1 \dots v} ( ET(V_i) )$ +% \begin{equation} +% \label{eq:et} +% ET(App) = \max_{i=1 \dots v} ( ET(V_i) ) +% \End{equation} where the %$ET(V_s)$ is the execution time of the slowest task $V_s$. The execution time of each task $i$ \mbox{($i=1 \dots v$)}, $ET(V_i)$ is -given by -\begin{equation} - \label{eq:ettask} - ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij} -\end{equation} +given by $ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij}$ + +% \begin{equation} +% \label{eq:ettask} +% ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \cdot wl_{ij} +% \End{equation} where $e_i$ is the computational cost of $V_i$, $wn_i$ is the computational power of the node $N_i$ on which $V_i$ is mapped, $J$ represents the neighbors set of $V_i$, $c_{ij}$ is the amount of -communications between $V_i$ and $V_j$, and $wn_{ij}$ is the link +communications between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the computing nodes on which are mapped $V_i$ and $V_j$. %Note that in the AIAC model, we only consider for $e_i$ the @@ -663,7 +671,7 @@ task assignment problem \cite{npcomp}, and is thus NP-complete. %iterative application on heterogeneous distributed architectures, so %it has to launch applications on a variety of computing nodes relying %by non-uniform links. As demonstrates in the previous section, it does -%not map efficiently tasks on nodes and the overall performances can be +%not map efficiently tasks on nodes and the overall performance can be %strongly penalize by this lack of mapping. Indeed, if the %heterogeneity degree of the architecture used is high, it can be %possible to have two neighbors tasks executing on two foreign @@ -676,7 +684,7 @@ task assignment problem \cite{npcomp}, and is thus NP-complete. % %%As we use parallel distributed algorithms on distributed %%architectures, we have to efficiently map tasks on computing nodes, in -%%order to obtain best performances and the slowest execution time of +%%order to obtain best performance and the slowest execution time of %%the application. %There exist many scheduling and mapping algorithms in the literature, %and the research is active in this domain. A first approach is to use @@ -713,7 +721,7 @@ task assignment problem \cite{npcomp}, and is thus NP-complete. % % % -%On figure \ref{fig:tig} we can see an application with 6 tasks, in +%On Figure \ref{fig:tig} we can see an application with 6 tasks, in %which each task is in relation with tasks of previous and next rank, %with two exceptions, the first and the last tasks, which can %eventually be in relation, depending of the application. This is a @@ -727,20 +735,17 @@ task assignment problem \cite{npcomp}, and is thus NP-complete. %In the previous section we have determined that we need to use TIG %model based mapping algorithms to address our problem. In the literature of the TIG mapping, we can find many algorithms, -which can be divided into two categories: - -\myitemize{5} -\item \textbf{Edge-cuts optimization}. The aim of this class of - algorithms is to minimize the use of the penalizing links between - clusters. As tasks are depending on neighbors, which are called here - dependencies, the goal is to choose nodes which distance, in term of - network, is small, to improve communications between tasks. Here we - can cite Metis\cite{metis}, Chaco\cite{chaco} and - PaGrid\cite{pagrid} which are libraries containing such kind of - algorithms. The main drawback of edge-cuts algorithms is that they - do not tackle the computing nodes heterogeneity issues. They only - focus on communication overheads. -% The figure \ref{fig:edge} shows +which can be divided into two categories. First, in the +\textit{Edge-cuts optimization} class of algorithms, the aim is to +minimize the use of the penalizing links between clusters. As tasks +are depending on neighbors, which are called dependencies, the goal is +to choose nodes where distance, in term of network, is small to +improve communications between tasks. Here we can cite +Metis\cite{metis}, Chaco\cite{chaco} and PaGrid\cite{pagrid} which are +libraries containing such kind of algorithms. The main drawback of +edge-cuts algorithms is that they do not tackle the computing nodes +heterogeneity issues. They only focus on communication overheads. +% The Figure \ref{fig:edge} shows % that the optimization is on edges of communications, which are % circled in red. % @@ -750,16 +755,16 @@ which can be divided into two categories: % \caption{The edge-cuts optimization} % \label{fig:edge} % \end{figure} -\item \textbf{Execution time optimization}. The aim of these - algorithms is to minimize the whole execution time of the - application. They look for nodes which can provide the small - execution time of tasks using their computational power. Here we can - cite FastMap\cite{fastmap} and MiniMax\cite{minimax} as such kind of - algorithms. QM\cite{qm_these} is also an algorithm of this category, - but it aims to find for each task the node which can provide the - best execution time. QM works at the task level, - whereas others work at the application level.\\ -% The figure +Then, in the \textit{Execution time optimization} class of algorithms +the aim is to minimize the whole execution time of the +application. They look for nodes which can provide the small execution +time of tasks using their computational power. Here we can cite +FastMap\cite{fastmap} and MiniMax\cite{minimax} as such kind of +algorithms. QM\cite{qm_these} is also an algorithm of this category, +but it aims to find for each task the node which can provide the best +execution time. QM works at the task level, whereas others work at the +application level +% The Figure % \ref{fig:et} shows that the optimization is on tasks, which are % circled in red.\\ % @@ -769,15 +774,15 @@ which can be divided into two categories: % \caption{The execution time optimization} % \label{fig:et} % \end{figure} -\end{itemize} The two classes of algorithms may fit with our goals, because in our model we have both the computational power of nodes and communication -costs may influence the applications performances. - -Nevertheless, to the best of our knowledge, none of the existing -algorithms take into consideration the specificities of the AIAC model -(see next section). +costs which may influence the applications performance. We can also +cite partitioning tools like Scotch \cite{scotch} which aims at +privileging the load balancing of their partitioning +schemes. Nevertheless, to the best of our knowledge, none of the +existing algorithms take into consideration the specificities of the +AIAC model (see next section). %specifically address the AIAC mapping problem. %As the existing mapping algorithms are not designed to fit %with the AIAC mapping problem. @@ -785,10 +790,10 @@ algorithms take into consideration the specificities of the AIAC model \section{AIAC mapping} \label{sec:aiacmapping} -In this section we present the specificities of the AIAC model, which -are interesting in the mapping problem, and the solution we propose: -the AIAC QM algorithm, which is an extended version of the QM -algorithm. +% In this section we present the specificities of the AIAC model, which +% are interesting in the mapping problem, and the solution we propose: +% the AIAC QM algorithm, which is an extended version of the QM +% algorithm. \subsection{Specificities of the AIAC mapping problem} \label{sec:specAIACmapping} @@ -797,7 +802,7 @@ An important point to take into consideration in the AIAC model is that we do not allow the execution of multiple tasks on the same computing node. This comes from the fact that the targeted architectures are volatile distributed environments. Assigning -multiple tasks to a node provides a fall of performances when this +multiple tasks to a node provides a fall of performance when this node fails. Indeed we should redeploy all of the tasks from this node to another one, using last saves, which implies to search a new available computing node, transfer saves to it and restart the @@ -818,7 +823,10 @@ this model, the faster and more frequently a task receives its dependencies, the faster it converges. Moreover, as the JaceP2P-V2 environment is fault tolerant and tasks save checkpoints on their neighbors, it is more efficient to save on near nodes than on far -ones. +ones. In the synchronous model, both heterogeneity and locality must +be taken into account in a balanced way. In the asynchronous model, +since no synchronization occurs, the heterogeneity issue is less +important. %As our problem is on task mapping, only considering the dependency @@ -838,14 +846,14 @@ ones. We present here the solution we propose, called \textit{AIAC QM algorithm}, to address the AIAC mapping problem. We decided to improve the \textit{Quick-quality Map} (QM) algorithm since it is one -of the most accurate to address the TIG mapping problem. +of the most accurate method to address the TIG mapping problem. % %\subsection{Modified Quick-quality Map} %\label{sec:modifiedqm} % %As the previous algorithm describe in section \ref{sec:sma} showed %that mapping provides a significant increase of applications -%performances (which can be seen in the section \ref{sec:experiments}), +%performance (which can be seen in the section \ref{sec:experiments}), %we decide to try another mapping algorithm, which is a modified %version of the \textit{Quick-quality Map} (QM) algorithm. % @@ -864,7 +872,7 @@ of the most accurate to address the TIG mapping problem. %of it to fit with our constraints. This was an opportunity to be taken %to insert a little part of ``edge-cuts'' optimization, as in our model %communications have to be taken into account. -In its original version, this algorithm aims at privileging the +In its original version, this algorithm aims at prioritizing the computational power of nodes. Indeed, its aim is to find the more powerful node to map a task on. Moreover, a part of this algorithm is designed to map multiple tasks on the same node, in order to improve @@ -875,94 +883,93 @@ our model, as we authorize only the execution of one task on a single node -- this allows to lose only the work of a single task in case of node's fault, with a low cost on restarting mechanism. Instead assigning multiple tasks on the same computing node, our mapping -algorithm tries to keep tasks locality, to improve communications, by +algorithm tries to keep tasks locally, to improve communications, by trying to assign tasks to computing nodes in the neighborhood of which their neighbors are mapped on. -The pseudo-code of AIAC QM is given in Algorithm \ref{alg:qmmodified}. +% The pseudo-code of AIAC QM is given in Algorithm \ref{alg:qmmodified}. -\SetAlgoSkip{} -\begin{algorithm} - \SetLine - \dontprintsemicolon +% \SetAlgoSkip{} +% \begin{algorithm} +% \SetLine +% \dontprintsemicolon - \KwIn{Sets of tasks and computing nodes} - \KwOut{Mapping of tasks to nodes} +% \KwIn{Sets of tasks and computing nodes} +% \KwOut{Mapping of tasks to nodes} - \BlankLine +% \BlankLine - sort nodes by descending power\; - map tasks in order on nodes\; - set all tasks \textit{moveable}\; - $r \leftarrow 1$\; +% sort nodes by descending power\; +% map tasks in order on nodes\; +% set all tasks \textit{moveable}\; +% $r \leftarrow 1$\; - \BlankLine +% \BlankLine - \While{one task is moveable}{ - \For{each task $t_{i}$ $\&\&$ $t_{i}$ is moveable }{ - $n_{c} \leftarrow$ current node of $t_{i}$\; - $n_{n} \leftarrow t_{i}$\; +% \While{one task is moveable}{ +% \For{each task $t_{i}$ $\&\&$ $t_{i}$ is moveable }{ +% $n_{c} \leftarrow$ current node of $t_{i}$\; +% $n_{n} \leftarrow t_{i}$\; - \BlankLine +% \BlankLine - \For{$k = 0 ; k < \frac{f \cdot n}{r} ; k++$}{ - select random node $n_{r}$ in $[0,\frac{n}{r}]$\; - \If{ET($t_{i}$,$n_{r}$) $<$ ET($t_{i}$,$n_{n}$)}{ - $n_{n} \leftarrow n_{r} $\; - } - } +% \For{$k = 0 ; k < \frac{f \cdot n}{r} ; k++$}{ +% select random node $n_{r}$ in $[0,\frac{n}{r}]$\; +% \If{ET($t_{i}$,$n_{r}$) $<$ ET($t_{i}$,$n_{n}$)}{ +% $n_{n} \leftarrow n_{r} $\; +% } +% } - \BlankLine +% \BlankLine - \For{each node $n_{v}$ near $dep(t_{i})$}{ - \If{ET($t_{i}$,$n_{v}$) $<$ ET($t_{i}$,$n_{n}$)}{ - $n_{n} \leftarrow n_{v} $\; - } - } +% \For{each node $n_{v}$ near $dep(t_{i})$}{ +% \If{ET($t_{i}$,$n_{v}$) $<$ ET($t_{i}$,$n_{n}$)}{ +% $n_{n} \leftarrow n_{v} $\; +% } +% } - \BlankLine +% \BlankLine - \If{$n_{n} \neq n_{c}$}{ - map $t_{i}$ on $n_{n}$\; - update ET of $t_{i}$ and dep($t_{i}$)\; - } - } +% \If{$n_{n} \neq n_{c}$}{ +% map $t_{i}$ on $n_{n}$\; +% update ET of $t_{i}$ and dep($t_{i}$)\; +% } +% } - \BlankLine +% \BlankLine - set $t_i$ not moveable\; - $r \leftarrow r+1$ if all tasks have been considered\; - } +% set $t_i$ not moveable\; +% $r \leftarrow r+1$ if all tasks have been considered\; +% } - \caption{The AIAC QM} - \label{alg:qmmodified} -\end{algorithm} -%\vspace{-0.5cm} - -All nodes are first sorted in descending order according to their -computation power, and all tasks are mapped on these nodes according -to their identifier (they are also marked as ``moveable'', that means -that each task can be moved from a node to another). As in the -original QM algorithm, AIAC QM keeps track of the \textit{number of - rounds} $r$ ($r > 0$), that all tasks have been searched for a -better node. This allows to reduce at each round the number of -considered nodes. While there is at least one moveable task, it -performs for each moveable task the search for a better node. It +% \caption{The AIAC QM} +% \label{alg:qmmodified} +% \end{algorithm} +% %\vspace{-0.5cm} + +So, in this algorithm all nodes are first sorted in descending order +according to their computation power, and all tasks are mapped on +these nodes according to their identifier (they are also marked as +``moveable''; it means that each task can be moved from a node to +another). As in the original QM algorithm, AIAC QM keeps track of the +\textit{number of rounds} $r$ ($r > 0$), that all tasks have been +searched for a better node. This allows to reduce at each round the +number of considered nodes. While there is at least one moveable task, +it performs for each moveable task the search for a better node. It chooses a set of nodes, $\frac{f \cdot n}{r}$, where $f$ is defined as -the search factor and $n$ is the number of nodes. $r$ and $f -\in ]0,1]$ control the portion of nodes that will be considered where -more numerous the rounds are, the less the considered nodes will -be. Then the algorithm estimates the execution time $ET(v)$ of the -task on each node. If it is smaller than the current node on which the -task is mapped on, this node becomes the new potential node for -task $t_i$. +the search factor and $n$ is the number of nodes. $r$ and $f \in +]0,1]$ control the portion of nodes that will be considered where more +numerous the rounds are, the less the considered nodes will be. Then +the algorithm estimates the execution time $ET(v)$ of the task on each +node. If it is smaller than the current node on which the task is +mapped on, this node becomes the new potential node for task $t_i$. After having randomly searched for a new node, the AIAC QM tries to map the task on nodes that are neighbors of nodes of which the dependencies of $t_i$ are mapped on. This is one of the major modification to the original QM algorithm. It introduces a little part of ``edge-cuts'' optimization. In the original version, it tries to -map the task $t_i$ on the same node of one as its dependencies. As +map the task $t_i$ on the same node of one of its dependencies. As explain in \ref{sec:specAIACmapping}, this is not an acceptable solution in our case. Instead, the algorithm now searches to map task $t_i$ on nodes which are near the ones its dependencies are mapped @@ -978,22 +985,22 @@ the task is set to ``moveable''. And finally, if all tasks have been considered in this round, $r$ is incremented. The complexity of the AIAC QM algorithm is about $O(n^2 \cdot -t \cdot ln(r))$. This complexity is the same as the original -(details are given in \cite{qm_these}, with a an increase of a factor -$n$, corresponding to the edge-cuts part. +t \cdot ln(r))$. This complexity is the same as the original algorithm +(details are given in \cite{qm_these}, with an increase of a factor +$n$, corresponding to the edge-cuts part). \section{Experimentation} \label{sec:expe} -We now describe the experiments we have conducted and their -components, to evaluate the effects of the AIAC QM algorithm on -application execution time. +% We now describe the experiments we have conducted and their +% components, to evaluate the effects of the AIAC QM algorithm on +% application execution time. -\subsection{The NAS Parallel Benchmark Kernel CG} +\subsection{The NAS Parallel Benchmark Kernel CG and the Grid'5000 platform } \label{sec:cg} We used the ``Kernel CG'' of the NAS Parallel Benchmarks (NPB) -\cite{nas} to evaluate the performances of the mapping +\cite{nas} to evaluate the performance of the mapping algorithm. This benchmark is designed to be used on large architectures, because it tests communications over latency networks, by processing unstructured matrix vector multiplication. In this @@ -1008,46 +1015,48 @@ asynchronous iterative model. With the multisplitting algorithm, the $A$ matrix is split into -horizontal rectangle parts, as Figure \ref{fig:multisplit} shows. Each -of these parts is affected to a processor -- so the size of data +horizontal rectangle parts. +%as Figure \ref{fig:multisplit} shows. +Each +of these parts is assigned to a processor -- so the size of data depends on the matrix size but also on the number of participating -nodes. In this way, a processor is in charge of computing its $X Sub$ -part by solving the following subsystem: $ASub \times XSub = BSub - -DepLeft \times XLeft - DepRight \times XRight$ +nodes. In this way, a processor is in charge of computing its {\small $X Sub$} +part by solving the following subsystem: {\small $ASub \times XSub = BSub - +DepLeft \times XLeft - DepRight \times XRight$}. -After solving $XSub$, the result must be sent to other +After solving {\small $XSub$}, the result must be sent to other processors which depend on it. -\begin{figure}[h!] - \centering - \includegraphics[width=7.4cm]{images/multisplit} - \caption{Data decomposition for the multisplitting method - implementation} - \label{fig:multisplit} -\end{figure} - -The multisplitting method can be decomposed into four phases: -\begin{enumerate} -\itemsep=5pt -\item \textbf{Data decomposition}. In this phase, data are allocated - to each processor assuming the decomposition exposed on figure - \ref{fig:multisplit}. Then, each processor iterates until converge - on the following. -\item \textbf{Computation}. At the beginning of the computation, each - processor computes - $BLoc = BSub - DepLeft \times XLeft - DepRight \times XRight$. Then, - it solves $ASub \times XSub = BLoc$ by using a - multi-threaded sequential version of the Conjugate Gradient method. -\item \textbf{Data exchange}. Each processor sends its $XSub$ - part to its neighbors. Here, the neighborhood is closely related - to the density of the $A$ matrix. Clearly, a dense matrix implies an - \textit{all-to-all} communication scheme while a matrix with a low - bandwidth reduces the density of the communication scheme. -\item \textbf{Convergence detection} Each processor computes its local - convergence and sends it to a server node. When this one detects - that each processor has converged, it stops the whole computation - process. -\end{enumerate} +% \begin{figure}[h!] +% \centering +% \includegraphics[width=7.4cm]{images/multisplit} +% \caption{Data decomposition for the multisplitting method +% implementation} +% \label{fig:multisplit} +% \end{figure} + +% The multisplitting method can be decomposed into four phases: +% \begin{enumerate} +% \itemsep=5pt +% \item \textbf{Data decomposition}. In this phase, data are allocated +% to each processor assuming the decomposition shown in Figure +% \ref{fig:multisplit}. Then, each processor iterates until +% convergence is reached. +% \item \textbf{Computation}. At the beginning of the computation, each +% processor computes +% $BLoc = BSub - DepLeft \times XLeft - DepRight \times XRight$. Then, +% it solves $ASub \times XSub = BLoc$ by using a +% multi-threaded sequential version of the Conjugate Gradient method. +% \item \textbf{Data exchange}. Each processor sends its $XSub$ +% part to its neighbors. Here, the neighborhood is closely related +% to the density of the $A$ matrix. Clearly, a dense matrix implies an +% \textit{all-to-all} communication scheme while a matrix with a low +% bandwidth reduces the density of the communication scheme. +% \item \textbf{Convergence detection} Each processor computes its local +% convergence and sends it to a server node. When this one detects +% that each processor has converged, it stops the whole computation +% process. +% \end{enumerate} %It can be pointed out here that it is possible to modify the data %decomposition in order to obtain non disjoint rectangle matrices. This @@ -1057,78 +1066,82 @@ The multisplitting method can be decomposed into four phases: For more details about this method, interested readers are invited to see \cite{book_raph}. In our benchmark, the sequential solver part of the multisplitting method is the Conjugate Gradient, using the -MTJ\cite{mtj} library. Its implementation is multi-threaded, so it -benefits from multi-core processors. +MTJ\cite{mtj} library. Its implementation is multi-threaded, to +benefit from multi-core processors. We point out here that this benchmark is a typical AIAC -application. The general form of the TIG for this application is given -by Figure \ref{fig:tigcg}. +application. In our study, we consider that the computational costs of +tasks are approximately the same and that the communications costs are +also the same (this comes from the difficulty to evaluate real costs +in the AIAC model). For our experiments the bandwidth of matrices has +been reduced in order to limit the dependencies and we fixed it to +$35,000$. This bandwidth size generates, according to the problem's +size, between 10 and 25 neighbors per tasks. + +% The general form of the TIG for this application is given +% by Figure \ref{fig:tigcg}. + +% \begin{figure}[h!] +% \centering +% \includegraphics[width=8cm]{images/tigcg2} +% \caption{Part of the form of the TIG representing an instance of the +% NAS Kernel CG application} +% \label{fig:tigcg} +% \end{figure} + +%This figure shows 6 tasks, which are represented by a circle in which +%the identifier of the task is given. -\begin{figure}[h!] - \centering - \includegraphics[width=8cm]{images/tigcg2} - \caption{Part of the form of the TIG representing an instance of the - NAS Kernel CG application} - \label{fig:tigcg} -\end{figure} - -This figure shows 6 tasks, which are represented by a circle in which -the identifier of the task is given. In our study, we consider that -the computational costs of tasks are approximately the same and that the -communications costs also the same (this comes from the difficulty to -evaluate real costs in the AIAC model). % The computational cost of a task is given by the number on the top % left-hand side of each circle (for example the cost of task 31 is % 1000). Communications between tasks are represented by edges on % which the amount of communication is given (for example, the % communication cost between tasks 29 and 30 is about 30). -Doted lines represent communications with tasks which are not -represented on the figure. We can see here that each task has four -neighbors (the two previous and the two next). This amount of -neighbors is directly related to the bandwidth of the matrix (in this -example the bandwidth is very small). For more details about the -influence of the bandwidth on the amount of neighbors, interested -readers are invited to see \cite{largescale}. - -For our experiments the bandwidth of matrices has been reduced in -order to limit the dependencies and we fixed it to $35,000$. This -bandwidth size generates, according to the problem's size, between 10 -and 25 neighbors per tasks. - -\subsection{The Grid'5000 platform} -\label{sec:g5k} +% Dotted lines represent communications with tasks which are not +% represented on the figure. We can see here that each task has four +% neighbors (the two previous and the two next). This amount of +% neighbors is directly related to the bandwidth of the matrix (in this +% example the bandwidth is very small). For more details about the +% influence of the bandwidth on the amount of neighbors, interested +% readers are invited to see \cite{largescale}. + + + +%\subsubsection*{The Grid'5000 platform} +%\Label{sec:g5k} The platform used for our tests, called Grid’5000\cite{g5k}, is a French nationwide experimental set of clusters which provides a configurable and controllable instrument. We can find many clusters with different kinds of computers with various specifications and -software. - -\begin{figure}[h!] - \centering - \includegraphics[height=6.5cm]{images/g5k-noms} - \caption{The Grid'5000 sites map} - \label{fig:g5ksite} -\end{figure} - -Clusters are spread over 9 sites, as can be seen on Figure -\ref{fig:g5ksite}, and the computing power represents more than 5000 +software. Clusters are spread over 9 sites, +%as can be seen on Figure \ref{fig:g5ksite}, +and the computing power represents more than 5000 computing cores interconnected by the ``Renater'' network. This network is the national network for research and education; it provides a large bandwidth with high latency. Intra-clusters networks present small bandwidth and low latencies. +% \begin{figure}[h!] +% \centering +% \includegraphics[height=6.5cm]{images/g5k-noms} +% \caption{The Grid'5000 sites map} +% \label{fig:g5ksite} +% \end{figure} + + + \subsection{Other mapping algorithms} \label{sec:othermaping} -In this section we present the two other mapping algorithms we used -in our experiments to compare the performances of the AIAC QM -algorithm. The first one was used to evaluate the benefits of a -mapping solution in section \ref{sec:benef}. The second one was used -to show the differences between the two mapping class, the ``execution -time'' and the ``edge-cuts'' optimizations, as it is a fully edge-cut -optimization based mapping algorithm. +% In this section we present the two other mapping algorithms we used +% in our experiments to compare the performance of the AIAC QM +% algorithm. The first one was used to evaluate the benefits of a +% mapping solution in section \ref{sec:benef}. The second one was used +% to show the differences between the two mapping class, the ``execution +% time'' and the ``edge-cuts'' optimizations, as it is a fully edge-cut +% optimization based mapping algorithm. \subsubsection{A Simple Mapping algorithm} \label{sec:sma} @@ -1138,31 +1151,31 @@ optimization based mapping algorithm. %for the mapping. % %The first thing we have done was to be sure that a mapping algorithm -%would enhance applications performances. +%would enhance applications performance. The \textit{Simple Mapping algorithm} (SMa) was designed to show the benefits of a mapping algorithm in the JaceP2P-V2 platform. %The text of the \textit{Simple Mapping} if given by -Algorithm \ref{alg:sma} gives the pseudo-code of the \textit{Simple - Mapping algorithm}. +%Algorithm \ref{alg:sma} gives the pseudo-code of the \textit{Simple +% Mapping algorithm}. %, in which we can see that it is very simple, with a complexity in $O(n^ %2)$ resulting from sort methods. -\SetAlgoSkip{} -\begin{algorithm} - \SetLine - \dontprintsemicolon - \KwIn{Sets of tasks and computing nodes} - \KwOut{Mapping of tasks to nodes} +% \SetAlgoSkip{} +% \begin{algorithm} +% \SetLine +% \dontprintsemicolon +% \KwIn{Sets of tasks and computing nodes} +% \KwOut{Mapping of tasks to nodes} - \BlankLine +% \BlankLine - sort computing nodes by cluster\; - sort clusters by size, from higher to lower\; - map tasks in order on sorted clusters\; - \caption{The Simple Mapping algorithm} - \label{alg:sma} -\end{algorithm} +% sort computing nodes by cluster\; +% sort clusters by size, from higher to lower\; +% map tasks in order on sorted clusters\; +% \caption{The Simple Mapping algorithm} +% \label{alg:sma} +% \end{algorithm} %The aim of this algorithm is to do similarly to a very simple @@ -1181,6 +1194,8 @@ task is assigned to a particular computing node of the chosen cluster. %on distributed clusters architectures, with a gain over $30\%$ on %execution, as experiments described in section \ref{sec:benef}. +\vspace*{0.2cm} + \subsubsection{Edge-cuts optimization} \label{sec:edgcutalgo} @@ -1208,62 +1223,62 @@ recursive bisection. %Therefore its running execution time is %independent of the desired number of subsections. -The adapted version of this algorithm, Farhat's Edge-Cut (F-EC), -evaluated in the JaceP2P-V2 environment is described in Algorithm -\ref{alg:edgecuts}. +% The adapted version of this algorithm, Farhat's Edge-Cut (F-EC), +% evaluated in the JaceP2P-V2 environment is described in Algorithm +% \ref{alg:edgecuts}. -\SetAlgoSkip{} -\begin{algorithm} - \SetLine - \dontprintsemicolon +% \SetAlgoSkip{} +% \begin{algorithm} +% \SetLine +% \dontprintsemicolon - \KwIn{Sets of tasks and computing nodes} - \KwOut{Mapping of tasks to nodes} +% \KwIn{Sets of tasks and computing nodes} +% \KwOut{Mapping of tasks to nodes} - \BlankLine +% \BlankLine - sort nodes by cluster\; - $lTasks \leftarrow$ sort tasks by dep degree\; - $changeCluster \leftarrow$ true\; - $cTasks \leftarrow$ empty; +% sort nodes by cluster\; +% $lTasks \leftarrow$ sort tasks by dep degree\; +% $changeCluster \leftarrow$ true\; +% $cTasks \leftarrow$ empty; - \BlankLine +% \BlankLine - \While{one task is not mapped}{ - \If{$changeCluster$}{ - $curCluster \leftarrow$ nextCluster()\; - $places \leftarrow$ size($curCluster$)\; - $changeCluster \leftarrow$ false\; - $mTasks \leftarrow$ empty\; - } +% \While{one task is not mapped}{ +% \If{$changeCluster$}{ +% $curCluster \leftarrow$ nextCluster()\; +% $places \leftarrow$ size($curCluster$)\; +% $changeCluster \leftarrow$ false\; +% $mTasks \leftarrow$ empty\; +% } - \BlankLine +% \BlankLine - \If{no task in cTasks}{ - $cTasks \leftarrow$ first task from $lTasks$\; - } +% \If{no task in cTasks}{ +% $cTasks \leftarrow$ first task from $lTasks$\; +% } - \BlankLine +% \BlankLine - $curTask \leftarrow$ first task in $cTasks$\; +% $curTask \leftarrow$ first task in $cTasks$\; - \BlankLine +% \BlankLine - \eIf{$( 1 + $dep(curTask)~$\cdot~\delta) <= places)$}{ - remove $curTask$ from $cTasks$\; - add $curTask$ in $mTasks$\; - $places \leftarrow places - 1$\; - add dep(curTask) in cTasks\; - }{ - $changeCluster$ $\leftarrow$ true\; - associate $mTasks$ with $curCluster$\; - } - } +% \eIf{$( 1 + $dep(curTask)~$\cdot~\delta) <= places)$}{ +% remove $curTask$ from $cTasks$\; +% add $curTask$ in $mTasks$\; +% $places \leftarrow places - 1$\; +% add dep(curTask) in cTasks\; +% }{ +% $changeCluster$ $\leftarrow$ true\; +% associate $mTasks$ with $curCluster$\; +% } +% } - \caption{The Fahrat's Edge-Cut algorithm} - \label{alg:edgecuts} -\end{algorithm} +% \caption{The Fahrat's Edge-Cut algorithm} +% \label{alg:edgecuts} +% \end{algorithm} This algorithm aims to do a ``clusterization'' of the tasks. First, it groups computing nodes in clusters, which are sorted according to @@ -1272,7 +1287,7 @@ following their dependency degree, starting from the higher to the lower. Tasks in the top of the list have a higher priority to be mapped. Next, the algorithm tries to map on each cluster the maximum number of tasks. To map a task on a cluster, the algorithm evaluates -if there is enough place to map the task and some of its +if there is enough space to map the task and some of its dependencies. This amount of dependencies is fixed by a factor $\delta$, which is a parameter of the algorithm. In the positive case, the task is mapped on the current cluster and its dependencies become @@ -1328,7 +1343,7 @@ degree of heterogeneity of distributed clusters; it is the ratio between the average and the standard deviation of the computing nodes power. This heterogeneity degree may vary from 0, nodes are homogeneous, to 10, nodes are totally heterogeneous. In these -experiments, we consider that there is no computing nodes fault during +experiments, we consider that there is no computing nodes failing during applications execution. The application used to realize these experiments is the KernelCG of @@ -1348,14 +1363,15 @@ size $5,000,000$ (named ``class F'') using 128 nodes. -\subsubsection{About heterogeneity} -\label{sec:xphetero} +%\subsubsection{About heterogeneity} +%\label{sec:xphetero} -The first experiments concern the study of the impact of the +%The first experiments concern the study of the impact of the +Our experiments concern the study of the impact of the heterogeneity of the computing nodes on the mapping -results. Heterogeneity takes an important place in the high -performance computing on grid, all the more so when using the -asynchronous iteration model. +results. Heterogeneity is an important factor in high performance +computing in the grid all the more so when using the asynchronous +iteration model. As mapping algorithms take in parameter a factor of research (for AIAC QM) and the amount of local dependencies (for F-EC), we fixed both to @@ -1370,12 +1386,14 @@ affect the networks heterogeneity, because of the difficulty to disturb and control network on Grid'5000; by default, networks are already quite heterogeneous. We needed more than 200 computing nodes to execute our application because of the small capacity of some -clusters to execute the largest problems (there is not enough memory). +clusters to execute the largest problems (there is not enough +memory). The nodes used have more than 2 GB of RAM and both execute a +Linux 64 bits distribution. The first architecture, Arc1.1, was composed of 113 computing nodes representing 440 computing cores, spread over 5 clusters in 4 -sites. In Arc1.1 we used bi-cores (2 clusters), quadri-cores (2 -clusters) and bi-quadri-cores (1 cluster) machines. Its heterogeneity +geographically distant sites. In Arc1.1 we used bi-cores (2 clusters), quad-cores (2 +clusters) and bi-quad-cores (1 cluster) machines. Its heterogeneity degree value is 6.43. This architecture was used to run class E of the CG application using 64 computing nodes. The second architecture, Arc1.2, used to execute class F of the CG application, using 128 computing @@ -1404,10 +1422,9 @@ in comparison to the version without mapping. \hline \end{tabular} \caption{Gains in time of the execution of the class E of the CG - application on Arc1.1 using 64 computing nodes, with mapping - algorithms} + application on Arc1.1 using 64 nodes} \label{tab:exph1E} - \vspace*{-0.3cm} + \vspace*{-0.4cm} \end{table} %\vspace{-0.5cm} @@ -1428,10 +1445,9 @@ in comparison to the version without mapping. \hline \end{tabular} \caption{Gains in time of the execution of the class F of the CG - application on Arc1.2 using 128 computing nodes, with mapping - algorithms} + application on Arc1.2 using 128 nodes} \label{tab:exph1F} - \vspace*{-0.3cm} + \vspace*{-0.4cm} \end{table} At first, we can see that the Simple Mapping algorithm, though it is @@ -1439,7 +1455,7 @@ simple, provides a significant improvement of application execution time. This highlights that JaceP2P-V2 really needs a mapping algorithm in order to be more efficient. Then, we can see that the F-EC and the AIAC QM algorithms provide a better mapping than the Simple Mapping -algorithms, we can see a significant difference between both +algorithms. We can see a significant difference between both algorithms. This comes from the homogeneity of clusters. In this case, the F-EC algorithm is more efficient since the minimization of the communications becomes more important than the tackle of the @@ -1455,16 +1471,16 @@ The effect is that tasks do less iterations as they receive more frequently updated data from their neighbors. In addition, as tasks and their dependencies are on the same cluster, communications are improved, but also as computations take -approximately the same time, the amount of iterations is reduce and +approximately the same time, the amount of iterations is reduced and the algorithm can converge more quickly. -Another important positive point is that gains are scalable, which allows -to foresee big improvements for very large applications.\\ +% Another important positive point is that gains are scalable, which allows +% to foresee big improvements for very large applications.\\ The third architecture, Arc2.1, was composed of 112 computing nodes, representing 394 computing cores, spread over 5 clusters in 5 sites. In this architecture we used bi-cores (3 clusters), -quadri-cores (1 cluster) and bi-quadri-cores (1 cluster) machines. Its +quad-cores (1 cluster) and bi-quad-cores (1 cluster) machines. Its heterogeneity degree's value is 8.41. This architecture was used to run class E of the CG application, using 64 computing nodes. The fourth architecture, Arc2.2, used to execute class F of the CG @@ -1494,10 +1510,9 @@ mapping. \hline \end{tabular} \caption{Gains in time of the execution of the class E of the CG - application on Arc2.1 using 64 computing nodes, with mapping - algorithms} + application on Arc2.1 using 64 nodes} \label{tab:exph2E} - \vspace*{-0.3cm} + \vspace*{-0.4cm} \end{table} \renewcommand{\arraystretch}{1.5} @@ -1516,15 +1531,14 @@ mapping. \hline \end{tabular} \caption{Gains in time of the execution of the class F of the CG - application on Arc2.2 using 128 computing nodes, with mapping - algorithms} + application on Arc2.2 using 128 nodes} \label{tab:exph2F} - \vspace*{-0.3cm} + \vspace*{-0.4cm} \end{table} To begin with, these experiments confirm that a mapping algorithm is needed and that improvements are always scalable. Then, we can see -that the F-EC algorithm falls in performances and AIAC QM is +that the F-EC algorithm falls in performance and AIAC QM is improved. What is surprising is that the Simple Mapping algorithm is better than F-EC. This can be explained by the fact that as computing nodes are quite heterogeneous, computations are not the same, so it is @@ -1534,83 +1548,83 @@ architecture, it is more efficient to choose the best computing nodes to compute iterations more quickly and to improve the convergence detection. -Here, it is important to note that the AIAC QM algorithm offers a gain -of about $50\%$ on the execution time, that is to say that the -application takes half of the execution time than without mapping. - -\subsubsection{Parameters variation} -\label{sec:xpvariation} - -After having evaluated mapping algorithms on the heterogeneity of -distributed clusters, we now propose to change the parameters of AIAC -QM and F-EC algorithms, in order to determine which values are the -most accurate. - -To do these experiments, we used an architecture composed of 122 -computing nodes representing 506 computing cores, spread over 5 -clusters in 5 sites. In this architecture we used bi-cores (2 -clusters), quadri-cores (2 clusters) and bi-quadri-cores (1 cluster) -machines. Its heterogeneity degree value is 4.98. -%, which means that -%computing nodes power is very heterogeneous. - -The parameters of each algorithm, $f$ (the search factor) for -AIAC QM and $\delta$ (the amount of local dependencies) for F-EC, -varied both with values $10\%$, $50\%$ and $90\%$. We used the CG -multi-splitting application on 64 computing nodes. The results of -these experiments are given in Table \ref{tab:expparams}. Results -exposed in this table represent the gains in execution time provided -by each algorithm with different parameters values. - -%\vspace{0.2cm} -\renewcommand{\arraystretch}{1.5} - -\begin{table}[h!] - \centering - \begin{tabular}[h!]{|c||c|c|c|} - \hline - Parameters& $10\%$ & $50\%$ & $90\%$ \\ - \hline - \hline -% Simple & \multicolumn{3}{c|}{\textcolor{blue}{$39\%$}}\\ - SMa & \multicolumn{3}{c|}{$30\%$}\\ - \hline -% AIAC QM & $35\%$ & \textcolor{blue}{$41\%$} & $35\%$ \\ - AIAC QM & $30\%$ & $32\%$ & $30\%$ \\ - \hline -% F-EC & $39\%$ & $35\%$ & \textcolor{blue}{$45\%$} \\ - F-EC & $40\%$ & $37\%$ & $45\%$ \\ - \hline - \end{tabular} -% \caption{Parameters variations using a $500'000$ problem size on an -% architecture of 5.37 heterogeneity degree} - \caption{Gains in execution time with mapping algorithms parameters - variations using the class E of the CG application using 64 - computing nodes} -% \vspace*{-0.4cm} - \label{tab:expparams} -% \vspace*{-0.9cm} -\end{table} - -%First of all, we can see that the Simple mapping provides the same -%order of performances, as shown in the precedent section, so it is -%not affected by the heterogeneity degree. Secondly, -For the AIAC QM algorithm, we can note that the best value for its -parameter $f$ is about $50\%$, but its impact is not big enough to -indicate a specific configuration. -% With a low heterogeneity degree, this mapping algorithm provides a -% good performances improvement. -Finally, and this is not surprising, the F-EC algorithm is more -efficient with a factor $\delta$ near $100\%$, which directly comes -from its aim. But we can see that it is more efficient to have a -factor around $10\%$ than having one around $50\%$. - -We can note here, with a lower heterogeneity degree than in previous -experiments, gains are lower and the difference between AIAC QM and -F-EC (with parameters at $50\%$) is lower. It can be explained as the -fact that more the heterogeneity degree tends to 0 more computing -nodes are the same, so a mapping solution will not be efficient, -except one only optimizing network latency. +% Here, it is important to note that the AIAC QM algorithm offers a gain +% of about $50\%$ on the execution time, that is to say that the +% application takes half of the execution time than without mapping. + +% \subsubsection{Parameters variation} +% \label{sec:xpvariation} + +% After having evaluated mapping algorithms on the heterogeneity of +% distributed clusters, we now propose to change the parameters of AIAC +% QM and F-EC algorithms, in order to determine which values are the +% most accurate. + +% To do these experiments, we used an architecture composed of 122 +% computing nodes representing 506 computing cores, spread over 5 +% clusters in 5 sites. In this architecture we used bi-cores (2 +% clusters), quad-cores (2 clusters) and bi-quad-cores (1 cluster) +% machines. Its heterogeneity degree value is 4.98. +% %, which means that +% %computing nodes power is very heterogeneous. + +% The parameters of each algorithm, $f$ (the search factor) for +% AIAC QM and $\delta$ (the amount of local dependencies) for F-EC, +% varied both with values $10\%$, $50\%$ and $90\%$. We used the CG +% multi-splitting application on 64 computing nodes. The results of +% these experiments are given in Table \ref{tab:expparams}. Results +% reported in this table represent the gains in execution time provided +% by each algorithm with different parameters values. + +% %\vspace{0.2cm} +% \renewcommand{\arraystretch}{1.5} + +% \begin{table}[h!] +% \centering +% \begin{tabular}[h!]{|c||c|c|c|} +% \hline +% Parameters& $10\%$ & $50\%$ & $90\%$ \\ +% \hline +% \hline +% % Simple & \multicolumn{3}{c|}{\textcolor{blue}{$39\%$}}\\ +% SMa & \multicolumn{3}{c|}{$30\%$}\\ +% \hline +% % AIAC QM & $35\%$ & \textcolor{blue}{$41\%$} & $35\%$ \\ +% AIAC QM & $30\%$ & $32\%$ & $30\%$ \\ +% \hline +% % F-EC & $39\%$ & $35\%$ & \textcolor{blue}{$45\%$} \\ +% F-EC & $40\%$ & $37\%$ & $45\%$ \\ +% \hline +% \end{tabular} +% % \caption{Parameters variations using a $500'000$ problem size on an +% % architecture of 5.37 heterogeneity degree} +% \caption{Gains in execution time with mapping algorithms parameters +% variations using the class E of the CG application using 64 +% computing nodes} +% % \vspace*{-0.4cm} +% \label{tab:expparams} +% % \vspace*{-0.9cm} +% \end{table} + +% %First of all, we can see that the Simple mapping provides the same +% %order of performance, as shown in the precedent section, so it is +% %not affected by the heterogeneity degree. Secondly, +% For the AIAC QM algorithm, we can note that the best value for its +% parameter $f$ is about $50\%$, but its impact is not big enough to +% indicate a specific configuration. +% % With a low heterogeneity degree, this mapping algorithm provides a +% % good performance improvement. +% Finally, and this is not surprising, the F-EC algorithm is more +% efficient with a factor $\delta$ near $100\%$, which directly comes +% from its aim. But we can see that it is more efficient to have a +% factor around $10\%$ than having one around $50\%$. + +% We can note here, with a lower heterogeneity degree than in previous +% experiments, gains are lower and the difference between AIAC QM and +% F-EC (with parameters at $50\%$) is lower. It can be explained as the +% fact that more the heterogeneity degree tends to 0 more computing +% nodes are the same, so a mapping solution will not be efficient, +% except one only optimizing network latency. %These experiments show that the impact of parameters values does not %well influence the AIAC QM, whereas it is very important for the F-EC %algorithm. @@ -1670,79 +1684,81 @@ In this paper we have presented %%We proposed three mapping algorithms for the JaceP2P-V2 %%environment. The first is a simple way mapping, the Simple Mapping %%algorithm, which always provides a good and stable improvement of -%%performances on all kinds of architectures. +%%performance on all kinds of architectures. %We propose a specific mapping algorithm for the AIAC model, called AIAC QM. This algorithm is based on the execution time optimization but it also -includes a small degree of edge-cuts optimization. Experiments on a -real large scale architecture of a typical AIAC application show that +includes a small degree of edge-cuts optimization. Experiments show that the AIAC QM mapping algorithm is efficient on architectures with a high heterogeneity degree. This can be explained by the fact that all iteration computations are quite different, for our example, and the convergence is more quickly detected as the more powerful computing -nodes progress in the computation. The F-EC algorithm, which is based -on the ``edge-cuts'' optimization, is meanwhile efficient on -architectures with a low heterogeneity degree. This can be explained -by the fact that in such an environment, it is more accurate for a -task to have its dependencies locally on the same cluster in order to -have efficient communications and to allow iterations to be computed -together, which improves the convergence detection speed. -Experiments we conducted have shown gains in execution time up to -$50\%$, which denotes a division by 2 of this execution time, for a -typical AIAC application on more than 700 computing cores. -%Experiments have shown that -%the importance of the parameter of both algorithms, AIAC QM and F-EC, -%is not so essential for the first one, instead it is very critical for -%the second one, but we cannot be sure that it is true all the time on -%all kinds of architectures; it maybe depends on the heterogeneity -%degree of the network. -As we did not influence the network's heterogeneity, -% as we did for the computational power of nodes, -the evaluation of the network impact on the -application execution time would be one of our next work. +nodes progress in the computation. +% The F-EC algorithm, which is based +% on the ``edge-cuts'' optimization, is meanwhile efficient on +% architectures with a low heterogeneity degree. This can be explained +% by the fact that in such an environment, it is more accurate for a +% task to have its dependencies locally on the same cluster in order to +% have efficient communications and to allow iterations to be computed +% together, which improves the convergence detection speed. +% Experiments we conducted have shown gains in execution time up to +% $50\%$, which denotes a division by 2 of this execution time, for a +% typical AIAC application on more than 700 computing cores. +% %Experiments have shown that +% %the importance of the parameter of both algorithms, AIAC QM and F-EC, +% %is not so essential for the first one, instead it is very critical for +% %the second one, but we cannot be sure that it is true all the time on +% %all kinds of architectures; it maybe depends on the heterogeneity +% %degree of the network. +% As we did not influence the network's heterogeneity, +% % as we did for the computational power of nodes, +% the evaluation of the network impact on the +% application execution time would be one of our next work. %For now, these three mapping algorithms are implemented in an %additional library for the JaceP2P-V2 environment. The results %presented in this paper show that a mapping algorithm allows to -%improve applications performances, but as the executing architectures +%improve applications performance, but as the executing architectures %should have a variety of heterogeneity degree, we have to find a %compromise between the two approaches in order to have an efficient %mapping algorithm on all kinds of architectures. In the future, we %would like to design an efficient mapping algorithm to improve the -%execution of asynchronous iteration applications on heterogeneous +%execution of asynchronous iteration applications on heterogeous %distributed architectures. As the algorithm should be integrated in %the JaceP2P-V2 environment, which is fully decentralized and fault %tolerant, the new mapping algorithm should have also these %characteristics, in order to retrieve a fully decentralized and fault %tolerant environment. -Our future works concern the amelioration of the AIAC QM algorithm, in -order to improve it on homogeneous distributed architectures. As the -F-EC mapping algorithm is efficient on such architectures, we will -give a more important consideration to the edge-cuts part of AIAC -QM. Another important point is to take into consideration the fault +% Our future works concern the amelioration of the AIAC QM algorithm, in +% order to improve it on homogeneous distributed architectures. As the +% F-EC mapping algorithm is efficient on such architectures, we will +% give a more important consideration to the edge-cuts part of AIAC +% QM. + +In our future work we plan to take into consideration the fault tolerance problem. In this study we have realized our experiments without computing node fault, which is not the real case. We have to take into account the AIAC QM algorithm about this important parameter. First we have to efficiently choose new nodes to replace failed ones. Secondly, as we do checkpointing to save tasks' states, -we have to efficiently choose backup nodes not to fall in case +we have to efficiently choose backup nodes not to fail in case a whole cluster fails, as we save on neighbors (which are in general on the same cluster for communication efficiency reasons), an important part of the application is lost and we cannot restart this part; so the whole application fails. A trade-off should be done -by having some saving nodes in external clusters. +by having some saving nodes in external clusters.\\ -\section*{Acknowledgements} +% \subsubsection*{Acknowledgements} -\label{sec:merci} +% \label{sec:merci} -This work was supported by the European Interreg IV From-P2P project -and the region of Franche-Comté. +% This work was supported by the European Interreg IV From-P2P project +% and the region of Franche-Comté. -Experiments presented in this paper were carried out using the -Grid'5000\cite{g5k} experimental testbed, being developed under the -INRIA ALADDIN development action with support from CNRS, RENATER and -several Universities as well as other funding bodies. +% Experiments presented in this paper were carried out using the +% Grid'5000\cite{g5k} experimental testbed, being developed under the +% INRIA ALADDIN development action with support from CNRS, RENATER and +% several Universities as well as other funding bodies. \bibliographystyle{unsrt}