X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/interreg4.git/blobdiff_plain/cd6cba274c30661ccf72fa4c75b772a34de59361..HEAD:/heteropar10/heteropar10.tex?ds=sidebyside diff --git a/heteropar10/heteropar10.tex b/heteropar10/heteropar10.tex index b0b366a..4102d2a 100644 --- a/heteropar10/heteropar10.tex +++ b/heteropar10/heteropar10.tex @@ -12,22 +12,28 @@ \usepackage[pdftex,final]{graphicx} \usepackage[ruled,vlined]{algorithm2e} \usepackage[pdftex]{graphicx} +\usepackage{multirow} \DeclareGraphicsExtensions{.pdf,.jpeg,.png} -\title{MAHEVE: A New Reliable AIAC Mapping Strategy for Heterogeneous +%\title{MAHEVE: A New Reliable AIAC Mapping Strategy for Heterogeneous +% Environments \thanks{This work was supported by the European +% Interreg IV From-P2P project.}} + +\title{MAHEVE: An Efficient Reliable Mapping of Asynchronous + Iterative Applications on Volatile and Heterogeneous Environments \thanks{This work was supported by the European - Interreg IV From-P2P project.}} + Interreg IV From-P2P project.}} \author{Raphaël Couturier, David Laiymani and Sébastien Miquée} \authorrunning{R. Couturier, D. Laiymani and S. Miquée} - \institute{ - University of Franche Comté \qquad LIFC laboratory\\[1mm] + \institute{\vspace{-0.2cm} + University of Franche-Comté \qquad LIFC laboratory\\%[1mm] IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort, France\\\{{\tt - raphael.couturier,david.laiymani,sebastien.miquee}\} - {\tt @univ-fcomte.fr} } + raphael.couturier,david.laiymani,sebastien.miquee}\}{\tt + @univ-fcomte.fr} } \date{} @@ -45,13 +51,13 @@ larger sizes. The asynchronous iteration model, called AIAC, has been proven to be an efficient solution for heterogeneous and distributed architectures. An efficient mapping of applications' - tasks is essential to reduce their execution time. We present in - this paper a new mapping algorithm, called MAHEVE (Mapping Algorithm + tasks is essential to reduce their execution time. In this paper we + present a new mapping algorithm, called MAHEVE (Mapping Algorithm for HEterogeneous and Volatile Environments) which is efficient on such architectures and integrates a fault tolerance mechanism to - resist to computing nodes failures. Our experiments show gains on a - typical AIAC application's execution time about $55\%$, executed on - distributed clusters architectures containing more than 400 + resist computing nodes failures. Our experiments show gains on a + typical AIAC application execution time of about $55\%$, executed + on distributed clusters architectures containing more than 400 computing cores with the JaceP2P-V2 environment. %\textup{\small \textbf{Keywords:} Mapping algorithm, Fault tolerance, @@ -63,7 +69,7 @@ \label{sec:intro} Nowadays, scientific applications require a great computation power to -solve their large problems. Though personal computers are more +solve large problems. Though personal computers are becoming more powerful, in many cases they are not sufficient. One well adapted solution is to use computers clusters in order to combine the power of many machines. Distributed clusters form such an architecture, @@ -73,27 +79,25 @@ architecture brings users heterogeneity in computing machines as well as network latency. In order to use such an architecture, parallel programming is required. In the parallel computing area, in order to execute very large applications on heterogeneous architectures, -iterative methods are well adapted\cite{book_raph}. +iterative methods are well adapted\cite{book_raph,bcvc06:ij}. These methods repeat the same instructions block until a convergence state and a desired approximation of the solution are reached. They constitute the only known approach to solving some kinds of problems -and are relatively easy to parallelize. The Jacobi or Conjugate -Gradient\cite{cg} methods are examples of such methods. - - -To parallelize this kind of algorithm, one of the most used methods is -the message passing paradigm which allows an efficient mechanism to -exchange data between tasks. As such a method, we focus here on the -asynchronous parallel iterative model, called AIAC\cite{book_raph} -(for \textit{Asynchronous Iteration and Asynchronous Communication}). +and are relatively easy to parallelize. The Jacobi or the Conjugate +Gradient\cite{cg} methods are examples of such methods. To parallelize +them, one of the most used methods is the message passing paradigm +which provides efficient mechanisms to exchange data between tasks. As +such a method, we focus here on the asynchronous parallel iterative +model, called AIAC\cite{book_raph} (for \textit{Asynchronous + Iterations -- Asynchronous Communications}). \begin{figure}[h!] \vspace{-0.4cm} \centering - \includegraphics[width=7.4cm]{images/IACA} - \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model} + \includegraphics[width=7.4cm]{images/AIAC} + \caption{Two processors computing in the Asynchronous Iterations -- Asynchronous Communications (AIAC) model} \label{fig:AIAC} \vspace{-0.90cm} \end{figure} @@ -101,13 +105,13 @@ asynchronous parallel iterative model, called AIAC\cite{book_raph} In this model, as can be seen on Figure \ref{fig:AIAC}, after each iteration, a task sends its results to its neighbors and immediately -starts the next iteration with the last received data. The receiving +starts the next iteration with the last received data. The receiving and sending mechanisms are asynchronous and tasks do not have to wait for the reception of dependency messages from their neighbors. Consequently, there is no idle time between two iterations. Furthermore, this model is tolerant to messages loss and even if a task is stopped the remaining tasks continue the -computation, with the last available data. Several +computation, with the last available data. Several experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms in the context of distributed clusters with high latency between clusters. These works underline the good adaptability of AIAC @@ -119,35 +123,35 @@ algorithms of tasks to processors dedicated to the AIAC model on heterogeneous distributed clusters. Both these two algorithms, AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC (for \textit{Farhat Edges-Cuts}) showed an important performances -improvement by reducing up to $50\%$ the application's execution +improvement by reducing up to $50\%$ the application execution time. These experiments were performed by using the JaceP2P-V2 environment. This Java based platform is an executing and developing environment dedicated to the AIAC model. By implementing a distributed backup/restore mechanism it is also fully fault tolerant\cite{jaceP2P-v2}. In our previous experiments we did not -introduce computing nodes failures during the computation. And as -architecture's heterogeneity continually evolves according to -computing nodes volatility, we have to more precisely takes care about +introduce computing nodes failures during the computation. As +architecture heterogeneity continually evolves according to +computing nodes volatility, we have to take care more precisely about the heterogeneity of the target platform. Thus in this paper our main contribution is to propose a new mapping algorithm called MAHEVE (\textit{Mapping Algorithm for HEterogeneous and Volatile Environments}). This algorithm explicitly tackles the heterogeneity issue and introduces a level of dynamism in order to adapt itself to -the fault tolerance mechanisms. Our experiments show gains about 55\% -on application's execution time, with about 10 points better than -AIAC-QM and about 25 points better than F-EC. +the fault tolerance mechanisms. Our experiments show gains of about +$55\%$ on application execution time, which is about 10 points +better than AIAC-QM and about 25 points better than F-EC. The rest of this paper is organized as -follows. Section~\ref{sec:jacep2p} reminds the JaceP2P-V2 middleware +follows. Section~\ref{sec:jacep2p} presents the JaceP2P-V2 middleware by describing its architecture and briefly presenting its fault -tolerance mechanism. Section~\ref{sec:pb} formalizes our mapping and +tolerance mechanisms. Section~\ref{sec:pb} formalizes our mapping and fault tolerance problems and quotes existing issues to address them. Section~\ref{sec:maheve} describes the new mapping strategy we -propose, MAHEVE. In section~\ref{sec:expe} we present the experiments -we have conducted on the Grid'5000 testbed with more than 400 -computing cores. Finally, we give some concluding remarks and plan our -future work in section~\ref{sec:conclu}. +propose, MAHEVE. In Section~\ref{sec:expe} we present the experiments +we conducted on the Grid'5000 testbed with more than 400 computing +cores. Finally, we give some concluding remarks and plan our future +work in Section~\ref{sec:conclu}. \section{JaceP2P-V2} @@ -165,13 +169,13 @@ The JaceP2P-V2 platform part, which is based on the daemons and supervisors paradigm, is composed of three main entities: \begin{itemize} -\item The ``super-nodes'', which is in charge of supervising free - computing nodes connected to the platform. +\item The ``super-nodes'', which are in charge of supervising free + computing nodes connected to the platform; \item The ``spawner'', which is launched by a user wanting to execute a parallel application. It is in charge of a group of computing - nodes and monitor them. If one of them fails, it requires a - replacing one to a super-node. + nodes and monitors them. If one of them fails, it requires a + replacing one to a super-node; \item The ``daemon'', when launched, connects to a super-node and waits for a task to execute. Each daemon can communicate directly @@ -181,16 +185,16 @@ supervisors paradigm, is composed of three main entities: To be able to execute asynchronous iterative applications, JaceP2P-V2 -has an asynchronous messaging mechanism. In order to resist daemons' +has an asynchronous messaging mechanism. In order to resist daemons failures, it implements a distributed backup mechanism called the \textit{uncoordinated distributed checkpointing}\cite{uncoord_cp}. This decentralized procedure allows the platform to be very scalable, with no weak points and does not require a secure nor a stable station for backups. When a daemon dies, it is replaced by another one. Here we suppose that we have enough available free nodes. Moreover, to -resit to supervisors failures and to be scalable, it reserves some -extra nodes. For more details on the JaceP2P-V2 platform, interested -readers can refer to \cite{jaceP2P-v2}. +resist supervisors failures and scalability, it reserves some extra +nodes. For more details on the JaceP2P-V2 platform, interested readers +can refer to \cite{jaceP2P-v2}. \section{Mapping and fault tolerance problems} @@ -207,7 +211,7 @@ With the AIAC model, all tasks compute in parallel at the same time, without precedence nor synchronization. During an iteration, each task computes its job and sends its results to its neighbors, and immediately starts the next iteration. The TIG\cite{tig1} -(\textit{Task Interaction Graph}) is the most appropriate model to our +(\textit{Task Interaction Graph}) model is the most appropriate to our problem, as it only models relationships between tasks. In this model, all the tasks are considered simultaneously executable and communications can take place at any time during the computation, with @@ -223,7 +227,7 @@ represent the mutual communication among tasks. A function \mbox{$EC : \mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) = e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}. Another function -\mbox{$D : V \rightarrow \mathbb{R}^+$} gives the amount of +\mbox{$D : V \rightarrow \mathbb{N}^+$} gives the amount of dependencies of a task, and we define \mbox{$D(V_i) = d_i$}. @@ -239,7 +243,7 @@ architecture, where \mbox{$N = \{N_1,N_2,\dots N_n\}$} is the set of $|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The vertices represent the computing nodes and the edges represent the links between them. An edge \mbox{$L_i \in L$} is an unordered pair -\mbox{$(N_x,N_y) \in \mathbb{N}$}, representing a communication link +\mbox{$(N_x,N_y) \in N$}, representing a communication link between nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow \mathbb{R}^+$} gives the computational power of nodes and another function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the @@ -247,12 +251,12 @@ communication latency of links. We define \mbox{$WN(N_i) = wn_i$} and \mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters contained in the architecture. A function \mbox{$CN : C \rightarrow \mathbb{N}^+$} gives the amount of computing nodes contained in a -cluster, and another function \mbox{$CF : C \rightarrow \mathbb{R}^+$} -gives the amount of available computing nodes (not involve in an +cluster, and another function \mbox{$CF : C \rightarrow \mathbb{N}^+$} +gives the amount of available computing nodes (not involved in an application computation) of a cluster. We define \mbox{$CN(C_i) = C_{Ni}$} and \mbox{$CF(C_i) = C_{Fi}$}. We also define \mbox{$C_{Pi} = \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of -cluster $C_i$ and \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$} +cluster $C_i$, \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$} as the average computation power of cluster $C_i$, and $C_{\overline{P}fi}$ the average power of its available resources. @@ -260,7 +264,7 @@ We evaluate the \textit{heterogeneity degree} of the architecture, noted $hd$, by using the \textit{relative standard deviation} method, with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the average computing power of nodes and $\sigma_{PN}$ represents the -standard deviation of computing nodes power. This measure provides us +standard deviation of computing nodes power. This measure provides us the coefficient of variation of the platform in percentage -- we only consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd > 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully @@ -277,19 +281,20 @@ $GG$, the execution time of the application, $ET(App)$, can be defined as the execution time of the slowest task. Indeed, an application ends when all the tasks have detected convergence and reached the desired approximation of the solution. We define $ET(App) = \max_{i=1 \dots - v} ( ET(V_i) )$ where the execution time of each task $i$ -\mbox{($i=1 \dots v$)}, $ET(V_i)$ is given by $ET(V_i) = + v} ( ET(V_i) )$, where the execution time of each task $i$ +\mbox{($i=1 \dots v$)}, $ET(V_i)$, is given by $ET(V_i) = \frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$ is the computational cost of $V_i$, $wn_i$ is the computational power of the node $N_i$ on which $V_i$ is mapped, $J$ represents the neighbors set of $V_i$, $c_{ij}$ is the amount of communications between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the -computing nodes on which are mapped $V_i$ and $V_j$. As described in +computing nodes on which $V_i$ and $V_j$ are mapped. As described in this formula, the execution time of a task depends on the task weight -and communications it may occur with its neighbors. We underline here -that in the AIAC model, it is impossible to predict the number of -iterations of a task. So it is difficult to evaluate a priori its cost -$e_i$. +and on the communications which may occur between this task and its +neighbors. We underline here that in the AIAC model, it is impossible +to predict the number of iterations of a task. So it is difficult to +evaluate a priori its cost $e_i$. + This tasks mapping problem is similar to the classical graph partitioning and task assignment problem, and is thus NP-complete. @@ -299,14 +304,14 @@ partitioning and task assignment problem, and is thus NP-complete. \label{sec:pbft} In volatile environments, computing nodes can disconnect at any time -during the computation, and these ones have to efficiently be replaced. +during the computation, and have thus to be efficiently replaced. The replacing nodes should be the best ones at the fault time, according to the chosen mapping algorithm, by searching them in available nodes. As executing environments can regularly evolve, due -to computing nodes' volatility, a mapping algorithm has to keep a -right overview of the architecture, in real time. Thus, criteria to +to computing nodes volatility, a mapping algorithm has to keep a +correct overview of the architecture, in real time. Thus, criteria to assign tasks to nodes should evolve too. @@ -325,12 +330,12 @@ environment. An important point to take into consideration is that we do not allow the execution of multiple tasks on the same computing node, as this -provides a fall of performance when this one fails. Indeed we should +provides a fall of performances when this one fails. Indeed we should redeploy all of the tasks from this node to another one, using last saves, which can be spread on multiple computing nodes. This may -result in large communication overheads and a waste of computation -time. Nevertheless, to benefit of multi-cores processors, we use a -task level parallelism by running multi-threaded sequential solver for +result in large communication overheads and in a waste of computation +time. Nevertheless, to benefit from multi-cores processors, we use a +task level parallelism by multi-threaded sequential solvers for example. @@ -338,7 +343,7 @@ Another important point in the AIAC model is that as the JaceP2P-V2 environment is fault tolerant and tasks save checkpoints on their neighbors, it is more efficient to save on near nodes than on far ones in order to reduce the communication overhead during this operation, -and to faster restart a task. +and to restart a task faster. \subsection{Related work} @@ -353,7 +358,7 @@ where distance, in term of network, is small to improve communications between tasks. Here we can cite Metis\cite{metis} and Chaco\cite{chaco} which are libraries containing such kind of algorithms. The second category is the \textit{Execution time - optimization} class, which aims to minimize the whole application's + optimization} class, which aims at minimizing the whole application execution time. These algorithms look for nodes which can provide the smallest execution time of tasks using their computational power. Here we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and @@ -366,11 +371,12 @@ which may influence the applications performances. All mentioned algorithms do not tackle the computing nodes failures -issue, or basically by applying the same policy. As explain in section -\ref{sec:pbft}, a more efficient and dedicated replacement function is -needed. Nevertheless, to the best of our knowledge, no tasks mapping -algorithm, addressing explicitly both the executing platform -heterogeneity and the computing nodes failures issues, exists. +issue, or only basically by applying the same policy. As explained in +Section \ref{sec:pbft}, a more efficient and dedicated replacement +function is needed. Nevertheless, to the best of our knowledge, no +tasks mapping algorithm, addressing explicitly both the executing +platform heterogeneity and the computing nodes failures issues, +exists. \vspace{-0.25cm} @@ -379,15 +385,15 @@ heterogeneity and the computing nodes failures issues, exists. Here we present our new tasks mapping strategy, called MAHEVE (for \textit{Mapping Algorithm for HEterogeneous and Volatile - Environments}). This algorithm aims to take the best part of each -category mentioned in section \ref{sec:pbrw}, the edge-cuts -minimization and the application's execution time optimization + Environments}). This algorithm aims at taking the best part of each +category mentioned in Section \ref{sec:pbrw}, the edge-cuts +minimization and the application execution time optimization algorithms. This new algorithm can be divided into two parts. The first part aims -to perform the initial mapping, and the second part is devoted to -search replacing nodes when computing nodes' failures occur. +at performing the initial mapping, and the second part is devoted to +search replacing nodes when computing nodes failures occur. \vspace{-0.20cm} @@ -405,26 +411,26 @@ their sort order) with a reservation of some nodes in each cluster. \subsubsection{Sorting clusters} \label{sec:maheve_sort_clusters} -The first step of the initial mapping is to sort clusters accordingly -to the executing platform's heterogeneity degree $hd$. The main -principles are that a cluster obtain a better mark when $hd < 0.5$ and -it contains more computing nodes than other clusters ($C_{Fi}$, the -amount of available free nodes, is privileged), and when $hd \ge 0.5$ +The first step of the initial mapping is to sort clusters according to +the executing platform heterogeneity degree $hd$. The main principles +are that a cluster obtains a better mark when $hd < 0.5$ and it +contains more computing nodes than other clusters ($C_{Fi}$, the +number of available free nodes, is privileged), and when $hd \ge 0.5$ and it contains more powerful computing nodes ($C_{\overline{P}fi}$, -the average computation power, is privileged). These choices come from -several experiments with the AIAC model, which show that in such +the average free computation power, is privileged). These choices come +from several experiments with the AIAC model, which show that in such environments it is more efficient to privilege the computation power -or the amount of nodes. As the amount of nodes, $C_{Fi}$, and the -average computing power, $C_{\overline{P}fi}$, are not in the order of -magnitude, we normalize them with two functions, $normN$ and -$normP$. They are defined as \mbox{$normN\left(C_{Fi}\right) = C_{Fi} - \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate (in -percent) of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) = +or the number of nodes. As the number of nodes, $C_{Fi}$, and the +average free computing power, $C_{\overline{P}fi}$, are not in the +same order of magnitude, we normalize them with two functions, $normN$ +and $normP$. They are defined with \mbox{$normN\left(C_{Fi}\right) = + C_{Fi} \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate +of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) = C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|} - C_{\overline{P}fj}$}, which is the rate (in percent) of the average -power, both representing the cluster in the architecture. We note -$NC_{Fi} = normN\left( C_{Fi} \right)$ and $NC_{\overline{P}fi} = -normP(C_{\overline{P}fi})$. + C_{\overline{P}fj}$}, which is the rate of the average power, both +representing the cluster in the architecture. We note +$normN\left( C_{Fi} \right) = NC_{Fi}$ and $normP(C_{\overline{P}fi}) +=NC_{\overline{P}fi}$. The formula used to give a mark, $M_i$, to a cluster is @@ -449,18 +455,18 @@ receives the lower mark. \subsubsection{Sorting tasks} \label{sec:maheve_sort_tasks} -Like clusters, tasks are also sorted accordingly to the heterogeneity -degree of the executing platform. This sort is done in the same way as +Like clusters, tasks are also sorted according to the heterogeneity +degree of the executing platform. This sorting is done in the same way as previously, as when $hd < 0.5$ tasks with higher dependencies will be privileged, and when $hd \ge 0.5$ tasks with higher computing cost are -privileged, in order to be executed on highest powered computing -nodes. +privileged. +%, in order to be executed on the highest powered computing nodes. The main function used to classified tasks is \begin{equation} \label{eq:tasks} - Q_i = e_i^{hd} \times d_i^{1 - hd} + Q_i = {e_i}^{hd} \times {d_i}^{1 - hd} \end{equation} where $Q_i$ is the evaluation of the task $i$ according to the @@ -471,27 +477,29 @@ task $i$. Tasks are taken in the order of the first sort, determined with equation (\ref{eq:tasks}), and each task is placed in a new list (the final one) and some of its dependencies are added. We note $Nb_i = -d_i^{1 - hd}$ this amount of dependencies as the lower the +{d_i}^{1 - hd}$ this amount of dependencies as the lower the heterogeneity degree is the higher this number will be. This final operation allows to control the necessary locality of tasks according -to the heterogeneity degree of the platform. +to $hd$. %the heterogeneity degree of the platform. \subsubsection{Mapping method} \label{sec:maheve_mapping_method} The third step of the initial mapping is to allocate tasks to -computing nodes. As clusters and tasks have been sorted accordingly to -the executing platform's heterogeneity degree, ordered from the -highest mark to the lower, this function maps tasks on each available +nodes. As clusters and tasks have been sorted accordingly to the +executing platform heterogeneity degree, ordered from the highest +mark to the lowest, this function maps tasks on each available computing nodes of clusters, in their respective order in lists (for -example task classified first in the tasks list is mapped on an -available node of the cluster classified first in clusters list). - -The idea here is not to fulfill each cluster, but preserve some +example a task classified first in the tasks list is mapped on an +available node of the cluster classified first in the clusters list). +The idea here is not to fulfill each cluster, but to preserve some computing nodes in each cluster. These conserved nodes will be used to -replace failed nodes. The fact of considering in the initial mapping -the fault tolerance is a new approach in mapping algorithms. +replace failed nodes. + + +Here we can mentioned that the whole mapping process (the three steps) +has a complexity of $O( |V|^2 )$, where |V| is the number of tasks. \subsection{Replacing function} @@ -501,14 +509,14 @@ This function is essential in a volatile environment, as an efficient replacement should reduce the overhead on the application execution time due to the loss of computing steps and data. -As we have shown in the previous section, during the initial mapping -some computing nodes in each cluster have been preserved for fault -tolerance. When a node fails this function replace it by a free node -of the same cluster. If none is available this function sorts again -clusters, to take into consideration platform's modifications, and -replace the failed node by one available in the new sorted clusters -list. This mechanism allows to keep tasks' locality and a real time -overview of the executing platform. +As shown in the previous section, during the initial mapping some +computing nodes in each cluster have been preserved. When a node fails +this function replace it by a free node of the same cluster. If none +is available this function sorts again clusters, to take into +consideration platform modifications, and replace the failed node by +one available in the new sorted clusters list. This mechanism allows +to keep tasks locality and a real time overview of the executing +platform. \section{Experimentation} @@ -520,38 +528,36 @@ overview of the executing platform. We used the ``Kernel CG'' application of the NAS Parallel Benchmarks (NPB) \cite{nas} to evaluate the performances of our new mapping algorithm. This benchmark is designed to be used on large -architectures, because it stresses communications over latency -networks, by processing unstructured matrix vector multiplication with -a Conjugate Gradient method. As this method cannot be executed with -the asynchronous iteration model we have replaced it by another method -called the multisplitting method. This latter supports the -asynchronous iterative model. For more details about this method, -interested readers are invited to see \cite{book_raph}. The chosen -problem used a matrix of size $5,000,000$ with a low bandwidth, fixed -to $35,000$. This bandwidth size generates, according to the problem's -size, between 8 and 20 neighbors per tasks. This application was -executed on 64 computing nodes. +architectures, as it stresses communications%over latency networks +, by processing unstructured matrix vector multiplication with a +Conjugate Gradient method. As this method cannot be executed with the +asynchronous iteration model we have replaced it by another method +called the multisplitting method, which supports the asynchronous +iterative model. More details about this method can be found in +\cite{book_raph}. The chosen problem used a matrix of size $5,000,000$ +with a low bandwidth, fixed to $35,000$. This bandwidth size +generates, according to the problem size, between 8 and 20 neighbors +per tasks. This application was executed on 64 nodes. The platform used for our tests, called Grid’5000\cite{g5k}, is a -French nationwide experimental set of clusters which provides us +French nationwide experimental set of clusters which provides us with distributed clusters architectures (28 heterogeneous clusters spread -over 9 sites). We used three distributed clusters architectures on -the Grid'5000 testbed, each having a different heterogeneity -degree. The first one was composed of four clusters spread over four -sites, with a total of 106 computing nodes representing 424 computing -cores with \mbox{$hd = 0.08$}; the second one was composed of four -clusters spread over three sites, with a total of 110 computing nodes -representing 440 computing cores with \mbox{$hd = 0.50$}; and finally -the third one was composed of five clusters spread over four sites -with 115 computing nodes representing 620 computing cores with -\mbox{$hd = 0.72$}. +over 9 sites). We used three distributed clusters architectures, each +having a different heterogeneity degree. The first one was composed of +four clusters spread over four sites, with a total of 106 computing +nodes representing 424 computing cores with \mbox{$hd = 0.08$}; the +second one was composed of four clusters spread over three sites, with +a total of 110 computing nodes representing 440 computing cores with +\mbox{$hd = 0.50$}; and finally the third one was composed of five +clusters spread over four sites with 115 computing nodes representing +620 computing cores with \mbox{$hd = 0.72$}. All computing nodes of these clusters have at least 4 computing cores (in the last used architecture, with \mbox{$hd = 0.72$}, two clusters are composed of 8 computing cores machines) with a minimum of 4GB of -memory (in order to execute the application with a big problem's +memory (in order to execute the application with a big problem size). All computing nodes can communicate with each other through an efficient network. Nevertheless, this latter is shared with many other users so high latencies appear during executions. @@ -562,7 +568,7 @@ users so high latencies appear during executions. During executions, we introduced two failures in computing nodes involved in the computation every 20 seconds to simulate a volatile -environment. Unfortunately, we had not the opportunity to realize +environment. Unfortunately, we did not have the opportunity to realize experiments with more computing nodes over more sites with problems of larger sizes, but we plan to extend our experiments in the future. @@ -574,87 +580,105 @@ fault tolerant versions of the AIAC-QM and F-EC mapping algorithms presented in \cite{pdsec10}. Table \ref{tab:results} shows the execution times of each mapping algorithm compared to the default mapping strategy of the JaceP2P-V2 platform, with the corresponding -gains on application's execution time, in brackets. +gains on application execution time, given in brackets. It presents +both the executions with faults (WF) and the fault free (FF) +executions. -\renewcommand{\arraystretch}{1.7} +\renewcommand{\arraystretch}{1.6} \begin{table}[h!] \centering - \begin{tabular}{|c|c|c|c|c|} + \begin{tabular}{|c|c|c|c|c|c|c|c|c|} \hline - ~$ hd$~ & ~Default~ & ~FT-AIAC-QM~ & ~FT-FEC~ & ~MAHEVE~ \\ + \multirow{2}{*}{ ~$hd$~ }&\multicolumn{2}{c|}{ ~Default~ }&\multicolumn{2}{c|}{ ~FT-AIAC-QM~ }&\multicolumn{2}{c|}{ ~FT-FEC~ }&\multicolumn{2}{c|}{ ~MAHEVE~ }\\ + \cline{2-9} + + & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ \\ \hline - ~$0.08$~&229s&178s (22\%)&154s (33\%)&113s (50\%)\\ - ~$0.50$~&242s&118s (51\%)&133s (45\%)&85s (65\%)\\ - ~$0.72$~&192s&99s (45\%)&121s (33\%)&86s (53\%)\\ + ~$0.08$~ & ~80~ & ~229~ & ~63 (21\%)~ & ~178 (22\%)~ & ~61 (23\%)~ & ~154 + (33\%)~ & ~60 (25\%)~ & ~113 (50\%)~ \\ + + ~$0.50$~ & ~67~ & ~242~ & ~61 (9\%)~ & ~118 (51\%)~ & ~63 (6\%)~ & ~133 + (45\%)~ & ~54 (20\%)~ & ~85 (65\%)~ \\ + + ~$0.72$~ & ~67~ & ~192~ & ~59 (12\%)~ & ~99 (45\%)~ & ~65 (3\%)~ & ~121 + (33\%)~ & ~52 (22\%)~ & ~86 (53\%)~\\ + \hline \end{tabular} \vspace{0.15cm} - \caption{Application's execution time in seconds and corresponding gains on various - platforms using different mapping algorithms with 2 computing - nodes' failures each 20 seconds} + \caption{Application execution time in seconds and corresponding gains on various + platforms using different mapping algorithms, with fault free (FF) executions + and with 2 node failures each 20 seconds (WF) executions.} \label{tab:results} \vspace{-0.7cm} \end{table} First of all, we can note that all mapping algorithms provide an -enhancement of the application's performances by considerably reducing -its execution time, with an average gain about $45\%$ in general in -comparison to the default policy. As shown in \cite{pdsec10}, FT-FEC -is efficient on architectures with a low heterogeneity degree -(\mbox{$hd = 0.08$} by providing gains about $33\%$, and gains are -seemly the same on heterogeneous architectures (\mbox{$hd = +enhancement of the application performances by considerably reducing +its execution time, especially for executions with node failures, with +an average gain of about $45\%$ in general in comparison to the +default policy. If we focus on executions with node failures (WF), +FT-FEC is efficient on architectures with a low heterogeneity degree +(\mbox{$hd = 0.08$}) by providing gains of about $33\%$, and gains are +roughly the same on heterogeneous architectures (\mbox{$hd = 0.72$}). FT-AIAC-QM is efficient on architectures with a high -heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains about +heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains of about $45\%$, whereas it is not so efficient on homogeneous architectures -(\mbox{$hd = 0.08$}) by providing gains about $22\%$. We can note here -that on an architecture with a heterogeneity degree of $0.50$ +(\mbox{$hd = 0.08$}) by providing gains of about $22\%$. We can note +here that on an architecture with a heterogeneity degree of $0.50$ FT-AIAC-QM is more efficient than FT-FEC by providing gains up to -$50\%$. +$50\%$. Here we point out that in fault free executions (FF), both +algorithms also provide gains on their respective favorite +architectures, though gains are less great than in executions with +faults (WF). + -Now if we look at the performances of our new solution, MAHEVE, we can +Now if we focus on the performances of our new solution MAHEVE, we can see that it is all the time better than other algorithms. As can be -seen in \mbox{Table \ref{tab:results}}, it reduces the application's -execution time by about $50\%$ on homogeneous architectures (here of -$0.08$ heterogeneity degree) what is more than 25 point better than -FT-FEC and near 30 points better than FT-AIAC-QM. On heterogeneous -architectures (here of $0.72$ heterogeneity degree) it also -outperforms other mapping algorithms by reducing the application's -execution time by about $53\%$ what is near about 10 points better -than FT-AIAC-QM and 20 points better than FT-FEC. On middle -heterogeneity degree architectures (here of $0.50$ heterogeneity -degree), MAHEVE is another one time better than its two comparative -mapping algorithms by reducing the application's execution time by -about $53\%$. These good performances come from the fact that it is -designed to be efficient on both architectures, homogeneous and +seen in \mbox{Table \ref{tab:results}}, in executions with faults +(WF), it reduces the application's execution time by about $50\%$ on +homogeneous architectures (here of $0.08$ heterogeneity degree) which +is more than 25 point better than FT-FEC and near 30 points better +than FT-AIAC-QM. On heterogeneous architectures (here of $0.72$ +heterogeneity degree) it also outperforms other mapping algorithms by +reducing the application execution time by about $53\%$ which is +almost 10 points better than FT-AIAC-QM and 20 points better than +FT-FEC. On middle heterogeneity degree architectures (here of $0.50$ +heterogeneity degree), MAHEVE is once again better than its two +comparative mapping algorithms by reducing the application execution +time by about $53\%$. These good performances come from the fact that +it is designed to be efficient on both architectures, homogeneous and heterogeneous. Moreover, as it integrates a fault tolerance \textit{security} in the initial mapping, it is more efficient when computing nodes fail. Here we can point out that this algorithm allows -in general gains on application's execution time about $55\%$. +in general gains on application execution time of about $55\%$. In fault free executions (FF), it outperforms once again +the two other algorithms. \section{Conclusion and future works} \label{sec:conclu} -In this paper we have presented a novel mapping algorithm, called +In this paper we have presented a new mapping algorithm, called MAHEVE, to address the AIAC mapping issue on heterogeneous and -volatile environments. It aims to do an efficient mapping of tasks on -distributed clusters architectures by taking the best part of the two -known approaches, application's execution time optimization and +volatile environments. It aims at doing an efficient mapping of tasks +on distributed clusters architectures by taking the best part of the +two known approaches, application execution time optimization and edge-cuts minimization. Experiments show that it is the most efficient -mapping algorithm on all kinds of architectures, as it takes care -about their heterogeneity degree and adapt its sort methods to it. We -have shown that it is all the time better than the two other +mapping algorithm on all kinds of architectures, as it takes into +account their heterogeneity degree and adapt its sort methods to +it. We have shown that it is all the time better than the two other comparative mapping algorithms, FT-AIAC-QM and FT-FEC. This can be explained by the fact that it not only takes care about computing nodes and clusters, but also about the tasks' properties, what refines the mapping solution. + In our future works we plan to enhance the MAHEVE algorithm -performances by modifying the notation of clusters as their locality -is now not taken into consideration, in order to favor tasks locality, -which will reduce communications delays providing a better convergence -rate. +performances by modifying the notation of clusters, since their +locality has not yet been taken into consideration. This would favor +tasks locality, which would reduce communications delays and provide a +much better convergence rate. %%%