\usepackage[pdftex,final]{graphicx}
\usepackage[ruled,vlined]{algorithm2e}
\usepackage[pdftex]{graphicx}
+\usepackage{multirow}
\DeclareGraphicsExtensions{.pdf,.jpeg,.png}
-\title{MAHEVE: A New Reliable AIAC Mapping Strategy for Heterogeneous
+%\title{MAHEVE: A New Reliable AIAC Mapping Strategy for Heterogeneous
+% Environments \thanks{This work was supported by the European
+% Interreg IV From-P2P project.}}
+
+\title{MAHEVE: An Efficient Reliable Mapping of Asynchronous
+ Iterative Applications on Volatile and Heterogeneous
Environments \thanks{This work was supported by the European
- Interreg IV From-P2P project.}}
+ Interreg IV From-P2P project.}}
\author{Raphaël Couturier, David Laiymani and Sébastien Miquée}
\authorrunning{R. Couturier, D. Laiymani and S. Miquée}
- \institute{
- University of Franche Comté \qquad LIFC laboratory\\[1mm]
+ \institute{\vspace{-0.2cm}
+ University of Franche-Comté \qquad LIFC laboratory\\%[1mm]
IUT Belfort-Montb\'eliard, 2 Rue Engel Gros \\ BP 27 90016 Belfort,
France\\\{{\tt
- raphael.couturier,david.laiymani,sebastien.miquee}\}
- {\tt @univ-fcomte.fr} }
+ raphael.couturier,david.laiymani,sebastien.miquee}\}{\tt
+ @univ-fcomte.fr} }
\date{}
larger sizes. The asynchronous iteration model, called AIAC, has
been proven to be an efficient solution for heterogeneous and
distributed architectures. An efficient mapping of applications'
- tasks is essential to reduce their execution time. We present in
- this paper a new mapping algorithm, called MAHEVE (Mapping Algorithm
+ tasks is essential to reduce their execution time. In this paper we
+ present a new mapping algorithm, called MAHEVE (Mapping Algorithm
for HEterogeneous and Volatile Environments) which is efficient on
such architectures and integrates a fault tolerance mechanism to
- resist to computing nodes failures. Our experiments show gains on a
- typical AIAC application's execution time about $55\%$, executed on
- distributed clusters architectures containing more than 400
+ resist computing nodes failures. Our experiments show gains on a
+ typical AIAC application execution time of about $55\%$, executed
+ on distributed clusters architectures containing more than 400
computing cores with the JaceP2P-V2 environment.
%\textup{\small \textbf{Keywords:} Mapping algorithm, Fault tolerance,
\label{sec:intro}
Nowadays, scientific applications require a great computation power to
-solve their large problems. Though personal computers are more
+solve large problems. Though personal computers are becoming more
powerful, in many cases they are not sufficient. One well adapted
solution is to use computers clusters in order to combine the power of
many machines. Distributed clusters form such an architecture,
as network latency. In order to use such an architecture, parallel
programming is required. In the parallel computing area, in order to
execute very large applications on heterogeneous architectures,
-iterative methods are well adapted\cite{book_raph}.
+iterative methods are well adapted\cite{book_raph,bcvc06:ij}.
These methods repeat the same instructions block until a convergence
state and a desired approximation of the solution are reached. They
constitute the only known approach to solving some kinds of problems
-and are relatively easy to parallelize. The Jacobi or Conjugate
-Gradient\cite{cg} methods are examples of such methods.
-
-
-To parallelize this kind of algorithm, one of the most used methods is
-the message passing paradigm which allows an efficient mechanism to
-exchange data between tasks. As such a method, we focus here on the
-asynchronous parallel iterative model, called AIAC\cite{book_raph}
-(for \textit{Asynchronous Iteration and Asynchronous Communication}).
+and are relatively easy to parallelize. The Jacobi or the Conjugate
+Gradient\cite{cg} methods are examples of such methods. To parallelize
+them, one of the most used methods is the message passing paradigm
+which provides efficient mechanisms to exchange data between tasks. As
+such a method, we focus here on the asynchronous parallel iterative
+model, called AIAC\cite{book_raph} (for \textit{Asynchronous
+ Iterations -- Asynchronous Communications}).
\begin{figure}[h!]
\vspace{-0.4cm}
\centering
- \includegraphics[width=7.4cm]{images/IACA}
- \caption{Two processors computing in the Asynchronous Iteration - Asynchronous Communication (AIAC) model}
+ \includegraphics[width=7.4cm]{images/AIAC}
+ \caption{Two processors computing in the Asynchronous Iterations -- Asynchronous Communications (AIAC) model}
\label{fig:AIAC}
\vspace{-0.90cm}
\end{figure}
In this model, as can be seen on Figure \ref{fig:AIAC}, after each
iteration, a task sends its results to its neighbors and immediately
-starts the next iteration with the last received data. The receiving
+starts the next iteration with the last received data. The receiving
and sending mechanisms are asynchronous and tasks do not have to wait
for the reception of dependency messages from their
neighbors. Consequently, there is no idle time between two
iterations. Furthermore, this model is tolerant to messages loss and
even if a task is stopped the remaining tasks continue the
-computation, with the last available data. Several
+computation, with the last available data. Several
experiments\cite{bcvc06:ij} show the relevance of the AIAC algorithms
in the context of distributed clusters with high latency between
clusters. These works underline the good adaptability of AIAC
heterogeneous distributed clusters. Both these two algorithms,
AIAC-QM (for \textit{AIAC Quick-quality Map}) and F-EC (for
\textit{Farhat Edges-Cuts}) showed an important performances
-improvement by reducing up to $50\%$ the application's execution
+improvement by reducing up to $50\%$ the application execution
time. These experiments were performed by using the JaceP2P-V2
environment. This Java based platform is an executing and developing
environment dedicated to the AIAC model. By implementing a distributed
backup/restore mechanism it is also fully fault
tolerant\cite{jaceP2P-v2}. In our previous experiments we did not
-introduce computing nodes failures during the computation. And as
-architecture's heterogeneity continually evolves according to
-computing nodes volatility, we have to more precisely takes care about
+introduce computing nodes failures during the computation. As
+architecture heterogeneity continually evolves according to
+computing nodes volatility, we have to take care more precisely about
the heterogeneity of the target platform. Thus in this paper our main
contribution is to propose a new mapping algorithm called MAHEVE
(\textit{Mapping Algorithm for HEterogeneous and Volatile
Environments}). This algorithm explicitly tackles the heterogeneity
issue and introduces a level of dynamism in order to adapt itself to
-the fault tolerance mechanisms. Our experiments show gains about 55\%
-on application's execution time, with about 10 points better than
-AIAC-QM and about 25 points better than F-EC.
+the fault tolerance mechanisms. Our experiments show gains of about
+$55\%$ on application execution time, which is about 10 points
+better than AIAC-QM and about 25 points better than F-EC.
The rest of this paper is organized as
-follows. Section~\ref{sec:jacep2p} reminds the JaceP2P-V2 middleware
+follows. Section~\ref{sec:jacep2p} presents the JaceP2P-V2 middleware
by describing its architecture and briefly presenting its fault
-tolerance mechanism. Section~\ref{sec:pb} formalizes our mapping and
+tolerance mechanisms. Section~\ref{sec:pb} formalizes our mapping and
fault tolerance problems and quotes existing issues to address
them. Section~\ref{sec:maheve} describes the new mapping strategy we
-propose, MAHEVE. In section~\ref{sec:expe} we present the experiments
-we have conducted on the Grid'5000 testbed with more than 400
-computing cores. Finally, we give some concluding remarks and plan our
-future work in section~\ref{sec:conclu}.
+propose, MAHEVE. In Section~\ref{sec:expe} we present the experiments
+we conducted on the Grid'5000 testbed with more than 400 computing
+cores. Finally, we give some concluding remarks and plan our future
+work in Section~\ref{sec:conclu}.
\section{JaceP2P-V2}
supervisors paradigm, is composed of three main entities:
\begin{itemize}
-\item The ``super-nodes'', which is in charge of supervising free
- computing nodes connected to the platform.
+\item The ``super-nodes'', which are in charge of supervising free
+ computing nodes connected to the platform;
\item The ``spawner'', which is launched by a user wanting to execute
a parallel application. It is in charge of a group of computing
- nodes and monitor them. If one of them fails, it requires a
- replacing one to a super-node.
+ nodes and monitors them. If one of them fails, it requires a
+ replacing one to a super-node;
\item The ``daemon'', when launched, connects to a super-node and
waits for a task to execute. Each daemon can communicate directly
To be able to execute asynchronous iterative applications, JaceP2P-V2
-has an asynchronous messaging mechanism. In order to resist daemons'
+has an asynchronous messaging mechanism. In order to resist daemons
failures, it implements a distributed backup mechanism called the
\textit{uncoordinated distributed checkpointing}\cite{uncoord_cp}.
This decentralized procedure allows the platform to be very scalable,
with no weak points and does not require a secure nor a stable station
for backups. When a daemon dies, it is replaced by another one. Here
we suppose that we have enough available free nodes. Moreover, to
-resit to supervisors failures and to be scalable, it reserves some
-extra nodes. For more details on the JaceP2P-V2 platform, interested
-readers can refer to \cite{jaceP2P-v2}.
+resist supervisors failures and scalability, it reserves some extra
+nodes. For more details on the JaceP2P-V2 platform, interested readers
+can refer to \cite{jaceP2P-v2}.
\section{Mapping and fault tolerance problems}
without precedence nor synchronization. During an iteration, each task
computes its job and sends its results to its neighbors, and
immediately starts the next iteration. The TIG\cite{tig1}
-(\textit{Task Interaction Graph}) is the most appropriate model to our
+(\textit{Task Interaction Graph}) model is the most appropriate to our
problem, as it only models relationships between tasks. In this model,
all the tasks are considered simultaneously executable and
communications can take place at any time during the computation, with
\mbox{$CC : E \rightarrow \mathbb{R}^+$} gives the communication cost
for message passing on edges. We define \mbox{$|V| = v$, $EC(V_i) =
e_i$} and \mbox{$CC(V_i,V_j) = c_{ij}$}. Another function
-\mbox{$D : V \rightarrow \mathbb{R}^+$} gives the amount of
+\mbox{$D : V \rightarrow \mathbb{N}^+$} gives the amount of
dependencies of a task, and we define \mbox{$D(V_i) = d_i$}.
$|N|$ vertices and $L$ is the set of $|L|$ undirectional edges. The
vertices represent the computing nodes and the edges represent the
links between them. An edge \mbox{$L_i \in L$} is an unordered pair
-\mbox{$(N_x,N_y) \in \mathbb{N}$}, representing a communication link
+\mbox{$(N_x,N_y) \in N$}, representing a communication link
between nodes $N_x$ and $N_y$. A function \mbox{$WN : N \rightarrow
\mathbb{R}^+$} gives the computational power of nodes and another
function \mbox{$WL : L \rightarrow \mathbb{R}^+$} gives the
\mbox{$WL(L_i,L_j) = wl_{ij}$}. Let be $|C|$ the number of clusters
contained in the architecture. A function \mbox{$CN : C \rightarrow
\mathbb{N}^+$} gives the amount of computing nodes contained in a
-cluster, and another function \mbox{$CF : C \rightarrow \mathbb{R}^+$}
-gives the amount of available computing nodes (not involve in an
+cluster, and another function \mbox{$CF : C \rightarrow \mathbb{N}^+$}
+gives the amount of available computing nodes (not involved in an
application computation) of a cluster. We define \mbox{$CN(C_i) =
C_{Ni}$} and \mbox{$CF(C_i) = C_{Fi}$}. We also define \mbox{$C_{Pi}
= \sum_{j=1}^{C_{Ni}}{wn_j}$} as the whole computation power of
-cluster $C_i$ and \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
+cluster $C_i$, \mbox{$C_{\overline{P}i} = \frac{C_{Pi}}{C_{Ni}}$}
as the average computation power of cluster $C_i$, and
$C_{\overline{P}fi}$ the average power of its available resources.
noted $hd$, by using the \textit{relative standard deviation} method,
with $hd = \frac{\sigma_{PN}}{avg_{PN}}$ where $avg_{PN}$ is the
average computing power of nodes and $\sigma_{PN}$ represents the
-standard deviation of computing nodes power. This measure provides us
+standard deviation of computing nodes power. This measure provides us
the coefficient of variation of the platform in percentage -- we only
consider \mbox{$0 \leq hd \leq 1$} as considering values of \mbox{$hd
> 1$} is not relevant, as \mbox{$hd = 1$} denotes a fully
as the execution time of the slowest task. Indeed, an application ends
when all the tasks have detected convergence and reached the desired
approximation of the solution. We define $ET(App) = \max_{i=1 \dots
- v} ( ET(V_i) )$ where the execution time of each task $i$
-\mbox{($i=1 \dots v$)}, $ET(V_i)$ is given by $ET(V_i) =
+ v} ( ET(V_i) )$, where the execution time of each task $i$
+\mbox{($i=1 \dots v$)}, $ET(V_i)$, is given by $ET(V_i) =
\frac{e_i}{wn_i} + \sum_{j \in J} c_{ij} \times wl_{ij}$ where $e_i$
is the computational cost of $V_i$, $wn_i$ is the computational power
of the node $N_i$ on which $V_i$ is mapped, $J$ represents the
neighbors set of $V_i$, $c_{ij}$ is the amount of communications
between $V_i$ and $V_j$, and $wl_{ij}$ is the link latency between the
-computing nodes on which are mapped $V_i$ and $V_j$. As described in
+computing nodes on which $V_i$ and $V_j$ are mapped. As described in
this formula, the execution time of a task depends on the task weight
-and communications it may occur with its neighbors. We underline here
-that in the AIAC model, it is impossible to predict the number of
-iterations of a task. So it is difficult to evaluate a priori its cost
-$e_i$.
+and on the communications which may occur between this task and its
+neighbors. We underline here that in the AIAC model, it is impossible
+to predict the number of iterations of a task. So it is difficult to
+evaluate a priori its cost $e_i$.
+
This tasks mapping problem is similar to the classical graph
partitioning and task assignment problem, and is thus NP-complete.
\label{sec:pbft}
In volatile environments, computing nodes can disconnect at any time
-during the computation, and these ones have to efficiently be replaced.
+during the computation, and have thus to be efficiently replaced.
The replacing nodes should be the best ones at the fault time,
according to the chosen mapping algorithm, by searching them in
available nodes. As executing environments can regularly evolve, due
-to computing nodes' volatility, a mapping algorithm has to keep a
-right overview of the architecture, in real time. Thus, criteria to
+to computing nodes volatility, a mapping algorithm has to keep a
+correct overview of the architecture, in real time. Thus, criteria to
assign tasks to nodes should evolve too.
An important point to take into consideration is that we do not allow
the execution of multiple tasks on the same computing node, as this
-provides a fall of performance when this one fails. Indeed we should
+provides a fall of performances when this one fails. Indeed we should
redeploy all of the tasks from this node to another one, using last
saves, which can be spread on multiple computing nodes. This may
-result in large communication overheads and a waste of computation
-time. Nevertheless, to benefit of multi-cores processors, we use a
-task level parallelism by running multi-threaded sequential solver for
+result in large communication overheads and in a waste of computation
+time. Nevertheless, to benefit from multi-cores processors, we use a
+task level parallelism by multi-threaded sequential solvers for
example.
environment is fault tolerant and tasks save checkpoints on their
neighbors, it is more efficient to save on near nodes than on far ones
in order to reduce the communication overhead during this operation,
-and to faster restart a task.
+and to restart a task faster.
\subsection{Related work}
between tasks. Here we can cite Metis\cite{metis} and
Chaco\cite{chaco} which are libraries containing such kind of
algorithms. The second category is the \textit{Execution time
- optimization} class, which aims to minimize the whole application's
+ optimization} class, which aims at minimizing the whole application
execution time. These algorithms look for nodes which can provide the
smallest execution time of tasks using their computational power. Here
we can cite QM\cite{qm_these}, FastMap\cite{fastmap}, and
All mentioned algorithms do not tackle the computing nodes failures
-issue, or basically by applying the same policy. As explain in section
-\ref{sec:pbft}, a more efficient and dedicated replacement function is
-needed. Nevertheless, to the best of our knowledge, no tasks mapping
-algorithm, addressing explicitly both the executing platform
-heterogeneity and the computing nodes failures issues, exists.
+issue, or only basically by applying the same policy. As explained in
+Section \ref{sec:pbft}, a more efficient and dedicated replacement
+function is needed. Nevertheless, to the best of our knowledge, no
+tasks mapping algorithm, addressing explicitly both the executing
+platform heterogeneity and the computing nodes failures issues,
+exists.
\vspace{-0.25cm}
Here we present our new tasks mapping strategy, called MAHEVE (for
\textit{Mapping Algorithm for HEterogeneous and Volatile
- Environments}). This algorithm aims to take the best part of each
-category mentioned in section \ref{sec:pbrw}, the edge-cuts
-minimization and the application's execution time optimization
+ Environments}). This algorithm aims at taking the best part of each
+category mentioned in Section \ref{sec:pbrw}, the edge-cuts
+minimization and the application execution time optimization
algorithms.
This new algorithm can be divided into two parts. The first part aims
-to perform the initial mapping, and the second part is devoted to
-search replacing nodes when computing nodes' failures occur.
+at performing the initial mapping, and the second part is devoted to
+search replacing nodes when computing nodes failures occur.
\vspace{-0.20cm}
\subsubsection{Sorting clusters}
\label{sec:maheve_sort_clusters}
-The first step of the initial mapping is to sort clusters accordingly
-to the executing platform's heterogeneity degree $hd$. The main
-principles are that a cluster obtain a better mark when $hd < 0.5$ and
-it contains more computing nodes than other clusters ($C_{Fi}$, the
-amount of available free nodes, is privileged), and when $hd \ge 0.5$
+The first step of the initial mapping is to sort clusters according to
+the executing platform heterogeneity degree $hd$. The main principles
+are that a cluster obtains a better mark when $hd < 0.5$ and it
+contains more computing nodes than other clusters ($C_{Fi}$, the
+number of available free nodes, is privileged), and when $hd \ge 0.5$
and it contains more powerful computing nodes ($C_{\overline{P}fi}$,
-the average computation power, is privileged). These choices come from
-several experiments with the AIAC model, which show that in such
+the average free computation power, is privileged). These choices come
+from several experiments with the AIAC model, which show that in such
environments it is more efficient to privilege the computation power
-or the amount of nodes. As the amount of nodes, $C_{Fi}$, and the
-average computing power, $C_{\overline{P}fi}$, are not in the order of
-magnitude, we normalize them with two functions, $normN$ and
-$normP$. They are defined as \mbox{$normN\left(C_{Fi}\right) = C_{Fi}
- \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate (in
-percent) of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
+or the number of nodes. As the number of nodes, $C_{Fi}$, and the
+average free computing power, $C_{\overline{P}fi}$, are not in the
+same order of magnitude, we normalize them with two functions, $normN$
+and $normP$. They are defined with \mbox{$normN\left(C_{Fi}\right) =
+ C_{Fi} \times 100 \div \sum_{j=1}^{|C|}C_{Fj}$}, which is the rate
+of computing nodes, and \mbox{$normP(C_{\overline{P}fi}) =
C_{\overline{P}fi} \times 100 \div \sum_{j=1}^{|C|}
- C_{\overline{P}fj}$}, which is the rate (in percent) of the average
-power, both representing the cluster in the architecture. We note
-$NC_{Fi} = normN\left( C_{Fi} \right)$ and $NC_{\overline{P}fi} =
-normP(C_{\overline{P}fi})$.
+ C_{\overline{P}fj}$}, which is the rate of the average power, both
+representing the cluster in the architecture. We note
+$normN\left( C_{Fi} \right) = NC_{Fi}$ and $normP(C_{\overline{P}fi})
+=NC_{\overline{P}fi}$.
The formula used to give a mark, $M_i$, to a cluster is
\subsubsection{Sorting tasks}
\label{sec:maheve_sort_tasks}
-Like clusters, tasks are also sorted accordingly to the heterogeneity
-degree of the executing platform. This sort is done in the same way as
+Like clusters, tasks are also sorted according to the heterogeneity
+degree of the executing platform. This sorting is done in the same way as
previously, as when $hd < 0.5$ tasks with higher dependencies will be
privileged, and when $hd \ge 0.5$ tasks with higher computing cost are
-privileged, in order to be executed on highest powered computing
-nodes.
+privileged.
+%, in order to be executed on the highest powered computing nodes.
The main function used to classified tasks is
\begin{equation}
\label{eq:tasks}
- Q_i = e_i^{hd} \times d_i^{1 - hd}
+ Q_i = {e_i}^{hd} \times {d_i}^{1 - hd}
\end{equation}
where $Q_i$ is the evaluation of the task $i$ according to the
Tasks are taken in the order of the first sort, determined with
equation (\ref{eq:tasks}), and each task is placed in a new list (the
final one) and some of its dependencies are added. We note $Nb_i =
-d_i^{1 - hd}$ this amount of dependencies as the lower the
+{d_i}^{1 - hd}$ this amount of dependencies as the lower the
heterogeneity degree is the higher this number will be. This final
operation allows to control the necessary locality of tasks according
-to the heterogeneity degree of the platform.
+to $hd$. %the heterogeneity degree of the platform.
\subsubsection{Mapping method}
\label{sec:maheve_mapping_method}
The third step of the initial mapping is to allocate tasks to
-computing nodes. As clusters and tasks have been sorted accordingly to
-the executing platform's heterogeneity degree, ordered from the
-highest mark to the lower, this function maps tasks on each available
+nodes. As clusters and tasks have been sorted accordingly to the
+executing platform heterogeneity degree, ordered from the highest
+mark to the lowest, this function maps tasks on each available
computing nodes of clusters, in their respective order in lists (for
-example task classified first in the tasks list is mapped on an
-available node of the cluster classified first in clusters list).
-
-The idea here is not to fulfill each cluster, but preserve some
+example a task classified first in the tasks list is mapped on an
+available node of the cluster classified first in the clusters list).
+The idea here is not to fulfill each cluster, but to preserve some
computing nodes in each cluster. These conserved nodes will be used to
-replace failed nodes. The fact of considering in the initial mapping
-the fault tolerance is a new approach in mapping algorithms.
+replace failed nodes.
+
+
+Here we can mentioned that the whole mapping process (the three steps)
+has a complexity of $O( |V|^2 )$, where |V| is the number of tasks.
\subsection{Replacing function}
replacement should reduce the overhead on the application execution
time due to the loss of computing steps and data.
-As we have shown in the previous section, during the initial mapping
-some computing nodes in each cluster have been preserved for fault
-tolerance. When a node fails this function replace it by a free node
-of the same cluster. If none is available this function sorts again
-clusters, to take into consideration platform's modifications, and
-replace the failed node by one available in the new sorted clusters
-list. This mechanism allows to keep tasks' locality and a real time
-overview of the executing platform.
+As shown in the previous section, during the initial mapping some
+computing nodes in each cluster have been preserved. When a node fails
+this function replace it by a free node of the same cluster. If none
+is available this function sorts again clusters, to take into
+consideration platform modifications, and replace the failed node by
+one available in the new sorted clusters list. This mechanism allows
+to keep tasks locality and a real time overview of the executing
+platform.
\section{Experimentation}
We used the ``Kernel CG'' application of the NAS Parallel Benchmarks
(NPB) \cite{nas} to evaluate the performances of our new mapping
algorithm. This benchmark is designed to be used on large
-architectures, because it stresses communications over latency
-networks, by processing unstructured matrix vector multiplication with
-a Conjugate Gradient method. As this method cannot be executed with
-the asynchronous iteration model we have replaced it by another method
-called the multisplitting method. This latter supports the
-asynchronous iterative model. For more details about this method,
-interested readers are invited to see \cite{book_raph}. The chosen
-problem used a matrix of size $5,000,000$ with a low bandwidth, fixed
-to $35,000$. This bandwidth size generates, according to the problem's
-size, between 8 and 20 neighbors per tasks. This application was
-executed on 64 computing nodes.
+architectures, as it stresses communications%over latency networks
+, by processing unstructured matrix vector multiplication with a
+Conjugate Gradient method. As this method cannot be executed with the
+asynchronous iteration model we have replaced it by another method
+called the multisplitting method, which supports the asynchronous
+iterative model. More details about this method can be found in
+\cite{book_raph}. The chosen problem used a matrix of size $5,000,000$
+with a low bandwidth, fixed to $35,000$. This bandwidth size
+generates, according to the problem size, between 8 and 20 neighbors
+per tasks. This application was executed on 64 nodes.
The platform used for our tests, called Grid’5000\cite{g5k}, is a
-French nationwide experimental set of clusters which provides us
+French nationwide experimental set of clusters which provides us with
distributed clusters architectures (28 heterogeneous clusters spread
-over 9 sites). We used three distributed clusters architectures on
-the Grid'5000 testbed, each having a different heterogeneity
-degree. The first one was composed of four clusters spread over four
-sites, with a total of 106 computing nodes representing 424 computing
-cores with \mbox{$hd = 0.08$}; the second one was composed of four
-clusters spread over three sites, with a total of 110 computing nodes
-representing 440 computing cores with \mbox{$hd = 0.50$}; and finally
-the third one was composed of five clusters spread over four sites
-with 115 computing nodes representing 620 computing cores with
-\mbox{$hd = 0.72$}.
+over 9 sites). We used three distributed clusters architectures, each
+having a different heterogeneity degree. The first one was composed of
+four clusters spread over four sites, with a total of 106 computing
+nodes representing 424 computing cores with \mbox{$hd = 0.08$}; the
+second one was composed of four clusters spread over three sites, with
+a total of 110 computing nodes representing 440 computing cores with
+\mbox{$hd = 0.50$}; and finally the third one was composed of five
+clusters spread over four sites with 115 computing nodes representing
+620 computing cores with \mbox{$hd = 0.72$}.
All computing nodes of these clusters have at least 4 computing cores
(in the last used architecture, with \mbox{$hd = 0.72$}, two clusters
are composed of 8 computing cores machines) with a minimum of 4GB of
-memory (in order to execute the application with a big problem's
+memory (in order to execute the application with a big problem
size). All computing nodes can communicate with each other through an
efficient network. Nevertheless, this latter is shared with many other
users so high latencies appear during executions.
During executions, we introduced two failures in computing nodes
involved in the computation every 20 seconds to simulate a volatile
-environment. Unfortunately, we had not the opportunity to realize
+environment. Unfortunately, we did not have the opportunity to realize
experiments with more computing nodes over more sites with problems of
larger sizes, but we plan to extend our experiments in the future.
presented in \cite{pdsec10}. Table \ref{tab:results} shows the
execution times of each mapping algorithm compared to the default
mapping strategy of the JaceP2P-V2 platform, with the corresponding
-gains on application's execution time, in brackets.
+gains on application execution time, given in brackets. It presents
+both the executions with faults (WF) and the fault free (FF)
+executions.
-\renewcommand{\arraystretch}{1.7}
+\renewcommand{\arraystretch}{1.6}
\begin{table}[h!]
\centering
- \begin{tabular}{|c|c|c|c|c|}
+ \begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline
- ~$ hd$~ & ~Default~ & ~FT-AIAC-QM~ & ~FT-FEC~ & ~MAHEVE~ \\
+ \multirow{2}{*}{ ~$hd$~ }&\multicolumn{2}{c|}{ ~Default~ }&\multicolumn{2}{c|}{ ~FT-AIAC-QM~ }&\multicolumn{2}{c|}{ ~FT-FEC~ }&\multicolumn{2}{c|}{ ~MAHEVE~ }\\
+ \cline{2-9}
+
+ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ & ~FF~ & ~WF~ \\
\hline
- ~$0.08$~&229s&178s (22\%)&154s (33\%)&113s (50\%)\\
- ~$0.50$~&242s&118s (51\%)&133s (45\%)&85s (65\%)\\
- ~$0.72$~&192s&99s (45\%)&121s (33\%)&86s (53\%)\\
+ ~$0.08$~ & ~80~ & ~229~ & ~63 (21\%)~ & ~178 (22\%)~ & ~61 (23\%)~ & ~154
+ (33\%)~ & ~60 (25\%)~ & ~113 (50\%)~ \\
+
+ ~$0.50$~ & ~67~ & ~242~ & ~61 (9\%)~ & ~118 (51\%)~ & ~63 (6\%)~ & ~133
+ (45\%)~ & ~54 (20\%)~ & ~85 (65\%)~ \\
+
+ ~$0.72$~ & ~67~ & ~192~ & ~59 (12\%)~ & ~99 (45\%)~ & ~65 (3\%)~ & ~121
+ (33\%)~ & ~52 (22\%)~ & ~86 (53\%)~\\
+
\hline
\end{tabular}
\vspace{0.15cm}
- \caption{Application's execution time in seconds and corresponding gains on various
- platforms using different mapping algorithms with 2 computing
- nodes' failures each 20 seconds}
+ \caption{Application execution time in seconds and corresponding gains on various
+ platforms using different mapping algorithms, with fault free (FF) executions
+ and with 2 node failures each 20 seconds (WF) executions.}
\label{tab:results}
\vspace{-0.7cm}
\end{table}
First of all, we can note that all mapping algorithms provide an
-enhancement of the application's performances by considerably reducing
-its execution time, with an average gain about $45\%$ in general in
-comparison to the default policy. As shown in \cite{pdsec10}, FT-FEC
-is efficient on architectures with a low heterogeneity degree
-(\mbox{$hd = 0.08$} by providing gains about $33\%$, and gains are
-seemly the same on heterogeneous architectures (\mbox{$hd =
+enhancement of the application performances by considerably reducing
+its execution time, especially for executions with node failures, with
+an average gain of about $45\%$ in general in comparison to the
+default policy. If we focus on executions with node failures (WF),
+FT-FEC is efficient on architectures with a low heterogeneity degree
+(\mbox{$hd = 0.08$}) by providing gains of about $33\%$, and gains are
+roughly the same on heterogeneous architectures (\mbox{$hd =
0.72$}). FT-AIAC-QM is efficient on architectures with a high
-heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains about
+heterogeneity degree (\mbox{$hd = 0.72$}) by providing gains of about
$45\%$, whereas it is not so efficient on homogeneous architectures
-(\mbox{$hd = 0.08$}) by providing gains about $22\%$. We can note here
-that on an architecture with a heterogeneity degree of $0.50$
+(\mbox{$hd = 0.08$}) by providing gains of about $22\%$. We can note
+here that on an architecture with a heterogeneity degree of $0.50$
FT-AIAC-QM is more efficient than FT-FEC by providing gains up to
-$50\%$.
+$50\%$. Here we point out that in fault free executions (FF), both
+algorithms also provide gains on their respective favorite
+architectures, though gains are less great than in executions with
+faults (WF).
+
-Now if we look at the performances of our new solution, MAHEVE, we can
+Now if we focus on the performances of our new solution MAHEVE, we can
see that it is all the time better than other algorithms. As can be
-seen in \mbox{Table \ref{tab:results}}, it reduces the application's
-execution time by about $50\%$ on homogeneous architectures (here of
-$0.08$ heterogeneity degree) what is more than 25 point better than
-FT-FEC and near 30 points better than FT-AIAC-QM. On heterogeneous
-architectures (here of $0.72$ heterogeneity degree) it also
-outperforms other mapping algorithms by reducing the application's
-execution time by about $53\%$ what is near about 10 points better
-than FT-AIAC-QM and 20 points better than FT-FEC. On middle
-heterogeneity degree architectures (here of $0.50$ heterogeneity
-degree), MAHEVE is another one time better than its two comparative
-mapping algorithms by reducing the application's execution time by
-about $53\%$. These good performances come from the fact that it is
-designed to be efficient on both architectures, homogeneous and
+seen in \mbox{Table \ref{tab:results}}, in executions with faults
+(WF), it reduces the application's execution time by about $50\%$ on
+homogeneous architectures (here of $0.08$ heterogeneity degree) which
+is more than 25 point better than FT-FEC and near 30 points better
+than FT-AIAC-QM. On heterogeneous architectures (here of $0.72$
+heterogeneity degree) it also outperforms other mapping algorithms by
+reducing the application execution time by about $53\%$ which is
+almost 10 points better than FT-AIAC-QM and 20 points better than
+FT-FEC. On middle heterogeneity degree architectures (here of $0.50$
+heterogeneity degree), MAHEVE is once again better than its two
+comparative mapping algorithms by reducing the application execution
+time by about $53\%$. These good performances come from the fact that
+it is designed to be efficient on both architectures, homogeneous and
heterogeneous. Moreover, as it integrates a fault tolerance
\textit{security} in the initial mapping, it is more efficient when
computing nodes fail. Here we can point out that this algorithm allows
-in general gains on application's execution time about $55\%$.
+in general gains on application execution time of about $55\%$. In fault free executions (FF), it outperforms once again
+the two other algorithms.
\section{Conclusion and future works}
\label{sec:conclu}
-In this paper we have presented a novel mapping algorithm, called
+In this paper we have presented a new mapping algorithm, called
MAHEVE, to address the AIAC mapping issue on heterogeneous and
-volatile environments. It aims to do an efficient mapping of tasks on
-distributed clusters architectures by taking the best part of the two
-known approaches, application's execution time optimization and
+volatile environments. It aims at doing an efficient mapping of tasks
+on distributed clusters architectures by taking the best part of the
+two known approaches, application execution time optimization and
edge-cuts minimization. Experiments show that it is the most efficient
-mapping algorithm on all kinds of architectures, as it takes care
-about their heterogeneity degree and adapt its sort methods to it. We
-have shown that it is all the time better than the two other
+mapping algorithm on all kinds of architectures, as it takes into
+account their heterogeneity degree and adapt its sort methods to
+it. We have shown that it is all the time better than the two other
comparative mapping algorithms, FT-AIAC-QM and FT-FEC. This can be
explained by the fact that it not only takes care about computing
nodes and clusters, but also about the tasks' properties, what refines
the mapping solution.
+
In our future works we plan to enhance the MAHEVE algorithm
-performances by modifying the notation of clusters as their locality
-is now not taken into consideration, in order to favor tasks locality,
-which will reduce communications delays providing a better convergence
-rate.
+performances by modifying the notation of clusters, since their
+locality has not yet been taken into consideration. This would favor
+tasks locality, which would reduce communications delays and provide a
+much better convergence rate.
%%%