X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/loba-papers.git/blobdiff_plain/e3080a36717b6b9710daaec8c4d1e529b19c3176..9ab6d3b1551a28b58f5df7ea6346f08eafb814e9:/supercomp11/supercomp11.tex?ds=sidebyside diff --git a/supercomp11/supercomp11.tex b/supercomp11/supercomp11.tex index 7daaa52..b321468 100644 --- a/supercomp11/supercomp11.tex +++ b/supercomp11/supercomp11.tex @@ -1,25 +1,43 @@ - \documentclass[smallextended]{svjour3} +\usepackage[utf8]{inputenc} +\usepackage[T1]{fontenc} +\usepackage{mathptmx} +\usepackage{amsmath} +\usepackage{courier} \usepackage{graphicx} +\usepackage{url} +\usepackage[ruled,lined]{algorithm2e} + +\newcommand{\abs}[1]{\lvert#1\rvert} % \abs{x} -> |x| + +\newenvironment{algodata}{% + \begin{tabular}[t]{@{}l@{:~}l@{}}}{% + \end{tabular}} + +\newcommand{\FIXMEmargin}[1]{% + \marginpar{\textbf{[FIXME]} {\footnotesize #1}}} +\newcommand{\FIXME}[2][]{% + \ifx #2\relax\relax \FIXMEmargin{#1}% + \else \textbf{$\triangleright$\FIXMEmargin{#1}~#2}\fi} + +\newcommand{\VAR}[1]{\textit{#1}} \begin{document} -\title{Best effort strategy and virtual load for asynchronous iterative load balancing} +\title{Best effort strategy and virtual load + for asynchronous iterative load balancing} \author{Raphaël Couturier \and - Arnaud Giersch \and - Abderrahmane Sider + Arnaud Giersch } -\institute{F. Author \at - first address \\ - Tel.: +123-45-678910\\ - Fax: +123-45-678910\\ - \email{fauthor@example.com} % \\ -% \emph{Present address:} of F. Author % if needed - \and - S. Author \at - second address +\institute{R. Couturier \and A. Giersch \at + FEMTO-ST, University of Franche-Comté, Belfort, France \\ + % Tel.: +123-45-678910\\ + % Fax: +123-45-678910\\ + \email{% + raphael.couturier@femto-st.fr, + arnaud.giersch@femto-st.fr} } \maketitle @@ -28,32 +46,669 @@ \begin{abstract} Most of the time, asynchronous load balancing algorithms have extensively been -studied in a theoretical point of view. The Bertsekas' algorithm is certainly -the most well known algorithm for which the convergence proof is given. From a -practical point of view, when a node wants to balance a part of its load to some -of its neighbors, the strategy is not described. In this paper, we propose a -strategy called \texttt{best effort} which tries to balance the load of a node -to all its less loaded neighbors while ensuring that all the nodes concerned by -the load balancing phase have the same amount of load. Moreover, asynchronous -iterative algorithms in which an asynchronous load balancing algorithm is -implemented most of the time can dissociate messages concerning load transfers -and message concerning load information. In order to increase the converge of a -load balancing algorithm, we propose a simple heuristic called \texttt{virtual - load} which allows a node that receives an load information message to -integrate the load that it will receive latter in its load (virtually) and -consequently sends a (real) part of its load to some of its neighbors. In order -to validate our approaches, we have defined a simulator based on SimGrid which -allowed us to conduct many experiments. +studied in a theoretical point of view. The Bertsekas and Tsitsiklis' +algorithm~\cite[section~7.4]{bertsekas+tsitsiklis.1997.parallel} +is certainly the most well known algorithm for which the convergence proof is +given. From a practical point of view, when a node wants to balance a part of +its load to some of its neighbors, the strategy is not described. In this +paper, we propose a strategy called \emph{best effort} which tries to balance +the load of a node to all its less loaded neighbors while ensuring that all the +nodes concerned by the load balancing phase have the same amount of load. +Moreover, asynchronous iterative algorithms in which an asynchronous load +balancing algorithm is implemented most of the time can dissociate messages +concerning load transfers and message concerning load information. In order to +increase the converge of a load balancing algorithm, we propose a simple +heuristic called \emph{virtual load} which allows a node that receives a load +information message to integrate the load that it will receive later in its +load (virtually) and consequently sends a (real) part of its load to some of its +neighbors. In order to validate our approaches, we have defined a simulator +based on SimGrid which allowed us to conduct many experiments. \end{abstract} +\section{Introduction} + +Load balancing algorithms are extensively used in parallel and distributed +applications in order to reduce the execution times. They can be applied in +different scientific fields from high performance computation to micro sensor +networks. They are iterative by nature. In literature many kinds of load +balancing algorithms have been studied. They can be classified according +different criteria: centralized or decentralized, in static or dynamic +environment, with homogeneous or heterogeneous load, using synchronous or +asynchronous iterations, with a static topology or a dynamic one which evolves +during time. In this work, we focus on asynchronous load balancing algorithms +where computer nodes are considered homogeneous and with homogeneous load with +no external load. In this context, Bertsekas and Tsitsiklis have proposed an +algorithm which is definitively a reference for many works. In their work, they +proved that under classical hypotheses of asynchronous iterative algorithms and +a special constraint avoiding \emph{ping-pong} effect, an asynchronous +iterative algorithm converge to the uniform load distribution. This work has +been extended by many authors. For example, Cortés et al., with +DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous}, propose a +version working with integer load. This work was later generalized by +the same authors in \cite{cedo+cortes+ripoll+al.2007.convergence}. +\FIXME{Rajouter des choses ici. Lesquelles ?} + +Although the Bertsekas and Tsitsiklis' algorithm describes the condition to +ensure the convergence, there is no indication or strategy to really implement +the load distribution. In other word, a node can send a part of its load to one +or many of its neighbors while all the convergence conditions are +followed. Consequently, we propose a new strategy called \emph{best effort} +that tries to balance the load of a node to all its less loaded neighbors while +ensuring that all the nodes concerned by the load balancing phase have the same +amount of load. Moreover, when real asynchronous applications are considered, +using asynchronous load balancing algorithms can reduce the execution +times. Most of the times, it is simpler to distinguish load information messages +from data migration messages. Former ones allows a node to inform its +neighbors of its current load. These messages are very small, they can be sent +quite often. For example, if an computing iteration takes a significant times +(ranging from seconds to minutes), it is possible to send a new load information +message at each neighbor at each iteration. Latter messages contains data that +migrates from one node to another one. Depending on the application, it may have +sense or not that nodes try to balance a part of their load at each computing +iteration. But the time to transfer a load message from a node to another one is +often much more longer that to time to transfer a load information message. So, +when a node receives the information that later it will receive a data message, +it can take this information into account and it can consider that its new load +is larger. Consequently, it can send a part of it real load to some of its +neighbors if required. We call this trick the \emph{virtual load} mechanism. + + + +So, in this work, we propose a new strategy for improving the distribution of +the load and a simple but efficient trick that also improves the load +balancing. Moreover, we have conducted many simulations with SimGrid in order to +validate our improvements are really efficient. Our simulations consider that in +order to send a message, a latency delays the sending and according to the +network performance and the message size, the time of the reception of the +message also varies. + +In the following of this paper, Section~\ref{BT algo} describes the Bertsekas +and Tsitsiklis' asynchronous load balancing algorithm. Moreover, we present a +possible problem in the convergence conditions. Section~\ref{Best-effort} +presents the best effort strategy which provides an efficient way to reduce the +execution times. This strategy will be compared with other ones, presented in +Section~\ref{Other}. In Section~\ref{Virtual load}, the virtual load mechanism +is proposed. Simulations allowed to show that both our approaches are valid +using a quite realistic model detailed in Section~\ref{Simulations}. Finally we +give a conclusion and some perspectives to this work. + + + +\section{Bertsekas and Tsitsiklis' asynchronous load balancing algorithm} +\label{BT algo} + +In order prove the convergence of asynchronous iterative load balancing +Bertsekas and Tsitsiklis proposed a model +in~\cite{bertsekas+tsitsiklis.1997.parallel}. Here we recall some notations. +Consider that $N={1,...,n}$ processors are connected through a network. +Communication links are represented by a connected undirected graph $G=(N,V)$ +where $V$ is the set of links connecting different processors. In this work, we +consider that processors are homogeneous for sake of simplicity. It is quite +easy to tackle the heterogeneous case~\cite{ElsMonPre02}. Load of processor $i$ +at time $t$ is represented by $x_i(t)\geq 0$. Let $V(i)$ be the set of +neighbors of processor $i$. Each processor $i$ has an estimate of the load of +each of its neighbors $j \in V(i)$ represented by $x_j^i(t)$. According to +asynchronism and communication delays, this estimate may be outdated. We also +consider that the load is described by a continuous variable. + +When a processor send a part of its load to one or some of its neighbors, the +transfer takes time to be completed. Let $s_{ij}(t)$ be the amount of load that +processor $i$ has transferred to processor $j$ at time $t$ and let $r_{ij}(t)$ be the +amount of load received by processor $j$ from processor $i$ at time $t$. Then +the amount of load of processor $i$ at time $t+1$ is given by: +\begin{equation} +x_i(t+1)=x_i(t)-\sum_{j\in V(i)} s_{ij}(t) + \sum_{j\in V(i)} r_{ji}(t) +\label{eq:ping-pong} +\end{equation} + + +Some conditions are required to ensure the convergence. One of them can be +called the \emph{ping-pong} condition which specifies that: +\begin{equation} +x_i(t)-\sum _{k\in V(i)} s_{ik}(t) \geq x_j^i(t)+s_{ij}(t) +\end{equation} +for any processor $i$ and any $j \in V(i)$ such that $x_i(t)>x_j^i(t)$. This +condition aims at avoiding a processor to send a part of its load and being +less loaded after that. + +Nevertheless, we think that this condition may lead to deadlocks in some +cases. For example, if we consider only three processors and that processor $1$ +is linked to processor $2$ which is also linked to processor $3$ (i.e. a simple +chain which 3 processors). Now consider we have the following values at time $t$: +\begin{eqnarray*} +x_1(t)=10 \\ +x_2(t)=100 \\ +x_3(t)=99.99\\ + x_3^2(t)=99.99\\ +\end{eqnarray*} +In this case, processor $2$ can either sends load to processor $1$ or processor +$3$. If it sends load to processor $1$ it will not satisfy condition +(\ref{eq:ping-pong}) because after the sending it will be less loaded that +$x_3^2(t)$. So we consider that the \emph{ping-pong} condition is probably to +strong. Currently, we did not try to make another convergence proof without this +condition or with a weaker condition. + +Nevertheless, we conjecture that such a weaker condition exists. In fact, we +have never seen any scenario that is not leading to convergence, even with +load-balancing strategies that are not exactly fulfilling these two conditions. + +It may be the subject of future work to express weaker conditions, and to prove +that they are sufficient to ensure the convergence of the load-balancing +algorithm. + +\section{Best effort strategy} +\label{Best-effort} + +In this section we describe a new load-balancing strategy that we call +\emph{best effort}. First, we explain the general idea behind this strategy, +and then we describe some variants of this basic strategy. + +\subsection{Basic strategy} + +The general idea behind the \emph{best effort} strategy is that each processor, +that detects it has more load than some of its neighbors, sends some load to the +most of its less loaded neighbors, doing its best to reach the equilibrium +between those neighbors and himself. + +More precisely, when a processor $i$ is in its load-balancing phase, +he proceeds as following. +\begin{enumerate} +\item First, the neighbors are sorted in non-decreasing order of their + known loads $x^i_j(t)$. + +\item Then, this sorted list is traversed in order to find its largest + prefix such as the load of each selected neighbor is lesser than: + \begin{itemize} + \item the processor's own load, and + \item the mean of the loads of the selected neighbors and of the + processor's load. + \end{itemize} + Let's call $S_i(t)$ the set of the selected neighbors, and + $\bar{x}(t)$ the mean of the loads of the selected neighbors and of + the processor load: + \begin{equation*} + \bar{x}(t) = \frac{1}{\abs{S_i(t)} + 1} + \left( x_i(t) + \sum_{j\in S_i(t)} x^i_j(t) \right) + \end{equation*} + The following properties hold: + \begin{equation*} + \begin{cases} + S_i(t) \subset V(i) \\ + x^i_j(t) < x_i(t) & \forall j \in S_i(t) \\ + x^i_j(t) < \bar{x} & \forall j \in S_i(t) \\ + x^i_j(t) \leq x^i_k(t) & \forall j \in S_i(t), \forall k \in V(i) \setminus S_i(t) \\ + \bar{x} \leq x_i(t) + \end{cases} + \end{equation*} + +\item Once this selection is completed, processor $i$ sends to each of + the selected neighbor $j\in S_i(t)$ an amount of load $s_{ij}(t) = + \bar{x} - x^i_j(t)$. + + From the above equations, and notably from the definition of + $\bar{x}$, it can easily be verified that: + \begin{equation*} + \begin{cases} + x_i(t) - \sum_{j\in S_i(t)} s_{ij}(t) = \bar{x} \\ + x^i_j(t) + s_{ij}(t) = \bar{x} & \forall j \in S_i(t) + \end{cases} + \end{equation*} +\end{enumerate} + +\subsection{Leveling the amount to send} + +With the aforementioned basic strategy, each node does its best to reach the +equilibrium with its neighbors. Since each node may be taking the same kind of +decision at the same moment, there is the risk that a node receives load from +several of its neighbors, and then is temporary going off the equilibrium state. +This is particularly true with strongly connected applications. + +In order to reduce this effect, we add the ability to level the amount to send. +The idea, here, is to make smaller steps toward the equilibrium, such that a +potentially wrong decision has a lower impact. + +Concretely, once $s_{ij}$ has been evaluated as before, it is simply divided by +some configurable factor. That's what we named the ``parameter $k$'' in +Section~\ref{Results}. The amount of data to send is then $s_{ij}(t) = (\bar{x} +- x^i_j(t))/k$. +\FIXME[check that it's still named $k$ in Sec.~\ref{Results}]{} + +\section{Other strategies} +\label{Other} + +Another load balancing strategy, working under the same conditions, was +previously developed by Bahi, Giersch, and Makhoul in +\cite{bahi+giersch+makhoul.2008.scalable}. In order to assess the performances +of the new \emph{best effort}, we naturally chose to compare it to this anterior +work. More precisely, we will use the algorithm~2 from +\cite{bahi+giersch+makhoul.2008.scalable} and, in the following, we will +reference it under the name of Makhoul's. + +Here is an outline of the Makhoul's algorithm. When a given node needs to take +a load balancing decision, it starts by sorting its neighbors by increasing +order of their load. Then, it computes the difference between its own load, and +the load of each of its neighbors. Finally, taking the neighbors following the +order defined before, the amount of load to send $s_{ij}$ is computed as +$1/(N+1)$ of the load difference, with $N$ being the number of neighbors. This +process continues as long as the node is more loaded than the considered +neighbor. + + +\section{Virtual load} +\label{Virtual load} + +In this section, we present the concept of \texttt{virtual load}. In order to +use this concept, load balancing messages must be sent using two different kinds +of messages: load information messages and load balancing messages. More +precisely, a node wanting to send a part of its load to one of its neighbors, +can first send a load information message containing the load it will send and +then it can send the load balancing message containing data to be transferred. +Load information message are really short, consequently they will be received +very quickly. In opposition, load balancing messages are often bigger and thus +require more time to be transferred. + +The concept of \texttt{virtual load} allows a node that received a load +information message to integrate the load that it will receive later in its load +(virtually) and consequently send a (real) part of its load to some of its +neighbors. In fact, a node that receives a load information message knows that +later it will receive the corresponding load balancing message containing the +corresponding data. So if this node detects it is too loaded compared to some +of its neighbors and if it has enough load (real load), then it can send more +load to some of its neighbors without waiting the reception of the load +balancing message. + +Doing this, we can expect a faster convergence since nodes have a faster +information of the load they will receive, so they can take in into account. + +\FIXME{Est ce qu'on donne l'algo avec virtual load?} + +\FIXME{describe integer mode} + +\section{Simulations} +\label{Simulations} + +In order to test and validate our approaches, we wrote a simulator +using the SimGrid +framework~\cite{casanova+legrand+quinson.2008.simgrid}. This +simulator, which consists of about 2,700 lines of C++, allows to run +the different load-balancing strategies under various parameters, such +as the initial distribution of load, the interconnection topology, the +characteristics of the running platform, etc. Then several metrics +are issued that permit to compare the strategies. + +The simulation model is detailed in the next section (\ref{Sim + model}), and the experimental contexts are described in +section~\ref{Contexts}. Then the results of the simulations are +presented in section~\ref{Results}. + +\subsection{Simulation model} +\label{Sim model} + +In the simulation model the processors exchange messages which are of +two kinds. First, there are \emph{control messages} which only carry +information that is exchanged between the processors, such as the +current load, or the virtual load transfers if this option is +selected. These messages are rather small, and their size is +constant. Then, there are \emph{data messages} that carry the real +load transferred between the processors. The size of a data message +is a function of the amount of load that it carries, and it can be +pretty large. In order to receive the messages, each processor has +two receiving channels, one for each kind of messages. Finally, when +a message is sent or received, this is done by using the non-blocking +primitives of SimGrid\footnote{That are \texttt{MSG\_task\_isend()}, + and \texttt{MSG\_task\_irecv()}.}. + +During the simulation, each processor concurrently runs three threads: +a \emph{receiving thread}, a \emph{computing thread}, and a +\emph{load-balancing thread}, which we will briefly describe now. + +\paragraph{Receiving thread} The receiving thread is in charge of +waiting for messages to come, either on the control channel, or on the +data channel. Its behavior is sketched by Algorithm~\ref{algo.recv}. +When a message is received, it is pushed in a buffer of +received message, to be later consumed by one of the other threads. +There are two such buffers, one for the control messages, and one for +the data messages. The buffers are implemented with a lock-free FIFO +\cite{sutter.2008.writing} to avoid contention between the threads. + +\begin{algorithm} + \caption{Receiving thread} + \label{algo.recv} + \KwData{ + \begin{algodata} + \VAR{ctrl\_chan}, \VAR{data\_chan} + & communication channels (control and data) \\ + \VAR{ctrl\_fifo}, \VAR{data\_fifo} + & buffers of received messages (control and data) \\ + \end{algodata}} + \While{true}{% + wait for a message to be available on either \VAR{ctrl\_chan}, + or \VAR{data\_chan}\; + \If{a message is available on \VAR{ctrl\_chan}}{% + get the message from \VAR{ctrl\_chan}, and push it into \VAR{ctrl\_fifo}\; + } + \If{a message is available on \VAR{data\_chan}}{% + get the message from \VAR{data\_chan}, and push it into \VAR{data\_fifo}\; + } + } +\end{algorithm} + +\paragraph{Computing thread} The computing thread is in charge of the +real load management. As exposed in Algorithm~\ref{algo.comp}, it +iteratively runs the following operations: +\begin{itemize} +\item if some load was received from the neighbors, get it; +\item if there is some load to send to the neighbors, send it; +\item run some computation, whose duration is function of the current + load of the processor. +\end{itemize} +Practically, after the computation, the computing thread waits for a +small amount of time if the iterations are looping too fast (for +example, when the current load is near zero). + +\begin{algorithm} + \caption{Computing thread} + \label{algo.comp} + \KwData{ + \begin{algodata} + \VAR{data\_fifo} & buffer of received data messages \\ + \VAR{real\_load} & current load \\ + \end{algodata}} + \While{true}{% + \If{\VAR{data\_fifo} is empty and $\VAR{real\_load} = 0$}{% + wait until a message is pushed into \VAR{data\_fifo}\; + } + \While{\VAR{data\_fifo} is not empty}{% + pop a message from \VAR{data\_fifo}\; + get the load embedded in the message, and add it to \VAR{real\_load}\; + } + \ForEach{neighbor $n$}{% + \If{there is some amount of load $a$ to send to $n$}{% + send $a$ units of load to $n$, and subtract it from \VAR{real\_load}\; + } + } + \If{$\VAR{real\_load} > 0.0$}{ + simulate some computation, whose duration is function of \VAR{real\_load}\; + ensure that the main loop does not iterate too fast\; + } + } +\end{algorithm} + +\paragraph{Load-balancing thread} The load-balancing thread is in +charge of running the load-balancing algorithm, and exchange the +control messages. As shown in Algorithm~\ref{algo.lb}, it iteratively +runs the following operations: +\begin{itemize} +\item get the control messages that were received from the neighbors; +\item run the load-balancing algorithm; +\item send control messages to the neighbors, to inform them of the + processor's current load, and possibly of virtual load transfers; +\item wait a minimum (configurable) amount of time, to avoid to + iterate too fast. +\end{itemize} + +\begin{algorithm} + \caption{Load-balancing} + \label{algo.lb} + \While{true}{% + \While{\VAR{ctrl\_fifo} is not empty}{% + pop a message from \VAR{ctrl\_fifo}\; + identify the sender of the message, + and update the current knowledge of its load\; + } + run the load-balancing algorithm to make the decision about load transfers\; + \ForEach{neighbor $n$}{% + send a control messages to $n$\; + } + ensure that the main loop does not iterate too fast\; + } +\end{algorithm} +\paragraph{} +For the sake of simplicity, a few details were voluntary omitted from +these descriptions. For an exhaustive presentation, we refer to the +actual source code that was used for the experiments% +\footnote{As mentioned before, our simulator relies on the SimGrid + framework~\cite{casanova+legrand+quinson.2008.simgrid}. For the + experiments, we used a pre-release of SimGrid 3.7 (Git commit + 67d62fca5bdee96f590c942b50021cdde5ce0c07, available from + \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is +available at +\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}. +\FIXME{ajouter des détails sur la gestion de la charge virtuelle ? +par ex, donner l'idée générale de l'implémentation. l'idée générale est déja décrite en section~\ref{Virtual load}} -qsdqsd +\subsection{Experimental contexts} +\label{Contexts} +In order to assess the performances of our algorithms, we ran our +simulator with various parameters, and extracted several metrics, that +we will describe in this section. +\paragraph{Load balancing strategies} +Several load balancing strategies were compared. We ran the experiments with +the \emph{Best effort}, and with the \emph{Makhoul} strategies. \emph{Best + effort} was tested with parameter $k = 1$, $k = 2$, and $k = 4$. Secondly, +each strategy was run in its two variants: with, and without the management of +\emph{virtual load}. Finally, we tested each configuration with \emph{real}, +and with \emph{integer} load. + +To summarize the different load balancing strategies, we have: +\begin{description} +\item[\textbf{strategies:}] \emph{Makhoul}, or \emph{Best effort} with $k\in + \{1,2,4\}$ +\item[\textbf{variants:}] with, or without virtual load +\item[\textbf{domain:}] real load, or integer load +\end{description} +% +This gives us as many as $4\times 2\times 2 = 16$ different strategies. + +\paragraph{End of the simulation} + +The simulations were run until the load was nearly balanced among the +participating nodes. More precisely the simulation stops when each node holds +an amount of load at less than 1\% of the load average, during an arbitrary +number of computing iterations (2000 in our case). + +Note that this convergence detection was implemented in a centralized manner. +This is easy to do within the simulator, but it's obviously not realistic. In a +real application we would have chosen a decentralized convergence detection +algorithm, like the one described by Bahi, Contassot-Vivier, Couturier, and +Vernier in \cite{10.1109/TPDS.2005.2}. + +\paragraph{Platforms} + +In order to show the behavior of the different strategies in different +settings, we simulated the executions on two sorts of platforms. These two +sorts of platforms differ by their underlaid network topology. On the one hand, +we have homogeneous platforms, modeled as a cluster. On the other hand, we have +heterogeneous platforms, modeled as the interconnection of a number of clusters. + +The clusters were modeled by a fixed number of computing nodes interconnected +through a backbone link. Each computing node has a computing power of +1~GFlop/s, and is connected to the backbone by a network link whose bandwidth is +of 125~MB/s, with a latency of 50~$\mu$s. The backbone has a network bandwidth +of 2.25~GB/s, with a latency of 500~$\mu$s. + +The heterogeneous platform descriptions were created by taking a subset of the +Grid'5000 infrastructure\footnote{Grid'5000 is a French large scale experimental + Grid (see \url{https://www.grid5000.fr/}).}, as described in the platform file +\texttt{g5k.xml} distributed with SimGrid. Note that the heterogeneity of the +platform here only comes from the network topology. Indeed, since our +algorithms currently do not handle heterogeneous computing resources, the +processor speeds were normalized, and we arbitrarily chose to fix them to +1~GFlop/s. + +Then we derived each sort of platform with four different number of computing +nodes: 16, 64, 256, and 1024 nodes. + +\paragraph{Configurations} + +The distributed processes of the application were then logically organized along +three possible topologies: a line, a torus or an hypercube. We ran tests where +the total load was initially on an only node (at one end for the line topology), +and other tests where the load was initially randomly distributed across all the +participating nodes. The total amount of load was fixed to a number of load +units equal to 1000 times the number of node. The average load is then of 1000 +load units. + +For each of the preceding configuration, we finally had to choose the +computation and communication costs of a load unit. We chose them, such as to +have three different computation over communication cost ratios, and hence model +three different kinds of applications: +\begin{itemize} +\item mainly communicating, with a computation/communication cost ratio of $1/10$; +\item mainly computing, with a computation/communication cost ratio of $10/1$ ; +\item balanced, with a computation/communication cost ratio of $1/1$. +\end{itemize} + +To summarize the various configurations, we have: +\begin{description} +\item[\textbf{platforms:}] homogeneous (cluster), or heterogeneous (subset of + Grid'5000) +\item[\textbf{platform sizes:}] platforms with 16, 64, 256, or 1024 nodes +\item[\textbf{process topologies:}] line, torus, or hypercube +\item[\textbf{initial load distribution:}] initially on a only node, or + initially randomly distributed over all nodes +\item[\textbf{computation/communication ratio:}] $10/1$, $1/1$, or $1/10$ +\end{description} +% +This gives us as many as $2\times 4\times 3\times 2\times 3 = 144$ different +configurations. +% +Combined with the various load balancing strategies, we had $16\times 144 = +2304$ distinct settings to evaluate. In fact, as it will be shown later, we +didn't run all the strategies, nor all the configurations for the bigger +platforms with 1024 nodes, since to simulations would have run for a too long +time. + +Anyway, all these the experiments represent more than 240 hours of computing +time. + +\paragraph{Metrics} + +In order to evaluate and compare the different load balancing strategies we had +to define several metrics. Our goal, when choosing these metrics, was to have +something tending to a constant value, i.e. to have a measure which is not +changing anymore once the convergence state is reached. Moreover, we wanted to +have some normalized value, in order to be able to compare them across different +settings. + +With these constraints in mind, we defined the following metrics: +% +\begin{description} +\item[\textbf{average idle time:}] that's the total time spent, when the nodes + don't hold any share of load, and thus have nothing to compute. This total + time is divided by the number of participating nodes, such as to have a number + that can be compared between simulations of different sizes. + + This metric is expected to give an idea of the ability of the strategy to + diffuse the load quickly. A smaller value is better. + +\item[\textbf{average convergence date:}] that's the average of the dates when + all nodes reached the convergence state. The dates are measured as a number + of (simulated) seconds since the beginning of the simulation. + +\item[\textbf{maximum convergence date:}] that's the date when the last node + reached the convergence state. + + These two dates give an idea of the time needed by the strategy to reach the + equilibrium state. A smaller value is better. + +\item[\textbf{data transfer amount:}] that's the sum of the amount of all data + transfers during the simulation. This sum is then normalized by dividing it + by the total amount of data present in the system. + + This metric is expected to give an idea of the efficiency of the strategy in + terms of data movements, i.e. its ability to reach the equilibrium with fewer + transfers. Again, a smaller value is better. + +\end{description} + + +\subsection{Validation of our approaches} +\label{Results} + + +On constate quoi (vérifier avec les chiffres)? +\begin{itemize} +\item cluster ou grid, entier ou réel, ne font pas de grosses différences + +\item bookkeeping? améliore souvent les choses, parfois au prix d'un retard au démarrage + +\item makhoul? se fait battre sur les grosses plateformes + +\item taille de plateforme? + +\item ratio comp/comm? + +\item option $k$? peut-être intéressant sur des plateformes fortement interconnectées (hypercube) + +\item volume de comm? souvent, besteffort/plain en fait plus. pourquoi? + +\item répartition initiale de la charge ? + +\item integer mode sur topo. line n'a jamais fini en plain? vérifier si ce n'est + pas à cause de l'effet d'escalier que bk est capable de gommer. + +\end{itemize} + +\begin{itshape} +On veut montrer quoi ? : +\FIXME{remove that part} + +1) best plus rapide que les autres (simple, makhoul) +2) avantage virtual load + +Est ce qu'on peut trouver des contre exemple? +Topologies variées + + +Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées +Mais aussi simulation avec temps court qui montre que seul best converge + +Expés avec ratio calcul/comm rapide et lent + +Quelques expés avec charge initiale aléatoire plutot que sur le premier proc + +Cadre processeurs homogènes + +Topologies statiques + +On ne tient pas compte de la vitesse des liens donc on la considère homogène + +Prendre un réseau hétérogène et rendre processeur homogène + +Taille : 10 100 très gros +\end{itshape} + +\section{Conclusion and perspectives} + +\FIXME{conclude!} + +\begin{acknowledgements} + Computations have been performed on the supercomputer facilities of + the Mésocentre de calcul de Franche-Comté. +\end{acknowledgements} + +\FIXME{find and add more references} +\bibliographystyle{spmpsci} +\bibliography{biblio} \end{document} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: t +%%% fill-column: 80 +%%% ispell-local-dictionary: "american" +%%% End: + +% LocalWords: Raphaël Couturier Arnaud Giersch Abderrahmane Sider Franche ij +% LocalWords: Bertsekas Tsitsiklis SimGrid DASUD Comté Béjaïa asynchronism ji +% LocalWords: ik isend irecv Cortés et al chan ctrl fifo Makhoul GFlop xml pre +% LocalWords: FEMTO Makhoul's fca bdee cdde Contassot Vivier underlaid