Fix spacing.

[loba-papers.git] / supercomp11 / supercomp11.tex
diff --git a/supercomp11/supercomp11.tex b/supercomp11/supercomp11.tex

index 7daaa52bed489911e6845de60a1e20e27e21105b..b32146822cb5608d73e5ccb853e13af167297665 100644 (file)
--- a/supercomp11/supercomp11.tex
+++ b/supercomp11/supercomp11.tex
@@ -1,25 +1,43 @@
-
  \documentclass[smallextended]{svjour3}
+\usepackage[utf8]{inputenc}
+\usepackage[T1]{fontenc}
+\usepackage{mathptmx}
+\usepackage{amsmath}
+\usepackage{courier}
  \usepackage{graphicx}
+\usepackage{url}
+\usepackage[ruled,lined]{algorithm2e}
+
+\newcommand{\abs}[1]{\lvert#1\rvert} % \abs{x} -> |x|
+
+\newenvironment{algodata}{%
+  \begin{tabular}[t]{@{}l@{:~}l@{}}}{%
+  \end{tabular}}
+
+\newcommand{\FIXMEmargin}[1]{%
+  \marginpar{\textbf{[FIXME]} {\footnotesize #1}}}
+\newcommand{\FIXME}[2][]{%
+  \ifx #2\relax\relax \FIXMEmargin{#1}%
+  \else \textbf{$\triangleright$\FIXMEmargin{#1}~#2}\fi}
+
+\newcommand{\VAR}[1]{\textit{#1}}
  
  \begin{document}
  
-\title{Best effort strategy and virtual load for asynchronous iterative load balancing}
+\title{Best effort strategy and virtual load
+  for asynchronous iterative load balancing}
  
  \author{Raphaël Couturier \and
-        Arnaud Giersch \and
-        Abderrahmane Sider
+        Arnaud Giersch
  }
  
-\institute{F. Author \at
-              first address \\
-              Tel.: +123-45-678910\\
-              Fax: +123-45-678910\\
-              \email{fauthor@example.com}           %  \\
-%             \emph{Present address:} of F. Author  %  if needed
-           \and
-           S. Author \at
-              second address
+\institute{R. Couturier \and A. Giersch \at
+              FEMTO-ST, University of Franche-Comté, Belfort, France \\
+              % Tel.: +123-45-678910\\
+              % Fax: +123-45-678910\\
+              \email{%
+                raphael.couturier@femto-st.fr,
+                arnaud.giersch@femto-st.fr}
  }
  
  \maketitle
@@ -28,32 +46,669 @@
  \begin{abstract}
  
  Most of the  time, asynchronous load balancing algorithms  have extensively been
-studied in  a theoretical point of  view. The Bertsekas'  algorithm is certainly
-the most well  known algorithm for which the convergence proof  is given. From a
-practical point of view, when a node wants to balance a part of its load to some
-of its  neighbors, the strategy is not  described.  In this paper,  we propose a
-strategy called \texttt{best  effort} which tries to balance the  load of a node
-to all its less loaded neighbors  while ensuring that all the nodes concerned by
-the load balancing  phase have the same amount  of load.  Moreover, asynchronous
-iterative  algorithms  in which  an  asynchronous  load  balancing algorithm  is
-implemented most of  the time can dissociate messages  concerning load transfers
-and message concerning load information.  In order to increase the converge of a
-load balancing  algorithm, we propose a simple  heuristic called \texttt{virtual
-  load}  which  allows a  node  that receives  an  load  information message  to
-integrate  the load  that it  will receive  latter in  its load  (virtually) and
-consequently sends a (real) part of its  load to some of its neighbors. In order
-to validate our  approaches, we have defined a simulator  based on SimGrid which
-allowed us to conduct many experiments.
+studied in a theoretical point  of view. The Bertsekas and Tsitsiklis'
+algorithm~\cite[section~7.4]{bertsekas+tsitsiklis.1997.parallel}
+is certainly  the most well known  algorithm for which the  convergence proof is
+given. From a  practical point of view, when  a node wants to balance  a part of
+its  load to some  of its  neighbors, the  strategy is  not described.   In this
+paper, we propose a strategy  called \emph{best effort} which tries to balance
+the load of a node to all  its less loaded neighbors while ensuring that all the
+nodes  concerned by  the load  balancing  phase have  the same  amount of  load.
+Moreover,  asynchronous  iterative  algorithms  in which  an  asynchronous  load
+balancing  algorithm is  implemented most  of the  time can  dissociate messages
+concerning load transfers and message  concerning load information.  In order to
+increase  the  converge of  a  load balancing  algorithm,  we  propose a  simple
+heuristic called \emph{virtual load} which allows a node that receives a load
+information message  to integrate the  load that it  will receive later  in its
+load (virtually) and consequently sends a (real) part of its load to some of its
+neighbors.  In order to  validate our  approaches, we  have defined  a simulator
+based on SimGrid which allowed us to conduct many experiments.
  
  
  \end{abstract}
  
+\section{Introduction}
+
+Load  balancing algorithms  are  extensively used  in  parallel and  distributed
+applications in  order to  reduce the  execution times. They  can be  applied in
+different scientific  fields from high  performance computation to  micro sensor
+networks.   They are  iterative by  nature.  In  literature many  kinds  of load
+balancing  algorithms  have been  studied.   They  can  be classified  according
+different  criteria:   centralized  or  decentralized,  in   static  or  dynamic
+environment,  with  homogeneous  or  heterogeneous load,  using  synchronous  or
+asynchronous iterations, with  a static topology or a  dynamic one which evolves
+during time.  In  this work, we focus on  asynchronous load balancing algorithms
+where computer nodes  are considered homogeneous and with  homogeneous load with
+no external  load. In  this context, Bertsekas  and Tsitsiklis have  proposed an
+algorithm which is definitively a reference  for many works. In their work, they
+proved that under classical  hypotheses of asynchronous iterative algorithms and
+a  special  constraint   avoiding  \emph{ping-pong}  effect,  an  asynchronous
+iterative algorithm  converge to  the uniform load  distribution. This  work has
+been extended by many authors. For example, Cortés et al., with
+DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous}, propose a
+version working with integer load.  This work was later generalized by
+the same authors in \cite{cedo+cortes+ripoll+al.2007.convergence}.
+\FIXME{Rajouter des choses ici.  Lesquelles ?}
+
+Although  the Bertsekas  and Tsitsiklis'  algorithm describes  the  condition to
+ensure the convergence,  there is no indication or  strategy to really implement
+the load distribution. In other word, a node  can send a part of its load to one
+or   many  of   its  neighbors   while  all   the  convergence   conditions  are
+followed. Consequently,  we propose a  new strategy called  \emph{best effort}
+that tries to balance the load of  a node to all its less loaded neighbors while
+ensuring that all the nodes concerned  by the load balancing phase have the same
+amount of  load.  Moreover, when real asynchronous  applications are considered,
+using  asynchronous   load  balancing   algorithms  can  reduce   the  execution
+times. Most of the times, it is simpler to distinguish load information messages
+from  data  migration  messages.  Former  ones  allows  a  node to  inform  its
+neighbors of its  current load. These messages are very small,  they can be sent
+quite often.  For example, if an  computing iteration takes  a significant times
+(ranging from seconds to minutes), it is possible to send a new load information
+message at each  neighbor at each iteration. Latter  messages contains data that
+migrates from one node to another one. Depending on the application, it may have
+sense or not  that nodes try to balance  a part of their load  at each computing
+iteration. But the time to transfer a load message from a node to another one is
+often much more longer that to  time to transfer a load information message. So,
+when a node receives the information  that later it will receive a data message,
+it can take this information into account  and it can consider that its new load
+is larger.   Consequently, it can  send a part  of it real  load to some  of its
+neighbors if required. We call this trick the \emph{virtual load} mechanism.
+
+
+
+So, in  this work, we propose a  new strategy for improving  the distribution of
+the  load  and  a  simple  but  efficient trick  that  also  improves  the  load
+balancing. Moreover, we have conducted  many simulations with SimGrid in order to
+validate our improvements are really efficient. Our simulations consider that in
+order  to send a  message, a  latency delays  the sending  and according  to the
+network  performance and  the message  size, the  time of  the reception  of the
+message also varies.
+
+In the following of this paper, Section~\ref{BT algo} describes the Bertsekas
+and Tsitsiklis' asynchronous load balancing algorithm. Moreover, we present a
+possible problem in the convergence conditions.  Section~\ref{Best-effort}
+presents the best effort strategy which provides an efficient way to reduce the
+execution times.  This strategy will be compared with other ones, presented in
+Section~\ref{Other}.  In Section~\ref{Virtual load}, the virtual load mechanism
+is proposed.  Simulations allowed to show that both our approaches are valid
+using a quite realistic model detailed in Section~\ref{Simulations}.  Finally we
+give a conclusion and some perspectives to this work.
+
+
+
+\section{Bertsekas  and Tsitsiklis' asynchronous load balancing algorithm}
+\label{BT algo}
+
+In  order  prove  the  convergence  of  asynchronous  iterative  load  balancing
+Bertsekas         and        Tsitsiklis         proposed         a        model
+in~\cite{bertsekas+tsitsiklis.1997.parallel}.   Here we  recall  some notations.
+Consider  that  $N={1,...,n}$  processors   are  connected  through  a  network.
+Communication links  are represented by  a connected undirected  graph $G=(N,V)$
+where $V$ is the set of links connecting different processors. In this work, we
+consider that  processors are  homogeneous for sake  of simplicity. It  is quite
+easy to tackle the  heterogeneous case~\cite{ElsMonPre02}. Load of processor $i$
+at  time $t$  is  represented  by $x_i(t)\geq  0$.   Let $V(i)$  be  the set  of
+neighbors of processor  $i$.  Each processor $i$ has an estimate  of the load of
+each  of its  neighbors $j  \in V(i)$  represented by  $x_j^i(t)$.  According to
+asynchronism and communication  delays, this estimate may be  outdated.  We also
+consider that the load is described by a continuous variable.
+
+When a processor  send a part of its  load to one or some of  its neighbors, the
+transfer takes time to be completed.  Let $s_{ij}(t)$ be the amount of load that
+processor $i$ has transferred to processor $j$ at time $t$ and let $r_{ij}(t)$ be the
+amount of  load received by processor $j$  from processor $i$ at  time $t$. Then
+the amount of load of processor $i$ at time $t+1$ is given by:
+\begin{equation}
+x_i(t+1)=x_i(t)-\sum_{j\in V(i)} s_{ij}(t) + \sum_{j\in V(i)} r_{ji}(t)
+\label{eq:ping-pong}
+\end{equation}
+
+
+Some  conditions are  required to  ensure the  convergence. One  of them  can be
+called the \emph{ping-pong} condition which specifies that:
+\begin{equation}
+x_i(t)-\sum _{k\in V(i)} s_{ik}(t) \geq x_j^i(t)+s_{ij}(t)
+\end{equation}
+for any  processor $i$ and any  $j \in V(i)$ such  that $x_i(t)>x_j^i(t)$.  This
+condition aims  at avoiding a processor  to send a  part of its load  and being
+less loaded after that.
+
+Nevertheless,  we  think that  this  condition may  lead  to  deadlocks in  some
+cases. For example, if we consider  only three processors and that processor $1$
+is linked to processor $2$ which is  also linked to processor $3$ (i.e. a simple
+chain which 3 processors). Now consider we have the following values at time $t$:
+\begin{eqnarray*}
+x_1(t)=10   \\
+x_2(t)=100   \\
+x_3(t)=99.99\\
+ x_3^2(t)=99.99\\
+\end{eqnarray*}
+In this case, processor $2$ can  either sends load to processor $1$ or processor
+$3$.   If  it  sends  load  to  processor $1$  it  will  not  satisfy  condition
+(\ref{eq:ping-pong})  because  after the  sending  it  will  be less  loaded  that
+$x_3^2(t)$.  So we consider that the \emph{ping-pong} condition is probably to
+strong. Currently, we did not try to make another convergence proof without this
+condition or with a weaker condition.
+
+Nevertheless, we conjecture that such a weaker condition exists.  In fact, we
+have never seen any scenario that is not leading to convergence, even with
+load-balancing strategies that are not exactly fulfilling these two conditions.
+
+It may be the subject of future work to express weaker conditions, and to prove
+that they are sufficient to ensure the convergence of the load-balancing
+algorithm.
+
+\section{Best effort strategy}
+\label{Best-effort}
+
+In this section we describe a new load-balancing strategy that we call
+\emph{best effort}.  First, we explain the general idea behind this strategy,
+and then we describe some variants of this basic strategy.
+
+\subsection{Basic strategy}
+
+The general idea behind the \emph{best effort} strategy is that each processor,
+that detects it has more load than some of its neighbors, sends some load to the
+most of its less loaded neighbors, doing its best to reach the equilibrium
+between those neighbors and himself.
+
+More precisely, when a processor $i$ is in its load-balancing phase,
+he proceeds as following.
+\begin{enumerate}
+\item First, the neighbors are sorted in non-decreasing order of their
+  known loads $x^i_j(t)$.
+
+\item Then, this sorted list is traversed in order to find its largest
+  prefix such as the load of each selected neighbor is lesser than:
+  \begin{itemize}
+  \item the processor's own load, and
+  \item the mean of the loads of the selected neighbors and of the
+    processor's load.
+  \end{itemize}
+  Let's call $S_i(t)$ the set of the selected neighbors, and
+  $\bar{x}(t)$ the mean of the loads of the selected neighbors and of
+  the processor load:
+  \begin{equation*}
+    \bar{x}(t) = \frac{1}{\abs{S_i(t)} + 1}
+      \left( x_i(t) + \sum_{j\in S_i(t)} x^i_j(t) \right)
+  \end{equation*}
+  The following properties hold:
+  \begin{equation*}
+    \begin{cases}
+      S_i(t) \subset V(i) \\
+      x^i_j(t) < x_i(t) & \forall j \in S_i(t) \\
+      x^i_j(t) < \bar{x} & \forall j \in S_i(t) \\
+      x^i_j(t) \leq x^i_k(t) & \forall j \in S_i(t), \forall k \in V(i) \setminus S_i(t) \\
+      \bar{x} \leq x_i(t)
+    \end{cases}
+  \end{equation*}
+
+\item Once this selection is completed, processor $i$ sends to each of
+  the selected neighbor $j\in S_i(t)$ an amount of load $s_{ij}(t) =
+  \bar{x} - x^i_j(t)$.
+
+  From the above equations, and notably from the definition of
+  $\bar{x}$, it can easily be verified that:
+  \begin{equation*}
+    \begin{cases}
+      x_i(t) - \sum_{j\in S_i(t)} s_{ij}(t) = \bar{x} \\
+      x^i_j(t) + s_{ij}(t) = \bar{x} & \forall j \in S_i(t)
+    \end{cases}
+  \end{equation*}
+\end{enumerate}
+
+\subsection{Leveling the amount to send}
+
+With the aforementioned basic strategy, each node does its best to reach the
+equilibrium with its neighbors.  Since each node may be taking the same kind of
+decision at the same moment, there is the risk that a node receives load from
+several of its neighbors, and then is temporary going off the equilibrium state.
+This is particularly true with strongly connected applications.
+
+In order to reduce this effect, we add the ability to level the amount to send.
+The idea, here, is to make smaller steps toward the equilibrium, such that a
+potentially wrong decision has a lower impact.
+
+Concretely, once $s_{ij}$ has been evaluated as before, it is simply divided by
+some configurable factor.  That's what we named the ``parameter $k$'' in
+Section~\ref{Results}.  The amount of data to send is then $s_{ij}(t) = (\bar{x}
+- x^i_j(t))/k$.
+\FIXME[check that it's still named $k$ in Sec.~\ref{Results}]{}
+
+\section{Other strategies}
+\label{Other}
+
+Another load balancing strategy, working under the same conditions, was
+previously developed by Bahi, Giersch, and Makhoul in
+\cite{bahi+giersch+makhoul.2008.scalable}.  In order to assess the performances
+of the new \emph{best effort}, we naturally chose to compare it to this anterior
+work.  More precisely, we will use the algorithm~2 from
+\cite{bahi+giersch+makhoul.2008.scalable} and, in the following, we will
+reference it under the name of Makhoul's.
+
+Here is an outline of the Makhoul's algorithm.  When a given node needs to take
+a load balancing decision, it starts by sorting its neighbors by increasing
+order of their load.  Then, it computes the difference between its own load, and
+the load of each of its neighbors.  Finally, taking the neighbors following the
+order defined before, the amount of load to send $s_{ij}$ is computed as
+$1/(N+1)$ of the load difference, with $N$ being the number of neighbors.  This
+process continues as long as the node is more loaded than the considered
+neighbor.
+
+
+\section{Virtual load}
+\label{Virtual load}
+
+In this section,  we present the concept of \texttt{virtual  load}.  In order to
+use this concept, load balancing messages must be sent using two different kinds
+of  messages:  load information  messages  and  load  balancing messages.   More
+precisely, a node  wanting to send a part  of its load to one  of its neighbors,
+can first send  a load information message containing the load  it will send and
+then it can send the load  balancing message containing data  to be transferred.
+Load information  message are really  short, consequently they will  be received
+very quickly.  In opposition, load  balancing messages are often bigger and thus
+require more time to be transferred.
+
+The  concept  of  \texttt{virtual load}  allows  a  node  that received  a  load
+information message to integrate the load that it will receive later in its load
+(virtually)  and consequently send  a (real)  part of  its load  to some  of its
+neighbors. In fact,  a node that receives a load  information message knows that
+later it  will receive the  corresponding load balancing message  containing the
+corresponding data.  So  if this node detects it is too  loaded compared to some
+of its neighbors  and if it has enough  load (real load), then it  can send more
+load  to  some of  its  neighbors  without waiting  the  reception  of the  load
+balancing message.
+
+Doing  this, we  can  expect a  faster  convergence since  nodes  have a  faster
+information of the load they will receive, so they can take in into account.
+
+\FIXME{Est ce qu'on donne l'algo avec virtual load?}
+
+\FIXME{describe integer mode}
+
+\section{Simulations}
+\label{Simulations}
+
+In order to test and validate our approaches, we wrote a simulator
+using the SimGrid
+framework~\cite{casanova+legrand+quinson.2008.simgrid}.  This
+simulator, which consists of about 2,700 lines of C++, allows to run
+the different load-balancing strategies under various parameters, such
+as the initial distribution of load, the interconnection topology, the
+characteristics of the running platform, etc.  Then several metrics
+are issued that permit to compare the strategies.
+
+The simulation model is detailed in the next section (\ref{Sim
+  model}), and the experimental contexts are described in
+section~\ref{Contexts}.  Then the results of the simulations are
+presented in section~\ref{Results}.
+
+\subsection{Simulation model}
+\label{Sim model}
+
+In the simulation model the processors exchange messages which are of
+two kinds.  First, there are \emph{control messages} which only carry
+information that is exchanged between the processors, such as the
+current load, or the virtual load transfers if this option is
+selected.  These messages are rather small, and their size is
+constant.  Then, there are \emph{data messages} that carry the real
+load transferred between the processors.  The size of a data message
+is a function of the amount of load that it carries, and it can be
+pretty large.  In order to receive the messages, each processor has
+two receiving channels, one for each kind of messages.  Finally, when
+a message is sent or received, this is done by using the non-blocking
+primitives of SimGrid\footnote{That are \texttt{MSG\_task\_isend()},
+  and \texttt{MSG\_task\_irecv()}.}.
+
+During the simulation, each processor concurrently runs three threads:
+a \emph{receiving thread}, a \emph{computing thread}, and a
+\emph{load-balancing thread}, which we will briefly describe now.
+
+\paragraph{Receiving thread} The receiving thread is in charge of
+waiting for messages to come, either on the control channel, or on the
+data channel.  Its behavior is sketched by Algorithm~\ref{algo.recv}.
+When a message is received, it is pushed in a buffer of
+received message, to be later consumed by one of the other threads.
+There are two such buffers, one for the control messages, and one for
+the data messages.  The buffers are implemented with a lock-free FIFO
+\cite{sutter.2008.writing} to avoid contention between the threads.
+
+\begin{algorithm}
+  \caption{Receiving thread}
+  \label{algo.recv}
+  \KwData{
+    \begin{algodata}
+      \VAR{ctrl\_chan}, \VAR{data\_chan}
+      & communication channels (control and data) \\
+      \VAR{ctrl\_fifo}, \VAR{data\_fifo}
+      & buffers of received messages (control and data) \\
+    \end{algodata}}
+  \While{true}{%
+    wait for a message to be available on either \VAR{ctrl\_chan},
+    or \VAR{data\_chan}\;
+    \If{a message is available on \VAR{ctrl\_chan}}{%
+      get the message from \VAR{ctrl\_chan}, and push it into \VAR{ctrl\_fifo}\;
+    }
+    \If{a message is available on \VAR{data\_chan}}{%
+      get the message from \VAR{data\_chan}, and push it into \VAR{data\_fifo}\;
+    }
+  }
+\end{algorithm}
+
+\paragraph{Computing thread} The computing thread is in charge of the
+real load management.  As exposed in Algorithm~\ref{algo.comp}, it
+iteratively runs the following operations:
+\begin{itemize}
+\item if some load was received from the neighbors, get it;
+\item if there is some load to send to the neighbors, send it;
+\item run some computation, whose duration is function of the current
+  load of the processor.
+\end{itemize}
+Practically, after the computation, the computing thread waits for a
+small amount of time if the iterations are looping too fast (for
+example, when the current load is near zero).
+
+\begin{algorithm}
+  \caption{Computing thread}
+  \label{algo.comp}
+  \KwData{
+    \begin{algodata}
+      \VAR{data\_fifo} & buffer of received data messages \\
+      \VAR{real\_load} & current load \\
+    \end{algodata}}
+  \While{true}{%
+    \If{\VAR{data\_fifo} is empty and $\VAR{real\_load} = 0$}{%
+      wait until a message is pushed into \VAR{data\_fifo}\;
+    }
+    \While{\VAR{data\_fifo} is not empty}{%
+      pop a message from \VAR{data\_fifo}\;
+      get the load embedded in the message, and add it to \VAR{real\_load}\;
+    }
+    \ForEach{neighbor $n$}{%
+      \If{there is some amount of load $a$ to send to $n$}{%
+        send $a$ units of load to $n$, and subtract it from \VAR{real\_load}\;
+      }
+    }
+    \If{$\VAR{real\_load} > 0.0$}{
+      simulate some computation, whose duration is function of \VAR{real\_load}\;
+      ensure that the main loop does not iterate too fast\;
+    }
+  }
+\end{algorithm}
+
+\paragraph{Load-balancing thread} The load-balancing thread is in
+charge of running the load-balancing algorithm, and exchange the
+control messages.  As shown in Algorithm~\ref{algo.lb}, it iteratively
+runs the following operations:
+\begin{itemize}
+\item get the control messages that were received from the neighbors;
+\item run the load-balancing algorithm;
+\item send control messages to the neighbors, to inform them of the
+  processor's current load, and possibly of virtual load transfers;
+\item wait a minimum (configurable) amount of time, to avoid to
+  iterate too fast.
+\end{itemize}
+
+\begin{algorithm}
+  \caption{Load-balancing}
+  \label{algo.lb}
+  \While{true}{%
+    \While{\VAR{ctrl\_fifo} is not empty}{%
+      pop a message from \VAR{ctrl\_fifo}\;
+      identify the sender of the message,
+      and update the current knowledge of its load\;
+    }
+    run the load-balancing algorithm to make the decision about load transfers\;
+    \ForEach{neighbor $n$}{%
+      send a control messages to $n$\;
+    }
+    ensure that the main loop does not iterate too fast\;
+  }
+\end{algorithm}
  
+\paragraph{}
+For the sake of simplicity, a few details were voluntary omitted from
+these descriptions.  For an exhaustive presentation, we refer to the
+actual source code that was used for the experiments%
+\footnote{As mentioned before, our simulator relies on the SimGrid
+  framework~\cite{casanova+legrand+quinson.2008.simgrid}.  For the
+  experiments, we used a pre-release of SimGrid 3.7 (Git commit
+  67d62fca5bdee96f590c942b50021cdde5ce0c07, available from
+  \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is
+available at
+\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}.
  
+\FIXME{ajouter des détails sur la gestion de la charge virtuelle ?
+par ex, donner l'idée générale de l'implémentation.  l'idée générale est déja décrite en section~\ref{Virtual load}}
  
-qsdqsd
+\subsection{Experimental contexts}
+\label{Contexts}
  
+In order to assess the performances of our algorithms, we ran our
+simulator with various parameters, and extracted several metrics, that
+we will describe in this section.
  
+\paragraph{Load balancing strategies}
  
+Several load balancing strategies were compared.  We ran the experiments with
+the \emph{Best effort}, and with the \emph{Makhoul} strategies.  \emph{Best
+  effort} was tested with parameter $k = 1$, $k = 2$, and $k = 4$.  Secondly,
+each strategy was run in its two variants: with, and without the management of
+\emph{virtual load}.  Finally, we tested each configuration with \emph{real},
+and with \emph{integer} load.
+
+To summarize the different load balancing strategies, we have:
+\begin{description}
+\item[\textbf{strategies:}] \emph{Makhoul}, or \emph{Best effort} with $k\in
+  \{1,2,4\}$
+\item[\textbf{variants:}] with, or without virtual load
+\item[\textbf{domain:}] real load, or integer load
+\end{description}
+%
+This gives us as many as $4\times 2\times 2 = 16$ different strategies.
+
+\paragraph{End of the simulation}
+
+The simulations were run until the load was nearly balanced among the
+participating nodes.  More precisely the simulation stops when each node holds
+an amount of load at less than 1\% of the load average, during an arbitrary
+number of computing iterations (2000 in our case).
+
+Note that this convergence detection was implemented in a centralized manner.
+This is easy to do within the simulator, but it's obviously not realistic.  In a
+real application we would have chosen a decentralized convergence detection
+algorithm, like the one described by Bahi, Contassot-Vivier, Couturier, and
+Vernier in \cite{10.1109/TPDS.2005.2}.
+
+\paragraph{Platforms}
+
+In order to show the behavior of the different strategies in different
+settings, we simulated the executions on two sorts of platforms.  These two
+sorts of platforms differ by their underlaid network topology.  On the one hand,
+we have homogeneous platforms, modeled as a cluster.  On the other hand, we have
+heterogeneous platforms, modeled as the interconnection of a number of clusters.
+
+The clusters were modeled by a fixed number of computing nodes interconnected
+through a backbone link.  Each computing node has a computing power of
+1~GFlop/s, and is connected to the backbone by a network link whose bandwidth is
+of 125~MB/s, with a latency of 50~$\mu$s.  The backbone has a network bandwidth
+of 2.25~GB/s, with a latency of 500~$\mu$s.
+
+The heterogeneous platform descriptions were created by taking a subset of the
+Grid'5000 infrastructure\footnote{Grid'5000 is a French large scale experimental
+  Grid (see \url{https://www.grid5000.fr/}).}, as described in the platform file
+\texttt{g5k.xml} distributed with SimGrid.  Note that the heterogeneity of the
+platform here only comes from the network topology.  Indeed, since our
+algorithms currently do not handle heterogeneous computing resources, the
+processor speeds were normalized, and we arbitrarily chose to fix them to
+1~GFlop/s.
+
+Then we derived each sort of platform with four different number of computing
+nodes: 16, 64, 256, and 1024 nodes.
+
+\paragraph{Configurations}
+
+The distributed processes of the application were then logically organized along
+three possible topologies: a line, a torus or an hypercube.  We ran tests where
+the total load was initially on an only node (at one end for the line topology),
+and other tests where the load was initially randomly distributed across all the
+participating nodes.  The total amount of load was fixed to a number of load
+units equal to 1000 times the number of node.  The average load is then of 1000
+load units.
+
+For each of the preceding configuration, we finally had to choose the
+computation and communication costs of a load unit.  We chose them, such as to
+have three different computation over communication cost ratios, and hence model
+three different kinds of applications:
+\begin{itemize}
+\item mainly communicating, with a computation/communication cost ratio of $1/10$;
+\item mainly computing, with a computation/communication cost ratio of $10/1$ ;
+\item balanced, with a computation/communication cost ratio of $1/1$.
+\end{itemize}
+
+To summarize the various configurations, we have:
+\begin{description}
+\item[\textbf{platforms:}] homogeneous (cluster), or heterogeneous (subset of
+  Grid'5000)
+\item[\textbf{platform sizes:}] platforms with 16, 64, 256, or 1024 nodes
+\item[\textbf{process topologies:}] line, torus, or hypercube
+\item[\textbf{initial load distribution:}] initially on a only node, or
+  initially randomly distributed over all nodes
+\item[\textbf{computation/communication ratio:}] $10/1$, $1/1$, or $1/10$
+\end{description}
+%
+This gives us as many as $2\times 4\times 3\times 2\times 3 = 144$ different
+configurations.
+%
+Combined with the various load balancing strategies, we had $16\times 144 =
+2304$ distinct settings to evaluate.  In fact, as it will be shown later, we
+didn't run all the strategies, nor all the configurations for the bigger
+platforms with 1024 nodes, since to simulations would have run for a too long
+time.
+
+Anyway, all these the experiments represent more than 240 hours of computing
+time.
+
+\paragraph{Metrics}
+
+In order to evaluate and compare the different load balancing strategies we had
+to define several metrics.  Our goal, when choosing these metrics, was to have
+something tending to a constant value, i.e. to have a measure which is not
+changing anymore once the convergence state is reached.  Moreover, we wanted to
+have some normalized value, in order to be able to compare them across different
+settings.
+
+With these constraints in mind, we defined the following metrics:
+%
+\begin{description}
+\item[\textbf{average idle time:}] that's the total time spent, when the nodes
+  don't hold any share of load, and thus have nothing to compute.  This total
+  time is divided by the number of participating nodes, such as to have a number
+  that can be compared between simulations of different sizes.
+
+  This metric is expected to give an idea of the ability of the strategy to
+  diffuse the load quickly.  A smaller value is better.
+
+\item[\textbf{average convergence date:}] that's the average of the dates when
+  all nodes reached the convergence state.  The dates are measured as a number
+  of (simulated) seconds since the beginning of the simulation.
+
+\item[\textbf{maximum convergence date:}] that's the date when the last node
+  reached the convergence state.
+
+  These two dates give an idea of the time needed by the strategy to reach the
+  equilibrium state.  A smaller value is better.
+
+\item[\textbf{data transfer amount:}] that's the sum of the amount of all data
+  transfers during the simulation.  This sum is then normalized by dividing it
+  by the total amount of data present in the system.
+
+  This metric is expected to give an idea of the efficiency of the strategy in
+  terms of data movements, i.e. its ability to reach the equilibrium with fewer
+  transfers.  Again, a smaller value is better.
+
+\end{description}
+
+
+\subsection{Validation of our approaches}
+\label{Results}
+
+
+On constate quoi (vérifier avec les chiffres)?
+\begin{itemize}
+\item cluster ou grid, entier ou réel, ne font pas de grosses différences
+
+\item bookkeeping? améliore souvent les choses, parfois au prix d'un retard au démarrage
+
+\item makhoul? se fait battre sur les grosses plateformes
+
+\item taille de plateforme?
+
+\item ratio comp/comm?
+
+\item option $k$? peut-être intéressant sur des plateformes fortement interconnectées (hypercube)
+
+\item volume de comm? souvent, besteffort/plain en fait plus. pourquoi?
+
+\item répartition initiale de la charge ?
+
+\item integer mode sur topo. line n'a jamais fini en plain? vérifier si ce n'est
+  pas à cause de l'effet d'escalier que bk est capable de gommer.
+
+\end{itemize}
+
+\begin{itshape}
+On veut montrer quoi ? :
+\FIXME{remove that part}
+
+1) best plus rapide que les autres (simple, makhoul)
+2) avantage virtual load
+
+Est ce qu'on peut trouver des contre exemple?
+Topologies variées
+
+
+Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées
+Mais aussi simulation avec temps court qui montre que seul best converge
+
+Expés avec ratio calcul/comm rapide et lent
+
+Quelques expés avec charge initiale aléatoire plutot que sur le premier proc
+
+Cadre processeurs homogènes
+
+Topologies statiques
+
+On ne tient pas compte de la vitesse des liens donc on la considère homogène
+
+Prendre un réseau hétérogène et rendre processeur homogène
+
+Taille : 10 100 très gros
+\end{itshape}
+
+\section{Conclusion and perspectives}
+
+\FIXME{conclude!}
+
+\begin{acknowledgements}
+  Computations have been performed on the supercomputer facilities of
+  the Mésocentre de calcul de Franche-Comté.
+\end{acknowledgements}
+
+\FIXME{find and add more references}
+\bibliographystyle{spmpsci}
+\bibliography{biblio}
  
  \end{document}
+
+%%% Local Variables:
+%%% mode: latex
+%%% TeX-master: t
+%%% fill-column: 80
+%%% ispell-local-dictionary: "american"
+%%% End:
+
+% LocalWords:  Raphaël Couturier Arnaud Giersch Abderrahmane Sider Franche ij
+% LocalWords:  Bertsekas Tsitsiklis SimGrid DASUD Comté Béjaïa asynchronism ji
+% LocalWords:  ik isend irecv Cortés et al chan ctrl fifo Makhoul GFlop xml pre
+% LocalWords:  FEMTO Makhoul's fca bdee cdde Contassot Vivier underlaid