+been extended by many authors. For example, Cortés et al., with
+DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous}, propose a
+version working with integer load. This work was later generalized by
+the same authors in \cite{cedo+cortes+ripoll+al.2007.convergence}.
+\FIXME{Rajouter des choses ici. Lesquelles ?}
+
+Although the Bertsekas and Tsitsiklis' algorithm describes the condition to
+ensure the convergence, there is no indication or strategy to really implement
+the load distribution. In other word, a node can send a part of its load to one
+or many of its neighbors while all the convergence conditions are
+followed. Consequently, we propose a new strategy called \emph{best effort}
+that tries to balance the load of a node to all its less loaded neighbors while
+ensuring that all the nodes concerned by the load balancing phase have the same
+amount of load. Moreover, when real asynchronous applications are considered,
+using asynchronous load balancing algorithms can reduce the execution
+times. Most of the times, it is simpler to distinguish load information messages
+from data migration messages. Former ones allows a node to inform its
+neighbors of its current load. These messages are very small, they can be sent
+quite often. For example, if an computing iteration takes a significant times
+(ranging from seconds to minutes), it is possible to send a new load information
+message at each neighbor at each iteration. Latter messages contains data that
+migrates from one node to another one. Depending on the application, it may have
+sense or not that nodes try to balance a part of their load at each computing
+iteration. But the time to transfer a load message from a node to another one is
+often much more longer that to time to transfer a load information message. So,
+when a node receives the information that later it will receive a data message,
+it can take this information into account and it can consider that its new load
+is larger. Consequently, it can send a part of it real load to some of its
+neighbors if required. We call this trick the \emph{virtual load} mechanism.
+
+
+
+So, in this work, we propose a new strategy for improving the distribution of
+the load and a simple but efficient trick that also improves the load
+balancing. Moreover, we have conducted many simulations with SimGrid in order to
+validate our improvements are really efficient. Our simulations consider that in
+order to send a message, a latency delays the sending and according to the
+network performance and the message size, the time of the reception of the
+message also varies.
+
+In the following of this paper, Section~\ref{BT algo} describes the Bertsekas
+and Tsitsiklis' asynchronous load balancing algorithm. Moreover, we present a
+possible problem in the convergence conditions. Section~\ref{Best-effort}
+presents the best effort strategy which provides an efficient way to reduce the
+execution times. This strategy will be compared with other ones, presented in
+Section~\ref{Other}. In Section~\ref{Virtual load}, the virtual load mechanism
+is proposed. Simulations allowed to show that both our approaches are valid
+using a quite realistic model detailed in Section~\ref{Simulations}. Finally we
+give a conclusion and some perspectives to this work.
+
+
+
+\section{Bertsekas and Tsitsiklis' asynchronous load balancing algorithm}
+\label{BT algo}
+
+In order prove the convergence of asynchronous iterative load balancing
+Bertsekas and Tsitsiklis proposed a model
+in~\cite{bertsekas+tsitsiklis.1997.parallel}. Here we recall some notations.
+Consider that $N={1,...,n}$ processors are connected through a network.
+Communication links are represented by a connected undirected graph $G=(N,V)$
+where $V$ is the set of links connecting different processors. In this work, we
+consider that processors are homogeneous for sake of simplicity. It is quite
+easy to tackle the heterogeneous case~\cite{ElsMonPre02}. Load of processor $i$
+at time $t$ is represented by $x_i(t)\geq 0$. Let $V(i)$ be the set of
+neighbors of processor $i$. Each processor $i$ has an estimate of the load of
+each of its neighbors $j \in V(i)$ represented by $x_j^i(t)$. According to
+asynchronism and communication delays, this estimate may be outdated. We also
+consider that the load is described by a continuous variable.
+
+When a processor send a part of its load to one or some of its neighbors, the
+transfer takes time to be completed. Let $s_{ij}(t)$ be the amount of load that
+processor $i$ has transferred to processor $j$ at time $t$ and let $r_{ij}(t)$ be the
+amount of load received by processor $j$ from processor $i$ at time $t$. Then
+the amount of load of processor $i$ at time $t+1$ is given by:
+\begin{equation}
+x_i(t+1)=x_i(t)-\sum_{j\in V(i)} s_{ij}(t) + \sum_{j\in V(i)} r_{ji}(t)
+\label{eq:ping-pong}
+\end{equation}
+
+
+Some conditions are required to ensure the convergence. One of them can be
+called the \emph{ping-pong} condition which specifies that:
+\begin{equation}
+x_i(t)-\sum _{k\in V(i)} s_{ik}(t) \geq x_j^i(t)+s_{ij}(t)
+\end{equation}
+for any processor $i$ and any $j \in V(i)$ such that $x_i(t)>x_j^i(t)$. This
+condition aims at avoiding a processor to send a part of its load and being
+less loaded after that.
+
+Nevertheless, we think that this condition may lead to deadlocks in some
+cases. For example, if we consider only three processors and that processor $1$
+is linked to processor $2$ which is also linked to processor $3$ (i.e. a simple
+chain which 3 processors). Now consider we have the following values at time $t$:
+\begin{eqnarray*}
+x_1(t)=10 \\
+x_2(t)=100 \\
+x_3(t)=99.99\\
+ x_3^2(t)=99.99\\
+\end{eqnarray*}
+In this case, processor $2$ can either sends load to processor $1$ or processor
+$3$. If it sends load to processor $1$ it will not satisfy condition
+(\ref{eq:ping-pong}) because after the sending it will be less loaded that
+$x_3^2(t)$. So we consider that the \emph{ping-pong} condition is probably to
+strong. Currently, we did not try to make another convergence proof without this
+condition or with a weaker condition.
+
+Nevertheless, we conjecture that such a weaker condition exists. In fact, we
+have never seen any scenario that is not leading to convergence, even with
+load-balancing strategies that are not exactly fulfilling these two conditions.
+
+It may be the subject of future work to express weaker conditions, and to prove
+that they are sufficient to ensure the convergence of the load-balancing
+algorithm.
+
+\section{Best effort strategy}
+\label{Best-effort}
+
+In this section we describe a new load-balancing strategy that we call
+\emph{best effort}. First, we explain the general idea behind this strategy,
+and then we describe some variants of this basic strategy.
+
+\subsection{Basic strategy}
+
+The general idea behind the \emph{best effort} strategy is that each processor,
+that detects it has more load than some of its neighbors, sends some load to the
+most of its less loaded neighbors, doing its best to reach the equilibrium
+between those neighbors and himself.
+
+More precisely, when a processor $i$ is in its load-balancing phase,
+he proceeds as following.
+\begin{enumerate}
+\item First, the neighbors are sorted in non-decreasing order of their
+ known loads $x^i_j(t)$.
+
+\item Then, this sorted list is traversed in order to find its largest
+ prefix such as the load of each selected neighbor is lesser than:
+ \begin{itemize}
+ \item the processor's own load, and
+ \item the mean of the loads of the selected neighbors and of the
+ processor's load.
+ \end{itemize}
+ Let's call $S_i(t)$ the set of the selected neighbors, and
+ $\bar{x}(t)$ the mean of the loads of the selected neighbors and of
+ the processor load:
+ \begin{equation*}
+ \bar{x}(t) = \frac{1}{\abs{S_i(t)} + 1}
+ \left( x_i(t) + \sum_{j\in S_i(t)} x^i_j(t) \right)
+ \end{equation*}
+ The following properties hold:
+ \begin{equation*}
+ \begin{cases}
+ S_i(t) \subset V(i) \\
+ x^i_j(t) < x_i(t) & \forall j \in S_i(t) \\
+ x^i_j(t) < \bar{x} & \forall j \in S_i(t) \\
+ x^i_j(t) \leq x^i_k(t) & \forall j \in S_i(t), \forall k \in V(i) \setminus S_i(t) \\
+ \bar{x} \leq x_i(t)
+ \end{cases}
+ \end{equation*}
+
+\item Once this selection is completed, processor $i$ sends to each of
+ the selected neighbor $j\in S_i(t)$ an amount of load $s_{ij}(t) =
+ \bar{x} - x^i_j(t)$.
+
+ From the above equations, and notably from the definition of
+ $\bar{x}$, it can easily be verified that:
+ \begin{equation*}
+ \begin{cases}
+ x_i(t) - \sum_{j\in S_i(t)} s_{ij}(t) = \bar{x} \\
+ x^i_j(t) + s_{ij}(t) = \bar{x} & \forall j \in S_i(t)
+ \end{cases}
+ \end{equation*}
+\end{enumerate}
+
+\subsection{Leveling the amount to send}
+
+With the aforementioned basic strategy, each node does its best to reach the
+equilibrium with its neighbors. Since each node may be taking the same kind of
+decision at the same moment, there is the risk that a node receives load from
+several of its neighbors, and then is temporary going off the equilibrium state.
+This is particularly true with strongly connected applications.
+
+In order to reduce this effect, we add the ability to level the amount to send.
+The idea, here, is to make smaller steps toward the equilibrium, such that a
+potentially wrong decision has a lower impact.
+
+Concretely, once $s_{ij}$ has been evaluated as before, it is simply divided by
+some configurable factor. That's what we named the ``parameter $k$'' in
+Section~\ref{Results}. The amount of data to send is then $s_{ij}(t) = (\bar{x}
+- x^i_j(t))/k$.
+\FIXME[check that it's still named $k$ in Sec.~\ref{Results}]{}
+
+\section{Other strategies}
+\label{Other}
+
+Another load balancing strategy, working under the same conditions, was
+previously developed by Bahi, Giersch, and Makhoul in
+\cite{bahi+giersch+makhoul.2008.scalable}. In order to assess the performances
+of the new \emph{best effort}, we naturally chose to compare it to this anterior
+work. More precisely, we will use the algorithm~2 from
+\cite{bahi+giersch+makhoul.2008.scalable} and, in the following, we will
+reference it under the name of Makhoul's.
+
+Here is an outline of the Makhoul's algorithm. When a given node needs to take
+a load balancing decision, it starts by sorting its neighbors by increasing
+order of their load. Then, it computes the difference between its own load, and
+the load of each of its neighbors. Finally, taking the neighbors following the
+order defined before, the amount of load to send $s_{ij}$ is computed as
+$1/(N+1)$ of the load difference, with $N$ being the number of neighbors. This
+process continues as long as the node is more loaded than the considered
+neighbor.
+
+
+\section{Virtual load}
+\label{Virtual load}
+
+In this section, we present the concept of \texttt{virtual load}. In order to
+use this concept, load balancing messages must be sent using two different kinds
+of messages: load information messages and load balancing messages. More
+precisely, a node wanting to send a part of its load to one of its neighbors,
+can first send a load information message containing the load it will send and
+then it can send the load balancing message containing data to be transferred.
+Load information message are really short, consequently they will be received
+very quickly. In opposition, load balancing messages are often bigger and thus
+require more time to be transferred.
+
+The concept of \texttt{virtual load} allows a node that received a load
+information message to integrate the load that it will receive later in its load
+(virtually) and consequently send a (real) part of its load to some of its
+neighbors. In fact, a node that receives a load information message knows that
+later it will receive the corresponding load balancing message containing the
+corresponding data. So if this node detects it is too loaded compared to some
+of its neighbors and if it has enough load (real load), then it can send more
+load to some of its neighbors without waiting the reception of the load
+balancing message.
+
+Doing this, we can expect a faster convergence since nodes have a faster
+information of the load they will receive, so they can take in into account.
+
+\FIXME{Est ce qu'on donne l'algo avec virtual load?}
+
+\FIXME{describe integer mode}
+
+\section{Simulations}
+\label{Simulations}
+
+In order to test and validate our approaches, we wrote a simulator
+using the SimGrid
+framework~\cite{casanova+legrand+quinson.2008.simgrid}. This
+simulator, which consists of about 2,700 lines of C++, allows to run
+the different load-balancing strategies under various parameters, such
+as the initial distribution of load, the interconnection topology, the
+characteristics of the running platform, etc. Then several metrics
+are issued that permit to compare the strategies.
+
+The simulation model is detailed in the next section (\ref{Sim
+ model}), and the experimental contexts are described in
+section~\ref{Contexts}. Then the results of the simulations are
+presented in section~\ref{Results}.
+
+\subsection{Simulation model}
+\label{Sim model}
+
+In the simulation model the processors exchange messages which are of
+two kinds. First, there are \emph{control messages} which only carry
+information that is exchanged between the processors, such as the
+current load, or the virtual load transfers if this option is
+selected. These messages are rather small, and their size is
+constant. Then, there are \emph{data messages} that carry the real
+load transferred between the processors. The size of a data message
+is a function of the amount of load that it carries, and it can be
+pretty large. In order to receive the messages, each processor has
+two receiving channels, one for each kind of messages. Finally, when
+a message is sent or received, this is done by using the non-blocking
+primitives of SimGrid\footnote{That are \texttt{MSG\_task\_isend()},
+ and \texttt{MSG\_task\_irecv()}.}.
+
+During the simulation, each processor concurrently runs three threads:
+a \emph{receiving thread}, a \emph{computing thread}, and a
+\emph{load-balancing thread}, which we will briefly describe now.
+
+For the sake of simplicity, a few details were voluntary omitted from
+these descriptions. For an exhaustive presentation, we refer to the
+actual source code that was used for the experiments%
+\footnote{As mentioned before, our simulator relies on the SimGrid
+ framework~\cite{casanova+legrand+quinson.2008.simgrid}. For the
+ experiments, we used a pre-release of SimGrid 3.7 (Git commit
+ 67d62fca5bdee96f590c942b50021cdde5ce0c07, available from
+ \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is
+available at
+\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}.
+
+\subsubsection{Receiving thread}
+
+The receiving thread is in charge of waiting for messages to come, either on the
+control channel, or on the data channel. Its behavior is sketched by
+Algorithm~\ref{algo.recv}. When a message is received, it is pushed in a buffer
+of received message, to be later consumed by one of the other threads. There
+are two such buffers, one for the control messages, and one for the data
+messages. The buffers are implemented with a lock-free FIFO
+\cite{sutter.2008.writing} to avoid contention between the threads.
+
+\begin{algorithm}
+ \caption{Receiving thread}
+ \label{algo.recv}
+ \KwData{
+ \begin{algodata}
+ \VAR{ctrl\_chan}, \VAR{data\_chan}
+ & communication channels (control and data) \\
+ \VAR{ctrl\_fifo}, \VAR{data\_fifo}
+ & buffers of received messages (control and data) \\
+ \end{algodata}}
+ \While{true}{%
+ wait for a message to be available on either \VAR{ctrl\_chan},
+ or \VAR{data\_chan}\;
+ \If{a message is available on \VAR{ctrl\_chan}}{%
+ get the message from \VAR{ctrl\_chan}, and push it into \VAR{ctrl\_fifo}\;
+ }
+ \If{a message is available on \VAR{data\_chan}}{%
+ get the message from \VAR{data\_chan}, and push it into \VAR{data\_fifo}\;
+ }
+ }
+\end{algorithm}
+
+\subsubsection{Computing thread}
+
+The computing thread is in charge of the real load management. As exposed in
+Algorithm~\ref{algo.comp}, it iteratively runs the following operations:
+\begin{itemize}
+\item if some load was received from the neighbors, get it;
+\item if there is some load to send to the neighbors, send it;
+\item run some computation, whose duration is function of the current
+ load of the processor.
+\end{itemize}
+Practically, after the computation, the computing thread waits for a
+small amount of time if the iterations are looping too fast (for
+example, when the current load is near zero).
+
+\begin{algorithm}
+ \caption{Computing thread}
+ \label{algo.comp}
+ \KwData{
+ \begin{algodata}
+ \VAR{data\_fifo} & buffer of received data messages \\
+ \VAR{real\_load} & current load \\
+ \end{algodata}}
+ \While{true}{%
+ \If{\VAR{data\_fifo} is empty and $\VAR{real\_load} = 0$}{%
+ wait until a message is pushed into \VAR{data\_fifo}\;
+ }
+ \While{\VAR{data\_fifo} is not empty}{%
+ pop a message from \VAR{data\_fifo}\;
+ get the load embedded in the message, and add it to \VAR{real\_load}\;
+ }
+ \ForEach{neighbor $n$}{%
+ \If{there is some amount of load $a$ to send to $n$}{%
+ send $a$ units of load to $n$, and subtract it from \VAR{real\_load}\;
+ }
+ }
+ \If{$\VAR{real\_load} > 0.0$}{
+ simulate some computation, whose duration is function of \VAR{real\_load}\;
+ ensure that the main loop does not iterate too fast\;
+ }
+ }
+\end{algorithm}
+
+\subsubsection{Load-balancing thread}
+
+The load-balancing thread is in charge of running the load-balancing algorithm,
+and exchange the control messages. As shown in Algorithm~\ref{algo.lb}, it
+iteratively runs the following operations:
+\begin{itemize}
+\item get the control messages that were received from the neighbors;
+\item run the load-balancing algorithm;
+\item send control messages to the neighbors, to inform them of the
+ processor's current load, and possibly of virtual load transfers;
+\item wait a minimum (configurable) amount of time, to avoid to
+ iterate too fast.
+\end{itemize}
+
+\begin{algorithm}
+ \caption{Load-balancing}
+ \label{algo.lb}
+ \While{true}{%
+ \While{\VAR{ctrl\_fifo} is not empty}{%
+ pop a message from \VAR{ctrl\_fifo}\;
+ identify the sender of the message,
+ and update the current knowledge of its load\;
+ }
+ run the load-balancing algorithm to make the decision about load transfers\;
+ \ForEach{neighbor $n$}{%
+ send a control messages to $n$\;
+ }
+ ensure that the main loop does not iterate too fast\;
+ }
+\end{algorithm}
+
+\paragraph{}\FIXME{ajouter des détails sur la gestion de la charge virtuelle ?
+par ex, donner l'idée générale de l'implémentation. l'idée générale est déja décrite en section~\ref{Virtual load}}
+
+\subsection{Experimental contexts}
+\label{Contexts}
+
+In order to assess the performances of our algorithms, we ran our
+simulator with various parameters, and extracted several metrics, that
+we will describe in this section.
+
+\subsubsection{Load balancing strategies}
+
+Several load balancing strategies were compared. We ran the experiments with
+the \emph{Best effort}, and with the \emph{Makhoul} strategies. \emph{Best
+ effort} was tested with parameter $k = 1$, $k = 2$, and $k = 4$. Secondly,
+each strategy was run in its two variants: with, and without the management of
+\emph{virtual load}. Finally, we tested each configuration with \emph{real},
+and with \emph{integer} load.
+
+To summarize the different load balancing strategies, we have:
+\begin{description}
+\item[\textbf{strategies:}] \emph{Makhoul}, or \emph{Best effort} with $k\in
+ \{1,2,4\}$
+\item[\textbf{variants:}] with, or without virtual load
+\item[\textbf{domain:}] real load, or integer load
+\end{description}
+%
+This gives us as many as $4\times 2\times 2 = 16$ different strategies.
+
+\subsubsection{End of the simulation}
+
+The simulations were run until the load was nearly balanced among the
+participating nodes. More precisely the simulation stops when each node holds
+an amount of load at less than 1\% of the load average, during an arbitrary
+number of computing iterations (2000 in our case).