+In the simulation model the processors exchange messages which are of
+two kinds. First, there are \emph{control messages} which only carry
+information that is exchanged between the processors, such as the
+current load, or the virtual load transfers if this option is
+selected. These messages are rather small, and their size is
+constant. Then, there are \emph{data messages} that carry the real
+load transferred between the processors. The size of a data message
+is a function of the amount of load that it carries, and it can be
+pretty large. In order to receive the messages, each processor has
+two receiving channels, one for each kind of messages. Finally, when
+a message is sent or received, this is done by using the non-blocking
+primitives of SimGrid\footnote{That are \texttt{MSG\_task\_isend()},
+ and \texttt{MSG\_task\_irecv()}.}.
+
+During the simulation, each processor concurrently runs three threads:
+a \emph{receiving thread}, a \emph{computing thread}, and a
+\emph{load-balancing thread}, which we will briefly describe now.
+
+\paragraph{Receiving thread} The receiving thread is in charge of
+waiting for messages to come, either on the control channel, or on the
+data channel. Its behavior is sketched by Algorithm~\ref{algo.recv}.
+When a message is received, it is pushed in a buffer of
+received message, to be later consumed by one of the other threads.
+There are two such buffers, one for the control messages, and one for
+the data messages. The buffers are implemented with a lock-free FIFO
+\cite{sutter.2008.writing} to avoid contention between the threads.
+
+\begin{algorithm}
+ \caption{Receiving thread}
+ \label{algo.recv}
+ \KwData{
+ \begin{algodata}
+ \VAR{ctrl\_chan}, \VAR{data\_chan}
+ & communication channels (control and data) \\
+ \VAR{ctrl\_fifo}, \VAR{data\_fifo}
+ & buffers of received messages (control and data) \\
+ \end{algodata}}
+ \While{true}{%
+ wait for a message to be available on either \VAR{ctrl\_chan},
+ or \VAR{data\_chan}\;
+ \If{a message is available on \VAR{ctrl\_chan}}{%
+ get the message from \VAR{ctrl\_chan}, and push it into \VAR{ctrl\_fifo}\;
+ }
+ \If{a message is available on \VAR{data\_chan}}{%
+ get the message from \VAR{data\_chan}, and push it into \VAR{data\_fifo}\;
+ }
+ }
+\end{algorithm}
+
+\paragraph{Computing thread} The computing thread is in charge of the
+real load management. As exposed in Algorithm~\ref{algo.comp}, it
+iteratively runs the following operations:
+\begin{itemize}
+\item if some load was received from the neighbors, get it;
+\item if there is some load to send to the neighbors, send it;
+\item run some computation, whose duration is function of the current
+ load of the processor.
+\end{itemize}
+Practically, after the computation, the computing thread waits for a
+small amount of time if the iterations are looping too fast (for
+example, when the current load is near zero).
+
+\begin{algorithm}
+ \caption{Computing thread}
+ \label{algo.comp}
+ \KwData{
+ \begin{algodata}
+ \VAR{data\_fifo} & buffer of received data messages \\
+ \VAR{real\_load} & current load \\
+ \end{algodata}}
+ \While{true}{%
+ \If{\VAR{data\_fifo} is empty and $\VAR{real\_load} = 0$}{%
+ wait until a message is pushed into \VAR{data\_fifo}\;
+ }
+ \While{\VAR{data\_fifo} is not empty}{%
+ pop a message from \VAR{data\_fifo}\;
+ get the load embedded in the message, and add it to \VAR{real\_load}\;
+ }
+ \ForEach{neighbor $n$}{%
+ \If{there is some amount of load $a$ to send to $n$}{%
+ send $a$ units of load to $n$, and subtract it from \VAR{real\_load}\;
+ }
+ }
+ \If{$\VAR{real\_load} > 0.0$}{
+ simulate some computation, whose duration is function of \VAR{real\_load}\;
+ ensure that the main loop does not iterate too fast\;
+ }
+ }
+\end{algorithm}
+
+\paragraph{Load-balancing thread} The load-balancing thread is in
+charge of running the load-balancing algorithm, and exchange the
+control messages. As shown in Algorithm~\ref{algo.lb}, it iteratively
+runs the following operations:
+\begin{itemize}
+\item get the control messages that were received from the neighbors;
+\item run the load-balancing algorithm;
+\item send control messages to the neighbors, to inform them of the
+ processor's current load, and possibly of virtual load transfers;
+\item wait a minimum (configurable) amount of time, to avoid to
+ iterate too fast.
+\end{itemize}
+
+\begin{algorithm}
+ \caption{Load-balancing}
+ \label{algo.lb}
+ \While{true}{%
+ \While{\VAR{ctrl\_fifo} is not empty}{%
+ pop a message from \VAR{ctrl\_fifo}\;
+ identify the sender of the message,
+ and update the current knowledge of its load\;
+ }
+ run the load-balancing algorithm to make the decision about load transfers\;
+ \ForEach{neighbor $n$}{%
+ send a control messages to $n$\;
+ }
+ ensure that the main loop does not iterate too fast\;
+ }
+\end{algorithm}
+
+\paragraph{}
+For the sake of simplicity, a few details were voluntary omitted from
+these descriptions. For an exhaustive presentation, we refer to the
+actual source code that was used for the experiments%
+\footnote{As mentioned before, our simulator relies on the SimGrid
+ framework~\cite{casanova+legrand+quinson.2008.simgrid}. For the
+ experiments, we used a pre-release of SimGrid 3.7 (Git commit
+ 67d62fca5bdee96f590c942b50021cdde5ce0c07, available from
+ \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is
+available at
+\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}.
+
+\FIXME{ajouter des détails sur la gestion de la charge virtuelle ?}
+
+\subsection{Experimental contexts}
+\label{Contexts}
+
+In order to assess the performances of our algorithms, we ran our
+simulator with various parameters, and extracted several metrics, that
+we will describe in this section.
+
+\paragraph{Load balancing strategies}
+
+Several load balancing strategies were compared. We ran the experiments with
+the \emph{Best effort}, and with the \emph{Makhoul} strategies. \emph{Best
+ effort} was tested with parameter $k = 1$, $k = 2$, and $k = 4$. Secondly,
+each strategy was run in its two variants: with, and without the management of
+\emph{virtual load}. Finally, we tested each configuration with \emph{real},
+and with \emph{integer} load.
+
+To summarize the different load balancing strategies, we have:
+\begin{description}
+\item[\textbf{strategies:}] \emph{Makhoul}, or \emph{Best effort} with $k\in
+ \{1,2,4\}$
+\item[\textbf{variants:}] with, or without virtual load
+\item[\textbf{domain:}] real load, or integer load
+\end{description}
+%
+This gives us as many as $4\times 2\times 2 = 16$ different strategies.
+
+
+\paragraph{Configurations}
+
+In order to show the behavior of the different strategies in different
+settings, we simulated the executions on two sorts of platforms. These two
+sorts of platforms differ by their underlaid network topology. On the one hand,
+we have homogeneous platforms, modeled as a cluster. On the other hand, we have
+heterogeneous platforms, modeled as the interconnection of a number of clusters.
+The heterogeneous platform descriptions were created by taking a subset of the
+Grid'5000 infrastructure\footnote{Grid'5000 is a French large scale experimental
+ Grid (see \url{https://www.grid5000.fr/}).}, as described in the platform file
+\texttt{g5k.xml} distributed with SimGrid. Note that the heterogeneity of the
+platform only comes from the network topology. The processor speeds, and
+network bandwidths were normalized since our algorithms currently are not aware
+of such heterogeneity. We arbitrarily chose to fix the processor speed to
+1~GFlop/s, and the network bandwidth to 125~MB/s, with a latency of 50~$\mu$s,
+except for the links between geographically distant sites, where the network
+bandwidth was fixed to 2.25~GB/s, with a latency of 500~$\mu$s.
+
+Then we derived each sort of platform with four different number of computing
+nodes: 16, 64, 256, and 1024 nodes.
+
+The distributed processes of the application were then logically organized along
+three possible topologies: a line, a torus or an hypercube. We ran tests where
+the total load was initially on an only node (at one end for the line topology),
+and other tests where the load was initially randomly distributed across all
+the participating nodes.
+
+For each of the preceding configuration, we finally had to choose the
+computation and communication costs of a load unit. We chose them, such as to
+have three different computation over communication cost ratios, and hence model
+three different kinds of applications:
+\begin{itemize}
+\item mainly communicating, with a computation/communication cost ratio of $1/10$;
+\item mainly computing, with a computation/communication cost ratio of $10/1$ ;
+\item balanced, with a computation/communication cost ratio of $1/1$.
+\end{itemize}
+
+To summarize the various configurations, we have:
+\begin{description}
+\item[\textbf{platforms:}] homogeneous (cluster), or heterogeneous (subset of
+ Grid'5000)
+\item[\textbf{platform sizes:}] platforms with 16, 64, 256, or 1024 nodes
+\item[\textbf{process topologies:}] line, torus, or hypercube
+\item[\textbf{initial load distribution:}] initially on a only node, or
+ initially randomly distributed over all nodes
+\item[\textbf{computation/communication ratio:}] $10/1$, $1/1$, or $1/10$
+\end{description}
+%
+This gives us as many as $2\times 4\times 3\times 2\times 3 = 144$ different
+configurations.
+%
+Combined with the various load balancing strategies, we had $16\times 144 =
+2304$ distinct settings to evaluate. In fact, as it will be shown later, we
+didn't run all the strategies, nor all the configurations for the bigger
+platforms with 1024 nodes, since to simulations would have run for a too long
+time.
+
+Anyway, all these the experiments represent more than 240 hours of computing
+time.
+
+\paragraph{Metrics}
+
+\begin{description}
+\item[\textbf{average idle time}]
+\item[\textbf{average convergence date}]
+\item[\textbf{maximum convergence date}]
+\item[\textbf{data transfer amount}] relative to the total data amount
+\end{description}
+