\begin{tabular}[t]{@{}l@{:~}l@{}}}{%
\end{tabular}}
-\newcommand{\FIXME}[1]{%
- \textbf{$\triangleright$\marginpar{\textbf{[FIXME]}}~#1}}
+\newcommand{\FIXMEmargin}[1]{%
+ \marginpar{\textbf{[FIXME]} {\footnotesize #1}}}
+\newcommand{\FIXME}[2][]{%
+ \ifx #2\relax\relax \FIXMEmargin{#1}%
+ \else \textbf{$\triangleright$\FIXMEmargin{#1}~#2}\fi}
\newcommand{\VAR}[1]{\textit{#1}}
DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous}, propose a
version working with integer load. This work was later generalized by
the same authors in \cite{cedo+cortes+ripoll+al.2007.convergence}.
-\FIXME{Rajouter des choses ici.}
+\FIXME{Rajouter des choses ici. Lesquelles ?}
Although the Bertsekas and Tsitsiklis' algorithm describes the condition to
ensure the convergence, there is no indication or strategy to really implement
$x_3^2(t)$. So we consider that the \emph{ping-pong} condition is probably to
strong. Currently, we did not try to make another convergence proof without this
condition or with a weaker condition.
-%
-\FIXME{Develop: We have the feeling that such a weaker condition
- exists, because (it's not a proof, but) we have never seen any
- scenario that is not leading to convergence, even with LB-strategies
- that are not fulfilling these two conditions.}
+
+Nevertheless, we conjecture that such a weaker condition exists. In fact, we
+have never seen any scenario that is not leading to convergence, even with
+load-balancing strategies that are not exactly fulfilling these two conditions.
+
+It may be the subject of future work to express weaker conditions, and to prove
+that they are sufficient to ensure the convergence of the load-balancing
+algorithm.
\section{Best effort strategy}
\label{Best-effort}
-In this section we describe a new load-balancing strategy that we call
-\emph{best effort}. The general idea behind this strategy is that each
-processor, that detects it has more load than some of its neighbors,
-sends some load to the most of its less loaded neighbors, doing its
-best to reach the equilibrium between those neighbors and himself.
+In this section we describe a new load-balancing strategy that we call
+\emph{best effort}. First, we explain the general idea behind this strategy,
+and then we describe some variants of this basic strategy.
+
+\subsection{Basic strategy}
+
+The general idea behind the \emph{best effort} strategy is that each processor,
+that detects it has more load than some of its neighbors, sends some load to the
+most of its less loaded neighbors, doing its best to reach the equilibrium
+between those neighbors and himself.
More precisely, when a processor $i$ is in its load-balancing phase,
he proceeds as following.
\end{equation*}
\end{enumerate}
-\FIXME{describe parameter $k$}
+\subsection{Leveling the amount to send}
-\section{Other strategies}
-\label{Other}
+With the aforementioned basic strategy, each node does its best to reach the
+equilibrium with its neighbors. Since each node may be taking the same kind of
+decision at the same moment, there is the risk that a node receives load from
+several of its neighbors, and then is temporary going off the equilibrium state.
+This is particularly true with strongly connected applications.
-\FIXME{Réécrire en angliche.}
+In order to reduce this effect, we add the ability to level the amount to send.
+The idea, here, is to make smaller steps toward the equilibrium, such that a
+potentially wrong decision has a lower impact.
-% \FIXME{faut-il décrire les stratégies makhoul et simple ?}
+Concretely, once $s_{ij}$ has been evaluated as before, it is simply divided by
+some configurable factor. That's what we named the ``parameter $k$'' in
+Section~\ref{Results}. The amount of data to send is then $s_{ij}(t) = (\bar{x}
+- x^i_j(t))/k$.
+\FIXME[check that it's still named $k$ in Sec.~\ref{Results}]{}
-% \paragraph{simple} Tentative de respecter simplement les conditions de Bertsekas.
-% Parmi les voisins moins chargés que soi, on sélectionne :
-% \begin{itemize}
-% \item un des moins chargés (vmin) ;
-% \item un des plus chargés (vmax),
-% \end{itemize}
-% puis on équilibre avec vmin en s'assurant que notre charge reste
-% toujours supérieure à celle de vmin et à celle de vmax.
-
-% On envoie donc (avec "self" pour soi-même) :
-% \[
-% \min\left(\frac{load(self) - load(vmin)}{2}, load(self) - load(vmax)\right)
-% \]
+\section{Other strategies}
+\label{Other}
-\paragraph{makhoul} Ordonne les voisins du moins chargé au plus chargé
-puis calcule les différences de charge entre soi-même et chacun des
-voisins.
+Another load balancing strategy, working under the same conditions, was
+previously developed by Bahi, Giersch, and Makhoul in
+\cite{bahi+giersch+makhoul.2008.scalable}. In order to assess the performances
+of the new \emph{best effort}, we naturally chose to compare it to this anterior
+work. More precisely, we will use the algorithm~2 from
+\cite{bahi+giersch+makhoul.2008.scalable} and, in the following, we will
+reference it under the name of Makhoul's.
-Ensuite, pour chaque voisin, dans l'ordre, et tant qu'on reste plus
-chargé que le voisin en question, on lui envoie 1/(N+1) de la
-différence calculée au départ, avec N le nombre de voisins.
+Here is an outline of the Makhoul's algorithm. When a given node needs to take
+a load balancing decision, it starts by sorting its neighbors by increasing
+order of their load. Then, it computes the difference between its own load, and
+the load of each of its neighbors. Finally, taking the neighbors following the
+order defined before, the amount of load to send $s_{ij}$ is computed as
+$1/(N+1)$ of the load difference, with $N$ being the number of neighbors. This
+process continues as long as the node is more loaded than the considered
+neighbor.
-C'est l'algorithme~2 dans~\cite{bahi+giersch+makhoul.2008.scalable}.
\section{Virtual load}
\label{Virtual load}
available at
\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}.
-\FIXME{ajouter des détails sur la gestion de la charge virtuelle ?}
+\FIXME{ajouter des détails sur la gestion de la charge virtuelle ?
+par ex, donner l'idée générale de l'implémentation. l'idée générale est déja décrite en section~\ref{Virtual load}}
\subsection{Experimental contexts}
\label{Contexts}
number of computing iterations (2000 in our case).
Note that this convergence detection was implemented in a centralized manner.
-This is easy to do within the simulator, but it's obviously not realistic. In
-a real application we would have chosen a decentralized convergence detection algorithm, like the one described in \cite{10.1109/TPDS.2005.2}.
+This is easy to do within the simulator, but it's obviously not realistic. In a
+real application we would have chosen a decentralized convergence detection
+algorithm, like the one described by Bahi, Contassot-Vivier, Couturier, and
+Vernier in \cite{10.1109/TPDS.2005.2}.
\paragraph{Platforms}
sorts of platforms differ by their underlaid network topology. On the one hand,
we have homogeneous platforms, modeled as a cluster. On the other hand, we have
heterogeneous platforms, modeled as the interconnection of a number of clusters.
+
+The clusters were modeled by a fixed number of computing nodes interconnected
+through a backbone link. Each computing node has a computing power of
+1~GFlop/s, and is connected to the backbone by a network link whose bandwidth is
+of 125~MB/s, with a latency of 50~$\mu$s. The backbone has a network bandwidth
+of 2.25~GB/s, with a latency of 500~$\mu$s.
+
The heterogeneous platform descriptions were created by taking a subset of the
Grid'5000 infrastructure\footnote{Grid'5000 is a French large scale experimental
Grid (see \url{https://www.grid5000.fr/}).}, as described in the platform file
\texttt{g5k.xml} distributed with SimGrid. Note that the heterogeneity of the
-platform only comes from the network topology. The processor speeds, and
-network bandwidths were normalized since our algorithms currently are not aware
-of such heterogeneity. We arbitrarily chose to fix the processor speed to
-1~GFlop/s, and the network bandwidth to 125~MB/s, with a latency of 50~$\mu$s,
-except for the links between geographically distant sites, where the network
-bandwidth was fixed to 2.25~GB/s, with a latency of 500~$\mu$s.
+platform here only comes from the network topology. Indeed, since our
+algorithms currently do not handle heterogeneous computing resources, the
+processor speeds were normalized, and we arbitrarily chose to fix them to
+1~GFlop/s.
Then we derived each sort of platform with four different number of computing
nodes: 16, 64, 256, and 1024 nodes.
The distributed processes of the application were then logically organized along
three possible topologies: a line, a torus or an hypercube. We ran tests where
the total load was initially on an only node (at one end for the line topology),
-and other tests where the load was initially randomly distributed across all
-the participating nodes.
+and other tests where the load was initially randomly distributed across all the
+participating nodes. The total amount of load was fixed to a number of load
+units equal to 1000 times the number of node. The average load is then of 1000
+load units.
For each of the preceding configuration, we finally had to choose the
computation and communication costs of a load unit. We chose them, such as to
\paragraph{Metrics}
-In order to evaluate and compare the different load balancing strategies, we
-choose to measure the following metrics:
+In order to evaluate and compare the different load balancing strategies we had
+to define several metrics. Our goal, when choosing these metrics, was to have
+something tending to a constant value, i.e. to have a measure which is not
+changing anymore once the convergence state is reached. Moreover, we wanted to
+have some normalized value, in order to be able to compare them across different
+settings.
+With these constraints in mind, we defined the following metrics:
+%
\begin{description}
-\item[\textbf{average idle time:}]
-\item[\textbf{average convergence date:}]
-\item[\textbf{maximum convergence date:}]
-\item[\textbf{data transfer amount:}] relative to the total data amount
+\item[\textbf{average idle time:}] that's the total time spent, when the nodes
+ don't hold any share of load, and thus have nothing to compute. This total
+ time is divided by the number of participating nodes, such as to have a number
+ that can be compared between simulations of different sizes.
+
+ This metric is expected to give an idea of the ability of the strategy to
+ diffuse the load quickly. A smaller value is better.
+
+\item[\textbf{average convergence date:}] that's the average of the dates when
+ all nodes reached the convergence state. The dates are measured as a number
+ of (simulated) seconds since the beginning of the simulation.
+
+\item[\textbf{maximum convergence date:}] that's the date when the last node
+ reached the convergence state.
+
+ These two dates give an idea of the time needed by the strategy to reach the
+ equilibrium state. A smaller value is better.
+
+\item[\textbf{data transfer amount:}] that's the sum of the amount of all data
+ transfers during the simulation. This sum is then normalized by dividing it
+ by the total amount of data present in the system.
+
+ This metric is expected to give an idea of the efficiency of the strategy in
+ terms of data movements, i.e. its ability to reach the equilibrium with fewer
+ transfers. Again, a smaller value is better.
+
\end{description}
-\FIXME{dire à chaque fois ce que ça représente, et motiver le choix}
+
\subsection{Validation of our approaches}
\label{Results}
+Dans cet ordre:
+...
+- comparer be/makhoul -> be tient la route
+ -> en réel uniquement
+- valider l'extension virtual load -> c'est 'achement bien
+- proposer le -k -> ça peut aider dans certains cas
+- conclure avec la version entière -> on n'a pas l'effet d'escalier !
+Q: comment inclure les types/tailles de platesformes ?
+Q: comment faire des moyennes ?
+Q: comment introduire les distrib 1/N ?
+...
+
+On constate quoi (vérifier avec les chiffres)?
+\begin{itemize}
+\item cluster ou grid, entier ou réel, ne font pas de grosses différences
+
+\item bookkeeping? améliore souvent les choses, parfois au prix d'un retard au démarrage
+
+\item makhoul? se fait battre sur les grosses plateformes
+
+\item taille de plateforme?
+
+\item ratio comp/comm?
+\item option $k$? peut-être intéressant sur des plateformes fortement interconnectées (hypercube)
+
+\item volume de comm? souvent, besteffort/plain en fait plus. pourquoi?
+
+\item répartition initiale de la charge ?
+
+\item integer mode sur topo. line n'a jamais fini en plain? vérifier si ce n'est
+ pas à cause de l'effet d'escalier que bk est capable de gommer.
+
+\end{itemize}
+
+\begin{itshape}
On veut montrer quoi ? :
+\FIXME{remove that part}
1) best plus rapide que les autres (simple, makhoul)
2) avantage virtual load
Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées
Mais aussi simulation avec temps court qui montre que seul best converge
-
Expés avec ratio calcul/comm rapide et lent
Quelques expés avec charge initiale aléatoire plutot que sur le premier proc
Prendre un réseau hétérogène et rendre processeur homogène
Taille : 10 100 très gros
+\end{itshape}
\section{Conclusion and perspectives}
+\FIXME{conclude!}
+
\begin{acknowledgements}
Computations have been performed on the supercomputer facilities of
the Mésocentre de calcul de Franche-Comté.
\end{acknowledgements}
+\FIXME{find and add more references}
\bibliographystyle{spmpsci}
\bibliography{biblio}
% LocalWords: Raphaël Couturier Arnaud Giersch Abderrahmane Sider Franche ij
% LocalWords: Bertsekas Tsitsiklis SimGrid DASUD Comté Béjaïa asynchronism ji
-% LocalWords: ik isend irecv Cortés et al chan ctrl fifo Makhoul
+% LocalWords: ik isend irecv Cortés et al chan ctrl fifo Makhoul GFlop xml pre
+% LocalWords: FEMTO Makhoul's fca bdee cdde Contassot Vivier underlaid