X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/loba-papers.git/blobdiff_plain/f622eef95fe3973773527fa3260b7563a498ec60..1e4ddc1ac6f6e145312650aa924c3a98871a0dbc:/supercomp11/supercomp11.tex?ds=sidebyside

diff --git a/supercomp11/supercomp11.tex b/supercomp11/supercomp11.tex
index db3e809..2c1cb7c 100644
--- a/supercomp11/supercomp11.tex
+++ b/supercomp11/supercomp11.tex
@@ -14,8 +14,11 @@
   \begin{tabular}[t]{@{}l@{:~}l@{}}}{%
   \end{tabular}}
 
-\newcommand{\FIXME}[1]{%
-  \textbf{$\triangleright$\marginpar{\textbf{[FIXME]}}~#1}}
+\newcommand{\FIXMEmargin}[1]{%
+  \marginpar{\textbf{[FIXME]} {\footnotesize #1}}}
+\newcommand{\FIXME}[2][]{%
+  \ifx #2\relax\relax \FIXMEmargin{#1}%
+  \else \textbf{$\triangleright$\FIXMEmargin{#1}~#2}\fi}
 
 \newcommand{\VAR}[1]{\textit{#1}}
 
@@ -85,7 +88,7 @@ been extended by many authors. For example, CortÃ©s et al., with
 DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous}, propose a
 version working with integer load.  This work was later generalized by
 the same authors in \cite{cedo+cortes+ripoll+al.2007.convergence}.
-\FIXME{Rajouter des choses ici.}
+\FIXME{Rajouter des choses ici.  Lesquelles ?}
 
 Although  the Bertsekas  and Tsitsiklis'  algorithm describes  the  condition to
 ensure the convergence,  there is no indication or  strategy to really implement
@@ -186,20 +189,28 @@ $3$.   If  it  sends  load  to  processor $1$  it  will  not  satisfy  condition
 $x_3^2(t)$.  So we consider that the \emph{ping-pong} condition is probably to
 strong. Currently, we did not try to make another convergence proof without this
 condition or with a weaker condition.
-%
-\FIXME{Develop: We have the feeling that such a weaker condition
-  exists, because (it's not a proof, but) we have never seen any
-  scenario that is not leading to convergence, even with LB-strategies
-  that are not fulfilling these two conditions.}
+
+Nevertheless, we conjecture that such a weaker condition exists.  In fact, we
+have never seen any scenario that is not leading to convergence, even with
+load-balancing strategies that are not exactly fulfilling these two conditions.
+
+It may be the subject of future work to express weaker conditions, and to prove
+that they are sufficient to ensure the convergence of the load-balancing
+algorithm.
 
 \section{Best effort strategy}
 \label{Best-effort}
 
-In this section we  describe  a new load-balancing strategy that we call
-\emph{best effort}.  The general idea behind this strategy is that each
-processor, that detects it has more load than some of its neighbors, 
-sends some load to the most of its less loaded neighbors, doing its
-best to reach the equilibrium between those neighbors and himself.
+In this section we describe a new load-balancing strategy that we call
+\emph{best effort}.  First, we explain the general idea behind this strategy,
+and then we describe some variants of this basic strategy.
+
+\subsection{Basic strategy}
+
+The general idea behind the \emph{best effort} strategy is that each processor,
+that detects it has more load than some of its neighbors, sends some load to the
+most of its less loaded neighbors, doing its best to reach the equilibrium
+between those neighbors and himself.
 
 More precisely, when a processor $i$ is in its load-balancing phase,
 he proceeds as following.
@@ -246,38 +257,44 @@ he proceeds as following.
   \end{equation*}
 \end{enumerate}
 
-\FIXME{describe parameter $k$}
-
-\section{Other strategies}
-\label{Other}
+\subsection{Leveling the amount to send}
 
-\FIXME{RÃ©Ã©crire en anglais.}
+With the aforementioned basic strategy, each node does its best to reach the
+equilibrium with its neighbors.  Since each node may be taking the same kind of
+decision at the same moment, there is the risk that a node receives load from
+several of its neighbors, and then is temporary going off the equilibrium state.
+This is particularly true with strongly connected applications.
 
-% \FIXME{faut-il dÃ©crire les stratÃ©gies makhoul et simple ?}
+In order to reduce this effect, we add the ability to level the amount to send.
+The idea, here, is to make smaller steps toward the equilibrium, such that a
+potentially wrong decision has a lower impact.
 
-% \paragraph{simple} Tentative de respecter simplement les conditions de Bertsekas.
-% Parmi les voisins moins chargÃ©s que soi, on sÃ©lectionne :
-% \begin{itemize}
-% \item un des moins chargÃ©s (vmin) ;
-% \item un des plus chargÃ©s (vmax),
-% \end{itemize}
-% puis on Ã©quilibre avec vmin en s'assurant que notre charge reste
-% toujours supÃ©rieure Ã  celle de vmin et Ã  celle de vmax.
+Concretely, once $s_{ij}$ has been evaluated as before, it is simply divided by
+some configurable factor.  That's what we named the ``parameter $k$'' in
+Section~\ref{Results}.  The amount of data to send is then $s_{ij}(t) = (\bar{x}
+- x^i_j(t))/k$.
+\FIXME[check that it's still named $k$ in Sec.~\ref{Results}]{}
 
-% On envoie donc (avec "self" pour soi-mÃªme) :
-% \[
-%     \min\left(\frac{load(self) - load(vmin)}{2}, load(self) - load(vmax)\right)
-% \]
+\section{Other strategies}
+\label{Other}
 
-\paragraph{makhoul} Ordonne les voisins du moins chargÃ© au plus chargÃ©
-puis calcule les diffÃ©rences de charge entre soi-mÃªme et chacun des
-voisins.
+Another load balancing strategy, working under the same conditions, was
+previously developed by Bahi, Giersch, and Makhoul in
+\cite{bahi+giersch+makhoul.2008.scalable}.  In order to assess the performances
+of the new \emph{best effort}, we naturally chose to compare it to this anterior
+work.  More precisely, we will use the algorithm~2 from
+\cite{bahi+giersch+makhoul.2008.scalable} and, in the following, we will
+reference it under the name of Makhoul's.
 
-Ensuite, pour chaque voisin, dans l'ordre, et tant qu'on reste plus
-chargÃ© que le voisin en question, on lui envoie 1/(N+1) de la
-diffÃ©rence calculÃ©e au dÃ©part, avec N le nombre de voisins.
+Here is an outline of the Makhoul's algorithm.  When a given node needs to take
+a load balancing decision, it starts by sorting its neighbors by increasing
+order of their load.  Then, it computes the difference between its own load, and
+the load of each of its neighbors.  Finally, taking the neighbors following the
+order defined before, the amount of load to send $s_{ij}$ is computed as
+$1/(N+1)$ of the load difference, with $N$ being the number of neighbors.  This
+process continues as long as the node is more loaded than the considered
+neighbor.
 
-C'est l'algorithme~2 dans~\cite{bahi+giersch+makhoul.2008.scalable}.
 
 \section{Virtual load}
 \label{Virtual load}
@@ -347,13 +364,25 @@ During the simulation, each processor concurrently runs three threads:
 a \emph{receiving thread}, a \emph{computing thread}, and a
 \emph{load-balancing thread}, which we will briefly describe now.
 
-\paragraph{Receiving thread} The receiving thread is in charge of
-waiting for messages to come, either on the control channel, or on the
-data channel.  Its behavior is sketched by Algorithm~\ref{algo.recv}.
-When a message is received, it is pushed in a buffer of
-received message, to be later consumed by one of the other threads.
-There are two such buffers, one for the control messages, and one for
-the data messages.  The buffers are implemented with a lock-free FIFO
+For the sake of simplicity, a few details were voluntary omitted from
+these descriptions.  For an exhaustive presentation, we refer to the
+actual source code that was used for the experiments%
+\footnote{As mentioned before, our simulator relies on the SimGrid
+  framework~\cite{casanova+legrand+quinson.2008.simgrid}.  For the
+  experiments, we used a pre-release of SimGrid 3.7 (Git commit
+  67d62fca5bdee96f590c942b50021cdde5ce0c07, available from
+  \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is
+available at
+\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}.
+
+\subsubsection{Receiving thread}
+
+The receiving thread is in charge of waiting for messages to come, either on the
+control channel, or on the data channel.  Its behavior is sketched by
+Algorithm~\ref{algo.recv}.  When a message is received, it is pushed in a buffer
+of received message, to be later consumed by one of the other threads.  There
+are two such buffers, one for the control messages, and one for the data
+messages.  The buffers are implemented with a lock-free FIFO
 \cite{sutter.2008.writing} to avoid contention between the threads.
 
 \begin{algorithm}
@@ -378,9 +407,10 @@ the data messages.  The buffers are implemented with a lock-free FIFO
   }
 \end{algorithm}
 
-\paragraph{Computing thread} The computing thread is in charge of the
-real load management.  As exposed in Algorithm~\ref{algo.comp}, it
-iteratively runs the following operations:
+\subsubsection{Computing thread}
+
+The computing thread is in charge of the real load management.  As exposed in
+Algorithm~\ref{algo.comp}, it iteratively runs the following operations:
 \begin{itemize}
 \item if some load was received from the neighbors, get it;
 \item if there is some load to send to the neighbors, send it;
@@ -419,10 +449,11 @@ example, when the current load is near zero).
   }
 \end{algorithm}
 
-\paragraph{Load-balancing thread} The load-balancing thread is in
-charge of running the load-balancing algorithm, and exchange the
-control messages.  As shown in Algorithm~\ref{algo.lb}, it iteratively
-runs the following operations:
+\subsubsection{Load-balancing thread}
+
+The load-balancing thread is in charge of running the load-balancing algorithm,
+and exchange the control messages.  As shown in Algorithm~\ref{algo.lb}, it
+iteratively runs the following operations:
 \begin{itemize}
 \item get the control messages that were received from the neighbors;
 \item run the load-balancing algorithm;
@@ -449,19 +480,8 @@ runs the following operations:
   }
 \end{algorithm}
 
-\paragraph{}
-For the sake of simplicity, a few details were voluntary omitted from
-these descriptions.  For an exhaustive presentation, we refer to the
-actual source code that was used for the experiments%
-\footnote{As mentioned before, our simulator relies on the SimGrid
-  framework~\cite{casanova+legrand+quinson.2008.simgrid}.  For the
-  experiments, we used a pre-release of SimGrid 3.7 (Git commit
-  67d62fca5bdee96f590c942b50021cdde5ce0c07, available from
-  \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is
-available at
-\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}.
-
-\FIXME{ajouter des dÃ©tails sur la gestion de la charge virtuelle ?}
+\paragraph{}\FIXME{ajouter des dÃ©tails sur la gestion de la charge virtuelle ?
+par ex, donner l'idÃ©e gÃ©nÃ©rale de l'implÃ©mentation.  l'idÃ©e gÃ©nÃ©rale est dÃ©ja dÃ©crite en section~\ref{Virtual load}}
 
 \subsection{Experimental contexts}
 \label{Contexts}
@@ -470,7 +490,7 @@ In order to assess the performances of our algorithms, we ran our
 simulator with various parameters, and extracted several metrics, that
 we will describe in this section.
 
-\paragraph{Load balancing strategies}
+\subsubsection{Load balancing strategies}
 
 Several load balancing strategies were compared.  We ran the experiments with
 the \emph{Best effort}, and with the \emph{Makhoul} strategies.  \emph{Best
@@ -489,7 +509,7 @@ To summarize the different load balancing strategies, we have:
 %
 This gives us as many as $4\times 2\times 2 = 16$ different strategies.
 
-\paragraph{End of the simulation}
+\subsubsection{End of the simulation}
 
 The simulations were run until the load was nearly balanced among the
 participating nodes.  More precisely the simulation stops when each node holds
@@ -499,9 +519,10 @@ number of computing iterations (2000 in our case).
 Note that this convergence detection was implemented in a centralized manner.
 This is easy to do within the simulator, but it's obviously not realistic.  In a
 real application we would have chosen a decentralized convergence detection
-algorithm, like the one described in \cite{10.1109/TPDS.2005.2}.
+algorithm, like the one described by Bahi, Contassot-Vivier, Couturier, and
+Vernier in \cite{10.1109/TPDS.2005.2}.
 
-\paragraph{Platforms}
+\subsubsection{Platforms}
 
 In order to show the behavior of the different strategies in different
 settings, we simulated the executions on two sorts of platforms.  These two
@@ -527,7 +548,7 @@ processor speeds were normalized, and we arbitrarily chose to fix them to
 Then we derived each sort of platform with four different number of computing
 nodes: 16, 64, 256, and 1024 nodes.
 
-\paragraph{Configurations}
+\subsubsection{Configurations}
 
 The distributed processes of the application were then logically organized along
 three possible topologies: a line, a torus or an hypercube.  We ran tests where
@@ -570,7 +591,7 @@ time.
 Anyway, all these the experiments represent more than 240 hours of computing
 time.
 
-\paragraph{Metrics}
+\subsubsection{Metrics}
 
 In order to evaluate and compare the different load balancing strategies we had
 to define several metrics.  Our goal, when choosing these metrics, was to have
@@ -611,44 +632,145 @@ With these constraints in mind, we defined the following metrics:
 \end{description}
 
 
-\subsection{Validation of our approaches}
+\subsection{Experimental results}
 \label{Results}
 
+In this section, the results for the different simulations will be presented,
+and we'll try to explain our observations.
+
+\subsubsection{Cluster vs grid platforms}
+
+As mentioned earlier, we simulated the different algorithms on two kinds of
+physical platforms: clusters and grids.  A first observation that we can make,
+is that the graphs we draw from the data have a similar aspect for the two kinds
+of platforms.  The only noticeable difference is that the algorithms need a bit
+more time to achieve the convergence on the grid platforms, than on clusters.
+Nevertheless their relative performances remain generally identical.
+
+This suggests that the relative performances of the different strategies are not
+influenced by the characteristics of the physical platform.  The differences in
+the convergence times can be explained by the fact that on the grid platforms,
+distant sites are interconnected by links of smaller bandwith.
+
+Therefore, in the following, we'll only discuss the results for the grid
+platforms.  The different results are presented on the
+figures~\ref{fig.results1} and~\ref{fig.resultsN}.
+
+\FIXME{explain how to read the graphs}
+ratio 1:1 not given here
+
+\begin{figure*}[p]
+  \centering
+  \includegraphics[width=.5\linewidth]{data/graphs/R1-1:10-grid-line}%
+  \includegraphics[width=.5\linewidth]{data/graphs/R1-10:1-grid-line}
+  \includegraphics[width=.5\linewidth]{data/graphs/R1-1:10-grid-torus}%
+  \includegraphics[width=.5\linewidth]{data/graphs/R1-10:1-grid-torus}
+  \includegraphics[width=.5\linewidth]{data/graphs/R1-1:10-grid-hcube}%
+  \includegraphics[width=.5\linewidth]{data/graphs/R1-10:1-grid-hcube}
+  \caption{Real mode, initially on an only mode, comp/comm ratio = 1/10 (left), or 10/1 (right).}
+  \label{fig.results1}
+\end{figure*}
+
+\begin{figure*}[p]
+  \centering
+  \includegraphics[width=.5\linewidth]{data/graphs/RN-1:10-grid-line}%
+  \includegraphics[width=.5\linewidth]{data/graphs/RN-10:1-grid-line}
+  \includegraphics[width=.5\linewidth]{data/graphs/RN-1:10-grid-torus}%
+  \includegraphics[width=.5\linewidth]{data/graphs/RN-10:1-grid-torus}
+  \includegraphics[width=.5\linewidth]{data/graphs/RN-1:10-grid-hcube}%
+  \includegraphics[width=.5\linewidth]{data/graphs/RN-10:1-grid-hcube}
+  \caption{Real mode, random initial distribution, comp/comm ratio = 1/10 (left), or 10/1 (right).}
+  \label{fig.resultsN}
+\end{figure*}
+
+\subsubsection{Main results}
+
+On fig.~\ref{fig.results1}, \dots
+
+\subsubsection{With the virtual load extension}
+
+\subsubsection{The $k$ parameter}
+
+\subsubsection{With an initial random repartition,  and larger platforms}
+
+\subsubsection{With integer load}
+
+\FIXME{what about the amount of data?}
+
+\begin{itshape}
+\FIXME{remove that part}
+Dans cet ordre:
+...
+- comparer be/makhoul -> be tient la route
+        -> en rÃ©el uniquement
+- valider l'extension virtual load -> c'est 'achement bien
+- proposer le -k -> Ã§a peut aider dans certains cas
+- conclure avec la version entiÃ¨re -> on n'a pas l'effet d'escalier !
+Q: comment inclure les types/tailles de platesformes ?
+Q: comment faire des moyennes ?
+Q: comment introduire les distrib 1/N ?
+...
+
+On constate quoi (vÃ©rifier avec les chiffres)?
+\begin{itemize}
+\item cluster ou grid, entier ou rÃ©el, ne font pas de grosses diffÃ©rences
+
+\item bookkeeping? amÃ©liore souvent les choses, parfois au prix d'un retard au dÃ©marrage
+
+\item makhoul? se fait battre sur les grosses plateformes
+
+\item taille de plateforme?
+
+\item ratio comp/comm?
 
-On veut montrer quoi ? :
+\item option $k$? peut-Ãªtre intÃ©ressant sur des plateformes fortement interconnectÃ©es (hypercube)
 
-1) best plus rapide que les autres (simple, makhoul)
-2) avantage virtual load
+\item volume de comm? souvent, besteffort/plain en fait plus. pourquoi?
 
-Est ce qu'on peut trouver des contre exemple?
-Topologies variÃ©es
+\item rÃ©partition initiale de la charge ?
 
+\item integer mode sur topo. line n'a jamais fini en plain? vÃ©rifier si ce n'est
+  pas Ã  cause de l'effet d'escalier que bk est capable de gommer.
 
-Simulation avec temps dÃ©finies assez long et on mesure la qualitÃ© avec : volume de calcul effectuÃ©, volume de donnÃ©es Ã©changÃ©es
-Mais aussi simulation avec temps court qui montre que seul best converge
+\end{itemize}
+
+% On veut montrer quoi ? :
+
+% 1) best plus rapide que les autres (simple, makhoul)
+% 2) avantage virtual load
 
+% Est ce qu'on peut trouver des contre exemple?
+% Topologies variÃ©es
 
-ExpÃ©s avec ratio calcul/comm rapide et lent
 
-Quelques expÃ©s avec charge initiale alÃ©atoire plutot que sur le premier proc
+% Simulation avec temps dÃ©finies assez long et on mesure la qualitÃ© avec : volume de calcul effectuÃ©, volume de donnÃ©es Ã©changÃ©es
+% Mais aussi simulation avec temps court qui montre que seul best converge
 
-Cadre processeurs homogÃ¨nes
+% ExpÃ©s avec ratio calcul/comm rapide et lent
 
-Topologies statiques
+% Quelques expÃ©s avec charge initiale alÃ©atoire plutot que sur le premier proc
 
-On ne tient pas compte de la vitesse des liens donc on la considÃ¨re homogÃ¨ne
+% Cadre processeurs homogÃ¨nes
 
-Prendre un rÃ©seau hÃ©tÃ©rogÃ¨ne et rendre processeur homogÃ¨ne
+% Topologies statiques
 
-Taille : 10 100 trÃ¨s gros
+% On ne tient pas compte de la vitesse des liens donc on la considÃ¨re homogÃ¨ne
+
+% Prendre un rÃ©seau hÃ©tÃ©rogÃ¨ne et rendre processeur homogÃ¨ne
+
+% Taille : 10 100 trÃ¨s gros
+\end{itshape}
 
 \section{Conclusion and perspectives}
 
+\FIXME{conclude!}
+
 \begin{acknowledgements}
   Computations have been performed on the supercomputer facilities of
   the MÃ©socentre de calcul de Franche-ComtÃ©.
 \end{acknowledgements}
 
+\FIXME{find and add more references}
 \bibliographystyle{spmpsci}
 \bibliography{biblio}
 
@@ -663,4 +785,5 @@ Taille : 10 100 trÃ¨s gros
 
 % LocalWords:  RaphaÃ«l Couturier Arnaud Giersch Abderrahmane Sider Franche ij
 % LocalWords:  Bertsekas Tsitsiklis SimGrid DASUD ComtÃ© BÃ©jaÃ¯a asynchronism ji
-% LocalWords:  ik isend irecv CortÃ©s et al chan ctrl fifo Makhoul GFlop xml
+% LocalWords:  ik isend irecv CortÃ©s et al chan ctrl fifo Makhoul GFlop xml pre
+% LocalWords:  FEMTO Makhoul's fca bdee cdde Contassot Vivier underlaid