X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/loba-papers.git/blobdiff_plain/6f6ebdac1613681f5093eb2303424044b5f60f12..3c2f33b63a5a1a5dd563955b54c361efaa0d2690:/supercomp11/supercomp11.tex diff --git a/supercomp11/supercomp11.tex b/supercomp11/supercomp11.tex index 3a1ec31..a7bc09a 100644 --- a/supercomp11/supercomp11.tex +++ b/supercomp11/supercomp11.tex @@ -14,8 +14,11 @@ \begin{tabular}[t]{@{}l@{:~}l@{}}}{% \end{tabular}} -\newcommand{\FIXME}[1]{% - \textbf{$\triangleright$\marginpar{\textbf{[FIXME]}}~#1}} +\newcommand{\FIXMEmargin}[1]{% + \marginpar{\textbf{[FIXME]} {\footnotesize #1}}} +\newcommand{\FIXME}[2][]{% + \ifx #2\relax\relax \FIXMEmargin{#1}% + \else \textbf{$\triangleright$\FIXMEmargin{#1}~#2}\fi} \newcommand{\VAR}[1]{\textit{#1}} @@ -29,12 +32,12 @@ } \institute{R. Couturier \and A. Giersch \at - LIFC, University of Franche-Comté, Belfort, France \\ + FEMTO-ST, University of Franche-Comté, Belfort, France \\ % Tel.: +123-45-678910\\ % Fax: +123-45-678910\\ \email{% - raphael.couturier@univ-fcomte.fr, - arnaud.giersch@univ-fcomte.fr} + raphael.couturier@femto-st.fr, + arnaud.giersch@femto-st.fr} } \maketitle @@ -85,7 +88,7 @@ been extended by many authors. For example, Cortés et al., with DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous}, propose a version working with integer load. This work was later generalized by the same authors in \cite{cedo+cortes+ripoll+al.2007.convergence}. -\FIXME{Rajouter des choses ici.} +\FIXME{Rajouter des choses ici. Lesquelles ?} Although the Bertsekas and Tsitsiklis' algorithm describes the condition to ensure the convergence, there is no indication or strategy to really implement @@ -186,20 +189,28 @@ $3$. If it sends load to processor $1$ it will not satisfy condition $x_3^2(t)$. So we consider that the \emph{ping-pong} condition is probably to strong. Currently, we did not try to make another convergence proof without this condition or with a weaker condition. -% -\FIXME{Develop: We have the feeling that such a weaker condition - exists, because (it's not a proof, but) we have never seen any - scenario that is not leading to convergence, even with LB-strategies - that are not fulfilling these two conditions.} + +Nevertheless, we conjecture that such a weaker condition exists. In fact, we +have never seen any scenario that is not leading to convergence, even with +load-balancing strategies that are not exactly fulfilling these two conditions. + +It may be the subject of future work to express weaker conditions, and to prove +that they are sufficient to ensure the convergence of the load-balancing +algorithm. \section{Best effort strategy} \label{Best-effort} -In this section we describe a new load-balancing strategy that we call -\emph{best effort}. The general idea behind this strategy is that each -processor, that detects it has more load than some of its neighbors, -sends some load to the most of its less loaded neighbors, doing its -best to reach the equilibrium between those neighbors and himself. +In this section we describe a new load-balancing strategy that we call +\emph{best effort}. First, we explain the general idea behind this strategy, +and then we describe some variants of this basic strategy. + +\subsection{Basic strategy} + +The general idea behind the \emph{best effort} strategy is that each processor, +that detects it has more load than some of its neighbors, sends some load to the +most of its less loaded neighbors, doing its best to reach the equilibrium +between those neighbors and himself. More precisely, when a processor $i$ is in its load-balancing phase, he proceeds as following. @@ -246,38 +257,44 @@ he proceeds as following. \end{equation*} \end{enumerate} -\FIXME{describe parameter $k$} +\subsection{Leveling the amount to send} -\section{Other strategies} -\label{Other} +With the aforementioned basic strategy, each node does its best to reach the +equilibrium with its neighbors. Since each node may be taking the same kind of +decision at the same moment, there is the risk that a node receives load from +several of its neighbors, and then is temporary going off the equilibrium state. +This is particularly true with strongly connected applications. -\FIXME{Réécrire en angliche.} +In order to reduce this effect, we add the ability to level the amount to send. +The idea, here, is to make smaller steps toward the equilibrium, such that a +potentially wrong decision has a lower impact. -% \FIXME{faut-il décrire les stratégies makhoul et simple ?} +Concretely, once $s_{ij}$ has been evaluated as before, it is simply divided by +some configurable factor. That's what we named the ``parameter $k$'' in +Section~\ref{Results}. The amount of data to send is then $s_{ij}(t) = (\bar{x} +- x^i_j(t))/k$. +\FIXME[check that it's still named $k$ in Sec.~\ref{Results}]{} -% \paragraph{simple} Tentative de respecter simplement les conditions de Bertsekas. -% Parmi les voisins moins chargés que soi, on sélectionne : -% \begin{itemize} -% \item un des moins chargés (vmin) ; -% \item un des plus chargés (vmax), -% \end{itemize} -% puis on équilibre avec vmin en s'assurant que notre charge reste -% toujours supérieure à celle de vmin et à celle de vmax. - -% On envoie donc (avec "self" pour soi-même) : -% \[ -% \min\left(\frac{load(self) - load(vmin)}{2}, load(self) - load(vmax)\right) -% \] +\section{Other strategies} +\label{Other} -\paragraph{makhoul} Ordonne les voisins du moins chargé au plus chargé -puis calcule les différences de charge entre soi-même et chacun des -voisins. +Another load balancing strategy, working under the same conditions, was +previously developed by Bahi, Giersch, and Makhoul in +\cite{bahi+giersch+makhoul.2008.scalable}. In order to assess the performances +of the new \emph{best effort}, we naturally chose to compare it to this anterior +work. More precisely, we will use the algorithm~2 from +\cite{bahi+giersch+makhoul.2008.scalable} and, in the following, we will +reference it under the name of Makhoul's. -Ensuite, pour chaque voisin, dans l'ordre, et tant qu'on reste plus -chargé que le voisin en question, on lui envoie 1/(N+1) de la -différence calculée au départ, avec N le nombre de voisins. +Here is an outline of the Makhoul's algorithm. When a given node needs to take +a load balancing decision, it starts by sorting its neighbors by increasing +order of their load. Then, it computes the difference between its own load, and +the load of each of its neighbors. Finally, taking the neighbors following the +order defined before, the amount of load to send $s_{ij}$ is computed as +$1/(N+1)$ of the load difference, with $N$ being the number of neighbors. This +process continues as long as the node is more loaded than the considered +neighbor. -C'est l'algorithme~2 dans~\cite{bahi+giersch+makhoul.2008.scalable}. \section{Virtual load} \label{Virtual load} @@ -347,13 +364,25 @@ During the simulation, each processor concurrently runs three threads: a \emph{receiving thread}, a \emph{computing thread}, and a \emph{load-balancing thread}, which we will briefly describe now. -\paragraph{Receiving thread} The receiving thread is in charge of -waiting for messages to come, either on the control channel, or on the -data channel. Its behavior is sketched by Algorithm~\ref{algo.recv}. -When a message is received, it is pushed in a buffer of -received message, to be later consumed by one of the other threads. -There are two such buffers, one for the control messages, and one for -the data messages. The buffers are implemented with a lock-free FIFO +For the sake of simplicity, a few details were voluntary omitted from +these descriptions. For an exhaustive presentation, we refer to the +actual source code that was used for the experiments% +\footnote{As mentioned before, our simulator relies on the SimGrid + framework~\cite{casanova+legrand+quinson.2008.simgrid}. For the + experiments, we used a pre-release of SimGrid 3.7 (Git commit + 67d62fca5bdee96f590c942b50021cdde5ce0c07, available from + \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is +available at +\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}. + +\subsubsection{Receiving thread} + +The receiving thread is in charge of waiting for messages to come, either on the +control channel, or on the data channel. Its behavior is sketched by +Algorithm~\ref{algo.recv}. When a message is received, it is pushed in a buffer +of received message, to be later consumed by one of the other threads. There +are two such buffers, one for the control messages, and one for the data +messages. The buffers are implemented with a lock-free FIFO \cite{sutter.2008.writing} to avoid contention between the threads. \begin{algorithm} @@ -378,9 +407,10 @@ the data messages. The buffers are implemented with a lock-free FIFO } \end{algorithm} -\paragraph{Computing thread} The computing thread is in charge of the -real load management. As exposed in Algorithm~\ref{algo.comp}, it -iteratively runs the following operations: +\subsubsection{Computing thread} + +The computing thread is in charge of the real load management. As exposed in +Algorithm~\ref{algo.comp}, it iteratively runs the following operations: \begin{itemize} \item if some load was received from the neighbors, get it; \item if there is some load to send to the neighbors, send it; @@ -419,10 +449,11 @@ example, when the current load is near zero). } \end{algorithm} -\paragraph{Load-balancing thread} The load-balancing thread is in -charge of running the load-balancing algorithm, and exchange the -control messages. As shown in Algorithm~\ref{algo.lb}, it iteratively -runs the following operations: +\subsubsection{Load-balancing thread} + +The load-balancing thread is in charge of running the load-balancing algorithm, +and exchange the control messages. As shown in Algorithm~\ref{algo.lb}, it +iteratively runs the following operations: \begin{itemize} \item get the control messages that were received from the neighbors; \item run the load-balancing algorithm; @@ -449,95 +480,280 @@ runs the following operations: } \end{algorithm} -\paragraph{} -For the sake of simplicity, a few details were voluntary omitted from -these descriptions. For an exhaustive presentation, we refer to the -actual source code that was used for the experiments% -\footnote{As mentioned before, our simulator relies on the SimGrid - framework~\cite{casanova+legrand+quinson.2008.simgrid}. For the - experiments, we used a pre-release of SimGrid 3.7 (Git commit - 67d62fca5bdee96f590c942b50021cdde5ce0c07, available from - \url{https://gforge.inria.fr/scm/?group_id=12})}, and which is -available at -\url{http://info.iut-bm.univ-fcomte.fr/staff/giersch/software/loba.tar.gz}. - -\FIXME{ajouter des détails sur la gestion de la charge virtuelle ?} +\paragraph{}\FIXME{ajouter des détails sur la gestion de la charge virtuelle ? +par ex, donner l'idée générale de l'implémentation. l'idée générale est déja décrite en section~\ref{Virtual load}} \subsection{Experimental contexts} \label{Contexts} In order to assess the performances of our algorithms, we ran our simulator with various parameters, and extracted several metrics, that -we will describe in this section. Overall, the experiments represent -more than 240 hours of computing time. +we will describe in this section. -\paragraph{Load balancing strategies} +\subsubsection{Load balancing strategies} -We ran the experiments with the \emph{Best effort}, and with the \emph{Makhoul} -strategies. \emph{Best effort} was tested with parameter $k = 1$, $k = 2$, and -$k = 4$. Secondly, each strategy was run in its two variants: with, and without -the management of \emph{virtual load}. Finally, we tested each configuration -with \emph{real}, and with \emph{integer} load. -This gives us as many as 32 different strategies. +Several load balancing strategies were compared. We ran the experiments with +the \emph{Best effort}, and with the \emph{Makhoul} strategies. \emph{Best + effort} was tested with parameter $k = 1$, $k = 2$, and $k = 4$. Secondly, +each strategy was run in its two variants: with, and without the management of +\emph{virtual load}. Finally, we tested each configuration with \emph{real}, +and with \emph{integer} load. -\paragraph{Configurations} +To summarize the different load balancing strategies, we have: \begin{description} -\item[\textbf{platforms}] homogeneous (cluster); heterogeneous (subset - of Grid5000) -\item[\textbf{platform size}] platforms with 16, 64, 256, and 1024 nodes -\item[\textbf{topologies}] line; torus; hypercube -\item[\textbf{initial load distribution}] initially on a only node; - initially on all nodes -\item[\textbf{comp/comm ratio}] $10/1$, $1/1$, $1/10$ +\item[\textbf{strategies:}] \emph{Makhoul}, or \emph{Best effort} with $k\in + \{1,2,4\}$ +\item[\textbf{variants:}] with, or without virtual load +\item[\textbf{domain:}] real load, or integer load \end{description} +% +This gives us as many as $4\times 2\times 2 = 16$ different strategies. + +\subsubsection{End of the simulation} + +The simulations were run until the load was nearly balanced among the +participating nodes. More precisely the simulation stops when each node holds +an amount of load at less than 1\% of the load average, during an arbitrary +number of computing iterations (2000 in our case). + +Note that this convergence detection was implemented in a centralized manner. +This is easy to do within the simulator, but it's obviously not realistic. In a +real application we would have chosen a decentralized convergence detection +algorithm, like the one described by Bahi, Contassot-Vivier, Couturier, and +Vernier in \cite{10.1109/TPDS.2005.2}. + +\subsubsection{Platforms} + +In order to show the behavior of the different strategies in different +settings, we simulated the executions on two sorts of platforms. These two +sorts of platforms differ by their underlaid network topology. On the one hand, +we have homogeneous platforms, modeled as a cluster. On the other hand, we have +heterogeneous platforms, modeled as the interconnection of a number of clusters. + +The clusters were modeled by a fixed number of computing nodes interconnected +through a backbone link. Each computing node has a computing power of +1~GFlop/s, and is connected to the backbone by a network link whose bandwidth is +of 125~MB/s, with a latency of 50~$\mu$s. The backbone has a network bandwidth +of 2.25~GB/s, with a latency of 500~$\mu$s. + +The heterogeneous platform descriptions were created by taking a subset of the +Grid'5000 infrastructure\footnote{Grid'5000 is a French large scale experimental + Grid (see \url{https://www.grid5000.fr/}).}, as described in the platform file +\texttt{g5k.xml} distributed with SimGrid. Note that the heterogeneity of the +platform here only comes from the network topology. Indeed, since our +algorithms currently do not handle heterogeneous computing resources, the +processor speeds were normalized, and we arbitrarily chose to fix them to +1~GFlop/s. + +Then we derived each sort of platform with four different number of computing +nodes: 16, 64, 256, and 1024 nodes. + +\subsubsection{Configurations} + +The distributed processes of the application were then logically organized along +three possible topologies: a line, a torus or an hypercube. We ran tests where +the total load was initially on an only node (at one end for the line topology), +and other tests where the load was initially randomly distributed across all the +participating nodes. The total amount of load was fixed to a number of load +units equal to 1000 times the number of node. The average load is then of 1000 +load units. + +For each of the preceding configuration, we finally had to choose the +computation and communication costs of a load unit. We chose them, such as to +have three different computation over communication cost ratios, and hence model +three different kinds of applications: +\begin{itemize} +\item mainly communicating, with a computation/communication cost ratio of $1/10$; +\item mainly computing, with a computation/communication cost ratio of $10/1$ ; +\item balanced, with a computation/communication cost ratio of $1/1$. +\end{itemize} -\paragraph{Metrics} +To summarize the various configurations, we have: +\begin{description} +\item[\textbf{platforms:}] homogeneous (cluster), or heterogeneous (subset of + Grid'5000) +\item[\textbf{platform sizes:}] platforms with 16, 64, 256, or 1024 nodes +\item[\textbf{process topologies:}] line, torus, or hypercube +\item[\textbf{initial load distribution:}] initially on a only node, or + initially randomly distributed over all nodes +\item[\textbf{computation/communication ratio:}] $10/1$, $1/1$, or $1/10$ +\end{description} +% +This gives us as many as $2\times 4\times 3\times 2\times 3 = 144$ different +configurations. +% +Combined with the various load balancing strategies, we had $16\times 144 = +2304$ distinct settings to evaluate. In fact, as it will be shown later, we +didn't run all the strategies, nor all the configurations for the bigger +platforms with 1024 nodes, since to simulations would have run for a too long +time. + +Anyway, all these the experiments represent more than 240 hours of computing +time. + +\subsubsection{Metrics} + +In order to evaluate and compare the different load balancing strategies we had +to define several metrics. Our goal, when choosing these metrics, was to have +something tending to a constant value, i.e. to have a measure which is not +changing anymore once the convergence state is reached. Moreover, we wanted to +have some normalized value, in order to be able to compare them across different +settings. +With these constraints in mind, we defined the following metrics: +% \begin{description} -\item[\textbf{average idle time}] -\item[\textbf{average convergence date}] -\item[\textbf{maximum convergence date}] -\item[\textbf{data transfer amount}] relative to the total data amount +\item[\textbf{average idle time:}] that's the total time spent, when the nodes + don't hold any share of load, and thus have nothing to compute. This total + time is divided by the number of participating nodes, such as to have a number + that can be compared between simulations of different sizes. + + This metric is expected to give an idea of the ability of the strategy to + diffuse the load quickly. A smaller value is better. + +\item[\textbf{average convergence date:}] that's the average of the dates when + all nodes reached the convergence state. The dates are measured as a number + of (simulated) seconds since the beginning of the simulation. + +\item[\textbf{maximum convergence date:}] that's the date when the last node + reached the convergence state. + + These two dates give an idea of the time needed by the strategy to reach the + equilibrium state. A smaller value is better. + +\item[\textbf{data transfer amount:}] that's the sum of the amount of all data + transfers during the simulation. This sum is then normalized by dividing it + by the total amount of data present in the system. + + This metric is expected to give an idea of the efficiency of the strategy in + terms of data movements, i.e. its ability to reach the equilibrium with fewer + transfers. Again, a smaller value is better. + \end{description} -\subsection{Validation of our approaches} + +\subsection{Experimental results} \label{Results} +In this section, the results for the different simulations will be presented, +and we'll try to explain our observations. + +\FIXME{cluster vs. grid : cluster légèrement plus rapide, mais c'est tout -> chose to give results for grid only.} + +\FIXME{explain how to read the graphs} + +\subsubsection{Main results} + +\begin{figure} + \centering + \includegraphics[width=.5\linewidth]{data/graphs/R1-1:10-grid-line}% + \includegraphics[width=.5\linewidth]{data/graphs/R1-10:1-grid-line} + \includegraphics[width=.5\linewidth]{data/graphs/R1-1:10-grid-torus}% + \includegraphics[width=.5\linewidth]{data/graphs/R1-10:1-grid-torus} + \includegraphics[width=.5\linewidth]{data/graphs/R1-1:10-grid-hcube}% + \includegraphics[width=.5\linewidth]{data/graphs/R1-10:1-grid-hcube} + \caption{Real mode, initially on an only mode, comp/comm ratio = 1/10 (left), or 10/1 (right).} + \label{fig.results1} +\end{figure} + +On fig.~\ref{fig.results1}, \dots + +\subsubsection{With the virtual load extension} + +\subsubsection{The $k$ parameter} + +\subsubsection{With an initial random repartition, and larger platforms} + +\begin{figure} + \centering + \includegraphics[width=.5\linewidth]{data/graphs/RN-1:10-grid-line}% + \includegraphics[width=.5\linewidth]{data/graphs/RN-10:1-grid-line} + \includegraphics[width=.5\linewidth]{data/graphs/RN-1:10-grid-torus}% + \includegraphics[width=.5\linewidth]{data/graphs/RN-10:1-grid-torus} + \includegraphics[width=.5\linewidth]{data/graphs/RN-1:10-grid-hcube}% + \includegraphics[width=.5\linewidth]{data/graphs/RN-10:1-grid-hcube} + \caption{Real mode, random initial distribution, comp/comm ratio = 1/10 (left), or 10/1 (right).} + \label{fig.resultsN} +\end{figure} + +\subsubsection{With integer load} + +\FIXME{what about the amount of data?} + +\begin{itshape} +\FIXME{remove that part} +Dans cet ordre: +... +- comparer be/makhoul -> be tient la route + -> en réel uniquement +- valider l'extension virtual load -> c'est 'achement bien +- proposer le -k -> ça peut aider dans certains cas +- conclure avec la version entière -> on n'a pas l'effet d'escalier ! +Q: comment inclure les types/tailles de platesformes ? +Q: comment faire des moyennes ? +Q: comment introduire les distrib 1/N ? +... + +On constate quoi (vérifier avec les chiffres)? +\begin{itemize} +\item cluster ou grid, entier ou réel, ne font pas de grosses différences + +\item bookkeeping? améliore souvent les choses, parfois au prix d'un retard au démarrage + +\item makhoul? se fait battre sur les grosses plateformes -On veut montrer quoi ? : +\item taille de plateforme? -1) best plus rapide que les autres (simple, makhoul) -2) avantage virtual load +\item ratio comp/comm? -Est ce qu'on peut trouver des contre exemple? -Topologies variées +\item option $k$? peut-être intéressant sur des plateformes fortement interconnectées (hypercube) +\item volume de comm? souvent, besteffort/plain en fait plus. pourquoi? -Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées -Mais aussi simulation avec temps court qui montre que seul best converge +\item répartition initiale de la charge ? +\item integer mode sur topo. line n'a jamais fini en plain? vérifier si ce n'est + pas à cause de l'effet d'escalier que bk est capable de gommer. -Expés avec ratio calcul/comm rapide et lent +\end{itemize} + +% On veut montrer quoi ? : + +% 1) best plus rapide que les autres (simple, makhoul) +% 2) avantage virtual load + +% Est ce qu'on peut trouver des contre exemple? +% Topologies variées -Quelques expés avec charge initiale aléatoire plutot que sur le premier proc -Cadre processeurs homogènes +% Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées +% Mais aussi simulation avec temps court qui montre que seul best converge -Topologies statiques +% Expés avec ratio calcul/comm rapide et lent -On ne tient pas compte de la vitesse des liens donc on la considère homogène +% Quelques expés avec charge initiale aléatoire plutot que sur le premier proc -Prendre un réseau hétérogène et rendre processeur homogène +% Cadre processeurs homogènes -Taille : 10 100 très gros +% Topologies statiques + +% On ne tient pas compte de la vitesse des liens donc on la considère homogène + +% Prendre un réseau hétérogène et rendre processeur homogène + +% Taille : 10 100 très gros +\end{itshape} \section{Conclusion and perspectives} +\FIXME{conclude!} + \begin{acknowledgements} Computations have been performed on the supercomputer facilities of the Mésocentre de calcul de Franche-Comté. \end{acknowledgements} +\FIXME{find and add more references} \bibliographystyle{spmpsci} \bibliography{biblio} @@ -552,4 +768,5 @@ Taille : 10 100 très gros % LocalWords: Raphaël Couturier Arnaud Giersch Abderrahmane Sider Franche ij % LocalWords: Bertsekas Tsitsiklis SimGrid DASUD Comté Béjaïa asynchronism ji -% LocalWords: ik isend irecv Cortés et al chan ctrl fifo +% LocalWords: ik isend irecv Cortés et al chan ctrl fifo Makhoul GFlop xml pre +% LocalWords: FEMTO Makhoul's fca bdee cdde Contassot Vivier underlaid