1 \documentclass[smallextended]{svjour3}
2 \usepackage[utf8]{inputenc}
3 \usepackage[T1]{fontenc}
10 \title{Best effort strategy and virtual load
11 for asynchronous iterative load balancing}
13 \author{Raphaël Couturier \and
18 \institute{R. Couturier \and A. Giersch \at
19 LIFC, University of Franche-Comté, Belfort, France \\
20 % Tel.: +123-45-678910\\
21 % Fax: +123-45-678910\\
23 raphael.couturier@univ-fcomte.fr,
24 arnaud.giersch@univ-fcomte.fr}
27 University of Béjaïa, Béjaïa, Algeria \\
28 \email{ar.sider@univ-bejaia.dz}
36 Most of the time, asynchronous load balancing algorithms have extensively been
37 studied in a theoretical point of view. The Bertsekas and Tsitsiklis'
38 algorithm~\cite[section~7.4]{bertsekas+tsitsiklis.1997.parallel}
39 is certainly the most well known algorithm for which the convergence proof is
40 given. From a practical point of view, when a node wants to balance a part of
41 its load to some of its neighbors, the strategy is not described. In this
42 paper, we propose a strategy called \texttt{best effort} which tries to balance
43 the load of a node to all its less loaded neighbors while ensuring that all the
44 nodes concerned by the load balancing phase have the same amount of load.
45 Moreover, asynchronous iterative algorithms in which an asynchronous load
46 balancing algorithm is implemented most of the time can dissociate messages
47 concerning load transfers and message concerning load information. In order to
48 increase the converge of a load balancing algorithm, we propose a simple
49 heuristic called \texttt{virtual load} which allows a node that receives an load
50 information message to integrate the load that it will receive later in its
51 load (virtually) and consequently sends a (real) part of its load to some of its
52 neighbors. In order to validate our approaches, we have defined a simulator
53 based on SimGrid which allowed us to conduct many experiments.
58 \section{Introduction}
60 Load balancing algorithms are extensively used in parallel and distributed
61 applications in order to reduce the execution times. They can be applied in
62 different scientific fields from high performance computation to micro sensor
63 networks. They are iterative by nature. In literature many kinds of load
64 balancing algorithms have been studied. They can be classified according
65 different criteria: centralized or decentralized, in static or dynamic
66 environment, with homogeneous or heterogeneous load, using synchronous or
67 asynchronous iterations, with a static topology or a dynamic one which evolves
68 during time. In this work, we focus on asynchronous load balancing algorithms
69 where computer nodes are considered homogeneous and with homogeneous load with
70 no external load. In this context, Bertsekas and Tsitsiklis have proposed an
71 algorithm which is definitively a reference for many works. In their work, they
72 proved that under classical hypotheses of asynchronous iterative algorithms and
73 a special constraint avoiding \texttt{ping-pong} effect, an asynchronous
74 iterative algorithm converge to the uniform load distribution. This work has
75 been extended by many authors. For example,
76 DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous} propose a version working
77 with integer load. {\bf Rajouter des choses ici}.
79 Although the Bertsekas and Tsitsiklis' algorithm describes the condition to
80 ensure the convergence, there is no indication or strategy to really implement
81 the load distribution. In other word, a node can send a part of its load to one
82 or many of its neighbors while all the convergence conditions are
83 followed. Consequently, we propose a new strategy called \texttt{best effort}
84 that tries to balance the load of a node to all its less loaded neighbors while
85 ensuring that all the nodes concerned by the load balancing phase have the same
86 amount of load. Moreover, when real asynchronous applications are considered,
87 using asynchronous load balancing algorithms can reduce the execution
88 times. Most of the times, it is simpler to distinguish load information messages
89 from data migration messages. Formers ones allows a node to inform its
90 neighbors of its current load. These messages are very small, they can be sent
91 quite often. For example, if an computing iteration takes a significant times
92 (ranging from seconds to minutes), it is possible to send a new load information
93 message at each neighbor at each iteration. Latter messages contains data that
94 migrates from one node to another one. Depending on the application, it may have
95 sense or not that nodes try to balance a part of their load at each computing
96 iteration. But the time to transfer a load message from a node to another one is
97 often much nore longer that to time to transfer a load information message. So,
98 when a node receives the information that later it will receive a data message,
99 it can take this information into account and it can consider that its new load
100 is larger. Consequently, it can send a part of it real load to some of its
101 neighbors if required. We call this trick the \texttt{virtual load} mecanism.
105 So, in this work, we propose a new strategy for improving the distribution of
106 the load and a simple but efficient trick that also improves the load
107 balacing. Moreover, we have conducted many simulations with simgrid in order to
108 validate our improvements are really efficient. Our simulations consider that in
109 order to send a message, a latency delays the sending and according to the
110 network performance and the message size, the time of the reception of the
113 In the following of this paper, Section~\ref{BT algo} describes the Bertsekas
114 and Tsitsiklis' asynchronous load balancing algorithm. Moreover, we present a
115 possible problem in the convergence conditions. Section~\ref{Best-effort}
116 presents the best effort strategy which provides an efficient way to reduce the
117 execution times. In Section~\ref{Virtual load}, the virtual load mecanism is
118 proposed. Simulations allowed to show that both our approaches are valid using a
119 quite realistic model detailed in Section~\ref{Simulations}. Finally we give a
120 conclusion and some perspectives to this work.
125 \section{Bertsekas and Tsitsiklis' asynchronous load balancing algorithm}
128 In order prove the convergence of asynchronous iterative load balancing
129 Bertesekas and Tsitsiklis proposed a model
130 in~\cite{bertsekas+tsitsiklis.1997.parallel}. Here we recall some notations.
131 Consider that $N={1,...,n}$ processors are connected through a network.
132 Communication links are represented by a connected undirected graph $G=(N,V)$
133 where $V$ is the set of links connecting differents processors. In this work, we
134 consider that processors are homogeneous for sake of simplicity. It is quite
135 easy to tackle the heterogeneous case~\cite{ElsMonPre02}. Load of processor $i$
136 at time $t$ is represented by $x_i(t)\geq 0$. Let $V(i)$ be the set of
137 neighbors of processor $i$. Each processor $i$ has an estimate of the load of
138 each of its neighbors $j \in V(i)$ represented by $x_j^i(t)$. According to
139 asynchronism and communication delays, this estimate may be outdated. We also
140 consider that the load is described by a continuous variable.
142 When a processor send a part of its load to one or some of its neighbors, the
143 transfer takes time to be completed. Let $s_{ij}(t)$ be the amount of load that
144 processor $i$ has transfered to processor $j$ at time $t$ and let $r_{ij}(t)$ be the
145 amount of load received by processor $j$ from processor $i$ at time $t$. Then
146 the amount of load of processor $i$ at time $t+1$ is given by:
148 x_i(t+1)=x_i(t)-\sum_{j\in V(i)} s_{ij}(t) + \sum_{j\in V(i)} r_{ji}(t)
153 Some conditions are required to ensure the convergence. One of them can be
154 called the \texttt{ping-pong} condition which specifies that:
156 x_i(t)-\sum _{k\in V(i)} s_{ik}(t) \geq x_j^i(t)+s_{ij}(t)
158 for any processor $i$ and any $j \in V(i)$ such that $x_i(t)>x_j^i(t)$. This
159 condition aims at avoiding a processor to send a part of its load and beeing
160 less loaded after that.
162 Nevertheless, we think that this condition may lead to deadlocks in some
163 cases. For example, if we consider only three processors and that processor $1$
164 is linked to processor $2$ which is also linked to processor $3$ (i.e. a simple
165 chain wich 3 processors). Now consider we have the following values at time $t$:
172 In this case, processor $2$ can either sends load to processor $1$ or processor
173 $3$. If it sends load to processor $1$ it will not satisfy condition
174 (\ref{eq:ping-pong}) because after the sending it will be less loaded that
175 $x_3^2(t)$. So we consider that the \texttt{ping-pong} condition is probably to
176 strong. Currently, we did not try to make another convergence proof without this
177 condition or with a weaker condition.
180 \section{Best effort strategy}
183 \textbf{À traduire} Ordonne les voisins du moins chargé au plus chargé.
184 Trouve ensuite, en les prenant dans ce ordre, le nombre maximal de
185 voisins tels que tous ont une charge inférieure à la moyenne des
186 charges des voisins sélectionnés, et de soi-même.
188 Les transferts de charge sont ensuite fait en visant cette moyenne pour
189 tous les voisins sélectionnés. On envoie une quantité de charge égale
190 à (moyenne - charge\_du\_voisin).
192 ~\\\textbf{Question} faut-il décrire les stratégies makhoul et simple ?
194 \paragraph{simple} Tentative de respecter simplement les conditions de Bertsekas.
195 Parmi les voisins moins chargés que soi, on sélectionne :
197 \item un des moins chargés (vmin) ;
198 \item un des plus chargés (vmax),
200 puis on équilibre avec vmin en s'assurant que notre charge reste
201 toujours supérieure à celle de vmin et à celle de vmax.
203 On envoie donc (avec "self" pour soi-même) :
205 \min\left(\frac{load(self) - load(vmin)}{2}, load(self) - load(vmax)\right)
208 \paragraph{makhoul} Ordonne les voisins du moins chargé au plus chargé
209 puis calcule les différences de charge entre soi-même et chacun des
212 Ensuite, pour chaque voisin, dans l'ordre, et tant qu'on reste plus
213 chargé que le voisin en question, on lui envoie 1/(N+1) de la
214 différence calculée au départ, avec N le nombre de voisins.
216 C'est l'algorithme~2 dans~\cite{bahi+giersch+makhoul.2008.scalable}.
218 \section{Virtual load}
221 \section{Simulations}
224 In order to test and validate our approaches, we wrote a simulator
226 framework~\cite{casanova+legrand+quinson.2008.simgrid}. The process
227 model is detailed in the next section (\ref{Sim model}), then the
228 results of the simulations are presented in section~\ref{Results}.
230 \subsection{Simulation model}
237 There are two receiving channels per host: control for information
238 messages, and data for load transfers.
243 Each process is made of 3 threads: a receiver thread, a computing
244 thread, and a load-balancer thread.
250 | wait for a message to come, either on data channel, or on ctrl channel
251 | push received message in a buffer of received messages
252 | -> ctrl messages on the one side
253 | -> data messages on the other side
256 The loop terminates when a "finalize" message is received on each
263 | if we received some real load, get it (data messages)
264 | if there is some real load to send, send it
265 | if we own some load, simulate some computing on it
266 | sleep a bit if we are looping too fast
268 send CLOSE on data for all neighbors
269 wait for CLOSE on data from all neighbors
271 The loop terminates when process::still_running() returns false.
272 (read the source for full details...)
274 * Load-balancing thread
275 ---------------------
278 | call load-balancing algorithm
280 | sleep (min_lb_iter_duration)
281 | receive ctrl messages
283 send CLOSE on ctrl for all neighbors
284 wait for CLOSE on ctrl from all neighbors
286 The loop terminates when process::still_running() returns false.
287 (read the source for full details...)
290 \subsection{Validation of our approaches}
294 On veut montrer quoi ? :
296 1) best plus rapide que les autres (simple, makhoul)
297 2) avantage virtual load
299 Est ce qu'on peut trouver des contre exemple?
303 Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées
304 Mais aussi simulation avec temps court qui montre que seul best converge
307 Expés avec ratio calcul/comm rapide et lent
309 Quelques expés avec charge initiale aléatoire plutot que sur le premier proc
311 Cadre processeurs homogènes
315 On ne tient pas compte de la vitesse des liens donc on la considère homogène
317 Prendre un réseau hétérogène et rendre processeur homogène
319 Taille : 10 100 très gros
321 \section{Conclusion and perspectives}
324 \bibliographystyle{spmpsci}
325 \bibliography{biblio}
332 %%% ispell-local-dictionary: "american"
335 % LocalWords: Raphaël Couturier Arnaud Giersch Abderrahmane Sider
336 % LocalWords: Bertsekas Tsitsiklis SimGrid DASUD