2 \documentclass[smallextended]{svjour3}
3 \usepackage[utf8]{inputenc}
4 \usepackage[T1]{fontenc}
11 \title{Best effort strategy and virtual load for asynchronous iterative load balancing}
13 \author{Raphaël Couturier \and
18 \institute{F. Author \at
20 Tel.: +123-45-678910\\
22 \email{fauthor@example.com} % \\
23 % \emph{Present address:} of F. Author % if needed
34 Most of the time, asynchronous load balancing algorithms have extensively been
35 studied in a theoretical point of view. The Bertsekas and Tsitsiklis'
36 algorithm~\cite[section~7.4]{bertsekas+tsitsiklis.1997.parallel}
37 is certainly the most well known algorithm for which the convergence proof is
38 given. From a practical point of view, when a node wants to balance a part of
39 its load to some of its neighbors, the strategy is not described. In this
40 paper, we propose a strategy called \texttt{best effort} which tries to balance
41 the load of a node to all its less loaded neighbors while ensuring that all the
42 nodes concerned by the load balancing phase have the same amount of load.
43 Moreover, asynchronous iterative algorithms in which an asynchronous load
44 balancing algorithm is implemented most of the time can dissociate messages
45 concerning load transfers and message concerning load information. In order to
46 increase the converge of a load balancing algorithm, we propose a simple
47 heuristic called \texttt{virtual load} which allows a node that receives an load
48 information message to integrate the load that it will receive later in its
49 load (virtually) and consequently sends a (real) part of its load to some of its
50 neighbors. In order to validate our approaches, we have defined a simulator
51 based on SimGrid which allowed us to conduct many experiments.
56 \section{Introduction}
58 Load balancing algorithms are extensively used in parallel and distributed
59 applications in order to reduce the execution times. They can be applied in
60 different scientific fields from high performance computation to micro sensor
61 networks. They are iterative by nature. In literature many kinds of load
62 balancing algorithms have been studied. They can be classified according
63 different criteria: centralized or decentralized, in static or dynamic
64 environment, with homogeneous or heterogeneous load, using synchronous or
65 asynchronous iterations, with a static topology or a dynamic one which evolves
66 during time. In this work, we focus on asynchronous load balancing algorithms
67 where computer nodes are considered homogeneous and with homogeneous load with
68 no external load. In this context, Bertsekas and Tsitsiklis have proposed an
69 algorithm which is definitively a reference for many works. In their work, they
70 proved that under classical hypotheses of asynchronous iterative algorithms and
71 a special constraint avoiding \texttt{ping-pong} effect, an asynchronous
72 iterative algorithm converge to the uniform load distribution. This work has
73 been extended by many authors. For example,
74 DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous} propose a version working
75 with integer load. {\bf Rajouter des choses ici}.
77 Although the Bertsekas and Tsitsiklis' algorithm describes the condition to
78 ensure the convergence, there is no indication or strategy to really implement
79 the load distribution. In other word, a node can send a part of its load to one
80 or many of its neighbors while all the convergence conditions are
81 followed. Consequently, we propose a new strategy called \texttt{best effort}
82 that tries to balance the load of a node to all its less loaded neighbors while
83 ensuring that all the nodes concerned by the load balancing phase have the same
84 amount of load. Moreover, when real asynchronous applications are considered,
85 using asynchronous load balancing algorithms can reduce the execution
86 times. Most of the times, it is simpler to distinguish load information messages
87 from data migration messages. Formers ones allows a node to inform its
88 neighbors of its current load. These messages are very small, they can be sent
89 quite often. For example, if an computing iteration takes a significant times
90 (ranging from seconds to minutes), it is possible to send a new load information
91 message at each neighbor at each iteration. Latter messages contains data that
92 migrates from one node to another one. Depending on the application, it may have
93 sense or not that nodes try to balance a part of their load at each computing
94 iteration. But the time to transfer a load message from a node to another one is
95 often much nore longer that to time to transfer a load information message. So,
96 when a node receives the information that later it will receive a data message,
97 it can take this information into account and it can consider that its new load
98 is larger. Consequently, it can send a part of it real load to some of its
99 neighbors if required. We call this trick the \texttt{virtual load} mecanism.
103 So, in this work, we propose a new strategy for improving the distribution of
104 the load and a simple but efficient trick that also improves the load
105 balacing. Moreover, we have conducted many simulations with simgrid in order to
106 validate our improvements are really efficient. Our simulations consider that in
107 order to send a message, a latency delays the sending and according to the
108 network performance and the message size, the time of the reception of the
111 In the following of this paper, Section~\ref{BT algo} describes the Bertsekas
112 and Tsitsiklis' asynchronous load balancing algorithm. Moreover, we present a
113 possible problem in the convergence conditions. Section~\ref{Best-effort}
114 presents the best effort strategy which provides an efficient way to reduce the
115 execution times. In Section~\ref{Virtual load}, the virtual load mecanism is
116 proposed. Simulations allowed to show that both our approaches are valid using a
117 quite realistic model detailed in Section~\ref{Simulations}. Finally we give a
118 conclusion and some perspectives to this work.
123 \section{Bertsekas and Tsitsiklis' asynchronous load balancing algorithm}
126 In order prove the convergence of asynchronous iterative load balancing
127 Bertesekas and Tsitsiklis proposed a model
128 in~\cite{bertsekas+tsitsiklis.1997.parallel}. Here we recall some notations.
129 Consider that $N={1,...,n}$ processors are connected through a network.
130 Communication links are represented by a connected undirected graph $G=(N,V)$
131 where $V$ is the set of links connecting differents processors. In this work, we
132 consider that processors are homogeneous for sake of simplicity. It is quite
133 easy to tackle the heterogeneous case~\cite{ElsMonPre02}. Load of processor $i$
134 at time $t$ is represented by $x_i(t)\geq 0$. Let $V(i)$ be the set of
135 neighbors of processor $i$. Each processor $i$ has an estimate of the load of
136 each of its neighbors $j \in V(i)$ represented by $x_j^i(t)$. According to
137 asynchronism and communication delays, this estimate may be outdated. We also
138 consider that the load is described by a continuous variable.
140 When a processor send a part of its load to one or some of its neighbors, the
141 transfer takes time to be completed. Let $s_{ij}(t)$ be the amount of load that
142 processor $i$ has transfered to processor $j$ at time $t$ and let $r_{ij}(t)$ be the
143 amount of load received by processor $j$ from processor $i$ at time $t$. Then
144 the amount of load of processor $i$ at time $t+1$ is given by:
146 x_i(t+1)=x_i(t)-\sum_{j\in V(i)} s_{ij}(t) + \sum_{j\in V(i)} r_{ji}(t)
151 Some conditions are required to ensure the convergence. One of them can be
152 called the \texttt{ping-pong} condition which specifies that:
154 x_i(t)-\sum _{k\in V(i)} s_{ik}(t) \geq x_j^i(t)+s_{ij}(t)
156 for any processor $i$ and any $j \in V(i)$ such that $x_i(t)>x_j^i(t)$. This
157 condition aims at avoiding a processor to send a part of its load and beeing
158 less loaded after that.
160 Nevertheless, we think that this condition may lead to deadlocks in some
161 cases. For example, if we consider only three processors and that processor $1$
162 is linked to processor $2$ which is also linked to processor $3$ (i.e. a simple
163 chain wich 3 processors). Now consider we have the following values at time $t$:
170 In this case, processor $2$ can either sends load to processor $1$ or processor
171 $3$. If it sends load to processor $1$ it will not satisfy condition
172 (\ref{eq:ping-pong}) because after the sending it will be less loaded that
173 $x_3^2(t)$. So we consider that the \texttt{ping-pong} condition is probably to
174 strong. Currently, we did not try to make another convergence proof without this
175 condition or with a weaker condition.
178 \section{Best effort strategy}
181 \textbf{À traduire} Ordonne les voisins du moins chargé au plus chargé.
182 Trouve ensuite, en les prenant dans ce ordre, le nombre maximal de
183 voisins tels que tous ont une charge inférieure à la moyenne des
184 charges des voisins sélectionnés, et de soi-même.
186 Les transferts de charge sont ensuite fait en visant cette moyenne pour
187 tous les voisins sélectionnés. On envoie une quantité de charge égale
188 à (moyenne - charge\_du\_voisin).
190 ~\\\textbf{Question} faut-il décrire les stratégies makhoul et simple ?
192 \paragraph{simple} Tentative de respecter simplement les conditions de Bertsekas.
193 Parmi les voisins moins chargés que soi, on sélectionne :
195 \item un des moins chargés (vmin) ;
196 \item un des plus chargés (vmax),
198 puis on équilibre avec vmin en s'assurant que notre charge reste
199 toujours supérieure à celle de vmin et à celle de vmax.
201 On envoie donc (avec "self" pour soi-même) :
203 \min\left(\frac{load(self) - load(vmin)}{2}, load(self) - load(vmax)\right)
206 \paragraph{makhoul} Ordonne les voisins du moins chargé au plus chargé
207 puis calcule les différences de charge entre soi-même et chacun des
210 Ensuite, pour chaque voisin, dans l'ordre, et tant qu'on reste plus
211 chargé que le voisin en question, on lui envoie 1/(N+1) de la
212 différence calculée au départ, avec N le nombre de voisins.
214 C'est l'algorithme~2 dans~\cite{bahi+giersch+makhoul.2008.scalable}.
216 \section{Virtual load}
219 \section{Simulations}
222 In order to test and validate our approaches, we wrote a simulator
224 framework~\cite{casanova+legrand+quinson.2008.simgrid}. The process
225 model is detailed in the next section (\ref{Sim model}), then the
226 results of the simulations are presented in section~\ref{Results}.
228 \subsection{Simulation model}
235 There are two receiving channels per host: control for information
236 messages, and data for load transfers.
241 Each process is made of 3 threads: a receiver thread, a computing
242 thread, and a load-balancer thread.
248 | wait for a message to come, either on data channel, or on ctrl channel
249 | push received message in a buffer of received messages
250 | -> ctrl messages on the one side
251 | -> data messages on the other side
254 The loop terminates when a "finalize" message is received on each
261 | if we received some real load, get it (data messages)
262 | if there is some real load to send, send it
263 | if we own some load, simulate some computing on it
264 | sleep a bit if we are looping too fast
266 send CLOSE on data for all neighbors
267 wait for CLOSE on data from all neighbors
269 The loop terminates when process::still_running() returns false.
270 (read the source for full details...)
272 * Load-balancing thread
273 ---------------------
276 | call load-balancing algorithm
278 | sleep (min_lb_iter_duration)
279 | receive ctrl messages
281 send CLOSE on ctrl for all neighbors
282 wait for CLOSE on ctrl from all neighbors
284 The loop terminates when process::still_running() returns false.
285 (read the source for full details...)
288 \subsection{Validation of our approaches}
292 On veut montrer quoi ? :
294 1) best plus rapide que les autres (simple, makhoul)
295 2) avantage virtual load
297 Est ce qu'on peut trouver des contre exemple?
301 Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées
302 Mais aussi simulation avec temps court qui montre que seul best converge
305 Expés avec ratio calcul/comm rapide et lent
307 Quelques expés avec charge initiale aléatoire plutot que sur le premier proc
309 Cadre processeurs homogènes
313 On ne tient pas compte de la vitesse des liens donc on la considère homogène
315 Prendre un réseau hétérogène et rendre processeur homogène
317 Taille : 10 100 très gros
319 \section{Conclusion and perspectives}
322 \bibliographystyle{spmpsci}
323 \bibliography{biblio}
330 %%% ispell-local-dictionary: "american"
333 % LocalWords: Raphaël Couturier Arnaud Giersch Abderrahmane Sider
334 % LocalWords: Bertsekas Tsitsiklis SimGrid DASUD