supercomp11/supercomp11.tex

   1 \documentclass[smallextended]{svjour3}
   2 \usepackage[utf8]{inputenc}
   3 \usepackage[T1]{fontenc}
   4 \usepackage{mathptmx}
   5 \usepackage{courier}
   6 \usepackage{graphicx}
   7
   8 \begin{document}
   9
  10 \title{Best effort strategy and virtual load
  11   for asynchronous iterative load balancing}
  12
  13 \author{Raphaël Couturier \and
  14         Arnaud Giersch \and
  15         Abderrahmane Sider
  16 }
  17
  18 \institute{R. Couturier \and A. Giersch \at
  19               LIFC, University of Franche-Comté, Belfort, France \\
  20               % Tel.: +123-45-678910\\
  21               % Fax: +123-45-678910\\
  22               \email{%
  23                 raphael.couturier@univ-fcomte.fr,
  24                 arnaud.giersch@univ-fcomte.fr}
  25            \and
  26            A. Sider \at
  27               University of Béjaïa, Béjaïa, Algeria \\
  28               \email{ar.sider@univ-bejaia.dz}
  29 }
  30
  31 \maketitle
  32
  33
  34 \begin{abstract}
  35
  36 Most of the  time, asynchronous load balancing algorithms  have extensively been
  37 studied in a theoretical point  of view. The Bertsekas and Tsitsiklis'
  38 algorithm~\cite[section~7.4]{bertsekas+tsitsiklis.1997.parallel}
  39 is certainly  the most well known  algorithm for which the  convergence proof is
  40 given. From a  practical point of view, when  a node wants to balance  a part of
  41 its  load to some  of its  neighbors, the  strategy is  not described.   In this
  42 paper, we propose a strategy  called \texttt{best effort} which tries to balance
  43 the load of a node to all  its less loaded neighbors while ensuring that all the
  44 nodes  concerned by  the load  balancing  phase have  the same  amount of  load.
  45 Moreover,  asynchronous  iterative  algorithms  in which  an  asynchronous  load
  46 balancing  algorithm is  implemented most  of the  time can  dissociate messages
  47 concerning load transfers and message  concerning load information.  In order to
  48 increase  the  converge of  a  load balancing  algorithm,  we  propose a  simple
  49 heuristic called \texttt{virtual load} which allows a node that receives an load
  50 information message  to integrate the  load that it  will receive later  in its
  51 load (virtually) and consequently sends a (real) part of its load to some of its
  52 neighbors.  In order to  validate our  approaches, we  have defined  a simulator
  53 based on SimGrid which allowed us to conduct many experiments.
  54
  55
  56 \end{abstract}
  57
  58 \section{Introduction}
  59
  60 Load  balancing algorithms  are  extensively used  in  parallel and  distributed
  61 applications in  order to  reduce the  execution times. They  can be  applied in
  62 different scientific  fields from high  performance computation to  micro sensor
  63 networks.   They are  iterative by  nature.  In  literature many  kinds  of load
  64 balancing  algorithms  have been  studied.   They  can  be classified  according
  65 different  criteria:   centralized  or  decentralized,  in   static  or  dynamic
  66 environment,  with  homogeneous  or  heterogeneous load,  using  synchronous  or
  67 asynchronous iterations, with  a static topology or a  dynamic one which evolves
  68 during time.  In  this work, we focus on  asynchronous load balancing algorithms
  69 where computer nodes  are considered homogeneous and with  homogeneous load with
  70 no external  load. In  this context, Bertsekas  and Tsitsiklis have  proposed an
  71 algorithm which is definitively a reference  for many works. In their work, they
  72 proved that under classical  hypotheses of asynchronous iterative algorithms and
  73 a  special  constraint   avoiding  \texttt{ping-pong}  effect,  an  asynchronous
  74 iterative algorithm  converge to  the uniform load  distribution. This  work has
  75 been extended by many authors. For example,
  76 DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous} propose a version working
  77 with integer load. {\bf Rajouter des choses ici}.
  78
  79 Although  the Bertsekas  and Tsitsiklis'  algorithm describes  the  condition to
  80 ensure the convergence,  there is no indication or  strategy to really implement
  81 the load distribution. In other word, a node  can send a part of its load to one
  82 or   many  of   its  neighbors   while  all   the  convergence   conditions  are
  83 followed. Consequently,  we propose a  new strategy called  \texttt{best effort}
  84 that tries to balance the load of  a node to all its less loaded neighbors while
  85 ensuring that all the nodes concerned  by the load balancing phase have the same
  86 amount of  load.  Moreover, when real asynchronous  applications are considered,
  87 using  asynchronous   load  balancing   algorithms  can  reduce   the  execution
  88 times. Most of the times, it is simpler to distinguish load information messages
  89 from  data  migration  messages.  Formers  ones  allows  a  node to  inform  its
  90 neighbors of its  current load. These messages are very small,  they can be sent
  91 quite often.  For example, if an  computing iteration takes  a significant times
  92 (ranging from seconds to minutes), it is possible to send a new load information
  93 message at each  neighbor at each iteration. Latter  messages contains data that
  94 migrates from one node to another one. Depending on the application, it may have
  95 sense or not  that nodes try to balance  a part of their load  at each computing
  96 iteration. But the time to transfer a load message from a node to another one is
  97 often much nore longer that to  time to transfer a load information message. So,
  98 when a node receives the information  that later it will receive a data message,
  99 it can take this information into account  and it can consider that its new load
 100 is larger.   Consequently, it can  send a part  of it real  load to some  of its
 101 neighbors if required. We call this trick the \texttt{virtual load} mecanism.
 102
 103
 104
 105 So, in  this work, we propose a  new strategy for improving  the distribution of
 106 the  load  and  a  simple  but  efficient trick  that  also  improves  the  load
 107 balacing. Moreover, we have conducted  many simulations with simgrid in order to
 108 validate our improvements are really efficient. Our simulations consider that in
 109 order  to send a  message, a  latency delays  the sending  and according  to the
 110 network  performance and  the message  size, the  time of  the reception  of the
 111 message also varies.
 112
 113 In the  following of this  paper, Section~\ref{BT algo} describes  the Bertsekas
 114 and Tsitsiklis'  asynchronous load balancing  algorithm. Moreover, we  present a
 115 possible  problem  in  the  convergence  conditions.   Section~\ref{Best-effort}
 116 presents the best effort strategy which  provides an efficient way to reduce the
 117 execution  times. In Section~\ref{Virtual  load}, the  virtual load  mecanism is
 118 proposed. Simulations allowed to show that both our approaches are valid using a
 119 quite realistic  model detailed in  Section~\ref{Simulations}. Finally we  give a
 120 conclusion and some perspectives to this work.
 121
 122
 123
 124
 125 \section{Bertsekas  and Tsitsiklis' asynchronous load balancing algorithm}
 126 \label{BT algo}
 127
 128 In  order  prove  the  convergence  of  asynchronous  iterative  load  balancing
 129 Bertesekas         and        Tsitsiklis         proposed         a        model
 130 in~\cite{bertsekas+tsitsiklis.1997.parallel}.   Here we  recall  some notations.
 131 Consider  that  $N={1,...,n}$  processors   are  connected  through  a  network.
 132 Communication links  are represented by  a connected undirected  graph $G=(N,V)$
 133 where $V$ is the set of links connecting differents processors. In this work, we
 134 consider that  processors are  homogeneous for sake  of simplicity. It  is quite
 135 easy to tackle the  heterogeneous case~\cite{ElsMonPre02}. Load of processor $i$
 136 at  time $t$  is  represented  by $x_i(t)\geq  0$.   Let $V(i)$  be  the set  of
 137 neighbors of processor  $i$.  Each processor $i$ has an estimate  of the load of
 138 each  of its  neighbors $j  \in V(i)$  represented by  $x_j^i(t)$.  According to
 139 asynchronism and communication  delays, this estimate may be  outdated.  We also
 140 consider that the load is described by a continuous variable.
 141
 142 When a processor  send a part of its  load to one or some of  its neighbors, the
 143 transfer takes time to be completed.  Let $s_{ij}(t)$ be the amount of load that
 144 processor $i$ has transfered to processor $j$ at time $t$ and let $r_{ij}(t)$ be the
 145 amount of  load received by processor $j$  from processor $i$ at  time $t$. Then
 146 the amount of load of processor $i$ at time $t+1$ is given by:
 147 \begin{equation}
 148 x_i(t+1)=x_i(t)-\sum_{j\in V(i)} s_{ij}(t) + \sum_{j\in V(i)} r_{ji}(t)
 149 \label{eq:ping-pong}
 150 \end{equation}
 151
 152
 153 Some  conditions are  required to  ensure the  convergence. One  of them  can be
 154 called the \texttt{ping-pong} condition which specifies that:
 155 \begin{equation}
 156 x_i(t)-\sum _{k\in V(i)} s_{ik}(t) \geq x_j^i(t)+s_{ij}(t)
 157 \end{equation}
 158 for any  processor $i$ and any  $j \in V(i)$ such  that $x_i(t)>x_j^i(t)$.  This
 159 condition aims  at avoiding a processor  to send a  part of its load  and beeing
 160 less loaded after that.
 161
 162 Nevertheless,  we  think that  this  condition may  lead  to  deadlocks in  some
 163 cases. For example, if we consider  only three processors and that processor $1$
 164 is linked to processor $2$ which is  also linked to processor $3$ (i.e. a simple
 165 chain wich 3 processors). Now consider we have the following values at time $t$:
 166 \begin{eqnarray*}
 167 x_1(t)=10   \\
 168 x_2(t)=100   \\
 169 x_3(t)=99.99\\
 170  x_3^2(t)=99.99\\
 171 \end{eqnarray*}
 172 In this case, processor $2$ can  either sends load to processor $1$ or processor
 173 $3$.   If  it  sends  load  to  processor $1$  it  will  not  satisfy  condition
 174 (\ref{eq:ping-pong})  because  after the  sending  it  will  be less  loaded  that
 175 $x_3^2(t)$.  So we consider that the \texttt{ping-pong} condition is probably to
 176 strong. Currently, we did not try to make another convergence proof without this
 177 condition or with a weaker condition.
 178
 179
 180 \section{Best effort strategy}
 181 \label{Best-effort}
 182
 183 \textbf{À traduire} Ordonne les voisins du moins chargé au plus chargé.
 184 Trouve ensuite, en les prenant dans ce ordre, le nombre maximal de
 185 voisins tels que tous ont une charge inférieure à la moyenne des
 186 charges des voisins sélectionnés, et de soi-même.
 187
 188 Les transferts de charge sont ensuite fait en visant cette moyenne pour
 189 tous les voisins sélectionnés.  On envoie une quantité de charge égale
 190 à (moyenne - charge\_du\_voisin).
 191
 192 ~\\\textbf{Question} faut-il décrire les stratégies makhoul et simple ?
 193
 194 \paragraph{simple} Tentative de respecter simplement les conditions de Bertsekas.
 195 Parmi les voisins moins chargés que soi, on sélectionne :
 196 \begin{itemize}
 197 \item un des moins chargés (vmin) ;
 198 \item un des plus chargés (vmax),
 199 \end{itemize}
 200 puis on équilibre avec vmin en s'assurant que notre charge reste
 201 toujours supérieure à celle de vmin et à celle de vmax.
 202
 203 On envoie donc (avec "self" pour soi-même) :
 204 \[
 205     \min\left(\frac{load(self) - load(vmin)}{2}, load(self) - load(vmax)\right)
 206 \]
 207
 208 \paragraph{makhoul} Ordonne les voisins du moins chargé au plus chargé
 209 puis calcule les différences de charge entre soi-même et chacun des
 210 voisins.
 211
 212 Ensuite, pour chaque voisin, dans l'ordre, et tant qu'on reste plus
 213 chargé que le voisin en question, on lui envoie 1/(N+1) de la
 214 différence calculée au départ, avec N le nombre de voisins.
 215
 216 C'est l'algorithme~2 dans~\cite{bahi+giersch+makhoul.2008.scalable}.
 217
 218 \section{Virtual load}
 219 \label{Virtual load}
 220
 221 \section{Simulations}
 222 \label{Simulations}
 223
 224 In order to test and validate our approaches, we wrote a simulator
 225 using the SimGrid
 226 framework~\cite{casanova+legrand+quinson.2008.simgrid}.  The process
 227 model is detailed in the next section (\ref{Sim model}), then the
 228 results of the simulations are presented in section~\ref{Results}.
 229
 230 \subsection{Simulation model}
 231 \label{Sim model}
 232
 233 \begin{verbatim}
 234 Communications
 235 ==============
 236
 237 There are two receiving channels per host: control for information
 238 messages, and data for load transfers.
 239
 240 Process model
 241 =============
 242
 243 Each process is made of 3 threads: a receiver thread, a computing
 244 thread, and a load-balancer thread.
 245
 246 * Receiver thread
 247   ---------------
 248
 249     Loop
 250     | wait for a message to come, either on data channel, or on ctrl channel
 251     | push received message in a buffer of received messages
 252     | -> ctrl messages on the one side
 253     | -> data messages on the other side
 254     +-
 255
 256    The loop terminates when a "finalize" message is received on each
 257    channel.
 258
 259 * Computing thread
 260   ----------------
 261
 262     Loop
 263     | if we received some real load, get it (data messages)
 264     | if there is some real load to send, send it
 265     | if we own some load, simulate some computing on it
 266     | sleep a bit if we are looping too fast
 267     +-
 268     send CLOSE on data for all neighbors
 269     wait for CLOSE on data from all neighbors
 270
 271   The loop terminates when process::still_running() returns false.
 272   (read the source for full details...)
 273
 274 * Load-balancing thread
 275   ---------------------
 276
 277     Loop
 278     | call load-balancing algorithm
 279     | send ctrl messages
 280     | sleep (min_lb_iter_duration)
 281     | receive ctrl messages
 282     +-
 283     send CLOSE on ctrl for all neighbors
 284     wait for CLOSE on ctrl from all neighbors
 285
 286   The loop terminates when process::still_running() returns false.
 287   (read the source for full details...)
 288 \end{verbatim}
 289
 290 \subsection{Validation of our approaches}
 291 \label{Results}
 292
 293
 294 On veut montrer quoi ? :
 295
 296 1) best plus rapide que les autres (simple, makhoul)
 297 2) avantage virtual load
 298
 299 Est ce qu'on peut trouver des contre exemple?
 300 Topologies variées
 301
 302
 303 Simulation avec temps définies assez long et on mesure la qualité avec : volume de calcul effectué, volume de données échangées
 304 Mais aussi simulation avec temps court qui montre que seul best converge
 305
 306
 307 Expés avec ratio calcul/comm rapide et lent
 308
 309 Quelques expés avec charge initiale aléatoire plutot que sur le premier proc
 310
 311 Cadre processeurs homogènes
 312
 313 Topologies statiques
 314
 315 On ne tient pas compte de la vitesse des liens donc on la considère homogène
 316
 317 Prendre un réseau hétérogène et rendre processeur homogène
 318
 319 Taille : 10 100 très gros
 320
 321 \section{Conclusion and perspectives}
 322
 323
 324 \bibliographystyle{spmpsci}
 325 \bibliography{biblio}
 326
 327 \end{document}
 328
 329 %%% Local Variables:
 330 %%% mode: latex
 331 %%% TeX-master: t
 332 %%% ispell-local-dictionary: "american"
 333 %%% End:
 334
 335 % LocalWords:  Raphaël Couturier Arnaud Giersch Abderrahmane Sider
 336 % LocalWords:  Bertsekas Tsitsiklis SimGrid DASUD