X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/loba-papers.git/blobdiff_plain/b680616f7e0702003982aa20fd7208c18eba98f1..ed88631cce8e6737ba146dd0a98914baa97ee29c:/supercomp11/supercomp11.tex

diff --git a/supercomp11/supercomp11.tex b/supercomp11/supercomp11.tex
index f975555..a566b8a 100644
--- a/supercomp11/supercomp11.tex
+++ b/supercomp11/supercomp11.tex
@@ -1,3 +1,4 @@
+
 \documentclass[smallextended]{svjour3}
 \usepackage[utf8]{inputenc}
 \usepackage[T1]{fontenc}
@@ -52,7 +53,7 @@ based on SimGrid which allowed us to conduct many experiments.
 
 \end{abstract}
 
-
+\section{Introduction}
 
 Load  balancing algorithms  are  extensively used  in  parallel and  distributed
 applications in  order to  reduce the  execution times. They  can be  applied in
@@ -71,7 +72,251 @@ a  special  constraint   avoiding  \texttt{ping-pong}  effect,  an  asynchronous
 iterative algorithm  converge to  the uniform load  distribution. This  work has
 been extended by many authors. For example,
 DASUD~\cite{cortes+ripoll+cedo+al.2002.asynchronous} propose a version working
-with integer load.
+with integer load. {\bf Rajouter des choses ici}.
+
+Although  the Bertsekas  and Tsitsiklis'  algorithm describes  the  condition to
+ensure the convergence,  there is no indication or  strategy to really implement
+the load distribution. In other word, a node  can send a part of its load to one
+or   many  of   its  neighbors   while  all   the  convergence   conditions  are
+followed. Consequently,  we propose a  new strategy called  \texttt{best effort}
+that tries to balance the load of  a node to all its less loaded neighbors while
+ensuring that all the nodes concerned  by the load balancing phase have the same
+amount of  load.  Moreover, when real asynchronous  applications are considered,
+using  asynchronous   load  balancing   algorithms  can  reduce   the  execution
+times. Most of the times, it is simpler to distinguish load information messages
+from  data  migration  messages.  Formers  ones  allows  a  node to  inform  its
+neighbors of its  current load. These messages are very small,  they can be sent
+quite often.  For example, if an  computing iteration takes  a significant times
+(ranging from seconds to minutes), it is possible to send a new load information
+message at each  neighbor at each iteration. Latter  messages contains data that
+migrates from one node to another one. Depending on the application, it may have
+sense or not  that nodes try to balance  a part of their load  at each computing
+iteration. But the time to transfer a load message from a node to another one is
+often much nore longer that to  time to transfer a load information message. So,
+when a node receives the information  that later it will receive a data message,
+it can take this information into account  and it can consider that its new load
+is larger.   Consequently, it can  send a part  of it real  load to some  of its
+neighbors if required. We call this trick the \texttt{virtual load} mecanism.
+
+
+
+So, in  this work, we propose a  new strategy for improving  the distribution of
+the  load  and  a  simple  but  efficient trick  that  also  improves  the  load
+balacing. Moreover, we have conducted  many simulations with simgrid in order to
+validate our improvements are really efficient. Our simulations consider that in
+order  to send a  message, a  latency delays  the sending  and according  to the
+network  performance and  the message  size, the  time of  the reception  of the
+message also varies.
+
+In the  following of this  paper, Section~\ref{BT algo} describes  the Bertsekas
+and Tsitsiklis'  asynchronous load balancing  algorithm. Moreover, we  present a
+possible  problem  in  the  convergence  conditions.   Section~\ref{Best-effort}
+presents the best effort strategy which  provides an efficient way to reduce the
+execution  times. In Section~\ref{Virtual  load}, the  virtual load  mecanism is
+proposed. Simulations allowed to show that both our approaches are valid using a
+quite realistic  model detailed in  Section~\ref{Simulations}. Finally we  give a
+conclusion and some perspectives to this work.
+
+
+
+
+\section{Bertsekas  and Tsitsiklis' asynchronous load balancing algorithm}
+\label{BT algo}
+
+In  order  prove  the  convergence  of  asynchronous  iterative  load  balancing
+Bertesekas         and        Tsitsiklis         proposed         a        model
+in~\cite{bertsekas+tsitsiklis.1997.parallel}.   Here we  recall  some notations.
+Consider  that  $N={1,...,n}$  processors   are  connected  through  a  network.
+Communication links  are represented by  a connected undirected  graph $G=(N,V)$
+where $V$ is the set of links connecting differents processors. In this work, we
+consider that  processors are  homogeneous for sake  of simplicity. It  is quite
+easy to tackle the  heterogeneous case~\cite{ElsMonPre02}. Load of processor $i$
+at  time $t$  is  represented  by $x_i(t)\geq  0$.   Let $V(i)$  be  the set  of
+neighbors of processor  $i$.  Each processor $i$ has an estimate  of the load of
+each  of its  neighbors $j  \in V(i)$  represented by  $x_j^i(t)$.  According to
+asynchronism and communication  delays, this estimate may be  outdated.  We also
+consider that the load is described by a continuous variable.
+
+When a processor  send a part of its  load to one or some of  its neighbors, the
+transfer takes time to be completed.  Let $s_{ij}(t)$ be the amount of load that
+processor $i$ has transfered to processor $j$ at time $t$ and let $r_{ij}(t)$ be the
+amount of  load received by processor $j$  from processor $i$ at  time $t$. Then
+the amount of load of processor $i$ at time $t+1$ is given by:
+\begin{equation}
+x_i(t+1)=x_i(t)-\sum_{j\in V(i)} s_{ij}(t) + \sum_{j\in V(i)} r_{ji}(t)
+\label{eq:ping-pong}
+\end{equation}
+
+
+Some  conditions are  required to  ensure the  convergence. One  of them  can be
+called the \texttt{ping-pong} condition which specifies that:
+\begin{equation}
+x_i(t)-\sum _{k\in V(i)} s_{ik}(t) \geq x_j^i(t)+s_{ij}(t)
+\end{equation}
+for any  processor $i$ and any  $j \in V(i)$ such  that $x_i(t)>x_j^i(t)$.  This
+condition aims  at avoiding a processor  to send a  part of its load  and beeing
+less loaded after that.
+
+Nevertheless,  we  think that  this  condition may  lead  to  deadlocks in  some
+cases. For example, if we consider  only three processors and that processor $1$
+is linked to processor $2$ which is  also linked to processor $3$ (i.e. a simple
+chain wich 3 processors). Now consider we have the following values at time $t$:
+\begin{eqnarray*}
+x_1(t)=10   \\
+x_2(t)=100   \\
+x_3(t)=99.99\\
+ x_3^2(t)=99.99\\
+\end{eqnarray*}
+In this case, processor $2$ can  either sends load to processor $1$ or processor
+$3$.   If  it  sends  load  to  processor $1$  it  will  not  satisfy  condition
+(\ref{eq:ping-pong})  because  after the  sending  it  will  be less  loaded  that
+$x_3^2(t)$.  So we consider that the \texttt{ping-pong} condition is probably to
+strong. Currently, we did not try to make another convergence proof without this
+condition or with a weaker condition.
+
+
+\section{Best effort strategy}
+\label{Best-effort}
+
+\textbf{Ã traduire} Ordonne les voisins du moins chargÃ© au plus chargÃ©.
+Trouve ensuite, en les prenant dans ce ordre, le nombre maximal de
+voisins tels que tous ont une charge infÃ©rieure Ã  la moyenne des
+charges des voisins sÃ©lectionnÃ©s, et de soi-mÃªme.
+
+Les transferts de charge sont ensuite fait en visant cette moyenne pour
+tous les voisins sÃ©lectionnÃ©s.  On envoie une quantitÃ© de charge Ã©gale
+Ã  (moyenne - charge\_du\_voisin).
+
+~\\\textbf{Question} faut-il dÃ©crire les stratÃ©gies makhoul et simple ?
+
+\paragraph{simple} Tentative de respecter simplement les conditions de Bertsekas.
+Parmi les voisins moins chargÃ©s que soi, on sÃ©lectionne :
+\begin{itemize}
+\item un des moins chargÃ©s (vmin) ;
+\item un des plus chargÃ©s (vmax),
+\end{itemize}
+puis on Ã©quilibre avec vmin en s'assurant que notre charge reste
+toujours supÃ©rieure Ã  celle de vmin et Ã  celle de vmax.
+
+On envoie donc (avec "self" pour soi-mÃªme) :
+\[
+    \min\left(\frac{load(self) - load(vmin)}{2}, load(self) - load(vmax)\right)
+\]
+
+\paragraph{makhoul} Ordonne les voisins du moins chargÃ© au plus chargÃ©
+puis calcule les diffÃ©rences de charge entre soi-mÃªme et chacun des
+voisins.
+
+Ensuite, pour chaque voisin, dans l'ordre, et tant qu'on reste plus
+chargÃ© que le voisin en question, on lui envoie 1/(N+1) de la
+diffÃ©rence calculÃ©e au dÃ©part, avec N le nombre de voisins.
+
+C'est l'algorithme~2 dans~\cite{bahi+giersch+makhoul.2008.scalable}.
+
+\section{Virtual load}
+\label{Virtual load}
+
+\section{Simulations}
+\label{Simulations}
+
+In order to test and validate our approaches, we wrote a simulator
+using the SimGrid
+framework~\cite{casanova+legrand+quinson.2008.simgrid}.  The process
+model is detailed in the next section (\ref{Sim model}), then the
+results of the simulations are presented in section~\ref{Results}.
+
+\subsection{Simulation model}
+\label{Sim model}
+
+\begin{verbatim}
+Communications
+==============
+
+There are two receiving channels per host: control for information
+messages, and data for load transfers.
+
+Process model
+=============
+
+Each process is made of 3 threads: a receiver thread, a computing
+thread, and a load-balancer thread.
+
+* Receiver thread
+  ---------------
+
+    Loop
+    | wait for a message to come, either on data channel, or on ctrl channel
+    | push received message in a buffer of received messages
+    | -> ctrl messages on the one side
+    | -> data messages on the other side
+    +-
+
+   The loop terminates when a "finalize" message is received on each
+   channel.
+
+* Computing thread
+  ----------------
+
+    Loop
+    | if we received some real load, get it (data messages)
+    | if there is some real load to send, send it
+    | if we own some load, simulate some computing on it
+    | sleep a bit if we are looping too fast
+    +-
+    send CLOSE on data for all neighbors
+    wait for CLOSE on data from all neighbors
+
+  The loop terminates when process::still_running() returns false.
+  (read the source for full details...)
+
+* Load-balancing thread
+  ---------------------
+
+    Loop
+    | call load-balancing algorithm
+    | send ctrl messages
+    | sleep (min_lb_iter_duration)
+    | receive ctrl messages
+    +-
+    send CLOSE on ctrl for all neighbors
+    wait for CLOSE on ctrl from all neighbors
+
+  The loop terminates when process::still_running() returns false.
+  (read the source for full details...)
+\end{verbatim}
+
+\subsection{Validation of our approaches}
+\label{Results}
+
+
+On veut montrer quoi ? :
+
+1) best plus rapide que les autres (simple, makhoul)
+2) avantage virtual load
+
+Est ce qu'on peut trouver des contre exemple?
+Topologies variÃ©es
+
+
+Simulation avec temps dÃ©finies assez long et on mesure la qualitÃ© avec : volume de calcul effectuÃ©, volume de donnÃ©es Ã©changÃ©es
+Mais aussi simulation avec temps court qui montre que seul best converge
+
+
+ExpÃ©s avec ratio calcul/comm rapide et lent
+
+Quelques expÃ©s avec charge initiale alÃ©atoire plutot que sur le premier proc
+
+Cadre processeurs homogÃ¨nes
+
+Topologies statiques
+
+On ne tient pas compte de la vitesse des liens donc on la considÃ¨re homogÃ¨ne
+
+Prendre un rÃ©seau hÃ©tÃ©rogÃ¨ne et rendre processeur homogÃ¨ne
+
+Taille : 10 100 trÃ¨s gros
+
+\section{Conclusion and perspectives}
 
 
 \bibliographystyle{spmpsci}