Reindent.

[loba-papers.git] / supercomp11 / supercomp11.tex
diff --git a/supercomp11/supercomp11.tex b/supercomp11/supercomp11.tex

index 49477d700de8388240a743629056f9f400fb6149..1faf1b03ca6afe1d4407ab771812f940c5a32c07 100644 (file)
--- a/supercomp11/supercomp11.tex
+++ b/supercomp11/supercomp11.tex
@@ -29,12 +29,12 @@
  }
  
  \institute{R. Couturier \and A. Giersch \at
-              LIFC, University of Franche-Comté, Belfort, France \\
+              FEMTO-ST, University of Franche-Comté, Belfort, France \\
                % Tel.: +123-45-678910\\
                % Fax: +123-45-678910\\
                \email{%
-                raphael.couturier@univ-fcomte.fr,
-                arnaud.giersch@univ-fcomte.fr}
+                raphael.couturier@femto-st.fr,
+                arnaud.giersch@femto-st.fr}
  }
  
  \maketitle
@@ -489,8 +489,19 @@ To summarize the different load balancing strategies, we have:
  %
  This gives us as many as $4\times 2\times 2 = 16$ different strategies.
  
+\paragraph{End of the simulation}
  
-\paragraph{Configurations}
+The simulations were run until the load was nearly balanced among the
+participating nodes.  More precisely the simulation stops when each node holds
+an amount of load at less than 1\% of the load average, during an arbitrary
+number of computing iterations (2000 in our case).
+
+Note that this convergence detection was implemented in a centralized manner.
+This is easy to do within the simulator, but it's obviously not realistic.  In a
+real application we would have chosen a decentralized convergence detection
+algorithm, like the one described in \cite{10.1109/TPDS.2005.2}.
+
+\paragraph{Platforms}
  
  In order to show the behavior of the different strategies in different
  settings, we simulated the executions on two sorts of platforms.  These two
@@ -511,11 +522,15 @@ bandwidth was fixed to 2.25~GB/s, with a latency of 500~$\mu$s.
  Then we derived each sort of platform with four different number of computing
  nodes: 16, 64, 256, and 1024 nodes.
  
+\paragraph{Configurations}
+
  The distributed processes of the application were then logically organized along
  three possible topologies: a line, a torus or an hypercube.  We ran tests where
  the total load was initially on an only node (at one end for the line topology),
-and other tests where the load was initially randomly distributed across all
-the participating nodes.
+and other tests where the load was initially randomly distributed across all the
+participating nodes.  The total amount of load was fixed to a number of load
+units equal to 1000 times the number of node.  The average load is then of 1000
+load units.
  
  For each of the preceding configuration, we finally had to choose the
  computation and communication costs of a load unit.  We chose them, such as to
@@ -552,13 +567,45 @@ time.
  
  \paragraph{Metrics}
  
+In order to evaluate and compare the different load balancing strategies we had
+to define several metrics.  Our goal, when choosing these metrics, was to have
+something tending to a constant value, i.e. to have a measure which is not
+changing anymore once the convergence state is reached.  Moreover, we wanted to
+have some normalized value, in order to be able to compare them across different
+settings.
+
+With these constraints in mind, we defined the following metrics:
+%
  \begin{description}
-\item[\textbf{average idle time}]
-\item[\textbf{average convergence date}]
-\item[\textbf{maximum convergence date}]
-\item[\textbf{data transfer amount}] relative to the total data amount
+\item[\textbf{average idle time:}] that's the total time spent, when the nodes
+  don't hold any share of load, and thus have nothing to compute.  This total
+  time is divided by the number of participating nodes, such as to have a number
+  that can be compared between simulations of different sizes.
+
+  This metric is expected to give an idea of the ability of the strategy to
+  diffuse the load quickly.  A smaller value is better.
+
+\item[\textbf{average convergence date:}] that's the average of the dates when
+  all nodes reached the convergence state.  The dates are measured as a number
+  of (simulated) seconds since the beginning of the simulation.
+
+\item[\textbf{maximum convergence date:}] that's the date when the last node
+  reached the convergence state.
+
+  These two dates give an idea of the time needed by the strategy to reach the
+  equilibrium state.  A smaller value is better.
+
+\item[\textbf{data transfer amount:}] that's the sum of the amount of all data
+  transfers during the simulation.  This sum is then normalized by dividing it
+  by the total amount of data present in the system.
+
+  This metric is expected to give an idea of the efficiency of the strategy in
+  terms of data movements, i.e. its ability to reach the equilibrium with fewer
+  transfers.  Again, a smaller value is better.
+
  \end{description}
  
+
  \subsection{Validation of our approaches}
  \label{Results}