From 650bc60eaa30522c55c1f4dba5437dce233c7390 Mon Sep 17 00:00:00 2001
From: Arnaud Giersch <arnaud.giersch@iut-bm.univ-fcomte.fr>
Date: Tue, 18 Mar 2014 10:54:33 +0100
Subject: [PATCH 1/1] More remarks.

---
 paper.tex | 53 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 33 insertions(+), 20 deletions(-)

diff --git a/paper.tex b/paper.tex
index 8039f59..43d284f 100644
--- a/paper.tex
+++ b/paper.tex
@@ -109,7 +109,7 @@ objective function. Section~\ref{sec.optim} demonstrates the proposed
 energy-performance algorithm. Section~\ref{sec.expe} presents the results of our
 experiments.  Section~\ref{sec.compare} shows the comparison results. Finally,
 we conclude in Section~\ref{sec.concl}.
-
+\AG{There are too many sections!}
 \section{Related Works}
 \label{sec.relwork}
 
@@ -189,7 +189,7 @@ platform. These tasks can exchange the data via synchronous message passing.
 Therefore, the execution time of a task consists of the computation time and the
 communication time. Moreover, the synchronous communications between tasks can
 lead to idle time while tasks wait at the synchronization barrier for other tasks to
-finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happen when processing different
+finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happens when processing different
 amounts of data on each processor  (see figure~(\ref{fig:h2})). In
 this case the fastest tasks have to wait at the synchronization barrier for the
 slowest tasks to finish their job. In both cases the overall execution time
@@ -223,6 +223,7 @@ design dependent parameter and $I_{leak}$ is a technology-dependent
 parameter. Energy consumed by an individual processor $E_{ind}$ is the summation
 of the dynamic and the static power multiplied by the execution time for example
 see~\cite{36,15}.
+\AG{What's an ``execution time for example'' ? Add the correct punctuation.}
 \begin{equation}
   \label{eq:eind}
    E_\textit{ind} = ( P_\textit{dyn} + P_\textit{static} ) \cdot T
@@ -309,11 +310,13 @@ communication process the processors remain idle until the communication has
 finished. For that reason any change in the frequency has no impact on the time
 of communication but it has obvious impact on the time of
 computation~\cite{17}. We have made many tests on a real cluster to prove that the
+\AG{Caution: in general, tests don't \emph{prove} anything}
 frequency scaling factor \emph S has a linear relation with computation time
 only. To predict the execution time of MPI program, the communication time and 
 the computation time for the slower task must be first precisely specified. Secondly, 
 these times are used to predict the execution time for any MPI program as a function of 
 the new scaling factor as in the EQ~(\ref{eq:tnew}).
+\AG{EQ~xx, without ``the''. Change everywhere.}
 \begin{equation}
   \label{eq:tnew}
  \textit  T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}}
@@ -324,14 +327,15 @@ communication time consists of the beginning times which an MPI calls for
 sending or receiving till the message is synchronously sent or received. In this
 paper we predict the execution time of the program for any new scaling factor
 value. Depending on this prediction we can produce our energy-performance scaling
-method as we will show in the coming sections. In the next section we make to finishan
+method as we will show in the coming sections. In the next section we make to finishan\AG{finishan?}
 investigation study for the EQ~(\ref{eq:tnew}).
 
 \section{Performance Prediction Verification}
 \label{sec.verif}
 
+\AG{This section presents experimental results. It should be put just before Sec.~\ref{sec.expe}}
 In this section we evaluate the precision of our performance prediction methods
-on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution
+on the NAS benchmarks. We use the EQ~(\ref{eq:tnew}) that predicts the execution
 time for any scale value. The NAS programs run the class B for comparing the
 real execution time with the predicted execution time. Each program runs offline
 with all available scaling factors on 8 or 9 nodes to produce real execution
@@ -473,13 +477,17 @@ scaling factor for both energy and performance at the same time.
 \end{algorithm}
 The proposed EPSA algorithm works online during the execution time of the MPI
 program. It selects the optimal scaling factor by gathering some information
-from the program after one iteration. This algorithm has small execution time
-(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data
+from the program after one iteration.
+\AG{Which information?}
+ This algorithm has small execution time
+(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes).
+\AG{Algorithmic complexity?}
+ The data
 required by this algorithm is the computation time and the communication time
 for each task from the first iteration only. When these times are measured, the
 MPI program calls the EPSA algorithm to choose the new frequency using the
-optimal scaling factor. Then the program set the new frequency to the
-system. The algorithm is called just one time during the execution of the
+optimal scaling factor. Then the program sets the new frequency to the
+system\AG[]{???}. The algorithm is called just one time during the execution of the
 program. The DVFS algorithm~(\ref{dvfs}) shows where and when the EPSA algorithm is called
 in the MPI program.
 %\begin{minipage}{\textwidth}
@@ -492,7 +500,7 @@ in the MPI program.
  \For {$J:=1$ to $Some-Iterations \; $}
   \State -Computations Section.
    \State -Communications Section.
-   \If {$(J==1)$} 
+   \If {$(J=1)$} 
      \State -Gather all times of computation and\par\hspace{13 pt} communication from each node.
      \State -Call EPSA with these times.
      \State -Calculate the new frequency from optimal scale.
@@ -502,7 +510,7 @@ in the MPI program.
 \end{algorithmic}
 \end{algorithm}
 
-After obtaining the optimal scale factor from the EPSA algorithm. The program
+After obtaining the optimal scale factor from the EPSA algorithm.\AG[]{comma} The program
 calculates the new frequency $F_i$ for each task proportionally to its time
 value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we
 can calculate the new frequency $F_i$ as follows:
@@ -528,7 +536,9 @@ respectively. Our experiments are executed on the simulator SimGrid/SMPI
 v3.10. We design a platform file that simulates a cluster with one core per
 node. This cluster is a homogeneous architecture with distributed memory. The
 detailed characteristics of our platform file are shown in the
-table~(\ref{table:platform}). Each node in the cluster has 18 frequency values
+table~(\ref{table:platform}).
+\AG{Are those characteristics realistic?}
+ Each node in the cluster has 18 frequency values
 from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive
 frequencies.
 \begin{table}[htb]
@@ -545,8 +555,10 @@ frequencies.
   \label{table:platform}
 \end{table}
 Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all
-the NAS MPI programs while assuming the power dynamic is equal to 20W and the
-power static is equal to 4W for all experiments. We run the proposed EPSA
+the NAS MPI programs while assuming the power dynamic is equal to \np[W]{20} and
+the power static is equal to \np[W]{4} for all experiments.
+\AG{How did you choose those values (available frequencies, power consumption)?}
+ We run the proposed EPSA
 algorithm for all these programs. The results showed that the algorithm selected
 different scaling factors for each program depending on the communication
 features of the program as in the figure~(\ref{fig:nas}). This figure shows that
@@ -554,9 +566,9 @@ there are different distances between the normalized energy and the normalized
 inversed performance curves, because there are different communication features
 for each MPI program.  When there are little or not communications, the inversed
 performance curve is very close to the energy curve. Then the distance between
-the two curves is very small. This lead to small energy savings. The opposite
+the two curves is very small. This leads to small energy savings. The opposite
 happens when there are a lot of communication, theto finish distance between the two
-curves is big.  This lead to more energy savings (e.g. CG and FT), see
+curves is big.  This leads to more energy savings (e.g. CG and FT), see
 table~(\ref{table:factors results}). All discovered frequency scaling factors
 optimize both the energy and the performance simultaneously for all the NAS
 programs. In table~(\ref{table:factors results}), we record all optimal scaling
@@ -606,15 +618,15 @@ EPSA to selects smaller scaling factor values (inducing smaller energy savings).
 \label{sec.compare}
 
 In this section, we compare our EPSA algorithm results with Rauber and RÃ¼nger
-methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal
-level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to
-this scenario as $R_{E}$. The second scenario is similar to the first
+methods~\cite{3}. They had two scenarios, the first is to reduce energy to
+optimal level without considering the performance as in EQ~(\ref{eq:sopt}). We
+refer to this scenario as $R_{E}$. The second scenario is similar to the first
 except setting the slower task to the maximum frequency (when the scale $S=1$)
 to keep the performance from degradation as mush as possible. We refer to this
 scenario as $R_{E-P}$. The comparison is made in tables~(\ref{table:compare
   Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These
-tables show the results of our EPSA and Rauber and RÃ¼nger scenarios for all the NAS
-benchmarks programs for classes A,B and C.
+tables show the results of our EPSA and Rauber and RÃ¼nger scenarios for all the
+NAS benchmarks programs for classes A,B and C.
 \begin{table}[p]
   \caption{Comparing Results for  The NAS Class A}
   % title of Table
@@ -771,6 +783,7 @@ than the first.
 \section{Conclusion}
 \label{sec.concl}
 In this paper we develop the simultaneous energy-performance algorithm. It is works based on the trade off relation between the energy and performance. The results showed that when the scaling factor is big value leads to more energy saving. Also, it show that when the the scaling factor is small value leads to the fact that the scaling factor has bigger impact on performance than energy. Then the algorithm optimize the energy saving and performance in the same time to have positive trade off. The optimal trade off refer to maximum distance between the energy and the inversed performance curves. Also, the results explained when setting the slowest task to maximum frequency usually not have a big improvement on performance. 
+\AG{Needs to be better written.  Add some future works.}
 
 \section*{Acknowledgment}
 
-- 
2.39.5