energy-performance algorithm. Section~\ref{sec.expe} presents the results of our
experiments. Section~\ref{sec.compare} shows the comparison results. Finally,
we conclude in Section~\ref{sec.concl}.
-
+\AG{There are too many sections!}
\section{Related Works}
\label{sec.relwork}
Therefore, the execution time of a task consists of the computation time and the
communication time. Moreover, the synchronous communications between tasks can
lead to idle time while tasks wait at the synchronization barrier for other tasks to
-finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happen when processing different
+finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happens when processing different
amounts of data on each processor (see figure~(\ref{fig:h2})). In
this case the fastest tasks have to wait at the synchronization barrier for the
slowest tasks to finish their job. In both cases the overall execution time
parameter. Energy consumed by an individual processor $E_{ind}$ is the summation
of the dynamic and the static power multiplied by the execution time for example
see~\cite{36,15}.
+\AG{What's an ``execution time for example'' ? Add the correct punctuation.}
\begin{equation}
\label{eq:eind}
E_\textit{ind} = ( P_\textit{dyn} + P_\textit{static} ) \cdot T
finished. For that reason any change in the frequency has no impact on the time
of communication but it has obvious impact on the time of
computation~\cite{17}. We have made many tests on a real cluster to prove that the
+\AG{Caution: in general, tests don't \emph{prove} anything}
frequency scaling factor \emph S has a linear relation with computation time
only. To predict the execution time of MPI program, the communication time and
the computation time for the slower task must be first precisely specified. Secondly,
these times are used to predict the execution time for any MPI program as a function of
the new scaling factor as in the EQ~(\ref{eq:tnew}).
+\AG{EQ~xx, without ``the''. Change everywhere.}
\begin{equation}
\label{eq:tnew}
\textit T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}}
sending or receiving till the message is synchronously sent or received. In this
paper we predict the execution time of the program for any new scaling factor
value. Depending on this prediction we can produce our energy-performance scaling
-method as we will show in the coming sections. In the next section we make to finishan
+method as we will show in the coming sections. In the next section we make to finishan\AG{finishan?}
investigation study for the EQ~(\ref{eq:tnew}).
\section{Performance Prediction Verification}
\label{sec.verif}
+\AG{This section presents experimental results. It should be put just before Sec.~\ref{sec.expe}}
In this section we evaluate the precision of our performance prediction methods
-on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution
+on the NAS benchmarks. We use the EQ~(\ref{eq:tnew}) that predicts the execution
time for any scale value. The NAS programs run the class B for comparing the
real execution time with the predicted execution time. Each program runs offline
with all available scaling factors on 8 or 9 nodes to produce real execution
\end{algorithm}
The proposed EPSA algorithm works online during the execution time of the MPI
program. It selects the optimal scaling factor by gathering some information
-from the program after one iteration. This algorithm has small execution time
-(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data
+from the program after one iteration.
+\AG{Which information?}
+ This algorithm has small execution time
+(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes).
+\AG{Algorithmic complexity?}
+ The data
required by this algorithm is the computation time and the communication time
for each task from the first iteration only. When these times are measured, the
MPI program calls the EPSA algorithm to choose the new frequency using the
-optimal scaling factor. Then the program set the new frequency to the
-system. The algorithm is called just one time during the execution of the
+optimal scaling factor. Then the program sets the new frequency to the
+system\AG[]{???}. The algorithm is called just one time during the execution of the
program. The DVFS algorithm~(\ref{dvfs}) shows where and when the EPSA algorithm is called
in the MPI program.
%\begin{minipage}{\textwidth}
\For {$J:=1$ to $Some-Iterations \; $}
\State -Computations Section.
\State -Communications Section.
- \If {$(J==1)$}
+ \If {$(J=1)$}
\State -Gather all times of computation and\par\hspace{13 pt} communication from each node.
\State -Call EPSA with these times.
\State -Calculate the new frequency from optimal scale.
\end{algorithmic}
\end{algorithm}
-After obtaining the optimal scale factor from the EPSA algorithm. The program
+After obtaining the optimal scale factor from the EPSA algorithm.\AG[]{comma} The program
calculates the new frequency $F_i$ for each task proportionally to its time
value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we
can calculate the new frequency $F_i$ as follows:
v3.10. We design a platform file that simulates a cluster with one core per
node. This cluster is a homogeneous architecture with distributed memory. The
detailed characteristics of our platform file are shown in the
-table~(\ref{table:platform}). Each node in the cluster has 18 frequency values
+table~(\ref{table:platform}).
+\AG{Are those characteristics realistic?}
+ Each node in the cluster has 18 frequency values
from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive
frequencies.
\begin{table}[htb]
\label{table:platform}
\end{table}
Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all
-the NAS MPI programs while assuming the power dynamic is equal to 20W and the
-power static is equal to 4W for all experiments. We run the proposed EPSA
+the NAS MPI programs while assuming the power dynamic is equal to \np[W]{20} and
+the power static is equal to \np[W]{4} for all experiments.
+\AG{How did you choose those values (available frequencies, power consumption)?}
+ We run the proposed EPSA
algorithm for all these programs. The results showed that the algorithm selected
different scaling factors for each program depending on the communication
features of the program as in the figure~(\ref{fig:nas}). This figure shows that
inversed performance curves, because there are different communication features
for each MPI program. When there are little or not communications, the inversed
performance curve is very close to the energy curve. Then the distance between
-the two curves is very small. This lead to small energy savings. The opposite
+the two curves is very small. This leads to small energy savings. The opposite
happens when there are a lot of communication, theto finish distance between the two
-curves is big. This lead to more energy savings (e.g. CG and FT), see
+curves is big. This leads to more energy savings (e.g. CG and FT), see
table~(\ref{table:factors results}). All discovered frequency scaling factors
optimize both the energy and the performance simultaneously for all the NAS
programs. In table~(\ref{table:factors results}), we record all optimal scaling
\label{sec.compare}
In this section, we compare our EPSA algorithm results with Rauber and Rünger
-methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal
-level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to
-this scenario as $R_{E}$. The second scenario is similar to the first
+methods~\cite{3}. They had two scenarios, the first is to reduce energy to
+optimal level without considering the performance as in EQ~(\ref{eq:sopt}). We
+refer to this scenario as $R_{E}$. The second scenario is similar to the first
except setting the slower task to the maximum frequency (when the scale $S=1$)
to keep the performance from degradation as mush as possible. We refer to this
scenario as $R_{E-P}$. The comparison is made in tables~(\ref{table:compare
Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These
-tables show the results of our EPSA and Rauber and Rünger scenarios for all the NAS
-benchmarks programs for classes A,B and C.
+tables show the results of our EPSA and Rauber and Rünger scenarios for all the
+NAS benchmarks programs for classes A,B and C.
\begin{table}[p]
\caption{Comparing Results for The NAS Class A}
% title of Table
\section{Conclusion}
\label{sec.concl}
In this paper we develop the simultaneous energy-performance algorithm. It is works based on the trade off relation between the energy and performance. The results showed that when the scaling factor is big value leads to more energy saving. Also, it show that when the the scaling factor is small value leads to the fact that the scaling factor has bigger impact on performance than energy. Then the algorithm optimize the energy saving and performance in the same time to have positive trade off. The optimal trade off refer to maximum distance between the energy and the inversed performance curves. Also, the results explained when setting the slowest task to maximum frequency usually not have a big improvement on performance.
+\AG{Needs to be better written. Add some future works.}
\section*{Acknowledgment}