From 650bc60eaa30522c55c1f4dba5437dce233c7390 Mon Sep 17 00:00:00 2001 From: Arnaud Giersch Date: Tue, 18 Mar 2014 10:54:33 +0100 Subject: [PATCH 1/1] More remarks. --- paper.tex | 53 +++++++++++++++++++++++++++++++++-------------------- 1 file changed, 33 insertions(+), 20 deletions(-) diff --git a/paper.tex b/paper.tex index 8039f59..43d284f 100644 --- a/paper.tex +++ b/paper.tex @@ -109,7 +109,7 @@ objective function. Section~\ref{sec.optim} demonstrates the proposed energy-performance algorithm. Section~\ref{sec.expe} presents the results of our experiments. Section~\ref{sec.compare} shows the comparison results. Finally, we conclude in Section~\ref{sec.concl}. - +\AG{There are too many sections!} \section{Related Works} \label{sec.relwork} @@ -189,7 +189,7 @@ platform. These tasks can exchange the data via synchronous message passing. Therefore, the execution time of a task consists of the computation time and the communication time. Moreover, the synchronous communications between tasks can lead to idle time while tasks wait at the synchronization barrier for other tasks to -finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happen when processing different +finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happens when processing different amounts of data on each processor (see figure~(\ref{fig:h2})). In this case the fastest tasks have to wait at the synchronization barrier for the slowest tasks to finish their job. In both cases the overall execution time @@ -223,6 +223,7 @@ design dependent parameter and $I_{leak}$ is a technology-dependent parameter. Energy consumed by an individual processor $E_{ind}$ is the summation of the dynamic and the static power multiplied by the execution time for example see~\cite{36,15}. +\AG{What's an ``execution time for example'' ? Add the correct punctuation.} \begin{equation} \label{eq:eind} E_\textit{ind} = ( P_\textit{dyn} + P_\textit{static} ) \cdot T @@ -309,11 +310,13 @@ communication process the processors remain idle until the communication has finished. For that reason any change in the frequency has no impact on the time of communication but it has obvious impact on the time of computation~\cite{17}. We have made many tests on a real cluster to prove that the +\AG{Caution: in general, tests don't \emph{prove} anything} frequency scaling factor \emph S has a linear relation with computation time only. To predict the execution time of MPI program, the communication time and the computation time for the slower task must be first precisely specified. Secondly, these times are used to predict the execution time for any MPI program as a function of the new scaling factor as in the EQ~(\ref{eq:tnew}). +\AG{EQ~xx, without ``the''. Change everywhere.} \begin{equation} \label{eq:tnew} \textit T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}} @@ -324,14 +327,15 @@ communication time consists of the beginning times which an MPI calls for sending or receiving till the message is synchronously sent or received. In this paper we predict the execution time of the program for any new scaling factor value. Depending on this prediction we can produce our energy-performance scaling -method as we will show in the coming sections. In the next section we make to finishan +method as we will show in the coming sections. In the next section we make to finishan\AG{finishan?} investigation study for the EQ~(\ref{eq:tnew}). \section{Performance Prediction Verification} \label{sec.verif} +\AG{This section presents experimental results. It should be put just before Sec.~\ref{sec.expe}} In this section we evaluate the precision of our performance prediction methods -on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution +on the NAS benchmarks. We use the EQ~(\ref{eq:tnew}) that predicts the execution time for any scale value. The NAS programs run the class B for comparing the real execution time with the predicted execution time. Each program runs offline with all available scaling factors on 8 or 9 nodes to produce real execution @@ -473,13 +477,17 @@ scaling factor for both energy and performance at the same time. \end{algorithm} The proposed EPSA algorithm works online during the execution time of the MPI program. It selects the optimal scaling factor by gathering some information -from the program after one iteration. This algorithm has small execution time -(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data +from the program after one iteration. +\AG{Which information?} + This algorithm has small execution time +(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). +\AG{Algorithmic complexity?} + The data required by this algorithm is the computation time and the communication time for each task from the first iteration only. When these times are measured, the MPI program calls the EPSA algorithm to choose the new frequency using the -optimal scaling factor. Then the program set the new frequency to the -system. The algorithm is called just one time during the execution of the +optimal scaling factor. Then the program sets the new frequency to the +system\AG[]{???}. The algorithm is called just one time during the execution of the program. The DVFS algorithm~(\ref{dvfs}) shows where and when the EPSA algorithm is called in the MPI program. %\begin{minipage}{\textwidth} @@ -492,7 +500,7 @@ in the MPI program. \For {$J:=1$ to $Some-Iterations \; $} \State -Computations Section. \State -Communications Section. - \If {$(J==1)$} + \If {$(J=1)$} \State -Gather all times of computation and\par\hspace{13 pt} communication from each node. \State -Call EPSA with these times. \State -Calculate the new frequency from optimal scale. @@ -502,7 +510,7 @@ in the MPI program. \end{algorithmic} \end{algorithm} -After obtaining the optimal scale factor from the EPSA algorithm. The program +After obtaining the optimal scale factor from the EPSA algorithm.\AG[]{comma} The program calculates the new frequency $F_i$ for each task proportionally to its time value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we can calculate the new frequency $F_i$ as follows: @@ -528,7 +536,9 @@ respectively. Our experiments are executed on the simulator SimGrid/SMPI v3.10. We design a platform file that simulates a cluster with one core per node. This cluster is a homogeneous architecture with distributed memory. The detailed characteristics of our platform file are shown in the -table~(\ref{table:platform}). Each node in the cluster has 18 frequency values +table~(\ref{table:platform}). +\AG{Are those characteristics realistic?} + Each node in the cluster has 18 frequency values from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive frequencies. \begin{table}[htb] @@ -545,8 +555,10 @@ frequencies. \label{table:platform} \end{table} Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all -the NAS MPI programs while assuming the power dynamic is equal to 20W and the -power static is equal to 4W for all experiments. We run the proposed EPSA +the NAS MPI programs while assuming the power dynamic is equal to \np[W]{20} and +the power static is equal to \np[W]{4} for all experiments. +\AG{How did you choose those values (available frequencies, power consumption)?} + We run the proposed EPSA algorithm for all these programs. The results showed that the algorithm selected different scaling factors for each program depending on the communication features of the program as in the figure~(\ref{fig:nas}). This figure shows that @@ -554,9 +566,9 @@ there are different distances between the normalized energy and the normalized inversed performance curves, because there are different communication features for each MPI program. When there are little or not communications, the inversed performance curve is very close to the energy curve. Then the distance between -the two curves is very small. This lead to small energy savings. The opposite +the two curves is very small. This leads to small energy savings. The opposite happens when there are a lot of communication, theto finish distance between the two -curves is big. This lead to more energy savings (e.g. CG and FT), see +curves is big. This leads to more energy savings (e.g. CG and FT), see table~(\ref{table:factors results}). All discovered frequency scaling factors optimize both the energy and the performance simultaneously for all the NAS programs. In table~(\ref{table:factors results}), we record all optimal scaling @@ -606,15 +618,15 @@ EPSA to selects smaller scaling factor values (inducing smaller energy savings). \label{sec.compare} In this section, we compare our EPSA algorithm results with Rauber and Rünger -methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal -level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to -this scenario as $R_{E}$. The second scenario is similar to the first +methods~\cite{3}. They had two scenarios, the first is to reduce energy to +optimal level without considering the performance as in EQ~(\ref{eq:sopt}). We +refer to this scenario as $R_{E}$. The second scenario is similar to the first except setting the slower task to the maximum frequency (when the scale $S=1$) to keep the performance from degradation as mush as possible. We refer to this scenario as $R_{E-P}$. The comparison is made in tables~(\ref{table:compare Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These -tables show the results of our EPSA and Rauber and Rünger scenarios for all the NAS -benchmarks programs for classes A,B and C. +tables show the results of our EPSA and Rauber and Rünger scenarios for all the +NAS benchmarks programs for classes A,B and C. \begin{table}[p] \caption{Comparing Results for The NAS Class A} % title of Table @@ -771,6 +783,7 @@ than the first. \section{Conclusion} \label{sec.concl} In this paper we develop the simultaneous energy-performance algorithm. It is works based on the trade off relation between the energy and performance. The results showed that when the scaling factor is big value leads to more energy saving. Also, it show that when the the scaling factor is small value leads to the fact that the scaling factor has bigger impact on performance than energy. Then the algorithm optimize the energy saving and performance in the same time to have positive trade off. The optimal trade off refer to maximum distance between the energy and the inversed performance curves. Also, the results explained when setting the slowest task to maximum frequency usually not have a big improvement on performance. +\AG{Needs to be better written. Add some future works.} \section*{Acknowledgment} -- 2.39.5