-penchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
-using the simulator Simgrid/SMPI v3.10~\cite{45} over an homogeneous distributed
-memory architecture. Furthermore, we compare the proposed algorithm with
-Rauber's methods. The comparison's results show that our algorithm gives better
-energy-time trade off.
-\sectionfont{\large}
-
-\section{.~Related Works }
-
-In the this section some heuristics, to compute the scaling factor, are
-presented and classified in two parts : offline and online methods.
- \sectionfont{\large}
-
-\subsection{~The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as an example in Azevedo et al.~\cite{40}. He used
-intra-task algorithm to choose the DVFS setting when there are dependency points
-between tasks. While in~\cite{29}, Xie et al. used breadth-first search
-algorithm to do that. Their goal is saving energy with time limits. Another
-approaches gathers and stores the runtime information for each DVFS state , then
-used their methods offline to select the suitable DVFS that optimize energy-time
-trade offs. As an example~\cite{8}, Rountree et al. used liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression
-algorithm for the same goal. The offline study that shown the DVFS impact on the
-communication time of the MPI program is~\cite{17}, Freeh et al. show that these
-times not changed when the frequency is scaled down.
-\sectionfont{\large}
-
-\subsection{~The online DVFS orientations}
-
-The objective of these works is to dynamically compute and set the frequency of
-the CPU during the runtime of the program for saving energy. Estimating and
-predicting approaches for the energy-time trade offs developed by
-~\cite{11,2,31}. These works select the best DVFS setting depending on the slack
-times. These times happen when the processors have to wait for data from other
-processors to compute their task. For example, during the synchronous
-communication time that take place in the MPI programs, the processors are
-idle. The optimal DVFS can be selected using the learning methods. Therefore, in
-~\cite{39,19} used machine learning to converge to the suitable DVFS
-configuration. Their learning algorithms have big time to converge when the
-number of available frequencies is high. Also, the communication time of the MPI
-program used online for saving energy as in~\cite{1}, Lim et al. developed an
-algorithm that detects the communication sections and changes the frequency
-during these sections only. This approach changes the frequency many times
-because an iteration may contain more than one communication section. The domain
-of analytical modeling used for choosing the optimal frequency as in ~\cite{3},
-Rauber et al. developed an analytical mathematical model for determining the
-optimal frequency scaling factor for any number of concurrent tasks, without
-considering communication times. They set the slowest task to maximum frequency
-for maintaining performance. In this paper we compare our algorithm with
-Rauber's model~\cite{3}, because his model can be used for any number of
-concurrent tasks for homogeneous platform and this is the same direction of this
-paper. However, the primary contributions of this paper are:
+benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
+using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183}
+over an homogeneous distributed memory architecture. Furthermore, we compare the
+proposed algorithm with Rauber and Rünger methods~\cite{3}.
+The comparison's results show that our algorithm gives better energy-time tradeoff.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents related works
+from other authors. Section~\ref{sec.exe} shows the execution of parallel
+tasks and sources of idle times. It resumes the energy
+model of homogeneous platform. Section~\ref{sec.mpip} evaluates the performance
+of MPI program. Section~\ref{sec.compet} presents the energy-performance tradeoffs
+objective function. Section~\ref{sec.optim} demonstrates the proposed energy-performance algorithm. Section~\ref{sec.expe} verifies the performance prediction
+model and presents the results of the proposed algorithm. Also, It shows the comparison results. Finally,
+we conclude in Section~\ref{sec.concl}.
+\section{Related works}
+\label{sec.relwork}
+
+\AG{Consider introducing the models (sec.~\ref{sec.exe}) before related works}
+
+In this section, some heuristics to compute the scaling factor are
+presented and classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program. They return static scaling factor values to the processors
+participating in the execution of the parallel program. On one hand, the scaling
+factor
+values could be computed based on information retrieved by analyzing the code of
+the program and the computing system that will execute it. In ~\cite{40},
+Azevedo et
+al. detect during compilation the dependency points between
+tasks in a parallel program. This information is then used to lower the frequency of
+some processors in order to eliminate slack times. A slack time is the period of time during which a processor that have already finished its computation, have to wait
+for a set of processors to finish their computations and send their results to the
+waiting processor in order to continue its task that is
+dependent on the results of computations being executed on other processors.
+Freeh et al. showed in ~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled down.
+On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or
+partial executions of the program. A part or the whole program is usually executed over all the available frequency gears and the the execution time and the energy consumed with each frequency gear are measured. Then an heuristic or an exact method uses the retrieved information to compute the values of the scaling factor for the processors.
+ In~\cite{29}, Xie et al. use an exact exponential breadth-first search algorithm to compute the scaling factor values that give the optimal energy reduction while respecting a deadline for a sequential program. They also present a linear heuristic that approximates the optimal solution. In~\cite{8} , Rountree et al. use a linear programming
+algorithm, while in~\cite{38,34}, Cochran et al. use multi logistic regression algorithm for the same goal.
+The main drawback for these methods is that they all require executing a part or the whole program on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+The online scaling factor selection methods are executed during the runtime of the program. They are usually integrated into iterative programs where the same block of instructions is executed many times. During the first few iterations, many informations are measured such as the execution time, the energy consumed using a multimeter, the slack times, ... Then a method will exploit these measurements to compute the scaling factor values for each processor. This operation, measurements and computing new scaling factor, can be repeated as much as needed if the iterations are not regular. Kimura, Peraza, Yu-Liang et al. ~\cite{11,2,31} used learning methods to select the appropriate scaling factor values to eliminate the slack times during runtime. However, as seen in ~\cite{39,19}, machine learning methods can take a lot of time to converge when the number of available gears is big. To reduce the impact of slack times, in~\cite{1}, Lim et al. developed an algorithm that detects the
+communication sections and changes the frequency during these sections
+only. This approach might change the frequency of each processor many times per iteration if an iteration
+contains more than one communication section. In ~\cite{3}, Rauber and Rünger used an analytical model that after measuring the energy consumed and the execution time with the highest frequency gear, it can predict the energy consumed and the execution time for every frequency gear . These predictions may be used to choose the optimal gear for each processor executing the parallel program to reduce energy consumption.
+To maintain the performance of the parallel program , they
+set the processor with the biggest load to the highest gear and then compute the scaling factor values for the rest of the processors. Although this model was built for parallel architectures, it can be adapted to distributed architectures by taking into account the communications.
+The primary contribution of this paper is presenting a new online scaling factor selection method which has the following characteristics :