+\begin{abstract}
+ Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs. This
+ technique is usually used to reduce the energy consumed by a CPU while
+ computing. Thus, decreasing the frequency
+ reduces the power consumed by the CPU. However, it can also significantly
+ affect the performance of the executed program if it is compute bound and if a
+ low CPU frequency is selected. Therefore, the chosen scaling factor must
+ give the best possible trade-off between energy reduction and performance.
+
+ In this paper we present an algorithm that predicts the energy consumed with
+ each frequency gear and selects the one that gives the best ratio between
+ energy consumption reduction and performance. This algorithm works online
+ without training or profiling and has a very small overhead. It also takes
+ into account synchronous communications between the nodes that are executing
+ the distributed algorithm. The algorithm has been evaluated over the SimGrid
+ simulator while being applied to the NAS parallel benchmark programs. The
+ results of the experiments show that it outperforms other existing scaling
+ factor selection algorithms.
+\end{abstract}
+
+\section{Introduction}
+\label{sec.intro}
+
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years. To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers (for example in November 2013, according to the
+TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more
+than 3 million of cores and delivers more than \np[Tflop/s]{33} while consuming
+\np[kW]{17808}). This large increase in number of computing cores has led to
+large energy consumption by these architectures. Moreover, the price of energy
+is expected to continue its ascent according to the demand. For all these
+reasons energy reduction has become an important topic in the high performance
+computing field. To tackle this problem, many researchers use DVFS (Dynamic
+Voltage Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a
+set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs. However, DVFS also degrades the performance of computation. Therefore
+researchers try to reduce the frequency to the minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives, they use heuristics to find the best
+scaling factor during the computation. If they aim for performance they choose
+the best scaling factor that reduces the consumed energy while affecting as
+little as possible the performance. On the other hand, if they aim for energy
+reduction, the chosen scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance. It is
+important to notice that lowering the frequency to the minimum value does not always
+give the most energy efficient execution due to energy leakage. The best
+scaling factor might be chosen during execution (online) or during a
+pre-execution phase. In this paper, we present an algorithm that selects a
+frequency scaling factor that simultaneously takes into consideration the energy
+consumption by the CPU and the performance of the application. The main
+objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with very small
+overhead. The proposed algorithm takes into account the communication times of
+the MPI program to choose the scaling factor. This algorithm has the ability to
+predict both energy consumption and execution time over all available scaling
+factors. The prediction achieved depends on some computing time information,
+gathered at the beginning of the runtime. We apply this algorithm to the NAS parallel benchmarks (NPB v3.3)~\cite{44}. Our experiments are executed using the simulator
+SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over an homogeneous
+distributed memory architecture. Furthermore, we compare the proposed algorithm
+with Rauber and Rünger methods~\cite{3}. The comparison's results show that our
+algorithm gives better energy-time trade-off.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents some
+related works from other authors. Section~\ref{sec.exe} presents an energy
+model for homogeneous platforms. Section~\ref{sec.mpip} describes how the
+performance of MPI programs can be predicted. Section~\ref{sec.compet} presents
+the energy-performance objective function that maximizes the reduction of energy
+consumption while minimizing the degradation of the program's performance.
+Section~\ref{sec.optim} details the proposed energy-performance algorithm.
+Section~\ref{sec.expe} verifies the accuracy of the performance prediction model
+and presents the results of the proposed algorithm. It also shows the
+comparison results between our method and other existing methods. Finally, we
+conclude in Section~\ref{sec.concl} with a summary and some future works.
+
+\section{Related works}
+\label{sec.relwork}
+
+
+In this section, some heuristics to compute the scaling factor are presented and
+classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program. They return static scaling factor values to the processors
+participating in the execution of the parallel program. On the one hand, the
+scaling factor values could be computed based on information retrieved by
+analyzing the code of the program and the computing system that will execute it.
+In~\cite{40}, Azevedo et al. detect during compilation the dependency points
+between tasks in a multi-task program. This information is then used to lower
+the frequency of some processors in order to eliminate slack times. A slack
+time is the period of time during which a processor that has already finished
+its computation, has to wait for a set of processors to finish their
+computations and send their results to the waiting processor in order to
+continue its task that is dependent on the results of computations being
+executed on other processors. Freeh et al. showed in~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled
+down. On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or partial executions of the program. The whole program or, a
+part of it, is usually executed over all the available frequency
+gears and the execution time and the energy consumed with each frequency
+gear are measured. Then a heuristic or an exact method uses the retrieved
+information to compute the values of the scaling factor for the processors.
+In~\cite{8} , Rountree et al. use a linear programming algorithm, while in~\cite{34}, Cochran et
+al. use a multi-logistic regression algorithm for the same goal. The main
+drawback of these methods is that they all require executing the
+whole program or, a part of it, on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+
+The online scaling factor selection methods are executed during the runtime of
+the program. They are usually integrated into iterative programs where the same
+block of instructions is executed many times. During the first few iterations,
+a lot of information is measured such as the execution time, the energy consumed
+using a multimeter, the slack times, \dots{} Then a method will exploit these
+measurements to compute the scaling factor values for each processor. This
+operation, measurements and computing new scaling factor, can be repeated as
+much as needed if the iterations are not regular. Peraza, Yu-Liang et
+al.~\cite{2,31} used varied heuristics to select the appropriate scaling
+factor values to eliminate the slack times during runtime. However, as seen
+in~\cite{19}, machine learning method takes a lot of time to converge
+when the number of available gears is big. To reduce the impact of slack times,
+in~\cite{1}, Lim et al. developed an algorithm that detects the communication
+sections and changes the frequency during these sections only. This approach
+might change the frequency of each processor many times per iteration if an
+iteration contains more than one communication section. In~\cite{3}, Rauber and
+Rünger used an analytical model that can predict the consumed energy for every frequency gear after measuring the consumed energy. They
+maintain the performance as mush as possible by setting the highest frequency gear to the slowest task.
+
+The primary contribution of
+our paper is to present a new online scaling factor selection method which has the
+ following characteristics:\\
+1) It is based on Rauber and Rünger analytical model to predict the energy
+ consumption of the application with different frequency gears.
+2) It selects the frequency scaling factor for simultaneously optimizing
+ energy reduction and maintaining performance.
+3) It is well adapted to distributed architectures because it takes into
+ account the communication time.
+4) It is well adapted to distributed applications with imbalanced tasks.
+5) It has a very small overhead when compared to other methods
+ (e.g.,~\cite{19}) and does not require profiling or training as
+ in~\cite{34}.
+
+
+% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'',
+% can be deleted if we need space, we can just say we are interested in this
+% paper in homogeneous clusters}
+
+
+\section{Energy model for homogeneous platform}
+\label{sec.exe}
+Many researchers~\cite{9,3,15,26} divide the power consumed by a processor into
+two power metrics: the static and the dynamic power. While the first one is
+consumed as long as the computing unit is on, the latter is only consumed during
+computation times. The dynamic power $P_{dyn}$ is related to the switching
+activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
+operational frequency $f$, as shown in EQ~(\ref{eq:pd}).
+\begin{equation}
+ \label{eq:pd}
+ P_\textit{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f