-The need for computing power is still increasing and it is not expected to slow
-down in the coming years. To satisfy this demand, researchers and supercomputers
-constructors have been regularly increasing the number of computing cores in
-supercomputers (for example in November 2013, according to the top 500
-list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3
-millions of cores and delivers more than 33 Tflop/s while consuming 17808
-kW). This large increase in number of computing cores has led to large energy
-consumption by these architectures. Moreover, the price of energy is expected to
-continue its ascent according to the demand. For all these reasons energy
-reduction became an important topic in the high performance computing field. To
-tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency
-Scaling) operations which reduce dynamically the frequency and voltage of cores
-and thus their energy consumption. However, this operation also degrades the
-performance of computation. Therefore researchers try to reduce the frequency to
-minimum when processors are idle (waiting for data from other processors or
-communicating with other processors). Moreover, depending on their objectives
-they use heuristics to find the best scaling factor during the computation. If
-they aim for performance they choose the best scaling factor that reduces the
-consumed energy while affecting as little as possible the performance. On the
-other hand, if they aim for energy reduction, the chosen scaling factor must
-produce the most energy efficient execution without considering the degradation
-of the performance. It is important to notice that lowering the frequency to
-minimum value does not always give the most efficient execution due to energy
-leakage. The best scaling factor might be chosen during execution (online) or
-during a pre-execution phase. In this paper we emphasize to develop an
-algorithm that selects the optimal frequency scaling factor that takes into
-consideration simultaneously the energy consumption and the performance. The
-main objective of HPC systems is to run the application with less execution
-time. Therefore, our algorithm selects the optimal scaling factor online with
-very small footprint. The proposed algorithm takes into account the
-communication times of the MPI programs to choose the scaling factor. This
-algorithm has ability to predict both energy consumption and execution time over
-all available scaling factors. The prediction achieved depends on some
-computing time information, gathered at the beginning of the runtime. We apply
-this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel
-penchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
-using the simulator Simgrid/SMPI v3.10~\cite{45} over an homogeneous distributed
-memory architecture. Furthermore, we compare the proposed algorithm with
-Rauber's methods. The comparison's results show that our algorithm gives better
-energy-time trade off.
-
-\section{Related Works}
-
-In the this section some heuristics, to compute the scaling factor, are
-presented and classified in two parts : offline and online methods.
-
-\subsection{The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as an example in Azevedo et al.~\cite{40}. He used
-intra-task algorithm to choose the DVFS setting when there are dependency points
-between tasks. While in~\cite{29}, Xie et al. used breadth-first search
-algorithm to do that. Their goal is saving energy with time limits. Another
-approaches gathers and stores the runtime information for each DVFS state , then
-used their methods offline to select the suitable DVFS that optimize energy-time
-trade offs. As an example~\cite{8}, Rountree et al. used liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression
-algorithm for the same goal. The offline study that shown the DVFS impact on the
-communication time of the MPI program is~\cite{17}, Freeh et al. show that these
-times not changed when the frequency is scaled down.
-
-\subsection{The online DVFS orientations}
-
-The objective of these works is to dynamically compute and set the frequency of
-the CPU during the runtime of the program for saving energy. Estimating and
-predicting approaches for the energy-time trade offs developed by
-~\cite{11,2,31}. These works select the best DVFS setting depending on the slack
-times. These times happen when the processors have to wait for data from other
-processors to compute their task. For example, during the synchronous
-communication time that take place in the MPI programs, the processors are
-idle. The optimal DVFS can be selected using the learning methods. Therefore, in
-~\cite{39,19} used machine learning to converge to the suitable DVFS
-configuration. Their learning algorithms have big time to converge when the
-number of available frequencies is high. Also, the communication time of the MPI
-program used online for saving energy as in~\cite{1}, Lim et al. developed an
-algorithm that detects the communication sections and changes the frequency
-during these sections only. This approach changes the frequency many times
-because an iteration may contain more than one communication section. The domain
-of analytical modeling used for choosing the optimal frequency as in ~\cite{3},
-Rauber et al. developed an analytical mathematical model for determining the
-optimal frequency scaling factor for any number of concurrent tasks, without
-considering communication times. They set the slowest task to maximum frequency
-for maintaining performance. In this paper we compare our algorithm with
-Rauber's model~\cite{3}, because his model can be used for any number of
-concurrent tasks for homogeneous platform and this is the same direction of this
-paper. However, the primary contributions of this paper are:
+\section{Introduction}
+\label{sec.intro}
+
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years. To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers (for example in November 2013, according to the
+TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more
+than 3 millions of cores and delivers more than \np[Tflop/s]{33} while consuming
+\np[kW]{17808}). This large increase in number of computing cores has led to
+large energy consumption by these architectures. Moreover, the price of energy
+is expected to continue its ascent according to the demand. For all these
+reasons energy reduction became an important topic in the high performance
+computing field. To tackle this problem, many researchers used DVFS (Dynamic
+Voltage Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a
+set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs. However, DVFS also degrades the performance of computation. Therefore
+researchers try to reduce the frequency to minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives they use heuristics to find the best
+scaling factor during the computation. If they aim for performance they choose
+the best scaling factor that reduces the consumed energy while affecting as
+little as possible the performance. On the other hand, if they aim for energy
+reduction, the chosen scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance. It is
+important to notice that lowering the frequency to minimum value does not always
+give the most energy efficient execution due to energy leakage. The best
+scaling factor might be chosen during execution (online) or during a
+pre-execution phase. In this paper, we present an algorithm that selects a
+frequency scaling factor that simultaneously takes into consideration the energy
+consumption by the CPU and the performance of the application. The main
+objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with very small
+footprint. The proposed algorithm takes into account the communication times of
+the MPI program to choose the scaling factor. This algorithm has ability to
+predict both energy consumption and execution time over all available scaling
+factors. The prediction achieved depends on some computing time information,
+gathered at the beginning of the runtime. We apply this algorithm to seven MPI
+benchmarks. These MPI programs are the NAS parallel benchmarks (NPB v3.3)
+developed by NASA~\cite{44}. Our experiments are executed using the simulator
+SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over an homogeneous
+distributed memory architecture. Furthermore, we compare the proposed algorithm
+with Rauber and Rünger methods~\cite{3}. The comparison's results show that our
+algorithm gives better energy-time trade-off.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents some
+related works from other authors. Section~\ref{sec.exe} explains the execution
+of parallel tasks and the sources of slack times. It also presents an energy
+model for homogeneous platforms. Section~\ref{sec.mpip} describes how the
+performance of MPI programs can be predicted. Section~\ref{sec.compet} presents
+the energy-performance objective function that maximizes the reduction of energy
+consumption while minimizing the degradation of the program's performance.
+Section~\ref{sec.optim} details the proposed energy-performance algorithm.
+Section~\ref{sec.expe} verifies the accuracy of the performance prediction model
+and presents the results of the proposed algorithm. It also shows the
+comparison results between our method and other existing methods. Finally, we
+conclude in Section~\ref{sec.concl} with a summary and some future works.
+
+\section{Related works}
+\label{sec.relwork}
+
+
+In this section, some heuristics to compute the scaling factor are presented and
+classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program. They return static scaling factor values to the processors
+participating in the execution of the parallel program. On one hand, the
+scaling factor values could be computed based on information retrieved by
+analyzing the code of the program and the computing system that will execute it.
+In~\cite{40}, Azevedo et al. detect during compilation the dependency points
+between tasks in a multi-task program. This information is then used to lower
+the frequency of some processors in order to eliminate slack times. A slack
+time is the period of time during which a processor that have already finished
+its computation, have to wait for a set of processors to finish their
+computations and send their results to the waiting processor in order to
+continue its task that is dependent on the results of computations being
+executed on other processors. Freeh et al. showed in~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled
+down. On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or partial executions of the program. A
+part or the whole program is usually executed over all the available frequency
+gears and the the execution time and the energy consumed with each frequency
+gear are measured. Then an heuristic or an exact method uses the retrieved
+information to compute the values of the scaling factor for the processors.
+In~\cite{29}, Xie et al. use an exact exponential breadth-first search algorithm
+to compute the scaling factor values that give the optimal energy reduction
+while respecting a deadline for a sequential program. They also present a
+linear heuristic that approximates the optimal solution. In~\cite{8} , Rountree
+et al. use a linear programming algorithm, while in~\cite{38,34}, Cochran et
+al. use multi logistic regression algorithm for the same goal. The main
+drawback for these methods is that they all require executing a part or the
+whole program on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+
+The online scaling factor selection methods are executed during the runtime of
+the program. They are usually integrated into iterative programs where the same
+block of instructions is executed many times. During the first few iterations,
+many informations are measured such as the execution time, the energy consumed
+using a multimeter, the slack times, \dots{} Then a method will exploit these
+measurements to compute the scaling factor values for each processor. This
+operation, measurements and computing new scaling factor, can be repeated as
+much as needed if the iterations are not regular. Kimura, Peraza, Yu-Liang et
+al.~\cite{11,2,31} used learning methods to select the appropriate scaling
+factor values to eliminate the slack times during runtime. However, as seen
+in~\cite{39,19}, machine learning methods can take a lot of time to converge
+when the number of available gears is big. To reduce the impact of slack times,
+in~\cite{1}, Lim et al. developed an algorithm that detects the communication
+sections and changes the frequency during these sections only. This approach
+might change the frequency of each processor many times per iteration if an
+iteration contains more than one communication section. In~\cite{3}, Rauber and
+Rünger used an analytical model that can predict the consumed energy and the
+execution time for every frequency gear after measuring the consumed energy and
+the execution time with the highest frequency gear. These predictions may be
+used to choose the optimal gear for each processor executing the parallel
+program to reduce energy consumption. To maintain the performance of the
+parallel program , they set the processor with the biggest load to the highest
+gear and then compute the scaling factor values for the rest of the processors.
+Although this model was built for parallel architectures, it can be adapted to
+distributed architectures by taking into account the communications. The
+primary contribution of our paper is presenting a new online scaling factor
+selection method which has the following characteristics: