-\begin{center}
-\Large
-\title*\textbf{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs}
- \end{center}
-\parskip 0pt
-\linespread{1.18}
-\normalsize
-\makeatletter
-\renewcommand*{\@seccntformat}[1]{\csname the#1\endcsname\hspace{0.01cm}}
-\makeatother
-\sectionfont{\large}
-\section{.~Introduction }
-The need for computing power is still increasing and it is not expected to slow down in the coming years. To satisfy this demand, researchers and supercomputers constructors have been regularly increasing the number of computing cores in supercomputers (for example in November 2013, according to the top 500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3 millions of cores and delivers more than 33 Tflop/s while consuming 17808 kW). This large increase in number of computing cores has led to large energy consumption by these architectures. Moreover, the price of energy is expected to continue its ascent according to the demand. For all these reasons energy reduction became an important topic in the high performance computing field. To tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency Scaling) operations which reduce dynamically the frequency and voltage of cores and thus their energy consumption. However, this operation also degrades the performance of computation. Therefore researchers try to reduce the frequency to minimum when processors are idle (waiting for data from other processors or communicating with other processors). Moreover, depending on their objectives they use heuristics to find the best scaling factor during the computation. If they aim for performance they choose the best scaling factor that reduces the consumed energy while affecting as little as possible the performance. On the other hand, if they aim for energy reduction, the chosen scaling factor must produce the most energy efficient execution without considering the degradation of the performance. It is important to notice that lowering the frequency to minimum value does not always give the most efficient execution due to energy leakage. The best scaling factor might be chosen during execution (online) or during a pre-execution phase.
-In this paper we emphasize to develop an algorithm that selects the optimal frequency scaling factor that takes into consideration simultaneously the energy consumption and the performance. The main objective of HPC systems is to run the application with less execution time. Therefore, our algorithm selects the optimal scaling factor online with very small footprint. The proposed algorithm takes into account the communication times of the MPI programs to choose the scaling factor. This algorithm has ability to predict both energy consumption and execution time over all available scaling factors. The prediction achieved depends on some computing time information, gathered at the beginning of the runtime.
-We apply this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel penchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed using the simulator Simgrid/SMPI v3.10~\cite{45} over an homogeneous distributed memory architecture. Furthermore, we compare the proposed algorithm with Rauber's methods. The comparison's results show that our algorithm gives better energy-time trade off.
-\sectionfont{\large}
-\section{.~Related Works }
-In the this section some heuristics, to compute the scaling factor, are presented and classified in two parts : offline and online methods.
- \sectionfont{\large}
-\subsection{~The offline DVFS orientations}
-The DVFS offline methods are static and are not executed during the runtime of the program. Some approaches used heuristics to select the best DVFS state during the compilation phases as an example in Azevedo et al.~\cite{40}. He used intra-task algorithm to choose the DVFS setting when there are dependency points between tasks. While in~\cite{29}, Xie et al. used breadth-first search algorithm to do that. Their goal is saving energy with time limits. Another approaches gathers and stores the runtime information for each DVFS state , then used their methods offline to select the suitable DVFS that optimize energy-time trade offs. As an example~\cite{8}, Rountree et al. used liner programming algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression algorithm for the same goal. The offline study that shown the DVFS impact on the communication time of the MPI program is~\cite{17}, Freeh et al. show that these times not changed when the frequency is scaled down.
-\sectionfont{\large}
-\subsection{~The online DVFS orientations}
-The objective of these works is to dynamically compute and set the frequency of the CPU during the runtime of the program for saving energy. Estimating and predicting approaches for the energy-time trade offs developed by ~\cite{11,2,31}. These works select the best DVFS setting depending on the slack times. These times happen when the processors have to wait for data from other processors to compute their task. For example, during the synchronous communication time that take place in the MPI programs, the processors are idle. The optimal DVFS can be selected using the learning methods. Therefore, in ~\cite{39,19} used machine learning to converge to the suitable DVFS configuration. Their learning algorithms have big time to converge when the number of available frequencies is high. Also, the communication time of the MPI program used online for saving energy as in~\cite{1}, Lim et al. developed an algorithm that detects the communication sections and changes the frequency during these sections only. This approach changes the frequency many times because an iteration may contain more than one communication section. The domain of analytical modeling used for choosing the optimal frequency as in ~\cite{3}, Rauber et al. developed an analytical mathematical model for determining the optimal frequency scaling factor for any number of concurrent tasks, without considering communication times. They set the slowest task to maximum frequency for maintaining performance.
-In this paper we compare our algorithm with Rauber's model~\cite{3}, because his model can be used for any number of concurrent tasks for homogeneous platform and this is the same direction of this paper.
-However, the primary contributions of this paper are:
-\\1-Selecting the optimal frequency scaling factor for energy and performance
- simultaneously. While taking into account the communication time.
-\\2-Adapting our scale factor to taking into account the imbalanced tasks.
-\\3-The execution time of our algorithm is very small when compared to other methods (e.g.,~\cite{19}).
-\\4-The proposed algorithm works online without profiling or training as in~\cite{38,34}.
-\sectionfont{\large}
-\section{.~Parallel Tasks Execution on Homogeneous Platform}
-A homogeneous cluster consists of identical nodes in terms of the hardware and the software. Each node has its own memory and at least one processor which can be a multi-core. The nodes are connected via a high bandwidth network. Tasks executed on this model can be either synchronous or asynchronous. In this paper we consider execution of the synchronous tasks on distributed homogeneous platform. These tasks can exchange the data via synchronous memory passing.
-\begin{figure}[h]
-\centering
-\subfloat[Synch. Imbalanced Communications]{\includegraphics[scale=0.67]{synch_tasks}\label{fig:h1}}
-\subfloat[Synch. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
- \caption{Parallel Tasks on Homogeneous Platform}
- \label{fig:homo}
-\end{figure}
-Therefore, the execution time of a task consists of the computation time and the communication time. Moreover, the synchronous communications between tasks can lead to idle time while tasks wait at the synchronous point for others tasks to finish their communications see figure~(\ref{fig:h1}).
-Another source for idle times is the imbalanced computations. This happen when processing different amounts of data on each processor as an example see figure~(\ref{fig:h2}). In this case the fastest tasks have to wait at the synchronous barrier for the slowest tasks to finish their job. In both two cases the overall execution time of the program is the execution time of the slowest task as :
-\begin{equation} \label{eq:T1}
- Program Time=MAX_{i=1,2,..,N} (T_i) \hfill
+
+\title{Dynamic Frequency Scaling for Energy Consumption Reduction in Distributed MPI Programs}
+
+\author{%
+ \IEEEauthorblockN{%
+ Jean-Claude Charr,
+ Raphaël Couturier,
+ Ahmed Fanfakh and
+ Arnaud Giersch
+ }
+ \IEEEauthorblockA{%
+ FEMTO-ST Institute\\
+ University of Franche-Comté\\
+ IUT de Belfort-Montb\'{e}liard, Rue Engel Gros, BP 27, 90016 Belfort, France\\
+ Fax : (+33)~3~84~58~77~32\\
+ Email: \{jean-claude.charr, raphael.couturier, ahmed.fanfakh\_badri\_muslim, arnaud.giersch\}@univ-fcomte.fr
+ }
+ }
+
+\maketitle
+
+\begin{abstract}
+ Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs.
+This technique is usually used to reduce the energy consumed by a CPU while
+computing . Indeed, power consumption by a processor at a given instant is
+exponentially related to its frequency. Thus, decreasing the frequency reduces
+the power consumed by the CPU. However, it can also significantly affect the
+performance of the executed program if it is compute bound and a low CPU
+frequency is selected. The performance degradation ratio can even be higher than
+the saved energy ratio. Therefore, the chosen scaling factor must give the best possible tradeoff
+between energy reduction and performance.
+
+In this paper we present an algorithm
+that predicts the energy consumed with each frequency gear and selects the one that
+gives the best ratio between energy consumption reduction and performance.
+This algorithm works online without training or profiling and
+has a very small overhead. It also takes into account synchronous communications between the nodes
+that are executing the distributed algorithm. The algorithm has been evaluated over the SimGrid simulator
+while being applied to the NAS parallel benchmark programs. The results of the experiments show that it outperforms other existing scaling factor selection algorithms.
+\end{abstract}
+
+\section{Introduction}
+\label{sec.intro}
+
+The need and demand for more computing power have been increasing since the birth of the first computing unit and it is not expected to slow
+down in the coming years. To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and processors in
+supercomputers (for example in November 2013, according to the TOP500
+list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3
+millions of cores and delivers more than 33 Tflop/s while consuming 17808
+kW). This large increase in number of computing cores has led to large energy
+consumption by these architectures. Moreover, the price of energy is expected to
+continue its ascent according to the demand. For all these reasons energy
+reduction became an important topic in the high performance computing field. To
+tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency
+Scaling) operations which reduce dynamically the frequency and voltage of cores
+and thus their energy consumption. Indeed, modern CPUs offer a set of acceptable frequencies which are usually called gears, and the user or the operating system can modify the frequency of the processor according to its needs. However, DVFS also degrades the
+performance of computation. Therefore researchers try to reduce the frequency to
+minimum when processors are idle (waiting for data from other processors or
+communicating with other processors). Moreover, depending on their objectives
+they use heuristics to find the best scaling factor during the computation. If
+they aim for performance they choose the best scaling factor that reduces the
+consumed energy while affecting as little as possible the performance. On the
+other hand, if they aim for energy reduction, the chosen scaling factor must
+produce the most energy efficient execution without considering the degradation
+of the performance. It is important to notice that lowering the frequency to
+minimum value does not always give the most energy efficient execution due to energy
+leakage. The best scaling factor might be chosen during execution (online) or
+during a pre-execution phase. In this paper, we present an
+algorithm that selects a frequency scaling factor that simultaneously takes into
+consideration the energy consumption by the CPU and the performance of the application. The
+main objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with
+very small footprint. The proposed algorithm takes into account the
+communication times of the MPI program to choose the scaling factor. This
+algorithm has ability to predict both energy consumption and execution time over
+all available scaling factors. The prediction achieved depends on some
+computing time information, gathered at the beginning of the runtime. We apply
+this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel
+benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
+using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183}
+over an homogeneous distributed memory architecture. Furthermore, we compare the
+proposed algorithm with Rauber and Rünger methods~\cite{3}.
+The comparison's results show that our algorithm gives better energy-time tradeoff.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents related works
+from other authors. Section~\ref{sec.exe} shows the execution of parallel
+tasks and sources of idle times. It resumes the energy
+model of homogeneous platform. Section~\ref{sec.mpip} evaluates the performance
+of MPI program. Section~\ref{sec.compet} presents the energy-performance tradeoffs
+objective function. Section~\ref{sec.optim} demonstrates the proposed energy-performance algorithm. Section~\ref{sec.expe} verifies the performance prediction
+model and presents the results of the proposed algorithm. Also, It shows the comparison results. Finally,
+we conclude in Section~\ref{sec.concl}.
+\section{Related works}
+\label{sec.relwork}
+
+\AG{Consider introducing the models sec.~\ref{sec.exe} maybe before related works}
+
+In this section, some heuristics to compute the scaling factor are
+presented and classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program. They return static scaling factor values to the processors
+participating in the execution of the parallel program. On one hand, the scaling
+factor
+values could be computed based on information retrieved by analyzing the code of
+the program and the computing system that will execute it. In ~\cite{40},
+Azevedo et
+al. detect during compilation the dependency points between
+tasks in a parallel program. This information is then used to lower the frequency of
+some processors in order to eliminate slack times. A slack time is the period of time during which a processor that have already finished its computation, have to wait
+for a set of processors to finish their computations and send their results to the
+waiting processor in order to continue its task that is
+dependent on the results of computations being executed on other processors.
+Freeh et al. showed in ~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled down.
+On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or
+partial executions of the program. A part or the whole program is usually executed over all the available frequency gears and the the execution time and the energy consumed with each frequency gear are measured. Then an heuristic or an exact method uses the retrieved information to compute the values of the scaling factor for the processors.
+ In~\cite{29}, Xie et al. use an exact exponential breadth-first search algorithm to compute the scaling factor values that give the optimal energy reduction while respecting a deadline for a sequential program. They also present a linear heuristic that approximates the optimal solution. In~\cite{8} , Rountree et al. use a linear programming
+algorithm, while in~\cite{38,34}, Cochran et al. use multi logistic regression algorithm for the same goal.
+The main drawback for these methods is that they all require executing a part or the whole program on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+The online scaling factor selection methods are executed during the runtime of the program. They are usually integrated into iterative programs where the same block of instructions is executed many times. During the first few iterations, many informations are measured such as the execution time, the energy consumed using a multimeter, the slack times, ... Then a method will exploit these measurements to compute the scaling factor values for each processor. This operation, measurements and computing new scaling factor, can be repeated as much as needed if the iterations are not regular. Kimura, Peraza, Yu-Liang et al. ~\cite{11,2,31} used learning methods to select the appropriate scaling factor values to eliminate the slack times during runtime. However, as seen in ~\cite{39,19}, machine learning methods can take a lot of time to converge when the number of available gears is big. To reduce the impact of slack times, in~\cite{1}, Lim et al. developed an algorithm that detects the
+communication sections and changes the frequency during these sections
+only. This approach might change the frequency of each processor many times per iteration if an iteration
+contains more than one communication section. In ~\cite{3}, Rauber and Rünger used an analytical model that after measuring the energy consumed and the execution time with the highest frequency gear, it can predict the energy consumed and the execution time for every frequency gear . These predictions may be used to choose the optimal gear for each processor executing the parallel program to reduce energy consumption.
+To maintain the performance of the parallel program , they
+set the processor with the biggest load to the highest gear and then compute the scaling factor values for the rest of the processors. Although this model was built for parallel architectures, it can be adapted to distributed architectures by taking into account the communications.
+The primary contribution of this paper is presenting a new online scaling factor selection method which has the following characteristics :
+\begin{enumerate}
+\item Based on Rauber's analytical model to predict the energy consumption and the execution time of the application with different frequency gears.
+\item Selects the frequency scaling factor for simultaneously optimizing energy reduction and maintaining performance.
+\item Well adapted to distributed architectures because it takes into account the communication time.
+\item Well adapted to distributed applications with imbalanced tasks.
+\item Has very small footprint when compared to other
+ methods (e.g.,~\cite{19}) and does not require profiling or training as
+ in~\cite{38,34}.
+\end{enumerate}
+
+
+\section{Execution and energy of parallel tasks on homogeneous platform}
+\label{sec.exe}
+%\AG{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'', can be deleted if we need space, we can just say we are interested in this paper in homogeneous clusters}
+\subsection{Parallel tasks execution on homogeneous platform}
+A homogeneous cluster consists of identical nodes in terms of hardware and software.
+Each node has its own memory and at least one processor which can
+be a multi-core. The nodes are connected via a high bandwidth network. Tasks
+executed on this model can be either synchronous or asynchronous. In this paper
+we consider execution of the synchronous tasks on distributed homogeneous
+platform. These tasks can exchange the data via synchronous message passing.
+\begin{figure*}[t]
+ \centering
+ \subfloat[Sync. imbalanced communications]{\includegraphics[scale=0.67]{commtasks}\label{fig:h1}}
+ \subfloat[Sync. imbalanced computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
+ \caption{Parallel tasks on homogeneous platform}
+ \label{fig:homo}
+\end{figure*}
+Therefore, the execution time of a task consists of the computation time and the
+communication time. Moreover, the synchronous communications between tasks can
+lead to slack times while tasks wait at the synchronization barrier for other tasks to
+finish their tasks (see figure~(\ref{fig:h1})). The imbalanced communications
+happen when nodes have to send/receive different amount of data or they communicate
+with different number of nodes. Another source of idle times is the imbalanced computations.
+This happens when processing different amounts of data on each processor (see figure~(\ref{fig:h2})).
+In this case the fastest tasks have to wait at the synchronization barrier for the
+slowest ones to begin the next task. In both cases the overall execution time
+of the program is the execution time of the slowest task as in EQ~(\ref{eq:T1}).
+\begin{equation}
+ \label{eq:T1}
+ \textit{Program Time} = \max_{i=1,2,\dots,N} T_i