-\begin{center}
- \Large
- \title*\textbf{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs}
-\end{center}
-\parskip 0pt
-\linespread{1.18}
-\normalsize
-\makeatletter
-\renewcommand*{\@seccntformat}[1]{\csname the#1\endcsname\hspace{0.01cm}}
-\makeatother
-\sectionfont{\large}
-
-\section{.~Introduction }
-
-The need for computing power is still increasing and it is not expected to slow
-down in the coming years. To satisfy this demand, researchers and supercomputers
-constructors have been regularly increasing the number of computing cores in
-supercomputers (for example in November 2013, according to the top 500
-list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3
-millions of cores and delivers more than 33 Tflop/s while consuming 17808
-kW). This large increase in number of computing cores has led to large energy
-consumption by these architectures. Moreover, the price of energy is expected to
-continue its ascent according to the demand. For all these reasons energy
-reduction became an important topic in the high performance computing field. To
-tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency
-Scaling) operations which reduce dynamically the frequency and voltage of cores
-and thus their energy consumption. However, this operation also degrades the
-performance of computation. Therefore researchers try to reduce the frequency to
-minimum when processors are idle (waiting for data from other processors or
-communicating with other processors). Moreover, depending on their objectives
-they use heuristics to find the best scaling factor during the computation. If
-they aim for performance they choose the best scaling factor that reduces the
-consumed energy while affecting as little as possible the performance. On the
-other hand, if they aim for energy reduction, the chosen scaling factor must
-produce the most energy efficient execution without considering the degradation
-of the performance. It is important to notice that lowering the frequency to
-minimum value does not always give the most efficient execution due to energy
-leakage. The best scaling factor might be chosen during execution (online) or
-during a pre-execution phase. In this paper we emphasize to develop an
-algorithm that selects the optimal frequency scaling factor that takes into
-consideration simultaneously the energy consumption and the performance. The
-main objective of HPC systems is to run the application with less execution
-time. Therefore, our algorithm selects the optimal scaling factor online with
-very small footprint. The proposed algorithm takes into account the
-communication times of the MPI programs to choose the scaling factor. This
-algorithm has ability to predict both energy consumption and execution time over
-all available scaling factors. The prediction achieved depends on some
-computing time information, gathered at the beginning of the runtime. We apply
-this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel
-penchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
-using the simulator Simgrid/SMPI v3.10~\cite{45} over an homogeneous distributed
-memory architecture. Furthermore, we compare the proposed algorithm with
-Rauber's methods. The comparison's results show that our algorithm gives better
-energy-time trade off.
-\sectionfont{\large}
-
-\section{.~Related Works }
-
-In the this section some heuristics, to compute the scaling factor, are
-presented and classified in two parts : offline and online methods.
- \sectionfont{\large}
-
-\subsection{~The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as an example in Azevedo et al.~\cite{40}. He used
-intra-task algorithm to choose the DVFS setting when there are dependency points
-between tasks. While in~\cite{29}, Xie et al. used breadth-first search
-algorithm to do that. Their goal is saving energy with time limits. Another
-approaches gathers and stores the runtime information for each DVFS state , then
-used their methods offline to select the suitable DVFS that optimize energy-time
-trade offs. As an example~\cite{8}, Rountree et al. used liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression
-algorithm for the same goal. The offline study that shown the DVFS impact on the
-communication time of the MPI program is~\cite{17}, Freeh et al. show that these
-times not changed when the frequency is scaled down.
-\sectionfont{\large}
-
-\subsection{~The online DVFS orientations}
-
-The objective of these works is to dynamically compute and set the frequency of
-the CPU during the runtime of the program for saving energy. Estimating and
-predicting approaches for the energy-time trade offs developed by
-~\cite{11,2,31}. These works select the best DVFS setting depending on the slack
-times. These times happen when the processors have to wait for data from other
-processors to compute their task. For example, during the synchronous
-communication time that take place in the MPI programs, the processors are
-idle. The optimal DVFS can be selected using the learning methods. Therefore, in
-~\cite{39,19} used machine learning to converge to the suitable DVFS
-configuration. Their learning algorithms have big time to converge when the
-number of available frequencies is high. Also, the communication time of the MPI
-program used online for saving energy as in~\cite{1}, Lim et al. developed an
-algorithm that detects the communication sections and changes the frequency
-during these sections only. This approach changes the frequency many times
-because an iteration may contain more than one communication section. The domain
-of analytical modeling used for choosing the optimal frequency as in ~\cite{3},
-Rauber et al. developed an analytical mathematical model for determining the
-optimal frequency scaling factor for any number of concurrent tasks, without
-considering communication times. They set the slowest task to maximum frequency
-for maintaining performance. In this paper we compare our algorithm with
-Rauber's model~\cite{3}, because his model can be used for any number of
-concurrent tasks for homogeneous platform and this is the same direction of this
-paper. However, the primary contributions of this paper are:
-\begin{enumerate}
-\item Selecting the optimal frequency scaling factor for energy and performance
- simultaneously. While taking into account the communication time.
-\item Adapting our scale factor to taking into account the imbalanced tasks.
-\item The execution time of our algorithm is very small when compared to other
- methods (e.g.,~\cite{19}).
-\item The proposed algorithm works online without profiling or training as
- in~\cite{38,34}.
-\end{enumerate}
-\sectionfont{\large}
-
-\section{.~Parallel Tasks Execution on Homogeneous Platform}
-
-A homogeneous cluster consists of identical nodes in terms of the hardware and
-the software. Each node has its own memory and at least one processor which can
-be a multi-core. The nodes are connected via a high bandwidth network. Tasks
-executed on this model can be either synchronous or asynchronous. In this paper
-we consider execution of the synchronous tasks on distributed homogeneous
-platform. These tasks can exchange the data via synchronous memory passing.
-\begin{figure}[h]
- \centering
- \subfloat[Synch. Imbalanced Communications]{\includegraphics[scale=0.67]{synch_tasks}\label{fig:h1}}
- \subfloat[Synch. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
- \caption{Parallel Tasks on Homogeneous Platform}
- \label{fig:homo}
-\end{figure}
-Therefore, the execution time of a task consists of the computation time and the
-communication time. Moreover, the synchronous communications between tasks can
-lead to idle time while tasks wait at the synchronous point for others tasks to
-finish their communications see figure~(\ref{fig:h1}). Another source for idle
-times is the imbalanced computations. This happen when processing different
-amounts of data on each processor as an example see figure~(\ref{fig:h2}). In
-this case the fastest tasks have to wait at the synchronous barrier for the
-slowest tasks to finish their job. In both two cases the overall execution time
-of the program is the execution time of the slowest task as :
-\begin{equation} \label{eq:T1}
- Program Time=MAX_{i=1,2,..,N} (T_i) \hfill
-\end{equation}
-where $T_i$ is the execution time of process $i$.
-\sectionfont{\large}
-
-\section{.~Energy Model for Homogeneous Platform}
-
-The energy consumption by the processor consists of two powers metric: the
-dynamic and the static power. This general power formulation is used by many
-researchers see ~\cite{9,3,15,26}. The dynamic power of the CMOS processors
-$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$,
-the supply voltage $V$ and operational frequency $f$ respectively as follow :
-\begin{equation} \label{eq:pd}
- \displaystyle P_{dyn} = \alpha . C_L . V^2 . f
+
+\title{Dynamic Frequency Scaling for Energy Consumption
+ Reduction in Synchronous Distributed Applications}
+
+\author{%
+ \IEEEauthorblockN{%
+ Jean-Claude Charr,
+ Raphaël Couturier,
+ Ahmed Fanfakh and
+ Arnaud Giersch
+ }
+ \IEEEauthorblockA{%
+ FEMTO-ST Institute\\
+ University of Franche-Comté\\
+ IUT de Belfort-Montbéliard,
+ 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
+ % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
+ % Fax: \mbox{+33 3 84 58 77 81}\\ % Dept Info
+ Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr}
+ }
+ }
+
+\maketitle
+
+\begin{abstract}
+ Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs. This
+ technique is usually used to reduce the energy consumed by a CPU while
+ computing. Thus, decreasing the frequency
+ reduces the power consumed by the CPU. However, it can also significantly
+ affect the performance of the executed program if it is compute bound and if a
+ low CPU frequency is selected. Therefore, the chosen scaling factor must
+ give the best possible trade-off between energy reduction and performance.
+
+ In this paper we present an algorithm that predicts the energy consumed with
+ each frequency gear and selects the one that gives the best ratio between
+ energy consumption reduction and performance. This algorithm works online
+ without training or profiling and has a very small overhead. It also takes
+ into account synchronous communications between the nodes that are executing
+ the distributed algorithm. The algorithm has been evaluated over the SimGrid
+ simulator while being applied to the NAS parallel benchmark programs. The
+ results of the experiments show that it outperforms other existing scaling
+ factor selection algorithms.
+\end{abstract}
+
+\section{Introduction}
+\label{sec.intro}
+
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years. To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers (for example in November 2013, according to the
+TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more
+than 3 million of cores and delivers more than \np[Tflop/s]{33} while consuming
+\np[kW]{17808}). This large increase in number of computing cores has led to
+large energy consumption by these architectures. Moreover, the price of energy
+is expected to continue its ascent according to the demand. For all these
+reasons energy reduction has become an important topic in the high performance
+computing field. To tackle this problem, many researchers use DVFS (Dynamic
+Voltage Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a
+set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs. However, DVFS also degrades the performance of computation. Therefore
+researchers try to reduce the frequency to the minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives, they use heuristics to find the best
+scaling factor during the computation. If they aim for performance they choose
+the best scaling factor that reduces the consumed energy while affecting as
+little as possible the performance. On the other hand, if they aim for energy
+reduction, the chosen scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance. It is
+important to notice that lowering the frequency to the minimum value does not always
+give the most energy efficient execution due to energy leakage. The best
+scaling factor might be chosen during execution (online) or during a
+pre-execution phase. In this paper, we present an algorithm that selects a
+frequency scaling factor that simultaneously takes into consideration the energy
+consumption by the CPU and the performance of the application. The main
+objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with very small
+overhead. The proposed algorithm takes into account the communication times of
+the MPI program to choose the scaling factor. This algorithm has the ability to
+predict both energy consumption and execution time over all available scaling
+factors. The prediction achieved depends on some computing time information,
+gathered at the beginning of the runtime. We apply this algorithm to the NAS parallel benchmarks (NPB v3.3)~\cite{44}. Our experiments are executed using the simulator
+SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over an homogeneous
+distributed memory architecture. Furthermore, we compare the proposed algorithm
+with Rauber and Rünger methods~\cite{3}. The comparison's results show that our
+algorithm gives better energy-time trade-off.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents some
+related works from other authors. Section~\ref{sec.exe} presents an energy
+model for homogeneous platforms. Section~\ref{sec.mpip} describes how the
+performance of MPI programs can be predicted. Section~\ref{sec.compet} presents
+the energy-performance objective function that maximizes the reduction of energy
+consumption while minimizing the degradation of the program's performance.
+Section~\ref{sec.optim} details the proposed energy-performance algorithm.
+Section~\ref{sec.expe} verifies the accuracy of the performance prediction model
+and presents the results of the proposed algorithm. It also shows the
+comparison results between our method and other existing methods. Finally, we
+conclude in Section~\ref{sec.concl} with a summary and some future works.
+
+\section{Related works}
+\label{sec.relwork}
+
+
+In this section, some heuristics to compute the scaling factor are presented and
+classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program. They return static scaling factor values to the processors
+participating in the execution of the parallel program. On the one hand, the
+scaling factor values could be computed based on information retrieved by
+analyzing the code of the program and the computing system that will execute it.
+In~\cite{40}, Azevedo et al. detect during compilation the dependency points
+between tasks in a multi-task program. This information is then used to lower
+the frequency of some processors in order to eliminate slack times. A slack
+time is the period of time during which a processor that has already finished
+its computation, has to wait for a set of processors to finish their
+computations and send their results to the waiting processor in order to
+continue its task that is dependent on the results of computations being
+executed on other processors. Freeh et al. showed in~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled
+down. On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or partial executions of the program. The whole program or, a
+part of it, is usually executed over all the available frequency
+gears and the execution time and the energy consumed with each frequency
+gear are measured. Then a heuristic or an exact method uses the retrieved
+information to compute the values of the scaling factor for the processors.
+In~\cite{8} , Rountree et al. use a linear programming algorithm, while in~\cite{34}, Cochran et
+al. use a multi-logistic regression algorithm for the same goal. The main
+drawback of these methods is that they all require executing the
+whole program or, a part of it, on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+
+The online scaling factor selection methods are executed during the runtime of
+the program. They are usually integrated into iterative programs where the same
+block of instructions is executed many times. During the first few iterations,
+a lot of information is measured such as the execution time, the energy consumed
+using a multimeter, the slack times, \dots{} Then a method will exploit these
+measurements to compute the scaling factor values for each processor. This
+operation, measurements and computing new scaling factor, can be repeated as
+much as needed if the iterations are not regular. Peraza, Yu-Liang et
+al.~\cite{2,31} used varied heuristics to select the appropriate scaling
+factor values to eliminate the slack times during runtime. However, as seen
+in~\cite{19}, machine learning method takes a lot of time to converge
+when the number of available gears is big. To reduce the impact of slack times,
+in~\cite{1}, Lim et al. developed an algorithm that detects the communication
+sections and changes the frequency during these sections only. This approach
+might change the frequency of each processor many times per iteration if an
+iteration contains more than one communication section. In~\cite{3}, Rauber and
+Rünger used an analytical model that can predict the consumed energy for every frequency gear after measuring the consumed energy. They
+maintain the performance as mush as possible by setting the highest frequency gear to the slowest task.
+
+The primary contribution of
+our paper is to present a new online scaling factor selection method which has the
+ following characteristics:\\
+1) It is based on Rauber and Rünger analytical model to predict the energy
+ consumption of the application with different frequency gears.
+2) It selects the frequency scaling factor for simultaneously optimizing
+ energy reduction and maintaining performance.
+3) It is well adapted to distributed architectures because it takes into
+ account the communication time.
+4) It is well adapted to distributed applications with imbalanced tasks.
+5) It has a very small overhead when compared to other methods
+ (e.g.,~\cite{19}) and does not require profiling or training as
+ in~\cite{34}.
+
+
+% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'',
+% can be deleted if we need space, we can just say we are interested in this
+% paper in homogeneous clusters}
+
+
+\section{Energy model for homogeneous platform}
+\label{sec.exe}
+Many researchers~\cite{9,3,15,26} divide the power consumed by a processor into
+two power metrics: the static and the dynamic power. While the first one is
+consumed as long as the computing unit is on, the latter is only consumed during
+computation times. The dynamic power $P_{dyn}$ is related to the switching
+activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
+operational frequency $f$, as shown in EQ~\eqref{eq:pd}.
+\begin{equation}
+ \label{eq:pd}
+ P_\textit{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f