X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy.git/blobdiff_plain/b039491f5ad738be1d313f6203bec5d1d39a84e8..e10a2fae800c41166633eabe0cff9e7815befda7:/paper.tex?ds=inline diff --git a/paper.tex b/paper.tex index e6d6745..e62908a 100644 --- a/paper.tex +++ b/paper.tex @@ -1,705 +1,709 @@ -\documentclass[12pt]{article} -%\documentclass[12pt,twocolumn]{article} -\DeclareMathSizes{40}{4000}{200}{2000} +\documentclass[conference]{IEEEtran} + \usepackage[T1]{fontenc} +\usepackage[utf8]{inputenc} \usepackage[english]{babel} -\usepackage{algorithm,algorithmicx,algpseudocode} -\usepackage{graphicx,graphics} +\usepackage{algpseudocode} +\usepackage{graphicx} \usepackage{subfig} -\usepackage{listings} -\usepackage{colortbl} -\usepackage{sectsty} -\usepackage{titlesec} -\usepackage{secdot} -%\usepackage[font={footnotesize,bt}]{caption} -%\usepackage[font=scriptsize,labelfont=bf]{caption} +\usepackage{amsmath} + +\usepackage{url} +\DeclareUrlCommand\email{\urlstyle{same}} + +\usepackage[autolanguage,np]{numprint} +\AtBeginDocument{% + \renewcommand*\npunitcommand[1]{\text{#1}} + \npthousandthpartsep{}} + +\usepackage{xspace} +\usepackage[textsize=footnotesize]{todonotes} +\newcommand{\AG}[2][inline]{% + \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} +\newcommand{\JC}[2][inline]{% + \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace} + +\newcommand{\Xsub}[2]{\ensuremath{#1_\textit{#2}}} + +\newcommand{\Dist}{\textit{Dist}} +\newcommand{\Eind}{\Xsub{E}{ind}} +\newcommand{\Enorm}{\Xsub{E}{Norm}} +\newcommand{\Eoriginal}{\Xsub{E}{Original}} +\newcommand{\Ereduced}{\Xsub{E}{Reduced}} +\newcommand{\Fdiff}{\Xsub{F}{diff}} +\newcommand{\Fmax}{\Xsub{F}{max}} +\newcommand{\Fnew}{\Xsub{F}{new}} +\newcommand{\Ileak}{\Xsub{I}{leak}} +\newcommand{\Kdesign}{\Xsub{K}{design}} +\newcommand{\MaxDist}{\textit{Max Dist}} +\newcommand{\Ntrans}{\Xsub{N}{trans}} +\newcommand{\Pdyn}{\Xsub{P}{dyn}} +\newcommand{\PnormInv}{\Xsub{P}{NormInv}} +\newcommand{\Pnorm}{\Xsub{P}{Norm}} +\newcommand{\Tnorm}{\Xsub{T}{Norm}} +\newcommand{\Pstates}{\Xsub{P}{states}} +\newcommand{\Pstatic}{\Xsub{P}{static}} +\newcommand{\Sopt}{\Xsub{S}{opt}} +\newcommand{\Tcomp}{\Xsub{T}{comp}} +\newcommand{\TmaxCommOld}{\Xsub{T}{Max Comm Old}} +\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}} +\newcommand{\Tmax}{\Xsub{T}{max}} +\newcommand{\Tnew}{\Xsub{T}{New}} +\newcommand{\Told}{\Xsub{T}{Old}} \begin{document} -\begin{center} - \Large - \title*\textbf{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs} -\end{center} -\parskip 0pt -\linespread{1.18} -\normalsize -\makeatletter -\renewcommand*{\@seccntformat}[1]{\csname the#1\endcsname\hspace{0.01cm}} -\makeatother -\sectionfont{\large} - -\section{.~Introduction } - -The need for computing power is still increasing and it is not expected to slow -down in the coming years. To satisfy this demand, researchers and supercomputers -constructors have been regularly increasing the number of computing cores in -supercomputers (for example in November 2013, according to the top 500 -list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3 -millions of cores and delivers more than 33 Tflop/s while consuming 17808 -kW). This large increase in number of computing cores has led to large energy -consumption by these architectures. Moreover, the price of energy is expected to -continue its ascent according to the demand. For all these reasons energy -reduction became an important topic in the high performance computing field. To -tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency -Scaling) operations which reduce dynamically the frequency and voltage of cores -and thus their energy consumption. However, this operation also degrades the -performance of computation. Therefore researchers try to reduce the frequency to -minimum when processors are idle (waiting for data from other processors or -communicating with other processors). Moreover, depending on their objectives -they use heuristics to find the best scaling factor during the computation. If -they aim for performance they choose the best scaling factor that reduces the -consumed energy while affecting as little as possible the performance. On the -other hand, if they aim for energy reduction, the chosen scaling factor must -produce the most energy efficient execution without considering the degradation -of the performance. It is important to notice that lowering the frequency to -minimum value does not always give the most efficient execution due to energy -leakage. The best scaling factor might be chosen during execution (online) or -during a pre-execution phase. In this paper we emphasize to develop an -algorithm that selects the optimal frequency scaling factor that takes into -consideration simultaneously the energy consumption and the performance. The -main objective of HPC systems is to run the application with less execution -time. Therefore, our algorithm selects the optimal scaling factor online with -very small footprint. The proposed algorithm takes into account the -communication times of the MPI programs to choose the scaling factor. This -algorithm has ability to predict both energy consumption and execution time over -all available scaling factors. The prediction achieved depends on some -computing time information, gathered at the beginning of the runtime. We apply -this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel -penchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed -using the simulator Simgrid/SMPI v3.10~\cite{45} over an homogeneous distributed -memory architecture. Furthermore, we compare the proposed algorithm with -Rauber's methods. The comparison's results show that our algorithm gives better -energy-time trade off. -\sectionfont{\large} - -\section{.~Related Works } - -In the this section some heuristics, to compute the scaling factor, are -presented and classified in two parts : offline and online methods. - \sectionfont{\large} - -\subsection{~The offline DVFS orientations} - -The DVFS offline methods are static and are not executed during the runtime of -the program. Some approaches used heuristics to select the best DVFS state -during the compilation phases as an example in Azevedo et al.~\cite{40}. He used -intra-task algorithm to choose the DVFS setting when there are dependency points -between tasks. While in~\cite{29}, Xie et al. used breadth-first search -algorithm to do that. Their goal is saving energy with time limits. Another -approaches gathers and stores the runtime information for each DVFS state , then -used their methods offline to select the suitable DVFS that optimize energy-time -trade offs. As an example~\cite{8}, Rountree et al. used liner programming -algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression -algorithm for the same goal. The offline study that shown the DVFS impact on the -communication time of the MPI program is~\cite{17}, Freeh et al. show that these -times not changed when the frequency is scaled down. -\sectionfont{\large} - -\subsection{~The online DVFS orientations} - -The objective of these works is to dynamically compute and set the frequency of -the CPU during the runtime of the program for saving energy. Estimating and -predicting approaches for the energy-time trade offs developed by -~\cite{11,2,31}. These works select the best DVFS setting depending on the slack -times. These times happen when the processors have to wait for data from other -processors to compute their task. For example, during the synchronous -communication time that take place in the MPI programs, the processors are -idle. The optimal DVFS can be selected using the learning methods. Therefore, in -~\cite{39,19} used machine learning to converge to the suitable DVFS -configuration. Their learning algorithms have big time to converge when the -number of available frequencies is high. Also, the communication time of the MPI -program used online for saving energy as in~\cite{1}, Lim et al. developed an -algorithm that detects the communication sections and changes the frequency -during these sections only. This approach changes the frequency many times -because an iteration may contain more than one communication section. The domain -of analytical modeling used for choosing the optimal frequency as in ~\cite{3}, -Rauber et al. developed an analytical mathematical model for determining the -optimal frequency scaling factor for any number of concurrent tasks, without -considering communication times. They set the slowest task to maximum frequency -for maintaining performance. In this paper we compare our algorithm with -Rauber's model~\cite{3}, because his model can be used for any number of -concurrent tasks for homogeneous platform and this is the same direction of this -paper. However, the primary contributions of this paper are: -\begin{enumerate} -\item Selecting the optimal frequency scaling factor for energy and performance - simultaneously. While taking into account the communication time. -\item Adapting our scale factor to taking into account the imbalanced tasks. -\item The execution time of our algorithm is very small when compared to other - methods (e.g.,~\cite{19}). -\item The proposed algorithm works online without profiling or training as - in~\cite{38,34}. -\end{enumerate} -\sectionfont{\large} - -\section{.~Parallel Tasks Execution on Homogeneous Platform} - -A homogeneous cluster consists of identical nodes in terms of the hardware and -the software. Each node has its own memory and at least one processor which can -be a multi-core. The nodes are connected via a high bandwidth network. Tasks -executed on this model can be either synchronous or asynchronous. In this paper -we consider execution of the synchronous tasks on distributed homogeneous -platform. These tasks can exchange the data via synchronous memory passing. -\begin{figure}[h] - \centering - \subfloat[Synch. Imbalanced Communications]{\includegraphics[scale=0.67]{synch_tasks}\label{fig:h1}} - \subfloat[Synch. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}} - \caption{Parallel Tasks on Homogeneous Platform} - \label{fig:homo} -\end{figure} -Therefore, the execution time of a task consists of the computation time and the -communication time. Moreover, the synchronous communications between tasks can -lead to idle time while tasks wait at the synchronous point for others tasks to -finish their communications see figure~(\ref{fig:h1}). Another source for idle -times is the imbalanced computations. This happen when processing different -amounts of data on each processor as an example see figure~(\ref{fig:h2}). In -this case the fastest tasks have to wait at the synchronous barrier for the -slowest tasks to finish their job. In both two cases the overall execution time -of the program is the execution time of the slowest task as : -\begin{equation} \label{eq:T1} - Program Time=MAX_{i=1,2,..,N} (T_i) \hfill -\end{equation} -where $T_i$ is the execution time of process $i$. -\sectionfont{\large} - -\section{.~Energy Model for Homogeneous Platform} - -The energy consumption by the processor consists of two powers metric: the -dynamic and the static power. This general power formulation is used by many -researchers see ~\cite{9,3,15,26}. The dynamic power of the CMOS processors -$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$, -the supply voltage $V$ and operational frequency $f$ respectively as follow : -\begin{equation} \label{eq:pd} - \displaystyle P_{dyn} = \alpha . C_L . V^2 . f -\end{equation} -The static power $P_{static}$ captures the leakage power consumption as well as -the power consumption of peripheral devices like the I/O subsystem. -\begin{equation} \label{eq:ps} - \displaystyle P_{static} = V . N . K_{design} . I_{leak} -\end{equation} -where V is the supply voltage, N is the number of transistors, $K_{design}$ is a -design dependent parameter and $I_{leak}$ is a technology-dependent -parameter. Energy consumed by an individual processor $E_{ind}$ is the summation -of the dynamic and the static power multiply by the execution time for example -see~\cite{36,15} . -\begin{equation} \label{eq:eind} - \displaystyle E_{ind} = (P_{dyn} + P_{static} ) . T + +\title{Dynamic Frequency Scaling for Energy Consumption + Reduction in Synchronous Distributed Applications} + +\author{% + \IEEEauthorblockN{% + Jean-Claude Charr, + Raphaël Couturier, + Ahmed Fanfakh and + Arnaud Giersch + } + \IEEEauthorblockA{% + FEMTO-ST Institute\\ + University of Franche-Comté\\ + IUT de Belfort-Montbéliard, + 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\ + % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël + % Fax: \mbox{+33 3 84 58 77 81}\\ % Dept Info + Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr} + } + } + +\maketitle + +\begin{abstract} + Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs. This + technique is usually used to reduce the energy consumed by a CPU while + computing. Thus, decreasing the frequency + reduces the power consumed by the CPU. However, it can also significantly + affect the performance of the executed program if it is compute bound and if a + low CPU frequency is selected. Therefore, the chosen scaling factor must + give the best possible trade-off between energy reduction and performance. + + In this paper we present an algorithm that predicts the energy consumed with + each frequency gear and selects the one that gives the best ratio between + energy consumption reduction and performance. This algorithm works online + without training or profiling and has a very small overhead. It also takes + into account synchronous communications between the nodes that are executing + the distributed algorithm. The algorithm has been evaluated over the SimGrid + simulator while being applied to the NAS parallel benchmark programs. The + results of the experiments show that it outperforms other existing scaling + factor selection algorithms. +\end{abstract} + +\section{Introduction} +\label{sec.intro} + +The need and demand for more computing power have been increasing since the +birth of the first computing unit and it is not expected to slow down in the +coming years. To satisfy this demand, researchers and supercomputers +constructors have been regularly increasing the number of computing cores and +processors in supercomputers (for example in November 2013, according to the +TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more +than 3 million of cores and delivers more than \np[Tflop/s]{33} while consuming +\np[kW]{17808}). This large increase in number of computing cores has led to +large energy consumption by these architectures. Moreover, the price of energy +is expected to continue its ascent according to the demand. For all these +reasons energy reduction has become an important topic in the high performance +computing field. To tackle this problem, many researchers use DVFS (Dynamic +Voltage Frequency Scaling) operations which reduce dynamically the frequency and +voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a +set of acceptable frequencies which are usually called gears, and the user or +the operating system can modify the frequency of the processor according to its +needs. However, DVFS also degrades the performance of computation. Therefore +researchers try to reduce the frequency to the minimum when processors are idle +(waiting for data from other processors or communicating with other processors). +Moreover, depending on their objectives, they use heuristics to find the best +scaling factor during the computation. If they aim for performance they choose +the best scaling factor that reduces the consumed energy while affecting as +little as possible the performance. On the other hand, if they aim for energy +reduction, the chosen scaling factor must produce the most energy efficient +execution without considering the degradation of the performance. It is +important to notice that lowering the frequency to the minimum value does not always +give the most energy efficient execution due to energy leakage. The best +scaling factor might be chosen during execution (online) or during a +pre-execution phase. In this paper, we present an algorithm that selects a +frequency scaling factor that simultaneously takes into consideration the energy +consumption by the CPU and the performance of the application. The main +objective of HPC systems is to execute as fast as possible the application. +Therefore, our algorithm selects the scaling factor online with very small +overhead. The proposed algorithm takes into account the communication times of +the MPI program to choose the scaling factor. This algorithm has the ability to +predict both energy consumption and execution time over all available scaling +factors. The prediction achieved depends on some computing time information, +gathered at the beginning of the runtime. We apply this algorithm to the NAS parallel benchmarks (NPB v3.3)~\cite{44}. Our experiments are executed using the simulator +SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over a homogeneous +distributed memory architecture. Furthermore, we compare the proposed algorithm +with Rauber and Rünger methods~\cite{3}. The comparison's results show that our +algorithm gives better energy-time trade-off. + +This paper is organized as follows: Section~\ref{sec.relwork} presents some +related works from other authors. Section~\ref{sec.exe} presents an energy +model for homogeneous platforms. Section~\ref{sec.mpip} describes how the +performance of MPI programs can be predicted. Section~\ref{sec.compet} presents +the energy-performance objective function that maximizes the reduction of energy +consumption while minimizing the degradation of the program's performance. +Section~\ref{sec.optim} details the proposed energy-performance algorithm. +Section~\ref{sec.expe} verifies the accuracy of the performance prediction model +and presents the results of the proposed algorithm. It also shows the +comparison results between our method and other existing methods. Finally, we +conclude in Section~\ref{sec.concl} with a summary and some future works. + +\section{Related works} +\label{sec.relwork} + + +In this section, some heuristics to compute the scaling factor are presented and +classified into two categories: offline and online methods. + +\subsection{Offline scaling factor selection methods} + +The offline scaling factor selection methods are executed before the runtime of +the program. They return static scaling factor values to the processors +participating in the execution of the parallel program. On the one hand, the +scaling factor values could be computed based on information retrieved by +analyzing the code of the program and the computing system that will execute it. +In~\cite{40}, Azevedo et al. detect during compilation the dependency points +between tasks in a multi-task program. This information is then used to lower +the frequency of some processors in order to eliminate slack times. A slack +time is the period of time during which a processor that has already finished +its computation, has to wait for a set of processors to finish their +computations and send their results to the waiting processor in order to +continue its task that is dependent on the results of computations being +executed on other processors. Freeh et al. showed in~\cite{17} that the +communication times of MPI programs do not change when the frequency is scaled +down. On the other hand, some offline scaling factor selection methods use the +information gathered from previous full or partial executions of the program. The whole program or, a +part of it, is usually executed over all the available frequency +gears and the execution time and the energy consumed with each frequency +gear are measured. Then a heuristic or an exact method uses the retrieved +information to compute the values of the scaling factor for the processors. +In~\cite{8} , Rountree et al. use a linear programming algorithm, while in~\cite{34}, Cochran et +al. use a multi-logistic regression algorithm for the same goal. The main +drawback of these methods is that they all require executing the +whole program or, a part of it, on all frequency gears for each new instance of the same program. + +\subsection{Online scaling factor selection methods} + +The online scaling factor selection methods are executed during the runtime of +the program. They are usually integrated into iterative programs where the same +block of instructions is executed many times. During the first few iterations, +a lot of information is measured such as the execution time, the energy consumed +using a multimeter, the slack times, \dots{} Then a method will exploit these +measurements to compute the scaling factor values for each processor. This +operation, measurements and computing new scaling factor, can be repeated as +much as needed if the iterations are not regular. Peraza, Yu-Liang et +al.~\cite{2,31} used varied heuristics to select the appropriate scaling +factor values to eliminate the slack times during runtime. However, as seen +in~\cite{19}, machine learning method takes a lot of time to converge +when the number of available gears is big. To reduce the impact of slack times, +in~\cite{1}, Lim et al. developed an algorithm that detects the communication +sections and changes the frequency during these sections only. This approach +might change the frequency of each processor many times per iteration if an +iteration contains more than one communication section. In~\cite{3}, Rauber and +Rünger used an analytical model that can predict the consumed energy for every frequency gear after measuring the consumed energy. They +maintain the performance as mush as possible by setting the highest frequency gear to the slowest task. + +The primary contribution of +our paper is to present a new online scaling factor selection method which has the + following characteristics:\\ +1) It is based on Rauber and Rünger analytical model to predict the energy + consumption of the application with different frequency gears. +2) It selects the frequency scaling factor for simultaneously optimizing + energy reduction and maintaining performance. +3) It is well adapted to distributed architectures because it takes into + account the communication time. +4) It is well adapted to distributed applications with imbalanced tasks. +5) It has a very small overhead when compared to other methods + (e.g.,~\cite{19}) and does not require profiling or training as + in~\cite{34}. + + +% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'', +% can be deleted if we need space, we can just say we are interested in this +% paper in homogeneous clusters} + + +\section{Energy model for a homogeneous platform} +\label{sec.exe} +Many researchers~\cite{9,3,15,26} divide the power consumed by a processor into +two power metrics: the static and the dynamic power. While the first one is +consumed as long as the computing unit is on, the latter is only consumed during +computation times. The dynamic power $\Pdyn$ is related to the switching +activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and +operational frequency $f$, as shown in EQ~\eqref{eq:pd}. +\begin{equation} + \label{eq:pd} + \Pdyn = \alpha \cdot C_L \cdot V^2 \cdot f \end{equation} -The dynamic voltage and frequency scaling (DVFS) is a process that allowed in -modern processors to reduce the dynamic power by scaling down the voltage and -frequency. Its main objective is to reduce the overall energy -consumption~\cite{37}. The operational frequency \emph f depends linearly on the -supply voltage $V$, i.e., $V = \beta . f$ with some constant $\beta$. This -equation is used to study the change of the dynamic voltage with respect to -various frequency values in~\cite{3}. The reduction process of the frequency are -expressed by scaling factor \emph S. The scale \emph S is the ratio between the -maximum and the new frequency as in EQ~(\ref{eq:s}). -\begin{equation} \label{eq:s} - S=\:\frac{F_{max}}{F_{new}} \hfill \newline +The static power $\Pstatic$ captures the leakage power as follows: +\begin{equation} + \label{eq:ps} + \Pstatic = V \cdot \Ntrans \cdot \Kdesign \cdot \Ileak \end{equation} -The value of the scale \emph S is grater than 1 when changing the frequency to -any new frequency value(\emph {P-state}) in governor.~It is equal to 1 when the -frequency are set to the maximum frequency. The energy consumption model for -parallel homogeneous platform is depending on the scaling factor \emph S. This -factor reduces quadratically the dynamic power. Also, this factor increases the -static energy linearly because the execution time is increased~\cite{36}. The -energy model, depending on the frequency scaling factor, of homogeneous platform -for any number of concurrent tasks develops by Rauber~\cite{3}. This model -consider the two powers metric for measuring the energy of the parallel tasks as -in EQ~(\ref{eq:energy}). - -\begin{equation} \label{eq:energy} - E= \displaystyle \;P_{dyn}\,.\,S_1^{-2}\;.\,(T_1+\sum\limits_{i=2}^{N}\frac{T_i^3}{T_1^2})+\;P_{static}\,.\,T_1\,.\,S_1\;\,.\,N - \hfill +where V is the supply voltage, $\Ntrans$ is the number of transistors, +$\Kdesign$ is a design dependent parameter and $\Ileak$ is a +technology-dependent parameter. The energy consumed by an individual processor +to execute a given program can be computed as: +\begin{equation} + \label{eq:eind} + \Eind = \Pdyn \cdot \Tcomp + \Pstatic \cdot T \end{equation} -Where \emph N is the number of parallel nodes, $T_1 $ is the time of the slower -task, $T_i$ is the time of the task $i$ and $S_1$ is the maximum scaling factor -for the slower task. The scaling factor $S_1$, as in EQ~(\ref{eq:s1}), selects -from the set of scales values $S_i$. Each of these scales are proportional to -the time value $T_i$ depends on the new frequency value as in EQ~(\ref{eq:si}). -\begin{equation} \label{eq:s1} - S_1=MAX_{i=1,2,..,F} (S_i) \hfill +where $T$ is the execution time of the program, $\Tcomp$ is the computation +time and $\Tcomp \leq T$. $\Tcomp$ may be equal to $T$ if there is no +communication, no slack time and no synchronization. + +DVFS is a process that is allowed in modern processors to reduce the dynamic +power by scaling down the voltage and frequency. Its main objective is to +reduce the overall energy consumption~\cite{37}. The operational frequency $f$ +depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot f$ with some +constant $\beta$. This equation is used to study the change of the dynamic +voltage with respect to various frequency values in~\cite{3}. The reduction +process of the frequency can be expressed by the scaling factor $S$ which is the +ratio between the maximum and the new frequency as in EQ~\eqref{eq:s}. +\begin{equation} + \label{eq:s} + S = \frac{\Fmax}{\Fnew} \end{equation} -\begin{equation} \label{eq:si} - S_i=\:S\: .\:(\frac{T_1}{T_i})=\: (\frac{F_{max}}{F_{new}}).(\frac{T_1}{T_i}) \hfill +The value of the scaling factor $S$ is greater than 1 when changing the +frequency of the CPU to any new frequency value~(\emph{P-state}) in the +governor. This factor reduces quadratically +the dynamic power which may cause degradation in performance and thus, the +increase of the static energy because the execution time is increased~\cite{36}. +If the tasks are sorted according to their execution times before scaling in a +descending order, the total energy consumption model for a parallel homogeneous +platform, as presented by Rauber and Rünger~\cite{3}, can be written as a +function of the scaling factor $S$, as in EQ~\eqref{eq:energy}. + +\begin{equation} + \label{eq:energy} + E = \Pdyn \cdot S_1^{-2} \cdot + \left( T_1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^2} \right) + + \Pstatic \cdot T_1 \cdot S_1 \cdot N \end{equation} -Where $F$ is the number of available frequencies. In this paper we depend on -Rauber's energy model EQ~(\ref{eq:energy}) for two reasons : 1-this model used -for homogeneous platform that we work on in this paper. 2-we are compare our -algorithm with Rauber's scaling model. Rauber's optimal scaling factor for -optimal energy reduction derived from the EQ~(\ref{eq:energy}). He takes the -derivation for this equation (to be minimized) and set it to zero to produce the -scaling factor as in EQ~(\ref{eq:sopt}). -\begin{equation} \label{eq:sopt} - S_{opt}= {\sqrt [3~]{\frac{2}{n} \frac{P_{dyn}}{P_{static}} \Big(1+\sum\limits_{i=2}^{N}\frac{T_i^3}{T_1^3}\Big) }} \hfill +where $N$ is the number of parallel nodes, $T_i$ for $i=1,\dots,N$ are +the execution times of the sorted tasks. Therefore, $T_1$ is +the time of the slowest task, and $S_1$ its scaling factor which should be the +highest because they are proportional to the time values $T_i$. The scaling +factors $S_i$ are computed as in EQ~\eqref{eq:si}. +\begin{equation} + \label{eq:si} + S_i = S \cdot \frac{T_1}{T_i} + = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i} \end{equation} -%[\Big 3] -\sectionfont{\large} - -\section{.~Performance Evaluation of MPI Programs} - -The performance (execution time) of the parallel MPI applications are depends on -the time of the slowest task as in figure~(\ref{fig:homo}). Normally the -execution time of the parallel programs are proportional to the operational -frequency. Therefore, any DVFS operation for the energy reduction increase the -execution time of the parallel program. As shown in EQ~(\ref{eq:energy}) the -energy affected by the scaling factor $S$. This factor also has a great impact -on the performance. When scaling down the frequency to the new value according -to EQ(~\ref{eq:s}) lead to the value of the scale $S$ has inverse relation with -new frequency value ($S \propto \frac{1}{F_{new}}$). Also when decrease the -frequency value, the execution time increase. Then the new frequency value has -inverse relation with time ($F_{new} \propto \frac{1}{T}$). This lead to the -frequency scaling factor $S$ proportional linearly with execution time ($S -\propto T$). Large scale MPI applications such as NAS benchmarks have -considerable amount of communications embedded in these programs. During the -communication process the processor remain idle until the communication has -finished. For that reason any change in the frequency has no impact on the time -of communication but it has obvious impact on the time of -computation~\cite{17}. We are made many tests on real cluster to prove that the -frequency scaling factor \emph S has a linear relation with computation time -only also see~\cite{41}. To predict the execution time of MPI program, firstly -must be precisely specifying communication time and the computation time for the -slower task. Secondly, we use these times for predicting the execution time for -any MPI program as a function of the new scaling factor as in the -EQ~(\ref{eq:tnew}). -\begin{equation} \label{eq:tnew} - \displaystyle T_{new}= T_{Max \:Comp \:Old} \; . \:S \;+ \;T_{Max\: Comm\: Old} - \hfill +In this paper we use Rauber and Rünger's energy model, EQ~\eqref{eq:energy}, because it can be applied to homogeneous clusters if the communication time is taken in consideration. Moreover, we compare our algorithm with Rauber and Rünger's scaling factor selection +method which uses the same energy model. In their method, the optimal scaling factor is +computed by minimizing the derivation of EQ~\eqref{eq:energy} which produces +EQ~\eqref{eq:sopt}. + +\begin{equation} + \label{eq:sopt} + \Sopt = \sqrt[3]{\frac{2}{N} \cdot \frac{\Pdyn}{\Pstatic} \cdot + \left( 1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^3} \right) } \end{equation} -The above equation shows that the scaling factor \emph S has linear relation -with the computation time without affecting the communication time. The -communication time consists of the beginning times which an MPI calls for -sending or receiving till the message is synchronously sent or received. In this -paper we predict the execution time of the program for any new scaling factor -value. Depending on this prediction we can produce our energy-performace scaling -method as we will show in the coming sections. In the next section we make an -investigation study for the EQ~(\ref{eq:tnew}). -\sectionfont{\large} - -\section{.~Performance Prediction Verification } - -In this section we evaluate the precision of our performance prediction methods -on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution -time for any scale value. The NAS programs run the class B for comparing the -real execution time with the predicted execution time. Each program runs offline -with all available scaling factors on 8 or 9 nodes to produce real execution -time values. These scaling factors are computed by dividing the maximum -frequency by the new one see EQ~(\ref{eq:s}). In all tests, we use the simulator -Simgrid/SMPI v3.10 to run the NAS programs. -\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio] - \centering - \includegraphics[scale=0.60]{cg_per.eps} - \includegraphics[scale=0.60]{mg_pre.eps} - \includegraphics[scale=0.60]{bt_pre.eps} - \includegraphics[scale=0.60]{lu_pre.eps} - \caption{Fitting Predicted to Real Execution Time} - \label{fig:pred} -\end{figure} -%see Figure~\ref{fig:pred} -In our cluster there are 18 available frequency states for each processor from -2.5 GHz to 800 MHz, there is 100 MHz difference between two successive -frequencies. For more details on the characteristics of the platform refer to -table~(\ref{table:platform}). This lead to 18 run states for each program. We -use seven MPI programs of the NAS parallel benchmarks : CG, MG, EP, FT, BT, LU -and SP. The average normalized errors between the predicted execution time and -the real time (Simgrid time) for all programs is between 0.0032 to 0.0133. AS an -example, we are present the execution times of the NAS benchmarks as in the -figure~(\ref{fig:pred}). -\sectionfont{\large} - -\section{.~Performance to Energy Competition} -This section demonstrates our approach for choosing the optimal scaling -factor. This factor gives maximum energy reduction taking into account the -execution time for both computation and communication times . The relation -between the energy and the performance are nonlinear and complex, because the -relation of the energy with scaling factor is nonlinear and with the performance -it is linear see~\cite{17}. The relation between the energy and the performance -is not straightforward. Moreover, they are not measured using the same metric. -For solving this problem, we normalize the energy by calculating the ratio -between the consumed energy with scaled frequency and the consumed energy -without scaled frequency : -\begin{equation} \label{eq:enorm} - E_{Norm}=\displaystyle\frac{E_{Reduced}}{E_{Orginal}}= \frac{\displaystyle \;P_{dyn}\,.\,S_i^{-2}\,.\,(T_1+\sum\limits_{i=2}^{N}\frac{T_i^3}{T_1^2})+\;P_{static}\,.\,T_1\,.\,S_i\;\,.\,N }{\displaystyle \;P_{dyn}\,.\,(T_1+\sum\limits_{i=2}^{N}\frac{T_i^3}{T_1^2})+\;P_{static}\,.\,T_1\,\,.\,N } + + +\section{Performance evaluation of MPI programs} +\label{sec.mpip} + +The execution time of a parallel synchronous iterative application is +equal to the execution time of the slowest task. If there is no +communication and the application is not data bounded, the execution time of a +parallel program is linearly proportional to the operational frequency and any +DVFS operation for energy reduction increases the execution time of the parallel +program. Therefore, the scaling factor $S$ is linearly proportional to the +execution time. However, in most MPI applications the processes exchange +data. During these communications the processors involved remain idle until the +communications are finished. For that reason, any change in the frequency has no +impact on the time of communication~\cite{17}. The communication time for a +task is the summation of periods of time that begin with an MPI call for sending +or receiving a message till the message is synchronously sent or received. To +be able to predict the execution time of MPI program, the communication time and +the computation time for the slowest task must be measured before scaling. These +times are used to predict the execution time for any MPI program as a function +of the new scaling factor as in EQ~\eqref{eq:tnew}. +\begin{equation} + \label{eq:tnew} + \Tnew = \TmaxCompOld \cdot S + \TmaxCommOld \end{equation} -By the same way we can normalize the performance as follows : -\begin{equation} \label{eq:pnorm} - P_{Norm}=\displaystyle \frac{T_{New}}{T_{Old}}=\frac{T_{Max \:Comp \:Old} \;. \:S \;+ \;T_{Max\: Comm\: Old}}{T_{Old}} \;\; +In this paper, this prediction method is used to select the best scaling factor +for each processor as presented in the next section. + +\section{Performance and energy reduction trade-off} +\label{sec.compet} + +This section presents our method for choosing the optimal scaling factor that +gives the best tradeoff between energy reduction and performance. This method +takes into account the execution times for both computation and communication to +compute the scaling factor. Since the energy consumption and the performance +are not measured using the same metric, a normalized value of both measurements +can be used to compare them. The normalized energy is the ratio between the +consumed energy with scaled frequency and the consumed energy without scaled +frequency: +\begin{multline} + \label{eq:enorm} + \Enorm = \frac{ \Ereduced}{\Eoriginal} \\ + {} = \frac{\Pdyn \cdot S_1^{-2} \cdot + \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot S_1 \cdot N}{ + \Pdyn \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot N } +\end{multline} +In the same way, the normalized execution time of a program is computed as follows: +\begin{equation} + \label{eq:pnorm} + \Tnorm = \frac{\Tnew}{\Told} + = \frac{\TmaxCompOld \cdot S + \TmaxCommOld}{ + \TmaxCompOld + \TmaxCommOld} \end{equation} -The second problem is the optimization operation for both energy and performance -is not in the same direction. In other words, the normalized energy and the -performance curves are not in the same direction see figure~(\ref{fig:r2}). -While the main goal is to optimize the energy and performance in the same -time. According to the equations~(\ref{eq:enorm}) and~(\ref{eq:pnorm}) the -scaling factor \emph S reduce both the energy and the performance -simultaneously. But the main objective is to produce maximum energy reduction -with minimum performance reduction. Many researchers used different strategies -to solve this nonlinear problem for example see~\cite{19,42}, their methods add -big overhead to the algorithm for selecting the suitable frequency. In this -paper we are present a method to find the optimal scaling factor \emph S for -optimize both energy and performance simultaneously without adding big -overheads. Our solution for this problem is to make the optimization process -have the same direction. Therefore, we inverse the equation of normalize -performance as follows : -\begin{equation} \label{eq:pnorm_en} - \displaystyle P^{-1}_{Norm}= \frac{T_{Old}}{T_{New}}=\frac{T_{Old}}{T_{Max \:Comp \:Old} \;. \:S \;+ \;T_{Max\: Comm\: Old}} +The relation between the execution time and the consumed energy of a program is nonlinear and complex. In consequences, the relation between the consumed energy and the scaling factor is also nonlinear, for more details refer to~\cite{17}. Therefore, the resulting normalized energy consumption curve and execution time curve, for different scaling factors, do not have the same direction see Figure~\ref{fig:rel}\subref{fig:r2}. To tackle this problem and optimize both terms, we inverse the equation of the normalized execution time as follows: +\begin{equation} + \label{eq:pnorm_en} + \Pnorm = \frac{ \Told}{ \Tnew} + = \frac{\TmaxCompOld + + \TmaxCommOld}{\TmaxCompOld \cdot S + + \TmaxCommOld} \end{equation} \begin{figure} \centering - \subfloat[Converted Relation.]{\includegraphics[scale=0.70]{file.eps}\label{fig:r1}} - \subfloat[Real Relation.]{\includegraphics[scale=0.70]{file3.eps}\label{fig:r2}} + \subfloat[Real relation.]{% + \includegraphics[width=.5\linewidth]{fig/file3}\label{fig:r2}}% + \subfloat[Converted relation.]{% + \includegraphics[width=.5\linewidth]{fig/file}\label{fig:r1}} + \caption{The energy and performance relation} \label{fig:rel} - \caption{The Energy and Performance Relation} \end{figure} -Then, we can modelize our objective function as finding the maximum distance -between the energy curve EQ~(\ref{eq:enorm}) and the inverse of performance -curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This represent -the minimum energy consumption with minimum execution time (better performance) -in the same time, see figure~(\ref{fig:r1}). Then our objective function has the -following form: -\begin{equation} \label{eq:max} - \displaystyle MaxDist = Max \;(\;\overbrace{P^{-1}_{Norm}}^{Maximize}\; -\; \overbrace{E_{Norm}}^{Minimize} \;) +Then, we can model our objective function as finding the maximum distance +between the energy curve EQ~\eqref{eq:enorm} and the inverse of the execution time (performance) +curve EQ~\eqref{eq:pnorm_en} over all available scaling factors. This +represents the minimum energy consumption with minimum execution time (better +performance) at the same time, see Figure~\ref{fig:rel}\subref{fig:r1}. Then +our objective function has the following form: +\begin{equation} + \label{eq:max} + \MaxDist = \max_{j=1,2,\dots,F} + (\overbrace{\Pnorm(S_j)}^{\text{Maximize}} - + \overbrace{\Enorm(S_j)}^{\text{Minimize}} ) \end{equation} -Then we can select the optimal scaling factor that satisfy the -EQ~(\ref{eq:max}). Our objective function can works with any energy model or -static power values stored in a data file. Moreover, this function works in -optimal way when the energy function has a convex form with frequency scaling -factor as shown in ~\cite{15,3,19}. Energy measurement model is not the -objective of this paper and we choose Rauber's model as an example with two -reasons that mentioned before. -\sectionfont{\large} - -\section{.~Optimal Scaling Factor for Performance and Energy } - -In the previous section we described the objective function that satisfy our -goal in discovering optimal scaling factor for both performance and energy at -the same time. Therefore, we develop an energy to performance scaling algorithm -(EPSA). This algorithm is simple and has a direct way to calculate the optimal -scaling factor for both energy and performance at the same time. \clearpage -\linespread{1} -\begin{algorithm}[width=\textwidth,height=\textheight,keepaspectratio] - \caption{EPSA} +where $F$ is the number of available frequencies. Then we can select the optimal +scaling factor that satisfies EQ~\eqref{eq:max}. Our objective function can +work with any energy model or static power values stored in a data file. +Moreover, this function works in optimal way when the energy curve has a convex +form over the available frequency scaling factors as shown in~\cite{15,3,19}. + +\section{Optimal scaling factor for performance and energy} +\label{sec.optim} + +Algorithm on Figure~\ref{EPSA} computes the optimal scaling factor according to +the objective function described above. +\begin{figure}[tp] + \begin{algorithmic}[1] + % \footnotesize + \Require ~ + \begin{description} + \item[$\Pstatic$] static power value + \item[$\Pdyn$] dynamic power value + \item[$\Pstates$] number of available frequencies + \item[$\Fmax$] maximum frequency + \item[$\Fdiff$] difference between two successive freq. + \end{description} + \Ensure $\Sopt$ is the optimal scaling factor + + \State $\Sopt \gets 1$ + \State $\Dist \gets 0$ + \State $\Fnew \gets \Fmax$ + \For {$j = 2$ to $\Pstates$} + \State $\Fnew \gets \Fnew - \Fdiff$ + \State $S \gets \Fmax / \Fnew$ + \State $S_i \gets S \cdot \frac{T_1}{T_i} + = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i}$ + for $i=1,\dots,N$ + \State $\Enorm \gets + \frac{\Pdyn \cdot S_1^{-2} \cdot + \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot S_1 \cdot N }{ + \Pdyn \cdot + \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot N }$ + \State $\Pnorm \gets \Told / \Tnew$ + \If{$(\Pnorm - \Enorm > \Dist)$} + \State $\Sopt \gets S$ + \State $\Dist \gets \Pnorm - \Enorm$ + \EndIf + \EndFor + \State Return $\Sopt$ + \end{algorithmic} + \caption{Scaling factor selection algorithm} \label{EPSA} +\end{figure} + +The proposed algorithm works online during the execution time of the MPI +program. It selects the optimal scaling factor after gathering the computation +and communication times from the program after one iteration. Then the program +changes the new frequencies of the CPUs according to the computed scaling +factors. In our experiments over a homogeneous cluster described in +Section~\ref{sec.expe}, this algorithm has a small execution time. It takes +\np[$\mu$s]{1.52} on average for 4 nodes and \np[$\mu$s]{6.65} on average for 32 +nodes. The algorithm complexity is $O(F\cdot N)$, where $F$ is the number of +available frequencies and $N$ is the number of computing nodes. The algorithm +is called just once during the execution of the program. The DVFS algorithm on +Figure~\ref{dvfs} shows where and when the algorithm is called in the MPI +program. +%\begin{table}[htb] +% \caption{Platform file parameters} +% % title of Table +% \centering +% \begin{tabular}{|*{7}{l|}} +% \hline +% Max & Min & Backbone & Backbone & Link & Link & Sharing \\ +% Freq. & Freq. & Bandwidth & Latency & Bandwidth & Latency & Policy \\ +% \hline +% \np{2.5} & \np{800} & \np[GBps]{2.25} & \np[$\mu$s]{0.5} & \np[GBps]{1} & \np[$\mu$s]{50} & Full \\ +% GHz & MHz & & & & & Duplex \\ +% \hline +% \end{tabular} +% \label{table:platform} +%\end{table} + +\begin{figure}[tp] \begin{algorithmic}[1] - \State Initialize the variable $Dist=0$ - \State Set dynamic and static power values. - \State Set $P_{states}$ to the number of available frequencies. - \State Set the variable $F_{new}$ to max. frequency, $F_{new} = F_{max} $ - \State Set the variable $F_{diff}$ to the scale value between each two frequencies. - \For {$i=1$ to $P_{states} $} - \State - Calculate the new frequency as $F_{new}=F_{new} - F_{diff} $ - \State - Calculate the scale factor $S$ as in EQ~(\ref{eq:s}). - \State - Calculate all available scales $S_i$ depend on $S$ as in EQ~(\ref{eq:si}). - \State - Select the maximum scale factor $S_1$ from the set of scales $S_i$. - \State - Calculate the normalize energy $E_{Norm}=E_{R}/E_{O}$ as in EQ~(\ref{eq:enorm}). - \State - Calculate the normalize inverse of performance $P_{NormInv}=T_{old}/T_{new}$ - - as in EQ~(\ref{eq:pnorm_en}). - \If{ $(P_{NormInv}-E_{Norm}$ $>$ $Dist$) } - \State $S_{optimal}=S$ - \State $Dist = P_{NormInv} - E_{Norm}$ + % \footnotesize + \For {$k=1$ to \textit{some iterations}} + \State Computations section. + \State Communications section. + \If {$(k=1)$} + \State Gather all times of computation and\newline\hspace*{3em}% + communication from each node. + \State Call algorithm from Figure~\ref{EPSA} with these times. + \State Compute the new frequency from the\newline\hspace*{3em}% + returned optimal scaling factor. + \State Set the new frequency to the CPU. \EndIf \EndFor - \State $ Return \; \; (S_{optimal})$ \end{algorithmic} -\end{algorithm} -\linespread{1.2} The proposed EPSA algorithm works online during the execution -time of the MPI program. It selects the optimal scaling factor by gathering some -information from the program after one iteration. This algorithm has small -execution time (between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 -nodes). The data required by this algorithm is the computation time and the -communication time for each task from the first iteration only. When these times -are measured, the MPI program calls the EPSA algorithm to choose the new -frequency using the optimal scaling factor. Then the program set the new -frequency to the system. The algorithm is called just one time during the -execution of the program. The following example shows where and when the EPSA -algorithm is called in the MPI program : \clearpage -\begin{lstlisting} -FOR J:=1 to Some_iterations Do - -Computations Section. - -Communications Section. - IF (J==1) THEN - -Gather all times of computation and communication - from each node. - -Call EPSA with these times. - -Calculate the new frequency from optimal scale. - -Set the new frequency to the system. - ENDIF -ENDFOR -\end{lstlisting} -After obtaining the optimal scale factor from the EPSA algorithm. The program -calculates the new frequency $F_i$ for each task proportionally to its time -value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we -can calculate the new frequency $F_i$ as follows : -\begin{equation} \label{eq:fi} - F_i=\frac{F_{max} \; . \;T_i}{S_{optimal} \; . \;T_{max}} \hfill + \caption{DVFS algorithm} + \label{dvfs} +\end{figure} +After obtaining the optimal scaling factor, the program calculates the new +frequency $F_i$ for each task proportionally to its time value $T_i$. By +substitution of EQ~\eqref{eq:s} in EQ~\eqref{eq:si}, we can calculate the new +frequency $F_i$ as follows: +\begin{equation} + \label{eq:fi} + F_i = \frac{\Fmax \cdot T_i}{\Sopt \cdot \Tmax} \end{equation} According to this equation all the nodes may have the same frequency value if -they have balanced workloads. Otherwise, they take different frequencies when -have imbalanced workloads. Then EQ~(\ref{eq:fi}) works in adaptive way to change -the freguency according to the nodes workloads. -\sectionfont{\large} - -\section{.~Experimental Results} - -The proposed ESPA algorithm was applied to seven MPI programs of the NAS -benchmarks (EP ,CG , MG ,FT , BT, LU and SP). We work on three classes (A, B and -C) for each program. Each program runs on specific number of processors +they have balanced workloads, otherwise, they take different frequencies when +having imbalanced workloads. Thus, EQ~\eqref{eq:fi} adapts the frequency of the +CPU to the nodes' workloads to maintain the performance of the program. + +\section{Experimental results} +\label{sec.expe} +Our experiments are executed on the simulator SimGrid/SMPI v3.10. We configure +the simulator to use a homogeneous cluster with one core per node. +%The detailed characteristics of our platform file are shown in Table~\ref{table:platform}. +Each node in the cluster has 18 frequency values +from \np[GHz]{2.5} to \np[MHz]{800} with \np[MHz]{100} difference between each +two successive frequencies. The nodes are connected via an ethernet network with 1Gbit/s bandwidth. + +\subsection{Execution time prediction verification} + +In this section we evaluate the precision of our execution time prediction method +based on EQ~\eqref{eq:tnew} by applying it to the NAS benchmarks. The NAS programs +are executed with the class B option to compare the real execution time with +the predicted execution time. Each program runs offline with all available +scaling factors on 8 or 9 nodes (depending on the benchmark) to produce real +execution time values. These scaling factors are computed by dividing the +maximum frequency by the new one see EQ~\eqref{eq:s}. +\begin{figure} + \centering + \includegraphics[width=.5\linewidth]{fig/cg_per}\hfill% + % \includegraphics[width=.5\linewidth]{fig/mg_pre}\hfill% + % \includegraphics[width=.5\linewidth]{fig/bt_pre}\qquad% + \includegraphics[width=.5\linewidth]{fig/lu_pre}\hfill% + \caption{Comparing predicted to real execution times} + \label{fig:pred} +\end{figure} +%see Figure~\ref{fig:pred} +In our cluster there are 18 available frequency states for each processor. This +leads to 18 run states for each program. We use seven MPI programs of the NAS +parallel benchmarks: CG, MG, EP, FT, BT, LU and SP. Figure~\ref{fig:pred} +presents plots of the real execution times and the simulated ones. The maximum +normalized error between these two execution times varies between \np{0.0073} to +\np{0.031} dependent on the executed benchmark. The smallest prediction error +was for CG and the worst one was for LU. + +\subsection{The experimental results for the scaling algorithm } +The proposed algorithm was applied to seven MPI programs of the NAS benchmarks +(EP, CG, MG, FT, BT, LU and SP) which were run with three classes (A, B and C). +For each instance the benchmarks were executed on a number of processors proportional to the size of the class. Each class represents the problem size -ascending from the class A to C. Additionally, depending on some speed up points +ascending from class A to C. Additionally, depending on some speed up points for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes -respectively. Our experiments are executed on the simulator Simgrid/SMPI -v3.10. We design a platform file that simulates a cluster with one core per -node. This cluster is a homogeneous architecture with distributed memory. The -detailed characteristics of our platform file are shown in -thetable~(\ref{table:platform}). Each node in the cluster has 18 frequency -values from 2.5 GHz to 800 MHz with 100 MHz difference between each two -successive frequencies. -\begin{table}[ht] - \caption{Platform File Parameters} - % title of Table - \centering - \begin{tabular}{ | l | l | l |l | l |l |l | p{2cm} |} - \hline - Max & Min & Backbone & Backbone&Link &Link& Sharing \\ - Freq. & Freq. & Bandwidth & Latency & Bandwidth& Latency&Policy \\ \hline - 2.5 &800 & 2.25 GBps &5E-7 s & 1 GBps & 5E-5 s&Full \\ - GHz& MHz& & & & &Duplex \\\hline - \end{tabular} - \label{table:platform} -\end{table} -Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all -the NAS MPI programs while assuming the power dynamic is equal to 20W and the -power static is equal to 4W for all experiments. We run the proposed ESPA -algorithm for all these programs. The results showed that the algorithm selected +respectively. Depending on EQ~\eqref{eq:energy}, we measure the energy +consumption for all the NAS MPI programs while assuming that the dynamic power +with the highest frequency is equal to \np[W]{20} and the power static is equal +to \np[W]{4} for all experiments. These power values were also used by Rauber +and Rünger in~\cite{3}. The results showed that the algorithm selected different scaling factors for each program depending on the communication -features of the program as in the figure~(\ref{fig:nas}). This figure shows that -there are different distances between the normalized energy and the normalized -inversed performance curves, because there are different communication features -for each MPI program. When there are little or not communications, the inversed -performance curve is very close to the energy curve. Then the distance between -the two curves is very small. This lead to small energy savings. The opposite -happens when there are a lot of communication, the distance between the two -curves is big. This lead to more energy savings (e.g. CG and FT), see -table~(\ref{table:factors results}). All discovered frequency scaling factors -optimize both the energy and the performance simultaneously for all the NAS -programs. In table~(\ref{table:factors results}), we record all optimal scaling -factors results for each program on class C. These factors give the maximum -energy saving percent and the minimum performance degradation percent in the -same time over all available scales. -\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio] +features of the program as in the plots from Figure~\ref{fig:nas}. These plots +illustrate that there are different distances between the normalized energy and +the normalized inverted execution time curves, because there are different +communication features for each benchmark. When there are little or no +communications, the inverted execution time curve is very close to the energy +curve. Then the distance between the two curves is very small. This leads to +small energy savings. The opposite happens when there are a lot of +communication, the distance between the two curves is big. This leads to more +energy savings (e.g. CG and FT), see Table~\ref{table:compareC}. All discovered +frequency scaling factors optimize both the energy and the execution time +simultaneously for all NAS benchmarks. In Table~\ref{table:compareC}, we record +all optimal scaling factors results for each benchmark running class C. These +scaling factors give the maximum energy saving percentage and the minimum +performance degradation percentage at the same time from all available scaling +factors. +\begin{figure*}[t] \centering - \includegraphics[scale=0.47]{ep.eps} - \includegraphics[scale=0.47]{cg.eps} - \includegraphics[scale=0.47]{sp.eps} - \includegraphics[scale=0.47]{lu.eps} - \includegraphics[scale=0.47]{bt.eps} - \includegraphics[scale=0.47]{ft.eps} - \caption{Optimal scaling factors for The NAS MPI Programs} + \includegraphics[width=.33\linewidth]{fig/ep}\hfill% + \includegraphics[width=.33\linewidth]{fig/cg}\hfill% + % \includegraphics[width=.328\linewidth]{fig/sp} + % \includegraphics[width=.328\linewidth]{fig/lu}\hfill% + \includegraphics[width=.33\linewidth]{fig/bt} + % \includegraphics[width=.328\linewidth]{fig/ft} + \caption{Optimal scaling factors for the predicted energy and performance of NAS benchmarks} \label{fig:nas} -\end{figure} -\linespread{1.1} -\begin{table}[width=\textwidth,height=\textheight,keepaspectratio] - \caption{Optimal Scaling Factors Results} - % title of Table - \centering - \begin{tabular}{ | l | l | l |l | l | p{2cm} |} - \hline - Program & Optimal & Energy & Performance&Energy-Perf.\\ - Name & Scaling Factor& Saving \%&Degradation \% &Distance \\ \hline - CG & 1.56 &39.23 & 14.88 & 24.35\\ \hline - MG & 1.47 &34.97&21.7& 13.27 \\ \hline - EP & 1.04 &22.14&20.73 &1.41\\ \hline - LU & 1.388 &35.83&22.49 &13.34\\ \hline - BT & 1.315 &29.6&21.28 &8.32\\ \hline - SP & 1.388 &33.48 &21.36&12.12\\ \hline - FT & 1.47 &34.72 &19&15.72\\ \hline - \end{tabular} - \label{table:factors results} - % is used to refer this table in the text -\end{table} -\linespread{1.2} +\end{figure*} -As shown in the table~(\ref{table:factors results}), when the optimal scaling -factor has big value we can gain more energy savings for example as in CG and -FT. The opposite happens when the optimal scaling factor is small value as -example BT and EP. Our algorithm selects big scaling factor value when the +As shown in Table~\ref{table:compareC}, when the optimal scaling factor has a +big value we can gain more energy savings as in CG and FT benchmarks. The +opposite happens when the optimal scaling factor has a small value as in BT and +EP benchmarks. Our algorithm selects a big scaling factor value when the communication and the other slacks times are big and smaller ones in opposite -cases. In EP there are no communications inside the iterations. This make our -EPSA to selects smaller scaling factor values (inducing smaller energy savings). - -% \clearpage -\sectionfont{\large} - -\section{.~Comparing Results} - -In this section, we compare our EPSA algorithm results with Rauber's -methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal -level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to -this scenario as $Rauber_{E}$. The second scenario is similar to the first -except setting the slower task to the maximum frequency (when the scale $S=1$) -to keep the performance from degradation as mush as possible. We refer to this -scenario as $Rauber_{E-P}$. The comparison is made in tables~(\ref{table:compare - Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These -tables show the results of our EPSA and Rauber's two scenarios for all the NAS -benchmarks programs for classes A,B and C. -%\linespread{1} -\begin{table}[ht] - \caption{Comparing Results for The NAS Class A} +cases. In EP there are no communication inside the iterations. This leads our +algorithm to select smaller scaling factor values (inducing smaller energy +savings). + +\subsection{Results comparison} + +In this section, we compare our scaling factor selection method with Rauber and +Rünger methods~\cite{3}. They had two scenarios, the first is to reduce energy +to the optimal level without considering the execution time as in +EQ~\eqref{eq:sopt}. We refer to this scenario as $R_{E}$. The second scenario +is similar to the first except setting the slower task to the maximum frequency +(when the scale $S=1$) to keep the performance from degradation as mush as +possible. We refer to this scenario as $R_{E-P}$. While we refer to our +algorithm as EPSA (Energy to Performance Scaling Algorithm). The comparison is +made in Table~\ref{table:compareC}. This table shows the results of our method and +Rauber and Rünger scenarios for all the NAS benchmarks programs for class C. + +\begin{table} + \caption{Comparing results for the NAS class C} % title of Table \centering - \begin{tabular}{ | l | l | l |l | l |l| } + \begin{tabular}{|l|l|*{4}{r|}} \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - name &name&value& Saving \%&Degradation \% &Distance - \\ \hline - % \rowcolor[gray]{0.85} - EPSA&CG & 1.56 &37.02 & 13.88 & 23.14\\ \hline - $Rauber_{E-P}$&CG &2.14 &42.77 & 25.27 & 17.5\\ \hline - $Rauber_{E}$&CG &2.14 &42.77&26.46&16.31\\ \hline - - EPSA&MG & 1.47 &27.66&16.82&10.84\\ \hline - $Rauber_{E-P}$&MG &2.14&34.45&31.84&2.61\\ \hline - $Rauber_{E}$&MG &2.14&34.48&33.65&0.8 \\ \hline - - EPSA&EP &1.19 &25.32&20.79&4.53\\ \hline - $Rauber_{E-P}$&EP&2.05&41.45&55.67&-14.22\\ \hline - $Rauber_{E}$&EP&2.05&42.09&57.59&-15.5\\ \hline - - EPSA&LU&1.56& 39.55 &19.38& 20.17\\ \hline - $Rauber_{E-P}$&LU&2.14&45.62&27&18.62 \\ \hline - $Rauber_{E}$&LU&2.14&45.66&33.01&12.65\\ \hline - - EPSA&BT&1.315& 29.6&20.53&9.07 \\ \hline - $Rauber_{E-P}$&BT&2.1&45.53&49.63&-4.1\\ \hline - $Rauber_{E}$&BT&2.1&43.93&52.86&-8.93\\ \hline - - EPSA&SP&1.388& 33.51&15.65&17.86 \\ \hline - $Rauber_{E-P}$&SP&2.11&45.62&42.52&3.1\\ \hline - $Rauber_{E}$&SP&2.11&45.78&43.09&2.69\\ \hline - - EPSA&FT&1.25& 25&10.8&14.2 \\ \hline - $Rauber_{E-P}$&FT&2.1&39.29&34.3&4.99 \\ \hline - $Rauber_{E}$&FT&2.1&37.56&38.21&-0.65\\ \hline - \end{tabular} - \label{table:compare Class A} - % is used to refer this table in the text -\end{table} -\begin{table}[ht] - \caption{Comparing Results for The NAS Class B} - % title of Table - \centering - \begin{tabular}{ | l | l | l |l | l |l| } + Method & Program & Factor & Energy & Performance & Energy-Perf. \\ + Name & Name & Value & Saving \% & Degradation \% & Distance \\ \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - name &name&value& Saving \%&Degradation \% &Distance - \\ \hline % \rowcolor[gray]{0.85} - EPSA&CG & 1.66 &39.23&16.63&22.6 \\ \hline - $Rauber_{E-P}$&CG &2.15 &45.34&27.6&17.74\\ \hline - $Rauber_{E}$&CG &2.15 &45.34&28.88&16.46\\ \hline + $EPSA$ & CG & 1.56 & 39.23 & 14.88 & 24.35 \\ \hline + $R_{E-P}$ & CG & 2.15 & 45.36 & 25.89 & 19.47 \\ \hline + $R_{E}$ & CG & 2.15 & 45.36 & 26.70 & 18.66 \\ \hline - EPSA&MG & 1.47 &34.98&18.35&16.63\\ \hline - $Rauber_{E-P}$&MG &2.14&43.55&36.42&7.13 \\ \hline - $Rauber_{E}$&MG &2.14&43.56&37.07&6.49 \\ \hline + $EPSA$ & MG & 1.47 & 34.97 & 21.69 & 13.27 \\ \hline + $R_{E-P}$ & MG & 2.15 & 43.65 & 40.45 & 3.20 \\ \hline + $R_{E}$ & MG & 2.15 & 43.64 & 41.38 & 2.26 \\ \hline - EPSA&EP &1.08 &20.29&17.15&3.14 \\ \hline - $Rauber_{E-P}$&EP&2&42.38&56.88&-14.5\\ \hline - $Rauber_{E}$&EP&2&39.73&59.94&-20.21\\ \hline + $EPSA$ & EP & 1.04 & 22.14 & 20.73 & 1.41 \\ \hline + $R_{E-P}$ & EP & 1.92 & 39.40 & 56.33 & -16.93 \\ \hline + $R_{E}$ & EP & 1.92 & 38.10 & 56.35 & -18.25 \\ \hline - EPSA&LU&1.47&38.57&21.34&17.23 \\ \hline - $Rauber_{E-P}$&LU&2.1&43.62&36.51&7.11 \\ \hline - $Rauber_{E}$&LU&2.1&43.61&38.54&5.07 \\ \hline + $EPSA$ & LU & 1.38 & 35.83 & 22.49 & 13.34 \\ \hline + $R_{E-P}$ & LU & 2.15 & 44.97 & 41.00 & 3.97 \\ \hline + $R_{E}$ & LU & 2.15 & 44.97 & 41.80 & 3.17 \\ \hline - EPSA&BT&1.315& 29.59&20.88&8.71\\ \hline - $Rauber_{E-P}$&BT&2.1&44.53&53.05&-8.52\\ \hline - $Rauber_{E}$&BT&2.1&42.93&52.806&-9.876\\ \hline + $EPSA$ & BT & 1.31 & 29.60 & 21.28 & 8.32 \\ \hline + $R_{E-P}$ & BT & 2.13 & 45.60 & 49.84 & -4.24 \\ \hline + $R_{E}$ & BT & 2.13 & 44.90 & 55.16 & -10.26 \\ \hline - EPSA&SP&1.388&33.44&19.24&14.2 \\ \hline - $Rauber_{E-P}$&SP&2.15&45.69&43.2&2.49\\ \hline - $Rauber_{E}$&SP&2.15&45.41&44.47&0.94\\ \hline + $EPSA$ & SP & 1.38 & 33.48 & 21.35 & 12.12 \\ \hline + $R_{E-P}$ & SP & 2.10 & 45.69 & 43.60 & 2.09 \\ \hline + $R_{E}$ & SP & 2.10 & 45.75 & 44.10 & 1.65 \\ \hline - EPSA&FT&1.388&34.4&14.57&19.83 \\ \hline - $Rauber_{E-P}$&FT&2.13&42.98&37.35&5.63 \\ \hline - $Rauber_{E}$&FT&2.13&43.04&37.9&5.14\\ \hline + $EPSA$ & FT & 1.47 & 34.72 & 19.00 & 15.72 \\ \hline + $R_{E-P}$ & FT & 2.04 & 39.40 & 37.10 & 2.30 \\ \hline + $R_{E}$ & FT & 2.04 & 39.35 & 37.70 & 1.65 \\ \hline \end{tabular} - \label{table:compare Class B} + \label{table:compareC} % is used to refer this table in the text \end{table} - -\begin{table}[ht] - \caption{Comparing Results for The NAS Class C} - % title of Table - \centering - \begin{tabular}{ | l | l | l |l | l |l| } - \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - name &name&value& Saving \%&Degradation \% &Distance - \\ \hline - % \rowcolor[gray]{0.85} - EPSA&CG & 1.56 &39.23&14.88&24.35 \\ \hline - $Rauber_{E-P}$&CG &2.15 &45.36&25.89&19.47\\ \hline - $Rauber_{E}$&CG &2.15 &45.36&26.7&18.66\\ \hline - - EPSA&MG & 1.47 &34.97&21.697&13.273\\ \hline - $Rauber_{E-P}$&MG &2.15&43.65&40.45&3.2 \\ \hline - $Rauber_{E}$&MG &2.15&43.64&41.38&2.26 \\ \hline - - EPSA&EP &1.04 &22.14&20.73&1.41 \\ \hline - $Rauber_{E-P}$&EP&1.92&39.4&56.33&-16.93\\ \hline - $Rauber_{E}$&EP&1.92&38.1&56.35&-18.25\\ \hline - - EPSA&LU&1.388&35.83&22.49&13.34 \\ \hline - $Rauber_{E-P}$&LU&2.15&44.97&41&3.97 \\ \hline - $Rauber_{E}$&LU&2.15&44.97&41.8&3.17 \\ \hline - - EPSA&BT&1.315& 29.6&21.28&8.32\\ \hline - $Rauber_{E-P}$&BT&2.13&45.6&49.84&-4.24\\ \hline - $Rauber_{E}$&BT&2.13&44.9&55.16&-10.26\\ \hline - - EPSA&SP&1.388&33.48&21.35&12.12\\ \hline - $Rauber_{E-P}$&SP&2.1&45.69&43.6&2.09\\ \hline - $Rauber_{E}$&SP&2.1&45.75&44.1&1.65\\ \hline - - EPSA&FT&1.47&34.72&19&15.72 \\ \hline - $Rauber_{E-P}$&FT&2.04&39.4&37.1&2.3\\ \hline - $Rauber_{E}$&FT&2.04&39.35&37.7&1.65\\ \hline - \end{tabular} -\label{table:compare Class C} -% is used to refer this table in the text -\end{table} -%\linespread{1.2} -\clearpage As shown in these tables our scaling factor is not optimal for energy -saving such as Rauber's scaling factor EQ~(\ref{eq:sopt}), but it is optimal for -both the energy and the performance simultaneously. Our EPSA optimal scaling -factors has better simultaneous optimization for both the energy and the -performance compared to Rauber's energy-performance method -($Rauber_{E-P}$). Also, in ($Rauber_{E-P}$) method when setting the frequency to -maximum value for the slower task lead to a small improvement of the -performance. Also the results show that this method keep or improve energy -saving. Because of the energy consumption decrease when the execution time -decreased while the frequency value increased. - -Figure~(\ref{fig:compare}) shows the maximum distance between the energy saving -percent and the performance degradation percent. Therefore, this means it is the -same resultant of our objective function EQ~(\ref{eq:max}). Our algorithm always -gives positive energy to performance trade offs while Rauber's method -($Rauber_{E-P}$) gives in some time negative trade offs such as in BT and -EP. The positive trade offs with highest values lead to maximum energy savings -concatenating with less performance degradation and this the objective of this -paper. While the negative trade offs refers to improving energy saving (or may -be the performance) while degrading the performance (or may be the energy) more -than the first. -\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio] +As shown in Table~\ref{table:compareC}, the ($R_{E-P}$) method outperforms the ($R_{E}$) +method in terms of performance and energy reduction. The ($R_{E-P}$) method +also gives better energy savings than our method. However, although our scaling +factor is not optimal for energy reduction, the results in this table prove +that our algorithm returns the best scaling factor that satisfy our objective +method: the largest distance between energy reduction and performance +degradation. Figure~\ref{fig:compare} illustrates even better the distance between +the energy reduction and performance degradation. The negative values mean that one of +the two objectives (energy or performance) have been degraded more than the +other. The positive trade-offs with the highest values lead to maximum energy +savings while keeping the performance degradation as low as possible. Our +algorithm always gives the highest positive energy to performance trade-offs +while Rauber and Rünger's method, ($R_{E-P}$), gives sometimes negative +trade-offs such as in BT and EP. +\begin{figure}[t] \centering - \includegraphics[scale=0.60]{compare_class_A.pdf} - \includegraphics[scale=0.60]{compare_class_B.pdf} - \includegraphics[scale=0.60]{compare_class_c.pdf} - % use scale 35 for all to be in the same line - \caption{Comparing Our EPSA with Rauber's Methods} +% \includegraphics[width=.328\linewidth]{fig/compare_class_A} +% \includegraphics[width=.328\linewidth]{fig/compare_class_B} + \includegraphics[width=\linewidth]{fig/compare_class_C} + \caption{Comparing our method to Rauber and Rünger's methods} \label{fig:compare} \end{figure} -\clearpage -\bibliographystyle{plain} -\bibliography{my_reference} +\section{Conclusion} +\label{sec.concl} + +In this paper, we have presented a new online scaling factor selection method +that optimizes simultaneously the energy and performance of a distributed +application running on a homogeneous cluster. It uses the computation and +communication times measured at the first iteration to predict energy +consumption and the execution time of the parallel application at every available +frequency. Then, it selects the scaling factor that gives the best trade-off +between energy reduction and performance which is the maximum distance between +the energy and the inverted execution time curves. To evaluate this method, we +have applied it to the NAS benchmarks and it was compared to Rauber and Rünger +methods while being executed on the simulator SimGrid. The results showed that +our method, outperforms Rauber and Rünger's methods in terms of energy-performance +ratio. + +In the near future, we would like to adapt this scaling factor selection method +to heterogeneous platforms where each node has different characteristics. In +particular, each CPU has different available frequencies, energy consumption and +performance. It would be also interesting to develop a new energy model for +asynchronous parallel iterative methods where the number of iterations is not +known in advance and depends on the global convergence of the iterative system. + +\section*{Acknowledgment} + +This work has been partially supported by the Labex ACTION project (contract +``ANR-11-LABX-01-01''). Computations have been performed on the supercomputer +facilities of the Mésocentre de calcul de Franche-Comté. As a PhD student, +Mr. Ahmed Fanfakh, would like to thank the University of Babylon (Iraq) for +supporting his work. + +% trigger a \newpage just before the given reference +% number - used to balance the columns on the last page +% adjust value as needed - may need to be readjusted if +% the document is modified later +%\IEEEtriggeratref{15} + +\newpage +\bibliographystyle{IEEEtran} +\bibliography{IEEEabrv,my_reference} \end{document} %%% Local Variables: @@ -708,3 +712,6 @@ than the first. %%% fill-column: 80 %%% ispell-local-dictionary: "american" %%% End: + +% LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber +% LocalWords: CMOS EQ EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex