X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy.git/blobdiff_plain/9b8155e7b3b8a90231159f4424dbd6b177a140e2..e10a2fae800c41166633eabe0cff9e7815befda7:/paper.tex diff --git a/paper.tex b/paper.tex index 89f43e9..e62908a 100644 --- a/paper.tex +++ b/paper.tex @@ -3,23 +3,58 @@ \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} \usepackage[english]{babel} -\usepackage{algorithm,algorithmicx,algpseudocode} -\usepackage{graphicx,graphics} +\usepackage{algpseudocode} +\usepackage{graphicx} \usepackage{subfig} -\usepackage{listings} -\usepackage{colortbl} \usepackage{amsmath} +\usepackage{url} +\DeclareUrlCommand\email{\urlstyle{same}} + \usepackage[autolanguage,np]{numprint} -\renewcommand*\npunitcommand[1]{\text{#1}} +\AtBeginDocument{% + \renewcommand*\npunitcommand[1]{\text{#1}} + \npthousandthpartsep{}} \usepackage{xspace} \usepackage[textsize=footnotesize]{todonotes} -\newcommand{\AG}[2][inline]{\todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} +\newcommand{\AG}[2][inline]{% + \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} +\newcommand{\JC}[2][inline]{% + \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace} + +\newcommand{\Xsub}[2]{\ensuremath{#1_\textit{#2}}} + +\newcommand{\Dist}{\textit{Dist}} +\newcommand{\Eind}{\Xsub{E}{ind}} +\newcommand{\Enorm}{\Xsub{E}{Norm}} +\newcommand{\Eoriginal}{\Xsub{E}{Original}} +\newcommand{\Ereduced}{\Xsub{E}{Reduced}} +\newcommand{\Fdiff}{\Xsub{F}{diff}} +\newcommand{\Fmax}{\Xsub{F}{max}} +\newcommand{\Fnew}{\Xsub{F}{new}} +\newcommand{\Ileak}{\Xsub{I}{leak}} +\newcommand{\Kdesign}{\Xsub{K}{design}} +\newcommand{\MaxDist}{\textit{Max Dist}} +\newcommand{\Ntrans}{\Xsub{N}{trans}} +\newcommand{\Pdyn}{\Xsub{P}{dyn}} +\newcommand{\PnormInv}{\Xsub{P}{NormInv}} +\newcommand{\Pnorm}{\Xsub{P}{Norm}} +\newcommand{\Tnorm}{\Xsub{T}{Norm}} +\newcommand{\Pstates}{\Xsub{P}{states}} +\newcommand{\Pstatic}{\Xsub{P}{static}} +\newcommand{\Sopt}{\Xsub{S}{opt}} +\newcommand{\Tcomp}{\Xsub{T}{comp}} +\newcommand{\TmaxCommOld}{\Xsub{T}{Max Comm Old}} +\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}} +\newcommand{\Tmax}{\Xsub{T}{max}} +\newcommand{\Tnew}{\Xsub{T}{New}} +\newcommand{\Told}{\Xsub{T}{Old}} \begin{document} -\title{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs} +\title{Dynamic Frequency Scaling for Energy Consumption + Reduction in Synchronous Distributed Applications} \author{% \IEEEauthorblockN{% @@ -30,747 +65,635 @@ } \IEEEauthorblockA{% FEMTO-ST Institute\\ - University of Franche-Comté + University of Franche-Comté\\ + IUT de Belfort-Montbéliard, + 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\ + % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël + % Fax: \mbox{+33 3 84 58 77 81}\\ % Dept Info + Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr} + } } -} \maketitle -\AG{``Optimal'' is a bit pretentious in the title.\\ - Complete affiliation, add an email address, etc.} - \begin{abstract} -The important technique for energy reduction of parallel systems is CPU frequency -scaling. This operation used by many researchers to reduce energy consumption in many -ways. Frequency scaling operation also has big impact on the performance. In some cases, -the performance degradation ratio is bigger than energy saving ratio when the frequency scaled -to down level. Therefore, the trade offs between the energy and performance becomes more -important topic when using this technique. In this paper we developed an algorithm that -select the frequency scaling factor for both energy and performance simultaneously. -This algorithm takes into account the communication times when selecting the frequency scaling -factor. It is works online without training or profiling to have very small overhead. -The algorithm has better energy-performance trade offs compared to other methods. + Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs. This + technique is usually used to reduce the energy consumed by a CPU while + computing. Thus, decreasing the frequency + reduces the power consumed by the CPU. However, it can also significantly + affect the performance of the executed program if it is compute bound and if a + low CPU frequency is selected. Therefore, the chosen scaling factor must + give the best possible trade-off between energy reduction and performance. + + In this paper we present an algorithm that predicts the energy consumed with + each frequency gear and selects the one that gives the best ratio between + energy consumption reduction and performance. This algorithm works online + without training or profiling and has a very small overhead. It also takes + into account synchronous communications between the nodes that are executing + the distributed algorithm. The algorithm has been evaluated over the SimGrid + simulator while being applied to the NAS parallel benchmark programs. The + results of the experiments show that it outperforms other existing scaling + factor selection algorithms. \end{abstract} \section{Introduction} \label{sec.intro} -The need for computing power is still increasing and it is not expected to slow -down in the coming years. To satisfy this demand, researchers and supercomputers -constructors have been regularly increasing the number of computing cores in -supercomputers (for example in November 2013, according to the TOP500 -list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3 -millions of cores and delivers more than 33 Tflop/s while consuming 17808 -kW). This large increase in number of computing cores has led to large energy -consumption by these architectures. Moreover, the price of energy is expected to -continue its ascent according to the demand. For all these reasons energy -reduction became an important topic in the high performance computing field. To -tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency -Scaling) operations which reduce dynamically the frequency and voltage of cores -and thus their energy consumption. However, this operation also degrades the -performance of computation. Therefore researchers try to reduce the frequency to -minimum when processors are idle (waiting for data from other processors or -communicating with other processors). Moreover, depending on their objectives -they use heuristics to find the best scaling factor during the computation. If -they aim for performance they choose the best scaling factor that reduces the -consumed energy while affecting as little as possible the performance. On the -other hand, if they aim for energy reduction, the chosen scaling factor must -produce the most energy efficient execution without considering the degradation -of the performance. It is important to notice that lowering the frequency to -minimum value does not always give the most efficient execution due to energy -leakage. The best scaling factor might be chosen during execution (online) or -during a pre-execution phase. In this paper we emphasize to develop an -algorithm that selects a frequency scaling factor that simultaneously takes into -consideration the energy consumption and the performance. The -main objective of HPC systems is to run the application with less execution -time. Therefore, our algorithm selects the scaling factor online with -very small footprint. The proposed algorithm takes into account the -communication times of the MPI program to choose the scaling factor. This -algorithm has ability to predict both energy consumption and execution time over -all available scaling factors. The prediction achieved depends on some -computing time information, gathered at the beginning of the runtime. We apply -this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel -benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed -using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} -over an homogeneous distributed memory architecture. Furthermore, we compare the -proposed algorithm with Rauber and Rünger methods~\cite{3}. -The comparison's results show that our -algorithm gives better energy-time trade off. - -This paper is organized as follows: Section~\ref{sec.relwork} presents the works from other authors. -Section~\ref{sec.ptasks} shows the execution of parallel tasks and sources of idle times. Section~\ref{sec.energy} resumes the -energy model of homogeneous platform. Section~\ref{sec.mpip} evaluates the performance of MPI program. -Section~\ref{sec.verif} verifies the performance prediction model. Section~\ref{sec.compet} presents -the energy-performance trade offs objective function. Section~\ref{sec.optim} demonstrates the proposed -energy-performance algorithm. Section~\ref{sec.expe} presents the results of our experiments. -Section~\ref{sec.compare} shows the comparison results. Finally, we conclude in Section~\ref{sec.concl}. - -\section{Related Works} +The need and demand for more computing power have been increasing since the +birth of the first computing unit and it is not expected to slow down in the +coming years. To satisfy this demand, researchers and supercomputers +constructors have been regularly increasing the number of computing cores and +processors in supercomputers (for example in November 2013, according to the +TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more +than 3 million of cores and delivers more than \np[Tflop/s]{33} while consuming +\np[kW]{17808}). This large increase in number of computing cores has led to +large energy consumption by these architectures. Moreover, the price of energy +is expected to continue its ascent according to the demand. For all these +reasons energy reduction has become an important topic in the high performance +computing field. To tackle this problem, many researchers use DVFS (Dynamic +Voltage Frequency Scaling) operations which reduce dynamically the frequency and +voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a +set of acceptable frequencies which are usually called gears, and the user or +the operating system can modify the frequency of the processor according to its +needs. However, DVFS also degrades the performance of computation. Therefore +researchers try to reduce the frequency to the minimum when processors are idle +(waiting for data from other processors or communicating with other processors). +Moreover, depending on their objectives, they use heuristics to find the best +scaling factor during the computation. If they aim for performance they choose +the best scaling factor that reduces the consumed energy while affecting as +little as possible the performance. On the other hand, if they aim for energy +reduction, the chosen scaling factor must produce the most energy efficient +execution without considering the degradation of the performance. It is +important to notice that lowering the frequency to the minimum value does not always +give the most energy efficient execution due to energy leakage. The best +scaling factor might be chosen during execution (online) or during a +pre-execution phase. In this paper, we present an algorithm that selects a +frequency scaling factor that simultaneously takes into consideration the energy +consumption by the CPU and the performance of the application. The main +objective of HPC systems is to execute as fast as possible the application. +Therefore, our algorithm selects the scaling factor online with very small +overhead. The proposed algorithm takes into account the communication times of +the MPI program to choose the scaling factor. This algorithm has the ability to +predict both energy consumption and execution time over all available scaling +factors. The prediction achieved depends on some computing time information, +gathered at the beginning of the runtime. We apply this algorithm to the NAS parallel benchmarks (NPB v3.3)~\cite{44}. Our experiments are executed using the simulator +SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over a homogeneous +distributed memory architecture. Furthermore, we compare the proposed algorithm +with Rauber and Rünger methods~\cite{3}. The comparison's results show that our +algorithm gives better energy-time trade-off. + +This paper is organized as follows: Section~\ref{sec.relwork} presents some +related works from other authors. Section~\ref{sec.exe} presents an energy +model for homogeneous platforms. Section~\ref{sec.mpip} describes how the +performance of MPI programs can be predicted. Section~\ref{sec.compet} presents +the energy-performance objective function that maximizes the reduction of energy +consumption while minimizing the degradation of the program's performance. +Section~\ref{sec.optim} details the proposed energy-performance algorithm. +Section~\ref{sec.expe} verifies the accuracy of the performance prediction model +and presents the results of the proposed algorithm. It also shows the +comparison results between our method and other existing methods. Finally, we +conclude in Section~\ref{sec.concl} with a summary and some future works. + +\section{Related works} \label{sec.relwork} -\AG{Consider introducing the models (sec.~\ref{sec.ptasks}, - maybe~\ref{sec.energy}) before related works} - -In the this section some heuristics to compute the scaling factor are -presented and classified in two parts: offline and online methods. - -\subsection{The offline DVFS orientations} - -The DVFS offline methods are static and are not executed during the runtime of -the program. Some approaches used heuristics to select the best DVFS state -during the compilation phases as for example in Azevedo et al.~\cite{40}. They use -dynamic voltage scaling (DVS) algorithm to choose the DVS setting when there are dependency points -between tasks. While in~\cite{29}, Xie et al. used breadth-first search -algorithm to do that. Their goal is to save energy with time limits. Another -approach gathers and stores the runtime information for each DVFS state, then -selects the suitable DVFS offline to optimize energy-time -trade offs. As an example Rountree et al.~\cite{8}, use liner programming -algorithm, while in~\cite{38,34}, Cochran et al. use multi logistic regression -algorithm for the same goal. The offline study that shows the DVFS impact on the -communication time of the MPI program is~\cite{17}, where Freeh et al. show that these -times do not change when the frequency is scaled down. - -\subsection{The online DVFS orientations} - -The objective of online DVFS orientations works is to dynamically compute and set the frequency of -the CPU during the runtime of the program for saving energy. Estimating and -predicting approaches for the energy-time trade offs are developed by Kimura, Peraza, Yu-Liang et al. -~\cite{11,2,31}. These works select the best DVFS setting depending on the slack -times. These times happen when the processors have to wait for data from other -processors to compute their task. For example, during the synchronous -communications that take place in MPI programs, some processors are -idle. The optimal DVFS can be selected using learning methods. Therefore, in Dhiman, Hao Shen et al. -~\cite{39,19} used machine learning to converge to the suitable DVFS -configuration. Their learning algorithms take big time to converge when the -number of available frequencies is high. Also, the communication sections of the MPI -program can be used to save energy. In~\cite{1}, Lim et al. developed an -algorithm that detects the communication sections and changes the frequency -during these sections only. This approach changes the frequency many times -because an iteration may contain more than one communication section. The domain -of analytical modeling used for choosing the optimal frequency as in Rauber and Rünger~\cite{3}. they -developed an analytical mathematical model to determine the -optimal frequency scaling factor for any number of concurrent tasks. They set the slowest task to maximum frequency for maintaining performance. In this paper we compare our algorithm with -Rauber and Rünger model~\cite{3}, because their model can be used for any number of -concurrent tasks for homogeneous platforms. The primary contributions of this paper are: -\begin{enumerate} -\item Selecting the frequency scaling factor for simultaneously optimizing energy and performance, - while taking into account the communication time. -\item Adapting our scaling factor to take into account the imbalanced tasks. -\item The execution time of our algorithm is very small when compared to other - methods (e.g.,~\cite{19}). -\item The proposed algorithm works online without profiling or training as - in~\cite{38,34}. -\end{enumerate} - -\section{Parallel Tasks Execution on Homogeneous Platform} -\label{sec.ptasks} - -A homogeneous cluster consists of identical nodes in terms of hardware and software. -Each node has its own memory and at least one processor which can -be a multi-core. The nodes are connected via a high bandwidth network. Tasks -executed on this model can be either synchronous or asynchronous. In this paper -we consider execution of the synchronous tasks on distributed homogeneous -platform. These tasks can exchange the data via synchronous message passing. -\begin{figure*}[t] - \centering - \subfloat[Sync. Imbalanced Communications]{\includegraphics[scale=0.67]{commtasks}\label{fig:h1}} - \subfloat[Sync. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}} - \caption{Parallel Tasks on Homogeneous Platform} - \label{fig:homo} -\end{figure*} -Therefore, the execution time of a task consists of the computation time and the -communication time. Moreover, the synchronous communications between tasks can -lead to idle time while tasks wait at the synchronization barrier for other tasks to -finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happen when processing different -amounts of data on each processor (see figure~(\ref{fig:h2})). In -this case the fastest tasks have to wait at the synchronization barrier for the -slowest tasks to finish their job. In both cases the overall execution time -of the program is the execution time of the slowest task as: -\begin{equation} - \label{eq:T1} - \textit{Program Time} = \max_{i=1,2,\dots,N} T_i -\end{equation} -where $T_i$ is the execution time of task $i$. -\section{Energy Model for Homogeneous Platform} -\label{sec.energy} - -The energy consumption by the processor consists of two power metrics: the -dynamic and the static power. This general power formulation is used by many -researchers~\cite{9,3,15,26}. The dynamic power of the CMOS processors -$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$, -the supply voltage $V$ and operational frequency $f$ respectively as follow: +In this section, some heuristics to compute the scaling factor are presented and +classified into two categories: offline and online methods. + +\subsection{Offline scaling factor selection methods} + +The offline scaling factor selection methods are executed before the runtime of +the program. They return static scaling factor values to the processors +participating in the execution of the parallel program. On the one hand, the +scaling factor values could be computed based on information retrieved by +analyzing the code of the program and the computing system that will execute it. +In~\cite{40}, Azevedo et al. detect during compilation the dependency points +between tasks in a multi-task program. This information is then used to lower +the frequency of some processors in order to eliminate slack times. A slack +time is the period of time during which a processor that has already finished +its computation, has to wait for a set of processors to finish their +computations and send their results to the waiting processor in order to +continue its task that is dependent on the results of computations being +executed on other processors. Freeh et al. showed in~\cite{17} that the +communication times of MPI programs do not change when the frequency is scaled +down. On the other hand, some offline scaling factor selection methods use the +information gathered from previous full or partial executions of the program. The whole program or, a +part of it, is usually executed over all the available frequency +gears and the execution time and the energy consumed with each frequency +gear are measured. Then a heuristic or an exact method uses the retrieved +information to compute the values of the scaling factor for the processors. +In~\cite{8} , Rountree et al. use a linear programming algorithm, while in~\cite{34}, Cochran et +al. use a multi-logistic regression algorithm for the same goal. The main +drawback of these methods is that they all require executing the +whole program or, a part of it, on all frequency gears for each new instance of the same program. + +\subsection{Online scaling factor selection methods} + +The online scaling factor selection methods are executed during the runtime of +the program. They are usually integrated into iterative programs where the same +block of instructions is executed many times. During the first few iterations, +a lot of information is measured such as the execution time, the energy consumed +using a multimeter, the slack times, \dots{} Then a method will exploit these +measurements to compute the scaling factor values for each processor. This +operation, measurements and computing new scaling factor, can be repeated as +much as needed if the iterations are not regular. Peraza, Yu-Liang et +al.~\cite{2,31} used varied heuristics to select the appropriate scaling +factor values to eliminate the slack times during runtime. However, as seen +in~\cite{19}, machine learning method takes a lot of time to converge +when the number of available gears is big. To reduce the impact of slack times, +in~\cite{1}, Lim et al. developed an algorithm that detects the communication +sections and changes the frequency during these sections only. This approach +might change the frequency of each processor many times per iteration if an +iteration contains more than one communication section. In~\cite{3}, Rauber and +Rünger used an analytical model that can predict the consumed energy for every frequency gear after measuring the consumed energy. They +maintain the performance as mush as possible by setting the highest frequency gear to the slowest task. + +The primary contribution of +our paper is to present a new online scaling factor selection method which has the + following characteristics:\\ +1) It is based on Rauber and Rünger analytical model to predict the energy + consumption of the application with different frequency gears. +2) It selects the frequency scaling factor for simultaneously optimizing + energy reduction and maintaining performance. +3) It is well adapted to distributed architectures because it takes into + account the communication time. +4) It is well adapted to distributed applications with imbalanced tasks. +5) It has a very small overhead when compared to other methods + (e.g.,~\cite{19}) and does not require profiling or training as + in~\cite{34}. + + +% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'', +% can be deleted if we need space, we can just say we are interested in this +% paper in homogeneous clusters} + + +\section{Energy model for a homogeneous platform} +\label{sec.exe} +Many researchers~\cite{9,3,15,26} divide the power consumed by a processor into +two power metrics: the static and the dynamic power. While the first one is +consumed as long as the computing unit is on, the latter is only consumed during +computation times. The dynamic power $\Pdyn$ is related to the switching +activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and +operational frequency $f$, as shown in EQ~\eqref{eq:pd}. \begin{equation} \label{eq:pd} - P_\textit{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f + \Pdyn = \alpha \cdot C_L \cdot V^2 \cdot f \end{equation} -The static power $P_{static}$ captures the leakage power consumption as well as -the power consumption of peripheral devices like the I/O subsystem. +The static power $\Pstatic$ captures the leakage power as follows: \begin{equation} \label{eq:ps} - P_\textit{static} = V \cdot N \cdot K_{design} \cdot I_{leak} + \Pstatic = V \cdot \Ntrans \cdot \Kdesign \cdot \Ileak \end{equation} -where V is the supply voltage, N is the number of transistors, $K_{design}$ is a -design dependent parameter and $I_{leak}$ is a technology-dependent -parameter. Energy consumed by an individual processor $E_{ind}$ is the summation -of the dynamic and the static power multiplied by the execution time for example -see~\cite{36,15}. +where V is the supply voltage, $\Ntrans$ is the number of transistors, +$\Kdesign$ is a design dependent parameter and $\Ileak$ is a +technology-dependent parameter. The energy consumed by an individual processor +to execute a given program can be computed as: \begin{equation} \label{eq:eind} - E_\textit{ind} = ( P_\textit{dyn} + P_\textit{static} ) \cdot T + \Eind = \Pdyn \cdot \Tcomp + \Pstatic \cdot T \end{equation} -The dynamic voltage and frequency scaling (DVFS) is a process that is allowed in -modern processors to reduce the dynamic power by scaling down the voltage and -frequency. Its main objective is to reduce the overall energy -consumption~\cite{37}. The operational frequency \emph f depends linearly on the -supply voltage $V$, i.e., $V = \beta \cdot f$ with some constant $\beta$. This -equation is used to study the change of the dynamic voltage with respect to -various frequency values in~\cite{3}. The reduction process of the frequency are -expressed by scaling factor \emph S. The scale \emph S is the ratio between the -maximum and the new frequency as in EQ~(\ref{eq:s}). +where $T$ is the execution time of the program, $\Tcomp$ is the computation +time and $\Tcomp \leq T$. $\Tcomp$ may be equal to $T$ if there is no +communication, no slack time and no synchronization. + +DVFS is a process that is allowed in modern processors to reduce the dynamic +power by scaling down the voltage and frequency. Its main objective is to +reduce the overall energy consumption~\cite{37}. The operational frequency $f$ +depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot f$ with some +constant $\beta$. This equation is used to study the change of the dynamic +voltage with respect to various frequency values in~\cite{3}. The reduction +process of the frequency can be expressed by the scaling factor $S$ which is the +ratio between the maximum and the new frequency as in EQ~\eqref{eq:s}. \begin{equation} \label{eq:s} - S = \frac{F_\textit{max}}{F_\textit{new}} + S = \frac{\Fmax}{\Fnew} \end{equation} -The value of the scale $S$ is greater than 1 when changing the frequency to -any new frequency value~(\emph {P-state}) in governor, the CPU governor is an interface -driver supplied by the operating system kernel (e.g. Linux) to lowering core's frequency. -The scaling factor is equal to 1 when the frequency set is to the maximum frequency. -The energy consumption model for parallel homogeneous platform depends on the scaling factor \emph S. This factor reduces quadratically the dynamic power. Also, this factor increases the -static energy linearly because the execution time is increased~\cite{36}. The -energy model depending on the frequency scaling factor for homogeneous platform -for any number of concurrent tasks was developed by Rauber and Rünger~\cite{3}. This model -considers the two power metrics for measuring the energy of the parallel tasks as -in EQ~(\ref{eq:energy}): +The value of the scaling factor $S$ is greater than 1 when changing the +frequency of the CPU to any new frequency value~(\emph{P-state}) in the +governor. This factor reduces quadratically +the dynamic power which may cause degradation in performance and thus, the +increase of the static energy because the execution time is increased~\cite{36}. +If the tasks are sorted according to their execution times before scaling in a +descending order, the total energy consumption model for a parallel homogeneous +platform, as presented by Rauber and Rünger~\cite{3}, can be written as a +function of the scaling factor $S$, as in EQ~\eqref{eq:energy}. \begin{equation} \label{eq:energy} - E = P_\textit{dyn} \cdot S_1^{-2} \cdot + E = \Pdyn \cdot S_1^{-2} \cdot \left( T_1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^2} \right) + - P_\textit{static} \cdot T_1 \cdot S_1 \cdot N - \hfill -\end{equation} -where \emph N is the number of parallel nodes, $T_1 $ is the time of the slowest -task, $T_i$ is the time of the task $i$ and $S_1$ is the maximum scaling factor -for the slower task. The scaling factor $S_1$, as in EQ~(\ref{eq:s1}), selects -from the set of scales values $S_i$. Each of these scales are proportional to -the time value $T_i$ depends on the new frequency value as in EQ~(\ref{eq:si}). -\begin{equation} - \label{eq:s1} - S_1 = \max_{i=1,2,\dots,F} S_i + \Pstatic \cdot T_1 \cdot S_1 \cdot N \end{equation} +where $N$ is the number of parallel nodes, $T_i$ for $i=1,\dots,N$ are +the execution times of the sorted tasks. Therefore, $T_1$ is +the time of the slowest task, and $S_1$ its scaling factor which should be the +highest because they are proportional to the time values $T_i$. The scaling +factors $S_i$ are computed as in EQ~\eqref{eq:si}. \begin{equation} \label{eq:si} S_i = S \cdot \frac{T_1}{T_i} - = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i} + = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i} \end{equation} -where $F$ is the number of available frequencies. In this paper we depend on -Rauber and Rünger energy model EQ~(\ref{eq:energy}) for two reasons: (1)-this model is used -for homogeneous platform that we work on in this paper. 2-we compare our -algorithm with Rauber and Rünger scaling model. Rauber and Rünger scaling factor that reduce - energy consumption derived from the EQ~(\ref{eq:energy}). They take the -derivation for this equation (to be minimized) and set it to zero to produce the -scaling factor as in EQ~(\ref{eq:sopt}). +In this paper we use Rauber and Rünger's energy model, EQ~\eqref{eq:energy}, because it can be applied to homogeneous clusters if the communication time is taken in consideration. Moreover, we compare our algorithm with Rauber and Rünger's scaling factor selection +method which uses the same energy model. In their method, the optimal scaling factor is +computed by minimizing the derivation of EQ~\eqref{eq:energy} which produces +EQ~\eqref{eq:sopt}. + \begin{equation} \label{eq:sopt} - S_\textit{opt} = \sqrt[3]{\frac{2}{n} \cdot \frac{P_\textit{dyn}}{P_\textit{static}} \cdot + \Sopt = \sqrt[3]{\frac{2}{N} \cdot \frac{\Pdyn}{\Pstatic} \cdot \left( 1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^3} \right) } \end{equation} -\section{Performance Evaluation of MPI Programs} + +\section{Performance evaluation of MPI programs} \label{sec.mpip} -The performance (execution time) of parallel MPI applications depend on -the time of the slowest task as in figure~(\ref{fig:homo}). Normally the -execution time of the parallel programs are proportional to the operational -frequency. Therefore, any DVFS operation for the energy reduction increases the -execution time of the parallel program. As shown in EQ~(\ref{eq:energy}) the -energy is affected by the scaling factor $S$. This factor also has a great impact -on the performance. When scaling down the frequency to the new value according -to EQ~(\ref{eq:s}), the value of the scale $S$ has inverse relation with -new frequency value ($S \propto \frac{1}{F_{new}}$). Also when decreasing the -frequency value, the execution time increases. Then the new frequency value has -inverse relation with time ($F_{new} \propto \frac{1}{T}$). This leads to the -frequency scaling factor $S$ proportional linearly with execution time ($S -\propto T$). Large scale MPI applications such as NAS benchmarks have -considerable amount of communications embedded in these programs. During the -communication process the processors remain idle until the communication has -finished. For that reason any change in the frequency has no impact on the time -of communication but it has obvious impact on the time of -computation~\cite{17}. We have made many tests on a real cluster to prove that the -frequency scaling factor \emph S has a linear relation with computation time -only. To predict the execution time of MPI program, the communication time and -the computation time for the slower task must be first precisely specified. Secondly, -these times are used to predict the execution time for any MPI program as a function of -the new scaling factor as in the EQ~(\ref{eq:tnew}). +The execution time of a parallel synchronous iterative application is +equal to the execution time of the slowest task. If there is no +communication and the application is not data bounded, the execution time of a +parallel program is linearly proportional to the operational frequency and any +DVFS operation for energy reduction increases the execution time of the parallel +program. Therefore, the scaling factor $S$ is linearly proportional to the +execution time. However, in most MPI applications the processes exchange +data. During these communications the processors involved remain idle until the +communications are finished. For that reason, any change in the frequency has no +impact on the time of communication~\cite{17}. The communication time for a +task is the summation of periods of time that begin with an MPI call for sending +or receiving a message till the message is synchronously sent or received. To +be able to predict the execution time of MPI program, the communication time and +the computation time for the slowest task must be measured before scaling. These +times are used to predict the execution time for any MPI program as a function +of the new scaling factor as in EQ~\eqref{eq:tnew}. \begin{equation} \label{eq:tnew} - \textit T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}} + \Tnew = \TmaxCompOld \cdot S + \TmaxCommOld \end{equation} -The above equation shows that the scaling factor \emph S has linear relation -with the computation time without affecting the communication time. The -communication time consists of the beginning times which an MPI calls for -sending or receiving till the message is synchronously sent or received. In this -paper we predict the execution time of the program for any new scaling factor -value. Depending on this prediction we can produce our energy-performance scaling -method as we will show in the coming sections. In the next section we make an -investigation study for the EQ~(\ref{eq:tnew}). - -\section{Performance Prediction Verification} -\label{sec.verif} - -In this section we evaluate the precision of our performance prediction methods -on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution -time for any scale value. The NAS programs run the class B for comparing the -real execution time with the predicted execution time. Each program runs offline -with all available scaling factors on 8 or 9 nodes to produce real execution -time values. These scaling factors are computed by dividing the maximum -frequency by the new one see EQ~(\ref{eq:s}). In all tests, we use the simulator -SimGrid/SMPI v3.10 to run the NAS programs. -\begin{figure*}[t] - \centering - \includegraphics[width=.4\textwidth]{cg_per.eps}\qquad% - \includegraphics[width=.4\textwidth]{mg_pre.eps} - \includegraphics[width=.4\textwidth]{bt_pre.eps}\qquad% - \includegraphics[width=.4\textwidth]{lu_pre.eps} - \caption{Fitting Predicted to Real Execution Time} - \label{fig:pred} -\end{figure*} -%see Figure~\ref{fig:pred} -In our cluster there are 18 available frequency states for each processor from -2.5 GHz to 800 MHz, there is 100 MHz difference between two successive -frequencies. For more details on the characteristics of the platform refer to -table~(\ref{table:platform}). This lead to 18 run states for each program. We -use seven MPI programs of the NAS parallel benchmarks: CG, MG, EP, FT, BT, LU -and SP. The average normalized errors between the predicted execution time and -the real time (SimGrid time) for all programs is between 0.0032 to 0.0133. AS an -example, we are present the execution times of the NAS benchmarks as in the -figure~(\ref{fig:pred}). - -\section{Performance to Energy Competition} +In this paper, this prediction method is used to select the best scaling factor +for each processor as presented in the next section. + +\section{Performance and energy reduction trade-off} \label{sec.compet} -This section demonstrates our approach for choosing the optimal scaling -factor. This factor gives maximum energy reduction taking into account the -execution time for both computation and communication times. The relation -between the energy and the performance are nonlinear and complex, because the -relation of the energy with scaling factor is nonlinear and with the performance -it is linear see~\cite{17}. The relation between the energy and the performance -is not straightforward. Moreover, they are not measured using the same metric. -For solving this problem, we normalize the energy by calculating the ratio -between the consumed energy with scaled frequency and the consumed energy -without scaled frequency: +This section presents our method for choosing the optimal scaling factor that +gives the best tradeoff between energy reduction and performance. This method +takes into account the execution times for both computation and communication to +compute the scaling factor. Since the energy consumption and the performance +are not measured using the same metric, a normalized value of both measurements +can be used to compare them. The normalized energy is the ratio between the +consumed energy with scaled frequency and the consumed energy without scaled +frequency: \begin{multline} \label{eq:enorm} - E_\textit{Norm} = \frac{ E_\textit{Reduced}}{E_\textit{Original}} \\ - {} = \frac{P_\textit{dyn} \cdot S_i^{-2} \cdot - \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + - P_\textit{static} \cdot T_1 \cdot S_i \cdot N }{ - P_\textit{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + - P_\textit{static} \cdot T_1 \cdot N } + \Enorm = \frac{ \Ereduced}{\Eoriginal} \\ + {} = \frac{\Pdyn \cdot S_1^{-2} \cdot + \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot S_1 \cdot N}{ + \Pdyn \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot N } \end{multline} -By the same way we can normalize the performance as follows: +In the same way, the normalized execution time of a program is computed as follows: \begin{equation} \label{eq:pnorm} - P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}} - = \frac{T_\textit{Max Comp Old} \cdot S + - T_\textit{Max Comm Old}}{ T_\textit{Old}} + \Tnorm = \frac{\Tnew}{\Told} + = \frac{\TmaxCompOld \cdot S + \TmaxCommOld}{ + \TmaxCompOld + \TmaxCommOld} \end{equation} -The second problem is the optimization operation for both energy and performance -is not in the same direction. In other words, the normalized energy and the -performance curves are not in the same direction see figure~(\ref{fig:r2}). -While the main goal is to optimize the energy and performance in the same -time. According to the equations~(\ref{eq:enorm}) and~(\ref{eq:pnorm}) the -scaling factor \emph S reduce both the energy and the performance -simultaneously. But the main objective is to produce maximum energy reduction -with minimum performance reduction. Many researchers used different strategies -to solve this nonlinear problem for example see~\cite{19,42}, their methods add -big overhead to the algorithm for selecting the suitable frequency. In this -paper we are present a method to find the optimal scaling factor \emph S for -optimize both energy and performance simultaneously without adding big -overheads. Our solution for this problem is to make the optimization process -have the same direction. Therefore, we inverse the equation of normalize -performance as follows: +The relation between the execution time and the consumed energy of a program is nonlinear and complex. In consequences, the relation between the consumed energy and the scaling factor is also nonlinear, for more details refer to~\cite{17}. Therefore, the resulting normalized energy consumption curve and execution time curve, for different scaling factors, do not have the same direction see Figure~\ref{fig:rel}\subref{fig:r2}. To tackle this problem and optimize both terms, we inverse the equation of the normalized execution time as follows: \begin{equation} \label{eq:pnorm_en} - P^{-1}_\textit{Norm} = \frac{ T_\textit{Old}}{ T_\textit{New}} - = \frac{ T_\textit{Old}}{T_\textit{Max Comp Old} \cdot S + - T_\textit{Max Comm Old}} + \Pnorm = \frac{ \Told}{ \Tnew} + = \frac{\TmaxCompOld + + \TmaxCommOld}{\TmaxCompOld \cdot S + + \TmaxCommOld} \end{equation} -\begin{figure*} +\begin{figure} \centering - \subfloat[Converted Relation.]{% - \includegraphics[width=.4\textwidth]{file.eps}\label{fig:r1}}% - \qquad% - \subfloat[Real Relation.]{% - \includegraphics[width=.4\textwidth]{file3.eps}\label{fig:r2}} + \subfloat[Real relation.]{% + \includegraphics[width=.5\linewidth]{fig/file3}\label{fig:r2}}% + \subfloat[Converted relation.]{% + \includegraphics[width=.5\linewidth]{fig/file}\label{fig:r1}} + \caption{The energy and performance relation} \label{fig:rel} - \caption{The Energy and Performance Relation} -\end{figure*} -Then, we can modelize our objective function as finding the maximum distance -between the energy curve EQ~(\ref{eq:enorm}) and the inverse of performance -curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This represent -the minimum energy consumption with minimum execution time (better performance) -in the same time, see figure~(\ref{fig:r1}). Then our objective function has the -following form: +\end{figure} +Then, we can model our objective function as finding the maximum distance +between the energy curve EQ~\eqref{eq:enorm} and the inverse of the execution time (performance) +curve EQ~\eqref{eq:pnorm_en} over all available scaling factors. This +represents the minimum energy consumption with minimum execution time (better +performance) at the same time, see Figure~\ref{fig:rel}\subref{fig:r1}. Then +our objective function has the following form: \begin{equation} \label{eq:max} - \textit{MaxDist} = \max (\overbrace{P^{-1}_\textit{Norm}}^{\text{Maximize}} - - \overbrace{E_\textit{Norm}}^{\text{Minimize}} ) + \MaxDist = \max_{j=1,2,\dots,F} + (\overbrace{\Pnorm(S_j)}^{\text{Maximize}} - + \overbrace{\Enorm(S_j)}^{\text{Minimize}} ) \end{equation} -Then we can select the optimal scaling factor that satisfy the -EQ~(\ref{eq:max}). Our objective function can works with any energy model or -static power values stored in a data file. Moreover, this function works in -optimal way when the energy function has a convex form with frequency scaling -factor as shown in~\cite{15,3,19}. Energy measurement model is not the -objective of this paper and we choose Rauber and Rünger model as an example with two -reasons that mentioned before. - -\section{Optimal Scaling Factor for Performance and Energy} +where $F$ is the number of available frequencies. Then we can select the optimal +scaling factor that satisfies EQ~\eqref{eq:max}. Our objective function can +work with any energy model or static power values stored in a data file. +Moreover, this function works in optimal way when the energy curve has a convex +form over the available frequency scaling factors as shown in~\cite{15,3,19}. + +\section{Optimal scaling factor for performance and energy} \label{sec.optim} -In the previous section we described the objective function that satisfy our -goal in discovering optimal scaling factor for both performance and energy at -the same time. Therefore, we develop an energy to performance scaling algorithm -($EPSA$). This algorithm is simple and has a direct way to calculate the optimal -scaling factor for both energy and performance at the same time. -\begin{algorithm}[tp] - \caption{EPSA} +Algorithm on Figure~\ref{EPSA} computes the optimal scaling factor according to +the objective function described above. +\begin{figure}[tp] + \begin{algorithmic}[1] + % \footnotesize + \Require ~ + \begin{description} + \item[$\Pstatic$] static power value + \item[$\Pdyn$] dynamic power value + \item[$\Pstates$] number of available frequencies + \item[$\Fmax$] maximum frequency + \item[$\Fdiff$] difference between two successive freq. + \end{description} + \Ensure $\Sopt$ is the optimal scaling factor + + \State $\Sopt \gets 1$ + \State $\Dist \gets 0$ + \State $\Fnew \gets \Fmax$ + \For {$j = 2$ to $\Pstates$} + \State $\Fnew \gets \Fnew - \Fdiff$ + \State $S \gets \Fmax / \Fnew$ + \State $S_i \gets S \cdot \frac{T_1}{T_i} + = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i}$ + for $i=1,\dots,N$ + \State $\Enorm \gets + \frac{\Pdyn \cdot S_1^{-2} \cdot + \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot S_1 \cdot N }{ + \Pdyn \cdot + \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + \Pstatic \cdot T_1 \cdot N }$ + \State $\Pnorm \gets \Told / \Tnew$ + \If{$(\Pnorm - \Enorm > \Dist)$} + \State $\Sopt \gets S$ + \State $\Dist \gets \Pnorm - \Enorm$ + \EndIf + \EndFor + \State Return $\Sopt$ + \end{algorithmic} + \caption{Scaling factor selection algorithm} \label{EPSA} +\end{figure} + +The proposed algorithm works online during the execution time of the MPI +program. It selects the optimal scaling factor after gathering the computation +and communication times from the program after one iteration. Then the program +changes the new frequencies of the CPUs according to the computed scaling +factors. In our experiments over a homogeneous cluster described in +Section~\ref{sec.expe}, this algorithm has a small execution time. It takes +\np[$\mu$s]{1.52} on average for 4 nodes and \np[$\mu$s]{6.65} on average for 32 +nodes. The algorithm complexity is $O(F\cdot N)$, where $F$ is the number of +available frequencies and $N$ is the number of computing nodes. The algorithm +is called just once during the execution of the program. The DVFS algorithm on +Figure~\ref{dvfs} shows where and when the algorithm is called in the MPI +program. +%\begin{table}[htb] +% \caption{Platform file parameters} +% % title of Table +% \centering +% \begin{tabular}{|*{7}{l|}} +% \hline +% Max & Min & Backbone & Backbone & Link & Link & Sharing \\ +% Freq. & Freq. & Bandwidth & Latency & Bandwidth & Latency & Policy \\ +% \hline +% \np{2.5} & \np{800} & \np[GBps]{2.25} & \np[$\mu$s]{0.5} & \np[GBps]{1} & \np[$\mu$s]{50} & Full \\ +% GHz & MHz & & & & & Duplex \\ +% \hline +% \end{tabular} +% \label{table:platform} +%\end{table} + +\begin{figure}[tp] \begin{algorithmic}[1] - \State Initialize the variable $Dist=0$ - \State Set dynamic and static power values. - \State Set $P_{states}$ to the number of available frequencies. - \State Set the variable $F_{new}$ to max. frequency, $F_{new} = F_{max} $ - \State Set the variable $F_{diff}$ to the scale value between each two frequencies. - \For {$i=1$ to $P_{states} $} - \State - Calculate the new frequency as $F_{new}=F_{new} - F_{diff} $ - \State - Calculate the scale factor $S$ as in EQ~(\ref{eq:s}). - \State - Calculate all available scales $S_i$ depend on $S$ as\par\hspace{1 pt} in EQ~(\ref{eq:si}). - \State - Select the maximum scale factor $S_1$ from the set\par\hspace{1 pt} of scales $S_i$. - \State - Calculate the normalize energy $E_{Norm}=E_{R}/E_{O}$ - \par\hspace{1 pt} as in EQ~(\ref{eq:enorm}). - \State - Calculate the normalize inverse of performance\par\hspace{1 pt} - $P_{NormInv}=T_{old}/T_{new}$ as in EQ~(\ref{eq:pnorm_en}). - \If{ $(P_{NormInv}-E_{Norm} > Dist$) } - \State $S_{optimal} = S$ - \State $Dist = P_{NormInv} - E_{Norm}$ + % \footnotesize + \For {$k=1$ to \textit{some iterations}} + \State Computations section. + \State Communications section. + \If {$(k=1)$} + \State Gather all times of computation and\newline\hspace*{3em}% + communication from each node. + \State Call algorithm from Figure~\ref{EPSA} with these times. + \State Compute the new frequency from the\newline\hspace*{3em}% + returned optimal scaling factor. + \State Set the new frequency to the CPU. \EndIf \EndFor - \State Return $S_{optimal}$ \end{algorithmic} -\end{algorithm} -The proposed EPSA algorithm works online during the execution time of the MPI -program. It selects the optimal scaling factor by gathering some information -from the program after one iteration. This algorithm has small execution time -(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data -required by this algorithm is the computation time and the communication time -for each task from the first iteration only. When these times are measured, the -MPI program calls the EPSA algorithm to choose the new frequency using the -optimal scaling factor. Then the program set the new frequency to the -system. The algorithm is called just one time during the execution of the -program. The DVFS algorithm~(\ref{dvfs}) shows where and when the EPSA algorithm is called -in the MPI program. -%\begin{minipage}{\textwidth} -%\AG{Use the same format as for Algorithm~\ref{$EPSA$}} - -\begin{algorithm}[tp] - \caption{DVFS} + \caption{DVFS algorithm} \label{dvfs} - \begin{algorithmic}[1] - \For {$J:=1$ to $Some-Iterations \; $} - \State -Computations Section. - \State -Communications Section. - \If {$(J==1)$} - \State -Gather all times of computation and\par\hspace{13 pt} communication from each node. - \State -Call EPSA with these times. - \State -Calculate the new frequency from optimal scale. - \State -Set the new frequency to the system. - \EndIf -\EndFor -\end{algorithmic} -\end{algorithm} - -After obtaining the optimal scale factor from the EPSA algorithm. The program -calculates the new frequency $F_i$ for each task proportionally to its time -value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we -can calculate the new frequency $F_i$ as follows: +\end{figure} +After obtaining the optimal scaling factor, the program calculates the new +frequency $F_i$ for each task proportionally to its time value $T_i$. By +substitution of EQ~\eqref{eq:s} in EQ~\eqref{eq:si}, we can calculate the new +frequency $F_i$ as follows: \begin{equation} \label{eq:fi} - F_i = \frac{F_\textit{max} \cdot T_i}{S_\textit{optimal} \cdot T_\textit{max}} + F_i = \frac{\Fmax \cdot T_i}{\Sopt \cdot \Tmax} \end{equation} According to this equation all the nodes may have the same frequency value if -they have balanced workloads. Otherwise, they take different frequencies when -have imbalanced workloads. Then EQ~(\ref{eq:fi}) works in adaptive way to change -the frequency according to the nodes workloads. +they have balanced workloads, otherwise, they take different frequencies when +having imbalanced workloads. Thus, EQ~\eqref{eq:fi} adapts the frequency of the +CPU to the nodes' workloads to maintain the performance of the program. -\section{Experimental Results} +\section{Experimental results} \label{sec.expe} - -The proposed EPSA algorithm was applied to seven MPI programs of the NAS -benchmarks (EP, CG, MG, FT, BT, LU and SP). We work on three classes (A, B and -C) for each program. Each program runs on specific number of processors +Our experiments are executed on the simulator SimGrid/SMPI v3.10. We configure +the simulator to use a homogeneous cluster with one core per node. +%The detailed characteristics of our platform file are shown in Table~\ref{table:platform}. +Each node in the cluster has 18 frequency values +from \np[GHz]{2.5} to \np[MHz]{800} with \np[MHz]{100} difference between each +two successive frequencies. The nodes are connected via an ethernet network with 1Gbit/s bandwidth. + +\subsection{Execution time prediction verification} + +In this section we evaluate the precision of our execution time prediction method +based on EQ~\eqref{eq:tnew} by applying it to the NAS benchmarks. The NAS programs +are executed with the class B option to compare the real execution time with +the predicted execution time. Each program runs offline with all available +scaling factors on 8 or 9 nodes (depending on the benchmark) to produce real +execution time values. These scaling factors are computed by dividing the +maximum frequency by the new one see EQ~\eqref{eq:s}. +\begin{figure} + \centering + \includegraphics[width=.5\linewidth]{fig/cg_per}\hfill% + % \includegraphics[width=.5\linewidth]{fig/mg_pre}\hfill% + % \includegraphics[width=.5\linewidth]{fig/bt_pre}\qquad% + \includegraphics[width=.5\linewidth]{fig/lu_pre}\hfill% + \caption{Comparing predicted to real execution times} + \label{fig:pred} +\end{figure} +%see Figure~\ref{fig:pred} +In our cluster there are 18 available frequency states for each processor. This +leads to 18 run states for each program. We use seven MPI programs of the NAS +parallel benchmarks: CG, MG, EP, FT, BT, LU and SP. Figure~\ref{fig:pred} +presents plots of the real execution times and the simulated ones. The maximum +normalized error between these two execution times varies between \np{0.0073} to +\np{0.031} dependent on the executed benchmark. The smallest prediction error +was for CG and the worst one was for LU. + +\subsection{The experimental results for the scaling algorithm } +The proposed algorithm was applied to seven MPI programs of the NAS benchmarks +(EP, CG, MG, FT, BT, LU and SP) which were run with three classes (A, B and C). +For each instance the benchmarks were executed on a number of processors proportional to the size of the class. Each class represents the problem size -ascending from the class A to C. Additionally, depending on some speed up points +ascending from class A to C. Additionally, depending on some speed up points for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes -respectively. Our experiments are executed on the simulator SimGrid/SMPI -v3.10. We design a platform file that simulates a cluster with one core per -node. This cluster is a homogeneous architecture with distributed memory. The -detailed characteristics of our platform file are shown in the -table~(\ref{table:platform}). Each node in the cluster has 18 frequency values -from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive -frequencies. -\begin{table}[htb] - \caption{Platform File Parameters} - % title of Table - \centering - \begin{tabular}{|*{7}{l|}} - \hline - Max & Min & Backbone & Backbone&Link &Link& Sharing \\ - Freq. & Freq. & Bandwidth & Latency & Bandwidth& Latency&Policy \\ \hline - \np{2.5} & \np{800} & \np[GBps]{2.25} &\np[$\mu$s]{0.5}& \np[GBps]{1} & \np[$\mu$s]{50} &Full \\ - GHz& MHz& & & & &Duplex \\\hline - \end{tabular} - \label{table:platform} -\end{table} -Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all -the NAS MPI programs while assuming the power dynamic is equal to 20W and the -power static is equal to 4W for all experiments. We run the proposed EPSA -algorithm for all these programs. The results showed that the algorithm selected +respectively. Depending on EQ~\eqref{eq:energy}, we measure the energy +consumption for all the NAS MPI programs while assuming that the dynamic power +with the highest frequency is equal to \np[W]{20} and the power static is equal +to \np[W]{4} for all experiments. These power values were also used by Rauber +and Rünger in~\cite{3}. The results showed that the algorithm selected different scaling factors for each program depending on the communication -features of the program as in the figure~(\ref{fig:nas}). This figure shows that -there are different distances between the normalized energy and the normalized -inversed performance curves, because there are different communication features -for each MPI program. When there are little or not communications, the inversed -performance curve is very close to the energy curve. Then the distance between -the two curves is very small. This lead to small energy savings. The opposite -happens when there are a lot of communication, the distance between the two -curves is big. This lead to more energy savings (e.g. CG and FT), see -table~(\ref{table:factors results}). All discovered frequency scaling factors -optimize both the energy and the performance simultaneously for all the NAS -programs. In table~(\ref{table:factors results}), we record all optimal scaling -factors results for each program on class C. These factors give the maximum -energy saving percent and the minimum performance degradation percent in the -same time over all available scales. +features of the program as in the plots from Figure~\ref{fig:nas}. These plots +illustrate that there are different distances between the normalized energy and +the normalized inverted execution time curves, because there are different +communication features for each benchmark. When there are little or no +communications, the inverted execution time curve is very close to the energy +curve. Then the distance between the two curves is very small. This leads to +small energy savings. The opposite happens when there are a lot of +communication, the distance between the two curves is big. This leads to more +energy savings (e.g. CG and FT), see Table~\ref{table:compareC}. All discovered +frequency scaling factors optimize both the energy and the execution time +simultaneously for all NAS benchmarks. In Table~\ref{table:compareC}, we record +all optimal scaling factors results for each benchmark running class C. These +scaling factors give the maximum energy saving percentage and the minimum +performance degradation percentage at the same time from all available scaling +factors. \begin{figure*}[t] \centering - \includegraphics[width=.33\textwidth]{ep.eps}\hfill% - \includegraphics[width=.33\textwidth]{cg.eps}\hfill% - \includegraphics[width=.33\textwidth]{sp.eps} - \includegraphics[width=.33\textwidth]{lu.eps}\hfill% - \includegraphics[width=.33\textwidth]{bt.eps}\hfill% - \includegraphics[width=.33\textwidth]{ft.eps} - \caption{Optimal scaling factors for The NAS MPI Programs} + \includegraphics[width=.33\linewidth]{fig/ep}\hfill% + \includegraphics[width=.33\linewidth]{fig/cg}\hfill% + % \includegraphics[width=.328\linewidth]{fig/sp} + % \includegraphics[width=.328\linewidth]{fig/lu}\hfill% + \includegraphics[width=.33\linewidth]{fig/bt} + % \includegraphics[width=.328\linewidth]{fig/ft} + \caption{Optimal scaling factors for the predicted energy and performance of NAS benchmarks} \label{fig:nas} \end{figure*} -\begin{table}[htb] - \caption{Optimal Scaling Factors Results} - % title of Table - \centering - \AG{Use the same number of decimals for all numbers in a column, - and vertically align the numbers along the decimal points. - The same for all the following tables.} - \begin{tabular}{|l|*{4}{r|}} - \hline - Program & Optimal & Energy & Performance&Energy-Perf.\\ - Name & Scaling Factor& Saving \%&Degradation \% &Distance \\ \hline - CG & 1.56 &39.23&14.88 &24.35\\ \hline - MG & 1.47 &34.97&21.70 &13.27 \\ \hline - EP & 1.04 &22.14&20.73 &1.41\\ \hline - LU & 1.38 &35.83&22.49 &13.34\\ \hline - BT & 1.31 &29.60&21.28 &8.32\\ \hline - SP & 1.38 &33.48&21.36 &12.12\\ \hline - FT & 1.47 &34.72&19.00 &15.72\\ \hline - \end{tabular} - \label{table:factors results} - % is used to refer this table in the text -\end{table} -As shown in the table~(\ref{table:factors results}), when the optimal scaling -factor has big value we can gain more energy savings for example as in CG and -FT. The opposite happens when the optimal scaling factor is small value as -example BT and EP. Our algorithm selects big scaling factor value when the +As shown in Table~\ref{table:compareC}, when the optimal scaling factor has a +big value we can gain more energy savings as in CG and FT benchmarks. The +opposite happens when the optimal scaling factor has a small value as in BT and +EP benchmarks. Our algorithm selects a big scaling factor value when the communication and the other slacks times are big and smaller ones in opposite -cases. In EP there are no communications inside the iterations. This make our -EPSA to selects smaller scaling factor values (inducing smaller energy savings). - -\section{Comparing Results} -\label{sec.compare} - -In this section, we compare our EPSA algorithm results with Rauber and Rünger -methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal -level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to -this scenario as $R_{E}$. The second scenario is similar to the first -except setting the slower task to the maximum frequency (when the scale $S=1$) -to keep the performance from degradation as mush as possible. We refer to this -scenario as $R_{E-P}$. The comparison is made in tables~(\ref{table:compare - Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These -tables show the results of our EPSA and Rauber and Rünger scenarios for all the NAS -benchmarks programs for classes A,B and C. -\begin{table}[p] - \caption{Comparing Results for The NAS Class A} +cases. In EP there are no communication inside the iterations. This leads our +algorithm to select smaller scaling factor values (inducing smaller energy +savings). + +\subsection{Results comparison} + +In this section, we compare our scaling factor selection method with Rauber and +Rünger methods~\cite{3}. They had two scenarios, the first is to reduce energy +to the optimal level without considering the execution time as in +EQ~\eqref{eq:sopt}. We refer to this scenario as $R_{E}$. The second scenario +is similar to the first except setting the slower task to the maximum frequency +(when the scale $S=1$) to keep the performance from degradation as mush as +possible. We refer to this scenario as $R_{E-P}$. While we refer to our +algorithm as EPSA (Energy to Performance Scaling Algorithm). The comparison is +made in Table~\ref{table:compareC}. This table shows the results of our method and +Rauber and Rünger scenarios for all the NAS benchmarks programs for class C. + +\begin{table} + \caption{Comparing results for the NAS class C} % title of Table \centering \begin{tabular}{|l|l|*{4}{r|}} \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - Name &Name&Value& Saving \%&Degradation \% &Distance - \\ \hline - % \rowcolor[gray]{0.85} - $EPSA$&CG & 1.56 &37.02 & 13.88 & 23.14\\ \hline - $R_{E-P}$&CG &2.14 &42.77 & 25.27 & 17.50\\ \hline - $R_{E}$&CG &2.14 &42.77&26.46&16.31\\ \hline - - $EPSA$&MG & 1.47 &27.66&16.82&10.84\\ \hline - $R_{E-P}$&MG &2.14&34.45&31.84&2.61\\ \hline - $R_{E}$&MG &2.14&34.48&33.65&0.80 \\ \hline - - $EPSA$&EP &1.19 &25.32&20.79&4.53\\ \hline - $R_{E-P}$&EP&2.05&41.45&55.67&-14.22\\ \hline - $R_{E}$&EP&2.05&42.09&57.59&-15.50\\ \hline - - $EPSA$&LU&1.56& 39.55 &19.38& 20.17\\ \hline - $R_{E-P}$&LU&2.14&45.62&27.00&18.62 \\ \hline - $R_{E}$&LU&2.14&45.66&33.01&12.65\\ \hline - - $EPSA$&BT&1.31& 29.60&20.53&9.07 \\ \hline - $R_{E-P}$&BT&2.10&45.53&49.63&-4.10\\ \hline - $R_{E}$&BT&2.10&43.93&52.86&-8.93\\ \hline - - $EPSA$&SP&1.38& 33.51&15.65&17.86 \\ \hline - $R_{E-P}$&SP&2.11&45.62&42.52&3.10\\ \hline - $R_{E}$&SP&2.11&45.78&43.09&2.69\\ \hline - - $EPSA$&FT&1.25&25.00&10.80&14.20 \\ \hline - $R_{E-P}$&FT&2.10&39.29&34.30&4.99 \\ \hline - $R_{E}$&FT&2.10&37.56&38.21&-0.65\\ \hline - \end{tabular} - \label{table:compare Class A} - % is used to refer this table in the text -\end{table} -\begin{table}[p] - \caption{Comparing Results for The NAS Class B} - % title of Table - \centering - \begin{tabular}{|l|l|*{4}{r|}} + Method & Program & Factor & Energy & Performance & Energy-Perf. \\ + Name & Name & Value & Saving \% & Degradation \% & Distance \\ \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - Name &Name&Value& Saving \%&Degradation \% &Distance - \\ \hline % \rowcolor[gray]{0.85} - $EPSA$&CG & 1.66 &39.23&16.63&22.60 \\ \hline - $R_{E-P}$&CG &2.15 &45.34&27.60&17.74\\ \hline - $R_{E}$&CG &2.15 &45.34&28.88&16.46\\ \hline + $EPSA$ & CG & 1.56 & 39.23 & 14.88 & 24.35 \\ \hline + $R_{E-P}$ & CG & 2.15 & 45.36 & 25.89 & 19.47 \\ \hline + $R_{E}$ & CG & 2.15 & 45.36 & 26.70 & 18.66 \\ \hline - $EPSA$ &MG & 1.47 &34.98&18.35&16.63\\ \hline - $R_{E-P}$&MG &2.14&43.55&36.42&7.13 \\ \hline - $R_{E}$&MG &2.14&43.56&37.07&6.49 \\ \hline + $EPSA$ & MG & 1.47 & 34.97 & 21.69 & 13.27 \\ \hline + $R_{E-P}$ & MG & 2.15 & 43.65 & 40.45 & 3.20 \\ \hline + $R_{E}$ & MG & 2.15 & 43.64 & 41.38 & 2.26 \\ \hline - $EPSA$&EP &1.08 &20.29&17.15&3.14 \\ \hline - $R_{E-P}$&EP&2.00&42.38&56.88&-14.50\\ \hline - $R_{E}$&EP&2.00&39.73&59.94&-20.21\\ \hline + $EPSA$ & EP & 1.04 & 22.14 & 20.73 & 1.41 \\ \hline + $R_{E-P}$ & EP & 1.92 & 39.40 & 56.33 & -16.93 \\ \hline + $R_{E}$ & EP & 1.92 & 38.10 & 56.35 & -18.25 \\ \hline - $EPSA$&LU&1.47&38.57&21.34&17.23 \\ \hline - $R_{E-P}$&LU&2.10&43.62&36.51&7.11 \\ \hline - $R_{E}$&LU&2.10&43.61&38.54&5.07 \\ \hline + $EPSA$ & LU & 1.38 & 35.83 & 22.49 & 13.34 \\ \hline + $R_{E-P}$ & LU & 2.15 & 44.97 & 41.00 & 3.97 \\ \hline + $R_{E}$ & LU & 2.15 & 44.97 & 41.80 & 3.17 \\ \hline - $EPSA$&BT&1.31& 29.59&20.88&8.71\\ \hline - $R_{E-P}$&BT&2.10&44.53&53.05&-8.52\\ \hline - $R_{E}$&BT&2.10&42.93&52.80&-9.87\\ \hline + $EPSA$ & BT & 1.31 & 29.60 & 21.28 & 8.32 \\ \hline + $R_{E-P}$ & BT & 2.13 & 45.60 & 49.84 & -4.24 \\ \hline + $R_{E}$ & BT & 2.13 & 44.90 & 55.16 & -10.26 \\ \hline - $EPSA$&SP&1.38&33.44&19.24&14.20 \\ \hline - $R_{E-P}$&SP&2.15&45.69&43.20&2.49\\ \hline - $R_{E}$&SP&2.15&45.41&44.47&0.94\\ \hline + $EPSA$ & SP & 1.38 & 33.48 & 21.35 & 12.12 \\ \hline + $R_{E-P}$ & SP & 2.10 & 45.69 & 43.60 & 2.09 \\ \hline + $R_{E}$ & SP & 2.10 & 45.75 & 44.10 & 1.65 \\ \hline - $EPSA$&FT&1.38&34.40&14.57&19.83 \\ \hline - $R_{E-P}$&FT&2.13&42.98&37.35&5.63 \\ \hline - $R_{E}$&FT&2.13&43.04&37.90&5.14\\ \hline + $EPSA$ & FT & 1.47 & 34.72 & 19.00 & 15.72 \\ \hline + $R_{E-P}$ & FT & 2.04 & 39.40 & 37.10 & 2.30 \\ \hline + $R_{E}$ & FT & 2.04 & 39.35 & 37.70 & 1.65 \\ \hline \end{tabular} - \label{table:compare Class B} + \label{table:compareC} % is used to refer this table in the text \end{table} - -\begin{table}[p] - \caption{Comparing Results for The NAS Class C} - % title of Table - \centering - \begin{tabular}{|l|l|*{4}{r|}} - \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - Name &Name&Value& Saving \%&Degradation \% &Distance - \\ \hline - % \rowcolor[gray]{0.85} - $EPSA$&CG & 1.56 &39.23&14.88&24.35 \\ \hline - $R_{E-P}$&CG &2.15 &45.36&25.89&19.47\\ \hline - $R_{E}$&CG &2.15 &45.36&26.70&18.66\\ \hline - - $EPSA$&MG & 1.47 &34.97&21.69&13.27\\ \hline - $R_{E-P}$&MG &2.15&43.65&40.45&3.20 \\ \hline - $R_{E}$&MG &2.15&43.64&41.38&2.26 \\ \hline - - $EPSA$&EP &1.04 &22.14&20.73&1.41 \\ \hline - $R_{E-P}$&EP&1.92&39.40&56.33&-16.93\\ \hline - $R_{E}$&EP&1.92&38.10&56.35&-18.25\\ \hline - - $EPSA$&LU&1.38&35.83&22.49&13.34 \\ \hline - $R_{E-P}$&LU&2.15&44.97&41.00&3.97 \\ \hline - $R_{E}$&LU&2.15&44.97&41.80&3.17 \\ \hline - - $EPSA$&BT&1.31& 29.60&21.28&8.32\\ \hline - $R_{E-P}$&BT&2.13&45.60&49.84&-4.24\\ \hline - $R_{E}$&BT&2.13&44.90&55.16&-10.26\\ \hline - - $EPSA$&SP&1.38&33.48&21.35&12.12\\ \hline - $R_{E-P}$&SP&2.10&45.69&43.60&2.09\\ \hline - $R_{E}$&SP&2.10&45.75&44.10&1.65\\ \hline - - $EPSA$&FT&1.47&34.72&19.00&15.72 \\ \hline - $R_{E-P}$&FT&2.04&39.40&37.10&2.30\\ \hline - $R_{E}$&FT&2.04&39.35&37.70&1.65\\ \hline - \end{tabular} -\label{table:compare Class C} -% is used to refer this table in the text -\end{table} -As shown in these tables our scaling factor is not optimal for energy saving -such as Rauber's scaling factor EQ~(\ref{eq:sopt}), but it is optimal for both -the energy and the performance simultaneously. Our $EPSA$ optimal scaling factors -has better simultaneous optimization for both the energy and the performance -compared to Rauber and Rünger energy-performance method ($R_{E-P}$). Also, in -($R_{E-P}$) method when setting the frequency to maximum value for the -slower task lead to a small improvement of the performance. Also the results -show that this method keep or improve energy saving. Because of the energy -consumption decrease when the execution time decreased while the frequency value -increased. - -Figure~(\ref{fig:compare}) shows the maximum distance between the energy saving -percent and the performance degradation percent. Therefore, this means it is the -same resultant of our objective function EQ~(\ref{eq:max}). Our algorithm always -gives positive energy to performance trade offs while Rauber and Rünger method -($R_{E-P}$) gives in some time negative trade offs such as in BT and -EP. The positive trade offs with highest values lead to maximum energy savings -concatenating with less performance degradation and this the objective of this -paper. While the negative trade offs refers to improving energy saving (or may -be the performance) while degrading the performance (or may be the energy) more -than the first. +As shown in Table~\ref{table:compareC}, the ($R_{E-P}$) method outperforms the ($R_{E}$) +method in terms of performance and energy reduction. The ($R_{E-P}$) method +also gives better energy savings than our method. However, although our scaling +factor is not optimal for energy reduction, the results in this table prove +that our algorithm returns the best scaling factor that satisfy our objective +method: the largest distance between energy reduction and performance +degradation. Figure~\ref{fig:compare} illustrates even better the distance between +the energy reduction and performance degradation. The negative values mean that one of +the two objectives (energy or performance) have been degraded more than the +other. The positive trade-offs with the highest values lead to maximum energy +savings while keeping the performance degradation as low as possible. Our +algorithm always gives the highest positive energy to performance trade-offs +while Rauber and Rünger's method, ($R_{E-P}$), gives sometimes negative +trade-offs such as in BT and EP. \begin{figure}[t] \centering - \includegraphics[width=.33\textwidth]{compare_class_A.pdf} - \includegraphics[width=.33\textwidth]{compare_class_B.pdf} - \includegraphics[width=.33\textwidth]{compare_class_c.pdf} - \caption{Comparing Our EPSA with Rauber and Rünger Methods} +% \includegraphics[width=.328\linewidth]{fig/compare_class_A} +% \includegraphics[width=.328\linewidth]{fig/compare_class_B} + \includegraphics[width=\linewidth]{fig/compare_class_C} + \caption{Comparing our method to Rauber and Rünger's methods} \label{fig:compare} \end{figure} \section{Conclusion} \label{sec.concl} -\AG{the conclusion needs to be written\dots{} one day} +In this paper, we have presented a new online scaling factor selection method +that optimizes simultaneously the energy and performance of a distributed +application running on a homogeneous cluster. It uses the computation and +communication times measured at the first iteration to predict energy +consumption and the execution time of the parallel application at every available +frequency. Then, it selects the scaling factor that gives the best trade-off +between energy reduction and performance which is the maximum distance between +the energy and the inverted execution time curves. To evaluate this method, we +have applied it to the NAS benchmarks and it was compared to Rauber and Rünger +methods while being executed on the simulator SimGrid. The results showed that +our method, outperforms Rauber and Rünger's methods in terms of energy-performance +ratio. + +In the near future, we would like to adapt this scaling factor selection method +to heterogeneous platforms where each node has different characteristics. In +particular, each CPU has different available frequencies, energy consumption and +performance. It would be also interesting to develop a new energy model for +asynchronous parallel iterative methods where the number of iterations is not +known in advance and depends on the global convergence of the iterative system. \section*{Acknowledgment} -\AG{Right?} -Computations have been performed on the supercomputer facilities of the -Mésocentre de calcul de Franche-Comté. +This work has been partially supported by the Labex ACTION project (contract +``ANR-11-LABX-01-01''). Computations have been performed on the supercomputer +facilities of the Mésocentre de calcul de Franche-Comté. As a PhD student, +Mr. Ahmed Fanfakh, would like to thank the University of Babylon (Iraq) for +supporting his work. % trigger a \newpage just before the given reference % number - used to balance the columns on the last page @@ -778,6 +701,7 @@ Mésocentre de calcul de Franche-Comté. % the document is modified later %\IEEEtriggeratref{15} +\newpage \bibliographystyle{IEEEtran} \bibliography{IEEEabrv,my_reference} \end{document} @@ -790,4 +714,4 @@ Mésocentre de calcul de Franche-Comté. %%% End: % LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber -% LocalWords: CMOS EQ EPSA Franche Comté Tflop Rünger +% LocalWords: CMOS EQ EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex