X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy.git/blobdiff_plain/c3da7359fcccb593da2f53d83b99fd595bfae86c..ce79ce10130fe42f774d3898afeb5699b82274b8:/paper.tex?ds=inline diff --git a/paper.tex b/paper.tex index 1ce2f60..657bea7 100644 --- a/paper.tex +++ b/paper.tex @@ -9,777 +9,841 @@ \usepackage{listings} \usepackage{colortbl} \usepackage{amsmath} -% \usepackage{sectsty} -% \usepackage{titlesec} -% \usepackage{secdot} -%\usepackage[font={footnotesize,bt}]{caption} -%\usepackage[font=scriptsize,labelfont=bf]{caption} + +\usepackage{url} +\DeclareUrlCommand\email{\urlstyle{same}} + +\usepackage[autolanguage,np]{numprint} +\AtBeginDocument{% + \renewcommand*\npunitcommand[1]{\text{#1}} + \npthousandthpartsep{}} + \usepackage{xspace} \usepackage[textsize=footnotesize]{todonotes} -\newcommand{\AG}[2][inline]{\todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} +\newcommand{\AG}[2][inline]{% + \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} +\newcommand{\JC}[2][inline]{% + \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace} \begin{document} -\title{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs} +\title{Dynamic Frequency Scaling for Energy Consumption + Reduction in Distributed MPI Programs} \author{% \IEEEauthorblockN{% - Ahmed Badri, Jean-Claude Charr, - Raphaël Couturier and + Raphaël Couturier, + Ahmed Fanfakh and Arnaud Giersch } \IEEEauthorblockA{% FEMTO-ST Institute\\ - University of Franche-Comté + University of Franche-Comté\\ + IUT de Belfort-Montbéliard, + 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\ + % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël + % Fax: \mbox{+33 3 84 58 77 81}\\ % Dept Info + Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr} + } } -} \maketitle -\AG{``Optimal'' is a bit pretentious in the title.\\ - Complete affiliation, add an email address, etc.} - \begin{abstract} - \AG{complete the abstract\dots} + Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs. This + technique is usually used to reduce the energy consumed by a CPU while + computing. Indeed, power consumption by a processor at a given instant is + exponentially related to its frequency. Thus, decreasing the frequency + reduces the power consumed by the CPU. However, it can also significantly + affect the performance of the executed program if it is compute bound and if a + low CPU frequency is selected. The performance degradation ratio can even be + higher than the saved energy ratio. Therefore, the chosen scaling factor must + give the best possible trade-off between energy reduction and performance. + + In this paper we present an algorithm that predicts the energy consumed with + each frequency gear and selects the one that gives the best ratio between + energy consumption reduction and performance. This algorithm works online + without training or profiling and has a very small overhead. It also takes + into account synchronous communications between the nodes that are executing + the distributed algorithm. The algorithm has been evaluated over the SimGrid + simulator while being applied to the NAS parallel benchmark programs. The + results of the experiments show that it outperforms other existing scaling + factor selection algorithms. \end{abstract} \section{Introduction} \label{sec.intro} -The need for computing power is still increasing and it is not expected to slow -down in the coming years. To satisfy this demand, researchers and supercomputers -constructors have been regularly increasing the number of computing cores in -supercomputers (for example in November 2013, according to the TOP500 -list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3 -millions of cores and delivers more than 33 Tflop/s while consuming 17808 -kW). This large increase in number of computing cores has led to large energy -consumption by these architectures. Moreover, the price of energy is expected to -continue its ascent according to the demand. For all these reasons energy -reduction became an important topic in the high performance computing field. To -tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency -Scaling) operations which reduce dynamically the frequency and voltage of cores -and thus their energy consumption. However, this operation also degrades the -performance of computation. Therefore researchers try to reduce the frequency to -minimum when processors are idle (waiting for data from other processors or -communicating with other processors). Moreover, depending on their objectives -they use heuristics to find the best scaling factor during the computation. If -they aim for performance they choose the best scaling factor that reduces the -consumed energy while affecting as little as possible the performance. On the -other hand, if they aim for energy reduction, the chosen scaling factor must -produce the most energy efficient execution without considering the degradation -of the performance. It is important to notice that lowering the frequency to -minimum value does not always give the most efficient execution due to energy -leakage. The best scaling factor might be chosen during execution (online) or -during a pre-execution phase. In this paper we emphasize to develop an -algorithm that selects the optimal frequency scaling factor that takes into -consideration simultaneously the energy consumption and the performance. The -main objective of HPC systems is to run the application with less execution -time. Therefore, our algorithm selects the optimal scaling factor online with -very small footprint. The proposed algorithm takes into account the -communication times of the MPI programs to choose the scaling factor. This -algorithm has ability to predict both energy consumption and execution time over -all available scaling factors. The prediction achieved depends on some -computing time information, gathered at the beginning of the runtime. We apply -this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel -benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed -using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} -over an homogeneous distributed memory architecture. Furthermore, we compare the -proposed algorithm with Rauber's methods. -\AG{Add citation for Rauber's methods. Moreover, Rauber was not alone to to this work (use ``Rauber et al.'', or ``Rauber and Gudula'', or \dots)} -The comparison's results show that our -algorithm gives better energy-time trade off. -% -\AG{Correctly reword the following}% -In Section~\ref{sec.relwork} we present works from other -authors. Then, in Sections~\ref{sec.ptasks} and~\ref{sec.energy}, we -introduce our model. [\dots] Finally, we conclude in -Section~\ref{sec.concl}. - -\section{Related Works} +The need and demand for more computing power have been increasing since the +birth of the first computing unit and it is not expected to slow down in the +coming years. To satisfy this demand, researchers and supercomputers +constructors have been regularly increasing the number of computing cores and +processors in supercomputers (for example in November 2013, according to the +TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more +than 3 millions of cores and delivers more than \np[Tflop/s]{33} while consuming +\np[kW]{17808}). This large increase in number of computing cores has led to +large energy consumption by these architectures. Moreover, the price of energy +is expected to continue its ascent according to the demand. For all these +reasons energy reduction became an important topic in the high performance +computing field. To tackle this problem, many researchers used DVFS (Dynamic +Voltage Frequency Scaling) operations which reduce dynamically the frequency and +voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a +set of acceptable frequencies which are usually called gears, and the user or +the operating system can modify the frequency of the processor according to its +needs. However, DVFS also degrades the performance of computation. Therefore +researchers try to reduce the frequency to minimum when processors are idle +(waiting for data from other processors or communicating with other processors). +Moreover, depending on their objectives they use heuristics to find the best +scaling factor during the computation. If they aim for performance they choose +the best scaling factor that reduces the consumed energy while affecting as +little as possible the performance. On the other hand, if they aim for energy +reduction, the chosen scaling factor must produce the most energy efficient +execution without considering the degradation of the performance. It is +important to notice that lowering the frequency to minimum value does not always +give the most energy efficient execution due to energy leakage. The best +scaling factor might be chosen during execution (online) or during a +pre-execution phase. In this paper, we present an algorithm that selects a +frequency scaling factor that simultaneously takes into consideration the energy +consumption by the CPU and the performance of the application. The main +objective of HPC systems is to execute as fast as possible the application. +Therefore, our algorithm selects the scaling factor online with very small +footprint. The proposed algorithm takes into account the communication times of +the MPI program to choose the scaling factor. This algorithm has ability to +predict both energy consumption and execution time over all available scaling +factors. The prediction achieved depends on some computing time information, +gathered at the beginning of the runtime. We apply this algorithm to seven MPI +benchmarks. These MPI programs are the NAS parallel benchmarks (NPB v3.3) +developed by NASA~\cite{44}. Our experiments are executed using the simulator +SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over an homogeneous +distributed memory architecture. Furthermore, we compare the proposed algorithm +with Rauber and Rünger methods~\cite{3}. The comparison's results show that our +algorithm gives better energy-time trade-off. + +This paper is organized as follows: Section~\ref{sec.relwork} presents some +related works from other authors. Section~\ref{sec.exe} explains the execution +of parallel tasks and the sources of slack times. It also presents an energy +model for homogeneous platforms. Section~\ref{sec.mpip} describes how the +performance of MPI programs can be predicted. Section~\ref{sec.compet} presents +the energy-performance objective function that maximizes the reduction of energy +consumption while minimizing the degradation of the program's performance. +Section~\ref{sec.optim} details the proposed energy-performance algorithm. +Section~\ref{sec.expe} verifies the accuracy of the performance prediction model +and presents the results of the proposed algorithm. It also shows the +comparison results between our method and other existing methods. Finally, we +conclude in Section~\ref{sec.concl} with a summary and some future works. + +\section{Related works} \label{sec.relwork} -\AG{Consider introducing the models (sec.~\ref{sec.ptasks}, - maybe~\ref{sec.energy}) before related works} - -In the this section some heuristics, to compute the scaling factor, are -presented and classified in two parts : offline and online methods. - -\subsection{The offline DVFS orientations} - -The DVFS offline methods are static and are not executed during the runtime of -the program. Some approaches used heuristics to select the best DVFS state -during the compilation phases as an example in Azevedo et al.~\cite{40}. He used -intra-task algorithm -\AG{what is an ``intra-task algorithm''?} -to choose the DVFS setting when there are dependency points -between tasks. While in~\cite{29}, Xie et al. used breadth-first search -algorithm to do that. Their goal is saving energy with time limits. Another -approaches gathers and stores the runtime information for each DVFS state, then -used their methods offline to select the suitable DVFS that optimize energy-time -trade offs. As an example~\cite{8}, Rountree et al. used liner programming -algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression -algorithm for the same goal. The offline study that shown the DVFS impact on the -communication time of the MPI program is~\cite{17}, Freeh et al. show that these -times not changed when the frequency is scaled down. - -\subsection{The online DVFS orientations} - -The objective of these works is to dynamically compute and set the frequency of -the CPU during the runtime of the program for saving energy. Estimating and -predicting approaches for the energy-time trade offs developed by -~\cite{11,2,31}. These works select the best DVFS setting depending on the slack -times. These times happen when the processors have to wait for data from other -processors to compute their task. For example, during the synchronous -communication time that take place in the MPI programs, the processors are -idle. The optimal DVFS can be selected using the learning methods. Therefore, in -~\cite{39,19} used machine learning to converge to the suitable DVFS -configuration. Their learning algorithms have big time to converge when the -number of available frequencies is high. Also, the communication time of the MPI -program used online for saving energy as in~\cite{1}, Lim et al. developed an -algorithm that detects the communication sections and changes the frequency -during these sections only. This approach changes the frequency many times -because an iteration may contain more than one communication section. The domain -of analytical modeling used for choosing the optimal frequency as in~\cite{3}, -Rauber et al. developed an analytical mathematical model for determining the -optimal frequency scaling factor for any number of concurrent tasks, without -considering communication times. They set the slowest task to maximum frequency -for maintaining performance. In this paper we compare our algorithm with -Rauber's model~\cite{3}, because his model can be used for any number of -concurrent tasks for homogeneous platform and this is the same direction of this -paper. However, the primary contributions of this paper are: + +In this section, some heuristics to compute the scaling factor are presented and +classified into two categories: offline and online methods. + +\subsection{Offline scaling factor selection methods} + +The offline scaling factor selection methods are executed before the runtime of +the program. They return static scaling factor values to the processors +participating in the execution of the parallel program. On one hand, the +scaling factor values could be computed based on information retrieved by +analyzing the code of the program and the computing system that will execute it. +In~\cite{40}, Azevedo et al. detect during compilation the dependency points +between tasks in a multi-task program. This information is then used to lower +the frequency of some processors in order to eliminate slack times. A slack +time is the period of time during which a processor that have already finished +its computation, have to wait for a set of processors to finish their +computations and send their results to the waiting processor in order to +continue its task that is dependent on the results of computations being +executed on other processors. Freeh et al. showed in~\cite{17} that the +communication times of MPI programs do not change when the frequency is scaled +down. On the other hand, some offline scaling factor selection methods use the +information gathered from previous full or partial executions of the program. A +part or the whole program is usually executed over all the available frequency +gears and the the execution time and the energy consumed with each frequency +gear are measured. Then an heuristic or an exact method uses the retrieved +information to compute the values of the scaling factor for the processors. +In~\cite{29}, Xie et al. use an exact exponential breadth-first search algorithm +to compute the scaling factor values that give the optimal energy reduction +while respecting a deadline for a sequential program. They also present a +linear heuristic that approximates the optimal solution. In~\cite{8} , Rountree +et al. use a linear programming algorithm, while in~\cite{38,34}, Cochran et +al. use multi logistic regression algorithm for the same goal. The main +drawback for these methods is that they all require executing a part or the +whole program on all frequency gears for each new instance of the same program. + +\subsection{Online scaling factor selection methods} + +The online scaling factor selection methods are executed during the runtime of +the program. They are usually integrated into iterative programs where the same +block of instructions is executed many times. During the first few iterations, +many informations are measured such as the execution time, the energy consumed +using a multimeter, the slack times, \dots{} Then a method will exploit these +measurements to compute the scaling factor values for each processor. This +operation, measurements and computing new scaling factor, can be repeated as +much as needed if the iterations are not regular. Kimura, Peraza, Yu-Liang et +al.~\cite{11,2,31} used learning methods to select the appropriate scaling +factor values to eliminate the slack times during runtime. However, as seen +in~\cite{39,19}, machine learning methods can take a lot of time to converge +when the number of available gears is big. To reduce the impact of slack times, +in~\cite{1}, Lim et al. developed an algorithm that detects the communication +sections and changes the frequency during these sections only. This approach +might change the frequency of each processor many times per iteration if an +iteration contains more than one communication section. In~\cite{3}, Rauber and +Rünger used an analytical model that can predict the consumed energy and the +execution time for every frequency gear after measuring the consumed energy and +the execution time with the highest frequency gear. These predictions may be +used to choose the optimal gear for each processor executing the parallel +program to reduce energy consumption. To maintain the performance of the +parallel program , they set the processor with the biggest load to the highest +gear and then compute the scaling factor values for the rest of the processors. +Although this model was built for parallel architectures, it can be adapted to +distributed architectures by taking into account the communications. The +primary contribution of our paper is presenting a new online scaling factor +selection method which has the following characteristics: \begin{enumerate} -\item Selecting the optimal frequency scaling factor for energy and performance - simultaneously. While taking into account the communication time. -\item Adapting our scale factor to taking into account the imbalanced tasks. -\item The execution time of our algorithm is very small when compared to other - methods (e.g.,~\cite{19}). -\item The proposed algorithm works online without profiling or training as +\item It is based on Rauber and Rünger analytical model to predict the energy + consumption of the application with different frequency gears. +\item It selects the frequency scaling factor for simultaneously optimizing + energy reduction and maintaining performance. +\item It is well adapted to distributed architectures because it takes into + account the communication time. +\item It is well adapted to distributed applications with imbalanced tasks. +\item it has very small footprint when compared to other methods + (e.g.,~\cite{19}) and does not require profiling or training as in~\cite{38,34}. \end{enumerate} -\section{Parallel Tasks Execution on Homogeneous Platform} -\label{sec.ptasks} -A homogeneous cluster consists of identical nodes in terms of the hardware and -the software. Each node has its own memory and at least one processor which can -be a multi-core. The nodes are connected via a high bandwidth network. Tasks -executed on this model can be either synchronous or asynchronous. In this paper +\section{Execution and energy of parallel tasks on homogeneous platform} +\label{sec.exe} + +% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'', +% can be deleted if we need space, we can just say we are interested in this +% paper in homogeneous clusters} + +\subsection{Parallel tasks execution on homogeneous platform} + +A homogeneous cluster consists of identical nodes in terms of hardware and +software. Each node has its own memory and at least one processor which can be +a multi-core. The nodes are connected via a high bandwidth network. Tasks +executed on this model can be either synchronous or asynchronous. In this paper we consider execution of the synchronous tasks on distributed homogeneous -platform. These tasks can exchange the data via synchronous memory passing. +platform. These tasks can exchange the data via synchronous message passing. \begin{figure*}[t] \centering - \subfloat[Sync. Imbalanced Communications]{\includegraphics[scale=0.67]{synch_tasks}\label{fig:h1}} - \subfloat[Sync. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}} - \caption{Parallel Tasks on Homogeneous Platform} + \subfloat[Sync. imbalanced communications]{% + \includegraphics[scale=0.67]{fig/commtasks}\label{fig:h1}} + \subfloat[Sync. imbalanced computations]{% + \includegraphics[scale=0.67]{fig/compt}\label{fig:h2}} + \caption{Parallel tasks on homogeneous platform} \label{fig:homo} \end{figure*} -\AG{On fig.~\ref{fig:h1}, how can there be a synchronization point without communications just before ?\\ -Use ``Sync.'' to abbreviate ``Synchronization''} Therefore, the execution time of a task consists of the computation time and the -communication time. Moreover, the synchronous communications between tasks can -lead to idle time while tasks wait at the synchronous point for others tasks to -finish their communications see figure~(\ref{fig:h1}). Another source for idle -times is the imbalanced computations. This happen when processing different -amounts of data on each processor as an example see figure~(\ref{fig:h2}). In -this case the fastest tasks have to wait at the synchronous barrier for the -slowest tasks to finish their job. In both two cases the overall execution time -of the program is the execution time of the slowest task as : +communication time. Moreover, the synchronous communications between tasks can +lead to slack times while tasks wait at the synchronization barrier for other +tasks to finish their tasks (see figure~(\ref{fig:h1})). The imbalanced +communications happen when nodes have to send/receive different amount of data +or they communicate with different number of nodes. Another source of slack +times is the imbalanced computations. This happens when processing different +amounts of data on each processor (see figure~(\ref{fig:h2})). In this case the +fastest tasks have to wait at the synchronization barrier for the slowest ones +to begin the next task. In both cases the overall execution time of the program +is the execution time of the slowest task as in EQ~(\ref{eq:T1}). \begin{equation} \label{eq:T1} \textit{Program Time} = \max_{i=1,2,\dots,N} T_i \end{equation} -where $T_i$ is the execution time of process $i$. +where $T_i$ is the execution time of task $i$ and all the tasks are executed +concurrently on different processors. -\section{Energy Model for Homogeneous Platform} -\label{sec.energy} +\subsection{Energy model for homogeneous platform} -The energy consumption by the processor consists of two powers metric: the -dynamic and the static power. This general power formulation is used by many -researchers see~\cite{9,3,15,26}. The dynamic power of the CMOS processors -$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$, -the supply voltage $V$ and operational frequency $f$ respectively as follow : +Many researchers~\cite{9,3,15,26} divide the power consumed by a processor to +two power metrics: the static and the dynamic power. While the first one is +consumed as long as the computing unit is on, the latter is only consumed during +computation times. The dynamic power $P_{dyn}$ is related to the switching +activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and +operational frequency $f$, as shown in EQ~(\ref{eq:pd}). \begin{equation} \label{eq:pd} - P_{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f + P_\textit{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f \end{equation} -The static power $P_{static}$ captures the leakage power consumption as well as -the power consumption of peripheral devices like the I/O subsystem. +The static power $P_{static}$ captures the leakage power as follows: \begin{equation} \label{eq:ps} - P_{static} = V \cdot N \cdot K_{design} \cdot I_{leak} + P_\textit{static} = V \cdot N_{trans} \cdot K_{design} \cdot I_{leak} \end{equation} -where V is the supply voltage, N is the number of transistors, $K_{design}$ is a -design dependent parameter and $I_{leak}$ is a technology-dependent -parameter. Energy consumed by an individual processor $E_{ind}$ is the summation -of the dynamic and the static power multiply by the execution time for example -see~\cite{36,15}. +where V is the supply voltage, $N_{trans}$ is the number of transistors, +$K_{design}$ is a design dependent parameter and $I_{leak}$ is a +technology-dependent parameter. The energy consumed by an individual processor +to execute a given program can be computed as: \begin{equation} \label{eq:eind} - E_{ind} = ( P_{dyn} + P_{static} ) \cdot T + E_\textit{ind} = P_\textit{dyn} \cdot T_{Comp} + P_\textit{static} \cdot T \end{equation} -The dynamic voltage and frequency scaling (DVFS) is a process that allowed in -modern processors to reduce the dynamic power by scaling down the voltage and -frequency. Its main objective is to reduce the overall energy -consumption~\cite{37}. The operational frequency \emph f depends linearly on the -supply voltage $V$, i.e., $V = \beta \cdot f$ with some constant $\beta$. This -equation is used to study the change of the dynamic voltage with respect to -various frequency values in~\cite{3}. The reduction process of the frequency are -expressed by scaling factor \emph S. The scale \emph S is the ratio between the -maximum and the new frequency as in EQ~(\ref{eq:s}). +where $T$ is the execution time of the program, $T_{Comp}$ is the computation +time and $T_{Comp} \leq T$. $T_{Comp}$ may be equal to $T$ if there is no +communications, no slack times and no synchronizations. + +DVFS is a process that is allowed in modern processors to reduce the dynamic +power by scaling down the voltage and frequency. Its main objective is to +reduce the overall energy consumption~\cite{37}. The operational frequency $f$ +depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot f$ with some +constant $\beta$. This equation is used to study the change of the dynamic +voltage with respect to various frequency values in~\cite{3}. The reduction +process of the frequency can be expressed by the scaling factor $S$ which is the +ratio between the maximum and the new frequency as in EQ~(\ref{eq:s}). \begin{equation} \label{eq:s} - S = \frac{F_{max}}{F_{new}} + S = \frac{F_\textit{max}}{F_\textit{new}} \end{equation} -The value of the scale $S$ is greater than 1 when changing the frequency to -any new frequency value (\emph {P-state}) in governor. -\AG{Explain what's a governor} -It is equal to 1 when the -frequency are set to the maximum frequency. The energy consumption model for -parallel homogeneous platform is depending on the scaling factor \emph S. This -factor reduces quadratically the dynamic power. Also, this factor increases the -static energy linearly because the execution time is increased~\cite{36}. The -energy model, depending on the frequency scaling factor, of homogeneous platform -for any number of concurrent tasks develops by Rauber~\cite{3}. This model -consider the two powers metric for measuring the energy of the parallel tasks as -in EQ~(\ref{eq:energy}). +The value of the scaling factor $S$ is greater than 1 when changing the +frequency of the CPU to any new frequency value~(\emph{P-state}) in the +governor. The CPU governor is an interface driver supplied by the operating +system's kernel to lower a core's frequency. This factor reduces quadratically +the dynamic power which may cause degradation in performance and thus, the +increase of the static energy because the execution time is increased~\cite{36}. +If the tasks are sorted according to their execution times before scaling in a +descending order, the total energy consumption model for a parallel homogeneous +platform, as presented by Rauber and Rünger~\cite{3}, can be written as a +function of the scaling factor $S$, as in EQ~(\ref{eq:energy}). \begin{equation} \label{eq:energy} - E = P_{dyn} \cdot S_1^{-2} \cdot + E = P_\textit{dyn} \cdot S_1^{-2} \cdot \left( T_1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^2} \right) + - P_{static} \cdot T_1 \cdot S_1 \cdot N + P_\textit{static} \cdot T_1 \cdot S_1 \cdot N \hfill \end{equation} -Where \emph N is the number of parallel nodes, $T_1 $ is the time of the slower -task, $T_i$ is the time of the task $i$ and $S_1$ is the maximum scaling factor -for the slower task. The scaling factor $S_1$, as in EQ~(\ref{eq:s1}), selects -from the set of scales values $S_i$. Each of these scales are proportional to -the time value $T_i$ depends on the new frequency value as in EQ~(\ref{eq:si}). -\begin{equation} - \label{eq:s1} - S_1 = \max_{i=1,2,\dots,F} S_i -\end{equation} +where $N$ is the number of parallel nodes, $T_i$ and $S_i$ for $i=1,\dots,N$ are +the execution times and scaling factors of the sorted tasks. Therefore, $T1$ is +the time of the slowest task, and $S_1$ its scaling factor which should be the +highest because they are proportional to the time values $T_i$. The scaling +factors are computed as in EQ~(\ref{eq:si}). \begin{equation} \label{eq:si} S_i = S \cdot \frac{T_1}{T_i} - = \frac{F_{max}}{F_{new}} \cdot \frac{T_1}{T_i} + = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i} \end{equation} -Where $F$ is the number of available frequencies. In this paper we depend on -Rauber's energy model EQ~(\ref{eq:energy}) for two reasons : 1-this model used -for homogeneous platform that we work on in this paper. 2-we are compare our -algorithm with Rauber's scaling model. Rauber's optimal scaling factor for -optimal energy reduction derived from the EQ~(\ref{eq:energy}). He takes the -derivation for this equation (to be minimized) and set it to zero to produce the -scaling factor as in EQ~(\ref{eq:sopt}). +In this paper we depend on Rauber and Rünger energy model EQ~(\ref{eq:energy}) +for two reasons: (1) this model is used for any number of concurrent tasks, and +(2) we compare our algorithm with Rauber and Rünger scaling factor selection +method which is based on EQ~(\ref{eq:energy}). The optimal scaling factor is +computed by minimizing the derivation for this equation which produces +EQ~(\ref{eq:sopt}). + \begin{equation} \label{eq:sopt} - S_{opt} = \sqrt[3]{\frac{2}{n} \cdot \frac{P_{dyn}}{P_{static}} \cdot + S_\textit{opt} = \sqrt[3]{\frac{2}{N} \cdot \frac{P_\textit{dyn}}{P_\textit{static}} \cdot \left( 1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^3} \right) } \end{equation} -\section{Performance Evaluation of MPI Programs} + +\section{Performance evaluation of MPI programs} \label{sec.mpip} -The performance (execution time) of the parallel MPI applications are depends on -the time of the slowest task as in figure~(\ref{fig:homo}). Normally the -execution time of the parallel programs are proportional to the operational -frequency. Therefore, any DVFS operation for the energy reduction increase the -execution time of the parallel program. As shown in EQ~(\ref{eq:energy}) the -energy affected by the scaling factor $S$. This factor also has a great impact -on the performance. When scaling down the frequency to the new value according -to EQ~(\ref{eq:s}) lead to the value of the scale $S$ has inverse relation with -new frequency value ($S \propto \frac{1}{F_{new}}$). Also when decrease the -frequency value, the execution time increase. Then the new frequency value has -inverse relation with time ($F_{new} \propto \frac{1}{T}$). This lead to the -frequency scaling factor $S$ proportional linearly with execution time ($S -\propto T$). Large scale MPI applications such as NAS benchmarks have -considerable amount of communications embedded in these programs. During the -communication process the processor remain idle until the communication has -finished. For that reason any change in the frequency has no impact on the time -of communication but it has obvious impact on the time of -computation~\cite{17}. We are made many tests on real cluster to prove that the -frequency scaling factor \emph S has a linear relation with computation time -only also see~\cite{41}. To predict the execution time of MPI program, firstly -must be precisely specifying communication time and the computation time for the -slower task. Secondly, we use these times for predicting the execution time for -any MPI program as a function of the new scaling factor as in the -EQ~(\ref{eq:tnew}). +The performance (execution time) of parallel synchronous MPI applications depend +on the time of the slowest task as in figure~(\ref{fig:homo}). If there is no +communication and the application is not data bounded, the execution time of a +parallel program is linearly proportional to the operational frequency and any +DVFS operation for energy reduction increases the execution time of the parallel +program. Therefore, the scaling factor $S$ is linearly proportional to the +execution time. However, in most of MPI applications the processes exchange +data. During these communications the processors involved remain idle until the +communications are finished. For that reason any change in the frequency has no +impact on the time of communication~\cite{17}. The communication time for a +task is the summation of periods of time that begin with an MPI call for sending +or receiving a message till the message is synchronously sent or received. To +be able to predict the execution time of MPI program, the communication time and +the computation time for the slower task must be measured before scaling. These +times are used to predict the execution time for any MPI program as a function +of the new scaling factor as in EQ~(\ref{eq:tnew}). \begin{equation} \label{eq:tnew} - T_{new} = T_{\textit{Max Comp Old}} \cdot S + T_{\textit{Max Comm Old}} + \textit T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}} \end{equation} -The above equation shows that the scaling factor \emph S has linear relation -with the computation time without affecting the communication time. The -communication time consists of the beginning times which an MPI calls for -sending or receiving till the message is synchronously sent or received. In this -paper we predict the execution time of the program for any new scaling factor -value. Depending on this prediction we can produce our energy-performance scaling -method as we will show in the coming sections. In the next section we make an -investigation study for the EQ~(\ref{eq:tnew}). - -\section{Performance Prediction Verification} -\label{sec.verif} - -In this section we evaluate the precision of our performance prediction methods -on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution -time for any scale value. The NAS programs run the class B for comparing the -real execution time with the predicted execution time. Each program runs offline -with all available scaling factors on 8 or 9 nodes to produce real execution -time values. These scaling factors are computed by dividing the maximum -frequency by the new one see EQ~(\ref{eq:s}). In all tests, we use the simulator -SimGrid/SMPI v3.10 to run the NAS programs. -\begin{figure*}[t] - \centering - \includegraphics[width=.4\textwidth]{cg_per.eps}\qquad% - \includegraphics[width=.4\textwidth]{mg_pre.eps} - \includegraphics[width=.4\textwidth]{bt_pre.eps}\qquad% - \includegraphics[width=.4\textwidth]{lu_pre.eps} - \caption{Fitting Predicted to Real Execution Time} - \label{fig:pred} -\end{figure*} -%see Figure~\ref{fig:pred} -In our cluster there are 18 available frequency states for each processor from -2.5 GHz to 800 MHz, there is 100 MHz difference between two successive -frequencies. For more details on the characteristics of the platform refer to -table~(\ref{table:platform}). This lead to 18 run states for each program. We -use seven MPI programs of the NAS parallel benchmarks : CG, MG, EP, FT, BT, LU -and SP. The average normalized errors between the predicted execution time and -the real time (SimGrid time) for all programs is between 0.0032 to 0.0133. AS an -example, we are present the execution times of the NAS benchmarks as in the -figure~(\ref{fig:pred}). - -\section{Performance to Energy Competition} +In this paper, this prediction method is used to select the best scaling factor +for each processor as presented in the next section. + +\section{Performance to energy competition} \label{sec.compet} -This section demonstrates our approach for choosing the optimal scaling -factor. This factor gives maximum energy reduction taking into account the -execution time for both computation and communication times. The relation -between the energy and the performance are nonlinear and complex, because the -relation of the energy with scaling factor is nonlinear and with the performance -it is linear see~\cite{17}. The relation between the energy and the performance -is not straightforward. Moreover, they are not measured using the same metric. -For solving this problem, we normalize the energy by calculating the ratio -between the consumed energy with scaled frequency and the consumed energy -without scaled frequency : +This section demonstrates our approach for choosing the optimal scaling factor. +This factor gives maximum energy reduction taking into account the execution +times for both computation and communication. The relation between the energy +and the performance is nonlinear and complex, because the relation of the energy +with scaling factor is nonlinear and with the performance it is linear +see~\cite{17}. Moreover, they are not measured using the same metric. For +solving this problem, we normalize the energy by calculating the ratio between +the consumed energy with scaled frequency and the consumed energy without scaled +frequency: \begin{multline} \label{eq:enorm} - E_\textit{Norm} = \frac{E_{Reduced}}{E_{Original}}\\ - {} = \frac{ P_{dyn} \cdot S_i^{-2} \cdot + E_\textit{Norm} = \frac{ E_\textit{Reduced}}{E_\textit{Original}} \\ + {} = \frac{P_\textit{dyn} \cdot S_1^{-2} \cdot \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + - P_{static} \cdot T_1 \cdot S_i \cdot N }{ - P_{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + - P_{static} \cdot T_1 \cdot N } + P_\textit{static} \cdot T_1 \cdot S_1 \cdot N }{ + P_\textit{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + P_\textit{static} \cdot T_1 \cdot N } \end{multline} -\AG{Use \texttt{\textbackslash{}text\{xxx\}} or - \texttt{\textbackslash{}textit\{xxx\}} for all subscripted words in equations - (e.g. \mbox{\texttt{E\_\{\textbackslash{}text\{Norm\}\}}}). - - Don't hesitate to define new commands : - \mbox{\texttt{\textbackslash{}newcommand\{\textbackslash{}ENorm\}\{E\_\{\textbackslash{}text\{Norm\}\}\}}} -} -By the same way we can normalize the performance as follows : +By the same way we can normalize the performance as follows: \begin{equation} \label{eq:pnorm} - P_{Norm} = \frac{T_{New}}{T_{Old}} - = \frac{T_{\textit{Max Comp Old}} \cdot S + - T_{\textit{Max Comm Old}}}{T_{Old}} + P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}} + = \frac{T_\textit{Max Comp Old} \cdot S + + T_\textit{Max Comm Old}}{T_\textit{Max Comp Old} + + T_\textit{Max Comm Old}} \end{equation} -The second problem is the optimization operation for both energy and performance -is not in the same direction. In other words, the normalized energy and the -performance curves are not in the same direction see figure~(\ref{fig:r2}). -While the main goal is to optimize the energy and performance in the same -time. According to the equations~(\ref{eq:enorm}) and~(\ref{eq:pnorm}) the -scaling factor \emph S reduce both the energy and the performance -simultaneously. But the main objective is to produce maximum energy reduction -with minimum performance reduction. Many researchers used different strategies -to solve this nonlinear problem for example see~\cite{19,42}, their methods add -big overhead to the algorithm for selecting the suitable frequency. In this -paper we are present a method to find the optimal scaling factor \emph S for -optimize both energy and performance simultaneously without adding big +The second problem is that the optimization operation for both energy and +performance is not in the same direction. In other words, the normalized energy +and the performance curves are not in the same direction see +figure~(\ref{fig:r2}). While the main goal is to optimize the energy and +performance in the same time. According to the equations~(\ref{eq:enorm}) +and~(\ref{eq:pnorm}), the scaling factor $S$ reduce both the energy and the +performance simultaneously. But the main objective is to produce maximum energy +reduction with minimum performance reduction. Many researchers used different +strategies to solve this nonlinear problem for example see~\cite{19,42}, their +methods add big overhead to the algorithm for selecting the suitable frequency. +In this paper we present a method to find the optimal scaling factor $S$ for +optimizing both energy and performance simultaneously without adding big overheads. Our solution for this problem is to make the optimization process -have the same direction. Therefore, we inverse the equation of normalize -performance as follows : +have the same direction. Therefore, we inverse the equation of normalize +performance as follows: \begin{equation} \label{eq:pnorm_en} - P^{-1}_{Norm} = \frac{T_{Old}}{T_{New}} - = \frac{T_{Old}}{T_{\textit{Max Comp Old}} \cdot S + - T_{\textit{Max Comm Old}}} + P^{-1}_\textit{Norm} = \frac{ T_\textit{Old}}{ T_\textit{New}} + = \frac{T_\textit{Max Comp Old} + + T_\textit{Max Comm Old}}{T_\textit{Max Comp Old} \cdot S + + T_\textit{Max Comm Old}} \end{equation} \begin{figure*} \centering - \subfloat[Converted Relation.]{% - \includegraphics[width=.33\textwidth]{file.eps}\label{fig:r1}}% + \subfloat[Converted relation.]{% + \includegraphics[width=.4\textwidth]{fig/file}\label{fig:r1}}% \qquad% - \subfloat[Real Relation.]{% - \includegraphics[width=.33\textwidth]{file3.eps}\label{fig:r2}} + \subfloat[Real relation.]{% + \includegraphics[width=.4\textwidth]{fig/file3}\label{fig:r2}} \label{fig:rel} - \caption{The Energy and Performance Relation} + \caption{The energy and performance relation} \end{figure*} Then, we can modelize our objective function as finding the maximum distance between the energy curve EQ~(\ref{eq:enorm}) and the inverse of performance -curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This represent -the minimum energy consumption with minimum execution time (better performance) -in the same time, see figure~(\ref{fig:r1}). Then our objective function has the -following form: +curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This +represents the minimum energy consumption with minimum execution time (better +performance) at the same time, see figure~(\ref{fig:r1}). Then our objective +function has the following form: \begin{equation} \label{eq:max} - \textit{MaxDist} = \max (\overbrace{P^{-1}_{Norm}}^{\text{Maximize}} - - \overbrace{E_{Norm}}^{\text{Minimize}} ) + Max Dist = \max_{j=1,2,\dots,F} + (\overbrace{P^{-1}_\textit{Norm}(S_j)}^{\text{Maximize}} - + \overbrace{E_\textit{Norm}(S_j)}^{\text{Minimize}} ) \end{equation} -Then we can select the optimal scaling factor that satisfy the -EQ~(\ref{eq:max}). Our objective function can works with any energy model or -static power values stored in a data file. Moreover, this function works in -optimal way when the energy function has a convex form with frequency scaling -factor as shown in~\cite{15,3,19}. Energy measurement model is not the -objective of this paper and we choose Rauber's model as an example with two -reasons that mentioned before. - -\section{Optimal Scaling Factor for Performance and Energy} +where $F$ is the number of available frequencies. Then we can select the optimal +scaling factor that satisfies EQ~(\ref{eq:max}). Our objective function can +work with any energy model or static power values stored in a data file. +Moreover, this function works in optimal way when the energy curve has a convex +form over the available frequency scaling factors as shown in~\cite{15,3,19}. + +\section{Optimal scaling factor for performance and energy} \label{sec.optim} -In the previous section we described the objective function that satisfy our -goal in discovering optimal scaling factor for both performance and energy at -the same time. Therefore, we develop an energy to performance scaling algorithm -(EPSA). This algorithm is simple and has a direct way to calculate the optimal -scaling factor for both energy and performance at the same time. +Algorithm~\ref{EPSA} computes the optimal scaling factor according to the +objective function described above. \begin{algorithm}[tp] - \caption{EPSA} + \caption{Scaling factor selection algorithm} \label{EPSA} \begin{algorithmic}[1] \State Initialize the variable $Dist=0$ \State Set dynamic and static power values. \State Set $P_{states}$ to the number of available frequencies. \State Set the variable $F_{new}$ to max. frequency, $F_{new} = F_{max} $ - \State Set the variable $F_{diff}$ to the scale value between each two frequencies. - \For {$i=1$ to $P_{states} $} - \State - Calculate the new frequency as $F_{new}=F_{new} - F_{diff} $ - \State - Calculate the scale factor $S$ as in EQ~(\ref{eq:s}). - \State - Calculate all available scales $S_i$ depend on $S$ as in EQ~(\ref{eq:si}). - \State - Select the maximum scale factor $S_1$ from the set of scales $S_i$. - \State - Calculate the normalize energy $E_{Norm}=E_{R}/E_{O}$ as in EQ~(\ref{eq:enorm}). - \State - Calculate the normalize inverse of performance\par - $P_{NormInv}=T_{old}/T_{new}$ as in EQ~(\ref{eq:pnorm_en}). - \If{ $(P_{NormInv}-E_{Norm} > Dist$) } - \State $S_{optimal} = S$ + \State Set the variable $F_{diff}$ to the difference between two successive + frequencies. + \For {$j:=1$ to $P_{states} $} + \State $F_{new}=F_{new} - F_{diff} $ + \State $S = \frac{F_\textit{max}}{F_\textit{new}}$ + \State $S_i = S \cdot \frac{T_1}{T_i} + = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i}$ + for $i=1,\dots,N$ + \State $E_\textit{Norm} = + \frac{P_\textit{dyn} \cdot S_1^{-2} \cdot + \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + P_\textit{static} \cdot T_1 \cdot S_1 \cdot N }{ + P_\textit{dyn} \cdot + \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) + + P_\textit{static} \cdot T_1 \cdot N }$ + \State $P_{NormInv}=T_{old}/T_{new}$ + \If{$(P_{NormInv}-E_{Norm} > Dist)$} + \State $S_{opt} = S$ \State $Dist = P_{NormInv} - E_{Norm}$ \EndIf \EndFor - \State Return $S_{optimal}$ + \State Return $S_{opt}$ \end{algorithmic} \end{algorithm} -The proposed EPSA algorithm works online during the execution time of the MPI -program. It selects the optimal scaling factor by gathering some information -from the program after one iteration. This algorithm has small execution time -(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data -required by this algorithm is the computation time and the communication time -for each task from the first iteration only. When these times are measured, the -MPI program calls the EPSA algorithm to choose the new frequency using the -optimal scaling factor. Then the program set the new frequency to the -system. The algorithm is called just one time during the execution of the -program. The DVFS algorithm~(\ref{dvfs}) shows where and when the EPSA algorithm is called -in the MPI program. -%\begin{minipage}{\textwidth} -%\AG{Use the same format as for Algorithm~\ref{EPSA}} + +The proposed algorithm works online during the execution time of the MPI +program. It selects the optimal scaling factor after gathering the computation +and communication times from the program after one iteration. Then the program +changes the new frequencies of the CPUs according to the computed scaling +factors. This algorithm has a small execution time: for a homogeneous cluster +composed of nodes having the characteristics presented in +table~\ref{table:platform}, it takes \np[ms]{0.00152} on average for 4 nodes and +\np[ms]{0.00665} on average for 32 nodes. The algorithm complexity is $O(F\cdot +N)$, where $F$ is the number of available frequencies and $N$ is the number of +computing nodes. The algorithm is called just once during the execution of the +program. The DVFS algorithm~(\ref{dvfs}) shows where and when the algorithm is +called in the MPI program. +\begin{table}[htb] + \caption{Platform file parameters} + % title of Table + \centering + \begin{tabular}{|*{7}{l|}} + \hline + Max & Min & Backbone & Backbone & Link & Link & Sharing \\ + Freq. & Freq. & Bandwidth & Latency & Bandwidth & Latency & Policy \\ + \hline + \np{2.5} & \np{800} & \np[GBps]{2.25} & \np[$\mu$s]{0.5} & \np[GBps]{1} & \np[$\mu$s]{50} & Full \\ + GHz & MHz & & & & & Duplex \\ + \hline + \end{tabular} + \label{table:platform} +\end{table} \begin{algorithm}[tp] \caption{DVFS} \label{dvfs} - \begin{algorithmic} - \For {$J:=1$ to $Some-Iterations \; $} - \State -Computations Section. - \State -Communications Section. - \If {$(J==1)$} - \State -Gather all times of computation and\par - \State communication from each node. - \State -Call EPSA with these times. - \State -Calculate the new frequency from optimal scale. - \State -Set the new frequency to the system. - \EndIf -\EndFor -\end{algorithmic} + \begin{algorithmic}[1] + \For {$k:=1$ to \textit{some iterations}} + \State Computations section. + \State Communications section. + \If {$(k=1)$} + \State Gather all times of computation and\newline\hspace*{3em}% + communication from each node. + \State Call algorithm~\ref{EPSA} with these times. + \State Compute the new frequency from the\newline\hspace*{3em}% + returned optimal scaling factor. + \State Set the new frequency to the CPU. + \EndIf + \EndFor + \end{algorithmic} \end{algorithm} - -After obtaining the optimal scale factor from the EPSA algorithm. The program -calculates the new frequency $F_i$ for each task proportionally to its time -value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we -can calculate the new frequency $F_i$ as follows : +After obtaining the optimal scaling factor, the program calculates the new +frequency $F_i$ for each task proportionally to its time value $T_i$. By +substitution of EQ~(\ref{eq:s}) in EQ~(\ref{eq:si}), we can calculate the new +frequency $F_i$ as follows: \begin{equation} \label{eq:fi} - F_i = \frac{F_{max} \cdot T_i}{S_{optimal} \cdot T_{max}} + F_i = \frac{F_\textit{max} \cdot T_i}{S_\textit{optimal} \cdot T_\textit{max}} \end{equation} According to this equation all the nodes may have the same frequency value if -they have balanced workloads. Otherwise, they take different frequencies when -have imbalanced workloads. Then EQ~(\ref{eq:fi}) works in adaptive way to change -the frequency according to the nodes workloads. +they have balanced workloads, otherwise, they take different frequencies when +having imbalanced workloads. Thus, EQ~(\ref{eq:fi}) adapts the frequency of the +CPU to the nodes' workloads to maintain performance. -\section{Experimental Results} +\section{Experimental results} \label{sec.expe} - -The proposed EPSA algorithm was applied to seven MPI programs of the NAS -benchmarks (EP, CG, MG, FT, BT, LU and SP). We work on three classes (A, B and -C) for each program. Each program runs on specific number of processors -proportional to the size of the class. Each class represents the problem size -ascending from the class A to C. Additionally, depending on some speed up points -for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes -respectively. Our experiments are executed on the simulator SimGrid/SMPI -v3.10. We design a platform file that simulates a cluster with one core per -node. This cluster is a homogeneous architecture with distributed memory. The -detailed characteristics of our platform file are shown in the -table~(\ref{table:platform}). Each node in the cluster has 18 frequency values -from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive -frequencies. -\begin{table}[htb] - \caption{Platform File Parameters} - % title of Table +Our experiments are executed on the simulator SimGrid/SMPI v3.10. We configure +the simulator to use a homogeneous cluster with one core per node. The detailed +characteristics of our platform file are shown in the +table~(\ref{table:platform}). Each node in the cluster has 18 frequency values +from \np[GHz]{2.5} to \np[MHz]{800} with \np[MHz]{100} difference between each +two successive frequencies. The simulated network link is \np[GB]{1} Ethernet +(TCP/IP). The backbone of the cluster simulates a high performance switch. + +\subsection{Performance prediction verification} + +In this section we evaluate the precision of our performance prediction method +based on EQ~(\ref{eq:tnew}) by applying it the NAS benchmarks. The NAS programs +are executed with the class B option for comparing the real execution time with +the predicted execution time. Each program runs offline with all available +scaling factors on 8 or 9 nodes (depending on the benchmark) to produce real +execution time values. These scaling factors are computed by dividing the +maximum frequency by the new one see EQ~(\ref{eq:s}). +\begin{figure*}[t] \centering - \begin{tabular}{ | l | l | l |l | l |l |l | p{2cm} |} - \hline - Max & Min & Backbone & Backbone&Link &Link& Sharing \\ - Freq. & Freq. & Bandwidth & Latency & Bandwidth& Latency&Policy \\ \hline - 2.5 &800 & 2.25 GBps &$5\times 10^{-7} s$& 1 GBps & $5\times 10^{-5}$ s&Full \\ - GHz& MHz& & & & &Duplex \\\hline - \end{tabular} - \label{table:platform} -\end{table} -Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all -the NAS MPI programs while assuming the power dynamic is equal to 20W and the -power static is equal to 4W for all experiments. We run the proposed EPSA -algorithm for all these programs. The results showed that the algorithm selected -different scaling factors for each program depending on the communication -features of the program as in the figure~(\ref{fig:nas}). This figure shows that -there are different distances between the normalized energy and the normalized -inversed performance curves, because there are different communication features -for each MPI program. When there are little or not communications, the inversed -performance curve is very close to the energy curve. Then the distance between -the two curves is very small. This lead to small energy savings. The opposite + \includegraphics[width=.328\textwidth]{fig/cg_per}\hfill% + \includegraphics[width=.328\textwidth]{fig/mg_pre}\hfill% + % \includegraphics[width=.4\textwidth]{fig/bt_pre}\qquad% + \includegraphics[width=.328\textwidth]{fig/lu_pre}\hfill% + \caption{Comparing predicted to real execution time} + \label{fig:pred} +\end{figure*} +%see Figure~\ref{fig:pred} +In our cluster there are 18 available frequency states for each processor. This +leads to 18 run states for each program. We use seven MPI programs of the NAS +parallel benchmarks: CG, MG, EP, FT, BT, LU and SP. Figure~(\ref{fig:pred}) +presents plots of the real execution times and the simulated ones. The maximum +normalized error between these two execution times varies between +\np{0.0073}\AG[]{unit?} to \np{0.031} dependent on the executed benchmark. The +smallest prediction error was for CG and the worst one was for LU. + +\subsection{The experimental results for the scaling algorithm } + +The proposed algorithm was applied to seven MPI programs of the NAS benchmarks +(EP, CG, MG, FT, BT, LU and SP) which were run with three classes (A, B and C). +For each instance the benchmarks were executed on a number of processors +proportional to the size of the class. Each class represents the problem size +ascending from the class A to C. Additionally, depending on some speed up +points for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes +respectively. Depending on EQ~(\ref{eq:energy}), we measure the energy +consumption for all the NAS MPI programs while assuming the power dynamic with +the highest frequency is equal to \np[W]{20} and the power static is equal to +\np[W]{4} for all experiments. These power values were also used by Rauber and +Rünger in~\cite{3}. The results showed that the algorithm selected different +scaling factors for each program depending on the communication features of the +program as in the plots~(\ref{fig:nas}). These plots illustrate that there are +different distances between the normalized energy and the normalized inverted +performance curves, because there are different communication features for each +benchmark. When there are little or not communications, the inverted +performance curve is very close to the energy curve. Then the distance between +the two curves is very small. This leads to small energy savings. The opposite happens when there are a lot of communication, the distance between the two -curves is big. This lead to more energy savings (e.g. CG and FT), see -table~(\ref{table:factors results}). All discovered frequency scaling factors -optimize both the energy and the performance simultaneously for all the NAS -programs. In table~(\ref{table:factors results}), we record all optimal scaling -factors results for each program on class C. These factors give the maximum -energy saving percent and the minimum performance degradation percent in the -same time over all available scales. +curves is big. This leads to more energy savings (e.g. CG and FT), see +table~(\ref{table:factors results}). All discovered frequency scaling factors +optimize both the energy and the performance simultaneously for all NAS +benchmarks. In table~(\ref{table:factors results}), we record all optimal +scaling factors results for each benchmark running class C. These scaling +factors give the maximum energy saving percent and the minimum performance +degradation percent at the same time from all available scaling factors. \begin{figure*}[t] \centering - \includegraphics[width=.33\textwidth]{ep.eps}\hfill% - \includegraphics[width=.33\textwidth]{cg.eps}\hfill% - \includegraphics[width=.33\textwidth]{sp.eps} - \includegraphics[width=.33\textwidth]{lu.eps}\hfill% - \includegraphics[width=.33\textwidth]{bt.eps}\hfill% - \includegraphics[width=.33\textwidth]{ft.eps} - \caption{Optimal scaling factors for The NAS MPI Programs} + \includegraphics[width=.328\textwidth]{fig/ep}\hfill% + \includegraphics[width=.328\textwidth]{fig/cg}\hfill% + \includegraphics[width=.328\textwidth]{fig/sp} + \includegraphics[width=.328\textwidth]{fig/lu}\hfill% + \includegraphics[width=.328\textwidth]{fig/bt}\hfill% + \includegraphics[width=.328\textwidth]{fig/ft} + \caption{Optimal scaling factors for the predicted energy and performance of NAS benchmarks} \label{fig:nas} \end{figure*} \begin{table}[htb] - \caption{Optimal Scaling Factors Results} + \caption{The scaling factors results} % title of Table \centering - \AG{Use the same number of decimals for all numbers in a column, - and vertically align the numbers along the decimal points. - The same for all the following tables.} - \begin{tabular}{ | l | l | l |l | l | p{2cm} |} + \begin{tabular}{|l|*{4}{r|}} + \hline + Program & Optimal & Energy & Performance & Energy-Perf. \\ + Name & Scaling Factor & Saving \% & Degradation \% & Distance \\ + \hline + CG & 1.56 & 39.23 & 14.88 & 24.35 \\ + \hline + MG & 1.47 & 34.97 & 21.70 & 13.27 \\ + \hline + EP & 1.04 & 22.14 & 20.73 & 1.41 \\ + \hline + LU & 1.38 & 35.83 & 22.49 & 13.34 \\ + \hline + BT & 1.31 & 29.60 & 21.28 & 8.32 \\ + \hline + SP & 1.38 & 33.48 & 21.36 & 12.12 \\ + \hline + FT & 1.47 & 34.72 & 19.00 & 15.72 \\ \hline - Program & Optimal & Energy & Performance&Energy-Perf.\\ - Name & Scaling Factor& Saving \%&Degradation \% &Distance \\ \hline - CG & 1.56 &39.23&14.88 &24.35\\ \hline - MG & 1.47 &34.97&21.70 &13.27 \\ \hline - EP & 1.04 &22.14&20.73 &1.41\\ \hline - LU & 1.38 &35.83&22.49 &13.34\\ \hline - BT & 1.31 &29.60&21.28 &8.32\\ \hline - SP & 1.38 &33.48&21.36 &12.12\\ \hline - FT & 1.47 &34.72&19.00 &15.72\\ \hline \end{tabular} \label{table:factors results} % is used to refer this table in the text \end{table} - As shown in the table~(\ref{table:factors results}), when the optimal scaling factor has big value we can gain more energy savings for example as in CG and -FT. The opposite happens when the optimal scaling factor is small value as -example BT and EP. Our algorithm selects big scaling factor value when the +FT. The opposite happens when the optimal scaling factor is small value as +example BT and EP. Our algorithm selects big scaling factor value when the communication and the other slacks times are big and smaller ones in opposite -cases. In EP there are no communications inside the iterations. This make our -EPSA to selects smaller scaling factor values (inducing smaller energy savings). - -\section{Comparing Results} -\label{sec.compare} - -In this section, we compare our EPSA algorithm results with Rauber's -methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal -level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to -this scenario as $Rauber_{E}$. The second scenario is similar to the first -except setting the slower task to the maximum frequency (when the scale $S=1$) -to keep the performance from degradation as mush as possible. We refer to this -scenario as $Rauber_{E-P}$. The comparison is made in tables~(\ref{table:compare - Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These -tables show the results of our EPSA and Rauber's two scenarios for all the NAS -benchmarks programs for classes A,B and C. -\begin{table*}[p] - \caption{Comparing Results for The NAS Class A} +cases. In EP there are no communications inside the iterations. This make our +algorithm to selects smaller scaling factor values (inducing smaller energy +savings). + +\subsection{Results comparison} + +In this section, we compare our scaling factor selection method with Rauber and +Rünger methods~\cite{3}. They had two scenarios, the first is to reduce energy +to the optimal level without considering the performance as in +EQ~(\ref{eq:sopt}). We refer to this scenario as $R_{E}$. The second scenario +is similar to the first except setting the slower task to the maximum frequency +(when the scale $S=1$) to keep the performance from degradation as mush as +possible. We refer to this scenario as $R_{E-P}$. While we refer to our +algorithm as EPSA (Energy to Performance Scaling Algorithm). The comparison is +made in tables \ref{table:compareA}, \ref{table:compareB}, +and~\ref{table:compareC}. These tables show the results of our method and +Rauber and Rünger scenarios for all the NAS benchmarks programs for classes A, B +and C. +\begin{table}[p] + \caption{Comparing results for the NAS class A} % title of Table \centering - \begin{tabular}{ | l | l | l |l | l | l| } + \begin{tabular}{|l|l|*{4}{r|}} + \hline + Method & Program & Factor & Energy & Performance & Energy-Perf. \\ + Name & Name & Value & Saving \% & Degradation \% & Distance \\ \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - name &name&value& Saving \%&Degradation \% &Distance - \\ \hline % \rowcolor[gray]{0.85} - EPSA&CG & 1.56 &37.02 & 13.88 & 23.14\\ \hline - $Rauber_{E-P}$&CG &2.14 &42.77 & 25.27 & 17.50\\ \hline - $Rauber_{E}$&CG &2.14 &42.77&26.46&16.31\\ \hline + $EPSA$ & CG & 1.56 & 37.02 & 13.88 & 23.14 \\ \hline + $R_{E-P}$ & CG & 2.14 & 42.77 & 25.27 & 17.50 \\ \hline + $R_{E}$ & CG & 2.14 & 42.77 & 26.46 & 16.31 \\ \hline - EPSA&MG & 1.47 &27.66&16.82&10.84\\ \hline - $Rauber_{E-P}$&MG &2.14&34.45&31.84&2.61\\ \hline - $Rauber_{E}$&MG &2.14&34.48&33.65&0.80 \\ \hline + $EPSA$ & MG & 1.47 & 27.66 & 16.82 & 10.84 \\ \hline + $R_{E-P}$ & MG & 2.14 & 34.45 & 31.84 & 2.61 \\ \hline + $R_{E}$ & MG & 2.14 & 34.48 & 33.65 & 0.80 \\ \hline - EPSA&EP &1.19 &25.32&20.79&4.53\\ \hline - $Rauber_{E-P}$&EP&2.05&41.45&55.67&-14.22\\ \hline - $Rauber_{E}$&EP&2.05&42.09&57.59&-15.50\\ \hline + $EPSA$ & EP & 1.19 & 25.32 & 20.79 & 4.53 \\ \hline + $R_{E-P}$ & EP & 2.05 & 41.45 & 55.67 & -14.22 \\ \hline + $R_{E}$ & EP & 2.05 & 42.09 & 57.59 & -15.50 \\ \hline - EPSA&LU&1.56& 39.55 &19.38& 20.17\\ \hline - $Rauber_{E-P}$&LU&2.14&45.62&27.00&18.62 \\ \hline - $Rauber_{E}$&LU&2.14&45.66&33.01&12.65\\ \hline + $EPSA$ & LU & 1.56 & 39.55 & 19.38 & 20.17 \\ \hline + $R_{E-P}$ & LU & 2.14 & 45.62 & 27.00 & 18.62 \\ \hline + $R_{E}$ & LU & 2.14 & 45.66 & 33.01 & 12.65 \\ \hline - EPSA&BT&1.31& 29.60&20.53&9.07 \\ \hline - $Rauber_{E-P}$&BT&2.10&45.53&49.63&-4.10\\ \hline - $Rauber_{E}$&BT&2.10&43.93&52.86&-8.93\\ \hline + $EPSA$ & BT & 1.31 & 29.60 & 20.53 & 9.07 \\ \hline + $R_{E-P}$ & BT & 2.10 & 45.53 & 49.63 & -4.10 \\ \hline + $R_{E}$ & BT & 2.10 & 43.93 & 52.86 & -8.93 \\ \hline - EPSA&SP&1.38& 33.51&15.65&17.86 \\ \hline - $Rauber_{E-P}$&SP&2.11&45.62&42.52&3.10\\ \hline - $Rauber_{E}$&SP&2.11&45.78&43.09&2.69\\ \hline + $EPSA$ & SP & 1.38 & 33.51 & 15.65 & 17.86 \\ \hline + $R_{E-P}$ & SP & 2.11 & 45.62 & 42.52 & 3.10 \\ \hline + $R_{E}$ & SP & 2.11 & 45.78 & 43.09 & 2.69 \\ \hline - EPSA&FT&1.25&25.00&10.80&14.20 \\ \hline - $Rauber_{E-P}$&FT&2.10&39.29&34.30&4.99 \\ \hline - $Rauber_{E}$&FT&2.10&37.56&38.21&-0.65\\ \hline + $EPSA$ & FT & 1.25 & 25.00 & 10.80 & 14.20 \\ \hline + $R_{E-P}$ & FT & 2.10 & 39.29 & 34.30 & 4.99 \\ \hline + $R_{E}$ & FT & 2.10 & 37.56 & 38.21 & -0.65 \\ \hline \end{tabular} - \label{table:compare Class A} + \label{table:compareA} % is used to refer this table in the text -\end{table*} -\begin{table*}[p] - \caption{Comparing Results for The NAS Class B} +\end{table} +\begin{table}[p] + \caption{Comparing results for the NAS class B} % title of Table \centering - \begin{tabular}{ | l | l | l |l | l |l| } + \begin{tabular}{|l|l|*{4}{r|}} + \hline + Method & Program & Factor & Energy & Performance & Energy-Perf. \\ + Name & Name & Value & Saving \% & Degradation \% & Distance \\ \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - name &name&value& Saving \%&Degradation \% &Distance - \\ \hline % \rowcolor[gray]{0.85} - EPSA&CG & 1.66 &39.23&16.63&22.60 \\ \hline - $Rauber_{E-P}$&CG &2.15 &45.34&27.60&17.74\\ \hline - $Rauber_{E}$&CG &2.15 &45.34&28.88&16.46\\ \hline + $EPSA$ & CG & 1.66 & 39.23 & 16.63 & 22.60 \\ \hline + $R_{E-P}$ & CG & 2.15 & 45.34 & 27.60 & 17.74 \\ \hline + $R_{E}$ & CG & 2.15 & 45.34 & 28.88 & 16.46 \\ \hline - EPSA&MG & 1.47 &34.98&18.35&16.63\\ \hline - $Rauber_{E-P}$&MG &2.14&43.55&36.42&7.13 \\ \hline - $Rauber_{E}$&MG &2.14&43.56&37.07&6.49 \\ \hline + $EPSA$ & MG & 1.47 & 34.98 & 18.35 & 16.63 \\ \hline + $R_{E-P}$ & MG & 2.14 & 43.55 & 36.42 & 7.13 \\ \hline + $R_{E}$ & MG & 2.14 & 43.56 & 37.07 & 6.49 \\ \hline - EPSA&EP &1.08 &20.29&17.15&3.14 \\ \hline - $Rauber_{E-P}$&EP&2.00&42.38&56.88&-14.50\\ \hline - $Rauber_{E}$&EP&2.00&39.73&59.94&-20.21\\ \hline + $EPSA$ & EP & 1.08 & 20.29 & 17.15 & 3.14 \\ \hline + $R_{E-P}$ & EP & 2.00 & 42.38 & 56.88 & -14.50 \\ \hline + $R_{E}$ & EP & 2.00 & 39.73 & 59.94 & -20.21 \\ \hline - EPSA&LU&1.47&38.57&21.34&17.23 \\ \hline - $Rauber_{E-P}$&LU&2.10&43.62&36.51&7.11 \\ \hline - $Rauber_{E}$&LU&2.10&43.61&38.54&5.07 \\ \hline + $EPSA$ & LU & 1.47 & 38.57 & 21.34 & 17.23 \\ \hline + $R_{E-P}$ & LU & 2.10 & 43.62 & 36.51 & 7.11 \\ \hline + $R_{E}$ & LU & 2.10 & 43.61 & 38.54 & 5.07 \\ \hline - EPSA&BT&1.31& 29.59&20.88&8.71\\ \hline - $Rauber_{E-P}$&BT&2.10&44.53&53.05&-8.52\\ \hline - $Rauber_{E}$&BT&2.10&42.93&52.80&-9.87\\ \hline + $EPSA$ & BT & 1.31 & 29.59 & 20.88 & 8.71 \\ \hline + $R_{E-P}$ & BT & 2.10 & 44.53 & 53.05 & -8.52 \\ \hline + $R_{E}$ & BT & 2.10 & 42.93 & 52.80 & -9.87 \\ \hline - EPSA&SP&1.38&33.44&19.24&14.20 \\ \hline - $Rauber_{E-P}$&SP&2.15&45.69&43.20&2.49\\ \hline - $Rauber_{E}$&SP&2.15&45.41&44.47&0.94\\ \hline + $EPSA$ & SP & 1.38 & 33.44 & 19.24 & 14.20 \\ \hline + $R_{E-P}$ & SP & 2.15 & 45.69 & 43.20 & 2.49 \\ \hline + $R_{E}$ & SP & 2.15 & 45.41 & 44.47 & 0.94 \\ \hline - EPSA&FT&1.38&34.40&14.57&19.83 \\ \hline - $Rauber_{E-P}$&FT&2.13&42.98&37.35&5.63 \\ \hline - $Rauber_{E}$&FT&2.13&43.04&37.90&5.14\\ \hline + $EPSA$ & FT & 1.38 & 34.40 & 14.57 & 19.83 \\ \hline + $R_{E-P}$ & FT & 2.13 & 42.98 & 37.35 & 5.63 \\ \hline + $R_{E}$ & FT & 2.13 & 43.04 & 37.90 & 5.14 \\ \hline \end{tabular} - \label{table:compare Class B} + \label{table:compareB} % is used to refer this table in the text -\end{table*} +\end{table} -\begin{table*}[p] - \caption{Comparing Results for The NAS Class C} +\begin{table}[p] + \caption{Comparing results for the NAS class C} % title of Table \centering - \begin{tabular}{ | l | l | l |l | l |l| } + \begin{tabular}{|l|l|*{4}{r|}} + \hline + Method & Program & Factor & Energy & Performance & Energy-Perf. \\ + Name & Name & Value & Saving \% & Degradation \% & Distance \\ \hline - Method&Program&Factor& Energy& Performance &Energy-Perf.\\ - name &name&value& Saving \%&Degradation \% &Distance - \\ \hline % \rowcolor[gray]{0.85} - EPSA&CG & 1.56 &39.23&14.88&24.35 \\ \hline - $Rauber_{E-P}$&CG &2.15 &45.36&25.89&19.47\\ \hline - $Rauber_{E}$&CG &2.15 &45.36&26.70&18.66\\ \hline + $EPSA$ & CG & 1.56 & 39.23 & 14.88 & 24.35 \\ \hline + $R_{E-P}$ & CG & 2.15 & 45.36 & 25.89 & 19.47 \\ \hline + $R_{E}$ & CG & 2.15 & 45.36 & 26.70 & 18.66 \\ \hline - EPSA&MG & 1.47 &34.97&21.69&13.27\\ \hline - $Rauber_{E-P}$&MG &2.15&43.65&40.45&3.20 \\ \hline - $Rauber_{E}$&MG &2.15&43.64&41.38&2.26 \\ \hline + $EPSA$ & MG & 1.47 & 34.97 & 21.69 & 13.27 \\ \hline + $R_{E-P}$ & MG & 2.15 & 43.65 & 40.45 & 3.20 \\ \hline + $R_{E}$ & MG & 2.15 & 43.64 & 41.38 & 2.26 \\ \hline - EPSA&EP &1.04 &22.14&20.73&1.41 \\ \hline - $Rauber_{E-P}$&EP&1.92&39.40&56.33&-16.93\\ \hline - $Rauber_{E}$&EP&1.92&38.10&56.35&-18.25\\ \hline + $EPSA$ & EP & 1.04 & 22.14 & 20.73 & 1.41 \\ \hline + $R_{E-P}$ & EP & 1.92 & 39.40 & 56.33 & -16.93 \\ \hline + $R_{E}$ & EP & 1.92 & 38.10 & 56.35 & -18.25 \\ \hline - EPSA&LU&1.38&35.83&22.49&13.34 \\ \hline - $Rauber_{E-P}$&LU&2.15&44.97&41.00&3.97 \\ \hline - $Rauber_{E}$&LU&2.15&44.97&41.80&3.17 \\ \hline + $EPSA$ & LU & 1.38 & 35.83 & 22.49 & 13.34 \\ \hline + $R_{E-P}$ & LU & 2.15 & 44.97 & 41.00 & 3.97 \\ \hline + $R_{E}$ & LU & 2.15 & 44.97 & 41.80 & 3.17 \\ \hline - EPSA&BT&1.31& 29.60&21.28&8.32\\ \hline - $Rauber_{E-P}$&BT&2.13&45.60&49.84&-4.24\\ \hline - $Rauber_{E}$&BT&2.13&44.90&55.16&-10.26\\ \hline + $EPSA$ & BT & 1.31 & 29.60 & 21.28 & 8.32 \\ \hline + $R_{E-P}$ & BT & 2.13 & 45.60 & 49.84 & -4.24 \\ \hline + $R_{E}$ & BT & 2.13 & 44.90 & 55.16 & -10.26 \\ \hline - EPSA&SP&1.38&33.48&21.35&12.12\\ \hline - $Rauber_{E-P}$&SP&2.10&45.69&43.60&2.09\\ \hline - $Rauber_{E}$&SP&2.10&45.75&44.10&1.65\\ \hline + $EPSA$ & SP & 1.38 & 33.48 & 21.35 & 12.12 \\ \hline + $R_{E-P}$ & SP & 2.10 & 45.69 & 43.60 & 2.09 \\ \hline + $R_{E}$ & SP & 2.10 & 45.75 & 44.10 & 1.65 \\ \hline - EPSA&FT&1.47&34.72&19.00&15.72 \\ \hline - $Rauber_{E-P}$&FT&2.04&39.40&37.10&2.30\\ \hline - $Rauber_{E}$&FT&2.04&39.35&37.70&1.65\\ \hline + $EPSA$ & FT & 1.47 & 34.72 & 19.00 & 15.72 \\ \hline + $R_{E-P}$ & FT & 2.04 & 39.40 & 37.10 & 2.30 \\ \hline + $R_{E}$ & FT & 2.04 & 39.35 & 37.70 & 1.65 \\ \hline \end{tabular} -\label{table:compare Class C} -% is used to refer this table in the text -\end{table*} -As shown in these tables our scaling factor is not optimal for energy saving -such as Rauber's scaling factor EQ~(\ref{eq:sopt}), but it is optimal for both -the energy and the performance simultaneously. Our EPSA optimal scaling factors -has better simultaneous optimization for both the energy and the performance -compared to Rauber's energy-performance method ($Rauber_{E-P}$). Also, in -($Rauber_{E-P}$) method when setting the frequency to maximum value for the -slower task lead to a small improvement of the performance. Also the results -show that this method keep or improve energy saving. Because of the energy -consumption decrease when the execution time decreased while the frequency value -increased. - -Figure~(\ref{fig:compare}) shows the maximum distance between the energy saving -percent and the performance degradation percent. Therefore, this means it is the -same resultant of our objective function EQ~(\ref{eq:max}). Our algorithm always -gives positive energy to performance trade offs while Rauber's method -($Rauber_{E-P}$) gives in some time negative trade offs such as in BT and -EP. The positive trade offs with highest values lead to maximum energy savings -concatenating with less performance degradation and this the objective of this -paper. While the negative trade offs refers to improving energy saving (or may -be the performance) while degrading the performance (or may be the energy) more -than the first. -\begin{figure}[t] - \centering - \includegraphics[width=.33\textwidth]{compare_class_A.pdf} - \includegraphics[width=.33\textwidth]{compare_class_B.pdf} - \includegraphics[width=.33\textwidth]{compare_class_c.pdf} - \caption{Comparing Our EPSA with Rauber's Methods} - \label{fig:compare} -\end{figure} + \label{table:compareC} + % is used to refer this table in the text +\end{table} +As shown in tables~\ref{table:compareA},~\ref{table:compareB} +and~\ref{table:compareC}, the ($R_{E-P}$) method outperforms the ($R_{E}$) +method in terms of performance and energy reduction. The ($R_{E-P}$) method +also gives better energy savings than our method. However, although our scaling +factor is not optimal for energy reduction, the results in these tables prove +that our algorithm returns the best scaling factor that satisfy our objective +method: the largest distance between energy reduction and performance +degradation. Negative values in the energy-performance column mean that one of +the two objectives (energy or performance) have been degraded more than the +other. The positive trade-offs with the highest values lead to maximum energy +savings while keeping the performance degradation as low as possible. Our +algorithm always gives the highest positive energy to performance trade-offs +while Rauber and Rünger method ($R_{E-P}$) gives in some time negative +trade-offs such as in BT and EP. +%\begin{figure*}[t] +% \centering +% \includegraphics[width=.328\textwidth]{fig/compare_class_A} +% \includegraphics[width=.328\textwidth]{fig/compare_class_B} +% \includegraphics[width=.328\textwidth]{fig/compare_class_C} +% \caption{Comparing our method to Rauber and Rünger methods} +% \label{fig:compare} +%\end{figure*} \section{Conclusion} \label{sec.concl} -\AG{the conclusion needs to be written\dots{} one day} +In this paper, we have presented a new online scaling factor selection method +that optimizes simultaneously the energy and performance of a distributed +application running on an homogeneous cluster. It uses the computation and +communication times measured at the first iteration to predict energy +consumption and the performance of the parallel application at every available +frequency. Then, it selects the scaling factor that gives the best trade-off +between energy reduction and performance which is the maximum distance between +the energy and the inverted performance curves. To evaluate this method, we +have applied it to the NAS benchmarks and it was compared to Rauber and Rünger +methods while being executed on the simulator SimGrid. The results showed that +our method, outperforms Rauber and Rünger methods in terms of energy-performance +ratio. + +In the near future, we would like to adapt this scaling factor selection method +to heterogeneous platforms where each node has different characteristics. In +particular, each CPU has different available frequencies, energy consumption and +performance. It would be also interesting to develop a new energy model for +asynchronous parallel iterative methods where the number of iterations is not +known in advance and depends on the global convergence of the iterative system. \section*{Acknowledgment} -\AG{Right?} -Computations have been performed on the supercomputer facilities of the -Mésocentre de calcul de Franche-Comté. +This work has been supported by the Labex ACTION project (contract +``ANR-11-LABX-01-01''). Computations have been performed on the supercomputer +facilities of the Mésocentre de calcul de Franche-Comté. As a PhD student, +Mr. Ahmed Fanfakh, would like to thank the University of Babylon (Iraq) for +supporting his work. % trigger a \newpage just before the given reference % number - used to balance the columns on the last page @@ -798,5 +862,5 @@ Mésocentre de calcul de Franche-Comté. %%% ispell-local-dictionary: "american" %%% End: -% LocalWords: Badri Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber -% LocalWords: CMOS EQ EPSA Franche Comté Tflop +% LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber +% LocalWords: CMOS EQ EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex