-\documentclass[12pt]{article}
-%\documentclass[12pt,twocolumn]{article}
+\documentclass[conference]{IEEEtran}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{listings}
\usepackage{colortbl}
\usepackage{amsmath}
-% \usepackage{sectsty}
-% \usepackage{titlesec}
-% \usepackage{secdot}
-%\usepackage[font={footnotesize,bt}]{caption}
-%\usepackage[font=scriptsize,labelfont=bf]{caption}
-\usepackage{lmodern}
-\usepackage{todonotes}
-\newcommand{\AG}[2][inline]{\todo[color=green!50,#1]{\sffamily\small\textbf{AG:} #2}}
+\usepackage{url}
+\DeclareUrlCommand\email{\urlstyle{same}}
+
+\usepackage[autolanguage,np]{numprint}
+\AtBeginDocument{%
+ \renewcommand*\npunitcommand[1]{\text{#1}}
+ \npthousandthpartsep{}}
+
+\usepackage{xspace}
+\usepackage[textsize=footnotesize]{todonotes}
+\newcommand{\AG}[2][inline]{%
+ \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
+\newcommand{\JC}[2][inline]{%
+ \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace}
\begin{document}
-\title{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs}
-\author{A. Badri \and J.-C. Charr \and R. Couturier \and A. Giersch}
-\maketitle
+\title{Dynamic Frequency Scaling for Energy Consumption
+ Reduction in Distributed MPI Programs}
+
+\author{%
+ \IEEEauthorblockN{%
+ Jean-Claude Charr,
+ Raphaël Couturier,
+ Ahmed Fanfakh and
+ Arnaud Giersch
+ }
+ \IEEEauthorblockA{%
+ FEMTO-ST Institute\\
+ University of Franche-Comté\\
+ IUT de Belfort-Montbéliard,
+ 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
+ % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
+ % Fax: \mbox{+33 3 84 58 77 81}\\ % Dept Info
+ Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr}
+ }
+ }
-\AG{``Optimal'' is a bit pretentious in the title}
+\maketitle
\begin{abstract}
- \AG{FIXME}
+ Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs. This
+ technique is usually used to reduce the energy consumed by a CPU while
+ computing. Indeed, power consumption by a processor at a given instant is
+ exponentially related to its frequency. Thus, decreasing the frequency
+ reduces the power consumed by the CPU. However, it can also significantly
+ affect the performance of the executed program if it is compute bound and if a
+ low CPU frequency is selected. The performance degradation ratio can even be
+ higher than the saved energy ratio. Therefore, the chosen scaling factor must
+ give the best possible trade-off between energy reduction and performance.
+
+ In this paper we present an algorithm that predicts the energy consumed with
+ each frequency gear and selects the one that gives the best ratio between
+ energy consumption reduction and performance. This algorithm works online
+ without training or profiling and has a very small overhead. It also takes
+ into account synchronous communications between the nodes that are executing
+ the distributed algorithm. The algorithm has been evaluated over the SimGrid
+ simulator while being applied to the NAS parallel benchmark programs. The
+ results of the experiments show that it outperforms other existing scaling
+ factor selection algorithms.
\end{abstract}
\section{Introduction}
-
-The need for computing power is still increasing and it is not expected to slow
-down in the coming years. To satisfy this demand, researchers and supercomputers
-constructors have been regularly increasing the number of computing cores in
-supercomputers (for example in November 2013, according to the top 500
-list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3
-millions of cores and delivers more than 33 Tflop/s while consuming 17808
-kW). This large increase in number of computing cores has led to large energy
-consumption by these architectures. Moreover, the price of energy is expected to
-continue its ascent according to the demand. For all these reasons energy
-reduction became an important topic in the high performance computing field. To
-tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency
-Scaling) operations which reduce dynamically the frequency and voltage of cores
-and thus their energy consumption. However, this operation also degrades the
-performance of computation. Therefore researchers try to reduce the frequency to
-minimum when processors are idle (waiting for data from other processors or
-communicating with other processors). Moreover, depending on their objectives
-they use heuristics to find the best scaling factor during the computation. If
-they aim for performance they choose the best scaling factor that reduces the
-consumed energy while affecting as little as possible the performance. On the
-other hand, if they aim for energy reduction, the chosen scaling factor must
-produce the most energy efficient execution without considering the degradation
-of the performance. It is important to notice that lowering the frequency to
-minimum value does not always give the most efficient execution due to energy
-leakage. The best scaling factor might be chosen during execution (online) or
-during a pre-execution phase. In this paper we emphasize to develop an
-algorithm that selects the optimal frequency scaling factor that takes into
-consideration simultaneously the energy consumption and the performance. The
-main objective of HPC systems is to run the application with less execution
-time. Therefore, our algorithm selects the optimal scaling factor online with
-very small footprint. The proposed algorithm takes into account the
-communication times of the MPI programs to choose the scaling factor. This
-algorithm has ability to predict both energy consumption and execution time over
-all available scaling factors. The prediction achieved depends on some
-computing time information, gathered at the beginning of the runtime. We apply
-this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel
-benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
-using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183}
-over an homogeneous distributed memory architecture. Furthermore, we compare the
-proposed algorithm with Rauber's methods. The comparison's results show that our
-algorithm gives better energy-time trade off.
-
-\section{Related Works}
-
-In the this section some heuristics, to compute the scaling factor, are
-presented and classified in two parts : offline and online methods.
-
-\subsection{The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as an example in Azevedo et al.~\cite{40}. He used
-intra-task algorithm to choose the DVFS setting when there are dependency points
-between tasks. While in~\cite{29}, Xie et al. used breadth-first search
-algorithm to do that. Their goal is saving energy with time limits. Another
-approaches gathers and stores the runtime information for each DVFS state, then
-used their methods offline to select the suitable DVFS that optimize energy-time
-trade offs. As an example~\cite{8}, Rountree et al. used liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression
-algorithm for the same goal. The offline study that shown the DVFS impact on the
-communication time of the MPI program is~\cite{17}, Freeh et al. show that these
-times not changed when the frequency is scaled down.
-
-\subsection{The online DVFS orientations}
-
-The objective of these works is to dynamically compute and set the frequency of
-the CPU during the runtime of the program for saving energy. Estimating and
-predicting approaches for the energy-time trade offs developed by
-~\cite{11,2,31}. These works select the best DVFS setting depending on the slack
-times. These times happen when the processors have to wait for data from other
-processors to compute their task. For example, during the synchronous
-communication time that take place in the MPI programs, the processors are
-idle. The optimal DVFS can be selected using the learning methods. Therefore, in
-~\cite{39,19} used machine learning to converge to the suitable DVFS
-configuration. Their learning algorithms have big time to converge when the
-number of available frequencies is high. Also, the communication time of the MPI
-program used online for saving energy as in~\cite{1}, Lim et al. developed an
-algorithm that detects the communication sections and changes the frequency
-during these sections only. This approach changes the frequency many times
-because an iteration may contain more than one communication section. The domain
-of analytical modeling used for choosing the optimal frequency as in~\cite{3},
-Rauber et al. developed an analytical mathematical model for determining the
-optimal frequency scaling factor for any number of concurrent tasks, without
-considering communication times. They set the slowest task to maximum frequency
-for maintaining performance. In this paper we compare our algorithm with
-Rauber's model~\cite{3}, because his model can be used for any number of
-concurrent tasks for homogeneous platform and this is the same direction of this
-paper. However, the primary contributions of this paper are:
+\label{sec.intro}
+
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years. To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers (for example in November 2013, according to the
+TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more
+than 3 millions of cores and delivers more than \np[Tflop/s]{33} while consuming
+\np[kW]{17808}). This large increase in number of computing cores has led to
+large energy consumption by these architectures. Moreover, the price of energy
+is expected to continue its ascent according to the demand. For all these
+reasons energy reduction became an important topic in the high performance
+computing field. To tackle this problem, many researchers used DVFS (Dynamic
+Voltage Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a
+set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs. However, DVFS also degrades the performance of computation. Therefore
+researchers try to reduce the frequency to minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives they use heuristics to find the best
+scaling factor during the computation. If they aim for performance they choose
+the best scaling factor that reduces the consumed energy while affecting as
+little as possible the performance. On the other hand, if they aim for energy
+reduction, the chosen scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance. It is
+important to notice that lowering the frequency to minimum value does not always
+give the most energy efficient execution due to energy leakage. The best
+scaling factor might be chosen during execution (online) or during a
+pre-execution phase. In this paper, we present an algorithm that selects a
+frequency scaling factor that simultaneously takes into consideration the energy
+consumption by the CPU and the performance of the application. The main
+objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with very small
+footprint. The proposed algorithm takes into account the communication times of
+the MPI program to choose the scaling factor. This algorithm has ability to
+predict both energy consumption and execution time over all available scaling
+factors. The prediction achieved depends on some computing time information,
+gathered at the beginning of the runtime. We apply this algorithm to seven MPI
+benchmarks. These MPI programs are the NAS parallel benchmarks (NPB v3.3)
+developed by NASA~\cite{44}. Our experiments are executed using the simulator
+SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over an homogeneous
+distributed memory architecture. Furthermore, we compare the proposed algorithm
+with Rauber and Rünger methods~\cite{3}. The comparison's results show that our
+algorithm gives better energy-time trade-off.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents some
+related works from other authors. Section~\ref{sec.exe} explains the execution
+of parallel tasks and the sources of slack times. It also presents an energy
+model for homogeneous platforms. Section~\ref{sec.mpip} describes how the
+performance of MPI programs can be predicted. Section~\ref{sec.compet} presents
+the energy-performance objective function that maximizes the reduction of energy
+consumption while minimizing the degradation of the program's performance.
+Section~\ref{sec.optim} details the proposed energy-performance algorithm.
+Section~\ref{sec.expe} verifies the accuracy of the performance prediction model
+and presents the results of the proposed algorithm. It also shows the
+comparison results between our method and other existing methods. Finally, we
+conclude in Section~\ref{sec.concl} with a summary and some future works.
+
+\section{Related works}
+\label{sec.relwork}
+
+
+In this section, some heuristics to compute the scaling factor are presented and
+classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program. They return static scaling factor values to the processors
+participating in the execution of the parallel program. On one hand, the
+scaling factor values could be computed based on information retrieved by
+analyzing the code of the program and the computing system that will execute it.
+In~\cite{40}, Azevedo et al. detect during compilation the dependency points
+between tasks in a multi-task program. This information is then used to lower
+the frequency of some processors in order to eliminate slack times. A slack
+time is the period of time during which a processor that have already finished
+its computation, have to wait for a set of processors to finish their
+computations and send their results to the waiting processor in order to
+continue its task that is dependent on the results of computations being
+executed on other processors. Freeh et al. showed in~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled
+down. On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or partial executions of the program. A
+part or the whole program is usually executed over all the available frequency
+gears and the the execution time and the energy consumed with each frequency
+gear are measured. Then an heuristic or an exact method uses the retrieved
+information to compute the values of the scaling factor for the processors.
+In~\cite{29}, Xie et al. use an exact exponential breadth-first search algorithm
+to compute the scaling factor values that give the optimal energy reduction
+while respecting a deadline for a sequential program. They also present a
+linear heuristic that approximates the optimal solution. In~\cite{8} , Rountree
+et al. use a linear programming algorithm, while in~\cite{38,34}, Cochran et
+al. use multi logistic regression algorithm for the same goal. The main
+drawback for these methods is that they all require executing a part or the
+whole program on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+
+The online scaling factor selection methods are executed during the runtime of
+the program. They are usually integrated into iterative programs where the same
+block of instructions is executed many times. During the first few iterations,
+many informations are measured such as the execution time, the energy consumed
+using a multimeter, the slack times, \dots{} Then a method will exploit these
+measurements to compute the scaling factor values for each processor. This
+operation, measurements and computing new scaling factor, can be repeated as
+much as needed if the iterations are not regular. Kimura, Peraza, Yu-Liang et
+al.~\cite{11,2,31} used learning methods to select the appropriate scaling
+factor values to eliminate the slack times during runtime. However, as seen
+in~\cite{39,19}, machine learning methods can take a lot of time to converge
+when the number of available gears is big. To reduce the impact of slack times,
+in~\cite{1}, Lim et al. developed an algorithm that detects the communication
+sections and changes the frequency during these sections only. This approach
+might change the frequency of each processor many times per iteration if an
+iteration contains more than one communication section. In~\cite{3}, Rauber and
+Rünger used an analytical model that can predict the consumed energy and the
+execution time for every frequency gear after measuring the consumed energy and
+the execution time with the highest frequency gear. These predictions may be
+used to choose the optimal gear for each processor executing the parallel
+program to reduce energy consumption. To maintain the performance of the
+parallel program , they set the processor with the biggest load to the highest
+gear and then compute the scaling factor values for the rest of the processors.
+Although this model was built for parallel architectures, it can be adapted to
+distributed architectures by taking into account the communications. The
+primary contribution of our paper is presenting a new online scaling factor
+selection method which has the following characteristics:
\begin{enumerate}
-\item Selecting the optimal frequency scaling factor for energy and performance
- simultaneously. While taking into account the communication time.
-\item Adapting our scale factor to taking into account the imbalanced tasks.
-\item The execution time of our algorithm is very small when compared to other
- methods (e.g.,~\cite{19}).
-\item The proposed algorithm works online without profiling or training as
+\item It is based on Rauber and Rünger analytical model to predict the energy
+ consumption of the application with different frequency gears.
+\item It selects the frequency scaling factor for simultaneously optimizing
+ energy reduction and maintaining performance.
+\item It is well adapted to distributed architectures because it takes into
+ account the communication time.
+\item It is well adapted to distributed applications with imbalanced tasks.
+\item it has very small footprint when compared to other methods
+ (e.g.,~\cite{19}) and does not require profiling or training as
in~\cite{38,34}.
\end{enumerate}
-\section{Parallel Tasks Execution on Homogeneous Platform}
-A homogeneous cluster consists of identical nodes in terms of the hardware and
-the software. Each node has its own memory and at least one processor which can
-be a multi-core. The nodes are connected via a high bandwidth network. Tasks
-executed on this model can be either synchronous or asynchronous. In this paper
+\section{Execution and energy of parallel tasks on homogeneous platform}
+\label{sec.exe}
+
+% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'',
+% can be deleted if we need space, we can just say we are interested in this
+% paper in homogeneous clusters}
+
+\subsection{Parallel tasks execution on homogeneous platform}
+
+A homogeneous cluster consists of identical nodes in terms of hardware and
+software. Each node has its own memory and at least one processor which can be
+a multi-core. The nodes are connected via a high bandwidth network. Tasks
+executed on this model can be either synchronous or asynchronous. In this paper
we consider execution of the synchronous tasks on distributed homogeneous
-platform. These tasks can exchange the data via synchronous memory passing.
-\begin{figure}[h]
+platform. These tasks can exchange the data via synchronous message passing.
+\begin{figure*}[t]
\centering
- \subfloat[Synch. Imbalanced Communications]{\includegraphics[scale=0.67]{synch_tasks}\label{fig:h1}}
- \subfloat[Synch. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
- \caption{Parallel Tasks on Homogeneous Platform}
+ \subfloat[Sync. imbalanced communications]{%
+ \includegraphics[scale=0.67]{fig/commtasks}\label{fig:h1}}
+ \subfloat[Sync. imbalanced computations]{%
+ \includegraphics[scale=0.67]{fig/compt}\label{fig:h2}}
+ \caption{Parallel tasks on homogeneous platform}
\label{fig:homo}
-\end{figure}
+\end{figure*}
Therefore, the execution time of a task consists of the computation time and the
-communication time. Moreover, the synchronous communications between tasks can
-lead to idle time while tasks wait at the synchronous point for others tasks to
-finish their communications see figure~(\ref{fig:h1}). Another source for idle
-times is the imbalanced computations. This happen when processing different
-amounts of data on each processor as an example see figure~(\ref{fig:h2}). In
-this case the fastest tasks have to wait at the synchronous barrier for the
-slowest tasks to finish their job. In both two cases the overall execution time
-of the program is the execution time of the slowest task as :
+communication time. Moreover, the synchronous communications between tasks can
+lead to slack times while tasks wait at the synchronization barrier for other
+tasks to finish their tasks (see figure~(\ref{fig:h1})). The imbalanced
+communications happen when nodes have to send/receive different amount of data
+or they communicate with different number of nodes. Another source of slack
+times is the imbalanced computations. This happens when processing different
+amounts of data on each processor (see figure~(\ref{fig:h2})). In this case the
+fastest tasks have to wait at the synchronization barrier for the slowest ones
+to begin the next task. In both cases the overall execution time of the program
+is the execution time of the slowest task as in EQ~(\ref{eq:T1}).
\begin{equation}
\label{eq:T1}
\textit{Program Time} = \max_{i=1,2,\dots,N} T_i
\end{equation}
-where $T_i$ is the execution time of process $i$.
+where $T_i$ is the execution time of task $i$ and all the tasks are executed
+concurrently on different processors.
-\section{Energy Model for Homogeneous Platform}
+\subsection{Energy model for homogeneous platform}
-The energy consumption by the processor consists of two powers metric: the
-dynamic and the static power. This general power formulation is used by many
-researchers see~\cite{9,3,15,26}. The dynamic power of the CMOS processors
-$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$,
-the supply voltage $V$ and operational frequency $f$ respectively as follow :
+Many researchers~\cite{9,3,15,26} divide the power consumed by a processor to
+two power metrics: the static and the dynamic power. While the first one is
+consumed as long as the computing unit is on, the latter is only consumed during
+computation times. The dynamic power $P_{dyn}$ is related to the switching
+activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
+operational frequency $f$, as shown in EQ~(\ref{eq:pd}).
\begin{equation}
\label{eq:pd}
- P_{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f
+ P_\textit{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f
\end{equation}
-The static power $P_{static}$ captures the leakage power consumption as well as
-the power consumption of peripheral devices like the I/O subsystem.
+The static power $P_{static}$ captures the leakage power as follows:
\begin{equation}
\label{eq:ps}
- P_{static} = V \cdot N \cdot K_{design} \cdot I_{leak}
+ P_\textit{static} = V \cdot N_{trans} \cdot K_{design} \cdot I_{leak}
\end{equation}
-where V is the supply voltage, N is the number of transistors, $K_{design}$ is a
-design dependent parameter and $I_{leak}$ is a technology-dependent
-parameter. Energy consumed by an individual processor $E_{ind}$ is the summation
-of the dynamic and the static power multiply by the execution time for example
-see~\cite{36,15} .
+where V is the supply voltage, $N_{trans}$ is the number of transistors,
+$K_{design}$ is a design dependent parameter and $I_{leak}$ is a
+technology-dependent parameter. The energy consumed by an individual processor
+to execute a given program can be computed as:
\begin{equation}
\label{eq:eind}
- E_{ind} = ( P_{dyn} + P_{static} ) \cdot T
+ E_\textit{ind} = P_\textit{dyn} \cdot T_{Comp} + P_\textit{static} \cdot T
\end{equation}
-The dynamic voltage and frequency scaling (DVFS) is a process that allowed in
-modern processors to reduce the dynamic power by scaling down the voltage and
-frequency. Its main objective is to reduce the overall energy
-consumption~\cite{37}. The operational frequency \emph f depends linearly on the
-supply voltage $V$, i.e., $V = \beta . f$ with some constant $\beta$. This
-equation is used to study the change of the dynamic voltage with respect to
-various frequency values in~\cite{3}. The reduction process of the frequency are
-expressed by scaling factor \emph S. The scale \emph S is the ratio between the
-maximum and the new frequency as in EQ~(\ref{eq:s}).
+where $T$ is the execution time of the program, $T_{Comp}$ is the computation
+time and $T_{Comp} \leq T$. $T_{Comp}$ may be equal to $T$ if there is no
+communications, no slack times and no synchronizations.
+
+DVFS is a process that is allowed in modern processors to reduce the dynamic
+power by scaling down the voltage and frequency. Its main objective is to
+reduce the overall energy consumption~\cite{37}. The operational frequency $f$
+depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot f$ with some
+constant $\beta$. This equation is used to study the change of the dynamic
+voltage with respect to various frequency values in~\cite{3}. The reduction
+process of the frequency can be expressed by the scaling factor $S$ which is the
+ratio between the maximum and the new frequency as in EQ~(\ref{eq:s}).
\begin{equation}
\label{eq:s}
- S = \frac{F_{max}}{F_{new}}
+ S = \frac{F_\textit{max}}{F_\textit{new}}
\end{equation}
-The value of the scale \emph S is grater than 1 when changing the frequency to
-any new frequency value(\emph {P-state}) in governor. It is equal to 1 when the
-frequency are set to the maximum frequency. The energy consumption model for
-parallel homogeneous platform is depending on the scaling factor \emph S. This
-factor reduces quadratically the dynamic power. Also, this factor increases the
-static energy linearly because the execution time is increased~\cite{36}. The
-energy model, depending on the frequency scaling factor, of homogeneous platform
-for any number of concurrent tasks develops by Rauber~\cite{3}. This model
-consider the two powers metric for measuring the energy of the parallel tasks as
-in EQ~(\ref{eq:energy}).
+The value of the scaling factor $S$ is greater than 1 when changing the
+frequency of the CPU to any new frequency value~(\emph{P-state}) in the
+governor. The CPU governor is an interface driver supplied by the operating
+system's kernel to lower a core's frequency. This factor reduces quadratically
+the dynamic power which may cause degradation in performance and thus, the
+increase of the static energy because the execution time is increased~\cite{36}.
+If the tasks are sorted according to their execution times before scaling in a
+descending order, the total energy consumption model for a parallel homogeneous
+platform, as presented by Rauber and Rünger~\cite{3}, can be written as a
+function of the scaling factor $S$, as in EQ~(\ref{eq:energy}).
\begin{equation}
\label{eq:energy}
- E = P_{dyn} \cdot S_1^{-2} \cdot
+ E = P_\textit{dyn} \cdot S_1^{-2} \cdot
\left( T_1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^2} \right) +
- P_{static} \cdot T_1 \cdot S_1 \cdot N
+ P_\textit{static} \cdot T_1 \cdot S_1 \cdot N
\hfill
\end{equation}
-Where \emph N is the number of parallel nodes, $T_1 $ is the time of the slower
-task, $T_i$ is the time of the task $i$ and $S_1$ is the maximum scaling factor
-for the slower task. The scaling factor $S_1$, as in EQ~(\ref{eq:s1}), selects
-from the set of scales values $S_i$. Each of these scales are proportional to
-the time value $T_i$ depends on the new frequency value as in EQ~(\ref{eq:si}).
-\begin{equation}
- \label{eq:s1}
- S_1 = \max_{i=1,2,\dots,F} S_i
-\end{equation}
+where $N$ is the number of parallel nodes, $T_i$ and $S_i$ for $i=1,\dots,N$ are
+the execution times and scaling factors of the sorted tasks. Therefore, $T1$ is
+the time of the slowest task, and $S_1$ its scaling factor which should be the
+highest because they are proportional to the time values $T_i$. The scaling
+factors are computed as in EQ~(\ref{eq:si}).
\begin{equation}
\label{eq:si}
S_i = S \cdot \frac{T_1}{T_i}
- = \frac{F_{max}}{F_{new}} \cdot \frac{T_1}{T_i}
+ = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i}
\end{equation}
-Where $F$ is the number of available frequencies. In this paper we depend on
-Rauber's energy model EQ~(\ref{eq:energy}) for two reasons : 1-this model used
-for homogeneous platform that we work on in this paper. 2-we are compare our
-algorithm with Rauber's scaling model. Rauber's optimal scaling factor for
-optimal energy reduction derived from the EQ~(\ref{eq:energy}). He takes the
-derivation for this equation (to be minimized) and set it to zero to produce the
-scaling factor as in EQ~(\ref{eq:sopt}).
+In this paper we depend on Rauber and Rünger energy model EQ~(\ref{eq:energy})
+for two reasons: (1) this model is used for any number of concurrent tasks, and
+(2) we compare our algorithm with Rauber and Rünger scaling factor selection
+method which is based on EQ~(\ref{eq:energy}). The optimal scaling factor is
+computed by minimizing the derivation for this equation which produces
+EQ~(\ref{eq:sopt}).
+
\begin{equation}
\label{eq:sopt}
- S_{opt} = \sqrt[3]{\frac{2}{n} \cdot \frac{P_{dyn}}{P_{static}} \cdot
+ S_\textit{opt} = \sqrt[3]{\frac{2}{N} \cdot \frac{P_\textit{dyn}}{P_\textit{static}} \cdot
\left( 1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^3} \right) }
\end{equation}
-\section{Performance Evaluation of MPI Programs}
-
-The performance (execution time) of the parallel MPI applications are depends on
-the time of the slowest task as in figure~(\ref{fig:homo}). Normally the
-execution time of the parallel programs are proportional to the operational
-frequency. Therefore, any DVFS operation for the energy reduction increase the
-execution time of the parallel program. As shown in EQ~(\ref{eq:energy}) the
-energy affected by the scaling factor $S$. This factor also has a great impact
-on the performance. When scaling down the frequency to the new value according
-to EQ~(\ref{eq:s}) lead to the value of the scale $S$ has inverse relation with
-new frequency value ($S \propto \frac{1}{F_{new}}$). Also when decrease the
-frequency value, the execution time increase. Then the new frequency value has
-inverse relation with time ($F_{new} \propto \frac{1}{T}$). This lead to the
-frequency scaling factor $S$ proportional linearly with execution time ($S
-\propto T$). Large scale MPI applications such as NAS benchmarks have
-considerable amount of communications embedded in these programs. During the
-communication process the processor remain idle until the communication has
-finished. For that reason any change in the frequency has no impact on the time
-of communication but it has obvious impact on the time of
-computation~\cite{17}. We are made many tests on real cluster to prove that the
-frequency scaling factor \emph S has a linear relation with computation time
-only also see~\cite{41}. To predict the execution time of MPI program, firstly
-must be precisely specifying communication time and the computation time for the
-slower task. Secondly, we use these times for predicting the execution time for
-any MPI program as a function of the new scaling factor as in the
-EQ~(\ref{eq:tnew}).
+
+\section{Performance evaluation of MPI programs}
+\label{sec.mpip}
+
+The performance (execution time) of parallel synchronous MPI applications depend
+on the time of the slowest task as in figure~(\ref{fig:homo}). If there is no
+communication and the application is not data bounded, the execution time of a
+parallel program is linearly proportional to the operational frequency and any
+DVFS operation for energy reduction increases the execution time of the parallel
+program. Therefore, the scaling factor $S$ is linearly proportional to the
+execution time. However, in most of MPI applications the processes exchange
+data. During these communications the processors involved remain idle until the
+communications are finished. For that reason any change in the frequency has no
+impact on the time of communication~\cite{17}. The communication time for a
+task is the summation of periods of time that begin with an MPI call for sending
+or receiving a message till the message is synchronously sent or received. To
+be able to predict the execution time of MPI program, the communication time and
+the computation time for the slower task must be measured before scaling. These
+times are used to predict the execution time for any MPI program as a function
+of the new scaling factor as in EQ~(\ref{eq:tnew}).
\begin{equation}
\label{eq:tnew}
- T_{new} = T_{\textit{Max Comp Old}} \cdot S + T_{\textit{Max Comm Old}}
+ \textit T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}}
\end{equation}
-The above equation shows that the scaling factor \emph S has linear relation
-with the computation time without affecting the communication time. The
-communication time consists of the beginning times which an MPI calls for
-sending or receiving till the message is synchronously sent or received. In this
-paper we predict the execution time of the program for any new scaling factor
-value. Depending on this prediction we can produce our energy-performance scaling
-method as we will show in the coming sections. In the next section we make an
-investigation study for the EQ~(\ref{eq:tnew}).
-
-\section{Performance Prediction Verification}
-
-In this section we evaluate the precision of our performance prediction methods
-on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution
-time for any scale value. The NAS programs run the class B for comparing the
-real execution time with the predicted execution time. Each program runs offline
-with all available scaling factors on 8 or 9 nodes to produce real execution
-time values. These scaling factors are computed by dividing the maximum
-frequency by the new one see EQ~(\ref{eq:s}). In all tests, we use the simulator
-SimGrid/SMPI v3.10 to run the NAS programs.
-\AG{Fig.~\ref{fig:pred} is hard to read when printed in black and white,
- especially the ``Normalize Real Perf.'' curve.}
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
- \centering
- \includegraphics[scale=0.60]{cg_per.eps}
- \includegraphics[scale=0.60]{mg_pre.eps}
- \includegraphics[scale=0.60]{bt_pre.eps}
- \includegraphics[scale=0.60]{lu_pre.eps}
- \caption{Fitting Predicted to Real Execution Time}
- \label{fig:pred}
-\end{figure}
-%see Figure~\ref{fig:pred}
-In our cluster there are 18 available frequency states for each processor from
-2.5 GHz to 800 MHz, there is 100 MHz difference between two successive
-frequencies. For more details on the characteristics of the platform refer to
-table~(\ref{table:platform}). This lead to 18 run states for each program. We
-use seven MPI programs of the NAS parallel benchmarks : CG, MG, EP, FT, BT, LU
-and SP. The average normalized errors between the predicted execution time and
-the real time (SimGrid time) for all programs is between 0.0032 to 0.0133. AS an
-example, we are present the execution times of the NAS benchmarks as in the
-figure~(\ref{fig:pred}).
-
-\section{Performance to Energy Competition}
-This section demonstrates our approach for choosing the optimal scaling
-factor. This factor gives maximum energy reduction taking into account the
-execution time for both computation and communication times . The relation
-between the energy and the performance are nonlinear and complex, because the
-relation of the energy with scaling factor is nonlinear and with the performance
-it is linear see~\cite{17}. The relation between the energy and the performance
-is not straightforward. Moreover, they are not measured using the same metric.
-For solving this problem, we normalize the energy by calculating the ratio
-between the consumed energy with scaled frequency and the consumed energy
-without scaled frequency :
-\begin{equation}
+In this paper, this prediction method is used to select the best scaling factor
+for each processor as presented in the next section.
+
+\section{Performance to energy competition}
+\label{sec.compet}
+
+This section demonstrates our approach for choosing the optimal scaling factor.
+This factor gives maximum energy reduction taking into account the execution
+times for both computation and communication. The relation between the energy
+and the performance is nonlinear and complex, because the relation of the energy
+with scaling factor is nonlinear and with the performance it is linear
+see~\cite{17}. Moreover, they are not measured using the same metric. For
+solving this problem, we normalize the energy by calculating the ratio between
+the consumed energy with scaled frequency and the consumed energy without scaled
+frequency:
+\begin{multline}
\label{eq:enorm}
- E_{Norm} = \frac{E_{Reduced}}{E_{Original}}
- = \frac{ P_{dyn} \cdot S_i^{-2} \cdot
+ E_\textit{Norm} = \frac{ E_\textit{Reduced}}{E_\textit{Original}} \\
+ {} = \frac{P_\textit{dyn} \cdot S_1^{-2} \cdot
\left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
- P_{static} \cdot T_1 \cdot S_i \cdot N }{
- P_{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
- P_{static} \cdot T_1 \cdot N }
-\end{equation}
-\AG{Use \texttt{\textbackslash{}text\{xxx\}} or
- \texttt{\textbackslash{}textit\{xxx\}} for all subscripted words in equations
- (e.g. \mbox{\texttt{E\_\{\textbackslash{}text\{Norm\}\}}}).
-
- Don't hesitate to define new commands :
- \mbox{\texttt{\textbackslash{}newcommand\{\textbackslash{}ENorm\}\{E\_\{\textbackslash{}text\{Norm\}\}\}}}
-}
-By the same way we can normalize the performance as follows :
+ P_\textit{static} \cdot T_1 \cdot S_1 \cdot N }{
+ P_\textit{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+ P_\textit{static} \cdot T_1 \cdot N }
+\end{multline}
+By the same way we can normalize the performance as follows:
\begin{equation}
\label{eq:pnorm}
- P_{Norm} = \frac{T_{New}}{T_{Old}}
- = \frac{T_{\textit{Max Comp Old}} \cdot S +
- T_{\textit{Max Comm Old}}}{T_{Old}}
+ P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}}
+ = \frac{T_\textit{Max Comp Old} \cdot S +
+ T_\textit{Max Comm Old}}{T_\textit{Max Comp Old} +
+ T_\textit{Max Comm Old}}
\end{equation}
-The second problem is the optimization operation for both energy and performance
-is not in the same direction. In other words, the normalized energy and the
-performance curves are not in the same direction see figure~(\ref{fig:r2}).
-While the main goal is to optimize the energy and performance in the same
-time. According to the equations~(\ref{eq:enorm}) and~(\ref{eq:pnorm}) the
-scaling factor \emph S reduce both the energy and the performance
-simultaneously. But the main objective is to produce maximum energy reduction
-with minimum performance reduction. Many researchers used different strategies
-to solve this nonlinear problem for example see~\cite{19,42}, their methods add
-big overhead to the algorithm for selecting the suitable frequency. In this
-paper we are present a method to find the optimal scaling factor \emph S for
-optimize both energy and performance simultaneously without adding big
+The second problem is that the optimization operation for both energy and
+performance is not in the same direction. In other words, the normalized energy
+and the performance curves are not in the same direction see
+figure~(\ref{fig:r2}). While the main goal is to optimize the energy and
+performance in the same time. According to the equations~(\ref{eq:enorm})
+and~(\ref{eq:pnorm}), the scaling factor $S$ reduce both the energy and the
+performance simultaneously. But the main objective is to produce maximum energy
+reduction with minimum performance reduction. Many researchers used different
+strategies to solve this nonlinear problem for example see~\cite{19,42}, their
+methods add big overhead to the algorithm for selecting the suitable frequency.
+In this paper we present a method to find the optimal scaling factor $S$ for
+optimizing both energy and performance simultaneously without adding big
overheads. Our solution for this problem is to make the optimization process
-have the same direction. Therefore, we inverse the equation of normalize
-performance as follows :
+have the same direction. Therefore, we inverse the equation of normalize
+performance as follows:
\begin{equation}
\label{eq:pnorm_en}
- P^{-1}_{Norm} = \frac{T_{Old}}{T_{New}}
- = \frac{T_{Old}}{T_{\textit{Max Comp Old}} \cdot S +
- T_{\textit{Max Comm Old}}}
+ P^{-1}_\textit{Norm} = \frac{ T_\textit{Old}}{ T_\textit{New}}
+ = \frac{T_\textit{Max Comp Old} +
+ T_\textit{Max Comm Old}}{T_\textit{Max Comp Old} \cdot S +
+ T_\textit{Max Comm Old}}
\end{equation}
-\begin{figure}
+\begin{figure*}
\centering
- \subfloat[Converted Relation.]{\includegraphics[scale=0.70]{file.eps}\label{fig:r1}}
- \subfloat[Real Relation.]{\includegraphics[scale=0.70]{file3.eps}\label{fig:r2}}
+ \subfloat[Converted relation.]{%
+ \includegraphics[width=.4\textwidth]{fig/file}\label{fig:r1}}%
+ \qquad%
+ \subfloat[Real relation.]{%
+ \includegraphics[width=.4\textwidth]{fig/file3}\label{fig:r2}}
\label{fig:rel}
- \caption{The Energy and Performance Relation}
-\end{figure}
+ \caption{The energy and performance relation}
+\end{figure*}
Then, we can modelize our objective function as finding the maximum distance
between the energy curve EQ~(\ref{eq:enorm}) and the inverse of performance
-curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This represent
-the minimum energy consumption with minimum execution time (better performance)
-in the same time, see figure~(\ref{fig:r1}). Then our objective function has the
-following form:
+curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This
+represents the minimum energy consumption with minimum execution time (better
+performance) at the same time, see figure~(\ref{fig:r1}). Then our objective
+function has the following form:
\begin{equation}
\label{eq:max}
- \textit{MaxDist} = \max (\overbrace{P^{-1}_{Norm}}^{\text{Maximize}} -
- \overbrace{E_{Norm}}^{\text{Minimize}} )
+ Max Dist = \max_{j=1,2,\dots,F}
+ (\overbrace{P^{-1}_\textit{Norm}(S_j)}^{\text{Maximize}} -
+ \overbrace{E_\textit{Norm}(S_j)}^{\text{Minimize}} )
\end{equation}
-Then we can select the optimal scaling factor that satisfy the
-EQ~(\ref{eq:max}). Our objective function can works with any energy model or
-static power values stored in a data file. Moreover, this function works in
-optimal way when the energy function has a convex form with frequency scaling
-factor as shown in~\cite{15,3,19}. Energy measurement model is not the
-objective of this paper and we choose Rauber's model as an example with two
-reasons that mentioned before.
-
-\section{Optimal Scaling Factor for Performance and Energy}
-
-In the previous section we described the objective function that satisfy our
-goal in discovering optimal scaling factor for both performance and energy at
-the same time. Therefore, we develop an energy to performance scaling algorithm
-(EPSA). This algorithm is simple and has a direct way to calculate the optimal
-scaling factor for both energy and performance at the same time.
-\begin{algorithm}[t]
- \caption{EPSA}
+where $F$ is the number of available frequencies. Then we can select the optimal
+scaling factor that satisfies EQ~(\ref{eq:max}). Our objective function can
+work with any energy model or static power values stored in a data file.
+Moreover, this function works in optimal way when the energy curve has a convex
+form over the available frequency scaling factors as shown in~\cite{15,3,19}.
+
+\section{Optimal scaling factor for performance and energy}
+\label{sec.optim}
+
+Algorithm~\ref{EPSA} computes the optimal scaling factor according to the
+objective function described above.
+\begin{algorithm}[tp]
+ \caption{Scaling factor selection algorithm}
\label{EPSA}
\begin{algorithmic}[1]
\State Initialize the variable $Dist=0$
\State Set dynamic and static power values.
\State Set $P_{states}$ to the number of available frequencies.
\State Set the variable $F_{new}$ to max. frequency, $F_{new} = F_{max} $
- \State Set the variable $F_{diff}$ to the scale value between each two frequencies.
- \For {$i=1$ to $P_{states} $}
- \State - Calculate the new frequency as $F_{new}=F_{new} - F_{diff} $
- \State - Calculate the scale factor $S$ as in EQ~(\ref{eq:s}).
- \State - Calculate all available scales $S_i$ depend on $S$ as in EQ~(\ref{eq:si}).
- \State - Select the maximum scale factor $S_1$ from the set of scales $S_i$.
- \State - Calculate the normalize energy $E_{Norm}=E_{R}/E_{O}$ as in EQ~(\ref{eq:enorm}).
- \State - Calculate the normalize inverse of performance\par
- $P_{NormInv}=T_{old}/T_{new}$ as in EQ~(\ref{eq:pnorm_en}).
- \If{ $(P_{NormInv}-E_{Norm} > Dist$) }
- \State $S_{optimal} = S$
+ \State Set the variable $F_{diff}$ to the difference between two successive
+ frequencies.
+ \For {$j:=1$ to $P_{states} $}
+ \State $F_{new}=F_{new} - F_{diff} $
+ \State $S = \frac{F_\textit{max}}{F_\textit{new}}$
+ \State $S_i = S \cdot \frac{T_1}{T_i}
+ = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i}$
+ for $i=1,\dots,N$
+ \State $E_\textit{Norm} =
+ \frac{P_\textit{dyn} \cdot S_1^{-2} \cdot
+ \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+ P_\textit{static} \cdot T_1 \cdot S_1 \cdot N }{
+ P_\textit{dyn} \cdot
+ \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+ P_\textit{static} \cdot T_1 \cdot N }$
+ \State $P_{NormInv}=T_{old}/T_{new}$
+ \If{$(P_{NormInv}-E_{Norm} > Dist)$}
+ \State $S_{opt} = S$
\State $Dist = P_{NormInv} - E_{Norm}$
\EndIf
\EndFor
- \State Return $S_{optimal}$
+ \State Return $S_{opt}$
\end{algorithmic}
\end{algorithm}
-The proposed EPSA algorithm works online during the execution time of the MPI
-program. It selects the optimal scaling factor by gathering some information
-from the program after one iteration. This algorithm has small execution time
-(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data
-required by this algorithm is the computation time and the communication time
-for each task from the first iteration only. When these times are measured, the
-MPI program calls the EPSA algorithm to choose the new frequency using the
-optimal scaling factor. Then the program set the new frequency to the
-system. The algorithm is called just one time during the execution of the
-program. The following example shows where and when the EPSA algorithm is called
-in the MPI program :
-\begin{minipage}{\textwidth}
-\AG{Use the same format as for Algorithm~\ref{EPSA}}
-\begin{lstlisting}[frame=tb]
-FOR J:=1 to Some_iterations Do
- -Computations Section.
- -Communications Section.
- IF (J==1) THEN
- -Gather all times of computation and communication
- from each node.
- -Call EPSA with these times.
- -Calculate the new frequency from optimal scale.
- -Set the new frequency to the system.
- ENDIF
-ENDFOR
-\end{lstlisting}
-\end{minipage}
-After obtaining the optimal scale factor from the EPSA algorithm. The program
-calculates the new frequency $F_i$ for each task proportionally to its time
-value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we
-can calculate the new frequency $F_i$ as follows :
-\begin{equation}
- \label{eq:fi}
- F_i = \frac{F_{max} \cdot T_i}{S_{optimal} \cdot T_{max}}
-\end{equation}
-According to this equation all the nodes may have the same frequency value if
-they have balanced workloads. Otherwise, they take different frequencies when
-have imbalanced workloads. Then EQ~(\ref{eq:fi}) works in adaptive way to change
-the frequency according to the nodes workloads.
-\section{Experimental Results}
-
-The proposed EPSA algorithm was applied to seven MPI programs of the NAS
-benchmarks (EP, CG, MG, FT, BT, LU and SP). We work on three classes (A, B and
-C) for each program. Each program runs on specific number of processors
-proportional to the size of the class. Each class represents the problem size
-ascending from the class A to C. Additionally, depending on some speed up points
-for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes
-respectively. Our experiments are executed on the simulator SimGrid/SMPI
-v3.10. We design a platform file that simulates a cluster with one core per
-node. This cluster is a homogeneous architecture with distributed memory. The
-detailed characteristics of our platform file are shown in the
-table~(\ref{table:platform}). Each node in the cluster has 18 frequency values
-from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive
-frequencies.
-\begin{table}[ht]
- \caption{Platform File Parameters}
+The proposed algorithm works online during the execution time of the MPI
+program. It selects the optimal scaling factor after gathering the computation
+and communication times from the program after one iteration. Then the program
+changes the new frequencies of the CPUs according to the computed scaling
+factors. This algorithm has a small execution time: for a homogeneous cluster
+composed of nodes having the characteristics presented in
+table~\ref{table:platform}, it takes \np[ms]{0.00152} on average for 4 nodes and
+\np[ms]{0.00665} on average for 32 nodes. The algorithm complexity is $O(F\cdot
+N)$, where $F$ is the number of available frequencies and $N$ is the number of
+computing nodes. The algorithm is called just once during the execution of the
+program. The DVFS algorithm~(\ref{dvfs}) shows where and when the algorithm is
+called in the MPI program.
+\begin{table}[htb]
+ \caption{Platform file parameters}
% title of Table
\centering
- \AG{Use e.g. $5\times 10^{-7}$ instead of 5E-7}
- \begin{tabular}{ | l | l | l |l | l |l |l | p{2cm} |}
+ \begin{tabular}{|*{7}{l|}}
+ \hline
+ Max & Min & Backbone & Backbone & Link & Link & Sharing \\
+ Freq. & Freq. & Bandwidth & Latency & Bandwidth & Latency & Policy \\
+ \hline
+ \np{2.5} & \np{800} & \np[GBps]{2.25} & \np[$\mu$s]{0.5} & \np[GBps]{1} & \np[$\mu$s]{50} & Full \\
+ GHz & MHz & & & & & Duplex \\
\hline
- Max & Min & Backbone & Backbone&Link &Link& Sharing \\
- Freq. & Freq. & Bandwidth & Latency & Bandwidth& Latency&Policy \\ \hline
- 2.5 &800 & 2.25 GBps &5E-7 s & 1 GBps & 5E-5 s&Full \\
- GHz& MHz& & & & &Duplex \\\hline
\end{tabular}
\label{table:platform}
\end{table}
-Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all
-the NAS MPI programs while assuming the power dynamic is equal to 20W and the
-power static is equal to 4W for all experiments. We run the proposed EPSA
-algorithm for all these programs. The results showed that the algorithm selected
-different scaling factors for each program depending on the communication
-features of the program as in the figure~(\ref{fig:nas}). This figure shows that
-there are different distances between the normalized energy and the normalized
-inversed performance curves, because there are different communication features
-for each MPI program. When there are little or not communications, the inversed
-performance curve is very close to the energy curve. Then the distance between
-the two curves is very small. This lead to small energy savings. The opposite
+
+\begin{algorithm}[tp]
+ \caption{DVFS}
+ \label{dvfs}
+ \begin{algorithmic}[1]
+ \For {$k:=1$ to \textit{some iterations}}
+ \State Computations section.
+ \State Communications section.
+ \If {$(k=1)$}
+ \State Gather all times of computation and\newline\hspace*{3em}%
+ communication from each node.
+ \State Call algorithm~\ref{EPSA} with these times.
+ \State Compute the new frequency from the\newline\hspace*{3em}%
+ returned optimal scaling factor.
+ \State Set the new frequency to the CPU.
+ \EndIf
+ \EndFor
+ \end{algorithmic}
+\end{algorithm}
+After obtaining the optimal scaling factor, the program calculates the new
+frequency $F_i$ for each task proportionally to its time value $T_i$. By
+substitution of EQ~(\ref{eq:s}) in EQ~(\ref{eq:si}), we can calculate the new
+frequency $F_i$ as follows:
+\begin{equation}
+ \label{eq:fi}
+ F_i = \frac{F_\textit{max} \cdot T_i}{S_\textit{optimal} \cdot T_\textit{max}}
+\end{equation}
+According to this equation all the nodes may have the same frequency value if
+they have balanced workloads, otherwise, they take different frequencies when
+having imbalanced workloads. Thus, EQ~(\ref{eq:fi}) adapts the frequency of the
+CPU to the nodes' workloads to maintain performance.
+
+\section{Experimental results}
+\label{sec.expe}
+Our experiments are executed on the simulator SimGrid/SMPI v3.10. We configure
+the simulator to use a homogeneous cluster with one core per node. The detailed
+characteristics of our platform file are shown in the
+table~(\ref{table:platform}). Each node in the cluster has 18 frequency values
+from \np[GHz]{2.5} to \np[MHz]{800} with \np[MHz]{100} difference between each
+two successive frequencies. The simulated network link is \np[GB]{1} Ethernet
+(TCP/IP). The backbone of the cluster simulates a high performance switch.
+
+\subsection{Performance prediction verification}
+
+In this section we evaluate the precision of our performance prediction method
+based on EQ~(\ref{eq:tnew}) by applying it the NAS benchmarks. The NAS programs
+are executed with the class B option for comparing the real execution time with
+the predicted execution time. Each program runs offline with all available
+scaling factors on 8 or 9 nodes (depending on the benchmark) to produce real
+execution time values. These scaling factors are computed by dividing the
+maximum frequency by the new one see EQ~(\ref{eq:s}).
+\begin{figure*}[t]
+ \centering
+ \includegraphics[width=.328\textwidth]{fig/cg_per}\hfill%
+ \includegraphics[width=.328\textwidth]{fig/mg_pre}\hfill%
+ % \includegraphics[width=.4\textwidth]{fig/bt_pre}\qquad%
+ \includegraphics[width=.328\textwidth]{fig/lu_pre}\hfill%
+ \caption{Comparing predicted to real execution time}
+ \label{fig:pred}
+\end{figure*}
+%see Figure~\ref{fig:pred}
+In our cluster there are 18 available frequency states for each processor. This
+leads to 18 run states for each program. We use seven MPI programs of the NAS
+parallel benchmarks: CG, MG, EP, FT, BT, LU and SP. Figure~(\ref{fig:pred})
+presents plots of the real execution times and the simulated ones. The maximum
+normalized error between these two execution times varies between
+\np{0.0073}\AG[]{unit?} to \np{0.031} dependent on the executed benchmark. The
+smallest prediction error was for CG and the worst one was for LU.
+
+\subsection{The experimental results for the scaling algorithm }
+
+The proposed algorithm was applied to seven MPI programs of the NAS benchmarks
+(EP, CG, MG, FT, BT, LU and SP) which were run with three classes (A, B and C).
+For each instance the benchmarks were executed on a number of processors
+proportional to the size of the class. Each class represents the problem size
+ascending from the class A to C. Additionally, depending on some speed up
+points for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes
+respectively. Depending on EQ~(\ref{eq:energy}), we measure the energy
+consumption for all the NAS MPI programs while assuming the power dynamic with
+the highest frequency is equal to \np[W]{20} and the power static is equal to
+\np[W]{4} for all experiments. These power values were also used by Rauber and
+Rünger in~\cite{3}. The results showed that the algorithm selected different
+scaling factors for each program depending on the communication features of the
+program as in the plots~(\ref{fig:nas}). These plots illustrate that there are
+different distances between the normalized energy and the normalized inverted
+performance curves, because there are different communication features for each
+benchmark. When there are little or not communications, the inverted
+performance curve is very close to the energy curve. Then the distance between
+the two curves is very small. This leads to small energy savings. The opposite
happens when there are a lot of communication, the distance between the two
-curves is big. This lead to more energy savings (e.g. CG and FT), see
-table~(\ref{table:factors results}). All discovered frequency scaling factors
-optimize both the energy and the performance simultaneously for all the NAS
-programs. In table~(\ref{table:factors results}), we record all optimal scaling
-factors results for each program on class C. These factors give the maximum
-energy saving percent and the minimum performance degradation percent in the
-same time over all available scales.
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
+curves is big. This leads to more energy savings (e.g. CG and FT), see
+table~(\ref{table:factors results}). All discovered frequency scaling factors
+optimize both the energy and the performance simultaneously for all NAS
+benchmarks. In table~(\ref{table:factors results}), we record all optimal
+scaling factors results for each benchmark running class C. These scaling
+factors give the maximum energy saving percent and the minimum performance
+degradation percent at the same time from all available scaling factors.
+\begin{figure*}[t]
\centering
- \includegraphics[scale=0.47]{ep.eps}
- \includegraphics[scale=0.47]{cg.eps}
- \includegraphics[scale=0.47]{sp.eps}
- \includegraphics[scale=0.47]{lu.eps}
- \includegraphics[scale=0.47]{bt.eps}
- \includegraphics[scale=0.47]{ft.eps}
- \caption{Optimal scaling factors for The NAS MPI Programs}
+ \includegraphics[width=.328\textwidth]{fig/ep}\hfill%
+ \includegraphics[width=.328\textwidth]{fig/cg}\hfill%
+ \includegraphics[width=.328\textwidth]{fig/sp}
+ \includegraphics[width=.328\textwidth]{fig/lu}\hfill%
+ \includegraphics[width=.328\textwidth]{fig/bt}\hfill%
+ \includegraphics[width=.328\textwidth]{fig/ft}
+ \caption{Optimal scaling factors for the predicted energy and performance of NAS benchmarks}
\label{fig:nas}
-\end{figure}
-\begin{table}[width=\textwidth,height=\textheight,keepaspectratio]
- \caption{Optimal Scaling Factors Results}
+\end{figure*}
+\begin{table}[htb]
+ \caption{The scaling factors results}
% title of Table
\centering
- \AG{Use the same number of decimals for all numbers in a column,
- and vertically align the numbers along the decimal points.
- The same for all the following tables.}
- \begin{tabular}{ | l | l | l |l | l | p{2cm} |}
+ \begin{tabular}{|l|*{4}{r|}}
+ \hline
+ Program & Optimal & Energy & Performance & Energy-Perf. \\
+ Name & Scaling Factor & Saving \% & Degradation \% & Distance \\
+ \hline
+ CG & 1.56 & 39.23 & 14.88 & 24.35 \\
+ \hline
+ MG & 1.47 & 34.97 & 21.70 & 13.27 \\
+ \hline
+ EP & 1.04 & 22.14 & 20.73 & 1.41 \\
+ \hline
+ LU & 1.38 & 35.83 & 22.49 & 13.34 \\
+ \hline
+ BT & 1.31 & 29.60 & 21.28 & 8.32 \\
+ \hline
+ SP & 1.38 & 33.48 & 21.36 & 12.12 \\
+ \hline
+ FT & 1.47 & 34.72 & 19.00 & 15.72 \\
\hline
- Program & Optimal & Energy & Performance&Energy-Perf.\\
- Name & Scaling Factor& Saving \%&Degradation \% &Distance \\ \hline
- CG & 1.56 &39.23 & 14.88 & 24.35\\ \hline
- MG & 1.47 &34.97&21.7& 13.27 \\ \hline
- EP & 1.04 &22.14&20.73 &1.41\\ \hline
- LU & 1.388 &35.83&22.49 &13.34\\ \hline
- BT & 1.315 &29.6&21.28 &8.32\\ \hline
- SP & 1.388 &33.48 &21.36&12.12\\ \hline
- FT & 1.47 &34.72 &19&15.72\\ \hline
\end{tabular}
\label{table:factors results}
% is used to refer this table in the text
\end{table}
-
As shown in the table~(\ref{table:factors results}), when the optimal scaling
factor has big value we can gain more energy savings for example as in CG and
-FT. The opposite happens when the optimal scaling factor is small value as
-example BT and EP. Our algorithm selects big scaling factor value when the
+FT. The opposite happens when the optimal scaling factor is small value as
+example BT and EP. Our algorithm selects big scaling factor value when the
communication and the other slacks times are big and smaller ones in opposite
-cases. In EP there are no communications inside the iterations. This make our
-EPSA to selects smaller scaling factor values (inducing smaller energy savings).
-
-\section{Comparing Results}
-
-In this section, we compare our EPSA algorithm results with Rauber's
-methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal
-level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to
-this scenario as $Rauber_{E}$. The second scenario is similar to the first
-except setting the slower task to the maximum frequency (when the scale $S=1$)
-to keep the performance from degradation as mush as possible. We refer to this
-scenario as $Rauber_{E-P}$. The comparison is made in tables~(\ref{table:compare
- Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These
-tables show the results of our EPSA and Rauber's two scenarios for all the NAS
-benchmarks programs for classes A,B and C.
-\begin{table}[ht]
- \caption{Comparing Results for The NAS Class A}
+cases. In EP there are no communications inside the iterations. This make our
+algorithm to selects smaller scaling factor values (inducing smaller energy
+savings).
+
+\subsection{Results comparison}
+
+In this section, we compare our scaling factor selection method with Rauber and
+Rünger methods~\cite{3}. They had two scenarios, the first is to reduce energy
+to the optimal level without considering the performance as in
+EQ~(\ref{eq:sopt}). We refer to this scenario as $R_{E}$. The second scenario
+is similar to the first except setting the slower task to the maximum frequency
+(when the scale $S=1$) to keep the performance from degradation as mush as
+possible. We refer to this scenario as $R_{E-P}$. While we refer to our
+algorithm as EPSA (Energy to Performance Scaling Algorithm). The comparison is
+made in tables \ref{table:compareA}, \ref{table:compareB},
+and~\ref{table:compareC}. These tables show the results of our method and
+Rauber and Rünger scenarios for all the NAS benchmarks programs for classes A, B
+and C.
+\begin{table}[p]
+ \caption{Comparing results for the NAS class A}
% title of Table
\centering
- \begin{tabular}{ | l | l | l |l | l |l| }
+ \begin{tabular}{|l|l|*{4}{r|}}
+ \hline
+ Method & Program & Factor & Energy & Performance & Energy-Perf. \\
+ Name & Name & Value & Saving \% & Degradation \% & Distance \\
\hline
- Method&Program&Factor& Energy& Performance &Energy-Perf.\\
- name &name&value& Saving \%&Degradation \% &Distance
- \\ \hline
% \rowcolor[gray]{0.85}
- EPSA&CG & 1.56 &37.02 & 13.88 & 23.14\\ \hline
- $Rauber_{E-P}$&CG &2.14 &42.77 & 25.27 & 17.5\\ \hline
- $Rauber_{E}$&CG &2.14 &42.77&26.46&16.31\\ \hline
+ $EPSA$ & CG & 1.56 & 37.02 & 13.88 & 23.14 \\ \hline
+ $R_{E-P}$ & CG & 2.14 & 42.77 & 25.27 & 17.50 \\ \hline
+ $R_{E}$ & CG & 2.14 & 42.77 & 26.46 & 16.31 \\ \hline
- EPSA&MG & 1.47 &27.66&16.82&10.84\\ \hline
- $Rauber_{E-P}$&MG &2.14&34.45&31.84&2.61\\ \hline
- $Rauber_{E}$&MG &2.14&34.48&33.65&0.8 \\ \hline
+ $EPSA$ & MG & 1.47 & 27.66 & 16.82 & 10.84 \\ \hline
+ $R_{E-P}$ & MG & 2.14 & 34.45 & 31.84 & 2.61 \\ \hline
+ $R_{E}$ & MG & 2.14 & 34.48 & 33.65 & 0.80 \\ \hline
- EPSA&EP &1.19 &25.32&20.79&4.53\\ \hline
- $Rauber_{E-P}$&EP&2.05&41.45&55.67&-14.22\\ \hline
- $Rauber_{E}$&EP&2.05&42.09&57.59&-15.5\\ \hline
+ $EPSA$ & EP & 1.19 & 25.32 & 20.79 & 4.53 \\ \hline
+ $R_{E-P}$ & EP & 2.05 & 41.45 & 55.67 & -14.22 \\ \hline
+ $R_{E}$ & EP & 2.05 & 42.09 & 57.59 & -15.50 \\ \hline
- EPSA&LU&1.56& 39.55 &19.38& 20.17\\ \hline
- $Rauber_{E-P}$&LU&2.14&45.62&27&18.62 \\ \hline
- $Rauber_{E}$&LU&2.14&45.66&33.01&12.65\\ \hline
+ $EPSA$ & LU & 1.56 & 39.55 & 19.38 & 20.17 \\ \hline
+ $R_{E-P}$ & LU & 2.14 & 45.62 & 27.00 & 18.62 \\ \hline
+ $R_{E}$ & LU & 2.14 & 45.66 & 33.01 & 12.65 \\ \hline
- EPSA&BT&1.315& 29.6&20.53&9.07 \\ \hline
- $Rauber_{E-P}$&BT&2.1&45.53&49.63&-4.1\\ \hline
- $Rauber_{E}$&BT&2.1&43.93&52.86&-8.93\\ \hline
+ $EPSA$ & BT & 1.31 & 29.60 & 20.53 & 9.07 \\ \hline
+ $R_{E-P}$ & BT & 2.10 & 45.53 & 49.63 & -4.10 \\ \hline
+ $R_{E}$ & BT & 2.10 & 43.93 & 52.86 & -8.93 \\ \hline
- EPSA&SP&1.388& 33.51&15.65&17.86 \\ \hline
- $Rauber_{E-P}$&SP&2.11&45.62&42.52&3.1\\ \hline
- $Rauber_{E}$&SP&2.11&45.78&43.09&2.69\\ \hline
+ $EPSA$ & SP & 1.38 & 33.51 & 15.65 & 17.86 \\ \hline
+ $R_{E-P}$ & SP & 2.11 & 45.62 & 42.52 & 3.10 \\ \hline
+ $R_{E}$ & SP & 2.11 & 45.78 & 43.09 & 2.69 \\ \hline
- EPSA&FT&1.25& 25&10.8&14.2 \\ \hline
- $Rauber_{E-P}$&FT&2.1&39.29&34.3&4.99 \\ \hline
- $Rauber_{E}$&FT&2.1&37.56&38.21&-0.65\\ \hline
+ $EPSA$ & FT & 1.25 & 25.00 & 10.80 & 14.20 \\ \hline
+ $R_{E-P}$ & FT & 2.10 & 39.29 & 34.30 & 4.99 \\ \hline
+ $R_{E}$ & FT & 2.10 & 37.56 & 38.21 & -0.65 \\ \hline
\end{tabular}
- \label{table:compare Class A}
+ \label{table:compareA}
% is used to refer this table in the text
\end{table}
-\begin{table}[ht]
- \caption{Comparing Results for The NAS Class B}
+\begin{table}[p]
+ \caption{Comparing results for the NAS class B}
% title of Table
\centering
- \begin{tabular}{ | l | l | l |l | l |l| }
+ \begin{tabular}{|l|l|*{4}{r|}}
+ \hline
+ Method & Program & Factor & Energy & Performance & Energy-Perf. \\
+ Name & Name & Value & Saving \% & Degradation \% & Distance \\
\hline
- Method&Program&Factor& Energy& Performance &Energy-Perf.\\
- name &name&value& Saving \%&Degradation \% &Distance
- \\ \hline
% \rowcolor[gray]{0.85}
- EPSA&CG & 1.66 &39.23&16.63&22.6 \\ \hline
- $Rauber_{E-P}$&CG &2.15 &45.34&27.6&17.74\\ \hline
- $Rauber_{E}$&CG &2.15 &45.34&28.88&16.46\\ \hline
+ $EPSA$ & CG & 1.66 & 39.23 & 16.63 & 22.60 \\ \hline
+ $R_{E-P}$ & CG & 2.15 & 45.34 & 27.60 & 17.74 \\ \hline
+ $R_{E}$ & CG & 2.15 & 45.34 & 28.88 & 16.46 \\ \hline
- EPSA&MG & 1.47 &34.98&18.35&16.63\\ \hline
- $Rauber_{E-P}$&MG &2.14&43.55&36.42&7.13 \\ \hline
- $Rauber_{E}$&MG &2.14&43.56&37.07&6.49 \\ \hline
+ $EPSA$ & MG & 1.47 & 34.98 & 18.35 & 16.63 \\ \hline
+ $R_{E-P}$ & MG & 2.14 & 43.55 & 36.42 & 7.13 \\ \hline
+ $R_{E}$ & MG & 2.14 & 43.56 & 37.07 & 6.49 \\ \hline
- EPSA&EP &1.08 &20.29&17.15&3.14 \\ \hline
- $Rauber_{E-P}$&EP&2&42.38&56.88&-14.5\\ \hline
- $Rauber_{E}$&EP&2&39.73&59.94&-20.21\\ \hline
+ $EPSA$ & EP & 1.08 & 20.29 & 17.15 & 3.14 \\ \hline
+ $R_{E-P}$ & EP & 2.00 & 42.38 & 56.88 & -14.50 \\ \hline
+ $R_{E}$ & EP & 2.00 & 39.73 & 59.94 & -20.21 \\ \hline
- EPSA&LU&1.47&38.57&21.34&17.23 \\ \hline
- $Rauber_{E-P}$&LU&2.1&43.62&36.51&7.11 \\ \hline
- $Rauber_{E}$&LU&2.1&43.61&38.54&5.07 \\ \hline
+ $EPSA$ & LU & 1.47 & 38.57 & 21.34 & 17.23 \\ \hline
+ $R_{E-P}$ & LU & 2.10 & 43.62 & 36.51 & 7.11 \\ \hline
+ $R_{E}$ & LU & 2.10 & 43.61 & 38.54 & 5.07 \\ \hline
- EPSA&BT&1.315& 29.59&20.88&8.71\\ \hline
- $Rauber_{E-P}$&BT&2.1&44.53&53.05&-8.52\\ \hline
- $Rauber_{E}$&BT&2.1&42.93&52.806&-9.876\\ \hline
+ $EPSA$ & BT & 1.31 & 29.59 & 20.88 & 8.71 \\ \hline
+ $R_{E-P}$ & BT & 2.10 & 44.53 & 53.05 & -8.52 \\ \hline
+ $R_{E}$ & BT & 2.10 & 42.93 & 52.80 & -9.87 \\ \hline
- EPSA&SP&1.388&33.44&19.24&14.2 \\ \hline
- $Rauber_{E-P}$&SP&2.15&45.69&43.2&2.49\\ \hline
- $Rauber_{E}$&SP&2.15&45.41&44.47&0.94\\ \hline
+ $EPSA$ & SP & 1.38 & 33.44 & 19.24 & 14.20 \\ \hline
+ $R_{E-P}$ & SP & 2.15 & 45.69 & 43.20 & 2.49 \\ \hline
+ $R_{E}$ & SP & 2.15 & 45.41 & 44.47 & 0.94 \\ \hline
- EPSA&FT&1.388&34.4&14.57&19.83 \\ \hline
- $Rauber_{E-P}$&FT&2.13&42.98&37.35&5.63 \\ \hline
- $Rauber_{E}$&FT&2.13&43.04&37.9&5.14\\ \hline
+ $EPSA$ & FT & 1.38 & 34.40 & 14.57 & 19.83 \\ \hline
+ $R_{E-P}$ & FT & 2.13 & 42.98 & 37.35 & 5.63 \\ \hline
+ $R_{E}$ & FT & 2.13 & 43.04 & 37.90 & 5.14 \\ \hline
\end{tabular}
- \label{table:compare Class B}
+ \label{table:compareB}
% is used to refer this table in the text
\end{table}
-\begin{table}[ht]
- \caption{Comparing Results for The NAS Class C}
+\begin{table}[p]
+ \caption{Comparing results for the NAS class C}
% title of Table
\centering
- \begin{tabular}{ | l | l | l |l | l |l| }
+ \begin{tabular}{|l|l|*{4}{r|}}
+ \hline
+ Method & Program & Factor & Energy & Performance & Energy-Perf. \\
+ Name & Name & Value & Saving \% & Degradation \% & Distance \\
\hline
- Method&Program&Factor& Energy& Performance &Energy-Perf.\\
- name &name&value& Saving \%&Degradation \% &Distance
- \\ \hline
% \rowcolor[gray]{0.85}
- EPSA&CG & 1.56 &39.23&14.88&24.35 \\ \hline
- $Rauber_{E-P}$&CG &2.15 &45.36&25.89&19.47\\ \hline
- $Rauber_{E}$&CG &2.15 &45.36&26.7&18.66\\ \hline
+ $EPSA$ & CG & 1.56 & 39.23 & 14.88 & 24.35 \\ \hline
+ $R_{E-P}$ & CG & 2.15 & 45.36 & 25.89 & 19.47 \\ \hline
+ $R_{E}$ & CG & 2.15 & 45.36 & 26.70 & 18.66 \\ \hline
- EPSA&MG & 1.47 &34.97&21.697&13.273\\ \hline
- $Rauber_{E-P}$&MG &2.15&43.65&40.45&3.2 \\ \hline
- $Rauber_{E}$&MG &2.15&43.64&41.38&2.26 \\ \hline
+ $EPSA$ & MG & 1.47 & 34.97 & 21.69 & 13.27 \\ \hline
+ $R_{E-P}$ & MG & 2.15 & 43.65 & 40.45 & 3.20 \\ \hline
+ $R_{E}$ & MG & 2.15 & 43.64 & 41.38 & 2.26 \\ \hline
- EPSA&EP &1.04 &22.14&20.73&1.41 \\ \hline
- $Rauber_{E-P}$&EP&1.92&39.4&56.33&-16.93\\ \hline
- $Rauber_{E}$&EP&1.92&38.1&56.35&-18.25\\ \hline
+ $EPSA$ & EP & 1.04 & 22.14 & 20.73 & 1.41 \\ \hline
+ $R_{E-P}$ & EP & 1.92 & 39.40 & 56.33 & -16.93 \\ \hline
+ $R_{E}$ & EP & 1.92 & 38.10 & 56.35 & -18.25 \\ \hline
- EPSA&LU&1.388&35.83&22.49&13.34 \\ \hline
- $Rauber_{E-P}$&LU&2.15&44.97&41&3.97 \\ \hline
- $Rauber_{E}$&LU&2.15&44.97&41.8&3.17 \\ \hline
+ $EPSA$ & LU & 1.38 & 35.83 & 22.49 & 13.34 \\ \hline
+ $R_{E-P}$ & LU & 2.15 & 44.97 & 41.00 & 3.97 \\ \hline
+ $R_{E}$ & LU & 2.15 & 44.97 & 41.80 & 3.17 \\ \hline
- EPSA&BT&1.315& 29.6&21.28&8.32\\ \hline
- $Rauber_{E-P}$&BT&2.13&45.6&49.84&-4.24\\ \hline
- $Rauber_{E}$&BT&2.13&44.9&55.16&-10.26\\ \hline
+ $EPSA$ & BT & 1.31 & 29.60 & 21.28 & 8.32 \\ \hline
+ $R_{E-P}$ & BT & 2.13 & 45.60 & 49.84 & -4.24 \\ \hline
+ $R_{E}$ & BT & 2.13 & 44.90 & 55.16 & -10.26 \\ \hline
- EPSA&SP&1.388&33.48&21.35&12.12\\ \hline
- $Rauber_{E-P}$&SP&2.1&45.69&43.6&2.09\\ \hline
- $Rauber_{E}$&SP&2.1&45.75&44.1&1.65\\ \hline
+ $EPSA$ & SP & 1.38 & 33.48 & 21.35 & 12.12 \\ \hline
+ $R_{E-P}$ & SP & 2.10 & 45.69 & 43.60 & 2.09 \\ \hline
+ $R_{E}$ & SP & 2.10 & 45.75 & 44.10 & 1.65 \\ \hline
- EPSA&FT&1.47&34.72&19&15.72 \\ \hline
- $Rauber_{E-P}$&FT&2.04&39.4&37.1&2.3\\ \hline
- $Rauber_{E}$&FT&2.04&39.35&37.7&1.65\\ \hline
+ $EPSA$ & FT & 1.47 & 34.72 & 19.00 & 15.72 \\ \hline
+ $R_{E-P}$ & FT & 2.04 & 39.40 & 37.10 & 2.30 \\ \hline
+ $R_{E}$ & FT & 2.04 & 39.35 & 37.70 & 1.65 \\ \hline
\end{tabular}
-\label{table:compare Class C}
-% is used to refer this table in the text
+ \label{table:compareC}
+ % is used to refer this table in the text
\end{table}
-As shown in these tables our scaling factor is not optimal for energy saving
-such as Rauber's scaling factor EQ~(\ref{eq:sopt}), but it is optimal for both
-the energy and the performance simultaneously. Our EPSA optimal scaling factors
-has better simultaneous optimization for both the energy and the performance
-compared to Rauber's energy-performance method ($Rauber_{E-P}$). Also, in
-($Rauber_{E-P}$) method when setting the frequency to maximum value for the
-slower task lead to a small improvement of the performance. Also the results
-show that this method keep or improve energy saving. Because of the energy
-consumption decrease when the execution time decreased while the frequency value
-increased.
-
-Figure~(\ref{fig:compare}) shows the maximum distance between the energy saving
-percent and the performance degradation percent. Therefore, this means it is the
-same resultant of our objective function EQ~(\ref{eq:max}). Our algorithm always
-gives positive energy to performance trade offs while Rauber's method
-($Rauber_{E-P}$) gives in some time negative trade offs such as in BT and
-EP. The positive trade offs with highest values lead to maximum energy savings
-concatenating with less performance degradation and this the objective of this
-paper. While the negative trade offs refers to improving energy saving (or may
-be the performance) while degrading the performance (or may be the energy) more
-than the first.
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
- \centering
- \includegraphics[scale=0.60]{compare_class_A.pdf}
- \includegraphics[scale=0.60]{compare_class_B.pdf}
- \includegraphics[scale=0.60]{compare_class_c.pdf}
- % use scale 35 for all to be in the same line
- \caption{Comparing Our EPSA with Rauber's Methods}
- \label{fig:compare}
-\end{figure}
-
-\AG{\texttt{bibtex} gives many errors, please correct them}
-\bibliographystyle{plain}
-\bibliography{my_reference}
+As shown in tables~\ref{table:compareA},~\ref{table:compareB}
+and~\ref{table:compareC}, the ($R_{E-P}$) method outperforms the ($R_{E}$)
+method in terms of performance and energy reduction. The ($R_{E-P}$) method
+also gives better energy savings than our method. However, although our scaling
+factor is not optimal for energy reduction, the results in these tables prove
+that our algorithm returns the best scaling factor that satisfy our objective
+method: the largest distance between energy reduction and performance
+degradation. Negative values in the energy-performance column mean that one of
+the two objectives (energy or performance) have been degraded more than the
+other. The positive trade-offs with the highest values lead to maximum energy
+savings while keeping the performance degradation as low as possible. Our
+algorithm always gives the highest positive energy to performance trade-offs
+while Rauber and Rünger method ($R_{E-P}$) gives in some time negative
+trade-offs such as in BT and EP.
+%\begin{figure*}[t]
+% \centering
+% \includegraphics[width=.328\textwidth]{fig/compare_class_A}
+% \includegraphics[width=.328\textwidth]{fig/compare_class_B}
+% \includegraphics[width=.328\textwidth]{fig/compare_class_C}
+% \caption{Comparing our method to Rauber and Rünger methods}
+% \label{fig:compare}
+%\end{figure*}
+
+\section{Conclusion}
+\label{sec.concl}
+
+In this paper, we have presented a new online scaling factor selection method
+that optimizes simultaneously the energy and performance of a distributed
+application running on an homogeneous cluster. It uses the computation and
+communication times measured at the first iteration to predict energy
+consumption and the performance of the parallel application at every available
+frequency. Then, it selects the scaling factor that gives the best trade-off
+between energy reduction and performance which is the maximum distance between
+the energy and the inverted performance curves. To evaluate this method, we
+have applied it to the NAS benchmarks and it was compared to Rauber and Rünger
+methods while being executed on the simulator SimGrid. The results showed that
+our method, outperforms Rauber and Rünger methods in terms of energy-performance
+ratio.
+
+In the near future, we would like to adapt this scaling factor selection method
+to heterogeneous platforms where each node has different characteristics. In
+particular, each CPU has different available frequencies, energy consumption and
+performance. It would be also interesting to develop a new energy model for
+asynchronous parallel iterative methods where the number of iterations is not
+known in advance and depends on the global convergence of the iterative system.
+
+\section*{Acknowledgment}
+
+This work has been supported by the Labex ACTION project (contract
+``ANR-11-LABX-01-01''). Computations have been performed on the supercomputer
+facilities of the Mésocentre de calcul de Franche-Comté. As a PhD student,
+Mr. Ahmed Fanfakh, would like to thank the University of Babylon (Iraq) for
+supporting his work.
+
+% trigger a \newpage just before the given reference
+% number - used to balance the columns on the last page
+% adjust value as needed - may need to be readjusted if
+% the document is modified later
+%\IEEEtriggeratref{15}
+
+\bibliographystyle{IEEEtran}
+\bibliography{IEEEabrv,my_reference}
\end{document}
%%% Local Variables:
%%% ispell-local-dictionary: "american"
%%% End:
-% LocalWords: Badri Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
-% LocalWords: CMOS EQ EPSA
+% LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
+% LocalWords: CMOS EQ EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex