\usepackage{algorithm}
\usepackage{subfig}
\usepackage{amsmath}
-\usepackage{multirow}
\usepackage{url}
\DeclareUrlCommand\email{\urlstyle{same}}
\newcommand{\JC}[2][inline]{%
\todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace}
-\newcommand{\Xsub}[2]{\ensuremath{#1_\textit{#2}}}
+\newcommand{\Xsub}[2]{{\ensuremath{#1_\mathit{#2}}}}
-\newcommand{\Dist}{\textit{Dist}}
+%% used to put some subscripts lower, and make them more legible
+\newcommand{\fxheight}[1]{\ifx#1\relax\relax\else\rule{0pt}{1.52ex}#1\fi}
+
+\newcommand{\CL}{\Xsub{C}{L}}
+\newcommand{\Dist}{\mathit{Dist}}
+\newcommand{\EdNew}{\Xsub{E}{dNew}}
\newcommand{\Eind}{\Xsub{E}{ind}}
\newcommand{\Enorm}{\Xsub{E}{Norm}}
\newcommand{\Eoriginal}{\Xsub{E}{Original}}
\newcommand{\Ereduced}{\Xsub{E}{Reduced}}
-\newcommand{\Fdiff}{\Xsub{F}{diff}}
-\newcommand{\Fmax}{\Xsub{F}{max}}
+\newcommand{\Es}{\Xsub{E}{S}}
+\newcommand{\Fdiff}[1][]{\Xsub{F}{diff}_{\!#1}}
+\newcommand{\Fmax}[1][]{\Xsub{F}{max}_{\fxheight{#1}}}
\newcommand{\Fnew}{\Xsub{F}{new}}
\newcommand{\Ileak}{\Xsub{I}{leak}}
\newcommand{\Kdesign}{\Xsub{K}{design}}
-\newcommand{\MaxDist}{\textit{Max Dist}}
+\newcommand{\MaxDist}{\mathit{Max}\Dist}
+\newcommand{\MinTcm}{\mathit{Min}\Tcm}
\newcommand{\Ntrans}{\Xsub{N}{trans}}
-\newcommand{\Pdyn}{\Xsub{P}{dyn}}
-\newcommand{\PnormInv}{\Xsub{P}{NormInv}}
+\newcommand{\Pd}[1][]{\Xsub{P}{d}_{\fxheight{#1}}}
+\newcommand{\PdNew}{\Xsub{P}{dNew}}
+\newcommand{\PdOld}{\Xsub{P}{dOld}}
\newcommand{\Pnorm}{\Xsub{P}{Norm}}
-\newcommand{\Tnorm}{\Xsub{T}{Norm}}
-\newcommand{\Pstates}{\Xsub{P}{states}}
-\newcommand{\Pstatic}{\Xsub{P}{static}}
-\newcommand{\Sopt}{\Xsub{S}{opt}}
-\newcommand{\Tcomp}{\Xsub{T}{comp}}
-\newcommand{\TmaxCommOld}{\Xsub{T}{Max Comm Old}}
-\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}}
-\newcommand{\Tmax}{\Xsub{T}{max}}
+\newcommand{\Ps}[1][]{\Xsub{P}{s}_{\fxheight{#1}}}
+\newcommand{\Scp}[1][]{\Xsub{S}{cp}_{#1}}
+\newcommand{\Sopt}[1][]{\Xsub{S}{opt}_{#1}}
+\newcommand{\Tcm}[1][]{\Xsub{T}{cm}_{\fxheight{#1}}}
+\newcommand{\Tcp}[1][]{\Xsub{T}{cp}_{#1}}
+\newcommand{\TcpOld}[1][]{\Xsub{T}{cpOld}_{#1}}
\newcommand{\Tnew}{\Xsub{T}{New}}
-\newcommand{\Told}{\Xsub{T}{Old}}
-\begin{document}
+\newcommand{\Told}{\Xsub{T}{Old}}
+
+\begin{document}
-\title{Energy Consumption Reduction for Message Passing Iterative Applications in Heterogeneous Architecture Using DVFS}
-
-\author{%
+\title{Energy Consumption Reduction for \\
+ Message Passing Iterative Applications on \\
+ Heterogeneous Architectures Using DVFS}
+
+\author{%
\IEEEauthorblockN{%
Jean-Claude Charr,
Raphaël Couturier,
Ahmed Fanfakh and
Arnaud Giersch
- }
+ }
\IEEEauthorblockA{%
- FEMTO-ST Institute, University of Franche-Comte\\
+ FEMTO-ST Institute, University of Franche-Comté\\
IUT de Belfort-Montbéliard,
19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
% Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
\maketitle
\begin{abstract}
-Computing platforms are consuming more and more energy due to the increasing
-number of nodes composing them. To minimize the operating costs of these
-platforms many techniques have been used. Dynamic voltage and frequency scaling
-(DVFS) is one of them. It reduces the frequency of a CPU to lower its energy
-consumption. However, lowering the frequency of a CPU might increase the
-execution time of an application running on that processor. Therefore, the
-frequency that gives the best tradeoff between the energy consumption and the
-performance of an application must be selected.\\
-In this paper, a new online frequencies selecting algorithm for heterogeneous
-platforms is presented. It selects the frequency which tries to give the best
-tradeoff between energy saving and performance degradation, for each node
-computing the message passing iterative application. The algorithm has a small
-overhead and works without training or profiling. It uses a new energy model for
-message passing iterative applications running on a heterogeneous platform. The
-proposed algorithm is evaluated on the Simgrid simulator while running the NAS
-parallel benchmarks. The experiments show that it reduces the energy
-consumption by up to 35\% while limiting the performance degradation as much as
-possible. Finally, the algorithm is compared to an existing method, the
-comparison results showing that it outperforms the latter.
+ Computing platforms are consuming more and more energy due to the increasing
+ number of nodes composing them. To minimize the operating costs of these
+ platforms many techniques have been used. Dynamic voltage and frequency
+ scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its
+ energy consumption. However, lowering the frequency of a CPU might increase
+ the execution time of an application running on that processor. Therefore,
+ the frequency that gives the best trade-off between the energy consumption and
+ the performance of an application must be selected.
+
+ In this paper, a new online frequency selecting algorithm for heterogeneous
+ platforms is presented. It selects the frequencies and tries to give the best
+ trade-off between energy saving and performance degradation, for each node
+ computing the message passing iterative application. The algorithm has a small
+ overhead and works without training or profiling. It uses a new energy model
+ for message passing iterative applications running on a heterogeneous
+ platform. The proposed algorithm is evaluated on the SimGrid simulator while
+ running the NAS parallel benchmarks. The experiments show that it reduces the
+ energy consumption by up to \np[\%]{35} while limiting the performance
+ degradation as much as possible. Finally, the algorithm is compared to an
+ existing method, the comparison results showing that it outperforms the
+ latter.
\end{abstract}
\section{Introduction}
\label{sec.intro}
-The need for more computing power is continually increasing. To partially
-satisfy this need, most supercomputers constructors just put more computing
-nodes in their platform. The resulting platforms might achieve higher floating
-point operations per second (FLOPS), but the energy consumption and the heat
-dissipation are also increased. As an example, the Chinese supercomputer
-Tianhe-2 had the highest FLOPS in November 2014 according to the Top500 list
-\cite{TOP500_Supercomputers_Sites}. However, it was also the most power hungry
-platform with its over 3 million cores consuming around 17.8 megawatts.
-Moreover, according to the U.S. annual energy outlook 2014
-\cite{U.S_Annual.Energy.Outlook.2014}, the price of energy for 1 megawatt-hour
+
+The need for more computing power is continually increasing. To partially
+satisfy this need, most supercomputers constructors just put more computing
+nodes in their platform. The resulting platforms might achieve higher floating
+point operations per second (FLOPS), but the energy consumption and the heat
+dissipation are also increased. As an example, the Chinese supercomputer
+Tianhe-2 had the highest FLOPS in November 2014 according to the Top500 list
+\cite{TOP500_Supercomputers_Sites}. However, it was also the most power hungry
+platform with its over 3 million cores consuming around 17.8 megawatts.
+Moreover, according to the U.S. annual energy outlook 2014
+\cite{U.S_Annual.Energy.Outlook.2014}, the price of energy for 1 megawatt-hour
was approximately equal to \$70. Therefore, the price of the energy consumed by
-the Tianhe-2 platform is approximately more than \$10 million each year. The
-computing platforms must be more energy efficient and offer the highest number
-of FLOPS per watt possible, such as the L-CSC from the GSI Helmholtz Center
+the Tianhe-2 platform is approximately more than \$10 million each year. The
+computing platforms must be more energy efficient and offer the highest number
+of FLOPS per watt possible, such as the L-CSC from the GSI Helmholtz Center
which became the top of the Green500 list in November 2014 \cite{Green500_List}.
This heterogeneous platform executes more than 5 GFLOPS per watt while consuming
57.15 kilowatts.
-Besides platform improvements, there are many software and hardware techniques
-to lower the energy consumption of these platforms, such as scheduling, DVFS,
-... DVFS is a widely used process to reduce the energy consumption of a
-processor by lowering its frequency
+Besides platform improvements, there are many software and hardware techniques
+to lower the energy consumption of these platforms, such as scheduling, DVFS,
+\dots{} DVFS is a widely used process to reduce the energy consumption of a
+processor by lowering its frequency
\cite{Rizvandi_Some.Observations.on.Optimal.Frequency}. However, it also reduces
the number of FLOPS executed by the processor which might increase the execution
time of the application running over that processor. Therefore, researchers use
-different optimization strategies to select the frequency that gives the best
-tradeoff between the energy reduction and performance degradation ratio. In
-\cite{Our_first_paper}, a frequency selecting algorithm was proposed to reduce
-the energy consumption of message passing iterative applications running over
-homogeneous platforms. The results of the experiments show significant energy
-consumption reductions. In this paper, a new frequency selecting algorithm
-adapted for heterogeneous platform is presented. It selects the vector of
-frequencies, for a heterogeneous platform running a message passing iterative
+different optimization strategies to select the frequency that gives the best
+trade-off between the energy reduction and performance degradation ratio. In
+\cite{Our_first_paper}, a frequency selecting algorithm was proposed to reduce
+the energy consumption of message passing iterative applications running over
+homogeneous platforms. The results of the experiments show significant energy
+consumption reductions. In this paper, a new frequency selecting algorithm
+adapted for heterogeneous platform is presented. It selects the vector of
+frequencies, for a heterogeneous platform running a message passing iterative
application, that simultaneously tries to offer the maximum energy reduction and
-minimum performance degradation ratio. The algorithm has a very small overhead,
+minimum performance degradation ratio. The algorithm has a very small overhead,
works online and does not need any training or profiling.
This paper is organized as follows: Section~\ref{sec.relwork} presents some
related works from other authors. Section~\ref{sec.exe} describes how the
-execution time of message passing programs can be predicted. It also presents an energy
-model that predicts the energy consumption of an application running over a heterogeneous platform. Section~\ref{sec.compet} presents
-the energy-performance objective function that maximizes the reduction of energy
+execution time of message passing programs can be predicted. It also presents
+an energy model that predicts the energy consumption of an application running
+over a heterogeneous platform. Section~\ref{sec.compet} presents the
+energy-performance objective function that maximizes the reduction of energy
consumption while minimizing the degradation of the program's performance.
-Section~\ref{sec.optim} details the proposed frequency selecting algorithm then the precision of the proposed algorithm is verified.
-Section~\ref{sec.expe} presents the results of applying the algorithm on the NAS parallel benchmarks and executing them
-on a heterogeneous platform. It shows the results of running three
-different power scenarios and comparing them. Moreover, it also shows the comparison results
-between the proposed method and an existing method.
-Finally, in Section~\ref{sec.concl} the paper ends with a summary and some future works.
+Section~\ref{sec.optim} details the proposed frequency selecting algorithm then
+the precision of the proposed algorithm is verified. Section~\ref{sec.expe}
+presents the results of applying the algorithm on the NAS parallel benchmarks
+and executing them on a heterogeneous platform. It shows the results of running
+three different power scenarios and comparing them. Moreover, it also shows the
+comparison results between the proposed method and an existing method. Finally,
+in Section~\ref{sec.concl} the paper ends with a summary and some future works.
\section{Related works}
\label{sec.relwork}
+
DVFS is a technique used in modern processors to scale down both the voltage and
-the frequency of the CPU while computing, in order to reduce the energy
-consumption of the processor. DVFS is also allowed in GPUs to achieve the
-same goal. Reducing the frequency of a processor lowers its number of FLOPS and
-might degrade the performance of the application running on that processor,
-especially if it is compute bound. Therefore selecting the appropriate frequency
-for a processor to satisfy some objectives while taking into account all the
-constraints, is not a trivial operation. Many researchers used different
-strategies to tackle this problem. Some of them developed online methods that
-compute the new frequency while executing the application, such as
-~\cite{Hao_Learning.based.DVFS,Spiliopoulos_Green.governors.Adaptive.DVFS}. Others
-used offline methods that might need to run the application and profile it
-before selecting the new frequency, such as
-~\cite{Rountree_Bounding.energy.consumption.in.MPI,Cochran_Pack_and_Cap_Adaptive_DVFS}. The
-methods could be heuristics, exact or brute force methods that satisfy varied
-objectives such as energy reduction or performance. They also could be adapted
-to the execution's environment and the type of the application such as
-sequential, parallel or distributed architecture, homogeneous or heterogeneous
-platform, synchronous or asynchronous application, ...
-
-In this paper, we are interested in reducing energy for message passing iterative synchronous applications running over heterogeneous platforms.
-Some works have already been done for such platforms and they can be classified into two types of heterogeneous platforms:
+the frequency of the CPU while computing, in order to reduce the energy
+consumption of the processor. DVFS is also allowed in GPUs to achieve the same
+goal. Reducing the frequency of a processor lowers its number of FLOPS and might
+degrade the performance of the application running on that processor, especially
+if it is compute bound. Therefore selecting the appropriate frequency for a
+processor to satisfy some objectives while taking into account all the
+constraints, is not a trivial operation. Many researchers used different
+strategies to tackle this problem. Some of them developed online methods that
+compute the new frequency while executing the application, such
+as~\cite{Hao_Learning.based.DVFS,Spiliopoulos_Green.governors.Adaptive.DVFS}.
+Others used offline methods that might need to run the application and profile
+it before selecting the new frequency, such
+as~\cite{Rountree_Bounding.energy.consumption.in.MPI,Cochran_Pack_and_Cap_Adaptive_DVFS}.
+The methods could be heuristics, exact or brute force methods that satisfy
+varied objectives such as energy reduction or performance. They also could be
+adapted to the execution's environment and the type of the application such as
+sequential, parallel or distributed architecture, homogeneous or heterogeneous
+platform, synchronous or asynchronous application, \dots{}
+
+In this paper, we are interested in reducing energy for message passing
+iterative synchronous applications running over heterogeneous platforms. Some
+works have already been done for such platforms and they can be classified into
+two types of heterogeneous platforms:
\begin{itemize}
-
\item the platform is composed of homogeneous GPUs and homogeneous CPUs.
\item the platform is only composed of heterogeneous CPUs.
-
\end{itemize}
-For the first type of platform, the computing intensive parallel tasks are executed on the GPUs and the rest are executed
-on the CPUs. Luley et al.
-~\cite{Luley_Energy.efficiency.evaluation.and.benchmarking}, proposed a heterogeneous
-cluster composed of Intel Xeon CPUs and NVIDIA GPUs. Their main goal was to maximize the
-energy efficiency of the platform during computation by maximizing the number of FLOPS per watt generated.
-In~\cite{KaiMa_Holistic.Approach.to.Energy.Efficiency.in.GPU-CPU}, Kai Ma et al. developed a scheduling
-algorithm that distributes workloads proportional to the computing power of the nodes which could be a GPU or a CPU. All the tasks must be completed at the same time.
-In~\cite{Rong_Effects.of.DVFS.on.K20.GPU}, Rong et al. showed that
-a heterogeneous (GPUs and CPUs) cluster that enables DVFS gave better energy and performance
-efficiency than other clusters only composed of CPUs.
-
-The work presented in this paper concerns the second type of platform, with heterogeneous CPUs.
-Many methods were conceived to reduce the energy consumption of this type of platform. Naveen et al.~\cite{Naveen_Power.Efficient.Resource.Scaling}
-developed a method that minimizes the value of $energy*delay^2$ (the delay is the sum of slack times that happen during synchronous communications) by dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed
-an algorithm that divides the executed tasks into two types: the critical and
-non critical tasks. The algorithm scales down the frequency of non critical tasks proportionally to their slack and communication times while limiting the performance degradation percentage to less than 10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed
- a heterogeneous cluster composed of two types
-of Intel and AMD processors. They use a gradient method to predict the impact of DVFS operations on performance.
-In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and \cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks},
- the best frequencies for a specified heterogeneous cluster are selected offline using some
-heuristic. Chen et al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic programming approach to
-minimize the power consumption of heterogeneous servers while respecting given time constraints. This approach
-had considerable overhead.
-In contrast to the above described papers, this paper presents the following contributions :
+For the first type of platform, the computing intensive parallel tasks are
+executed on the GPUs and the rest are executed on the CPUs. Luley et
+al.~\cite{Luley_Energy.efficiency.evaluation.and.benchmarking}, proposed a
+heterogeneous cluster composed of Intel Xeon CPUs and NVIDIA GPUs. Their main
+goal was to maximize the energy efficiency of the platform during computation by
+maximizing the number of FLOPS per watt generated.
+In~\cite{KaiMa_Holistic.Approach.to.Energy.Efficiency.in.GPU-CPU}, Kai Ma et
+al. developed a scheduling algorithm that distributes workloads proportional to
+the computing power of the nodes which could be a GPU or a CPU. All the tasks
+must be completed at the same time. In~\cite{Rong_Effects.of.DVFS.on.K20.GPU},
+Rong et al. showed that a heterogeneous (GPUs and CPUs) cluster that enables
+DVFS gave better energy and performance efficiency than other clusters only
+composed of CPUs.
+
+The work presented in this paper concerns the second type of platform, with
+heterogeneous CPUs. Many methods were conceived to reduce the energy
+consumption of this type of platform. Naveen et
+al.~\cite{Naveen_Power.Efficient.Resource.Scaling} developed a method that
+minimizes the value of $\mathit{energy}\times \mathit{delay}^2$ (the delay is
+the sum of slack times that happen during synchronous communications) by
+dynamically assigning new frequencies to the CPUs of the heterogeneous cluster.
+Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed an
+algorithm that divides the executed tasks into two types: the critical and non
+critical tasks. The algorithm scales down the frequency of non critical tasks
+proportionally to their slack and communication times while limiting the
+performance degradation percentage to less than \np[\%]{10}.
+In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a
+heterogeneous cluster composed of two types of Intel and AMD processors. They
+use a gradient method to predict the impact of DVFS operations on performance.
+In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and
+\cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks}, the best
+frequencies for a specified heterogeneous cluster are selected offline using
+some heuristic. Chen et
+al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic
+programming approach to minimize the power consumption of heterogeneous servers
+while respecting given time constraints. This approach had considerable
+overhead. In contrast to the above described papers, this paper presents the
+following contributions :
\begin{enumerate}
-\item two new energy and performance models for message passing iterative synchronous applications running over
- a heterogeneous platform. Both models take into account communication and slack times. The models can predict the required energy and the execution time of the application.
-
-\item a new online frequency selecting algorithm for heterogeneous platforms. The algorithm has a very small
- overhead and does not need any training or profiling. It uses a new optimization function which simultaneously maximizes the performance and minimizes the energy consumption of a message passing iterative synchronous application.
-
+\item two new energy and performance models for message passing iterative
+ synchronous applications running over a heterogeneous platform. Both models
+ take into account communication and slack times. The models can predict the
+ required energy and the execution time of the application.
+
+\item a new online frequency selecting algorithm for heterogeneous
+ platforms. The algorithm has a very small overhead and does not need any
+ training or profiling. It uses a new optimization function which
+ simultaneously maximizes the performance and minimizes the energy consumption
+ of a message passing iterative synchronous application.
+
\end{enumerate}
\section{The performance and energy consumption measurements on heterogeneous architecture}
\label{sec.exe}
-
-
-\subsection{The execution time of message passing distributed
- iterative applications on a heterogeneous platform}
+\subsection{The execution time of message passing distributed iterative
+ applications on a heterogeneous platform}
In this paper, we are interested in reducing the energy consumption of message
passing distributed iterative synchronous applications running over
power (FLOPS), energy consumption, CPU's frequency range, \dots{} but they all
have the same network bandwidth and latency.
-The overall execution time of a distributed iterative synchronous application
-over a heterogeneous platform consists of the sum of the computation time and
-the communication time for every iteration on a node. However, due to the
-heterogeneous computation power of the computing nodes, slack times might occur
-when fast nodes have to wait, during synchronous communications, for the slower
-nodes to finish their computations (see Figure~(\ref{fig:heter})).
-Therefore, the overall execution time of the program is the execution time of the slowest
-task which has the highest computation time and no slack time.
-
- \begin{figure}[t]
+\begin{figure}[!t]
\centering
- \includegraphics[scale=0.6]{fig/commtasks}
+ \includegraphics[scale=0.6]{fig/commtasks}
\caption{Parallel tasks on a heterogeneous platform}
\label{fig:heter}
\end{figure}
-Dynamic Voltage and Frequency Scaling (DVFS) is a process, implemented in
-modern processors, that reduces the energy consumption of a CPU by scaling
-down its voltage and frequency. Since DVFS lowers the frequency of a CPU
-and consequently its computing power, the execution time of a program running
-over that scaled down processor might increase, especially if the program is
-compute bound. The frequency reduction process can be expressed by the scaling
-factor S which is the ratio between the maximum and the new frequency of a CPU
+The overall execution time of a distributed iterative synchronous application
+over a heterogeneous platform consists of the sum of the computation time and
+the communication time for every iteration on a node. However, due to the
+heterogeneous computation power of the computing nodes, slack times might occur
+when fast nodes have to wait, during synchronous communications, for the slower
+nodes to finish their computations (see Figure~\ref{fig:heter}). Therefore, the
+overall execution time of the program is the execution time of the slowest task
+which has the highest computation time and no slack time.
+
+Dynamic Voltage and Frequency Scaling (DVFS) is a process, implemented in
+modern processors, that reduces the energy consumption of a CPU by scaling
+down its voltage and frequency. Since DVFS lowers the frequency of a CPU
+and consequently its computing power, the execution time of a program running
+over that scaled down processor might increase, especially if the program is
+compute bound. The frequency reduction process can be expressed by the scaling
+factor S which is the ratio between the maximum and the new frequency of a CPU
as in (\ref{eq:s}).
\begin{equation}
\label{eq:s}
- S = \frac{F_\textit{max}}{F_\textit{new}}
+ S = \frac{\Fmax}{\Fnew}
\end{equation}
- The execution time of a compute bound sequential program is linearly proportional
- to the frequency scaling factor $S$. On the other hand, message passing
- distributed applications consist of two parts: computation and communication.
- The execution time of the computation part is linearly proportional to the
- frequency scaling factor $S$ but the communication time is not affected by the
- scaling factor because the processors involved remain idle during the
- communications~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}.
- The communication time for a task is the summation of periods of
- time that begin with an MPI call for sending or receiving a message
- until the message is synchronously sent or received.
+The execution time of a compute bound sequential program is linearly
+proportional to the frequency scaling factor $S$. On the other hand, message
+passing distributed applications consist of two parts: computation and
+communication. The execution time of the computation part is linearly
+proportional to the frequency scaling factor $S$ but the communication time is
+not affected by the scaling factor because the processors involved remain idle
+during the communications~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. The
+communication time for a task is the summation of periods of time that begin
+with an MPI call for sending or receiving a message until the message is
+synchronously sent or received.
Since in a heterogeneous platform each node has different characteristics,
especially different frequency gears, when applying DVFS operations on these
vector of scaling factors can be predicted using (\ref{eq:perf}).
\begin{equation}
\label{eq:perf}
- \textit T_\textit{new} =
- \max_{i=1,2,\dots,N} ({TcpOld_{i}} \cdot S_{i}) + MinTcm
+ \Tnew = \max_{i=1,2,\dots,N} ({\TcpOld[i]} \cdot S_{i}) + \MinTcm
\end{equation}
-Where:\\
+Where:
\begin{equation}
-\label{eq:perf2}
- MinTcm = \min_{i=1,2,\dots,N} (Tcm_i)
+ \label{eq:perf2}
+ \MinTcm = \min_{i=1,2,\dots,N} (\Tcm[i])
\end{equation}
-where $TcpOld_i$ is the computation time of processor $i$ during the first
-iteration and $MinTcm$ is the communication time of the slowest processor from
-the first iteration. The model computes the maximum computation time with
-scaling factor from each node added to the communication time of the slowest
-node. It means only the communication time without any slack time is taken into
+where $\TcpOld[i]$ is the computation time of processor $i$ during the first
+iteration and $\MinTcm$ is the communication time of the slowest processor from
+the first iteration. The model computes the maximum computation time with
+scaling factor from each node added to the communication time of the slowest
+node. It means only the communication time without any slack time is taken into
account. Therefore, the execution time of the iterative application is equal to
-the execution time of one iteration as in (\ref{eq:perf}) multiplied by the
+the execution time of one iteration as in (\ref{eq:perf}) multiplied by the
number of iterations of that application.
-This prediction model is developed from the model to predict the execution time
-of message passing distributed applications for homogeneous
-architectures~\cite{Our_first_paper}. The execution time prediction model is
-used in the method to optimize both the energy consumption and the performance of
-iterative methods, which is presented in the following sections.
-
+This prediction model is developed from the model to predict the execution time
+of message passing distributed applications for homogeneous
+architectures~\cite{Our_first_paper}. The execution time prediction model is
+used in the method to optimize both the energy consumption and the performance
+of iterative methods, which is presented in the following sections.
\subsection{Energy model for heterogeneous platform}
+
Many researchers~\cite{Malkowski_energy.efficient.high.performance.computing,
-Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling,
-Rizvandi_Some.Observations.on.Optimal.Frequency} divide the power consumed by a processor into
-two power metrics: the static and the dynamic power. While the first one is
-consumed as long as the computing unit is turned on, the latter is only consumed during
-computation times. The dynamic power $Pd$ is related to the switching
-activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
-operational frequency $F$, as shown in (\ref{eq:pd}).
+ Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling,
+ Rizvandi_Some.Observations.on.Optimal.Frequency} divide the power consumed by
+a processor into two power metrics: the static and the dynamic power. While the
+first one is consumed as long as the computing unit is turned on, the latter is
+only consumed during computation times. The dynamic power $\Pd$ is related to
+the switching activity $\alpha$, load capacitance $\CL$, the supply voltage $V$
+and operational frequency $F$, as shown in (\ref{eq:pd}).
\begin{equation}
\label{eq:pd}
- Pd = \alpha \cdot C_L \cdot V^2 \cdot F
+ \Pd = \alpha \cdot \CL \cdot V^2 \cdot F
\end{equation}
-The static power $Ps$ captures the leakage power as follows:
+The static power $\Ps$ captures the leakage power as follows:
\begin{equation}
\label{eq:ps}
- Ps = V \cdot N_{trans} \cdot K_{design} \cdot I_{leak}
+ \Ps = V \cdot \Ntrans \cdot \Kdesign \cdot \Ileak
\end{equation}
-where V is the supply voltage, $N_{trans}$ is the number of transistors,
-$K_{design}$ is a design dependent parameter and $I_{leak}$ is a
+where V is the supply voltage, $\Ntrans$ is the number of transistors,
+$\Kdesign$ is a design dependent parameter and $\Ileak$ is a
technology dependent parameter. The energy consumed by an individual processor
to execute a given program can be computed as:
\begin{equation}
\label{eq:eind}
- E_\textit{ind} = Pd \cdot Tcp + Ps \cdot T
+ \Eind = \Pd \cdot \Tcp + \Ps \cdot T
\end{equation}
-where $T$ is the execution time of the program, $Tcp$ is the computation
-time and $Tcp \le T$. $Tcp$ may be equal to $T$ if there is no
+where $T$ is the execution time of the program, $\Tcp$ is the computation
+time and $\Tcp \le T$. $\Tcp$ may be equal to $T$ if there is no
communication and no slack time.
-The main objective of DVFS operation is to reduce the overall energy consumption~\cite{Le_DVFS.Laws.of.Diminishing.Returns}.
-The operational frequency $F$ depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot F$ with some
-constant $\beta$.~This equation is used to study the change of the dynamic
-voltage with respect to various frequency values in~\cite{Rauber_Analytical.Modeling.for.Energy}. The reduction
-process of the frequency can be expressed by the scaling factor $S$ which is the
-ratio between the maximum and the new frequency as in (\ref{eq:s}).
-The CPU governors are power schemes supplied by the operating
-system's kernel to lower a core's frequency. The new frequency
-$F_{new}$ from (\ref{eq:s}) can be calculated as follows:
+The main objective of DVFS operation is to reduce the overall energy
+consumption~\cite{Le_DVFS.Laws.of.Diminishing.Returns}. The operational
+frequency $F$ depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot
+F$ with some constant $\beta$.~This equation is used to study the change of the
+dynamic voltage with respect to various frequency values
+in~\cite{Rauber_Analytical.Modeling.for.Energy}. The reduction process of the
+frequency can be expressed by the scaling factor $S$ which is the ratio between
+the maximum and the new frequency as in (\ref{eq:s}). The CPU governors are
+power schemes supplied by the operating system's kernel to lower a core's
+frequency. The new frequency $\Fnew$ from (\ref{eq:s}) can be calculated as
+follows:
\begin{equation}
\label{eq:fnew}
- F_\textit{new} = S^{-1} \cdot F_\textit{max}
+ \Fnew = S^{-1} \cdot \Fmax
\end{equation}
-Replacing $F_{new}$ in (\ref{eq:pd}) as in (\ref{eq:fnew}) gives the following
+Replacing $\Fnew$ in (\ref{eq:pd}) as in (\ref{eq:fnew}) gives the following
equation for dynamic power consumption:
\begin{multline}
\label{eq:pdnew}
- {P}_\textit{dNew} = \alpha \cdot C_L \cdot V^2 \cdot F_{new} = \alpha \cdot C_L \cdot \beta^2 \cdot F_{new}^3 \\
- {} = \alpha \cdot C_L \cdot V^2 \cdot F_{max} \cdot S^{-3} = P_{dOld} \cdot S^{-3}
+ \PdNew = \alpha \cdot \CL \cdot V^2 \cdot \Fnew = \alpha \cdot \CL \cdot \beta^2 \cdot \Fnew^3 \\
+ {} = \alpha \cdot \CL \cdot V^2 \cdot \Fmax \cdot S^{-3} = \PdOld \cdot S^{-3}
\end{multline}
-where $ {P}_\textit{dNew}$ and $P_{dOld}$ are the dynamic power consumed with the
+where $\PdNew$ and $\PdOld$ are the dynamic power consumed with the
new frequency and the maximum frequency respectively.
-According to (\ref{eq:pdnew}) the dynamic power is reduced by a factor of $S^{-3}$ when
-reducing the frequency by a factor of $S$~\cite{Rauber_Analytical.Modeling.for.Energy}. Since the FLOPS of a CPU is proportional
-to the frequency of a CPU, the computation time is increased proportionally to $S$.
-The new dynamic energy is the dynamic power multiplied by the new time of computation
-and is given by the following equation:
+According to (\ref{eq:pdnew}) the dynamic power is reduced by a factor of
+$S^{-3}$ when reducing the frequency by a factor of
+$S$~\cite{Rauber_Analytical.Modeling.for.Energy}. Since the FLOPS of a CPU is
+proportional to the frequency of a CPU, the computation time is increased
+proportionally to $S$. The new dynamic energy is the dynamic power multiplied
+by the new time of computation and is given by the following equation:
\begin{equation}
\label{eq:Edyn}
- E_\textit{dNew} = P_{dOld} \cdot S^{-3} \cdot (Tcp \cdot S)= S^{-2}\cdot P_{dOld} \cdot Tcp
+ \EdNew = \PdOld \cdot S^{-3} \cdot (\Tcp \cdot S)= S^{-2}\cdot \PdOld \cdot \Tcp
\end{equation}
-The static power is related to the power leakage of the CPU and is consumed during computation
-and even when idle. As in~\cite{Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling},
- the static power of a processor is considered as constant
-during idle and computation periods, and for all its available frequencies.
-The static energy is the static power multiplied by the execution time of the program.
-According to the execution time model in (\ref{eq:perf}), the execution time of the program
-is the sum of the computation and the communication times. The computation time is linearly related
-to the frequency scaling factor, while this scaling factor does not affect the communication time.
-The static energy of a processor after scaling its frequency is computed as follows:
+The static power is related to the power leakage of the CPU and is consumed
+during computation and even when idle. As
+in~\cite{Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling},
+the static power of a processor is considered as constant during idle and
+computation periods, and for all its available frequencies. The static energy
+is the static power multiplied by the execution time of the program. According
+to the execution time model in (\ref{eq:perf}), the execution time of the
+program is the sum of the computation and the communication times. The
+computation time is linearly related to the frequency scaling factor, while this
+scaling factor does not affect the communication time. The static energy of a
+processor after scaling its frequency is computed as follows:
\begin{equation}
\label{eq:Estatic}
- E_\textit{s} = Ps \cdot (Tcp \cdot S + Tcm)
+ \Es = \Ps \cdot (\Tcp \cdot S + \Tcm)
\end{equation}
-In the considered heterogeneous platform, each processor $i$ might have
-different dynamic and static powers, noted as $Pd_{i}$ and $Ps_{i}$
-respectively. Therefore, even if the distributed message passing iterative
-application is load balanced, the computation time of each CPU $i$ noted
-$Tcp_{i}$ might be different and different frequency scaling factors might be
-computed in order to decrease the overall energy consumption of the application
-and reduce slack times. The communication time of a processor $i$ is noted as
-$Tcm_{i}$ and could contain slack times when communicating with slower
-nodes, see figure(\ref{fig:heter}). Therefore, all nodes do not have equal
-communication times. While the dynamic energy is computed according to the
-frequency scaling factor and the dynamic power of each node as in
-(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time
-of one iteration multiplied by the static power of each processor. The overall
-energy consumption of a message passing distributed application executed over a
-heterogeneous platform during one iteration is the summation of all dynamic and
+In the considered heterogeneous platform, each processor $i$ might have
+different dynamic and static powers, noted as $\Pd[i]$ and $\Ps[i]$
+respectively. Therefore, even if the distributed message passing iterative
+application is load balanced, the computation time of each CPU $i$ noted
+$\Tcp[i]$ might be different and different frequency scaling factors might be
+computed in order to decrease the overall energy consumption of the application
+and reduce slack times. The communication time of a processor $i$ is noted as
+$\Tcm[i]$ and could contain slack times when communicating with slower nodes,
+see Figure~\ref{fig:heter}. Therefore, all nodes do not have equal
+communication times. While the dynamic energy is computed according to the
+frequency scaling factor and the dynamic power of each node as in
+(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time
+of one iteration multiplied by the static power of each processor. The overall
+energy consumption of a message passing distributed application executed over a
+heterogeneous platform during one iteration is the summation of all dynamic and
static energies for each processor. It is computed as follows:
\begin{multline}
\label{eq:energy}
- E = \sum_{i=1}^{N} {(S_i^{-2} \cdot Pd_{i} \cdot Tcp_i)} + {} \\
- \sum_{i=1}^{N} (Ps_{i} \cdot (\max_{i=1,2,\dots,N} (Tcp_i \cdot S_{i}) +
- {MinTcm))}
- \end{multline}
-
-Reducing the frequencies of the processors according to the vector of
-scaling factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the
-application and thus, increase the static energy because the execution time is
-increased~\cite{Kim_Leakage.Current.Moore.Law}. The overall energy consumption for the iterative
-application can be measured by measuring the energy consumption for one iteration as in (\ref{eq:energy})
-multiplied by the number of iterations of that application.
+ E = \sum_{i=1}^{N} {(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i])} + {} \\
+ \sum_{i=1}^{N} (\Ps[i] \cdot (\max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) +
+ {\MinTcm))}
+\end{multline}
+Reducing the frequencies of the processors according to the vector of scaling
+factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the application
+and thus, increase the static energy because the execution time is
+increased~\cite{Kim_Leakage.Current.Moore.Law}. The overall energy consumption
+for the iterative application can be measured by measuring the energy
+consumption for one iteration as in (\ref{eq:energy}) multiplied by the number
+of iterations of that application.
\section{Optimization of both energy consumption and performance}
\label{sec.compet}
Using the lowest frequency for each processor does not necessarily give the most
-energy efficient execution of an application. Indeed, even though the dynamic
-power is reduced while scaling down the frequency of a processor, its
-computation power is proportionally decreased. Hence, the execution time might
-be drastically increased and during that time, dynamic and static powers are
-being consumed. Therefore, it might cancel any gains achieved by scaling down
-the frequency of all nodes to the minimum and the overall energy consumption of
-the application might not be the optimal one. It is not trivial to select the
-appropriate frequency scaling factor for each processor while considering the
-characteristics of each processor (computation power, range of frequencies,
-dynamic and static powers) and the task executed (computation/communication
-ratio). The aim being to reduce the overall energy consumption and to avoid
-increasing significantly the execution time. In our previous
-work~\cite{Our_first_paper}, we proposed a method that selects the optimal
-frequency scaling factor for a homogeneous cluster executing a message passing
-iterative synchronous application while giving the best trade-off between the
-energy consumption and the performance for such applications. In this work we
-are interested in heterogeneous clusters as described above. Due to the
-heterogeneity of the processors, a vector of scaling factors should
-be selected and it must give the best trade-off between energy consumption and
-performance.
-
-The relation between the energy consumption and the execution time for an
-application is complex and nonlinear, Thus, unlike the relation between the
-execution time and the scaling factor, the relation between the energy and the
-frequency scaling factors is nonlinear, for more details refer
-to~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. Moreover, these relations
-are not measured using the same metric. To solve this problem, the execution
-time is normalized by computing the ratio between the new execution time (after
-scaling down the frequencies of some processors) and the initial one (with
+energy efficient execution of an application. Indeed, even though the dynamic
+power is reduced while scaling down the frequency of a processor, its
+computation power is proportionally decreased. Hence, the execution time might
+be drastically increased and during that time, dynamic and static powers are
+being consumed. Therefore, it might cancel any gains achieved by scaling down
+the frequency of all nodes to the minimum and the overall energy consumption of
+the application might not be the optimal one. It is not trivial to select the
+appropriate frequency scaling factor for each processor while considering the
+characteristics of each processor (computation power, range of frequencies,
+dynamic and static powers) and the task executed (computation/communication
+ratio). The aim being to reduce the overall energy consumption and to avoid
+increasing significantly the execution time. In our previous
+work~\cite{Our_first_paper}, we proposed a method that selects the optimal
+frequency scaling factor for a homogeneous cluster executing a message passing
+iterative synchronous application while giving the best trade-off between the
+energy consumption and the performance for such applications. In this work we
+are interested in heterogeneous clusters as described above. Due to the
+heterogeneity of the processors, a vector of scaling factors should be selected
+and it must give the best trade-off between energy consumption and performance.
+
+The relation between the energy consumption and the execution time for an
+application is complex and nonlinear, Thus, unlike the relation between the
+execution time and the scaling factor, the relation between the energy and the
+frequency scaling factors is nonlinear, for more details refer
+to~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. Moreover, these relations
+are not measured using the same metric. To solve this problem, the execution
+time is normalized by computing the ratio between the new execution time (after
+scaling down the frequencies of some processors) and the initial one (with
maximum frequency for all nodes) as follows:
\begin{multline}
\label{eq:pnorm}
- P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}}\\
- {} = \frac{ \max_{i=1,2,\dots,N} (Tcp_{i} \cdot S_{i}) +MinTcm}
- {\max_{i=1,2,\dots,N}{(Tcp_i+Tcm_i)}}
+ \Pnorm = \frac{\Tnew}{\Told}\\
+ {} = \frac{ \max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) +\MinTcm}
+ {\max_{i=1,2,\dots,N}{(\Tcp[i]+\Tcm[i])}}
\end{multline}
-
-In the same way, the energy is normalized by computing the ratio between the consumed energy
-while scaling down the frequency and the consumed energy with maximum frequency for all nodes:
+In the same way, the energy is normalized by computing the ratio between the
+consumed energy while scaling down the frequency and the consumed energy with
+maximum frequency for all nodes:
\begin{multline}
\label{eq:enorm}
- E_\textit{Norm} = \frac{E_\textit{Reduced}}{E_\textit{Original}} \\
- {} = \frac{ \sum_{i=1}^{N}{(S_i^{-2} \cdot Pd_i \cdot Tcp_i)} +
- \sum_{i=1}^{N} {(Ps_i \cdot T_{New})}}{\sum_{i=1}^{N}{( Pd_i \cdot Tcp_i)} +
- \sum_{i=1}^{N} {(Ps_i \cdot T_{Old})}}
-\end{multline}
-Where $E_\textit{Reduced}$ and $E_\textit{Original}$ are computed using (\ref{eq:energy}) and
- $T_{New}$ and $T_{Old}$ are computed as in (\ref{eq:pnorm}).
-
-While the main
-goal is to optimize the energy and execution time at the same time, the normalized
-energy and execution time curves are not in the same direction. According
-to the equations~(\ref{eq:pnorm}) and (\ref{eq:enorm}), the vector of frequency
-scaling factors $S_1,S_2,\dots,S_N$ reduce both the energy and the execution
-time simultaneously. But the main objective is to produce maximum energy
-reduction with minimum execution time reduction.
-
-This problem can be solved by making the optimization process for energy and
-execution time following the same direction. Therefore, the equation of the
-normalized execution time is inverted which gives the normalized performance equation, as follows:
+ \Enorm = \frac{\Ereduced}{\Eoriginal} \\
+ {} = \frac{ \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i])} +
+ \sum_{i=1}^{N} {(\Ps[i] \cdot \Tnew)}}{\sum_{i=1}^{N}{( \Pd[i] \cdot \Tcp[i])} +
+ \sum_{i=1}^{N} {(\Ps[i] \cdot \Told)}}
+\end{multline}
+Where $\Ereduced$ and $\Eoriginal$ are computed using (\ref{eq:energy}) and
+$\Tnew$ and $\Told$ are computed as in (\ref{eq:pnorm}).
+
+While the main goal is to optimize the energy and execution time at the same
+time, the normalized energy and execution time curves are not in the same
+direction. According to the equations~(\ref{eq:pnorm}) and (\ref{eq:enorm}), the
+vector of frequency scaling factors $S_1,S_2,\dots,S_N$ reduce both the energy
+and the execution time simultaneously. But the main objective is to produce
+maximum energy reduction with minimum execution time reduction.
+
+This problem can be solved by making the optimization process for energy and
+execution time following the same direction. Therefore, the equation of the
+normalized execution time is inverted which gives the normalized performance
+equation, as follows:
\begin{multline}
\label{eq:pnorm_inv}
- P_\textit{Norm} = \frac{T_\textit{Old}}{T_\textit{New}}\\
- = \frac{\max_{i=1,2,\dots,N}{(Tcp_i+Tcm_i)}}
- { \max_{i=1,2,\dots,N} (Tcp_{i} \cdot S_{i}) + MinTcm}
+ \Pnorm = \frac{\Told}{\Tnew}\\
+ = \frac{\max_{i=1,2,\dots,N}{(\Tcp[i]+\Tcm[i])}}
+ { \max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) + \MinTcm}
\end{multline}
-
-\begin{figure}
+\begin{figure}[!t]
\centering
\subfloat[Homogeneous platform]{%
\includegraphics[width=.33\textwidth]{fig/homo}\label{fig:r1}}%
-
-
+
\subfloat[Heterogeneous platform]{%
\includegraphics[width=.33\textwidth]{fig/heter}\label{fig:r2}}
\label{fig:rel}
\caption{The energy and performance relation}
\end{figure}
-Then, the objective function can be modeled in order to find the maximum distance
-between the energy curve (\ref{eq:enorm}) and the performance
-curve (\ref{eq:pnorm_inv}) over all available sets of scaling factors. This
-represents the minimum energy consumption with minimum execution time (maximum
-performance) at the same time, see figure~(\ref{fig:r1}) or figure~(\ref{fig:r2}). Then the objective
-function has the following form:
+Then, the objective function can be modeled in order to find the maximum
+distance between the energy curve (\ref{eq:enorm}) and the performance curve
+(\ref{eq:pnorm_inv}) over all available sets of scaling factors. This
+represents the minimum energy consumption with minimum execution time (maximum
+performance) at the same time, see Figure~\ref{fig:r1} or
+Figure~\ref{fig:r2}. Then the objective function has the following form:
\begin{equation}
\label{eq:max}
- Max Dist =
- \max_{i=1,\dots F, j=1,\dots,N}
- (\overbrace{P_\textit{Norm}(S_{ij})}^{\text{Maximize}} -
- \overbrace{E_\textit{Norm}(S_{ij})}^{\text{Minimize}} )
+ \MaxDist =
+ \mathop{\max_{i=1,\dots F}}_{j=1,\dots,N}
+ (\overbrace{\Pnorm(S_{ij})}^{\text{Maximize}} -
+ \overbrace{\Enorm(S_{ij})}^{\text{Minimize}} )
\end{equation}
-where $N$ is the number of nodes and $F$ is the number of available frequencies for each node.
-Then, the optimal set of scaling factors that satisfies (\ref{eq:max}) can be selected.
-The objective function can work with any energy model or any power values for each node
-(static and dynamic powers). However, the most important energy reduction gain can be achieved when
-the energy curve has a convex form as shown in~\cite{Zhuo_Energy.efficient.Dynamic.Task.Scheduling,Rauber_Analytical.Modeling.for.Energy,Hao_Learning.based.DVFS}.
+where $N$ is the number of nodes and $F$ is the number of available frequencies
+for each node. Then, the optimal set of scaling factors that satisfies
+(\ref{eq:max}) can be selected. The objective function can work with any energy
+model or any power values for each node (static and dynamic powers). However,
+the most important energy reduction gain can be achieved when the energy curve
+has a convex form as shown
+in~\cite{Zhuo_Energy.efficient.Dynamic.Task.Scheduling,Rauber_Analytical.Modeling.for.Energy,Hao_Learning.based.DVFS}.
\section{The scaling factors selection algorithm for heterogeneous platforms }
\label{sec.optim}
-\subsection{The algorithm details}
-In this section, algorithm \ref{HSA} is presented. It selects the frequency scaling factors
-vector that gives the best trade-off between minimizing the energy consumption and maximizing
-the performance of a message passing synchronous iterative application executed on a heterogeneous
-platform. It works online during the execution time of the iterative message passing program.
-It uses information gathered during the first iteration such as the computation time and the
-communication time in one iteration for each node. The algorithm is executed after the first
-iteration and returns a vector of optimal frequency scaling factors that satisfies the objective
-function (\ref{eq:max}). The program applies DVFS operations to change the frequencies of the CPUs
-according to the computed scaling factors. This algorithm is called just once during the execution
-of the program. Algorithm~(\ref{dvfs}) shows where and when the proposed scaling algorithm is called
-in the iterative MPI program.
-
-The nodes in a heterogeneous platform have different computing powers, thus while executing message
-passing iterative synchronous applications, fast nodes have to wait for the slower ones to finish their
-computations before being able to synchronously communicate with them as in figure (\ref{fig:heter}).
-These periods are called idle or slack times.
-The algorithm takes into account this problem and tries to reduce these slack times when selecting the
-frequency scaling factors vector. At first, it selects initial frequency scaling factors that increase
-the execution times of fast nodes and minimize the differences between the computation times of
-fast and slow nodes. The value of the initial frequency scaling factor for each node is inversely
-proportional to its computation time that was gathered from the first iteration. These initial frequency
-scaling factors are computed as a ratio between the computation time of the slowest node and the
-computation time of the node $i$ as follows:
-\begin{equation}
- \label{eq:Scp}
- Scp_{i} = \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i}
-\end{equation}
-Using the initial frequency scaling factors computed in (\ref{eq:Scp}), the algorithm computes
-the initial frequencies for all nodes as a ratio between the maximum frequency of node $i$
-and the computation scaling factor $Scp_i$ as follows:
-\begin{equation}
- \label{eq:Fint}
- F_{i} = \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N}
-\end{equation}
-If the computed initial frequency for a node is not available in the gears of
-that node, it is replaced by the nearest available frequency. In figure
-(\ref{fig:st_freq}), the nodes are sorted by their computing power in ascending
-order and the frequencies of the faster nodes are scaled down according to the
-computed initial frequency scaling factors. The resulting new frequencies are
-colored in blue in figure (\ref{fig:st_freq}). This set of frequencies can be
-considered as a higher bound for the search space of the optimal vector of
-frequencies because selecting frequency scaling factors higher than the higher
-bound will not improve the performance of the application and it will increase
-its overall energy consumption. Therefore the algorithm that selects the
-frequency scaling factors starts the search method from these initial
-frequencies and takes a downward search direction toward lower frequencies. The
-algorithm iterates on all left frequencies, from the higher bound until all
-nodes reach their minimum frequencies, to compute their overall energy
-consumption and performance, and select the optimal frequency scaling factors
-vector. At each iteration the algorithm determines the slowest node according to
-the equation (\ref{eq:perf}) and keeps its frequency unchanged, while it lowers
-the frequency of all other nodes by one gear. The new overall energy
-consumption and execution time are computed according to the new scaling
-factors. The optimal set of frequency scaling factors is the set that gives the
-highest distance according to the objective function (\ref{eq:max}).
-
-Figures~\ref{fig:r1} and \ref{fig:r2} illustrate the normalized performance and
-consumed energy for an application running on a homogeneous platform and a
-heterogeneous platform respectively while increasing the scaling factors. It can
-be noticed that in a homogeneous platform the search for the optimal scaling
-factor should start from the maximum frequency because the performance and the
-consumed energy decrease from the beginning of the plot. On the other hand,
-in the heterogeneous platform the performance is maintained at the beginning of
-the plot even if the frequencies of the faster nodes decrease until the
-computing power of scaled down nodes are lower than the slowest node. In other
-words, until they reach the higher bound. It can also be noticed that the higher
-the difference between the faster nodes and the slower nodes is, the bigger the
-maximum distance between the energy curve and the performance curve is while
- the scaling factors are varying which results in bigger energy savings.
-\begin{figure}[t]
- \centering
- \includegraphics[scale=0.5]{fig/start_freq}
- \caption{Selecting the initial frequencies}
- \label{fig:st_freq}
-\end{figure}
-
-
-
-
\begin{algorithm}
\begin{algorithmic}[1]
% \footnotesize
\Require ~
\begin{description}
- \item[$Tcp_i$] array of all computation times for all nodes during one iteration and with highest frequency.
- \item[$Tcm_i$] array of all communication times for all nodes during one iteration and with highest frequency.
- \item[$Fmax_i$] array of the maximum frequencies for all nodes.
- \item[$Pd_i$] array of the dynamic powers for all nodes.
- \item[$Ps_i$] array of the static powers for all nodes.
- \item[$Fdiff_i$] array of the difference between two successive frequencies for all nodes.
+ \item[{$\Tcp[i]$}] array of all computation times for all nodes during one iteration and with highest frequency.
+ \item[{$\Tcm[i]$}] array of all communication times for all nodes during one iteration and with highest frequency.
+ \item[{$\Fmax[i]$}] array of the maximum frequencies for all nodes.
+ \item[{$\Pd[i]$}] array of the dynamic powers for all nodes.
+ \item[{$\Ps[i]$}] array of the static powers for all nodes.
+ \item[{$\Fdiff[i]$}] array of the difference between two successive frequencies for all nodes.
\end{description}
- \Ensure $Sopt_1,Sopt_2 \dots, Sopt_N$ is a vector of optimal scaling factors
+ \Ensure $\Sopt[1],\Sopt[2] \dots, \Sopt[N]$ is a vector of optimal scaling factors
- \State $ Scp_i \gets \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i} $
- \State $F_{i} \gets \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N}$
+ \State $\Scp[i] \gets \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]} $
+ \State $F_{i} \gets \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\cdots,N}$
\State Round the computed initial frequencies $F_i$ to the closest one available in each node.
\If{(not the first frequency)}
- \State $F_i \gets F_i+Fdiff_i,~i=1,\dots,N.$
- \EndIf
- \State $T_\textit{Old} \gets max_{~i=1,\dots,N } (Tcp_i+Tcm_i)$
- \State $E_\textit{Original} \gets \sum_{i=1}^{N}{( Pd_i \cdot Tcp_i)} +\sum_{i=1}^{N} {(Ps_i \cdot T_{Old})}$
- \State $Sopt_{i} \gets 1,~i=1,\dots,N. $
- \State $Dist \gets 0 $
+ \State $F_i \gets F_i+\Fdiff[i],~i=1,\dots,N.$
+ \EndIf
+ \State $\Told \gets \max_{i=1,\dots,N} (\Tcp[i]+\Tcm[i])$
+ % \State $\Eoriginal \gets \sum_{i=1}^{N}{( \Pd[i] \cdot \Tcp[i])} +\sum_{i=1}^{N} {(\Ps[i] \cdot \Told)}$
+ \State $\Eoriginal \gets \sum_{i=1}^{N}{( \Pd[i] \cdot \Tcp[i] + \Ps[i] \cdot \Told)}$
+ \State $\Sopt[i] \gets 1,~i=1,\dots,N. $
+ \State $\Dist \gets 0 $
\While {(all nodes not reach their minimum frequency)}
\If{(not the last freq. \textbf{and} not the slowest node)}
- \State $F_i \gets F_i - Fdiff_i,~i=1,\dots,N.$
- \State $S_i \gets \frac{Fmax_i}{F_i},~i=1,\dots,N.$
+ \State $F_i \gets F_i - \Fdiff[i],~i=1,\dots,N.$
+ \State $S_i \gets \frac{\Fmax[i]}{F_i},~i=1,\dots,N.$
\EndIf
- \State $T_{New} \gets max_\textit{~i=1,\dots,N} (Tcp_{i} \cdot S_{i}) + MinTcm $
- \State $E_\textit{Reduced} \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot Pd_i \cdot Tcp_i)} + $ \hspace*{43 mm}
- $\sum_{i=1}^{N} {(Ps_i \cdot T_{New})} $
- \State $ P_\textit{Norm} \gets \frac{T_\textit{Old}}{T_\textit{New}}$
- \State $E_\textit{Norm}\gets \frac{E_\textit{Reduced}}{E_\textit{Original}}$
+ \State $\Tnew \gets \max_{i=1,\dots,N} (\Tcp[i] \cdot S_{i}) + \MinTcm $
+% \State $\Ereduced \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i])} + \sum_{i=1}^{N} {(\Ps[i] \cdot \rlap{\Tnew)}} $
+ \State $\Ereduced \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i] + \Ps[i] \cdot \rlap{\Tnew)}} $
+ \State $\Pnorm \gets \frac{\Told}{\Tnew}$
+ \State $\Enorm\gets \frac{\Ereduced}{\Eoriginal}$
\If{$(\Pnorm - \Enorm > \Dist)$}
- \State $Sopt_{i} \gets S_{i},~i=1,\dots,N. $
+ \State $\Sopt[i] \gets S_{i},~i=1,\dots,N. $
\State $\Dist \gets \Pnorm - \Enorm$
\EndIf
\EndWhile
- \State Return $Sopt_1,Sopt_2,\dots,Sopt_N$
+ \State Return $\Sopt[1],\Sopt[2],\dots,\Sopt[N]$
\end{algorithmic}
\caption{frequency scaling factors selection algorithm}
\label{HSA}
\If {$(k=1)$}
\State Gather all times of computation and\newline\hspace*{3em}%
communication from each node.
- \State Call algorithm \ref{HSA}.
+ \State Call Algorithm \ref{HSA}.
\State Compute the new frequencies from the\newline\hspace*{3em}%
returned optimal scaling factors.
\State Set the new frequencies to nodes.
\label{dvfs}
\end{algorithm}
-\subsection{The evaluation of the proposed algorithm}
-\label{sec.verif.algo}
-The precision of the proposed algorithm mainly depends on the execution time
-prediction model defined in (\ref{eq:perf}) and the energy model computed by
-(\ref{eq:energy}). The energy model is also significantly dependent on the
-execution time model because the static energy is linearly related to the
-execution time and the dynamic energy is related to the computation time. So,
-all the works presented in this paper are based on the execution time model. To
-verify this model, the predicted execution time was compared to the real
-execution time over SimGrid/SMPI simulator,
-v3.10~\cite{casanova+giersch+legrand+al.2014.versatile}, for all the NAS
-parallel benchmarks NPB v3.3 \cite{NAS.Parallel.Benchmarks}, running class B on
-8 or 9 nodes. The comparison showed that the proposed execution time model is
-very precise, the maximum normalized difference between the predicted execution
-time and the real execution time is equal to 0.03 for all the NAS benchmarks.
+\subsection{The algorithm details}
-Since the proposed algorithm is not an exact method it does not test all the possible solutions (vectors of scaling factors)
-in the search space. To prove its efficiency, it was compared on small instances to a brute force search algorithm
-that tests all the possible solutions. The brute force algorithm was applied to different NAS benchmarks classes with
-different number of nodes. The solutions returned by the brute force algorithm and the proposed algorithm were identical
-and the proposed algorithm was on average 10 times faster than the brute force algorithm. It has a small execution time:
-for a heterogeneous cluster composed of four different types of nodes having the characteristics presented in
-table~\ref{table:platform}, it takes on average \np[ms]{0.04} for 4 nodes and \np[ms]{0.15} on average for 144 nodes
-to compute the best scaling factors vector. The algorithm complexity is $O(F\cdot (N \cdot4) )$, where $F$ is the number
-of iterations and $N$ is the number of computing nodes. The algorithm needs from 12 to 20 iterations to select the best
-vector of frequency scaling factors that gives the results of the next sections.
+In this section, Algorithm~\ref{HSA} is presented. It selects the frequency
+scaling factors vector that gives the best trade-off between minimizing the
+energy consumption and maximizing the performance of a message passing
+synchronous iterative application executed on a heterogeneous platform. It works
+online during the execution time of the iterative message passing program. It
+uses information gathered during the first iteration such as the computation
+time and the communication time in one iteration for each node. The algorithm is
+executed after the first iteration and returns a vector of optimal frequency
+scaling factors that satisfies the objective function (\ref{eq:max}). The
+program applies DVFS operations to change the frequencies of the CPUs according
+to the computed scaling factors. This algorithm is called just once during the
+execution of the program. Algorithm~\ref{dvfs} shows where and when the proposed
+scaling algorithm is called in the iterative MPI program.
+
+\begin{figure}[!t]
+ \centering
+ \includegraphics[scale=0.5]{fig/start_freq}
+ \caption{Selecting the initial frequencies}
+ \label{fig:st_freq}
+\end{figure}
-\section{Experimental results}
-\label{sec.expe}
-To evaluate the efficiency and the overall energy consumption reduction of
-algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The
-experiments were executed on the simulator SimGrid/SMPI which offers easy tools
-to create a heterogeneous platform and run message passing applications over it.
-The heterogeneous platform that was used in the experiments, had one core per
-node because just one process was executed per node. The heterogeneous platform
-was composed of four types of nodes. Each type of nodes had different
-characteristics such as the maximum CPU frequency, the number of available
-frequencies and the computational power, see Table \ref{table:platform}. The
-characteristics of these different types of nodes are inspired from the
-specifications of real Intel processors. The heterogeneous platform had up to
-144 nodes and had nodes from the four types in equal proportions, for example if
-a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
-constructors of CPUs do not specify the dynamic and the static power of their
-CPUs, for each type of node they were chosen proportionally to its computing
-power (FLOPS). In the initial heterogeneous platform, while computing with
-highest frequency, each node consumed an amount of power proportional to its
-computing power (which corresponds to 80\% of its dynamic power and the
-remaining 20\% to the static power), the same assumption was made in
-\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These
-nodes were connected via an ethernet network with 1 Gbit/s bandwidth.
+The nodes in a heterogeneous platform have different computing powers, thus
+while executing message passing iterative synchronous applications, fast nodes
+have to wait for the slower ones to finish their computations before being able
+to synchronously communicate with them as in Figure~\ref{fig:heter}. These
+periods are called idle or slack times. The algorithm takes into account this
+problem and tries to reduce these slack times when selecting the frequency
+scaling factors vector. At first, it selects initial frequency scaling factors
+that increase the execution times of fast nodes and minimize the differences
+between the computation times of fast and slow nodes. The value of the initial
+frequency scaling factor for each node is inversely proportional to its
+computation time that was gathered from the first iteration. These initial
+frequency scaling factors are computed as a ratio between the computation time
+of the slowest node and the computation time of the node $i$ as follows:
+\begin{equation}
+ \label{eq:Scp}
+ \Scp[i] = \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]}
+\end{equation}
+Using the initial frequency scaling factors computed in (\ref{eq:Scp}), the
+algorithm computes the initial frequencies for all nodes as a ratio between the
+maximum frequency of node $i$ and the computation scaling factor $\Scp[i]$ as
+follows:
+\begin{equation}
+ \label{eq:Fint}
+ F_{i} = \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\dots,N}
+\end{equation}
+If the computed initial frequency for a node is not available in the gears of
+that node, it is replaced by the nearest available frequency. In
+Figure~\ref{fig:st_freq}, the nodes are sorted by their computing power in
+ascending order and the frequencies of the faster nodes are scaled down
+according to the computed initial frequency scaling factors. The resulting new
+frequencies are highlighted in Figure~\ref{fig:st_freq}. This set of
+frequencies can be considered as a higher bound for the search space of the
+optimal vector of frequencies because selecting frequency scaling factors higher
+than the higher bound will not improve the performance of the application and it
+will increase its overall energy consumption. Therefore the algorithm that
+selects the frequency scaling factors starts the search method from these
+initial frequencies and takes a downward search direction toward lower
+frequencies. The algorithm iterates on all left frequencies, from the higher
+bound until all nodes reach their minimum frequencies, to compute their overall
+energy consumption and performance, and select the optimal frequency scaling
+factors vector. At each iteration the algorithm determines the slowest node
+according to the equation (\ref{eq:perf}) and keeps its frequency unchanged,
+while it lowers the frequency of all other nodes by one gear. The new overall
+energy consumption and execution time are computed according to the new scaling
+factors. The optimal set of frequency scaling factors is the set that gives the
+highest distance according to the objective function (\ref{eq:max}).
+
+Figures~\ref{fig:r1} and \ref{fig:r2} illustrate the normalized performance and
+consumed energy for an application running on a homogeneous platform and a
+heterogeneous platform respectively while increasing the scaling factors. It can
+be noticed that in a homogeneous platform the search for the optimal scaling
+factor should start from the maximum frequency because the performance and the
+consumed energy decrease from the beginning of the plot. On the other hand, in
+the heterogeneous platform the performance is maintained at the beginning of the
+plot even if the frequencies of the faster nodes decrease until the computing
+power of scaled down nodes are lower than the slowest node. In other words,
+until they reach the higher bound. It can also be noticed that the higher the
+difference between the faster nodes and the slower nodes is, the bigger the
+maximum distance between the energy curve and the performance curve is while the
+scaling factors are varying which results in bigger energy savings.
+
+\subsection{The evaluation of the proposed algorithm}
+\label{sec.verif.algo}
+The precision of the proposed algorithm mainly depends on the execution time
+prediction model defined in (\ref{eq:perf}) and the energy model computed by
+(\ref{eq:energy}). The energy model is also significantly dependent on the
+execution time model because the static energy is linearly related to the
+execution time and the dynamic energy is related to the computation time. So,
+all the works presented in this paper are based on the execution time model. To
+verify this model, the predicted execution time was compared to the real
+execution time over SimGrid/SMPI simulator,
+v3.10~\cite{casanova+giersch+legrand+al.2014.versatile}, for all the NAS
+parallel benchmarks NPB v3.3 \cite{NAS.Parallel.Benchmarks}, running class B on
+8 or 9 nodes. The comparison showed that the proposed execution time model is
+very precise, the maximum normalized difference between the predicted execution
+time and the real execution time is equal to 0.03 for all the NAS benchmarks.
-\begin{table}[htb]
+Since the proposed algorithm is not an exact method it does not test all the
+possible solutions (vectors of scaling factors) in the search space. To prove
+its efficiency, it was compared on small instances to a brute force search
+algorithm that tests all the possible solutions. The brute force algorithm was
+applied to different NAS benchmarks classes with different number of nodes. The
+solutions returned by the brute force algorithm and the proposed algorithm were
+identical and the proposed algorithm was on average 10 times faster than the
+brute force algorithm. It has a small execution time: for a heterogeneous
+cluster composed of four different types of nodes having the characteristics
+presented in Table~\ref{table:platform}, it takes on average \np[ms]{0.04} for 4
+nodes and \np[ms]{0.15} on average for 144 nodes to compute the best scaling
+factors vector. The algorithm complexity is $O(F\cdot N)$, where $F$ is the
+number of iterations and $N$ is the number of computing nodes. The algorithm
+needs from 12 to 20 iterations to select the best vector of frequency scaling
+factors that gives the results of the next sections.
+
+\begin{table}[!t]
\caption{Heterogeneous nodes characteristics}
% title of Table
\centering
- \begin{tabular}{|*{7}{l|}}
+ \begin{tabular}{|*{7}{r|}}
\hline
- Node &Simulated & Max & Min & Diff. & Dynamic & Static \\
- type &GFLOPS & Freq. & Freq. & Freq. & power & power \\
- & & GHz & GHz &GHz & & \\
+ Node & Simulated & Max & Min & Diff. & Dynamic & Static \\
+ type & GFLOPS & Freq. & Freq. & Freq. & power & power \\
+ & & GHz & GHz & GHz & & \\
\hline
- 1 &40 & 2.5 & 1.2 & 0.1 & 20~w &4~w \\
-
+ 1 & 40 & 2.50 & 1.20 & 0.100 & \np[W]{20} & \np[W]{4} \\
\hline
- 2 &50 & 2.66 & 1.6 & 0.133 & 25~w &5~w \\
-
+ 2 & 50 & 2.66 & 1.60 & 0.133 & \np[W]{25} & \np[W]{5} \\
\hline
- 3 &60 & 2.9 & 1.2 & 0.1 & 30~w &6~w \\
-
+ 3 & 60 & 2.90 & 1.20 & 0.100 & \np[W]{30} & \np[W]{6} \\
\hline
- 4 &70 & 3.4 & 1.6 & 0.133 & 35~w &7~w \\
-
+ 4 & 70 & 3.40 & 1.60 & 0.133 & \np[W]{35} & \np[W]{7} \\
\hline
\end{tabular}
\label{table:platform}
\end{table}
-
-%\subsection{Performance prediction verification}
+\section{Experimental results}
+\label{sec.expe}
+
+To evaluate the efficiency and the overall energy consumption reduction of
+Algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The
+experiments were executed on the simulator SimGrid/SMPI which offers easy tools
+to create a heterogeneous platform and run message passing applications over it.
+The heterogeneous platform that was used in the experiments, had one core per
+node because just one process was executed per node. The heterogeneous platform
+was composed of four types of nodes. Each type of nodes had different
+characteristics such as the maximum CPU frequency, the number of available
+frequencies and the computational power, see Table~\ref{table:platform}. The
+characteristics of these different types of nodes are inspired from the
+specifications of real Intel processors. The heterogeneous platform had up to
+144 nodes and had nodes from the four types in equal proportions, for example if
+a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
+constructors of CPUs do not specify the dynamic and the static power of their
+CPUs, for each type of node they were chosen proportionally to its computing
+power (FLOPS). In the initial heterogeneous platform, while computing with
+highest frequency, each node consumed an amount of power proportional to its
+computing power (which corresponds to \np[\%]{80} of its dynamic power and the
+remaining \np[\%]{20} to the static power), the same assumption was made in
+\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These
+nodes were connected via an Ethernet network with \np[Gbit/s]{1} bandwidth.
\subsection{The experimental results of the scaling algorithm}
\label{sec.res}
-
The proposed algorithm was applied to the seven parallel NAS benchmarks (EP, CG,
-MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes:
+MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes:
A, B and C. However, due to the lack of space in this paper, only the results of
-the biggest class, C, are presented while being run on different number of
-nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being
-executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on $1,
-2, 4, 8, 16, 32, 64, 128$ nodes. The other benchmarks such as BT and SP had to
-be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
-
-
-
-\begin{table}[htb]
+the biggest class, C, are presented while being run on different number of
+nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being
+executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on 1,
+2, 4, 8, 16, 32, 64, or 128 nodes. The other benchmarks such as BT and SP had
+to be executed on 1, 4, 9, 16, 36, 64, or 144 nodes.
+
+\begin{table}[!t]
\caption{Running NAS benchmarks on 4 nodes }
% title of Table
\centering
- \begin{tabular}{|*{7}{l|}}
+ \begin{tabular}{|*{7}{r|}}
\hline
- Program & Execution & Energy & Energy & Performance & Distance \\
- name & time/s & consumption/J & saving\% & degradation\% & \\
+ \hspace{-2.2084pt}%
+ Program & Execution & Energy & Energy & Performance & Distance \\
+ name & time/s & consumption/J & saving\% & degradation\% & \\
\hline
- CG & 64.64 & 3560.39 &34.16 &6.72 &27.44 \\
- \hline
- MG & 18.89 & 1074.87 &35.37 &4.34 &31.03 \\
- \hline
- EP &79.73 &5521.04 &26.83 &3.04 &23.79 \\
- \hline
- LU &308.65 &21126.00 &34.00 &6.16 &27.84 \\
+ CG & 64.64 & 3560.39 & 34.16 & 6.72 & 27.44 \\
+ \hline
+ MG & 18.89 & 1074.87 & 35.37 & 4.34 & 31.03 \\
+ \hline
+ EP & 79.73 & 5521.04 & 26.83 & 3.04 & 23.79 \\
+ \hline
+ LU & 308.65 & 21126.00 & 34.00 & 6.16 & 27.84 \\
+ \hline
+ BT & 360.12 & 21505.55 & 35.36 & 8.49 & 26.87 \\
+ \hline
+ SP & 234.24 & 13572.16 & 35.22 & 5.70 & 29.52 \\
+ \hline
+ FT & 81.58 & 4151.48 & 35.58 & 0.99 & 34.59 \\
\hline
- BT &360.12 &21505.55 &35.36 &8.49 &26.87 \\
- \hline
- SP &234.24 &13572.16 &35.22 &5.70 &29.52 \\
- \hline
- FT &81.58 &4151.48 &35.58 &0.99 &34.59 \\
-\hline
\end{tabular}
\label{table:res_4n}
-\end{table}
+% \end{table}
-\begin{table}[htb]
+ \medskip
+% \begin{table}[!t]
\caption{Running NAS benchmarks on 8 and 9 nodes }
% title of Table
\centering
- \begin{tabular}{|*{7}{l|}}
+ \begin{tabular}{|*{7}{r|}}
\hline
- Program & Execution & Energy & Energy & Performance & Distance \\
- name & time/s & consumption/J & saving\% & degradation\% & \\
+ \hspace{-2.2084pt}%
+ Program & Execution & Energy & Energy & Performance & Distance \\
+ name & time/s & consumption/J & saving\% & degradation\% & \\
\hline
- CG &36.11 &3263.49 &31.25 &7.12 &24.13 \\
- \hline
- MG &8.99 &953.39 &33.78 &6.41 &27.37 \\
- \hline
- EP &40.39 &5652.81 &27.04 &0.49 &26.55 \\
- \hline
- LU &218.79 &36149.77 &28.23 &0.01 &28.22 \\
+ CG & 36.11 & 3263.49 & 31.25 & 7.12 & 24.13 \\
+ \hline
+ MG & 8.99 & 953.39 & 33.78 & 6.41 & 27.37 \\
+ \hline
+ EP & 40.39 & 5652.81 & 27.04 & 0.49 & 26.55 \\
+ \hline
+ LU & 218.79 & 36149.77 & 28.23 & 0.01 & 28.22 \\
+ \hline
+ BT & 166.89 & 23207.42 & 32.32 & 7.89 & 24.43 \\
+ \hline
+ SP & 104.73 & 18414.62 & 24.73 & 2.78 & 21.95 \\
+ \hline
+ FT & 51.10 & 4913.26 & 31.02 & 2.54 & 28.48 \\
\hline
- BT &166.89 &23207.42 &32.32 &7.89 &24.43 \\
- \hline
- SP &104.73 &18414.62 &24.73 &2.78 &21.95 \\
- \hline
- FT &51.10 &4913.26 &31.02 &2.54 &28.48 \\
-\hline
\end{tabular}
\label{table:res_8n}
-\end{table}
+% \end{table}
-\begin{table}[htb]
+ \medskip
+% \begin{table}[!t]
\caption{Running NAS benchmarks on 16 nodes }
% title of Table
\centering
- \begin{tabular}{|*{7}{l|}}
+ \begin{tabular}{|*{7}{r|}}
\hline
- Program & Execution & Energy & Energy & Performance & Distance \\
- name & time/s & consumption/J & saving\% & degradation\% & \\
+ \hspace{-2.2084pt}%
+ Program & Execution & Energy & Energy & Performance & Distance \\
+ name & time/s & consumption/J & saving\% & degradation\% & \\
\hline
- CG &31.74 &4373.90 &26.29 &9.57 &16.72 \\
- \hline
- MG &5.71 &1076.19 &32.49 &6.05 &26.44 \\
- \hline
- EP &20.11 &5638.49 &26.85 &0.56 &26.29 \\
- \hline
- LU &144.13 &42529.06 &28.80 &6.56 &22.24 \\
+ CG & 31.74 & 4373.90 & 26.29 & 9.57 & 16.72 \\
+ \hline
+ MG & 5.71 & 1076.19 & 32.49 & 6.05 & 26.44 \\
+ \hline
+ EP & 20.11 & 5638.49 & 26.85 & 0.56 & 26.29 \\
+ \hline
+ LU & 144.13 & 42529.06 & 28.80 & 6.56 & 22.24 \\
+ \hline
+ BT & 97.29 & 22813.86 & 34.95 & 5.80 & 29.15 \\
+ \hline
+ SP & 66.49 & 20821.67 & 22.49 & 3.82 & 18.67 \\
+ \hline
+ FT & 37.01 & 5505.60 & 31.59 & 6.48 & 25.11 \\
\hline
- BT &97.29 &22813.86 &34.95 &5.80 &29.15 \\
- \hline
- SP &66.49 &20821.67 &22.49 &3.82 &18.67 \\
- \hline
- FT &37.01 &5505.60 &31.59 &6.48 &25.11 \\
-\hline
\end{tabular}
\label{table:res_16n}
-\end{table}
+% \end{table}
-\begin{table}[htb]
+ \medskip
+% \begin{table}[!t]
\caption{Running NAS benchmarks on 32 and 36 nodes }
% title of Table
\centering
- \begin{tabular}{|*{7}{l|}}
+ \begin{tabular}{|*{7}{r|}}
\hline
- Program & Execution & Energy & Energy & Performance & Distance \\
- name & time/s & consumption/J & saving\% & degradation\% & \\
+ \hspace{-2.2084pt}%
+ Program & Execution & Energy & Energy & Performance & Distance \\
+ name & time/s & consumption/J & saving\% & degradation\% & \\
\hline
- CG &32.35 &6704.21 &16.15 &5.30 &10.85 \\
- \hline
- MG &4.30 &1355.58 &28.93 &8.85 &20.08 \\
- \hline
- EP &9.96 &5519.68 &26.98 &0.02 &26.96 \\
- \hline
- LU &99.93 &67463.43 &23.60 &2.45 &21.15 \\
+ CG & 32.35 & 6704.21 & 16.15 & 5.30 & 10.85 \\
+ \hline
+ MG & 4.30 & 1355.58 & 28.93 & 8.85 & 20.08 \\
+ \hline
+ EP & 9.96 & 5519.68 & 26.98 & 0.02 & 26.96 \\
+ \hline
+ LU & 99.93 & 67463.43 & 23.60 & 2.45 & 21.15 \\
+ \hline
+ BT & 48.61 & 23796.97 & 34.62 & 5.83 & 28.79 \\
+ \hline
+ SP & 46.01 & 27007.43 & 22.72 & 3.45 & 19.27 \\
+ \hline
+ FT & 28.06 & 7142.69 & 23.09 & 2.90 & 20.19 \\
\hline
- BT &48.61 &23796.97 &34.62 &5.83 &28.79 \\
- \hline
- SP &46.01 &27007.43 &22.72 &3.45 &19.27 \\
- \hline
- FT &28.06 &7142.69 &23.09 &2.90 &20.19 \\
-\hline
\end{tabular}
\label{table:res_32n}
-\end{table}
+% \end{table}
-\begin{table}[htb]
+ \medskip
+% \begin{table}[!t]
\caption{Running NAS benchmarks on 64 nodes }
% title of Table
\centering
- \begin{tabular}{|*{7}{l|}}
+ \begin{tabular}{|*{7}{r|}}
\hline
- Program & Execution & Energy & Energy & Performance & Distance \\
- name & time/s & consumption/J & saving\% & degradation\% & \\
+ \hspace{-2.2084pt}%
+ Program & Execution & Energy & Energy & Performance & Distance \\
+ name & time/s & consumption/J & saving\% & degradation\% & \\
\hline
- CG &46.65 &17521.83 &8.13 &1.68 &6.45 \\
- \hline
- MG &3.27 &1534.70 &29.27 &14.35 &14.92 \\
- \hline
- EP &5.05 &5471.1084 &27.12 &3.11 &24.01 \\
- \hline
- LU &73.92 &101339.16 &21.96 &3.67 &18.29 \\
+ CG & 46.65 & 17521.83 & 8.13 & 1.68 & 6.45 \\
+ \hline
+ MG & 3.27 & 1534.70 & 29.27 & 14.35 & 14.92 \\
+ \hline
+ EP & 5.05 & 5471.11 & 27.12 & 3.11 & 24.01 \\
+ \hline
+ LU & 73.92 & 101339.16 & 21.96 & 3.67 & 18.29 \\
+ \hline
+ BT & 39.99 & 27166.71 & 32.02 & 12.28 & 19.74 \\
+ \hline
+ SP & 52.00 & 49099.28 & 24.84 & 0.03 & 24.81 \\
+ \hline
+ FT & 25.97 & 10416.82 & 20.15 & 4.87 & 15.28 \\
\hline
- BT &39.99 &27166.71 &32.02 &12.28 &19.74 \\
- \hline
- SP &52.00 &49099.28 &24.84 &0.03 &24.81 \\
- \hline
- FT &25.97 &10416.82 &20.15 &4.87 &15.28 \\
-\hline
\end{tabular}
\label{table:res_64n}
-\end{table}
+% \end{table}
-
-\begin{table}[htb]
+ \medskip
+% \begin{table}[!t]
\caption{Running NAS benchmarks on 128 and 144 nodes }
% title of Table
\centering
- \begin{tabular}{|*{7}{l|}}
+ \begin{tabular}{|*{7}{r|}}
\hline
- Program & Execution & Energy & Energy & Performance & Distance \\
- name & time/s & consumption/J & saving\% & degradation\% & \\
+ \hspace{-2.2084pt}%
+ Program & Execution & Energy & Energy & Performance & Distance \\
+ name & time/s & consumption/J & saving\% & degradation\% & \\
\hline
- CG &56.92 &41163.36 &4.00 &1.10 &2.90 \\
- \hline
- MG &3.55 &2843.33 &18.77 &10.38 &8.39 \\
- \hline
- EP &2.67 &5669.66 &27.09 &0.03 &27.06 \\
- \hline
- LU &51.23 &144471.90 &16.67 &2.36 &14.31 \\
+ CG & 56.92 & 41163.36 & 4.00 & 1.10 & 2.90 \\
+ \hline
+ MG & 3.55 & 2843.33 & 18.77 & 10.38 & 8.39 \\
+ \hline
+ EP & 2.67 & 5669.66 & 27.09 & 0.03 & 27.06 \\
+ \hline
+ LU & 51.23 & 144471.90 & 16.67 & 2.36 & 14.31 \\
+ \hline
+ BT & 37.96 & 44243.82 & 23.18 & 1.28 & 21.90 \\
+ \hline
+ SP & 64.53 & 115409.71 & 26.72 & 0.05 & 26.67 \\
+ \hline
+ FT & 25.51 & 18808.72 & 12.85 & 2.84 & 10.01 \\
\hline
- BT &37.96 &44243.82 &23.18 &1.28 &21.90 \\
- \hline
- SP &64.53 &115409.71 &26.72 &0.05 &26.67 \\
- \hline
- FT &25.51 &18808.72 &12.85 &2.84 &10.01 \\
-\hline
\end{tabular}
\label{table:res_128n}
\end{table}
-The overall energy consumption was computed for each instance according to the
-energy consumption model (\ref{eq:energy}), with and without applying the
-algorithm. The execution time was also measured for all these experiments. Then,
-the energy saving and performance degradation percentages were computed for each
-instance. The results are presented in Tables (\ref{table:res_4n},
-\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n},
-\ref{table:res_64n} and \ref{table:res_128n}). All these results are the average
-values from many experiments for energy savings and performance degradation.
-The tables show the experimental results for running the NAS parallel benchmarks
-on different number of nodes. The experiments show that the algorithm
-significantly reduces the energy consumption (up to 35\%) and tries to limit the
-performance degradation. They also show that the energy saving percentage
-decreases when the number of computing nodes increases. This reduction is due
-to the increase of the communication times compared to the execution times when
-the benchmarks are run over a high number of nodes. Indeed, the benchmarks with
-the same class, C, are executed on different numbers of nodes, so the
-computation required for each iteration is divided by the number of computing
-nodes. On the other hand, more communications are required when increasing the
-number of nodes so the static energy increases linearly according to the
-communication time and the dynamic power is less relevant in the overall energy
-consumption. Therefore, reducing the frequency with algorithm~(\ref{HSA}) is
-less effective in reducing the overall energy savings. It can also be noticed
-that for the benchmarks EP and SP that contain little or no communications, the
-energy savings are not significantly affected by the high number of nodes. No
-experiments were conducted using bigger classes than D, because they require a
-lot of memory (more than 64GB) when being executed by the simulator on one
-machine. The maximum distance between the normalized energy curve and the
-normalized performance for each instance is also shown in the result tables. It
-decrease in the same way as the energy saving percentage. The tables also show
-that the performance degradation percentage is not significantly increased when
-the number of computing nodes is increased because the computation times are
-small when compared to the communication times.
-
-
-
-\begin{figure}
+
+\begin{figure}[!t]
\centering
- \subfloat[Energy saving]{%
+ \subfloat[Energy saving (\%)]{%
\includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}%
-
- \subfloat[Performance degradation ]{%
+
+ \subfloat[Performance degradation (\%)]{%
\includegraphics[width=.33\textwidth]{fig/per_deg}\label{fig:per_deg}}
\label{fig:avg}
\caption{The energy and performance for all NAS benchmarks running with a different number of nodes}
\end{figure}
-Figures \ref{fig:energy} and \ref{fig:per_deg} present the energy saving and
-performance degradation respectively for all the benchmarks according to the
-number of used nodes. As shown in the first plot, the energy saving percentages
-of the benchmarks MG, LU, BT and FT decrease linearly when the number of nodes
-increase. While for the EP and SP benchmarks, the energy saving percentage is
-not affected by the increase of the number of computing nodes, because in these
-benchmarks there are little or no communications. Finally, the energy saving of
-the GC benchmark significantly decrease when the number of nodes increase
-because this benchmark has more communications than the others. The second plot
-shows that the performance degradation percentages of most of the benchmarks
-decrease when they run on a big number of nodes because they spend more time
-communicating than computing, thus, scaling down the frequencies of some nodes
+The overall energy consumption was computed for each instance according to the
+energy consumption model (\ref{eq:energy}), with and without applying the
+algorithm. The execution time was also measured for all these experiments. Then,
+the energy saving and performance degradation percentages were computed for each
+instance. The results are presented in Tables~\ref{table:res_4n},
+\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n},
+\ref{table:res_64n} and \ref{table:res_128n}. All these results are the average
+values from many experiments for energy savings and performance degradation.
+The tables show the experimental results for running the NAS parallel benchmarks
+on different number of nodes. The experiments show that the algorithm
+significantly reduces the energy consumption (up to \np[\%]{35}) and tries to
+limit the performance degradation. They also show that the energy saving
+percentage decreases when the number of computing nodes increases. This
+reduction is due to the increase of the communication times compared to the
+execution times when the benchmarks are run over a higher number of nodes.
+Indeed, the benchmarks with the same class, C, are executed on different numbers
+of nodes, so the computation required for each iteration is divided by the
+number of computing nodes. On the other hand, more communications are required
+when increasing the number of nodes so the static energy increases linearly
+according to the communication time and the dynamic power is less relevant in
+the overall energy consumption. Therefore, reducing the frequency with
+Algorithm~\ref{HSA} is less effective in reducing the overall energy savings. It
+can also be noticed that for the benchmarks EP and SP that contain little or no
+communications, the energy savings are not significantly affected by the high
+number of nodes. No experiments were conducted using bigger classes than D,
+because they require a lot of memory (more than \np[GB]{64}) when being executed
+by the simulator on one machine. The maximum distance between the normalized
+energy curve and the normalized performance for each instance is also shown in
+the result tables. It decrease in the same way as the energy saving percentage.
+The tables also show that the performance degradation percentage is not
+significantly increased when the number of computing nodes is increased because
+the computation times are small when compared to the communication times.
+
+Figures~\ref{fig:energy} and \ref{fig:per_deg} present the energy saving and
+performance degradation respectively for all the benchmarks according to the
+number of used nodes. As shown in the first plot, the energy saving percentages
+of the benchmarks MG, LU, BT and FT decrease linearly when the number of nodes
+increase. While for the EP and SP benchmarks, the energy saving percentage is
+not affected by the increase of the number of computing nodes, because in these
+benchmarks there are little or no communications. Finally, the energy saving of
+the GC benchmark significantly decrease when the number of nodes increase
+because this benchmark has more communications than the others. The second plot
+shows that the performance degradation percentages of most of the benchmarks
+decrease when they run on a big number of nodes because they spend more time
+communicating than computing, thus, scaling down the frequencies of some nodes
has less effect on the performance.
-
-
-
\subsection{The results for different power consumption scenarios}
\label{sec.compare}
-The results of the previous section were obtained while using processors that
-consume during computation an overall power which is 80\% composed of dynamic
-power and of 20\% of static power. In this section, these ratios are changed and
-two new power scenarios are considered in order to evaluate how the proposed
-algorithm adapts itself according to the static and dynamic power values. The
-two new power scenarios are the following:
+
+The results of the previous section were obtained while using processors that
+consume during computation an overall power which is \np[\%]{80} composed of
+dynamic power and of \np[\%]{20} of static power. In this section, these ratios
+are changed and two new power scenarios are considered in order to evaluate how
+the proposed algorithm adapts itself according to the static and dynamic power
+values. The two new power scenarios are the following:
\begin{itemize}
-\item 70\% of dynamic power and 30\% of static power
-\item 90\% of dynamic power and 10\% of static power
+\item \np[\%]{70} of dynamic power and \np[\%]{30} of static power
+\item \np[\%]{90} of dynamic power and \np[\%]{10} of static power
\end{itemize}
-The NAS parallel benchmarks were executed again over processors that follow the
-new power scenarios. The class C of each benchmark was run over 8 or 9 nodes
-and the results are presented in Tables \ref{table:res_s1} and
-\ref{table:res_s2}. These tables show that the energy saving percentage of the
-70\%-30\% scenario is smaller for all benchmarks compared to the energy saving
-of the 90\%-10\% scenario. Indeed, in the latter more dynamic power is consumed
-when nodes are running on their maximum frequencies, thus, scaling down the
-frequency of the nodes results in higher energy savings than in the 70\%-30\%
-scenario. On the other hand, the performance degradation percentage is smaller
-in the 70\%-30\% scenario compared to the 90\%-10\% scenario. This is due to the
-higher static power percentage in the first scenario which makes it more
-relevant in the overall consumed energy. Indeed, the static energy is related
-to the execution time and if the performance is degraded the amount of consumed
-static energy directly increas. Therefore, the proposed algorithm does not
-really significantly scale down much the frequencies of the nodes in order to
-limit the increase of the execution time and thus limiting the effect of the
+The NAS parallel benchmarks were executed again over processors that follow the
+new power scenarios. The class C of each benchmark was run over 8 or 9 nodes
+and the results are presented in Tables~\ref{table:res_s1} and
+\ref{table:res_s2}. These tables show that the energy saving percentage of the
+\np[\%]{70}-\np[\%]{30} scenario is smaller for all benchmarks compared to the
+energy saving of the \np[\%]{90}-\np[\%]{10} scenario. Indeed, in the latter
+more dynamic power is consumed when nodes are running on their maximum
+frequencies, thus, scaling down the frequency of the nodes results in higher
+energy savings than in the \np[\%]{70}-\np[\%]{30} scenario. On the other hand,
+the performance degradation percentage is smaller in the \np[\%]{70}-\np[\%]{30}
+scenario compared to the \np[\%]{90}-\np[\%]{10} scenario. This is due to the
+higher static power percentage in the first scenario which makes it more
+relevant in the overall consumed energy. Indeed, the static energy is related
+to the execution time and if the performance is degraded the amount of consumed
+static energy directly increases. Therefore, the proposed algorithm does not
+really significantly scale down much the frequencies of the nodes in order to
+limit the increase of the execution time and thus limiting the effect of the
consumed static energy.
-Both new power scenarios are compared to the old one in figure
-(\ref{fig:sen_comp}). It shows the average of the performance degradation, the
-energy saving and the distances for all NAS benchmarks of class C running on 8
-or 9 nodes. The comparison shows that the energy saving ratio is proportional
-to the dynamic power ratio: it is increased when applying the 90\%-10\% scenario
-because at maximum frequency the dynamic energy is the most relevant in the
-overall consumed energy and can be reduced by lowering the frequency of some
-processors. On the other hand, the energy saving decreases when the 70\%-30\%
-scenario is used because the dynamic energy is less relevant in the overall
-consumed energy and lowering the frequency does not return big energy savings.
-Moreover, the average of the performance degradation is decreased when using a
-higher ratio for static power (e.g. 70\%-30\% scenario and 80\%-20\%
-scenario). Since the proposed algorithm optimizes the energy consumption when
-using a higher ratio for dynamic power the algorithm selects bigger frequency
-scaling factors that result in more energy saving but less performance, for
-example see Figure (\ref{fig:scales_comp}). The opposite happens when using a
-higher ratio for static power, the algorithm proportionally selects smaller
-scaling values which result in less energy saving but also less performance
+Both new power scenarios are compared to the old one in
+Figure~\ref{fig:sen_comp}. It shows the average of the performance degradation,
+the energy saving and the distances for all NAS benchmarks of class C running on
+8 or 9 nodes. The comparison shows that the energy saving ratio is proportional
+to the dynamic power ratio: it is increased when applying the
+\np[\%]{90}-\np[\%]{10} scenario because at maximum frequency the dynamic energy
+is the most relevant in the overall consumed energy and can be reduced by
+lowering the frequency of some processors. On the other hand, the energy saving
+decreases when the \np[\%]{70}-\np[\%]{30} scenario is used because the dynamic
+energy is less relevant in the overall consumed energy and lowering the
+frequency does not return big energy savings. Moreover, the average of the
+performance degradation is decreased when using a higher ratio for static power
+(e.g. \np[\%]{70}-\np[\%]{30} scenario and \np[\%]{80}-\np[\%]{20}
+scenario). Since the proposed algorithm optimizes the energy consumption when
+using a higher ratio for dynamic power the algorithm selects bigger frequency
+scaling factors that result in more energy saving but less performance, for
+example see Figure~\ref{fig:scales_comp}. The opposite happens when using a
+higher ratio for static power, the algorithm proportionally selects smaller
+scaling values which result in less energy saving but also less performance
degradation.
-
- \begin{table}[htb]
- \caption{The results of the 70\%-30\% power scenario}
+\begin{table}[!t]
+ \caption{The results of the \np[\%]{70}-\np[\%]{30} power scenario}
% title of Table
\centering
- \begin{tabular}{|*{6}{l|}}
+ \begin{tabular}{|*{6}{r|}}
\hline
- Program & Energy & Energy & Performance & Distance \\
- name & consumption/J & saving\% & degradation\% & \\
+ Program & Energy & Energy & Performance & Distance \\
+ name & consumption/J & saving\% & degradation\% & \\
\hline
- CG &4144.21 &22.42 &7.72 &14.70 \\
- \hline
- MG &1133.23 &24.50 &5.34 &19.16 \\
+ CG & 4144.21 & 22.42 & 7.72 & 14.70 \\
+ \hline
+ MG & 1133.23 & 24.50 & 5.34 & 19.16 \\
\hline
- EP &6170.30 &16.19 &0.02 &16.17 \\
+ EP & 6170.30 & 16.19 & 0.02 & 16.17 \\
\hline
- LU &39477.28 &20.43 &0.07 &20.36 \\
+ LU & 39477.28 & 20.43 & 0.07 & 20.36 \\
\hline
- BT &26169.55 &25.34 &6.62 &18.71 \\
+ BT & 26169.55 & 25.34 & 6.62 & 18.71 \\
\hline
- SP &19620.09 &19.32 &3.66 &15.66 \\
+ SP & 19620.09 & 19.32 & 3.66 & 15.66 \\
\hline
- FT &6094.07 &23.17 &0.36 &22.81 \\
-\hline
+ FT & 6094.07 & 23.17 & 0.36 & 22.81 \\
+ \hline
\end{tabular}
\label{table:res_s1}
\end{table}
-
-
-\begin{table}[htb]
- \caption{The results of the 90\%-10\% power scenario}
+\begin{table}[!t]
+ \caption{The results of the \np[\%]{90}-\np[\%]{10} power scenario}
% title of Table
\centering
- \begin{tabular}{|*{6}{l|}}
+ \begin{tabular}{|*{6}{r|}}
\hline
- Program & Energy & Energy & Performance & Distance \\
- name & consumption/J & saving\% & degradation\% & \\
+ Program & Energy & Energy & Performance & Distance \\
+ name & consumption/J & saving\% & degradation\% & \\
\hline
- CG &2812.38 &36.36 &6.80 &29.56 \\
- \hline
- MG &825.427 &38.35 &6.41 &31.94 \\
- \hline
- EP &5281.62 &35.02 &2.68 &32.34 \\
- \hline
- LU &31611.28 &39.15 &3.51 &35.64 \\
+ CG & 2812.38 & 36.36 & 6.80 & 29.56 \\
+ \hline
+ MG & 825.43 & 38.35 & 6.41 & 31.94 \\
+ \hline
+ EP & 5281.62 & 35.02 & 2.68 & 32.34 \\
+ \hline
+ LU & 31611.28 & 39.15 & 3.51 & 35.64 \\
+ \hline
+ BT & 21296.46 & 36.70 & 6.60 & 30.10 \\
+ \hline
+ SP & 15183.42 & 35.19 & 11.76 & 23.43 \\
+ \hline
+ FT & 3856.54 & 40.80 & 5.67 & 35.13 \\
\hline
- BT &21296.46 &36.70 &6.60 &30.10 \\
- \hline
- SP &15183.42 &35.19 &11.76 &23.43 \\
- \hline
- FT &3856.54 &40.80 &5.67 &35.13 \\
-\hline
\end{tabular}
\label{table:res_s2}
\end{table}
+\begin{table}[!t]
+ \caption{Comparing the proposed algorithm}
+ \centering
+ \begin{tabular}{|*{7}{r|}}
+ \hline
+ Program & \multicolumn{2}{c|}{Energy saving \%}
+ & \multicolumn{2}{c|}{Perf. degradation \%}
+ & \multicolumn{2}{c|}{Distance} \\
+ \cline{2-7}
+ name & EDP & MaxDist & EDP & MaxDist & EDP & MaxDist \\
+ \hline
+ CG & 27.58 & 31.25 & 5.82 & 7.12 & 21.76 & 24.13 \\
+ \hline
+ MG & 29.49 & 33.78 & 3.74 & 6.41 & 25.75 & 27.37 \\
+ \hline
+ LU & 19.55 & 28.33 & 0.00 & 0.01 & 19.55 & 28.22 \\
+ \hline
+ EP & 28.40 & 27.04 & 4.29 & 0.49 & 24.11 & 26.55 \\
+ \hline
+ BT & 27.68 & 32.32 & 6.45 & 7.87 & 21.23 & 24.43 \\
+ \hline
+ SP & 20.52 & 24.73 & 5.21 & 2.78 & 15.31 & 21.95 \\
+ \hline
+ FT & 27.03 & 31.02 & 2.75 & 2.54 & 24.28 & 28.48 \\
+ \hline
+ \end{tabular}
+ \label{table:compare_EDP}
+\end{table}
-\begin{figure}
+\begin{figure}[!t]
\centering
\subfloat[Comparison between the results on 8 nodes]{%
\includegraphics[width=.33\textwidth]{fig/sen_comp}\label{fig:sen_comp}}%
\includegraphics[width=.33\textwidth]{fig/three_scenarios}\label{fig:scales_comp}}
\label{fig:comp}
\caption{The comparison of the three power scenarios}
-\end{figure}
-
+\end{figure}
+\begin{figure}[!t]
+ \centering
+ \includegraphics[scale=0.5]{fig/compare_EDP.pdf}
+ \caption{Trade-off comparison for NAS benchmarks class C}
+ \label{fig:compare_EDP}
+\end{figure}
\subsection{The comparison of the proposed scaling algorithm }
\label{sec.compare_EDP}
-In this section, the scaling factors selection algorithm, called MaxDist,
-is compared to Spiliopoulos et al. algorithm \cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP.
-They developed a green governor that regularly applies an online frequency selecting algorithm to reduce the energy consumed by a multicore architecture without degrading much its performance. The algorithm selects the frequencies that minimize the energy and delay products, $EDP=Enegry*Delay$ using the predicted overall energy consumption and execution time delay for each frequency.
-To fairly compare both algorithms, the same energy and execution time models, equations (\ref{eq:energy}) and (\ref{eq:fnew}), were used for both algorithms to predict the energy consumption and the execution times. Also Spiliopoulos et al. algorithm was adapted to start the search from the
-initial frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm is an exhaustive search algorithm that minimizes the EDP and has the initial frequencies values as an upper bound.
-
-Both algorithms were applied to the parallel NAS benchmarks to compare their efficiency. Table \ref{table:compare_EDP} presents the results of comparing the execution times and the energy consumptions for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than Spiliopoulos et al. algorithm,
-on average it results in 29.76\% energy saving while their algorithm returns just 25.75\%. The average of performance degradation percentage is approximately the same for both algorithms, about 4\%.
+In this section, the scaling factors selection algorithm, called MaxDist, is
+compared to Spiliopoulos et al. algorithm
+\cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP. They developed a
+green governor that regularly applies an online frequency selecting algorithm to
+reduce the energy consumed by a multicore architecture without degrading much
+its performance. The algorithm selects the frequencies that minimize the energy
+and delay products, $\mathit{EDP}=\mathit{energy}\times \mathit{delay}$ using
+the predicted overall energy consumption and execution time delay for each
+frequency. To fairly compare both algorithms, the same energy and execution
+time models, equations (\ref{eq:energy}) and (\ref{eq:fnew}), were used for both
+algorithms to predict the energy consumption and the execution times. Also
+Spiliopoulos et al. algorithm was adapted to start the search from the initial
+frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm
+is an exhaustive search algorithm that minimizes the EDP and has the initial
+frequencies values as an upper bound.
+
+Both algorithms were applied to the parallel NAS benchmarks to compare their
+efficiency. Table~\ref{table:compare_EDP} presents the results of comparing the
+execution times and the energy consumption for both versions of the NAS
+benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous
+nodes. The results show that our algorithm provides better energy savings than
+Spiliopoulos et al. algorithm, on average it results in \np[\%]{29.76} energy
+saving while their algorithm returns just \np[\%]{25.75}. The average of
+performance degradation percentage is approximately the same for both
+algorithms, about \np[\%]{4}.
For all benchmarks, our algorithm outperforms Spiliopoulos et al. algorithm in
-terms of energy and performance tradeoff, see figure (\ref{fig:compare_EDP}),
+terms of energy and performance trade-off, see Figure~\ref{fig:compare_EDP},
because it maximizes the distance between the energy saving and the performance
degradation values while giving the same weight for both metrics.
-
-
-
-\begin{table}[h]
- \caption{Comparing the proposed algorithm}
- \centering
-\begin{tabular}{|l|l|l|l|l|l|l|l|}
-\hline
-\multicolumn{2}{|l|}{\multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}Program \\ name\end{tabular}}} & \multicolumn{2}{l|}{Energy saving \%} & \multicolumn{2}{l|}{Perf. degradation \%} & \multicolumn{2}{l|}{Distance} \\ \cline{3-8}
-\multicolumn{2}{|l|}{} & EDP & MaxDist & EDP & MaxDist & EDP & MaxDist \\ \hline
-\multicolumn{2}{|l|}{CG} & 27.58 & 31.25 & 5.82 & 7.12 & 21.76 & 24.13 \\ \hline
-\multicolumn{2}{|l|}{MG} & 29.49 & 33.78 & 3.74 & 6.41 & 25.75 & 27.37 \\ \hline
-\multicolumn{2}{|l|}{LU} & 19.55 & 28.33 & 0.0 & 0.01 & 19.55 & 28.22 \\ \hline
-\multicolumn{2}{|l|}{EP} & 28.40 & 27.04 & 4.29 & 0.49 & 24.11 & 26.55 \\ \hline
-\multicolumn{2}{|l|}{BT} & 27.68 & 32.32 & 6.45 & 7.87 & 21.23 & 24.43 \\ \hline
-\multicolumn{2}{|l|}{SP} & 20.52 & 24.73 & 5.21 & 2.78 & 15.31 & 21.95 \\ \hline
-\multicolumn{2}{|l|}{FT} & 27.03 & 31.02 & 2.75 & 2.54 & 24.28 & 28.48 \\ \hline
-
-\end{tabular}
-\label{table:compare_EDP}
-\end{table}
-
-
-
-
-
-\begin{figure}[t]
- \centering
- \includegraphics[scale=0.5]{fig/compare_EDP.pdf}
- \caption{Tradeoff comparison for NAS benchmarks class C}
- \label{fig:compare_EDP}
-\end{figure}
-
-
\section{Conclusion}
-\label{sec.concl}
+\label{sec.concl}
+
In this paper, a new online frequency selecting algorithm has been presented. It
-selects the best possible vector of frequency scaling factors that gives the
-maximum distance (optimal tradeoff) between the predicted energy and the
+selects the best possible vector of frequency scaling factors that gives the
+maximum distance (optimal trade-off) between the predicted energy and the
predicted performance curves for a heterogeneous platform. This algorithm uses a
-new energy model for measuring and predicting the energy of distributed
-iterative applications running over heterogeneous platforms. To evaluate the
+new energy model for measuring and predicting the energy of distributed
+iterative applications running over heterogeneous platforms. To evaluate the
proposed method, it was applied on the NAS parallel benchmarks and executed over
-a heterogeneous platform simulated by Simgrid. The results of the experiments
-showed that the algorithm reduces up to 35\% the energy consumption of a message
-passing iterative method while limiting the degradation of the performance. The
-algorithm also selects different scaling factors according to the percentage of
-the computing and communication times, and according to the values of the static
-and dynamic powers of the CPUs. Finally, the algorithm was compared to
-Spiliopoulos et al. algorithm and the results showed that it outperforms their
-algorithm in terms of energy-time tradeoff.
+a heterogeneous platform simulated by SimGrid. The results of the experiments
+showed that the algorithm reduces up to \np[\%]{35} the energy consumption of a
+message passing iterative method while limiting the degradation of the
+performance. The algorithm also selects different scaling factors according to
+the percentage of the computing and communication times, and according to the
+values of the static and dynamic powers of the CPUs. Finally, the algorithm was
+compared to Spiliopoulos et al. algorithm and the results showed that it
+outperforms their algorithm in terms of energy-time trade-off.
In the near future, this method will be applied to real heterogeneous platforms
to evaluate its performance in a real study case. It would also be interesting
\section*{Acknowledgment}
This work has been partially supported by the Labex
-ACTION project (contract “ANR-11-LABX-01-01”). As a PhD student,
+ACTION project (contract ``ANR-11-LABX-01-01''). As a PhD student,
Mr. Ahmed Fanfakh, would like to thank the University of
-Babylon (Iraq) for supporting his work.
+Babylon (Iraq) for supporting his work.
% trigger a \newpage just before the given reference
\bibliographystyle{IEEEtran}
\bibliography{IEEEabrv,my_reference}
\end{document}
-
+
%%% Local Variables:
%%% mode: latex
%%% TeX-master: t
%%% End:
% LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
-% LocalWords: CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex
-% LocalWords: de badri muslim MPI TcpOld TcmOld dNew dOld cp Sopt Tcp Tcm Ps
-% LocalWords: Scp Fmax Fdiff SimGrid GFlops Xeon EP BT
+% LocalWords: CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex GPU
+% LocalWords: de badri muslim MPI SimGrid GFlops Xeon EP BT GPUs CPUs AMD
+% LocalWords: Spiliopoulos scalability