X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy2.git/blobdiff_plain/054bc9f5e26028cd84aa816c5c06ed131959fdde..af7fb590e267eb74efdf62ecceb980c1ce7f3eb8:/Heter_paper.tex?ds=sidebyside diff --git a/Heter_paper.tex b/Heter_paper.tex index 9d743db..a9a1e6d 100644 --- a/Heter_paper.tex +++ b/Heter_paper.tex @@ -8,7 +8,6 @@ \usepackage{algorithm} \usepackage{subfig} \usepackage{amsmath} -\usepackage{multirow} \usepackage{url} \DeclareUrlCommand\email{\urlstyle{same}} @@ -24,47 +23,55 @@ \newcommand{\JC}[2][inline]{% \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace} -\newcommand{\Xsub}[2]{\ensuremath{#1_\textit{#2}}} +\newcommand{\Xsub}[2]{{\ensuremath{#1_\mathit{#2}}}} -\newcommand{\Dist}{\textit{Dist}} +%% used to put some subscripts lower, and make them more legible +\newcommand{\fxheight}[1]{\ifx#1\relax\relax\else\rule{0pt}{1.52ex}#1\fi} + +\newcommand{\CL}{\Xsub{C}{L}} +\newcommand{\Dist}{\mathit{Dist}} +\newcommand{\EdNew}{\Xsub{E}{dNew}} \newcommand{\Eind}{\Xsub{E}{ind}} \newcommand{\Enorm}{\Xsub{E}{Norm}} \newcommand{\Eoriginal}{\Xsub{E}{Original}} \newcommand{\Ereduced}{\Xsub{E}{Reduced}} -\newcommand{\Fdiff}{\Xsub{F}{diff}} -\newcommand{\Fmax}{\Xsub{F}{max}} +\newcommand{\Es}{\Xsub{E}{S}} +\newcommand{\Fdiff}[1][]{\Xsub{F}{diff}_{\!#1}} +\newcommand{\Fmax}[1][]{\Xsub{F}{max}_{\fxheight{#1}}} \newcommand{\Fnew}{\Xsub{F}{new}} \newcommand{\Ileak}{\Xsub{I}{leak}} \newcommand{\Kdesign}{\Xsub{K}{design}} -\newcommand{\MaxDist}{\textit{Max Dist}} +\newcommand{\MaxDist}{\mathit{Max}\Dist} +\newcommand{\MinTcm}{\mathit{Min}\Tcm} \newcommand{\Ntrans}{\Xsub{N}{trans}} -\newcommand{\Pdyn}{\Xsub{P}{dyn}} -\newcommand{\PnormInv}{\Xsub{P}{NormInv}} +\newcommand{\Pd}[1][]{\Xsub{P}{d}_{\fxheight{#1}}} +\newcommand{\PdNew}{\Xsub{P}{dNew}} +\newcommand{\PdOld}{\Xsub{P}{dOld}} \newcommand{\Pnorm}{\Xsub{P}{Norm}} -\newcommand{\Tnorm}{\Xsub{T}{Norm}} -\newcommand{\Pstates}{\Xsub{P}{states}} -\newcommand{\Pstatic}{\Xsub{P}{static}} -\newcommand{\Sopt}{\Xsub{S}{opt}} -\newcommand{\Tcomp}{\Xsub{T}{comp}} -\newcommand{\TmaxCommOld}{\Xsub{T}{Max Comm Old}} -\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}} -\newcommand{\Tmax}{\Xsub{T}{max}} +\newcommand{\Ps}[1][]{\Xsub{P}{s}_{\fxheight{#1}}} +\newcommand{\Scp}[1][]{\Xsub{S}{cp}_{#1}} +\newcommand{\Sopt}[1][]{\Xsub{S}{opt}_{#1}} +\newcommand{\Tcm}[1][]{\Xsub{T}{cm}_{\fxheight{#1}}} +\newcommand{\Tcp}[1][]{\Xsub{T}{cp}_{#1}} +\newcommand{\TcpOld}[1][]{\Xsub{T}{cpOld}_{#1}} \newcommand{\Tnew}{\Xsub{T}{New}} -\newcommand{\Told}{\Xsub{T}{Old}} -\begin{document} +\newcommand{\Told}{\Xsub{T}{Old}} + +\begin{document} -\title{Energy Consumption Reduction for Message Passing Iterative Applications in Heterogeneous Architecture Using DVFS} - -\author{% +\title{Energy Consumption Reduction with DVFS for \\ + Message Passing Iterative Applications on \\ + Heterogeneous Architectures} + +\author{% \IEEEauthorblockN{% Jean-Claude Charr, Raphaël Couturier, Ahmed Fanfakh and Arnaud Giersch - } + } \IEEEauthorblockA{% - FEMTO-ST Institute\\ - University of Franche-Comté\\ + FEMTO-ST Institute, University of Franche-Comté\\ IUT de Belfort-Montbéliard, 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\ % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël @@ -76,152 +83,174 @@ \maketitle \begin{abstract} -Computing platforms are consuming more and more energy due to the increasing -number of nodes composing them. To minimize the operating costs of these -platforms many techniques have been used. Dynamic voltage and frequency scaling -(DVFS) is one of them. It reduces the frequency of a CPU to lower its energy -consumption. However, lowering the frequency of a CPU might increase the -execution time of an application running on that processor. Therefore, the -frequency that gives the best tradeoff between the energy consumption and the -performance of an application must be selected. - -In this paper, a new online frequencies selecting algorithm for heterogeneous -platforms is presented. It selects the frequency which tries to give the best -tradeoff between energy saving and performance degradation, for each node -computing the message passing iterative application. The algorithm has a small -overhead and works without training or profiling. It uses a new energy model for -message passing iterative applications running on a heterogeneous platform. The -proposed algorithm is evaluated on the Simgrid simulator while running the NAS -parallel benchmarks. The experiments show that it reduces the energy -consumption by up to 35\% while limiting the performance degradation as much as -possible. Finally, the algorithm is compared to an existing method, the -comparison results showing that it outperforms the latter. + Computing platforms are consuming more and more energy due to the increasing + number of nodes composing them. To minimize the operating costs of these + platforms many techniques have been used. Dynamic voltage and frequency + scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its + energy consumption. However, lowering the frequency of a CPU may increase + the execution time of an application running on that processor. Therefore, + the frequency that gives the best trade-off between the energy consumption and + the performance of an application must be selected. + + In this paper, a new online frequency selecting algorithm for heterogeneous + platforms is presented. It selects the frequencies and tries to give the best + trade-off between energy saving and performance degradation, for each node + computing the message passing iterative application. The algorithm has a small + overhead and works without training or profiling. It uses a new energy model + for message passing iterative applications running on a heterogeneous + platform. The proposed algorithm is evaluated on the SimGrid simulator while + running the NAS parallel benchmarks. The experiments show that it reduces the + energy consumption by up to \np[\%]{35} while limiting the performance + degradation as much as possible. Finally, the algorithm is compared to an + existing method, the comparison results showing that it outperforms the + latter. \end{abstract} \section{Introduction} \label{sec.intro} -The need for more computing power is continually increasing. To partially -satisfy this need, most supercomputers constructors just put more computing -nodes in their platform. The resulting platforms might achieve higher floating -point operations per second (FLOPS), but the energy consumption and the heat -dissipation are also increased. As an example, the Chinese supercomputer -Tianhe-2 had the highest FLOPS in November 2014 according to the Top500 list -\cite{TOP500_Supercomputers_Sites}. However, it was also the most power hungry -platform with its over 3 million cores consuming around 17.8 megawatts. -Moreover, according to the U.S. annual energy outlook 2014 -\cite{U.S_Annual.Energy.Outlook.2014}, the price of energy for 1 megawatt-hour + +The need for more computing power is continually increasing. To partially +satisfy this need, most supercomputers constructors just put more computing +nodes in their platform. The resulting platforms may achieve higher floating +point operations per second (FLOPS), but the energy consumption and the heat +dissipation are also increased. As an example, the Chinese supercomputer +Tianhe-2 had the highest FLOPS in November 2014 according to the Top500 list +\cite{TOP500_Supercomputers_Sites}. However, it was also the most power hungry +platform with its over 3 million cores consuming around 17.8 megawatts. +Moreover, according to the U.S. annual energy outlook 2014 +\cite{U.S_Annual.Energy.Outlook.2014}, the price of energy for 1 megawatt-hour was approximately equal to \$70. Therefore, the price of the energy consumed by -the Tianhe-2 platform is approximately more than \$10 million each year. The -computing platforms must be more energy efficient and offer the highest number -of FLOPS per watt possible, such as the L-CSC from the GSI Helmholtz Center +the Tianhe-2 platform is approximately more than \$10 million each year. The +computing platforms must be more energy efficient and offer the highest number +of FLOPS per watt possible, such as the L-CSC from the GSI Helmholtz Center which became the top of the Green500 list in November 2014 \cite{Green500_List}. This heterogeneous platform executes more than 5 GFLOPS per watt while consuming 57.15 kilowatts. -Besides platform improvements, there are many software and hardware techniques -to lower the energy consumption of these platforms, such as scheduling, DVFS, -... DVFS is a widely used process to reduce the energy consumption of a -processor by lowering its frequency +Besides platform improvements, there are many software and hardware techniques +to lower the energy consumption of these platforms, such as scheduling, DVFS, +\dots{} DVFS is a widely used process to reduce the energy consumption of a +processor by lowering its frequency \cite{Rizvandi_Some.Observations.on.Optimal.Frequency}. However, it also reduces -the number of FLOPS executed by the processor which might increase the execution +the number of FLOPS executed by the processor which may increase the execution time of the application running over that processor. Therefore, researchers use -different optimization strategies to select the frequency that gives the best -tradeoff between the energy reduction and performance degradation ratio. In -\cite{Our_first_paper}, a frequency selecting algorithm was proposed to reduce -the energy consumption of message passing iterative applications running over -homogeneous platforms. The results of the experiments show significant energy -consumption reductions. In this paper, a new frequency selecting algorithm -adapted for heterogeneous platform is presented. It selects the vector of -frequencies, for a heterogeneous platform running a message passing iterative +different optimization strategies to select the frequency that gives the best +trade-off between the energy reduction and performance degradation ratio. In +\cite{Our_first_paper}, a frequency selecting algorithm was proposed to reduce +the energy consumption of message passing iterative applications running over +homogeneous platforms. The results of the experiments show significant energy +consumption reductions. In this paper, a new frequency selecting algorithm +adapted for heterogeneous platform is presented. It selects the vector of +frequencies, for a heterogeneous platform running a message passing iterative application, that simultaneously tries to offer the maximum energy reduction and -minimum performance degradation ratio. The algorithm has a very small overhead, +minimum performance degradation ratio. The algorithm has a very small overhead, works online and does not need any training or profiling. This paper is organized as follows: Section~\ref{sec.relwork} presents some related works from other authors. Section~\ref{sec.exe} describes how the -execution time of message passing programs can be predicted. It also presents an energy -model that predicts the energy consumption of an application running over a heterogeneous platform. Section~\ref{sec.compet} presents -the energy-performance objective function that maximizes the reduction of energy +execution time of message passing programs can be predicted. It also presents +an energy model that predicts the energy consumption of an application running +over a heterogeneous platform. Section~\ref{sec.compet} presents the +energy-performance objective function that maximizes the reduction of energy consumption while minimizing the degradation of the program's performance. -Section~\ref{sec.optim} details the proposed frequency selecting algorithm then the precision of the proposed algorithm is verified. -Section~\ref{sec.expe} presents the results of applying the algorithm on the NAS parallel benchmarks and executing them -on a heterogeneous platform. It shows the results of running three -different power scenarios and comparing them. Moreover, it also shows the comparison results -between the proposed method and an existing method. -Finally, in Section~\ref{sec.concl} the paper ends with a summary and some future works. +Section~\ref{sec.optim} details the proposed frequency selecting algorithm then +the precision of the proposed algorithm is verified. Section~\ref{sec.expe} +presents the results of applying the algorithm on the NAS parallel benchmarks +and executing them on a heterogeneous platform. It shows the results of running +three different power scenarios and comparing them. Moreover, it also shows the +comparison results between the proposed method and an existing method. Finally, +in Section~\ref{sec.concl} the paper ends with a summary and some future works. \section{Related works} \label{sec.relwork} + DVFS is a technique used in modern processors to scale down both the voltage and -the frequency of the CPU while computing, in order to reduce the energy -consumption of the processor. DVFS is also allowed in GPUs to achieve the -same goal. Reducing the frequency of a processor lowers its number of FLOPS and -might degrade the performance of the application running on that processor, -especially if it is compute bound. Therefore selecting the appropriate frequency -for a processor to satisfy some objectives while taking into account all the -constraints, is not a trivial operation. Many researchers used different -strategies to tackle this problem. Some of them developed online methods that -compute the new frequency while executing the application, such as -~\cite{Hao_Learning.based.DVFS,Spiliopoulos_Green.governors.Adaptive.DVFS}. Others -used offline methods that might need to run the application and profile it -before selecting the new frequency, such as -~\cite{Rountree_Bounding.energy.consumption.in.MPI,Cochran_Pack_and_Cap_Adaptive_DVFS}. The -methods could be heuristics, exact or brute force methods that satisfy varied -objectives such as energy reduction or performance. They also could be adapted -to the execution's environment and the type of the application such as -sequential, parallel or distributed architecture, homogeneous or heterogeneous -platform, synchronous or asynchronous application, ... - -In this paper, we are interested in reducing energy for message passing iterative synchronous applications running over heterogeneous platforms. -Some works have already been done for such platforms and they can be classified into two types of heterogeneous platforms: +the frequency of the CPU while computing, in order to reduce the energy +consumption of the processor. DVFS is also allowed in GPUs to achieve the same +goal. Reducing the frequency of a processor lowers its number of FLOPS and may +degrade the performance of the application running on that processor, especially +if it is compute bound. Therefore selecting the appropriate frequency for a +processor to satisfy some objectives while taking into account all the +constraints, is not a trivial operation. Many researchers used different +strategies to tackle this problem. Some of them developed online methods that +compute the new frequency while executing the application, such +as~\cite{Hao_Learning.based.DVFS,Spiliopoulos_Green.governors.Adaptive.DVFS}. +Others used offline methods that may need to run the application and profile +it before selecting the new frequency, such +as~\cite{Rountree_Bounding.energy.consumption.in.MPI,Cochran_Pack_and_Cap_Adaptive_DVFS}. +The methods could be heuristics, exact or brute force methods that satisfy +varied objectives such as energy reduction or performance. They also could be +adapted to the execution's environment and the type of the application such as +sequential, parallel or distributed architecture, homogeneous or heterogeneous +platform, synchronous or asynchronous application, \dots{} + +In this paper, we are interested in reducing energy for message passing +iterative synchronous applications running over heterogeneous platforms. Some +works have already been done for such platforms and they can be classified into +two types of heterogeneous platforms: \begin{itemize} - \item the platform is composed of homogeneous GPUs and homogeneous CPUs. \item the platform is only composed of heterogeneous CPUs. - \end{itemize} -For the first type of platform, the computing intensive parallel tasks are executed on the GPUs and the rest are executed -on the CPUs. Luley et al. -~\cite{Luley_Energy.efficiency.evaluation.and.benchmarking}, proposed a heterogeneous -cluster composed of Intel Xeon CPUs and NVIDIA GPUs. Their main goal was to maximize the -energy efficiency of the platform during computation by maximizing the number of FLOPS per watt generated. -In~\cite{KaiMa_Holistic.Approach.to.Energy.Efficiency.in.GPU-CPU}, Kai Ma et al. developed a scheduling -algorithm that distributes workloads proportional to the computing power of the nodes which could be a GPU or a CPU. All the tasks must be completed at the same time. -In~\cite{Rong_Effects.of.DVFS.on.K20.GPU}, Rong et al. showed that -a heterogeneous (GPUs and CPUs) cluster that enables DVFS gave better energy and performance -efficiency than other clusters only composed of CPUs. - -The work presented in this paper concerns the second type of platform, with heterogeneous CPUs. -Many methods were conceived to reduce the energy consumption of this type of platform. Naveen et al.~\cite{Naveen_Power.Efficient.Resource.Scaling} -developed a method that minimizes the value of $energy*delay^2$ (the delay is the sum of slack times that happen during synchronous communications) by dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed -an algorithm that divides the executed tasks into two types: the critical and -non critical tasks. The algorithm scales down the frequency of non critical tasks proportionally to their slack and communication times while limiting the performance degradation percentage to less than 10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed - a heterogeneous cluster composed of two types -of Intel and AMD processors. They use a gradient method to predict the impact of DVFS operations on performance. -In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and \cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks}, - the best frequencies for a specified heterogeneous cluster are selected offline using some -heuristic. Chen et al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic programming approach to -minimize the power consumption of heterogeneous servers while respecting given time constraints. This approach -had considerable overhead. -In contrast to the above described papers, this paper presents the following contributions : +For the first type of platform, the computing intensive parallel tasks are +executed on the GPUs and the rest are executed on the CPUs. Luley et +al.~\cite{Luley_Energy.efficiency.evaluation.and.benchmarking}, proposed a +heterogeneous cluster composed of Intel Xeon CPUs and NVIDIA GPUs. Their main +goal was to maximize the energy efficiency of the platform during computation by +maximizing the number of FLOPS per watt generated. +In~\cite{KaiMa_Holistic.Approach.to.Energy.Efficiency.in.GPU-CPU}, Kai Ma et +al. developed a scheduling algorithm that distributes workloads proportional to +the computing power of the nodes which could be a GPU or a CPU. All the tasks +must be completed at the same time. In~\cite{Rong_Effects.of.DVFS.on.K20.GPU}, +Rong et al. showed that a heterogeneous (GPUs and CPUs) cluster that enables +DVFS gave better energy and performance efficiency than other clusters only +composed of CPUs. + +The work presented in this paper concerns the second type of platform, with +heterogeneous CPUs. Many methods were conceived to reduce the energy +consumption of this type of platform. Naveen et +al.~\cite{Naveen_Power.Efficient.Resource.Scaling} developed a method that +minimizes the value of $\mathit{energy}\times \mathit{delay}^2$ (the delay is +the sum of slack times that happen during synchronous communications) by +dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. +Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed an +algorithm that divides the executed tasks into two types: the critical and non +critical tasks. The algorithm scales down the frequency of non critical tasks +proportionally to their slack and communication times while limiting the +performance degradation percentage to less than \np[\%]{10}. +In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a +heterogeneous cluster composed of two types of Intel and AMD processors. They +use a gradient method to predict the impact of DVFS operations on performance. +In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and +\cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks}, the best +frequencies for a specified heterogeneous cluster are selected offline using +some heuristic. Chen et +al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic +programming approach to minimize the power consumption of heterogeneous servers +while respecting given time constraints. This approach had considerable +overhead. In contrast to the above described papers, this paper presents the +following contributions : \begin{enumerate} -\item two new energy and performance models for message passing iterative synchronous applications running over - a heterogeneous platform. Both models take into account communication and slack times. The models can predict the required energy and the execution time of the application. - -\item a new online frequency selecting algorithm for heterogeneous platforms. The algorithm has a very small - overhead and does not need any training or profiling. It uses a new optimization function which simultaneously maximizes the performance and minimizes the energy consumption of a message passing iterative synchronous application. - +\item two new energy and performance models for message passing iterative + synchronous applications running over a heterogeneous platform. Both models + take into account communication and slack times. The models can predict the + required energy and the execution time of the application. + +\item a new online frequency selecting algorithm for heterogeneous + platforms. The algorithm has a very small overhead and does not need any + training or profiling. It uses a new optimization function which + simultaneously maximizes the performance and minimizes the energy consumption + of a message passing iterative synchronous application. + \end{enumerate} \section{The performance and energy consumption measurements on heterogeneous architecture} \label{sec.exe} - - -\subsection{The execution time of message passing distributed - iterative applications on a heterogeneous platform} +\subsection{The execution time of message passing distributed iterative + applications on a heterogeneous platform} In this paper, we are interested in reducing the energy consumption of message passing distributed iterative synchronous applications running over @@ -231,44 +260,44 @@ network. Therefore, each node has different characteristics such as computing power (FLOPS), energy consumption, CPU's frequency range, \dots{} but they all have the same network bandwidth and latency. -The overall execution time of a distributed iterative synchronous application -over a heterogeneous platform consists of the sum of the computation time and -the communication time for every iteration on a node. However, due to the -heterogeneous computation power of the computing nodes, slack times might occur -when fast nodes have to wait, during synchronous communications, for the slower -nodes to finish their computations (see Figure~(\ref{fig:heter})). -Therefore, the overall execution time of the program is the execution time of the slowest -task which has the highest computation time and no slack time. - - \begin{figure}[t] +\begin{figure}[!t] \centering - \includegraphics[scale=0.6]{fig/commtasks} + \includegraphics[scale=0.6]{fig/commtasks} \caption{Parallel tasks on a heterogeneous platform} \label{fig:heter} \end{figure} -Dynamic Voltage and Frequency Scaling (DVFS) is a process, implemented in -modern processors, that reduces the energy consumption of a CPU by scaling -down its voltage and frequency. Since DVFS lowers the frequency of a CPU -and consequently its computing power, the execution time of a program running -over that scaled down processor might increase, especially if the program is -compute bound. The frequency reduction process can be expressed by the scaling -factor S which is the ratio between the maximum and the new frequency of a CPU +The overall execution time of a distributed iterative synchronous application +over a heterogeneous platform consists of the sum of the computation time and +the communication time for every iteration on a node. However, due to the +heterogeneous computation power of the computing nodes, slack times may occur +when fast nodes have to wait, during synchronous communications, for the slower +nodes to finish their computations (see Figure~\ref{fig:heter}). Therefore, the +overall execution time of the program is the execution time of the slowest task +which has the highest computation time and no slack time. + +Dynamic Voltage and Frequency Scaling (DVFS) is a process, implemented in +modern processors, that reduces the energy consumption of a CPU by scaling +down its voltage and frequency. Since DVFS lowers the frequency of a CPU +and consequently its computing power, the execution time of a program running +over that scaled down processor may increase, especially if the program is +compute bound. The frequency reduction process can be expressed by the scaling +factor S which is the ratio between the maximum and the new frequency of a CPU as in (\ref{eq:s}). \begin{equation} \label{eq:s} - S = \frac{F_\textit{max}}{F_\textit{new}} + S = \frac{\Fmax}{\Fnew} \end{equation} - The execution time of a compute bound sequential program is linearly proportional - to the frequency scaling factor $S$. On the other hand, message passing - distributed applications consist of two parts: computation and communication. - The execution time of the computation part is linearly proportional to the - frequency scaling factor $S$ but the communication time is not affected by the - scaling factor because the processors involved remain idle during the - communications~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. - The communication time for a task is the summation of periods of - time that begin with an MPI call for sending or receiving a message - until the message is synchronously sent or received. +The execution time of a compute bound sequential program is linearly +proportional to the frequency scaling factor $S$. On the other hand, message +passing distributed applications consist of two parts: computation and +communication. The execution time of the computation part is linearly +proportional to the frequency scaling factor $S$ but the communication time is +not affected by the scaling factor because the processors involved remain idle +during the communications~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. The +communication time for a task is the summation of periods of time that begin +with an MPI call for sending or receiving a message until the message is +synchronously sent or received. Since in a heterogeneous platform each node has different characteristics, especially different frequency gears, when applying DVFS operations on these @@ -282,364 +311,288 @@ operation. Then the execution time for one iteration of the application with any vector of scaling factors can be predicted using (\ref{eq:perf}). \begin{equation} \label{eq:perf} - \textit T_\textit{new} = - \max_{i=1,2,\dots,N} ({TcpOld_{i}} \cdot S_{i}) + MinTcm + \Tnew = \max_{i=1,2,\dots,N} ({\TcpOld[i]} \cdot S_{i}) + \MinTcm \end{equation} -Where:\\ +Where: \begin{equation} -\label{eq:perf2} - MinTcm = \min_{i=1,2,\dots,N} (Tcm_i) + \label{eq:perf2} + \MinTcm = \min_{i=1,2,\dots,N} (\Tcm[i]) \end{equation} -where $TcpOld_i$ is the computation time of processor $i$ during the first -iteration and $MinTcm$ is the communication time of the slowest processor from -the first iteration. The model computes the maximum computation time with -scaling factor from each node added to the communication time of the slowest -node. It means only the communication time without any slack time is taken into +where $\TcpOld[i]$ is the computation time of processor $i$ during the first +iteration and $\MinTcm$ is the communication time of the slowest processor from +the first iteration. The model computes the maximum computation time with +scaling factor from each node added to the communication time of the slowest +node. It means only the communication time without any slack time is taken into account. Therefore, the execution time of the iterative application is equal to -the execution time of one iteration as in (\ref{eq:perf}) multiplied by the +the execution time of one iteration as in (\ref{eq:perf}) multiplied by the number of iterations of that application. -This prediction model is developed from the model to predict the execution time -of message passing distributed applications for homogeneous -architectures~\cite{Our_first_paper}. The execution time prediction model is -used in the method to optimize both the energy consumption and the performance of -iterative methods, which is presented in the following sections. - +This prediction model is developed from the model to predict the execution time +of message passing distributed applications for homogeneous +architectures~\cite{Our_first_paper}. The execution time prediction model is +used in the method to optimize both the energy consumption and the performance +of iterative methods, which is presented in the following sections. \subsection{Energy model for heterogeneous platform} + Many researchers~\cite{Malkowski_energy.efficient.high.performance.computing, -Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling, -Rizvandi_Some.Observations.on.Optimal.Frequency} divide the power consumed by a processor into -two power metrics: the static and the dynamic power. While the first one is -consumed as long as the computing unit is turned on, the latter is only consumed during -computation times. The dynamic power $Pd$ is related to the switching -activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and -operational frequency $F$, as shown in (\ref{eq:pd}). + Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling, + Rizvandi_Some.Observations.on.Optimal.Frequency} divide the power consumed by +a processor into two power metrics: the static and the dynamic power. While the +first one is consumed as long as the computing unit is turned on, the latter is +only consumed during computation times. The dynamic power $\Pd$ is related to +the switching activity $\alpha$, load capacitance $\CL$, the supply voltage $V$ +and operational frequency $F$, as shown in (\ref{eq:pd}). \begin{equation} \label{eq:pd} - Pd = \alpha \cdot C_L \cdot V^2 \cdot F + \Pd = \alpha \cdot \CL \cdot V^2 \cdot F \end{equation} -The static power $Ps$ captures the leakage power as follows: +The static power $\Ps$ captures the leakage power as follows: \begin{equation} \label{eq:ps} - Ps = V \cdot N_{trans} \cdot K_{design} \cdot I_{leak} + \Ps = V \cdot \Ntrans \cdot \Kdesign \cdot \Ileak \end{equation} -where V is the supply voltage, $N_{trans}$ is the number of transistors, -$K_{design}$ is a design dependent parameter and $I_{leak}$ is a +where V is the supply voltage, $\Ntrans$ is the number of transistors, +$\Kdesign$ is a design dependent parameter and $\Ileak$ is a technology dependent parameter. The energy consumed by an individual processor to execute a given program can be computed as: \begin{equation} \label{eq:eind} - E_\textit{ind} = Pd \cdot Tcp + Ps \cdot T + \Eind = \Pd \cdot \Tcp + \Ps \cdot T \end{equation} -where $T$ is the execution time of the program, $Tcp$ is the computation -time and $Tcp \le T$. $Tcp$ may be equal to $T$ if there is no +where $T$ is the execution time of the program, $\Tcp$ is the computation +time and $\Tcp \le T$. $\Tcp$ may be equal to $T$ if there is no communication and no slack time. -The main objective of DVFS operation is to reduce the overall energy consumption~\cite{Le_DVFS.Laws.of.Diminishing.Returns}. -The operational frequency $F$ depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot F$ with some -constant $\beta$.~This equation is used to study the change of the dynamic -voltage with respect to various frequency values in~\cite{Rauber_Analytical.Modeling.for.Energy}. The reduction -process of the frequency can be expressed by the scaling factor $S$ which is the -ratio between the maximum and the new frequency as in (\ref{eq:s}). -The CPU governors are power schemes supplied by the operating -system's kernel to lower a core's frequency. The new frequency -$F_{new}$ from (\ref{eq:s}) can be calculated as follows: +The main objective of DVFS operation is to reduce the overall energy +consumption~\cite{Le_DVFS.Laws.of.Diminishing.Returns}. The operational +frequency $F$ depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot +F$ with some constant $\beta$.~This equation is used to study the change of the +dynamic voltage with respect to various frequency values +in~\cite{Rauber_Analytical.Modeling.for.Energy}. The reduction process of the +frequency can be expressed by the scaling factor $S$ which is the ratio between +the maximum and the new frequency as in (\ref{eq:s}). The CPU governors are +power schemes supplied by the operating system's kernel to lower a core's +frequency. The new frequency $\Fnew$ from (\ref{eq:s}) can be calculated as +follows: \begin{equation} \label{eq:fnew} - F_\textit{new} = S^{-1} \cdot F_\textit{max} + \Fnew = S^{-1} \cdot \Fmax \end{equation} -Replacing $F_{new}$ in (\ref{eq:pd}) as in (\ref{eq:fnew}) gives the following +Replacing $\Fnew$ in (\ref{eq:pd}) as in (\ref{eq:fnew}) gives the following equation for dynamic power consumption: \begin{multline} \label{eq:pdnew} - {P}_\textit{dNew} = \alpha \cdot C_L \cdot V^2 \cdot F_{new} = \alpha \cdot C_L \cdot \beta^2 \cdot F_{new}^3 \\ - {} = \alpha \cdot C_L \cdot V^2 \cdot F_{max} \cdot S^{-3} = P_{dOld} \cdot S^{-3} + \PdNew = \alpha \cdot \CL \cdot V^2 \cdot \Fnew = \alpha \cdot \CL \cdot \beta^2 \cdot \Fnew^3 \\ + {} = \alpha \cdot \CL \cdot V^2 \cdot \Fmax \cdot S^{-3} = \PdOld \cdot S^{-3} \end{multline} -where $ {P}_\textit{dNew}$ and $P_{dOld}$ are the dynamic power consumed with the +where $\PdNew$ and $\PdOld$ are the dynamic power consumed with the new frequency and the maximum frequency respectively. -According to (\ref{eq:pdnew}) the dynamic power is reduced by a factor of $S^{-3}$ when -reducing the frequency by a factor of $S$~\cite{Rauber_Analytical.Modeling.for.Energy}. Since the FLOPS of a CPU is proportional -to the frequency of a CPU, the computation time is increased proportionally to $S$. -The new dynamic energy is the dynamic power multiplied by the new time of computation -and is given by the following equation: +According to (\ref{eq:pdnew}) the dynamic power is reduced by a factor of +$S^{-3}$ when reducing the frequency by a factor of +$S$~\cite{Rauber_Analytical.Modeling.for.Energy}. Since the FLOPS of a CPU is +proportional to the frequency of a CPU, the computation time is increased +proportionally to $S$. The new dynamic energy is the dynamic power multiplied +by the new time of computation and is given by the following equation: \begin{equation} \label{eq:Edyn} - E_\textit{dNew} = P_{dOld} \cdot S^{-3} \cdot (Tcp \cdot S)= S^{-2}\cdot P_{dOld} \cdot Tcp + \EdNew = \PdOld \cdot S^{-3} \cdot (\Tcp \cdot S)= S^{-2}\cdot \PdOld \cdot \Tcp \end{equation} -The static power is related to the power leakage of the CPU and is consumed during computation -and even when idle. As in~\cite{Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling}, - the static power of a processor is considered as constant -during idle and computation periods, and for all its available frequencies. -The static energy is the static power multiplied by the execution time of the program. -According to the execution time model in (\ref{eq:perf}), the execution time of the program -is the sum of the computation and the communication times. The computation time is linearly related -to the frequency scaling factor, while this scaling factor does not affect the communication time. -The static energy of a processor after scaling its frequency is computed as follows: +The static power is related to the power leakage of the CPU and is consumed +during computation and even when idle. As +in~\cite{Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling}, +the static power of a processor is considered as constant during idle and +computation periods, and for all its available frequencies. The static energy +is the static power multiplied by the execution time of the program. According +to the execution time model in (\ref{eq:perf}), the execution time of the +program is the sum of the computation and the communication times. The +computation time is linearly related to the frequency scaling factor, while this +scaling factor does not affect the communication time. The static energy of a +processor after scaling its frequency is computed as follows: \begin{equation} \label{eq:Estatic} - E_\textit{s} = Ps \cdot (Tcp \cdot S + Tcm) + \Es = \Ps \cdot (\Tcp \cdot S + \Tcm) \end{equation} -In the considered heterogeneous platform, each processor $i$ might have -different dynamic and static powers, noted as $Pd_{i}$ and $Ps_{i}$ -respectively. Therefore, even if the distributed message passing iterative -application is load balanced, the computation time of each CPU $i$ noted -$Tcp_{i}$ might be different and different frequency scaling factors might be -computed in order to decrease the overall energy consumption of the application -and reduce slack times. The communication time of a processor $i$ is noted as -$Tcm_{i}$ and could contain slack times when communicating with slower -nodes, see figure(\ref{fig:heter}). Therefore, all nodes do not have equal -communication times. While the dynamic energy is computed according to the -frequency scaling factor and the dynamic power of each node as in -(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time -of one iteration multiplied by the static power of each processor. The overall -energy consumption of a message passing distributed application executed over a -heterogeneous platform during one iteration is the summation of all dynamic and +In the considered heterogeneous platform, each processor $i$ may have +different dynamic and static powers, noted as $\Pd[i]$ and $\Ps[i]$ +respectively. Therefore, even if the distributed message passing iterative +application is load balanced, the computation time of each CPU $i$ noted +$\Tcp[i]$ may be different and different frequency scaling factors may be +computed in order to decrease the overall energy consumption of the application +and reduce slack times. The communication time of a processor $i$ is noted as +$\Tcm[i]$ and could contain slack times when communicating with slower nodes, +see Figure~\ref{fig:heter}. Therefore, all nodes do not have equal +communication times. While the dynamic energy is computed according to the +frequency scaling factor and the dynamic power of each node as in +(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time +of one iteration multiplied by the static power of each processor. The overall +energy consumption of a message passing distributed application executed over a +heterogeneous platform during one iteration is the summation of all dynamic and static energies for each processor. It is computed as follows: \begin{multline} \label{eq:energy} - E = \sum_{i=1}^{N} {(S_i^{-2} \cdot Pd_{i} \cdot Tcp_i)} + {} \\ - \sum_{i=1}^{N} (Ps_{i} \cdot (\max_{i=1,2,\dots,N} (Tcp_i \cdot S_{i}) + - {MinTcm))} - \end{multline} - -Reducing the frequencies of the processors according to the vector of -scaling factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the -application and thus, increase the static energy because the execution time is -increased~\cite{Kim_Leakage.Current.Moore.Law}. The overall energy consumption for the iterative -application can be measured by measuring the energy consumption for one iteration as in (\ref{eq:energy}) -multiplied by the number of iterations of that application. + E = \sum_{i=1}^{N} {(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i])} + {} \\ + \sum_{i=1}^{N} (\Ps[i] \cdot (\max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) + + {\MinTcm))} +\end{multline} +Reducing the frequencies of the processors according to the vector of scaling +factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the application +and thus, increase the static energy because the execution time is +increased~\cite{Kim_Leakage.Current.Moore.Law}. The overall energy consumption +for the iterative application can be measured by measuring the energy +consumption for one iteration as in (\ref{eq:energy}) multiplied by the number +of iterations of that application. \section{Optimization of both energy consumption and performance} \label{sec.compet} Using the lowest frequency for each processor does not necessarily give the most -energy efficient execution of an application. Indeed, even though the dynamic -power is reduced while scaling down the frequency of a processor, its -computation power is proportionally decreased. Hence, the execution time might -be drastically increased and during that time, dynamic and static powers are -being consumed. Therefore, it might cancel any gains achieved by scaling down -the frequency of all nodes to the minimum and the overall energy consumption of -the application might not be the optimal one. It is not trivial to select the -appropriate frequency scaling factor for each processor while considering the -characteristics of each processor (computation power, range of frequencies, -dynamic and static powers) and the task executed (computation/communication -ratio). The aim being to reduce the overall energy consumption and to avoid -increasing significantly the execution time. In our previous -work~\cite{Our_first_paper}, we proposed a method that selects the optimal -frequency scaling factor for a homogeneous cluster executing a message passing -iterative synchronous application while giving the best trade-off between the -energy consumption and the performance for such applications. In this work we -are interested in heterogeneous clusters as described above. Due to the -heterogeneity of the processors, a vector of scaling factors should -be selected and it must give the best trade-off between energy consumption and -performance. - -The relation between the energy consumption and the execution time for an -application is complex and nonlinear, Thus, unlike the relation between the -execution time and the scaling factor, the relation between the energy and the -frequency scaling factors is nonlinear, for more details refer -to~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. Moreover, these relations -are not measured using the same metric. To solve this problem, the execution -time is normalized by computing the ratio between the new execution time (after -scaling down the frequencies of some processors) and the initial one (with +energy efficient execution of an application. Indeed, even though the dynamic +power is reduced while scaling down the frequency of a processor, its +computation power is proportionally decreased. Hence, the execution time might +be drastically increased and during that time, dynamic and static powers are +being consumed. Therefore, it might cancel any gains achieved by scaling down +the frequency of all nodes to the minimum and the overall energy consumption of +the application might not be the optimal one. It is not trivial to select the +appropriate frequency scaling factor for each processor while considering the +characteristics of each processor (computation power, range of frequencies, +dynamic and static powers) and the task executed (computation/communication +ratio). The aim being to reduce the overall energy consumption and to avoid +increasing significantly the execution time. In our previous +work~\cite{Our_first_paper}, we proposed a method that selects the optimal +frequency scaling factor for a homogeneous cluster executing a message passing +iterative synchronous application while giving the best trade-off between the +energy consumption and the performance for such applications. In this work we +are interested in heterogeneous clusters as described above. Due to the +heterogeneity of the processors, a vector of scaling factors should be selected +and it must give the best trade-off between energy consumption and performance. + +The relation between the energy consumption and the execution time for an +application is complex and nonlinear, Thus, unlike the relation between the +execution time and the scaling factor, the relation between the energy and the +frequency scaling factors is nonlinear, for more details refer +to~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. Moreover, these relations +are not measured using the same metric. To solve this problem, the execution +time is normalized by computing the ratio between the new execution time (after +scaling down the frequencies of some processors) and the initial one (with maximum frequency for all nodes) as follows: \begin{multline} \label{eq:pnorm} - P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}}\\ - {} = \frac{ \max_{i=1,2,\dots,N} (Tcp_{i} \cdot S_{i}) +MinTcm} - {\max_{i=1,2,\dots,N}{(Tcp_i+Tcm_i)}} + \Pnorm = \frac{\Tnew}{\Told}\\ + {} = \frac{ \max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) +\MinTcm} + {\max_{i=1,2,\dots,N}{(\Tcp[i]+\Tcm[i])}} \end{multline} - -In the same way, the energy is normalized by computing the ratio between the consumed energy -while scaling down the frequency and the consumed energy with maximum frequency for all nodes: +In the same way, the energy is normalized by computing the ratio between the +consumed energy while scaling down the frequency and the consumed energy with +maximum frequency for all nodes: \begin{multline} \label{eq:enorm} - E_\textit{Norm} = \frac{E_\textit{Reduced}}{E_\textit{Original}} \\ - {} = \frac{ \sum_{i=1}^{N}{(S_i^{-2} \cdot Pd_i \cdot Tcp_i)} + - \sum_{i=1}^{N} {(Ps_i \cdot T_{New})}}{\sum_{i=1}^{N}{( Pd_i \cdot Tcp_i)} + - \sum_{i=1}^{N} {(Ps_i \cdot T_{Old})}} -\end{multline} -Where $E_\textit{Reduced}$ and $E_\textit{Original}$ are computed using (\ref{eq:energy}) and - $T_{New}$ and $T_{Old}$ are computed as in (\ref{eq:pnorm}). - -While the main -goal is to optimize the energy and execution time at the same time, the normalized -energy and execution time curves are not in the same direction. According -to the equations~(\ref{eq:pnorm}) and (\ref{eq:enorm}), the vector of frequency -scaling factors $S_1,S_2,\dots,S_N$ reduce both the energy and the execution -time simultaneously. But the main objective is to produce maximum energy -reduction with minimum execution time reduction. - -This problem can be solved by making the optimization process for energy and -execution time following the same direction. Therefore, the equation of the -normalized execution time is inverted which gives the normalized performance equation, as follows: + \Enorm = \frac{\Ereduced}{\Eoriginal} \\ + {} = \frac{ \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i])} + + \sum_{i=1}^{N} {(\Ps[i] \cdot \Tnew)}}{\sum_{i=1}^{N}{( \Pd[i] \cdot \Tcp[i])} + + \sum_{i=1}^{N} {(\Ps[i] \cdot \Told)}} +\end{multline} +Where $\Ereduced$ and $\Eoriginal$ are computed using (\ref{eq:energy}) and +$\Tnew$ and $\Told$ are computed as in (\ref{eq:pnorm}). + +While the main goal is to optimize the energy and execution time at the same +time, the normalized energy and execution time curves are not in the same +direction. According to the equations~(\ref{eq:pnorm}) and (\ref{eq:enorm}), the +vector of frequency scaling factors $S_1,S_2,\dots,S_N$ reduce both the energy +and the execution time simultaneously. But the main objective is to produce +maximum energy reduction with minimum execution time reduction. + +This problem can be solved by making the optimization process for energy and +execution time following the same direction. Therefore, the equation of the +normalized execution time is inverted which gives the normalized performance +equation, as follows: \begin{multline} \label{eq:pnorm_inv} - P_\textit{Norm} = \frac{T_\textit{Old}}{T_\textit{New}}\\ - = \frac{\max_{i=1,2,\dots,N}{(Tcp_i+Tcm_i)}} - { \max_{i=1,2,\dots,N} (Tcp_{i} \cdot S_{i}) + MinTcm} + \Pnorm = \frac{\Told}{\Tnew}\\ + = \frac{\max_{i=1,2,\dots,N}{(\Tcp[i]+\Tcm[i])}} + { \max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) + \MinTcm} \end{multline} - -\begin{figure} +\begin{figure}[!t] \centering \subfloat[Homogeneous platform]{% \includegraphics[width=.33\textwidth]{fig/homo}\label{fig:r1}}% - - + \subfloat[Heterogeneous platform]{% \includegraphics[width=.33\textwidth]{fig/heter}\label{fig:r2}} \label{fig:rel} \caption{The energy and performance relation} \end{figure} -Then, the objective function can be modeled in order to find the maximum distance -between the energy curve (\ref{eq:enorm}) and the performance -curve (\ref{eq:pnorm_inv}) over all available sets of scaling factors. This -represents the minimum energy consumption with minimum execution time (maximum -performance) at the same time, see figure~(\ref{fig:r1}) or figure~(\ref{fig:r2}). Then the objective -function has the following form: +Then, the objective function can be modeled in order to find the maximum +distance between the energy curve (\ref{eq:enorm}) and the performance curve +(\ref{eq:pnorm_inv}) over all available sets of scaling factors. This +represents the minimum energy consumption with minimum execution time (maximum +performance) at the same time, see Figure~\ref{fig:r1} or +Figure~\ref{fig:r2}. Then the objective function has the following form: \begin{equation} \label{eq:max} - Max Dist = - \max_{i=1,\dots F, j=1,\dots,N} - (\overbrace{P_\textit{Norm}(S_{ij})}^{\text{Maximize}} - - \overbrace{E_\textit{Norm}(S_{ij})}^{\text{Minimize}} ) + \MaxDist = + \mathop{\max_{i=1,\dots F}}_{j=1,\dots,N} + (\overbrace{\Pnorm(S_{ij})}^{\text{Maximize}} - + \overbrace{\Enorm(S_{ij})}^{\text{Minimize}} ) \end{equation} -where $N$ is the number of nodes and $F$ is the number of available frequencies for each node. -Then, the optimal set of scaling factors that satisfies (\ref{eq:max}) can be selected. -The objective function can work with any energy model or any power values for each node -(static and dynamic powers). However, the most important energy reduction gain can be achieved when -the energy curve has a convex form as shown in~\cite{Zhuo_Energy.efficient.Dynamic.Task.Scheduling,Rauber_Analytical.Modeling.for.Energy,Hao_Learning.based.DVFS}. +where $N$ is the number of nodes and $F$ is the number of available frequencies +for each node. Then, the optimal set of scaling factors that satisfies +(\ref{eq:max}) can be selected. The objective function can work with any energy +model or any power values for each node (static and dynamic powers). However, +the most important energy reduction gain can be achieved when the energy curve +has a convex form as shown +in~\cite{Zhuo_Energy.efficient.Dynamic.Task.Scheduling,Rauber_Analytical.Modeling.for.Energy,Hao_Learning.based.DVFS}. \section{The scaling factors selection algorithm for heterogeneous platforms } \label{sec.optim} -\subsection{The algorithm details} -In this section, algorithm \ref{HSA} is presented. It selects the frequency scaling factors -vector that gives the best trade-off between minimizing the energy consumption and maximizing -the performance of a message passing synchronous iterative application executed on a heterogeneous -platform. It works online during the execution time of the iterative message passing program. -It uses information gathered during the first iteration such as the computation time and the -communication time in one iteration for each node. The algorithm is executed after the first -iteration and returns a vector of optimal frequency scaling factors that satisfies the objective -function (\ref{eq:max}). The program applies DVFS operations to change the frequencies of the CPUs -according to the computed scaling factors. This algorithm is called just once during the execution -of the program. Algorithm~(\ref{dvfs}) shows where and when the proposed scaling algorithm is called -in the iterative MPI program. - -The nodes in a heterogeneous platform have different computing powers, thus while executing message -passing iterative synchronous applications, fast nodes have to wait for the slower ones to finish their -computations before being able to synchronously communicate with them as in figure (\ref{fig:heter}). -These periods are called idle or slack times. -The algorithm takes into account this problem and tries to reduce these slack times when selecting the -frequency scaling factors vector. At first, it selects initial frequency scaling factors that increase -the execution times of fast nodes and minimize the differences between the computation times of -fast and slow nodes. The value of the initial frequency scaling factor for each node is inversely -proportional to its computation time that was gathered from the first iteration. These initial frequency -scaling factors are computed as a ratio between the computation time of the slowest node and the -computation time of the node $i$ as follows: -\begin{equation} - \label{eq:Scp} - Scp_{i} = \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i} -\end{equation} -Using the initial frequency scaling factors computed in (\ref{eq:Scp}), the algorithm computes -the initial frequencies for all nodes as a ratio between the maximum frequency of node $i$ -and the computation scaling factor $Scp_i$ as follows: -\begin{equation} - \label{eq:Fint} - F_{i} = \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N} -\end{equation} -If the computed initial frequency for a node is not available in the gears of -that node, it is replaced by the nearest available frequency. In figure -(\ref{fig:st_freq}), the nodes are sorted by their computing power in ascending -order and the frequencies of the faster nodes are scaled down according to the -computed initial frequency scaling factors. The resulting new frequencies are -colored in blue in figure (\ref{fig:st_freq}). This set of frequencies can be -considered as a higher bound for the search space of the optimal vector of -frequencies because selecting frequency scaling factors higher than the higher -bound will not improve the performance of the application and it will increase -its overall energy consumption. Therefore the algorithm that selects the -frequency scaling factors starts the search method from these initial -frequencies and takes a downward search direction toward lower frequencies. The -algorithm iterates on all left frequencies, from the higher bound until all -nodes reach their minimum frequencies, to compute their overall energy -consumption and performance, and select the optimal frequency scaling factors -vector. At each iteration the algorithm determines the slowest node according to -the equation (\ref{eq:perf}) and keeps its frequency unchanged, while it lowers -the frequency of all other nodes by one gear. The new overall energy -consumption and execution time are computed according to the new scaling -factors. The optimal set of frequency scaling factors is the set that gives the -highest distance according to the objective function (\ref{eq:max}). - -Figures~\ref{fig:r1} and \ref{fig:r2} illustrate the normalized performance and -consumed energy for an application running on a homogeneous platform and a -heterogeneous platform respectively while increasing the scaling factors. It can -be noticed that in a homogeneous platform the search for the optimal scaling -factor should start from the maximum frequency because the performance and the -consumed energy decrease from the beginning of the plot. On the other hand, -in the heterogeneous platform the performance is maintained at the beginning of -the plot even if the frequencies of the faster nodes decrease until the -computing power of scaled down nodes are lower than the slowest node. In other -words, until they reach the higher bound. It can also be noticed that the higher -the difference between the faster nodes and the slower nodes is, the bigger the -maximum distance between the energy curve and the performance curve is while - the scaling factors are varying which results in bigger energy savings. -\begin{figure}[t] - \centering - \includegraphics[scale=0.5]{fig/start_freq} - \caption{Selecting the initial frequencies} - \label{fig:st_freq} -\end{figure} - - - - \begin{algorithm} \begin{algorithmic}[1] % \footnotesize \Require ~ \begin{description} - \item[$Tcp_i$] array of all computation times for all nodes during one iteration and with highest frequency. - \item[$Tcm_i$] array of all communication times for all nodes during one iteration and with highest frequency. - \item[$Fmax_i$] array of the maximum frequencies for all nodes. - \item[$Pd_i$] array of the dynamic powers for all nodes. - \item[$Ps_i$] array of the static powers for all nodes. - \item[$Fdiff_i$] array of the difference between two successive frequencies for all nodes. + \item[{$\Tcp[i]$}] array of all computation times for all nodes during one iteration and with highest frequency. + \item[{$\Tcm[i]$}] array of all communication times for all nodes during one iteration and with highest frequency. + \item[{$\Fmax[i]$}] array of the maximum frequencies for all nodes. + \item[{$\Pd[i]$}] array of the dynamic powers for all nodes. + \item[{$\Ps[i]$}] array of the static powers for all nodes. + \item[{$\Fdiff[i]$}] array of the difference between two successive frequencies for all nodes. \end{description} - \Ensure $Sopt_1,Sopt_2 \dots, Sopt_N$ is a vector of optimal scaling factors + \Ensure $\Sopt[1],\Sopt[2] \dots, \Sopt[N]$ is a vector of optimal scaling factors - \State $ Scp_i \gets \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i} $ - \State $F_{i} \gets \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N}$ + \State $\Scp[i] \gets \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]} $ + \State $F_{i} \gets \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\cdots,N}$ \State Round the computed initial frequencies $F_i$ to the closest one available in each node. \If{(not the first frequency)} - \State $F_i \gets F_i+Fdiff_i,~i=1,\dots,N.$ - \EndIf - \State $T_\textit{Old} \gets max_{~i=1,\dots,N } (Tcp_i+Tcm_i)$ - \State $E_\textit{Original} \gets \sum_{i=1}^{N}{( Pd_i \cdot Tcp_i)} +\sum_{i=1}^{N} {(Ps_i \cdot T_{Old})}$ - \State $Sopt_{i} \gets 1,~i=1,\dots,N. $ - \State $Dist \gets 0 $ + \State $F_i \gets F_i+\Fdiff[i],~i=1,\dots,N.$ + \EndIf + \State $\Told \gets \max_{i=1,\dots,N} (\Tcp[i]+\Tcm[i])$ + % \State $\Eoriginal \gets \sum_{i=1}^{N}{( \Pd[i] \cdot \Tcp[i])} +\sum_{i=1}^{N} {(\Ps[i] \cdot \Told)}$ + \State $\Eoriginal \gets \sum_{i=1}^{N}{( \Pd[i] \cdot \Tcp[i] + \Ps[i] \cdot \Told)}$ + \State $\Sopt[i] \gets 1,~i=1,\dots,N. $ + \State $\Dist \gets 0 $ \While {(all nodes not reach their minimum frequency)} \If{(not the last freq. \textbf{and} not the slowest node)} - \State $F_i \gets F_i - Fdiff_i,~i=1,\dots,N.$ - \State $S_i \gets \frac{Fmax_i}{F_i},~i=1,\dots,N.$ + \State $F_i \gets F_i - \Fdiff[i],~i=1,\dots,N.$ + \State $S_i \gets \frac{\Fmax[i]}{F_i},~i=1,\dots,N.$ \EndIf - \State $T_{New} \gets max_\textit{~i=1,\dots,N} (Tcp_{i} \cdot S_{i}) + MinTcm $ - \State $E_\textit{Reduced} \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot Pd_i \cdot Tcp_i)} + $ \hspace*{43 mm} - $\sum_{i=1}^{N} {(Ps_i \cdot T_{New})} $ - \State $ P_\textit{Norm} \gets \frac{T_\textit{Old}}{T_\textit{New}}$ - \State $E_\textit{Norm}\gets \frac{E_\textit{Reduced}}{E_\textit{Original}}$ + \State $\Tnew \gets \max_{i=1,\dots,N} (\Tcp[i] \cdot S_{i}) + \MinTcm $ +% \State $\Ereduced \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i])} + \sum_{i=1}^{N} {(\Ps[i] \cdot \rlap{\Tnew)}} $ + \State $\Ereduced \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot \Tcp[i] + \Ps[i] \cdot \rlap{\Tnew)}} $ + \State $\Pnorm \gets \frac{\Told}{\Tnew}$ + \State $\Enorm\gets \frac{\Ereduced}{\Eoriginal}$ \If{$(\Pnorm - \Enorm > \Dist)$} - \State $Sopt_{i} \gets S_{i},~i=1,\dots,N. $ + \State $\Sopt[i] \gets S_{i},~i=1,\dots,N. $ \State $\Dist \gets \Pnorm - \Enorm$ \EndIf \EndWhile - \State Return $Sopt_1,Sopt_2,\dots,Sopt_N$ + \State Return $\Sopt[1],\Sopt[2],\dots,\Sopt[N]$ \end{algorithmic} \caption{frequency scaling factors selection algorithm} \label{HSA} @@ -654,7 +607,7 @@ maximum distance between the energy curve and the performance curve is while \If {$(k=1)$} \State Gather all times of computation and\newline\hspace*{3em}% communication from each node. - \State Call algorithm \ref{HSA}. + \State Call Algorithm \ref{HSA}. \State Compute the new frequencies from the\newline\hspace*{3em}% returned optimal scaling factors. \State Set the new frequencies to nodes. @@ -665,439 +618,553 @@ maximum distance between the energy curve and the performance curve is while \label{dvfs} \end{algorithm} -\subsection{The evaluation of the proposed algorithm} -\label{sec.verif.algo} -The precision of the proposed algorithm mainly depends on the execution time -prediction model defined in (\ref{eq:perf}) and the energy model computed by -(\ref{eq:energy}). The energy model is also significantly dependent on the -execution time model because the static energy is linearly related to the -execution time and the dynamic energy is related to the computation time. So, -all the works presented in this paper are based on the execution time model. To -verify this model, the predicted execution time was compared to the real -execution time over SimGrid/SMPI simulator, -v3.10~\cite{casanova+giersch+legrand+al.2014.versatile}, for all the NAS -parallel benchmarks NPB v3.3 \cite{NAS.Parallel.Benchmarks}, running class B on -8 or 9 nodes. The comparison showed that the proposed execution time model is -very precise, the maximum normalized difference between the predicted execution -time and the real execution time is equal to 0.03 for all the NAS benchmarks. +\subsection{The algorithm details} -Since the proposed algorithm is not an exact method it does not test all the possible solutions (vectors of scaling factors) -in the search space. To prove its efficiency, it was compared on small instances to a brute force search algorithm -that tests all the possible solutions. The brute force algorithm was applied to different NAS benchmarks classes with -different number of nodes. The solutions returned by the brute force algorithm and the proposed algorithm were identical -and the proposed algorithm was on average 10 times faster than the brute force algorithm. It has a small execution time: -for a heterogeneous cluster composed of four different types of nodes having the characteristics presented in -table~\ref{table:platform}, it takes on average \np[ms]{0.04} for 4 nodes and \np[ms]{0.15} on average for 144 nodes -to compute the best scaling factors vector. The algorithm complexity is $O(F\cdot (N \cdot4) )$, where $F$ is the number -of iterations and $N$ is the number of computing nodes. The algorithm needs from 12 to 20 iterations to select the best -vector of frequency scaling factors that gives the results of the next sections. +In this section, Algorithm~\ref{HSA} is presented. It selects the frequency +scaling factors vector that gives the best trade-off between minimizing the +energy consumption and maximizing the performance of a message passing +synchronous iterative application executed on a heterogeneous platform. It works +online during the execution time of the iterative message passing program. It +uses information gathered during the first iteration such as the computation +time and the communication time in one iteration for each node. The algorithm is +executed after the first iteration and returns a vector of optimal frequency +scaling factors that satisfies the objective function (\ref{eq:max}). The +program applies DVFS operations to change the frequencies of the CPUs according +to the computed scaling factors. This algorithm is called just once during the +execution of the program. Algorithm~\ref{dvfs} shows where and when the proposed +scaling algorithm is called in the iterative MPI program. + +\begin{figure}[!t] + \centering + \includegraphics[scale=0.5]{fig/start_freq} + \caption{Selecting the initial frequencies} + \label{fig:st_freq} +\end{figure} -\section{Experimental results} -\label{sec.expe} -To evaluate the efficiency and the overall energy consumption reduction of -algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The -experiments were executed on the simulator SimGrid/SMPI which offers easy tools -to create a heterogeneous platform and run message passing applications over it. -The heterogeneous platform that was used in the experiments, had one core per -node because just one process was executed per node. The heterogeneous platform -was composed of four types of nodes. Each type of nodes had different -characteristics such as the maximum CPU frequency, the number of available -frequencies and the computational power, see Table \ref{table:platform}. The -characteristics of these different types of nodes are inspired from the -specifications of real Intel processors. The heterogeneous platform had up to -144 nodes and had nodes from the four types in equal proportions, for example if -a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the -constructors of CPUs do not specify the dynamic and the static power of their -CPUs, for each type of node they were chosen proportionally to its computing -power (FLOPS). In the initial heterogeneous platform, while computing with -highest frequency, each node consumed an amount of power proportional to its -computing power (which corresponds to 80\% of its dynamic power and the -remaining 20\% to the static power), the same assumption was made in -\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These -nodes were connected via an ethernet network with 1 Gbit/s bandwidth. +The nodes in a heterogeneous platform have different computing powers, thus +while executing message passing iterative synchronous applications, fast nodes +have to wait for the slower ones to finish their computations before being able +to synchronously communicate with them as in Figure~\ref{fig:heter}. These +periods are called idle or slack times. The algorithm takes into account this +problem and tries to reduce these slack times when selecting the frequency +scaling factors vector. At first, it selects initial frequency scaling factors +that increase the execution times of fast nodes and minimize the differences +between the computation times of fast and slow nodes. The value of the initial +frequency scaling factor for each node is inversely proportional to its +computation time that was gathered from the first iteration. These initial +frequency scaling factors are computed as a ratio between the computation time +of the slowest node and the computation time of the node $i$ as follows: +\begin{equation} + \label{eq:Scp} + \Scp[i] = \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]} +\end{equation} +Using the initial frequency scaling factors computed in (\ref{eq:Scp}), the +algorithm computes the initial frequencies for all nodes as a ratio between the +maximum frequency of node $i$ and the computation scaling factor $\Scp[i]$ as +follows: +\begin{equation} + \label{eq:Fint} + F_{i} = \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\dots,N} +\end{equation} +If the computed initial frequency for a node is not available in the gears of +that node, it is replaced by the nearest available frequency. In +Figure~\ref{fig:st_freq}, the nodes are sorted by their computing power in +ascending order and the frequencies of the faster nodes are scaled down +according to the computed initial frequency scaling factors. The resulting new +frequencies are highlighted in Figure~\ref{fig:st_freq}. This set of +frequencies can be considered as a higher bound for the search space of the +optimal vector of frequencies because selecting frequency scaling factors higher +than the higher bound will not improve the performance of the application and it +will increase its overall energy consumption. Therefore the algorithm that +selects the frequency scaling factors starts the search method from these +initial frequencies and takes a downward search direction toward lower +frequencies. The algorithm iterates on all left frequencies, from the higher +bound until all nodes reach their minimum frequencies, to compute their overall +energy consumption and performance, and select the optimal frequency scaling +factors vector. At each iteration the algorithm determines the slowest node +according to the equation (\ref{eq:perf}) and keeps its frequency unchanged, +while it lowers the frequency of all other nodes by one gear. The new overall +energy consumption and execution time are computed according to the new scaling +factors. The optimal set of frequency scaling factors is the set that gives the +highest distance according to the objective function (\ref{eq:max}). + +Figures~\ref{fig:r1} and \ref{fig:r2} illustrate the normalized performance and +consumed energy for an application running on a homogeneous platform and a +heterogeneous platform respectively while increasing the scaling factors. It can +be noticed that in a homogeneous platform the search for the optimal scaling +factor should start from the maximum frequency because the performance and the +consumed energy decrease from the beginning of the plot. On the other hand, in +the heterogeneous platform the performance is maintained at the beginning of the +plot even if the frequencies of the faster nodes decrease until the computing +power of scaled down nodes are lower than the slowest node. In other words, +until they reach the higher bound. It can also be noticed that the higher the +difference between the faster nodes and the slower nodes is, the bigger the +maximum distance between the energy curve and the performance curve is while the +scaling factors are varying which results in bigger energy savings. + +\subsection{The evaluation of the proposed algorithm} +\label{sec.verif.algo} +The precision of the proposed algorithm mainly depends on the execution time +prediction model defined in (\ref{eq:perf}) and the energy model computed by +(\ref{eq:energy}). The energy model is also significantly dependent on the +execution time model because the static energy is linearly related to the +execution time and the dynamic energy is related to the computation time. So, +all the works presented in this paper are based on the execution time model. To +verify this model, the predicted execution time was compared to the real +execution time over SimGrid/SMPI simulator, +v3.10~\cite{casanova+giersch+legrand+al.2014.versatile}, for all the NAS +parallel benchmarks NPB v3.3 \cite{NAS.Parallel.Benchmarks}, running class B on +8 or 9 nodes. The comparison showed that the proposed execution time model is +very precise, the maximum normalized difference between the predicted execution +time and the real execution time is equal to 0.03 for all the NAS benchmarks. -\begin{table}[htb] +Since the proposed algorithm is not an exact method it does not test all the +possible solutions (vectors of scaling factors) in the search space. To prove +its efficiency, it was compared on small instances to a brute force search +algorithm that tests all the possible solutions. The brute force algorithm was +applied to different NAS benchmarks classes with different number of nodes. The +solutions returned by the brute force algorithm and the proposed algorithm were +identical and the proposed algorithm was on average 10 times faster than the +brute force algorithm. It has a small execution time: for a heterogeneous +cluster composed of four different types of nodes having the characteristics +presented in Table~\ref{table:platform}, it takes on average \np[ms]{0.04} for 4 +nodes and \np[ms]{0.15} on average for 144 nodes to compute the best scaling +factors vector. The algorithm complexity is $O(F\cdot N)$, where $F$ is the +maximum number of available frequencies, and $N$ is the number of computing +nodes. The algorithm needs from 12 to 20 iterations to select the best vector of +frequency scaling factors that gives the results of the next sections. + +\begin{table}[!t] \caption{Heterogeneous nodes characteristics} % title of Table \centering - \begin{tabular}{|*{7}{l|}} + \begin{tabular}{|*{7}{r|}} \hline - Node &Simulated & Max & Min & Diff. & Dynamic & Static \\ - type &GFLOPS & Freq. & Freq. & Freq. & power & power \\ - & & GHz & GHz &GHz & & \\ + Node & Simulated & Max & Min & Diff. & Dynamic & Static \\ + type & GFLOPS & Freq. & Freq. & Freq. & power & power \\ + & & GHz & GHz & GHz & & \\ \hline - 1 &40 & 2.5 & 1.2 & 0.1 & 20~w &4~w \\ - + 1 & 40 & 2.50 & 1.20 & 0.100 & \np[W]{20} & \np[W]{4} \\ \hline - 2 &50 & 2.66 & 1.6 & 0.133 & 25~w &5~w \\ - + 2 & 50 & 2.66 & 1.60 & 0.133 & \np[W]{25} & \np[W]{5} \\ \hline - 3 &60 & 2.9 & 1.2 & 0.1 & 30~w &6~w \\ - + 3 & 60 & 2.90 & 1.20 & 0.100 & \np[W]{30} & \np[W]{6} \\ \hline - 4 &70 & 3.4 & 1.6 & 0.133 & 35~w &7~w \\ - + 4 & 70 & 3.40 & 1.60 & 0.133 & \np[W]{35} & \np[W]{7} \\ \hline \end{tabular} \label{table:platform} \end{table} - -%\subsection{Performance prediction verification} +\section{Experimental results} +\label{sec.expe} + +To evaluate the efficiency and the overall energy consumption reduction of +Algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The +experiments were executed on the simulator SimGrid/SMPI which offers easy tools +to create a heterogeneous platform and run message passing applications over it. +The heterogeneous platform that was used in the experiments, had one core per +node because just one process was executed per node. The heterogeneous platform +was composed of four types of nodes. Each type of nodes had different +characteristics such as the maximum CPU frequency, the number of available +frequencies and the computational power, see Table~\ref{table:platform}. The +characteristics of these different types of nodes are inspired from the +specifications of real Intel processors. The heterogeneous platform had up to +144 nodes and had nodes from the four types in equal proportions, for example if +a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the +constructors of CPUs do not specify the dynamic and the static power of their +CPUs, for each type of node they were chosen proportionally to its computing +power (FLOPS). In the initial heterogeneous platform, while computing with +highest frequency, each node consumed an amount of power proportional to its +computing power (which corresponds to \np[\%]{80} of its dynamic power and the +remaining \np[\%]{20} to the static power), the same assumption was made in +\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These +nodes were connected via an Ethernet network with \np[Gbit/s]{1} bandwidth. \subsection{The experimental results of the scaling algorithm} \label{sec.res} - The proposed algorithm was applied to the seven parallel NAS benchmarks (EP, CG, -MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes: +MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes: A, B and C. However, due to the lack of space in this paper, only the results of -the biggest class, C, are presented while being run on different number of -nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being -executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on $1, -2, 4, 8, 16, 32, 64, 128$ nodes. The other benchmarks such as BT and SP had to -be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. - - - -\begin{table}[htb] +the biggest class, C, are presented while being run on different number of +nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being +executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on 1, +2, 4, 8, 16, 32, 64, or 128 nodes. The other benchmarks such as BT and SP had +to be executed on 1, 4, 9, 16, 36, 64, or 144 nodes. + +\begin{table}[!t] \caption{Running NAS benchmarks on 4 nodes } % title of Table \centering - \begin{tabular}{|*{7}{l|}} + \begin{tabular}{|*{7}{r|}} \hline - Program & Execution & Energy & Energy & Performance & Distance \\ - name & time/s & consumption/J & saving\% & degradation\% & \\ + \hspace{-2.2084pt}% + Program & Execution & Energy & Energy & Performance & Distance \\ + name & time/s & consumption/J & saving\% & degradation\% & \\ \hline - CG & 64.64 & 3560.39 &34.16 &6.72 &27.44 \\ - \hline - MG & 18.89 & 1074.87 &35.37 &4.34 &31.03 \\ - \hline - EP &79.73 &5521.04 &26.83 &3.04 &23.79 \\ - \hline - LU &308.65 &21126.00 &34.00 &6.16 &27.84 \\ + CG & 64.64 & 3560.39 & 34.16 & 6.72 & 27.44 \\ + \hline + MG & 18.89 & 1074.87 & 35.37 & 4.34 & 31.03 \\ + \hline + EP & 79.73 & 5521.04 & 26.83 & 3.04 & 23.79 \\ + \hline + LU & 308.65 & 21126.00 & 34.00 & 6.16 & 27.84 \\ + \hline + BT & 360.12 & 21505.55 & 35.36 & 8.49 & 26.87 \\ + \hline + SP & 234.24 & 13572.16 & 35.22 & 5.70 & 29.52 \\ + \hline + FT & 81.58 & 4151.48 & 35.58 & 0.99 & 34.59 \\ \hline - BT &360.12 &21505.55 &35.36 &8.49 &26.87 \\ - \hline - SP &234.24 &13572.16 &35.22 &5.70 &29.52 \\ - \hline - FT &81.58 &4151.48 &35.58 &0.99 &34.59 \\ -\hline \end{tabular} \label{table:res_4n} -\end{table} +% \end{table} -\begin{table}[htb] + \medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 8 and 9 nodes } % title of Table \centering - \begin{tabular}{|*{7}{l|}} + \begin{tabular}{|*{7}{r|}} \hline - Program & Execution & Energy & Energy & Performance & Distance \\ - name & time/s & consumption/J & saving\% & degradation\% & \\ + \hspace{-2.2084pt}% + Program & Execution & Energy & Energy & Performance & Distance \\ + name & time/s & consumption/J & saving\% & degradation\% & \\ \hline - CG &36.11 &3263.49 &31.25 &7.12 &24.13 \\ - \hline - MG &8.99 &953.39 &33.78 &6.41 &27.37 \\ - \hline - EP &40.39 &5652.81 &27.04 &0.49 &26.55 \\ - \hline - LU &218.79 &36149.77 &28.23 &0.01 &28.22 \\ + CG & 36.11 & 3263.49 & 31.25 & 7.12 & 24.13 \\ + \hline + MG & 8.99 & 953.39 & 33.78 & 6.41 & 27.37 \\ + \hline + EP & 40.39 & 5652.81 & 27.04 & 0.49 & 26.55 \\ + \hline + LU & 218.79 & 36149.77 & 28.23 & 0.01 & 28.22 \\ + \hline + BT & 166.89 & 23207.42 & 32.32 & 7.89 & 24.43 \\ + \hline + SP & 104.73 & 18414.62 & 24.73 & 2.78 & 21.95 \\ + \hline + FT & 51.10 & 4913.26 & 31.02 & 2.54 & 28.48 \\ \hline - BT &166.89 &23207.42 &32.32 &7.89 &24.43 \\ - \hline - SP &104.73 &18414.62 &24.73 &2.78 &21.95 \\ - \hline - FT &51.10 &4913.26 &31.02 &2.54 &28.48 \\ -\hline \end{tabular} \label{table:res_8n} -\end{table} +% \end{table} -\begin{table}[htb] + \medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 16 nodes } % title of Table \centering - \begin{tabular}{|*{7}{l|}} + \begin{tabular}{|*{7}{r|}} \hline - Program & Execution & Energy & Energy & Performance & Distance \\ - name & time/s & consumption/J & saving\% & degradation\% & \\ + \hspace{-2.2084pt}% + Program & Execution & Energy & Energy & Performance & Distance \\ + name & time/s & consumption/J & saving\% & degradation\% & \\ \hline - CG &31.74 &4373.90 &26.29 &9.57 &16.72 \\ - \hline - MG &5.71 &1076.19 &32.49 &6.05 &26.44 \\ - \hline - EP &20.11 &5638.49 &26.85 &0.56 &26.29 \\ - \hline - LU &144.13 &42529.06 &28.80 &6.56 &22.24 \\ + CG & 31.74 & 4373.90 & 26.29 & 9.57 & 16.72 \\ + \hline + MG & 5.71 & 1076.19 & 32.49 & 6.05 & 26.44 \\ + \hline + EP & 20.11 & 5638.49 & 26.85 & 0.56 & 26.29 \\ + \hline + LU & 144.13 & 42529.06 & 28.80 & 6.56 & 22.24 \\ + \hline + BT & 97.29 & 22813.86 & 34.95 & 5.80 & 29.15 \\ + \hline + SP & 66.49 & 20821.67 & 22.49 & 3.82 & 18.67 \\ + \hline + FT & 37.01 & 5505.60 & 31.59 & 6.48 & 25.11 \\ \hline - BT &97.29 &22813.86 &34.95 &5.80 &29.15 \\ - \hline - SP &66.49 &20821.67 &22.49 &3.82 &18.67 \\ - \hline - FT &37.01 &5505.60 &31.59 &6.48 &25.11 \\ -\hline \end{tabular} \label{table:res_16n} -\end{table} +% \end{table} -\begin{table}[htb] + \medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 32 and 36 nodes } % title of Table \centering - \begin{tabular}{|*{7}{l|}} + \begin{tabular}{|*{7}{r|}} \hline - Program & Execution & Energy & Energy & Performance & Distance \\ - name & time/s & consumption/J & saving\% & degradation\% & \\ + \hspace{-2.2084pt}% + Program & Execution & Energy & Energy & Performance & Distance \\ + name & time/s & consumption/J & saving\% & degradation\% & \\ \hline - CG &32.35 &6704.21 &16.15 &5.30 &10.85 \\ - \hline - MG &4.30 &1355.58 &28.93 &8.85 &20.08 \\ - \hline - EP &9.96 &5519.68 &26.98 &0.02 &26.96 \\ - \hline - LU &99.93 &67463.43 &23.60 &2.45 &21.15 \\ + CG & 32.35 & 6704.21 & 16.15 & 5.30 & 10.85 \\ + \hline + MG & 4.30 & 1355.58 & 28.93 & 8.85 & 20.08 \\ + \hline + EP & 9.96 & 5519.68 & 26.98 & 0.02 & 26.96 \\ + \hline + LU & 99.93 & 67463.43 & 23.60 & 2.45 & 21.15 \\ + \hline + BT & 48.61 & 23796.97 & 34.62 & 5.83 & 28.79 \\ + \hline + SP & 46.01 & 27007.43 & 22.72 & 3.45 & 19.27 \\ + \hline + FT & 28.06 & 7142.69 & 23.09 & 2.90 & 20.19 \\ \hline - BT &48.61 &23796.97 &34.62 &5.83 &28.79 \\ - \hline - SP &46.01 &27007.43 &22.72 &3.45 &19.27 \\ - \hline - FT &28.06 &7142.69 &23.09 &2.90 &20.19 \\ -\hline \end{tabular} \label{table:res_32n} -\end{table} +% \end{table} -\begin{table}[htb] + \medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 64 nodes } % title of Table \centering - \begin{tabular}{|*{7}{l|}} + \begin{tabular}{|*{7}{r|}} \hline - Program & Execution & Energy & Energy & Performance & Distance \\ - name & time/s & consumption/J & saving\% & degradation\% & \\ + \hspace{-2.2084pt}% + Program & Execution & Energy & Energy & Performance & Distance \\ + name & time/s & consumption/J & saving\% & degradation\% & \\ \hline - CG &46.65 &17521.83 &8.13 &1.68 &6.45 \\ - \hline - MG &3.27 &1534.70 &29.27 &14.35 &14.92 \\ - \hline - EP &5.05 &5471.1084 &27.12 &3.11 &24.01 \\ - \hline - LU &73.92 &101339.16 &21.96 &3.67 &18.29 \\ + CG & 46.65 & 17521.83 & 8.13 & 1.68 & 6.45 \\ + \hline + MG & 3.27 & 1534.70 & 29.27 & 14.35 & 14.92 \\ + \hline + EP & 5.05 & 5471.11 & 27.12 & 3.11 & 24.01 \\ + \hline + LU & 73.92 & 101339.16 & 21.96 & 3.67 & 18.29 \\ + \hline + BT & 39.99 & 27166.71 & 32.02 & 12.28 & 19.74 \\ + \hline + SP & 52.00 & 49099.28 & 24.84 & 0.03 & 24.81 \\ + \hline + FT & 25.97 & 10416.82 & 20.15 & 4.87 & 15.28 \\ \hline - BT &39.99 &27166.71 &32.02 &12.28 &19.74 \\ - \hline - SP &52.00 &49099.28 &24.84 &0.03 &24.81 \\ - \hline - FT &25.97 &10416.82 &20.15 &4.87 &15.28 \\ -\hline \end{tabular} \label{table:res_64n} -\end{table} +% \end{table} - -\begin{table}[htb] + \medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 128 and 144 nodes } % title of Table \centering - \begin{tabular}{|*{7}{l|}} + \begin{tabular}{|*{7}{r|}} \hline - Program & Execution & Energy & Energy & Performance & Distance \\ - name & time/s & consumption/J & saving\% & degradation\% & \\ + \hspace{-2.2084pt}% + Program & Execution & Energy & Energy & Performance & Distance \\ + name & time/s & consumption/J & saving\% & degradation\% & \\ \hline - CG &56.92 &41163.36 &4.00 &1.10 &2.90 \\ - \hline - MG &3.55 &2843.33 &18.77 &10.38 &8.39 \\ - \hline - EP &2.67 &5669.66 &27.09 &0.03 &27.06 \\ - \hline - LU &51.23 &144471.90 &16.67 &2.36 &14.31 \\ + CG & 56.92 & 41163.36 & 4.00 & 1.10 & 2.90 \\ + \hline + MG & 3.55 & 2843.33 & 18.77 & 10.38 & 8.39 \\ + \hline + EP & 2.67 & 5669.66 & 27.09 & 0.03 & 27.06 \\ + \hline + LU & 51.23 & 144471.90 & 16.67 & 2.36 & 14.31 \\ + \hline + BT & 37.96 & 44243.82 & 23.18 & 1.28 & 21.90 \\ + \hline + SP & 64.53 & 115409.71 & 26.72 & 0.05 & 26.67 \\ + \hline + FT & 25.51 & 18808.72 & 12.85 & 2.84 & 10.01 \\ \hline - BT &37.96 &44243.82 &23.18 &1.28 &21.90 \\ - \hline - SP &64.53 &115409.71 &26.72 &0.05 &26.67 \\ - \hline - FT &25.51 &18808.72 &12.85 &2.84 &10.01 \\ -\hline \end{tabular} \label{table:res_128n} \end{table} -The overall energy consumption was computed for each instance according to the -energy consumption model (\ref{eq:energy}), with and without applying the -algorithm. The execution time was also measured for all these experiments. Then, -the energy saving and performance degradation percentages were computed for each -instance. The results are presented in Tables (\ref{table:res_4n}, -\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n}, -\ref{table:res_64n} and \ref{table:res_128n}). All these results are the average -values from many experiments for energy savings and performance degradation. -The tables show the experimental results for running the NAS parallel benchmarks -on different number of nodes. The experiments show that the algorithm -significantly reduces the energy consumption (up to 35\%) and tries to limit the -performance degradation. They also show that the energy saving percentage -decreases when the number of computing nodes increases. This reduction is due -to the increase of the communication times compared to the execution times when -the benchmarks are run over a high number of nodes. Indeed, the benchmarks with -the same class, C, are executed on different numbers of nodes, so the -computation required for each iteration is divided by the number of computing -nodes. On the other hand, more communications are required when increasing the -number of nodes so the static energy increases linearly according to the -communication time and the dynamic power is less relevant in the overall energy -consumption. Therefore, reducing the frequency with algorithm~(\ref{HSA}) is -less effective in reducing the overall energy savings. It can also be noticed -that for the benchmarks EP and SP that contain little or no communications, the -energy savings are not significantly affected by the high number of nodes. No -experiments were conducted using bigger classes than D, because they require a -lot of memory (more than 64GB) when being executed by the simulator on one -machine. The maximum distance between the normalized energy curve and the -normalized performance for each instance is also shown in the result tables. It -decrease in the same way as the energy saving percentage. The tables also show -that the performance degradation percentage is not significantly increased when -the number of computing nodes is increased because the computation times are -small when compared to the communication times. - - - -\begin{figure} + +\begin{figure}[!t] \centering - \subfloat[Energy saving]{% + \subfloat[Energy saving (\%)]{% \includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}% - - \subfloat[Performance degradation ]{% + + \subfloat[Performance degradation (\%)]{% \includegraphics[width=.33\textwidth]{fig/per_deg}\label{fig:per_deg}} \label{fig:avg} \caption{The energy and performance for all NAS benchmarks running with a different number of nodes} \end{figure} -Figures \ref{fig:energy} and \ref{fig:per_deg} present the energy saving and -performance degradation respectively for all the benchmarks according to the -number of used nodes. As shown in the first plot, the energy saving percentages -of the benchmarks MG, LU, BT and FT decrease linearly when the number of nodes -increase. While for the EP and SP benchmarks, the energy saving percentage is -not affected by the increase of the number of computing nodes, because in these -benchmarks there are little or no communications. Finally, the energy saving of -the GC benchmark significantly decrease when the number of nodes increase -because this benchmark has more communications than the others. The second plot -shows that the performance degradation percentages of most of the benchmarks -decrease when they run on a big number of nodes because they spend more time -communicating than computing, thus, scaling down the frequencies of some nodes +The overall energy consumption was computed for each instance according to the +energy consumption model (\ref{eq:energy}), with and without applying the +algorithm. The execution time was also measured for all these experiments. Then, +the energy saving and performance degradation percentages were computed for each +instance. The results are presented in Tables~\ref{table:res_4n}, +\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n}, +\ref{table:res_64n} and \ref{table:res_128n}. All these results are the average +values from many experiments for energy savings and performance degradation. +The tables show the experimental results for running the NAS parallel benchmarks +on different number of nodes. The experiments show that the algorithm +significantly reduces the energy consumption (up to \np[\%]{35}) and tries to +limit the performance degradation. They also show that the energy saving +percentage decreases when the number of computing nodes increases. This +reduction is due to the increase of the communication times compared to the +execution times when the benchmarks are run over a higher number of nodes. +Indeed, the benchmarks with the same class, C, are executed on different numbers +of nodes, so the computation required for each iteration is divided by the +number of computing nodes. On the other hand, more communications are required +when increasing the number of nodes so the static energy increases linearly +according to the communication time and the dynamic power is less relevant in +the overall energy consumption. Therefore, reducing the frequency with +Algorithm~\ref{HSA} is less effective in reducing the overall energy savings. It +can also be noticed that for the benchmarks EP and SP that contain little or no +communications, the energy savings are not significantly affected by the high +number of nodes. No experiments were conducted using bigger classes than D, +because they require a lot of memory (more than \np[GB]{64}) when being executed +by the simulator on one machine. The maximum distance between the normalized +energy curve and the normalized performance for each instance is also shown in +the result tables. It decrease in the same way as the energy saving percentage. +The tables also show that the performance degradation percentage is not +significantly increased when the number of computing nodes is increased because +the computation times are small when compared to the communication times. + +Figures~\ref{fig:energy} and \ref{fig:per_deg} present the energy saving and +performance degradation respectively for all the benchmarks according to the +number of used nodes. As shown in the first plot, the energy saving percentages +of the benchmarks MG, LU, BT and FT decrease linearly when the number of nodes +increase. While for the EP and SP benchmarks, the energy saving percentage is +not affected by the increase of the number of computing nodes, because in these +benchmarks there are little or no communications. Finally, the energy saving of +the GC benchmark significantly decrease when the number of nodes increase +because this benchmark has more communications than the others. The second plot +shows that the performance degradation percentages of most of the benchmarks +decrease when they run on a big number of nodes because they spend more time +communicating than computing, thus, scaling down the frequencies of some nodes has less effect on the performance. - - - \subsection{The results for different power consumption scenarios} \label{sec.compare} -The results of the previous section were obtained while using processors that -consume during computation an overall power which is 80\% composed of dynamic -power and of 20\% of static power. In this section, these ratios are changed and -two new power scenarios are considered in order to evaluate how the proposed -algorithm adapts itself according to the static and dynamic power values. The -two new power scenarios are the following: + +The results of the previous section were obtained while using processors that +consume during computation an overall power which is \np[\%]{80} composed of +dynamic power and of \np[\%]{20} of static power. In this section, these ratios +are changed and two new power scenarios are considered in order to evaluate how +the proposed algorithm adapts itself according to the static and dynamic power +values. The two new power scenarios are the following: \begin{itemize} -\item 70\% of dynamic power and 30\% of static power -\item 90\% of dynamic power and 10\% of static power +\item \np[\%]{70} of dynamic power and \np[\%]{30} of static power +\item \np[\%]{90} of dynamic power and \np[\%]{10} of static power \end{itemize} -The NAS parallel benchmarks were executed again over processors that follow the -new power scenarios. The class C of each benchmark was run over 8 or 9 nodes -and the results are presented in Tables \ref{table:res_s1} and -\ref{table:res_s2}. These tables show that the energy saving percentage of the -70\%-30\% scenario is smaller for all benchmarks compared to the energy saving -of the 90\%-10\% scenario. Indeed, in the latter more dynamic power is consumed -when nodes are running on their maximum frequencies, thus, scaling down the -frequency of the nodes results in higher energy savings than in the 70\%-30\% -scenario. On the other hand, the performance degradation percentage is smaller -in the 70\%-30\% scenario compared to the 90\%-10\% scenario. This is due to the -higher static power percentage in the first scenario which makes it more -relevant in the overall consumed energy. Indeed, the static energy is related -to the execution time and if the performance is degraded the amount of consumed -static energy directly increas. Therefore, the proposed algorithm does not -really significantly scale down much the frequencies of the nodes in order to -limit the increase of the execution time and thus limiting the effect of the +The NAS parallel benchmarks were executed again over processors that follow the +new power scenarios. The class C of each benchmark was run over 8 or 9 nodes +and the results are presented in Tables~\ref{table:res_s1} and +\ref{table:res_s2}. These tables show that the energy saving percentage of the +\np[\%]{70}-\np[\%]{30} scenario is smaller for all benchmarks compared to the +energy saving of the \np[\%]{90}-\np[\%]{10} scenario. Indeed, in the latter +more dynamic power is consumed when nodes are running on their maximum +frequencies, thus, scaling down the frequency of the nodes results in higher +energy savings than in the \np[\%]{70}-\np[\%]{30} scenario. On the other hand, +the performance degradation percentage is smaller in the \np[\%]{70}-\np[\%]{30} +scenario compared to the \np[\%]{90}-\np[\%]{10} scenario. This is due to the +higher static power percentage in the first scenario which makes it more +relevant in the overall consumed energy. Indeed, the static energy is related +to the execution time and if the performance is degraded the amount of consumed +static energy directly increases. Therefore, the proposed algorithm does not +really significantly scale down much the frequencies of the nodes in order to +limit the increase of the execution time and thus limiting the effect of the consumed static energy. -Both new power scenarios are compared to the old one in figure -(\ref{fig:sen_comp}). It shows the average of the performance degradation, the -energy saving and the distances for all NAS benchmarks of class C running on 8 -or 9 nodes. The comparison shows that the energy saving ratio is proportional -to the dynamic power ratio: it is increased when applying the 90\%-10\% scenario -because at maximum frequency the dynamic energy is the most relevant in the -overall consumed energy and can be reduced by lowering the frequency of some -processors. On the other hand, the energy saving decreases when the 70\%-30\% -scenario is used because the dynamic energy is less relevant in the overall -consumed energy and lowering the frequency does not return big energy savings. -Moreover, the average of the performance degradation is decreased when using a -higher ratio for static power (e.g. 70\%-30\% scenario and 80\%-20\% -scenario). Since the proposed algorithm optimizes the energy consumption when -using a higher ratio for dynamic power the algorithm selects bigger frequency -scaling factors that result in more energy saving but less performance, for -example see Figure (\ref{fig:scales_comp}). The opposite happens when using a -higher ratio for static power, the algorithm proportionally selects smaller -scaling values which result in less energy saving but also less performance +Both new power scenarios are compared to the old one in +Figure~\ref{fig:sen_comp}. It shows the average of the performance degradation, +the energy saving and the distances for all NAS benchmarks of class C running on +8 or 9 nodes. The comparison shows that the energy saving ratio is proportional +to the dynamic power ratio: it is increased when applying the +\np[\%]{90}-\np[\%]{10} scenario because at maximum frequency the dynamic energy +is the most relevant in the overall consumed energy and can be reduced by +lowering the frequency of some processors. On the other hand, the energy saving +decreases when the \np[\%]{70}-\np[\%]{30} scenario is used because the dynamic +energy is less relevant in the overall consumed energy and lowering the +frequency does not return big energy savings. Moreover, the average of the +performance degradation is decreased when using a higher ratio for static power +(e.g. \np[\%]{70}-\np[\%]{30} scenario and \np[\%]{80}-\np[\%]{20} +scenario). Since the proposed algorithm optimizes the energy consumption when +using a higher ratio for dynamic power the algorithm selects bigger frequency +scaling factors that result in more energy saving but less performance, for +example see Figure~\ref{fig:scales_comp}. The opposite happens when using a +higher ratio for static power, the algorithm proportionally selects smaller +scaling values which result in less energy saving but also less performance degradation. - - \begin{table}[htb] - \caption{The results of the 70\%-30\% power scenario} +\begin{table}[!t] + \caption{The results of the \np[\%]{70}-\np[\%]{30} power scenario} % title of Table \centering - \begin{tabular}{|*{6}{l|}} + \begin{tabular}{|*{6}{r|}} \hline - Program & Energy & Energy & Performance & Distance \\ - name & consumption/J & saving\% & degradation\% & \\ + Program & Energy & Energy & Performance & Distance \\ + name & consumption/J & saving\% & degradation\% & \\ \hline - CG &4144.21 &22.42 &7.72 &14.70 \\ - \hline - MG &1133.23 &24.50 &5.34 &19.16 \\ + CG & 4144.21 & 22.42 & 7.72 & 14.70 \\ + \hline + MG & 1133.23 & 24.50 & 5.34 & 19.16 \\ \hline - EP &6170.30 &16.19 &0.02 &16.17 \\ + EP & 6170.30 & 16.19 & 0.02 & 16.17 \\ \hline - LU &39477.28 &20.43 &0.07 &20.36 \\ + LU & 39477.28 & 20.43 & 0.07 & 20.36 \\ \hline - BT &26169.55 &25.34 &6.62 &18.71 \\ + BT & 26169.55 & 25.34 & 6.62 & 18.71 \\ \hline - SP &19620.09 &19.32 &3.66 &15.66 \\ + SP & 19620.09 & 19.32 & 3.66 & 15.66 \\ \hline - FT &6094.07 &23.17 &0.36 &22.81 \\ -\hline + FT & 6094.07 & 23.17 & 0.36 & 22.81 \\ + \hline \end{tabular} \label{table:res_s1} \end{table} - - -\begin{table}[htb] - \caption{The results of the 90\%-10\% power scenario} +\begin{table}[!t] + \caption{The results of the \np[\%]{90}-\np[\%]{10} power scenario} % title of Table \centering - \begin{tabular}{|*{6}{l|}} + \begin{tabular}{|*{6}{r|}} \hline - Program & Energy & Energy & Performance & Distance \\ - name & consumption/J & saving\% & degradation\% & \\ + Program & Energy & Energy & Performance & Distance \\ + name & consumption/J & saving\% & degradation\% & \\ \hline - CG &2812.38 &36.36 &6.80 &29.56 \\ - \hline - MG &825.427 &38.35 &6.41 &31.94 \\ - \hline - EP &5281.62 &35.02 &2.68 &32.34 \\ - \hline - LU &31611.28 &39.15 &3.51 &35.64 \\ + CG & 2812.38 & 36.36 & 6.80 & 29.56 \\ + \hline + MG & 825.43 & 38.35 & 6.41 & 31.94 \\ + \hline + EP & 5281.62 & 35.02 & 2.68 & 32.34 \\ + \hline + LU & 31611.28 & 39.15 & 3.51 & 35.64 \\ + \hline + BT & 21296.46 & 36.70 & 6.60 & 30.10 \\ + \hline + SP & 15183.42 & 35.19 & 11.76 & 23.43 \\ + \hline + FT & 3856.54 & 40.80 & 5.67 & 35.13 \\ \hline - BT &21296.46 &36.70 &6.60 &30.10 \\ - \hline - SP &15183.42 &35.19 &11.76 &23.43 \\ - \hline - FT &3856.54 &40.80 &5.67 &35.13 \\ -\hline \end{tabular} \label{table:res_s2} \end{table} +\begin{table}[!t] + \caption{Comparing the proposed algorithm} + \centering + \begin{tabular}{|*{7}{r|}} + \hline + Program & \multicolumn{2}{c|}{Energy saving \%} + & \multicolumn{2}{c|}{Perf. degradation \%} + & \multicolumn{2}{c|}{Distance} \\ + \cline{2-7} + name & EDP & MaxDist & EDP & MaxDist & EDP & MaxDist \\ + \hline + CG & 27.58 & 31.25 & 5.82 & 7.12 & 21.76 & 24.13 \\ + \hline + MG & 29.49 & 33.78 & 3.74 & 6.41 & 25.75 & 27.37 \\ + \hline + LU & 19.55 & 28.33 & 0.00 & 0.01 & 19.55 & 28.22 \\ + \hline + EP & 28.40 & 27.04 & 4.29 & 0.49 & 24.11 & 26.55 \\ + \hline + BT & 27.68 & 32.32 & 6.45 & 7.87 & 21.23 & 24.43 \\ + \hline + SP & 20.52 & 24.73 & 5.21 & 2.78 & 15.31 & 21.95 \\ + \hline + FT & 27.03 & 31.02 & 2.75 & 2.54 & 24.28 & 28.48 \\ + \hline + \end{tabular} + \label{table:compare_EDP} +\end{table} -\begin{figure} +\begin{figure}[!t] \centering \subfloat[Comparison between the results on 8 nodes]{% \includegraphics[width=.33\textwidth]{fig/sen_comp}\label{fig:sen_comp}}% @@ -1106,79 +1173,68 @@ degradation. \includegraphics[width=.33\textwidth]{fig/three_scenarios}\label{fig:scales_comp}} \label{fig:comp} \caption{The comparison of the three power scenarios} -\end{figure} - +\end{figure} +\begin{figure}[!t] + \centering + \includegraphics[scale=0.5]{fig/compare_EDP.pdf} + \caption{Trade-off comparison for NAS benchmarks class C} + \label{fig:compare_EDP} +\end{figure} \subsection{The comparison of the proposed scaling algorithm } \label{sec.compare_EDP} -In this section, the scaling factors selection algorithm, called MaxDist, -is compared to Spiliopoulos et al. algorithm \cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP. -They developed a green governor that regularly applies an online frequency selecting algorithm to reduce the energy consumed by a multicore architecture without degrading much its performance. The algorithm selects the frequencies that minimize the energy and delay products, $EDP=Enegry*Delay$ using the predicted overall energy consumption and execution time delay for each frequency. -To fairly compare both algorithms, the same energy and execution time models, equations (\ref{eq:energy}) and (\ref{eq:fnew}), were used for both algorithms to predict the energy consumption and the execution times. Also Spiliopoulos et al. algorithm was adapted to start the search from the -initial frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm is an exhaustive search algorithm that minimizes the EDP and has the initial frequencies values as an upper bound. - -Both algorithms were applied to the parallel NAS benchmarks to compare their efficiency. Table \ref{table:compare_EDP} presents the results of comparing the execution times and the energy consumptions for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than Spiliopoulos et al. algorithm, -on average it results in 29.76\% energy saving while their algorithm returns just 25.75\%. The average of performance degradation percentage is approximately the same for both algorithms, about 4\%. +In this section, the scaling factors selection algorithm, called MaxDist, is +compared to Spiliopoulos et al. algorithm +\cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP. They developed a +green governor that regularly applies an online frequency selecting algorithm to +reduce the energy consumed by a multicore architecture without degrading much +its performance. The algorithm selects the frequencies that minimize the energy +and delay products, $\mathit{EDP}=\mathit{energy}\times \mathit{delay}$ using +the predicted overall energy consumption and execution time delay for each +frequency. To fairly compare both algorithms, the same energy and execution +time models, equations (\ref{eq:energy}) and (\ref{eq:fnew}), were used for both +algorithms to predict the energy consumption and the execution times. Also +Spiliopoulos et al. algorithm was adapted to start the search from the initial +frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm +is an exhaustive search algorithm that minimizes the EDP and has the initial +frequencies values as an upper bound. + +Both algorithms were applied to the parallel NAS benchmarks to compare their +efficiency. Table~\ref{table:compare_EDP} presents the results of comparing the +execution times and the energy consumption for both versions of the NAS +benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous +nodes. The results show that our algorithm provides better energy savings than +Spiliopoulos et al. algorithm, on average it results in \np[\%]{29.76} energy +saving while their algorithm returns just \np[\%]{25.75}. The average of +performance degradation percentage is approximately the same for both +algorithms, about \np[\%]{4}. For all benchmarks, our algorithm outperforms Spiliopoulos et al. algorithm in -terms of energy and performance tradeoff, see figure (\ref{fig:compare_EDP}), +terms of energy and performance trade-off, see Figure~\ref{fig:compare_EDP}, because it maximizes the distance between the energy saving and the performance degradation values while giving the same weight for both metrics. - - - -\begin{table}[h] - \caption{Comparing the proposed algorithm} - \centering -\begin{tabular}{|l|l|l|l|l|l|l|l|} -\hline -\multicolumn{2}{|l|}{\multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}Program \\ name\end{tabular}}} & \multicolumn{2}{l|}{Energy saving \%} & \multicolumn{2}{l|}{Perf. degradation \%} & \multicolumn{2}{l|}{Distance} \\ \cline{3-8} -\multicolumn{2}{|l|}{} & EDP & MaxDist & EDP & MaxDist & EDP & MaxDist \\ \hline -\multicolumn{2}{|l|}{CG} & 27.58 & 31.25 & 5.82 & 7.12 & 21.76 & 24.13 \\ \hline -\multicolumn{2}{|l|}{MG} & 29.49 & 33.78 & 3.74 & 6.41 & 25.75 & 27.37 \\ \hline -\multicolumn{2}{|l|}{LU} & 19.55 & 28.33 & 0.0 & 0.01 & 19.55 & 28.22 \\ \hline -\multicolumn{2}{|l|}{EP} & 28.40 & 27.04 & 4.29 & 0.49 & 24.11 & 26.55 \\ \hline -\multicolumn{2}{|l|}{BT} & 27.68 & 32.32 & 6.45 & 7.87 & 21.23 & 24.43 \\ \hline -\multicolumn{2}{|l|}{SP} & 20.52 & 24.73 & 5.21 & 2.78 & 15.31 & 21.95 \\ \hline -\multicolumn{2}{|l|}{FT} & 27.03 & 31.02 & 2.75 & 2.54 & 24.28 & 28.48 \\ \hline - -\end{tabular} -\label{table:compare_EDP} -\end{table} - - - - - -\begin{figure}[t] - \centering - \includegraphics[scale=0.5]{fig/compare_EDP.pdf} - \caption{Tradeoff comparison for NAS benchmarks class C} - \label{fig:compare_EDP} -\end{figure} - - \section{Conclusion} -\label{sec.concl} +\label{sec.concl} + In this paper, a new online frequency selecting algorithm has been presented. It -selects the best possible vector of frequency scaling factors that gives the -maximum distance (optimal tradeoff) between the predicted energy and the +selects the best possible vector of frequency scaling factors that gives the +maximum distance (optimal trade-off) between the predicted energy and the predicted performance curves for a heterogeneous platform. This algorithm uses a -new energy model for measuring and predicting the energy of distributed -iterative applications running over heterogeneous platforms. To evaluate the +new energy model for measuring and predicting the energy of distributed +iterative applications running over heterogeneous platforms. To evaluate the proposed method, it was applied on the NAS parallel benchmarks and executed over -a heterogeneous platform simulated by Simgrid. The results of the experiments -showed that the algorithm reduces up to 35\% the energy consumption of a message -passing iterative method while limiting the degradation of the performance. The -algorithm also selects different scaling factors according to the percentage of -the computing and communication times, and according to the values of the static -and dynamic powers of the CPUs. Finally, the algorithm was compared to -Spiliopoulos et al. algorithm and the results showed that it outperforms their -algorithm in terms of energy-time tradeoff. +a heterogeneous platform simulated by SimGrid. The results of the experiments +showed that the algorithm reduces up to \np[\%]{35} the energy consumption of a +message passing iterative method while limiting the degradation of the +performance. The algorithm also selects different scaling factors according to +the percentage of the computing and communication times, and according to the +values of the static and dynamic powers of the CPUs. Finally, the algorithm was +compared to Spiliopoulos et al. algorithm and the results showed that it +outperforms their algorithm in terms of energy-time trade-off. In the near future, this method will be applied to real heterogeneous platforms to evaluate its performance in a real study case. It would also be interesting @@ -1193,9 +1249,9 @@ the iterative system. \section*{Acknowledgment} This work has been partially supported by the Labex -ACTION project (contract “ANR-11-LABX-01-01”). As a PhD student, +ACTION project (contract ``ANR-11-LABX-01-01''). As a PhD student, Mr. Ahmed Fanfakh, would like to thank the University of -Babylon (Iraq) for supporting his work. +Babylon (Iraq) for supporting his work. % trigger a \newpage just before the given reference @@ -1207,7 +1263,7 @@ Babylon (Iraq) for supporting his work. \bibliographystyle{IEEEtran} \bibliography{IEEEabrv,my_reference} \end{document} - + %%% Local Variables: %%% mode: latex %%% TeX-master: t @@ -1216,6 +1272,6 @@ Babylon (Iraq) for supporting his work. %%% End: % LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber -% LocalWords: CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex -% LocalWords: de badri muslim MPI TcpOld TcmOld dNew dOld cp Sopt Tcp Tcm Ps -% LocalWords: Scp Fmax Fdiff SimGrid GFlops Xeon EP BT +% LocalWords: CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex GPU +% LocalWords: de badri muslim MPI SimGrid GFlops Xeon EP BT GPUs CPUs AMD +% LocalWords: Spiliopoulos scalability