frequency that gives the best trade-off between the energy consumption and the
performance of an application must be selected.
-In this paper, a new online frequency selecting algorithm for heterogeneous
-platforms is presented. It selects the frequencies and tries to give the best
-trade-off between energy saving and performance degradation, for each node
-computing the message passing iterative application. The algorithm has a small
+In this paper, a new online frequency selecting algorithm for heterogeneous
+platforms is presented. It selects the frequencies and tries to give the best
+trade-off between energy saving and performance degradation, for each node
+computing the message passing iterative application. The algorithm has a small
overhead and works without training or profiling. It uses a new energy model for
-message passing iterative applications running on a heterogeneous platform. The
-proposed algorithm is evaluated on the SimGrid simulator while running the NAS
-parallel benchmarks. The experiments show that it reduces the energy
-consumption by up to 35\% while limiting the performance degradation as much as
-possible. Finally, the algorithm is compared to an existing method, the
+message passing iterative applications running on a heterogeneous platform. The
+proposed algorithm is evaluated on the SimGrid simulator while running the NAS
+parallel benchmarks. The experiments show that it reduces the energy
+consumption by up to \np[\%]{35} while limiting the performance degradation as
+much as possible. Finally, the algorithm is compared to an existing method, the
comparison results showing that it outperforms the latter.
\end{abstract}
al.~\cite{Naveen_Power.Efficient.Resource.Scaling} developed a method that
minimizes the value of $\mathit{energy}\times \mathit{delay}^2$ (the delay is
the sum of slack times that happen during synchronous communications) by
-dynamically assigning new frequencies to the CPUs of the heterogeneous
-cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling}
-proposed an algorithm that divides the executed tasks into two types: the
-critical and non critical tasks. The algorithm scales down the frequency of non
-critical tasks proportionally to their slack and communication times while
-limiting the performance degradation percentage to less than
-10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a
+dynamically assigning new frequencies to the CPUs of the heterogeneous cluster.
+Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed an
+algorithm that divides the executed tasks into two types: the critical and non
+critical tasks. The algorithm scales down the frequency of non critical tasks
+proportionally to their slack and communication times while limiting the
+performance degradation percentage to less than \np[\%]{10}.
+In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a
heterogeneous cluster composed of two types of Intel and AMD processors. They
use a gradient method to predict the impact of DVFS operations on performance.
In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and
\cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks}, the best
frequencies for a specified heterogeneous cluster are selected offline using
-some heuristic. Chen et
+some heuristic. Chen et
al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic
programming approach to minimize the power consumption of heterogeneous servers
-while respecting given time constraints. This approach had considerable
+while respecting given time constraints. This approach had considerable
overhead. In contrast to the above described papers, this paper presents the
following contributions :
\begin{enumerate}
-\item two new energy and performance models for message passing iterative synchronous applications running over
- a heterogeneous platform. Both models take into account communication and slack times. The models can predict the required energy and the execution time of the application.
+\item two new energy and performance models for message passing iterative
+ synchronous applications running over a heterogeneous platform. Both models
+ take into account communication and slack times. The models can predict the
+ required energy and the execution time of the application.
-\item a new online frequency selecting algorithm for heterogeneous platforms. The algorithm has a very small
- overhead and does not need any training or profiling. It uses a new optimization function which simultaneously maximizes the performance and minimizes the energy consumption of a message passing iterative synchronous application.
+\item a new online frequency selecting algorithm for heterogeneous
+ platforms. The algorithm has a very small overhead and does not need any
+ training or profiling. It uses a new optimization function which
+ simultaneously maximizes the performance and minimizes the energy consumption
+ of a message passing iterative synchronous application.
\end{enumerate}
\section{Experimental results}
\label{sec.expe}
-To evaluate the efficiency and the overall energy consumption reduction of
+To evaluate the efficiency and the overall energy consumption reduction of
Algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The
-experiments were executed on the simulator SimGrid/SMPI which offers easy tools
+experiments were executed on the simulator SimGrid/SMPI which offers easy tools
to create a heterogeneous platform and run message passing applications over it.
-The heterogeneous platform that was used in the experiments, had one core per
+The heterogeneous platform that was used in the experiments, had one core per
node because just one process was executed per node. The heterogeneous platform
-was composed of four types of nodes. Each type of nodes had different
-characteristics such as the maximum CPU frequency, the number of available
-frequencies and the computational power, see Table~\ref{table:platform}. The
-characteristics of these different types of nodes are inspired from the
-specifications of real Intel processors. The heterogeneous platform had up to
+was composed of four types of nodes. Each type of nodes had different
+characteristics such as the maximum CPU frequency, the number of available
+frequencies and the computational power, see Table~\ref{table:platform}. The
+characteristics of these different types of nodes are inspired from the
+specifications of real Intel processors. The heterogeneous platform had up to
144 nodes and had nodes from the four types in equal proportions, for example if
a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
-constructors of CPUs do not specify the dynamic and the static power of their
-CPUs, for each type of node they were chosen proportionally to its computing
-power (FLOPS). In the initial heterogeneous platform, while computing with
-highest frequency, each node consumed an amount of power proportional to its
-computing power (which corresponds to 80\% of its dynamic power and the
-remaining 20\% to the static power), the same assumption was made in
-\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These
+constructors of CPUs do not specify the dynamic and the static power of their
+CPUs, for each type of node they were chosen proportionally to its computing
+power (FLOPS). In the initial heterogeneous platform, while computing with
+highest frequency, each node consumed an amount of power proportional to its
+computing power (which corresponds to \np[\%]{80} of its dynamic power and the
+remaining \np[\%]{20} to the static power), the same assumption was made in
+\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These
nodes were connected via an Ethernet network with 1 Gbit/s bandwidth.
The proposed algorithm was applied to the seven parallel NAS benchmarks (EP, CG,
-MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes:
+MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes:
A, B and C. However, due to the lack of space in this paper, only the results of
-the biggest class, C, are presented while being run on different number of
-nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being
-executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on $1,
-2, 4, 8, 16, 32, 64, 128$ nodes. The other benchmarks such as BT and SP had to
-be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
+the biggest class, C, are presented while being run on different number of
+nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being
+executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on 1,
+2, 4, 8, 16, 32, 64, or 128 nodes. The other benchmarks such as BT and SP had
+to be executed on 1, 4, 9, 16, 36, 64, or 144 nodes.
\end{tabular}
\label{table:res_128n}
\end{table}
-The overall energy consumption was computed for each instance according to the
-energy consumption model (\ref{eq:energy}), with and without applying the
+The overall energy consumption was computed for each instance according to the
+energy consumption model (\ref{eq:energy}), with and without applying the
algorithm. The execution time was also measured for all these experiments. Then,
the energy saving and performance degradation percentages were computed for each
-instance. The results are presented in Tables~\ref{table:res_4n},
-\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n},
+instance. The results are presented in Tables~\ref{table:res_4n},
+\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n},
\ref{table:res_64n} and \ref{table:res_128n}. All these results are the average
-values from many experiments for energy savings and performance degradation.
+values from many experiments for energy savings and performance degradation.
The tables show the experimental results for running the NAS parallel benchmarks
-on different number of nodes. The experiments show that the algorithm
-significantly reduces the energy consumption (up to 35\%) and tries to limit the
-performance degradation. They also show that the energy saving percentage
-decreases when the number of computing nodes increases. This reduction is due
-to the increase of the communication times compared to the execution times when
-the benchmarks are run over a high number of nodes. Indeed, the benchmarks with
-the same class, C, are executed on different numbers of nodes, so the
-computation required for each iteration is divided by the number of computing
-nodes. On the other hand, more communications are required when increasing the
-number of nodes so the static energy increases linearly according to the
-communication time and the dynamic power is less relevant in the overall energy
-consumption. Therefore, reducing the frequency with Algorithm~\ref{HSA} is
-less effective in reducing the overall energy savings. It can also be noticed
-that for the benchmarks EP and SP that contain little or no communications, the
-energy savings are not significantly affected by the high number of nodes. No
-experiments were conducted using bigger classes than D, because they require a
-lot of memory (more than 64GB) when being executed by the simulator on one
-machine. The maximum distance between the normalized energy curve and the
-normalized performance for each instance is also shown in the result tables. It
-decrease in the same way as the energy saving percentage. The tables also show
-that the performance degradation percentage is not significantly increased when
-the number of computing nodes is increased because the computation times are
-small when compared to the communication times.
+on different number of nodes. The experiments show that the algorithm
+significantly reduces the energy consumption (up to \np[\%]{35}) and tries to
+limit the performance degradation. They also show that the energy saving
+percentage decreases when the number of computing nodes increases. This
+reduction is due to the increase of the communication times compared to the
+execution times when the benchmarks are run over a higher number of nodes.
+Indeed, the benchmarks with the same class, C, are executed on different numbers
+of nodes, so the computation required for each iteration is divided by the
+number of computing nodes. On the other hand, more communications are required
+when increasing the number of nodes so the static energy increases linearly
+according to the communication time and the dynamic power is less relevant in
+the overall energy consumption. Therefore, reducing the frequency with
+Algorithm~\ref{HSA} is less effective in reducing the overall energy savings. It
+can also be noticed that for the benchmarks EP and SP that contain little or no
+communications, the energy savings are not significantly affected by the high
+number of nodes. No experiments were conducted using bigger classes than D,
+because they require a lot of memory (more than 64GB) when being executed by the
+simulator on one machine. The maximum distance between the normalized energy
+curve and the normalized performance for each instance is also shown in the
+result tables. It decrease in the same way as the energy saving percentage. The
+tables also show that the performance degradation percentage is not
+significantly increased when the number of computing nodes is increased because
+the computation times are small when compared to the communication times.
\begin{figure}[!t]
\centering
- \subfloat[Energy saving]{%
+ \subfloat[Energy saving (\%)]{%
\includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}%
- \subfloat[Performance degradation ]{%
+ \subfloat[Performance degradation (\%)]{%
\includegraphics[width=.33\textwidth]{fig/per_deg}\label{fig:per_deg}}
\label{fig:avg}
\caption{The energy and performance for all NAS benchmarks running with a different number of nodes}
\subsection{The results for different power consumption scenarios}
\label{sec.compare}
-The results of the previous section were obtained while using processors that
-consume during computation an overall power which is 80\% composed of dynamic
-power and of 20\% of static power. In this section, these ratios are changed and
-two new power scenarios are considered in order to evaluate how the proposed
-algorithm adapts itself according to the static and dynamic power values. The
-two new power scenarios are the following:
+The results of the previous section were obtained while using processors that
+consume during computation an overall power which is \np[\%]{80} composed of
+dynamic power and of \np[\%]{20} of static power. In this section, these ratios
+are changed and two new power scenarios are considered in order to evaluate how
+the proposed algorithm adapts itself according to the static and dynamic power
+values. The two new power scenarios are the following:
\begin{itemize}
-\item 70\% of dynamic power and 30\% of static power
-\item 90\% of dynamic power and 10\% of static power
+\item \np[\%]{70} of dynamic power and \np[\%]{30} of static power
+\item \np[\%]{90} of dynamic power and \np[\%]{10} of static power
\end{itemize}
-The NAS parallel benchmarks were executed again over processors that follow the
-new power scenarios. The class C of each benchmark was run over 8 or 9 nodes
-and the results are presented in Tables~\ref{table:res_s1} and
-\ref{table:res_s2}. These tables show that the energy saving percentage of the
-70\%-30\% scenario is smaller for all benchmarks compared to the energy saving
-of the 90\%-10\% scenario. Indeed, in the latter more dynamic power is consumed
-when nodes are running on their maximum frequencies, thus, scaling down the
-frequency of the nodes results in higher energy savings than in the 70\%-30\%
-scenario. On the other hand, the performance degradation percentage is smaller
-in the 70\%-30\% scenario compared to the 90\%-10\% scenario. This is due to the
-higher static power percentage in the first scenario which makes it more
-relevant in the overall consumed energy. Indeed, the static energy is related
-to the execution time and if the performance is degraded the amount of consumed
-static energy directly increases. Therefore, the proposed algorithm does not
-really significantly scale down much the frequencies of the nodes in order to
-limit the increase of the execution time and thus limiting the effect of the
+The NAS parallel benchmarks were executed again over processors that follow the
+new power scenarios. The class C of each benchmark was run over 8 or 9 nodes
+and the results are presented in Tables~\ref{table:res_s1} and
+\ref{table:res_s2}. These tables show that the energy saving percentage of the
+\np[\%]{70}-\np[\%]{30} scenario is smaller for all benchmarks compared to the
+energy saving of the \np[\%]{90}-\np[\%]{10} scenario. Indeed, in the latter
+more dynamic power is consumed when nodes are running on their maximum
+frequencies, thus, scaling down the frequency of the nodes results in higher
+energy savings than in the \np[\%]{70}-\np[\%]{30} scenario. On the other hand,
+the performance degradation percentage is smaller in the \np[\%]{70}-\np[\%]{30}
+scenario compared to the \np[\%]{90}-\np[\%]{10} scenario. This is due to the
+higher static power percentage in the first scenario which makes it more
+relevant in the overall consumed energy. Indeed, the static energy is related
+to the execution time and if the performance is degraded the amount of consumed
+static energy directly increases. Therefore, the proposed algorithm does not
+really significantly scale down much the frequencies of the nodes in order to
+limit the increase of the execution time and thus limiting the effect of the
consumed static energy.
-Both new power scenarios are compared to the old one in
-Figure~\ref{fig:sen_comp}. It shows the average of the performance degradation, the
-energy saving and the distances for all NAS benchmarks of class C running on 8
-or 9 nodes. The comparison shows that the energy saving ratio is proportional
-to the dynamic power ratio: it is increased when applying the 90\%-10\% scenario
-because at maximum frequency the dynamic energy is the most relevant in the
-overall consumed energy and can be reduced by lowering the frequency of some
-processors. On the other hand, the energy saving decreases when the 70\%-30\%
-scenario is used because the dynamic energy is less relevant in the overall
-consumed energy and lowering the frequency does not return big energy savings.
-Moreover, the average of the performance degradation is decreased when using a
-higher ratio for static power (e.g. 70\%-30\% scenario and 80\%-20\%
-scenario). Since the proposed algorithm optimizes the energy consumption when
-using a higher ratio for dynamic power the algorithm selects bigger frequency
-scaling factors that result in more energy saving but less performance, for
-example see Figure~\ref{fig:scales_comp}. The opposite happens when using a
-higher ratio for static power, the algorithm proportionally selects smaller
-scaling values which result in less energy saving but also less performance
+Both new power scenarios are compared to the old one in
+Figure~\ref{fig:sen_comp}. It shows the average of the performance degradation,
+the energy saving and the distances for all NAS benchmarks of class C running on
+8 or 9 nodes. The comparison shows that the energy saving ratio is proportional
+to the dynamic power ratio: it is increased when applying the
+\np[\%]{90}-\np[\%]{10} scenario because at maximum frequency the dynamic energy
+is the most relevant in the overall consumed energy and can be reduced by
+lowering the frequency of some processors. On the other hand, the energy saving
+decreases when the \np[\%]{70}-\np[\%]{30} scenario is used because the dynamic
+energy is less relevant in the overall consumed energy and lowering the
+frequency does not return big energy savings. Moreover, the average of the
+performance degradation is decreased when using a higher ratio for static power
+(e.g. \np[\%]{70}-\np[\%]{30} scenario and \np[\%]{80}-\np[\%]{20}
+scenario). Since the proposed algorithm optimizes the energy consumption when
+using a higher ratio for dynamic power the algorithm selects bigger frequency
+scaling factors that result in more energy saving but less performance, for
+example see Figure~\ref{fig:scales_comp}. The opposite happens when using a
+higher ratio for static power, the algorithm proportionally selects smaller
+scaling values which result in less energy saving but also less performance
degradation.
\begin{table}[!t]
- \caption{The results of the 70\%-30\% power scenario}
+ \caption{The results of the \np[\%]{70}-\np[\%]{30} power scenario}
% title of Table
\centering
\begin{tabular}{|*{6}{r|}}
\begin{table}[!t]
- \caption{The results of the 90\%-10\% power scenario}
+ \caption{The results of the \np[\%]{90}-\np[\%]{10} power scenario}
% title of Table
\centering
\begin{tabular}{|*{6}{r|}}
execution times and the energy consumption for both versions of the NAS
benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous
nodes. The results show that our algorithm provides better energy savings than
-Spiliopoulos et al. algorithm, on average it results in 29.76\% energy saving
-while their algorithm returns just 25.75\%. The average of performance
-degradation percentage is approximately the same for both algorithms, about 4\%.
+Spiliopoulos et al. algorithm, on average it results in \np[\%]{29.76} energy
+saving while their algorithm returns just \np[\%]{25.75}. The average of
+performance degradation percentage is approximately the same for both
+algorithms, about \np[\%]{4}.
For all benchmarks, our algorithm outperforms Spiliopoulos et al. algorithm in
\section{Conclusion}
\label{sec.concl}
In this paper, a new online frequency selecting algorithm has been presented. It
-selects the best possible vector of frequency scaling factors that gives the
-maximum distance (optimal trade-off) between the predicted energy and the
+selects the best possible vector of frequency scaling factors that gives the
+maximum distance (optimal trade-off) between the predicted energy and the
predicted performance curves for a heterogeneous platform. This algorithm uses a
-new energy model for measuring and predicting the energy of distributed
-iterative applications running over heterogeneous platforms. To evaluate the
+new energy model for measuring and predicting the energy of distributed
+iterative applications running over heterogeneous platforms. To evaluate the
proposed method, it was applied on the NAS parallel benchmarks and executed over
-a heterogeneous platform simulated by SimGrid. The results of the experiments
-showed that the algorithm reduces up to 35\% the energy consumption of a message
-passing iterative method while limiting the degradation of the performance. The
-algorithm also selects different scaling factors according to the percentage of
-the computing and communication times, and according to the values of the static
-and dynamic powers of the CPUs. Finally, the algorithm was compared to
-Spiliopoulos et al. algorithm and the results showed that it outperforms their
-algorithm in terms of energy-time trade-off.
+a heterogeneous platform simulated by SimGrid. The results of the experiments
+showed that the algorithm reduces up to \np[\%]{35} the energy consumption of a
+message passing iterative method while limiting the degradation of the
+performance. The algorithm also selects different scaling factors according to
+the percentage of the computing and communication times, and according to the
+values of the static and dynamic powers of the CPUs. Finally, the algorithm was
+compared to Spiliopoulos et al. algorithm and the results showed that it
+outperforms their algorithm in terms of energy-time trade-off.
In the near future, this method will be applied to real heterogeneous platforms
to evaluate its performance in a real study case. It would also be interesting