X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy2.git/blobdiff_plain/1cea7b40511f60228c23a644b707615cd1559471..e2d0acd1461bf4e64de95e87f02267cc513c1094:/Heter_paper.tex diff --git a/Heter_paper.tex b/Heter_paper.tex index a604b96..8b0f807 100644 --- a/Heter_paper.tex +++ b/Heter_paper.tex @@ -90,16 +90,16 @@ execution time of an application running on that processor. Therefore, the frequency that gives the best trade-off between the energy consumption and the performance of an application must be selected. -In this paper, a new online frequency selecting algorithm for heterogeneous -platforms is presented. It selects the frequencies and tries to give the best -trade-off between energy saving and performance degradation, for each node -computing the message passing iterative application. The algorithm has a small +In this paper, a new online frequency selecting algorithm for heterogeneous +platforms is presented. It selects the frequencies and tries to give the best +trade-off between energy saving and performance degradation, for each node +computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for -message passing iterative applications running on a heterogeneous platform. The -proposed algorithm is evaluated on the SimGrid simulator while running the NAS -parallel benchmarks. The experiments show that it reduces the energy -consumption by up to 35\% while limiting the performance degradation as much as -possible. Finally, the algorithm is compared to an existing method, the +message passing iterative applications running on a heterogeneous platform. The +proposed algorithm is evaluated on the SimGrid simulator while running the NAS +parallel benchmarks. The experiments show that it reduces the energy +consumption by up to \np[\%]{35} while limiting the performance degradation as +much as possible. Finally, the algorithm is compared to an existing method, the comparison results showing that it outperforms the latter. \end{abstract} @@ -207,30 +207,35 @@ consumption of this type of platform. Naveen et al.~\cite{Naveen_Power.Efficient.Resource.Scaling} developed a method that minimizes the value of $\mathit{energy}\times \mathit{delay}^2$ (the delay is the sum of slack times that happen during synchronous communications) by -dynamically assigning new frequencies to the CPUs of the heterogeneous -cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} -proposed an algorithm that divides the executed tasks into two types: the -critical and non critical tasks. The algorithm scales down the frequency of non -critical tasks proportionally to their slack and communication times while -limiting the performance degradation percentage to less than -10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a +dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. +Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed an +algorithm that divides the executed tasks into two types: the critical and non +critical tasks. The algorithm scales down the frequency of non critical tasks +proportionally to their slack and communication times while limiting the +performance degradation percentage to less than \np[\%]{10}. +In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a heterogeneous cluster composed of two types of Intel and AMD processors. They use a gradient method to predict the impact of DVFS operations on performance. In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and \cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks}, the best frequencies for a specified heterogeneous cluster are selected offline using -some heuristic. Chen et +some heuristic. Chen et al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic programming approach to minimize the power consumption of heterogeneous servers -while respecting given time constraints. This approach had considerable +while respecting given time constraints. This approach had considerable overhead. In contrast to the above described papers, this paper presents the following contributions : \begin{enumerate} -\item two new energy and performance models for message passing iterative synchronous applications running over - a heterogeneous platform. Both models take into account communication and slack times. The models can predict the required energy and the execution time of the application. +\item two new energy and performance models for message passing iterative + synchronous applications running over a heterogeneous platform. Both models + take into account communication and slack times. The models can predict the + required energy and the execution time of the application. -\item a new online frequency selecting algorithm for heterogeneous platforms. The algorithm has a very small - overhead and does not need any training or profiling. It uses a new optimization function which simultaneously maximizes the performance and minimizes the energy consumption of a message passing iterative synchronous application. +\item a new online frequency selecting algorithm for heterogeneous + platforms. The algorithm has a very small overhead and does not need any + training or profiling. It uses a new optimization function which + simultaneously maximizes the performance and minimizes the energy consumption + of a message passing iterative synchronous application. \end{enumerate} @@ -722,26 +727,26 @@ scaling factors that gives the results of the next sections. \section{Experimental results} \label{sec.expe} -To evaluate the efficiency and the overall energy consumption reduction of +To evaluate the efficiency and the overall energy consumption reduction of Algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The -experiments were executed on the simulator SimGrid/SMPI which offers easy tools +experiments were executed on the simulator SimGrid/SMPI which offers easy tools to create a heterogeneous platform and run message passing applications over it. -The heterogeneous platform that was used in the experiments, had one core per +The heterogeneous platform that was used in the experiments, had one core per node because just one process was executed per node. The heterogeneous platform -was composed of four types of nodes. Each type of nodes had different -characteristics such as the maximum CPU frequency, the number of available -frequencies and the computational power, see Table~\ref{table:platform}. The -characteristics of these different types of nodes are inspired from the -specifications of real Intel processors. The heterogeneous platform had up to +was composed of four types of nodes. Each type of nodes had different +characteristics such as the maximum CPU frequency, the number of available +frequencies and the computational power, see Table~\ref{table:platform}. The +characteristics of these different types of nodes are inspired from the +specifications of real Intel processors. The heterogeneous platform had up to 144 nodes and had nodes from the four types in equal proportions, for example if a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the -constructors of CPUs do not specify the dynamic and the static power of their -CPUs, for each type of node they were chosen proportionally to its computing -power (FLOPS). In the initial heterogeneous platform, while computing with -highest frequency, each node consumed an amount of power proportional to its -computing power (which corresponds to 80\% of its dynamic power and the -remaining 20\% to the static power), the same assumption was made in -\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These +constructors of CPUs do not specify the dynamic and the static power of their +CPUs, for each type of node they were chosen proportionally to its computing +power (FLOPS). In the initial heterogeneous platform, while computing with +highest frequency, each node consumed an amount of power proportional to its +computing power (which corresponds to \np[\%]{80} of its dynamic power and the +remaining \np[\%]{20} to the static power), the same assumption was made in +\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These nodes were connected via an Ethernet network with 1 Gbit/s bandwidth. @@ -780,13 +785,13 @@ nodes were connected via an Ethernet network with 1 Gbit/s bandwidth. The proposed algorithm was applied to the seven parallel NAS benchmarks (EP, CG, -MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes: +MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes: A, B and C. However, due to the lack of space in this paper, only the results of -the biggest class, C, are presented while being run on different number of -nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being -executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on $1, -2, 4, 8, 16, 32, 64, 128$ nodes. The other benchmarks such as BT and SP had to -be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. +the biggest class, C, are presented while being run on different number of +nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being +executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on 1, +2, 4, 8, 16, 32, 64, or 128 nodes. The other benchmarks such as BT and SP had +to be executed on 1, 4, 9, 16, 36, 64, or 144 nodes. @@ -962,47 +967,47 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. \end{tabular} \label{table:res_128n} \end{table} -The overall energy consumption was computed for each instance according to the -energy consumption model (\ref{eq:energy}), with and without applying the +The overall energy consumption was computed for each instance according to the +energy consumption model (\ref{eq:energy}), with and without applying the algorithm. The execution time was also measured for all these experiments. Then, the energy saving and performance degradation percentages were computed for each -instance. The results are presented in Tables~\ref{table:res_4n}, -\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n}, +instance. The results are presented in Tables~\ref{table:res_4n}, +\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n}, \ref{table:res_64n} and \ref{table:res_128n}. All these results are the average -values from many experiments for energy savings and performance degradation. +values from many experiments for energy savings and performance degradation. The tables show the experimental results for running the NAS parallel benchmarks -on different number of nodes. The experiments show that the algorithm -significantly reduces the energy consumption (up to 35\%) and tries to limit the -performance degradation. They also show that the energy saving percentage -decreases when the number of computing nodes increases. This reduction is due -to the increase of the communication times compared to the execution times when -the benchmarks are run over a high number of nodes. Indeed, the benchmarks with -the same class, C, are executed on different numbers of nodes, so the -computation required for each iteration is divided by the number of computing -nodes. On the other hand, more communications are required when increasing the -number of nodes so the static energy increases linearly according to the -communication time and the dynamic power is less relevant in the overall energy -consumption. Therefore, reducing the frequency with Algorithm~\ref{HSA} is -less effective in reducing the overall energy savings. It can also be noticed -that for the benchmarks EP and SP that contain little or no communications, the -energy savings are not significantly affected by the high number of nodes. No -experiments were conducted using bigger classes than D, because they require a -lot of memory (more than 64GB) when being executed by the simulator on one -machine. The maximum distance between the normalized energy curve and the -normalized performance for each instance is also shown in the result tables. It -decrease in the same way as the energy saving percentage. The tables also show -that the performance degradation percentage is not significantly increased when -the number of computing nodes is increased because the computation times are -small when compared to the communication times. +on different number of nodes. The experiments show that the algorithm +significantly reduces the energy consumption (up to \np[\%]{35}) and tries to +limit the performance degradation. They also show that the energy saving +percentage decreases when the number of computing nodes increases. This +reduction is due to the increase of the communication times compared to the +execution times when the benchmarks are run over a higher number of nodes. +Indeed, the benchmarks with the same class, C, are executed on different numbers +of nodes, so the computation required for each iteration is divided by the +number of computing nodes. On the other hand, more communications are required +when increasing the number of nodes so the static energy increases linearly +according to the communication time and the dynamic power is less relevant in +the overall energy consumption. Therefore, reducing the frequency with +Algorithm~\ref{HSA} is less effective in reducing the overall energy savings. It +can also be noticed that for the benchmarks EP and SP that contain little or no +communications, the energy savings are not significantly affected by the high +number of nodes. No experiments were conducted using bigger classes than D, +because they require a lot of memory (more than 64GB) when being executed by the +simulator on one machine. The maximum distance between the normalized energy +curve and the normalized performance for each instance is also shown in the +result tables. It decrease in the same way as the energy saving percentage. The +tables also show that the performance degradation percentage is not +significantly increased when the number of computing nodes is increased because +the computation times are small when compared to the communication times. \begin{figure}[!t] \centering - \subfloat[Energy saving]{% + \subfloat[Energy saving (\%)]{% \includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}% - \subfloat[Performance degradation ]{% + \subfloat[Performance degradation (\%)]{% \includegraphics[width=.33\textwidth]{fig/per_deg}\label{fig:per_deg}} \label{fig:avg} \caption{The energy and performance for all NAS benchmarks running with a different number of nodes} @@ -1027,59 +1032,61 @@ has less effect on the performance. \subsection{The results for different power consumption scenarios} \label{sec.compare} -The results of the previous section were obtained while using processors that -consume during computation an overall power which is 80\% composed of dynamic -power and of 20\% of static power. In this section, these ratios are changed and -two new power scenarios are considered in order to evaluate how the proposed -algorithm adapts itself according to the static and dynamic power values. The -two new power scenarios are the following: +The results of the previous section were obtained while using processors that +consume during computation an overall power which is \np[\%]{80} composed of +dynamic power and of \np[\%]{20} of static power. In this section, these ratios +are changed and two new power scenarios are considered in order to evaluate how +the proposed algorithm adapts itself according to the static and dynamic power +values. The two new power scenarios are the following: \begin{itemize} -\item 70\% of dynamic power and 30\% of static power -\item 90\% of dynamic power and 10\% of static power +\item \np[\%]{70} of dynamic power and \np[\%]{30} of static power +\item \np[\%]{90} of dynamic power and \np[\%]{10} of static power \end{itemize} -The NAS parallel benchmarks were executed again over processors that follow the -new power scenarios. The class C of each benchmark was run over 8 or 9 nodes -and the results are presented in Tables~\ref{table:res_s1} and -\ref{table:res_s2}. These tables show that the energy saving percentage of the -70\%-30\% scenario is smaller for all benchmarks compared to the energy saving -of the 90\%-10\% scenario. Indeed, in the latter more dynamic power is consumed -when nodes are running on their maximum frequencies, thus, scaling down the -frequency of the nodes results in higher energy savings than in the 70\%-30\% -scenario. On the other hand, the performance degradation percentage is smaller -in the 70\%-30\% scenario compared to the 90\%-10\% scenario. This is due to the -higher static power percentage in the first scenario which makes it more -relevant in the overall consumed energy. Indeed, the static energy is related -to the execution time and if the performance is degraded the amount of consumed -static energy directly increases. Therefore, the proposed algorithm does not -really significantly scale down much the frequencies of the nodes in order to -limit the increase of the execution time and thus limiting the effect of the +The NAS parallel benchmarks were executed again over processors that follow the +new power scenarios. The class C of each benchmark was run over 8 or 9 nodes +and the results are presented in Tables~\ref{table:res_s1} and +\ref{table:res_s2}. These tables show that the energy saving percentage of the +\np[\%]{70}-\np[\%]{30} scenario is smaller for all benchmarks compared to the +energy saving of the \np[\%]{90}-\np[\%]{10} scenario. Indeed, in the latter +more dynamic power is consumed when nodes are running on their maximum +frequencies, thus, scaling down the frequency of the nodes results in higher +energy savings than in the \np[\%]{70}-\np[\%]{30} scenario. On the other hand, +the performance degradation percentage is smaller in the \np[\%]{70}-\np[\%]{30} +scenario compared to the \np[\%]{90}-\np[\%]{10} scenario. This is due to the +higher static power percentage in the first scenario which makes it more +relevant in the overall consumed energy. Indeed, the static energy is related +to the execution time and if the performance is degraded the amount of consumed +static energy directly increases. Therefore, the proposed algorithm does not +really significantly scale down much the frequencies of the nodes in order to +limit the increase of the execution time and thus limiting the effect of the consumed static energy. -Both new power scenarios are compared to the old one in -Figure~\ref{fig:sen_comp}. It shows the average of the performance degradation, the -energy saving and the distances for all NAS benchmarks of class C running on 8 -or 9 nodes. The comparison shows that the energy saving ratio is proportional -to the dynamic power ratio: it is increased when applying the 90\%-10\% scenario -because at maximum frequency the dynamic energy is the most relevant in the -overall consumed energy and can be reduced by lowering the frequency of some -processors. On the other hand, the energy saving decreases when the 70\%-30\% -scenario is used because the dynamic energy is less relevant in the overall -consumed energy and lowering the frequency does not return big energy savings. -Moreover, the average of the performance degradation is decreased when using a -higher ratio for static power (e.g. 70\%-30\% scenario and 80\%-20\% -scenario). Since the proposed algorithm optimizes the energy consumption when -using a higher ratio for dynamic power the algorithm selects bigger frequency -scaling factors that result in more energy saving but less performance, for -example see Figure~\ref{fig:scales_comp}. The opposite happens when using a -higher ratio for static power, the algorithm proportionally selects smaller -scaling values which result in less energy saving but also less performance +Both new power scenarios are compared to the old one in +Figure~\ref{fig:sen_comp}. It shows the average of the performance degradation, +the energy saving and the distances for all NAS benchmarks of class C running on +8 or 9 nodes. The comparison shows that the energy saving ratio is proportional +to the dynamic power ratio: it is increased when applying the +\np[\%]{90}-\np[\%]{10} scenario because at maximum frequency the dynamic energy +is the most relevant in the overall consumed energy and can be reduced by +lowering the frequency of some processors. On the other hand, the energy saving +decreases when the \np[\%]{70}-\np[\%]{30} scenario is used because the dynamic +energy is less relevant in the overall consumed energy and lowering the +frequency does not return big energy savings. Moreover, the average of the +performance degradation is decreased when using a higher ratio for static power +(e.g. \np[\%]{70}-\np[\%]{30} scenario and \np[\%]{80}-\np[\%]{20} +scenario). Since the proposed algorithm optimizes the energy consumption when +using a higher ratio for dynamic power the algorithm selects bigger frequency +scaling factors that result in more energy saving but less performance, for +example see Figure~\ref{fig:scales_comp}. The opposite happens when using a +higher ratio for static power, the algorithm proportionally selects smaller +scaling values which result in less energy saving but also less performance degradation. \begin{table}[!t] - \caption{The results of the 70\%-30\% power scenario} + \caption{The results of the \np[\%]{70}-\np[\%]{30} power scenario} % title of Table \centering \begin{tabular}{|*{6}{r|}} @@ -1108,7 +1115,7 @@ degradation. \begin{table}[!t] - \caption{The results of the 90\%-10\% power scenario} + \caption{The results of the \np[\%]{90}-\np[\%]{10} power scenario} % title of Table \centering \begin{tabular}{|*{6}{r|}} @@ -1195,9 +1202,10 @@ efficiency. Table~\ref{table:compare_EDP} presents the results of comparing the execution times and the energy consumption for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than -Spiliopoulos et al. algorithm, on average it results in 29.76\% energy saving -while their algorithm returns just 25.75\%. The average of performance -degradation percentage is approximately the same for both algorithms, about 4\%. +Spiliopoulos et al. algorithm, on average it results in \np[\%]{29.76} energy +saving while their algorithm returns just \np[\%]{25.75}. The average of +performance degradation percentage is approximately the same for both +algorithms, about \np[\%]{4}. For all benchmarks, our algorithm outperforms Spiliopoulos et al. algorithm in @@ -1209,20 +1217,20 @@ degradation values while giving the same weight for both metrics. \section{Conclusion} \label{sec.concl} In this paper, a new online frequency selecting algorithm has been presented. It -selects the best possible vector of frequency scaling factors that gives the -maximum distance (optimal trade-off) between the predicted energy and the +selects the best possible vector of frequency scaling factors that gives the +maximum distance (optimal trade-off) between the predicted energy and the predicted performance curves for a heterogeneous platform. This algorithm uses a -new energy model for measuring and predicting the energy of distributed -iterative applications running over heterogeneous platforms. To evaluate the +new energy model for measuring and predicting the energy of distributed +iterative applications running over heterogeneous platforms. To evaluate the proposed method, it was applied on the NAS parallel benchmarks and executed over -a heterogeneous platform simulated by SimGrid. The results of the experiments -showed that the algorithm reduces up to 35\% the energy consumption of a message -passing iterative method while limiting the degradation of the performance. The -algorithm also selects different scaling factors according to the percentage of -the computing and communication times, and according to the values of the static -and dynamic powers of the CPUs. Finally, the algorithm was compared to -Spiliopoulos et al. algorithm and the results showed that it outperforms their -algorithm in terms of energy-time trade-off. +a heterogeneous platform simulated by SimGrid. The results of the experiments +showed that the algorithm reduces up to \np[\%]{35} the energy consumption of a +message passing iterative method while limiting the degradation of the +performance. The algorithm also selects different scaling factors according to +the percentage of the computing and communication times, and according to the +values of the static and dynamic powers of the CPUs. Finally, the algorithm was +compared to Spiliopoulos et al. algorithm and the results showed that it +outperforms their algorithm in terms of energy-time trade-off. In the near future, this method will be applied to real heterogeneous platforms to evaluate its performance in a real study case. It would also be interesting