X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy2.git/blobdiff_plain/054bc9f5e26028cd84aa816c5c06ed131959fdde..93836a322514b561c0e5f2c9a2fcdf584617852e:/Heter_paper.tex?ds=sidebyside diff --git a/Heter_paper.tex b/Heter_paper.tex index 9d743db..5d28377 100644 --- a/Heter_paper.tex +++ b/Heter_paper.tex @@ -63,8 +63,7 @@ Arnaud Giersch } \IEEEauthorblockA{% - FEMTO-ST Institute\\ - University of Franche-Comté\\ + FEMTO-ST Institute, University of Franche-Comte\\ IUT de Belfort-Montbéliard, 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\ % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël @@ -82,16 +81,15 @@ platforms many techniques have been used. Dynamic voltage and frequency scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its energy consumption. However, lowering the frequency of a CPU might increase the execution time of an application running on that processor. Therefore, the -frequency that gives the best tradeoff between the energy consumption and the -performance of an application must be selected. - +frequency that gives the best trade-off between the energy consumption and the +performance of an application must be selected.\\ In this paper, a new online frequencies selecting algorithm for heterogeneous platforms is presented. It selects the frequency which tries to give the best -tradeoff between energy saving and performance degradation, for each node +trade-off between energy saving and performance degradation, for each node computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for message passing iterative applications running on a heterogeneous platform. The -proposed algorithm is evaluated on the Simgrid simulator while running the NAS +proposed algorithm is evaluated on the SimGrid simulator while running the NAS parallel benchmarks. The experiments show that it reduces the energy consumption by up to 35\% while limiting the performance degradation as much as possible. Finally, the algorithm is compared to an existing method, the @@ -127,7 +125,7 @@ processor by lowering its frequency the number of FLOPS executed by the processor which might increase the execution time of the application running over that processor. Therefore, researchers use different optimization strategies to select the frequency that gives the best -tradeoff between the energy reduction and performance degradation ratio. In +trade-off between the energy reduction and performance degradation ratio. In \cite{Our_first_paper}, a frequency selecting algorithm was proposed to reduce the energy consumption of message passing iterative applications running over homogeneous platforms. The results of the experiments show significant energy @@ -195,7 +193,7 @@ efficiency than other clusters only composed of CPUs. The work presented in this paper concerns the second type of platform, with heterogeneous CPUs. Many methods were conceived to reduce the energy consumption of this type of platform. Naveen et al.~\cite{Naveen_Power.Efficient.Resource.Scaling} -developed a method that minimizes the value of $energy*delay^2$ (the delay is the sum of slack times that happen during synchronous communications) by dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed +developed a method that minimizes the value of $energy\cdot delay^2$ (the delay is the sum of slack times that happen during synchronous communications) by dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed an algorithm that divides the executed tasks into two types: the critical and non critical tasks. The algorithm scales down the frequency of non critical tasks proportionally to their slack and communication times while limiting the performance degradation percentage to less than 10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a heterogeneous cluster composed of two types @@ -240,7 +238,7 @@ nodes to finish their computations (see Figure~(\ref{fig:heter})). Therefore, the overall execution time of the program is the execution time of the slowest task which has the highest computation time and no slack time. - \begin{figure}[t] + \begin{figure}[!t] \centering \includegraphics[scale=0.6]{fig/commtasks} \caption{Parallel tasks on a heterogeneous platform} @@ -285,7 +283,7 @@ vector of scaling factors can be predicted using (\ref{eq:perf}). \textit T_\textit{new} = \max_{i=1,2,\dots,N} ({TcpOld_{i}} \cdot S_{i}) + MinTcm \end{equation} -Where:\\ +Where: \begin{equation} \label{eq:perf2} MinTcm = \min_{i=1,2,\dots,N} (Tcm_i) @@ -486,7 +484,7 @@ normalized execution time is inverted which gives the normalized performance equ \end{multline} -\begin{figure} +\begin{figure}[!t] \centering \subfloat[Homogeneous platform]{% \includegraphics[width=.33\textwidth]{fig/homo}\label{fig:r1}}% @@ -590,7 +588,7 @@ words, until they reach the higher bound. It can also be noticed that the higher the difference between the faster nodes and the slower nodes is, the bigger the maximum distance between the energy curve and the performance curve is while the scaling factors are varying which results in bigger energy savings. -\begin{figure}[t] +\begin{figure}[!t] \centering \includegraphics[scale=0.5]{fig/start_freq} \caption{Selecting the initial frequencies} @@ -714,10 +712,10 @@ highest frequency, each node consumed an amount of power proportional to its computing power (which corresponds to 80\% of its dynamic power and the remaining 20\% to the static power), the same assumption was made in \cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}. Finally, These -nodes were connected via an ethernet network with 1 Gbit/s bandwidth. +nodes were connected via an Ethernet network with 1 Gbit/s bandwidth. -\begin{table}[htb] +\begin{table}[!t] \caption{Heterogeneous nodes characteristics} % title of Table \centering @@ -762,7 +760,7 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. -\begin{table}[htb] +\begin{table}[!t] \caption{Running NAS benchmarks on 4 nodes } % title of Table \centering @@ -787,9 +785,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. \hline \end{tabular} \label{table:res_4n} -\end{table} +% \end{table} -\begin{table}[htb] +\medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 8 and 9 nodes } % title of Table \centering @@ -814,9 +813,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. \hline \end{tabular} \label{table:res_8n} -\end{table} +% \end{table} -\begin{table}[htb] +\medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 16 nodes } % title of Table \centering @@ -841,9 +841,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. \hline \end{tabular} \label{table:res_16n} -\end{table} +% \end{table} -\begin{table}[htb] +\medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 32 and 36 nodes } % title of Table \centering @@ -868,9 +869,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. \hline \end{tabular} \label{table:res_32n} -\end{table} +% \end{table} -\begin{table}[htb] +\medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 64 nodes } % title of Table \centering @@ -895,10 +897,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes. \hline \end{tabular} \label{table:res_64n} -\end{table} - +% \end{table} -\begin{table}[htb] +\medskip +% \begin{table}[!t] \caption{Running NAS benchmarks on 128 and 144 nodes } % title of Table \centering @@ -959,7 +961,7 @@ small when compared to the communication times. -\begin{figure} +\begin{figure}[!t] \centering \subfloat[Energy saving]{% \includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}% @@ -1014,7 +1016,7 @@ in the 70\%-30\% scenario compared to the 90\%-10\% scenario. This is due to the higher static power percentage in the first scenario which makes it more relevant in the overall consumed energy. Indeed, the static energy is related to the execution time and if the performance is degraded the amount of consumed -static energy directly increas. Therefore, the proposed algorithm does not +static energy directly increases. Therefore, the proposed algorithm does not really significantly scale down much the frequencies of the nodes in order to limit the increase of the execution time and thus limiting the effect of the consumed static energy. @@ -1040,7 +1042,7 @@ scaling values which result in less energy saving but also less performance degradation. - \begin{table}[htb] + \begin{table}[!t] \caption{The results of the 70\%-30\% power scenario} % title of Table \centering @@ -1069,7 +1071,7 @@ degradation. -\begin{table}[htb] +\begin{table}[!t] \caption{The results of the 90\%-10\% power scenario} % title of Table \centering @@ -1097,7 +1099,7 @@ degradation. \end{table} -\begin{figure} +\begin{figure}[!t] \centering \subfloat[Comparison between the results on 8 nodes]{% \includegraphics[width=.33\textwidth]{fig/sen_comp}\label{fig:sen_comp}}% @@ -1115,23 +1117,23 @@ degradation. \label{sec.compare_EDP} In this section, the scaling factors selection algorithm, called MaxDist, is compared to Spiliopoulos et al. algorithm \cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP. -They developed a green governor that regularly applies an online frequency selecting algorithm to reduce the energy consumed by a multicore architecture without degrading much its performance. The algorithm selects the frequencies that minimize the energy and delay products, $EDP=Enegry*Delay$ using the predicted overall energy consumption and execution time delay for each frequency. +They developed a green governor that regularly applies an online frequency selecting algorithm to reduce the energy consumed by a multicore architecture without degrading much its performance. The algorithm selects the frequencies that minimize the energy and delay products, $EDP=Energy\cdot Delay$ using the predicted overall energy consumption and execution time delay for each frequency. To fairly compare both algorithms, the same energy and execution time models, equations (\ref{eq:energy}) and (\ref{eq:fnew}), were used for both algorithms to predict the energy consumption and the execution times. Also Spiliopoulos et al. algorithm was adapted to start the search from the initial frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm is an exhaustive search algorithm that minimizes the EDP and has the initial frequencies values as an upper bound. -Both algorithms were applied to the parallel NAS benchmarks to compare their efficiency. Table \ref{table:compare_EDP} presents the results of comparing the execution times and the energy consumptions for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than Spiliopoulos et al. algorithm, +Both algorithms were applied to the parallel NAS benchmarks to compare their efficiency. Table \ref{table:compare_EDP} presents the results of comparing the execution times and the energy consumption for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than Spiliopoulos et al. algorithm, on average it results in 29.76\% energy saving while their algorithm returns just 25.75\%. The average of performance degradation percentage is approximately the same for both algorithms, about 4\%. For all benchmarks, our algorithm outperforms Spiliopoulos et al. algorithm in -terms of energy and performance tradeoff, see figure (\ref{fig:compare_EDP}), +terms of energy and performance trade-off, see figure (\ref{fig:compare_EDP}), because it maximizes the distance between the energy saving and the performance degradation values while giving the same weight for both metrics. -\begin{table}[h] +\begin{table}[!t] \caption{Comparing the proposed algorithm} \centering \begin{tabular}{|l|l|l|l|l|l|l|l|} @@ -1154,10 +1156,10 @@ degradation values while giving the same weight for both metrics. -\begin{figure}[t] +\begin{figure}[!t] \centering \includegraphics[scale=0.5]{fig/compare_EDP.pdf} - \caption{Tradeoff comparison for NAS benchmarks class C} + \caption{Trade-off comparison for NAS benchmarks class C} \label{fig:compare_EDP} \end{figure} @@ -1166,19 +1168,19 @@ degradation values while giving the same weight for both metrics. \label{sec.concl} In this paper, a new online frequency selecting algorithm has been presented. It selects the best possible vector of frequency scaling factors that gives the -maximum distance (optimal tradeoff) between the predicted energy and the +maximum distance (optimal trade-off) between the predicted energy and the predicted performance curves for a heterogeneous platform. This algorithm uses a new energy model for measuring and predicting the energy of distributed iterative applications running over heterogeneous platforms. To evaluate the proposed method, it was applied on the NAS parallel benchmarks and executed over -a heterogeneous platform simulated by Simgrid. The results of the experiments +a heterogeneous platform simulated by SimGrid. The results of the experiments showed that the algorithm reduces up to 35\% the energy consumption of a message passing iterative method while limiting the degradation of the performance. The algorithm also selects different scaling factors according to the percentage of the computing and communication times, and according to the values of the static and dynamic powers of the CPUs. Finally, the algorithm was compared to Spiliopoulos et al. algorithm and the results showed that it outperforms their -algorithm in terms of energy-time tradeoff. +algorithm in terms of energy-time trade-off. In the near future, this method will be applied to real heterogeneous platforms to evaluate its performance in a real study case. It would also be interesting @@ -1216,6 +1218,7 @@ Babylon (Iraq) for supporting his work. %%% End: % LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber -% LocalWords: CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex +% LocalWords: CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex GPU % LocalWords: de badri muslim MPI TcpOld TcmOld dNew dOld cp Sopt Tcp Tcm Ps -% LocalWords: Scp Fmax Fdiff SimGrid GFlops Xeon EP BT +% LocalWords: Scp Fmax Fdiff SimGrid GFlops Xeon EP BT GPUs CPUs AMD +% LocalWords: Spiliopoulos scalability