X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy2.git/blobdiff_plain/0ac070873920bcf8d6092cfb1bf75ffa3eb05fcb..300741fe3069be22399dd00ac07a1b213375a604:/Heter_paper.tex diff --git a/Heter_paper.tex b/Heter_paper.tex index 2050d2a..31a476a 100644 --- a/Heter_paper.tex +++ b/Heter_paper.tex @@ -49,9 +49,8 @@ \newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}} \newcommand{\Tmax}{\Xsub{T}{max}} \newcommand{\Tnew}{\Xsub{T}{New}} -\newcommand{\Told}{\Xsub{T}{Old}} - -\begin{document} +\newcommand{\Told}{\Xsub{T}{Old}} +\begin{document} \title{Energy Consumption Reduction in heterogeneous architecture using DVFS} @@ -143,7 +142,7 @@ vector of scaling factors can be predicted using EQ (\ref{eq:perf}). \textit T_\textit{new} = \max_{i=1,2,\dots,N} (TcpOld_{i} \cdot S_{i}) + MinTcm \end{equation} -where $TcpOld_i$ is the computation time of processor $i$ during the first iteration and $MinT_{c}m$ is the communication time of the slowest processor from the first iteration. The model computes the maximum computation time +where $TcpOld_i$ is the computation time of processor $i$ during the first iteration and $MinTcm$ is the communication time of the slowest processor from the first iteration. The model computes the maximum computation time with scaling factor from each node added to the communication time of the slowest node, it means only the communication time without any slack time. Therefore, we can consider the execution time of the iterative application is the execution time of one iteration as in EQ(\ref{eq:perf}) multiply by the number of iterations of the application. @@ -198,7 +197,7 @@ power consumption: \begin{multline} \label{eq:pdnew} {P}_\textit{dNew} = \alpha \cdot C_L \cdot V^2 \cdot F_{new} = \alpha \cdot C_L \cdot \beta^2 \cdot F_{new}^3 \\ - {} = \alpha \cdot C_L \cdot V^2 \cdot F \cdot S^{-3} = P_{dOld} \cdot S^{-3} + {} = \alpha \cdot C_L \cdot V^2 \cdot F_{max} \cdot S^{-3} = P_{dOld} \cdot S^{-3} \end{multline} where $ {P}_\textit{dNew}$ and $P_{dOld}$ are the dynamic power consumed with the new frequency and the maximum frequency respectively. @@ -206,7 +205,7 @@ According to EQ(\ref{eq:pdnew}) the dynamic power is reduced by a factor of $S^{ reducing the frequency by a factor of $S$~\cite{3}. Since the FLOPS of a CPU is proportional to the frequency of a CPU, the computation time is increased proportionally to $S$. The new dynamic energy is the dynamic power multiplied by the new time of computation and is given by the following equation: \begin{equation} \label{eq:Edyn} - E_\textit{dNew} = P_{dOld} \cdot S^{-3} \cdot (T_{cp} \cdot S)= S^{-2}\cdot P_{dOld} \cdot Tcp + E_\textit{dNew} = P_{dOld} \cdot S^{-3} \cdot (Tcp \cdot S)= S^{-2}\cdot P_{dOld} \cdot Tcp \end{equation} The static power is related to the power leakage of the CPU and is consumed during computation and even when idle. As in~\cite{3,46}, we assume that the static power of a processor is constant during idle and computation periods, and for all its available frequencies. The static energy is the static power multiplied by the execution time of the program. According to the execution time model in EQ(\ref{eq:perf}), @@ -218,7 +217,7 @@ of a processor after scaling its frequency is computed as follows: E_\textit{s} = P_\textit{s} \cdot (Tcp \cdot S + Tcm) \end{equation} -In the considered heterogeneous platform, each processor $i$ might have different dynamic and static powers, noted as $P_{di}$ and $P_{si}$ respectively. Therefore, even if the distributed message passing iterative application is load balanced, the computation time of each CPU $i$ noted $T_{cpi}$ might be different and different frequency scaling factors might be computed in order to decrease the overall energy consumption of the application and reduce the slack times. The communication time of a processor $i$ is noted as $T_{cmi}$ and could contain slack times if it is communicating with slower nodes, see figure(\ref{fig:heter}). Therefore, all nodes do not have equal communication times. While the dynamic energy is computed according to the frequency scaling factor and the dynamic power of each node as in EQ(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time of each processor multiplied by its static power. The overall energy consumption of a message passing distributed application executed over a heterogeneous platform during one iteration is the summation of all dynamic and static energies for each processor. It is computed as follows: +In the considered heterogeneous platform, each processor $i$ might have different dynamic and static powers, noted as $Pd_{i}$ and $Ps_{i}$ respectively. Therefore, even if the distributed message passing iterative application is load balanced, the computation time of each CPU $i$ noted $Tcp_{i}$ might be different and different frequency scaling factors might be computed in order to decrease the overall energy consumption of the application and reduce the slack times. The communication time of a processor $i$ is noted as $Tcm_{i}$ and could contain slack times if it is communicating with slower nodes, see figure(\ref{fig:heter}). Therefore, all nodes do not have equal communication times. While the dynamic energy is computed according to the frequency scaling factor and the dynamic power of each node as in EQ(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time of each processor multiplied by its static power. The overall energy consumption of a message passing distributed application executed over a heterogeneous platform during one iteration is the summation of all dynamic and static energies for each processor. It is computed as follows: \begin{multline} \label{eq:energy} E = \sum_{i=1}^{N} {(S_i^{-2} \cdot Pd_{i} \cdot Tcp_i)} + {} \\ @@ -325,9 +324,9 @@ However, the most energy reduction gain can be achieved when the energy curve ha In this section we are proposed a heterogeneous scaling algorithm, (figure~\ref{HSA}), that selects the optimal vector of the frequency scaling factors from each -node. The algorithm is numerates the suitable range of available frequency scaling -factors for each node in a heterogeneous cluster, returns a vector of optimal -frequency scaling factors for all node define as $Sopt_1,Sopt_2,\dots,Sopt_N$. Using heterogeneous cluster +node. The algorithm is numerates the suitable range of available vectors of the frequency scaling +factors for all node in a heterogeneous cluster, returns a vector of optimal +frequency scaling factors define as $Sopt_1,Sopt_2,\dots,Sopt_N$. Using heterogeneous cluster has different computing powers is produces different workloads for each node. Therefore, the fastest nodes waiting at the synchronous barrier for the slowest nodes to finish there work as in figure (\ref{fig:heter}). Our algorithm is takes into account these imbalanced workloads @@ -335,13 +334,14 @@ when is starts to search for selecting the best vector of the frequency scaling algorithm is selects the initial frequencies values for each node proportional to the times of computations that gathered from the first iteration. As an example in figure (\ref{fig:st_freq}), the algorithm don't tests the first -frequencies of the computing nodes until it is converge their frequencies to the -frequency of the slowest node. If the algorithm is starts to test changing the +frequencies of the computing nodes until their frequencies is converge to the +frequency of the slowest node. The operational frequency gear not surly related to computing power, therefore the algorithm +rapprochement the frequencies according to the computing power of each frequency. Moreover, If the algorithm is starts to test change the frequency of the slowest node from the first gear, we are loosing the performance and then the best trade-off relation (the maximum distance) be not reachable. This case will be similar -to a homogeneous cluster when all nodes scales their frequencies together from -the first gear. Therefore, there is a small distance between the energy and -the performance curves in a homogeneous cluster compare to heterogeneous one, for example see the figure(\ref{fig:r1}). Then the +to a homogeneous cluster when all nodes scale down their frequencies together from +the first gears. Therefore, there is a small distance between the energy and +the performance curves in a homogeneous cluster compare to heterogeneous one, for example see the figure(\ref{fig:r1}) and figure(\ref{fig:r2}) . Then the algorithm starts to search for the optimal vector of the frequency scaling factors from the selected initial frequencies until all node reach their minimum frequencies. \begin{figure}[t] @@ -375,7 +375,7 @@ maximum frequency of node $i$ and the computation scaling factor $Scp_i$ as fol \item[$Ps_i$] array of the static powers for all nodes. \item[$Fdiff_i$] array of the difference between two successive frequencies for all nodes. \end{description} - \Ensure $Sopt_1, \dots, Sopt_N$ is a set of optimal scaling factors + \Ensure $Sopt_1,Sopt_2 \dots, Sopt_N$ is a vector of optimal scaling factors \State $ Scp_i \gets \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i} $ \State $F_{i} \gets \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N}$ @@ -410,14 +410,14 @@ maximum frequency of node $i$ and the computation scaling factor $Scp_i$ as fol When the initial frequencies are computed, the algorithm numerates all available frequency scaling factors starting from the initial frequencies until all nodes reach their minimum frequencies. At each iteration the algorithm determine the slowest node according to EQ(\ref{eq:perf}). -It is remains the frequency of the slowest node without change, while it is scale down the frequency of the other +It is remains the frequency of the slowest node without change, while it is scales down the frequency of the other nodes. This is improved the execution time degradation and energy saving in the same time. The proposed algorithm works online during the execution time of the iterative MPI program. It is returns a vector of optimal frequency scaling factors depending on the objective function EQ(\ref{eq:max}). The program changes the new frequencies of the CPUs according to the computed scaling factors. This algorithm has a small execution time: for a heterogeneous cluster composed of four different types of -nodes having the characteristics presented in table~(\ref{table:platform}), it is +nodes having the characteristics presented in table~(\ref{table:platform}), it takes \np[ms]{0.04} on average for 4 nodes and \np[ms]{0.15} on average for 144 nodes. The algorithm complexity is $O(F\cdot (N \cdot4) )$, where $F$ is the number of iterations and $N$ is the number of computing nodes. The algorithm @@ -500,7 +500,7 @@ The proposed algorithm was applied to seven MPI programs of the NAS benchmarks ( \cite{44}, which were run with three classes (A, B and C). In this experiments we are interested to run the class C, the biggest class compared to A and B, on different number of nodes, from 4 to 128 or 144 nodes according to the type of the iterative MPI program. Depending on the proposed energy consumption model EQ(\ref{eq:energy}), - we are measure the energy consumption for all NAS MPI programs. The dynamic and static power values are used under the same assumption used by \cite{45,3}. We are used a percentage of 80\% for dynamic power and 20\% for static of the total power consumption of a CPU. The heterogeneous nodes in table (\ref{table:platform}) have different simulated computing power (FLOPS), ranked from the node of type 1 with smaller computing power (FLOPS) to the highest computing power (FLOPS) for node of type 4. Therefore, the power values are used proportionally increased from nodes of type 1 to nodes of type 4 that with highest computing power. Then, we are used an assumption that the power consumption is increased linearly when the computing power (FLOPS) of the processor is increased, see \cite{48}. + we are measure the energy consumption for all NAS MPI programs. The dynamic and static power values are used under the same assumption used by \cite{45,3}, we are used a percentage of 80\% for dynamic power and 20\% for static of the total power consumption of a CPU. The heterogeneous nodes in table (\ref{table:platform}) have different simulated computing power (FLOPS), ranked from the node of type 1 with smaller computing power (FLOPS) to the highest computing power (FLOPS) for node of type 4. Therefore, the power values are used proportionally increased from nodes of type 1 to nodes of type 4 that with highest computing power. Then, we are used an assumption that the power consumption is increased linearly when the computing power (FLOPS) of the processor is increased, see \cite{48}. \begin{table}[htb] \caption{Running NAS benchmarks on 4 nodes } @@ -723,8 +723,7 @@ running the NAS benchmarks class C on 8 or 9 nodes are place in the tables \hline EP &6170.30 &16.19 &0.02 &16.17 \\ \hline - LU &39477.28 &20.43 &0.07 to changes it -behaviour &20.36 \\ + LU &39477.28 &20.43 &0.07 &20.36 \\ \hline BT &26169.55 &25.34 &6.62 &18.71 \\ \hline