\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}}
\newcommand{\Tmax}{\Xsub{T}{max}}
\newcommand{\Tnew}{\Xsub{T}{New}}
-\newcommand{\Told}{\Xsub{T}{Old}}
-
-\begin{document}
+\newcommand{\Told}{\Xsub{T}{Old}}
+\begin{document}
\title{Energy Consumption Reduction in heterogeneous architecture using DVFS}
\textit T_\textit{new} =
\max_{i=1,2,\dots,N} (TcpOld_{i} \cdot S_{i}) + MinTcm
\end{equation}
-where $TcpOld_i$ is the computation time of processor $i$ during the first iteration and $MinT_{c}m$ is the communication time of the slowest processor from the first iteration. The model computes the maximum computation time
+where $TcpOld_i$ is the computation time of processor $i$ during the first iteration and $MinTcm$ is the communication time of the slowest processor from the first iteration. The model computes the maximum computation time
with scaling factor from each node added to the communication time of the slowest node, it means only the
communication time without any slack time. Therefore, we can consider the execution time of the iterative application is the execution time of one iteration as in EQ(\ref{eq:perf}) multiply by the number of iterations of the application.
\begin{multline}
\label{eq:pdnew}
{P}_\textit{dNew} = \alpha \cdot C_L \cdot V^2 \cdot F_{new} = \alpha \cdot C_L \cdot \beta^2 \cdot F_{new}^3 \\
- {} = \alpha \cdot C_L \cdot V^2 \cdot F \cdot S^{-3} = P_{dOld} \cdot S^{-3}
+ {} = \alpha \cdot C_L \cdot V^2 \cdot F_{max} \cdot S^{-3} = P_{dOld} \cdot S^{-3}
\end{multline}
where $ {P}_\textit{dNew}$ and $P_{dOld}$ are the dynamic power consumed with the new frequency and the maximum frequency respectively.
reducing the frequency by a factor of $S$~\cite{3}. Since the FLOPS of a CPU is proportional to the frequency of a CPU, the computation time is increased proportionally to $S$. The new dynamic energy is the dynamic power multiplied by the new time of computation and is given by the following equation:
\begin{equation}
\label{eq:Edyn}
- E_\textit{dNew} = P_{dOld} \cdot S^{-3} \cdot (T_{cp} \cdot S)= S^{-2}\cdot P_{dOld} \cdot Tcp
+ E_\textit{dNew} = P_{dOld} \cdot S^{-3} \cdot (Tcp \cdot S)= S^{-2}\cdot P_{dOld} \cdot Tcp
\end{equation}
The static power is related to the power leakage of the CPU and is consumed during computation and even when idle. As in~\cite{3,46}, we assume that the static power of a processor is constant during idle and computation periods, and for all its available frequencies.
The static energy is the static power multiplied by the execution time of the program. According to the execution time model in EQ(\ref{eq:perf}),
E_\textit{s} = P_\textit{s} \cdot (Tcp \cdot S + Tcm)
\end{equation}
-In the considered heterogeneous platform, each processor $i$ might have different dynamic and static powers, noted as $P_{di}$ and $P_{si}$ respectively. Therefore, even if the distributed message passing iterative application is load balanced, the computation time of each CPU $i$ noted $T_{cpi}$ might be different and different frequency scaling factors might be computed in order to decrease the overall energy consumption of the application and reduce the slack times. The communication time of a processor $i$ is noted as $T_{cmi}$ and could contain slack times if it is communicating with slower nodes, see figure(\ref{fig:heter}). Therefore, all nodes do not have equal communication times. While the dynamic energy is computed according to the frequency scaling factor and the dynamic power of each node as in EQ(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time of each processor multiplied by its static power. The overall energy consumption of a message passing distributed application executed over a heterogeneous platform during one iteration is the summation of all dynamic and static energies for each processor. It is computed as follows:
+In the considered heterogeneous platform, each processor $i$ might have different dynamic and static powers, noted as $Pd_{i}$ and $Ps_{i}$ respectively. Therefore, even if the distributed message passing iterative application is load balanced, the computation time of each CPU $i$ noted $Tcp_{i}$ might be different and different frequency scaling factors might be computed in order to decrease the overall energy consumption of the application and reduce the slack times. The communication time of a processor $i$ is noted as $Tcm_{i}$ and could contain slack times if it is communicating with slower nodes, see figure(\ref{fig:heter}). Therefore, all nodes do not have equal communication times. While the dynamic energy is computed according to the frequency scaling factor and the dynamic power of each node as in EQ(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time of each processor multiplied by its static power. The overall energy consumption of a message passing distributed application executed over a heterogeneous platform during one iteration is the summation of all dynamic and static energies for each processor. It is computed as follows:
\begin{multline}
\label{eq:energy}
E = \sum_{i=1}^{N} {(S_i^{-2} \cdot Pd_{i} \cdot Tcp_i)} + {} \\
In this section we are proposed a heterogeneous scaling algorithm,
(figure~\ref{HSA}), that selects the optimal vector of the frequency scaling factors from each
-node. The algorithm is numerates the suitable range of available frequency scaling
-factors for each node in a heterogeneous cluster, returns a vector of optimal
-frequency scaling factors for all node define as $Sopt_1,Sopt_2,\dots,Sopt_N$. Using heterogeneous cluster
+node. The algorithm is numerates the suitable range of available vectors of the frequency scaling
+factors for all node in a heterogeneous cluster, returns a vector of optimal
+frequency scaling factors define as $Sopt_1,Sopt_2,\dots,Sopt_N$. Using heterogeneous cluster
has different computing powers is produces different workloads for each node. Therefore, the fastest nodes waiting at the
synchronous barrier for the slowest nodes to finish there work as in figure
(\ref{fig:heter}). Our algorithm is takes into account these imbalanced workloads
algorithm is selects the initial frequencies values for each node proportional
to the times of computations that gathered from the first iteration. As an
example in figure (\ref{fig:st_freq}), the algorithm don't tests the first
-frequencies of the computing nodes until it is converge their frequencies to the
-frequency of the slowest node. If the algorithm is starts to test changing the
+frequencies of the computing nodes until their frequencies is converge to the
+frequency of the slowest node. The operational frequency gear not surly related to computing power, therefore the algorithm
+rapprochement the frequencies according to the computing power of each frequency. Moreover, If the algorithm is starts to test change the
frequency of the slowest node from the first gear, we are loosing the performance and
then the best trade-off relation (the maximum distance) be not reachable. This case will be similar
-to a homogeneous cluster when all nodes scales their frequencies together from
-the first gear. Therefore, there is a small distance between the energy and
-the performance curves in a homogeneous cluster compare to heterogeneous one, for example see the figure(\ref{fig:r1}). Then the
+to a homogeneous cluster when all nodes scale down their frequencies together from
+the first gears. Therefore, there is a small distance between the energy and
+the performance curves in a homogeneous cluster compare to heterogeneous one, for example see the figure(\ref{fig:r1}) and figure(\ref{fig:r2}) . Then the
algorithm starts to search for the optimal vector of the frequency scaling factors from the selected initial
frequencies until all node reach their minimum frequencies.
\begin{figure}[t]
\item[$Ps_i$] array of the static powers for all nodes.
\item[$Fdiff_i$] array of the difference between two successive frequencies for all nodes.
\end{description}
- \Ensure $Sopt_1, \dots, Sopt_N$ is a set of optimal scaling factors
+ \Ensure $Sopt_1,Sopt_2 \dots, Sopt_N$ is a vector of optimal scaling factors
\State $ Scp_i \gets \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i} $
\State $F_{i} \gets \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N}$
When the initial frequencies are computed, the algorithm numerates all available
frequency scaling factors starting from the initial frequencies until all nodes reach their
minimum frequencies. At each iteration the algorithm determine the slowest node according to EQ(\ref{eq:perf}).
-It is remains the frequency of the slowest node without change, while it is scale down the frequency of the other
+It is remains the frequency of the slowest node without change, while it is scales down the frequency of the other
nodes. This is improved the execution time degradation and energy saving in the same time.
The proposed algorithm works online during the execution time of the iterative MPI program. It is
returns a vector of optimal frequency scaling factors depending on the
objective function EQ(\ref{eq:max}). The program changes the new frequencies of
the CPUs according to the computed scaling factors. This algorithm has a small
execution time: for a heterogeneous cluster composed of four different types of
-nodes having the characteristics presented in table~(\ref{table:platform}), it is
+nodes having the characteristics presented in table~(\ref{table:platform}), it
takes \np[ms]{0.04} on average for 4 nodes and \np[ms]{0.15} on average for 144
nodes. The algorithm complexity is $O(F\cdot (N \cdot4) )$, where $F$ is the
number of iterations and $N$ is the number of computing nodes. The algorithm
\cite{44}, which were run with three classes (A, B and C).
In this experiments we are interested to run the class C, the biggest class compared to A and B, on different number of
nodes, from 4 to 128 or 144 nodes according to the type of the iterative MPI program. Depending on the proposed energy consumption model EQ(\ref{eq:energy}),
- we are measure the energy consumption for all NAS MPI programs. The dynamic and static power values are used under the same assumption used by \cite{45,3}. We are used a percentage of 80\% for dynamic power and 20\% for static of the total power consumption of a CPU. The heterogeneous nodes in table (\ref{table:platform}) have different simulated computing power (FLOPS), ranked from the node of type 1 with smaller computing power (FLOPS) to the highest computing power (FLOPS) for node of type 4. Therefore, the power values are used proportionally increased from nodes of type 1 to nodes of type 4 that with highest computing power. Then, we are used an assumption that the power consumption is increased linearly when the computing power (FLOPS) of the processor is increased, see \cite{48}.
+ we are measure the energy consumption for all NAS MPI programs. The dynamic and static power values are used under the same assumption used by \cite{45,3}, we are used a percentage of 80\% for dynamic power and 20\% for static of the total power consumption of a CPU. The heterogeneous nodes in table (\ref{table:platform}) have different simulated computing power (FLOPS), ranked from the node of type 1 with smaller computing power (FLOPS) to the highest computing power (FLOPS) for node of type 4. Therefore, the power values are used proportionally increased from nodes of type 1 to nodes of type 4 that with highest computing power. Then, we are used an assumption that the power consumption is increased linearly when the computing power (FLOPS) of the processor is increased, see \cite{48}.
\begin{table}[htb]
\caption{Running NAS benchmarks on 4 nodes }
\hline
EP &6170.30 &16.19 &0.02 &16.17 \\
\hline
- LU &39477.28 &20.43 &0.07 to changes it
-behaviour &20.36 \\
+ LU &39477.28 &20.43 &0.07 &20.36 \\
\hline
BT &26169.55 &25.34 &6.62 &18.71 \\
\hline