clusters (heterogeneous CPUs) and grid platforms are presented.
They select the frequencies that try to give the best
trade-off between energy saving and performance degradation, for each node
- computing the synchronous message passing iterative application. These algorithms have a small
+ computing the synchronous message passing application with iterations. These algorithms have a small
overhead and work without training or profiling. They use new energy models
- for message passing iterative synchronous applications running on both the heterogeneous
+ for message passing synchronous applications with iterations running on both the heterogeneous
local cluster and the grid platform. The first proposed algorithm for a heterogeneous local
cluster was evaluated on the SimGrid simulator while running the class C of the NAS parallel
benchmarks. The experiments conducted over 8 heterogeneous nodes show that it reduces on
average the energy consumption by 29.8\% while limiting the performance degradation to 3.8\%.
The second proposed algorithm for a grid platform was evaluated on the Grid5000 testbed
platform while running the class D of the NAS parallel benchmarks.
- The experiments were run on 16 nodes, distributed on three clusters, and show that it reduces
+ The experiments were run on 16 nodes, distributed on three clusters, and show that the algorithm reduces
on average the energy consumption by 30\% while the performance is on average only degraded
by 3.2\%.
Finally, both algorithms were compared to the EDP method. The comparison
selecting algorithm of synchronous message passing programs running over a grid platform.
Section~\ref{ch3:4} presents the results of applying the algorithm on the
NAS parallel benchmarks (class D) and executing them on the Grid'5000 testbed.
-The algorithm is also evaluated over multi-core architectures and over three different power scenarios. Moreover, section~\ref{ch3:4}, shows the comparison results between the proposed method and the EDP method.
+The algorithm is also evaluated over multi-core architectures and over three different power scenarios. Moreover, Section~\ref{ch3:4}, shows the comparison results between the proposed method and the EDP method.
Finally, in Section~\ref{ch3:concl} the chapter ends with a summary.
\section{Related works}
\label{ch3:relwork}
-As same as in CPUs, DVFS is also allowed in GPUs to reduce their energy consumption.
The process of selecting the appropriate frequency for a
processor to satisfy some objectives, while taking into account all the
constraints, is not a trivial operation. Many researchers used different
platform, synchronous or asynchronous application, \dots{}
In this chapter, we are interested in reducing the energy consumption when running a message passing
-iterative synchronous applications over a heterogeneous platform. Some
+synchronous applications with iterations over a heterogeneous platform. Some
works have already been done for such platforms which can be classified into
two types of heterogeneous platforms:
\begin{itemize}
overhead. In contrast to the above described works, the work of this chapter presents the
following contributions:
\begin{enumerate}
-\item two new energy and two performance models for message passing iterative
- synchronous applications running over a heterogeneous local cluster and a grid platform.
+\item two new energy and two performance models for message passing
+ synchronous applications with iterations running over a heterogeneous local cluster and a grid platform.
All the models take into account the communications and the slack times. The models can predict the
energy consumption and the execution time of the application.
local cluster and a grid platform. The algorithms have a very small overhead and do not need any
training or profiling. They use a new optimization function which
simultaneously maximizes the performance and minimizes the energy consumption
- of a message passing iterative synchronous application.
+ of a message passing synchronous application with iterations.
\end{enumerate}
-\section[The energy optimization of a heterogeneous cluster]{The energy optimization of parallel iterative applications running over local heterogeneous
+\section[The energy optimization of a heterogeneous cluster]{The energy optimization of parallel applications with iterations running over local heterogeneous
clusters}
\label{ch3:1}
-\subsection{The execution time of message passing distributed iterative
- applications on a heterogeneous local cluster}
+\subsection{The execution time of message passing distributed
+ applications with iterations on a heterogeneous local cluster}
\label{ch3:1:1}
In this section, we are interested in reducing the energy consumption of message
-passing distributed iterative synchronous applications running over heterogeneous local clusters.
+passing distributed synchronous applications with iterations running over heterogeneous local clusters.
In this work, a heterogeneous local cluster is defined as a collection of
heterogeneous computing nodes interconnected via a high speed homogeneous
network. Therefore, the nodes may have different characteristics such as computing
\label{fig:task-heter}
\end{figure}
-The overall execution time of a distributed iterative synchronous application
+The overall execution time of a distributed synchronous application with iterations
over a heterogeneous local cluster consists of the sum of the computation time and
the communication time for every iteration on a node. However, due to the
heterogeneous computation power of the computing nodes, slack times may occur
especially different frequency gears, when applying DVFS operations on these
nodes, they may get different scaling factors represented by a scaling vector:
$(S_1, S_2,\dots, S_N)$ where $S_i$ is the scaling factor of processor $i$. To
-be able to predict the execution time of message passing synchronous iterative
-applications running over a heterogeneous local cluster, for different vectors of
+be able to predict the execution time of message passing synchronous
+applications with iterations running over a heterogeneous local cluster, for different vectors of
scaling factors, the communication time and the computation time for all the
tasks must be measured during the first iteration before applying any DVFS
operation. Then the execution time for one iteration of the application with any
iteration. The model computes the maximum computation time with
scaling factor from each node added to the communication time of the slowest
node. It means only the communication time without any slack time is taken into
-account. Therefore, the execution time of the iterative application is equal to
+account. Therefore, the execution time of the application with iterations is equal to
the execution time of one iteration as in (\ref{eq:perf_heter}) multiplied by the
number of iterations of that application.
This prediction model is improved from the model that predicts the execution time
of message passing distributed applications for homogeneous
-architectures presented in chapter \ref{ch2} section \ref{ch2:3}. The execution time prediction model is
+architectures presented in Chapter \ref{ch2} Section \ref{ch2:3}. The execution time prediction model is
used in the method that optimizes both the energy consumption and the performance
-of iterative methods, which is presented in the following sections.
+of parallel application with iterations, which is presented in the following sections.
\subsection{Energy model for heterogeneous local cluster}
\label{ch3:1:2}
-In chapter \ref{ch2}, the dynamic and the static energy consumption of a
+In Chapter \ref{ch2}, the dynamic and the static energy consumption of a
processor is computed according to Equations \ref{eq:Edyn_new} and \ref{eq:Estatic_new} respectively. Then, the total energy consumption of a processor is the sum of these two metrics.
Therefore, the overall energy consumption for the parallel tasks over a parallel cluster
is the summation of the energies consumed by all the processors.
In the considered heterogeneous platform, each processor $i$ may have
different dynamic and static powers, noted as $\Pd[i]$ and $\Ps[i]$
-respectively. Therefore, even if the distributed message passing iterative
-application is load balanced, the computation time of each CPU $i$ noted
+respectively. Therefore, even if the distributed message passing
+application with iterations is load balanced, the computation time of each CPU $i$ noted
$\Tcp[i]$ may be different and different frequency scaling factors may be
computed in order to decrease the overall energy consumption of the application
and reduce the slack times. The communication time of a processor $i$ is noted as
(\ref{eq:Edyn_new}), the static energy is computed as the sum of the execution time
of one iteration as in \ref{eq:perf_heter} multiplied by the static power of each processor.
The overall energy consumption of a message passing distributed application executed over a
-heterogeneous cluster during one iteration is the summation of all the dynamic and
+heterogeneous cluster during one iteration is the summation of the dynamic and
static energies for all the processors. It is computed as follows:
\begin{equation}
\label{eq:energy-heter}
factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the application
and thus, increase the consumed static energy because the execution time is
increased~\cite{ref78}. The overall energy consumption
-for an iterative application can be measured by measuring the energy
+for an application with iterations can be measured by measuring the energy
consumption for one iteration as in (\ref{eq:energy-heter}) multiplied by the number
of iterations of that application.
appropriate frequency scaling factor for each processor while considering the
characteristics of each processor (computation power, range of frequencies,
dynamic and static powers) and the task it is executing (computation/communication
-ratio). In chapter~\ref{ch2}, we proposed a method that selects the optimal
+ratio). In Chapter~\ref{ch2}, we proposed a method that selects the optimal
frequency scaling factor for a homogeneous cluster executing a message passing
-iterative synchronous application while giving the best trade-off between the
+ synchronous application with iterations while giving the best trade-off between the
energy consumption and the performance for such applications.
In this section, this optimization method is improved while considering a heterogeneous clusters.
As described before, the relation between the energy consumption and the execution time for an
-application is complex and nonlinear. Thus, to find the trade-off relation between the energy consumption computed in Equation \ref{eq:energy-heter} and the performance with Equation \ref{eq:perf_heter} for the iterative message passing applications, first we need to normalize both term as follows:
+application is complex and nonlinear. Thus, to find the trade-off relation between the energy consumption computed in Equation \ref{eq:energy-heter} and the performance with Equation \ref{eq:perf_heter} for the message passing applications with iterations, first we need to normalize both terms as follows:
\begin{equation}
\begin{figure}[!t]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/heter}
- \caption{The energy and performance relation in Heterogeneous cluster}
+ \caption{The energy and performance relation in heterogeneous cluster}
\label{fig:rel-heter}
\end{figure}
In this section, Algorithm~\ref{HSA} is presented. It selects the frequency
scaling factors vector that gives the best trade-off between minimizing the
energy consumption and maximizing the performance of a message passing
-synchronous iterative application executed on a heterogeneous local cluster. It works
-online during the execution time of the iterative message passing program. It
+synchronous application with iterations executed on a heterogeneous local cluster. It works
+online during the execution time of the message passing program with iterations. It
uses information gathered during the first iteration such as the computation
time and the communication time in one iteration for each node. The algorithm is
executed after the first iteration and returns a vector of optimal frequency
program applies DVFS operations to change the frequencies of the CPUs according
to the computed scaling factors. This algorithm is called just once during the
execution of the program. Algorithm~\ref{dvfs-heter} shows where and when the proposed
-scaling algorithm is called in the iterative MPI program.
+scaling algorithm is called in the MPI program with iterations.
\begin{figure}[!t]
\centering
144 nodes and had nodes from the four types in equal proportions, for example if
a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
constructors of CPUs do not specify the dynamic and the static power of their
-CPUs, for each type of node they were chosen proportionally to its computing
+CPUs, for each type of node they were chosen proportionally to their computing
powers (FLOPS). The dynamic power corresponds to 80\% of the overall power consumption while executing at
the higher frequency and the
-remaining 20\% is the static power. The same assumption was made in chapter \ref{ch2} and
+remaining 20\% is the static power. The same assumption was made in Chapter \ref{ch2} and
\cite{ref3}. Finally, These nodes were connected via an Ethernet network with 1 \textit{Gbit/s} bandwidth.
frequencies, thus, scaling down the frequency of the nodes results in higher
energy savings than in the 70\%-30\% scenario. On the other hand,
the performance degradation percentage is smaller in the 70\%-30\%
-scenario compared to the 90\%-\%10 scenario. This is due to the
+scenario compared to the 90\%-10\% scenario. This is due to the
higher static power percentage in the first scenario which makes it more
relevant in the overall consumed energy. Indeed, the static energy is related
to the execution time and if the performance is degraded the amount of consumed
decreases when the 70\%-30\% scenario is used because the dynamic
energy is less relevant in the overall consumed energy and lowering the
frequency does not return big energy savings. Moreover, the average of the
-performance degradation is decreased when using a higher ratio for static power
+performance degradation is decreased when using a higher ratio for the static power
(e.g. 70\%-30\% scenario and 80\%-20\% scenario). Since the proposed
algorithm optimizes the energy consumption when
using a higher ratio for the dynamic power, the algorithm selects bigger frequency
scaling factors that results in more energy saving but degrade the performance, for
example see Figure~\ref{fig:powers-heter} (b). The opposite happens when using a
higher ratio for the static power, the algorithm proportionally selects smaller
-scaling values which result in less energy saving but also less performance
+scaling values which results in less energy saving but also less performance
degradation.
\begin{table}[!t]
because it maximizes the distance between the energy saving and the performance
degradation values while giving the same weight for both metrics.
-\section[The energy optimization of grid]{The energy optimization of parallel iterative applications running over grids}
+\section[The energy optimization of grid]{The energy optimization of parallel applications with iterations running over grids}
\label{ch3:3}
\subsection{The energy and performance models of grid platform}
\label{ch3:3:1}
In this section, we are interested in reducing the energy consumption of message
-passing iterative synchronous applications running over
+passing applications with synchronous iterations running over
heterogeneous grid platforms. A heterogeneous grid platform could be defined as a collection of
heterogeneous computing clusters interconnected via a long distance network which has a lower bandwidth
and a higher latency than the local networks of the clusters. Each computing cluster in the grid is composed of homogeneous nodes that are connected together via a high speed network. However, nodes from distinct clusters may have different characteristics such as computing power (FLOPS), energy consumption, CPU's frequency range, network bandwidth and latency.
Since in a heterogeneous grid each cluster has different characteristics,
when applying DVFS operations on the nodes
of these clusters, they may get different scaling factors represented by a scaling vector:
-$(S_{11}, S_{12},\dots, S_{NM})$ where $S_{ij}$ is the scaling factor of processor $j$ in cluster $i$ . To
-be able to predict the execution time of message passing synchronous iterative
-applications running over a heterogeneous grid, for different vectors of
+$(S_{11}, S_{12},\dots, S_{NM})$ where $S_{ij}$ is the scaling factor of processor $j$ in cluster $i$. To
+be able to predict the execution time of message passing
+applications with synchronous iterations running over a heterogeneous grid, for different vectors of
scaling factors, the communication time and the computation time for all the
tasks must be measured during the first iteration before applying any DVFS
operation. Then the execution time for one iteration of the application with any
and the slowest communication time without slack time during one iteration.
The latter is equal to the communication time of the slowest node in the slowest cluster $h$.
It means that only the communication time without any slack time is taken into account.
-Therefore, the execution time of the iterative application is equal to
+Therefore, the execution time of the parallel application with iterations is equal to
the execution time of one iteration as in Equation (\ref{eq:perf-grid}) multiplied by the
number of iterations of that application.
In the considered heterogeneous grid platform, each node $j$ in cluster $i$ may have
different dynamic and static powers from the nodes of the other clusters,
noted as $\Pd[ij]$ and $\Ps[ij]$ respectively. Therefore, even if the distributed
-message passing iterative application is load balanced, the computation time of each CPU $j$
+message passing application with iterations is load balanced, the computation time of each CPU $j$
in cluster $i$ noted $\Tcp[ij]$ may be different and different frequency scaling factors may be
computed in order to decrease the overall energy consumption of the application
and reduce slack times. The communication time of a processor $j$ in cluster $i$ is noted as
To optimize both of the energy consumption model computed by \ref{eq:energy-grid} and the performance model computed by \ref{eq:perf-grid},
-they must be normalized as in \ref{eq:enorm-heter} and \ref{eq:pnorm-heter} Equations respectively.
+they must be normalized as in Equation \ref{eq:enorm-heter} and Equation \ref{eq:pnorm-heter} respectively.
While the original energy consumption is the consumed energy with the
maximum frequency for all the nodes computed as follows:
\begin{figure}[!t]
\centering
\includegraphics[scale=0.7]{fig/ch3/init_freq}
- \caption{Selecting the initial frequencies in grid}
+ \caption{Selecting the initial frequencies in the grid architecture}
\label{fig:st_freq-grid}
\end{figure}
is presented. It selects the vector of frequency
scaling factors that gives the best trade-off between minimizing the
energy consumption and maximizing the performance of a message passing
-synchronous iterative application executed on a grid.
+ application with synchronous iterations executed on a grid.
It is similar to the frequency selection algorithm for heterogeneous
-local clusters presented in section \ref{ch3:1:4}.
+local clusters presented in Section \ref{ch3:1:4}.
The value of the initial frequency scaling factor for each node is inversely proportional to its
computation time that was gathered in the first iteration. The initial
Therefore, the dynamic power of one core is computed as the difference between the maximum
measured value in maximum powers vector and the minimum measured value in the idle powers vector.
-On the other hand, the static power consumption by one core is a part of the measured idle power consumption of the node. Since in Grid'5000 there is no way to measure precisely the consumed static power and it was assumed, as in sections \ref{ch3:2} and \ref{ch2:6}, that the static power represents a ratio of the dynamic power, the value of the static power is assumed to be equal to 20\% of the dynamic power consumption of the core.
+On the other hand, the static power consumption by one core is a part of the measured idle power consumption of the node. Since in Grid'5000 there is no way to measure precisely the consumed static power and it was assumed, as in Sections \ref{ch3:2} and \ref{ch2:6}, that the static power represents a ratio of the dynamic power, the value of the static power is assumed to be equal to 20\% of the dynamic power consumption of the core.
In the experiments presented in the following sections, two sites of Grid'5000 were used, Lyon and Nancy sites. These two sites have in total seven different clusters as shown in Figure~\ref{fig:grid5000}.
\end{tabular}
\label{table:grid5000-1}
\end{table}
-CPUs
\subsection{The experimental results of the scaling algorithm on a Grid}
is exponentially related to the CPU's frequency value. On the other hand, the increase in the number of computing nodes can
increase the communication times and thus produces less energy saving depending on the
benchmarks being executed. The results of the benchmarks CG, MG, BT and FT show more
-energy saving percentage in the one site scenario when executed over 16 nodes than on 32 nodes. LU and SP consume more energy with 16 nodes than with 32 node on one site because their computations to communications ratio is not affected by the increase of the number of local communications.
+energy saving percentage in the one site scenario when executed over 16 nodes than on 32 nodes. LU and SP consume more energy with 16 nodes than with 32 nodes on one site because their computations to communications ratio is not affected by the increase of the number of local communications.
\begin{figure}[!h]
\centering
\centering
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/eng_s.eps}
-\caption{The energy reduction while executing the NAS benchmarks over different scenarios}
+\caption{The energy reduction percentages while executing the NAS benchmarks over different scenarios}
\label{fig:eng_s}
\end{figure*}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/per_d.eps}
-\caption{The performance degradation of the NAS benchmarks over different scenarios}
+\caption{The performance degradation percentages of the NAS benchmarks over different scenarios}
\label{fig:per_d}
\end{figure*}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/dist.eps}
-\caption{The trade-off distance between the energy reduction and the performance of the NAS benchmarks
+\caption{The trade-off distance percentages between the energy reduction and the performance of the NAS benchmarks
over different scenarios}
\label{fig:dist-grid}
\end{figure*}
\begin{figure}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/time.eps}
- \caption{The execution times of NAS benchmarks running over the one core and the multi-core scenarios}
+ \caption{The execution times of the NAS benchmarks running over the one core and the multi-core scenarios}
\label{fig:time-mc}
\end{figure}
\begin{figure}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/eng_con.eps}
- \caption{The energy consumptions and execution times of NAS benchmarks over one core and multi-core per node architectures}
+ \caption{The energy consumptions and execution times of the NAS benchmarks over one core and multi-core per node architectures}
\label{fig:eng-cons-mc}
\end{figure}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/eng_s_mc.eps}
- \caption{The energy saving of running NAS benchmarks over one core and multi-core scenarios}
+ \caption{The energy saving percentages of running NAS benchmarks over one core and multi-core scenarios}
\label{fig:eng-s-mc}
\end{figure*}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/per_d_mc.eps}
- \caption{The performance degradation of running NAS benchmarks over one core and multi-core scenarios}
+ \caption{The performance degradation percentages of running NAS benchmarks over one core and multi-core scenarios}
\label{fig:per-d-mc}
\end{figure*}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/dist_mc.eps}
- \caption{The trade-off distance of running NAS benchmarks over one core and multi-core scenarios}
+ \caption{The trade-off distance percentages of running NAS benchmarks over one core and multi-core scenarios}
\label{fig:dist-mc}
\end{figure*}
The energy saving percentages of all the NAS benchmarks running over these two scenarios are presented in Figure~\ref{fig:eng-s-mc}.
\begin{figure}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/dist_pow.eps}
- \caption{The trade-off distance between the energy reduction and the performance of the NAS benchmarks over the three power scenarios}
+ \caption{The trade-off distance percentages between the energy reduction and the performance of the NAS benchmarks over the three power scenarios}
\label{fig:dist-pow}
\end{figure}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/edp_eng}
-\caption{The energy reduction induced by the Maxdist method and the EDP method}
+\caption{The energy reduction percentages induced by the Maxdist method and the EDP method}
\label{fig:edp-eng}
\end{figure*}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/edp_per}
-\caption{The performance degradation induced by the Maxdist method and the EDP method}
+\caption{The performance degradation percentages induced by the Maxdist method and the EDP method}
\label{fig:edp-perf}
\end{figure*}
\begin{figure*}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch3/edp_dist}
-\caption{The trade-off distance between the energy consumption reduction and the performance for the Maxdist method and the EDP method}
+\caption{The trade-off distance percentages between the energy consumption reduction and the performance for the Maxdist method and the EDP method}
\label{fig:edp-dist}
\end{figure*}
Whereas, the EDP algorithm gives sometimes negative trade-off values for some benchmarks in the two sites scenarios.
These negative trade-off values mean that the performance degradation percentage is higher than the energy saving percentage.
The high positive values of the trade-off distance percentage mean that the energy saving percentage is much higher than the performance degradation percentage.
-The complexity of both algoriths, Maxdist and EDP, are of order $O(N \cdot M_i \cdot F_j)$ and
+The complexity of both algorithms, Maxdist and EDP, are of order $O(N \cdot M_i \cdot F_j)$ and
$O(N \cdot M_i \cdot F_j^2)$ respectively, where $N$ is the number of the clusters, $M_i$ is the number of nodes and $F_j$ is the
maximum number of available frequencies of node $j$. When Maxdist is applied to a benchmark that is being executed over 32 nodes distributed between Nancy and Lyon sites, it takes on average $0.01$ $ms$ to compute the best frequencies while the EDP method is on average ten times slower over the same architecture.
maximum distance (optimal trade-off) between the predicted energy and the
predicted performance curves for a heterogeneous cluster and grid. Both algorithms use a
new energy models for measuring and predicting the energy consumption of message passing
-iterative applications running over a heterogeneous local cluster and a grid platform.
-Firstly, the proposed scaling factors selection algorithm for a heterogeneous local cluster is applied to the class C of NAS parallel benchmarks and simulated by SimGrid.
+ applications with iterations running over a heterogeneous local cluster and a grid platform.
+Firstly, the proposed scaling factors selection algorithm for a heterogeneous local cluster is applied to the class C of the NAS parallel benchmarks and simulated by SimGrid.
The results of the simulations showed that the algorithm on average reduces by 29.8\% the energy
consumption of the NAS benchmarks executed over 8 nodes while limiting the degradation of the performance by 3.8\%. The algorithm also selects different scaling factors according to
the percentage of the computing and communication times, and according to the