-\section{Experimental results}
-\label{sec.expe}
-To evaluate the efficiency and the overall energy consumption reduction of algorithm~(\ref{HSA}),
-it was applied to the NAS parallel benchmarks NPB v3.3 \cite{44}. The experiments were executed
-on the simulator SimGrid/SMPI v3.10~\cite{casanova+giersch+legrand+al.2014.versatile} which offers
-easy tools to create a heterogeneous platform and run message passing applications over it. The
-heterogeneous platform that was used in the experiments, had one core per node because just one
-process was executed per node. The heterogeneous platform was composed of four types of nodes.
-Each type of nodes had different characteristics such as the maximum CPU frequency, the number of
-available frequencies and the computational power, see table (\ref{table:platform}). The characteristics
-of these different types of nodes are inspired from the specifications of real Intel processors.
-The heterogeneous platform had up to 144 nodes and had nodes from the four types in equal proportions,
-for example if a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the constructors
-of CPUs do not specify the dynamic and the static power of their CPUs, for each type of node they were
-chosen proportionally to its computing power (FLOPS). In the initial heterogeneous platform, while computing
-with highest frequency, each node consumed power proportional to its computing power which 80\% of it was
-dynamic power and the rest was 20\% for the static power, the same assumption was made in \cite{45,3}.
-Finally, These nodes were connected via an ethernet network with 1 Gbit/s bandwidth.
-
-
-\begin{table}[htb]
+\subsection{The algorithm details}
+
+In this section, Algorithm~\ref{HSA} is presented. It selects the frequency
+scaling factors vector that gives the best trade-off between minimizing the
+energy consumption and maximizing the performance of a message passing
+synchronous iterative application executed on a heterogeneous platform. It works
+online during the execution time of the iterative message passing program. It
+uses information gathered during the first iteration such as the computation
+time and the communication time in one iteration for each node. The algorithm is
+executed after the first iteration and returns a vector of optimal frequency
+scaling factors that satisfies the objective function (\ref{eq:max}). The
+program applies DVFS operations to change the frequencies of the CPUs according
+to the computed scaling factors. This algorithm is called just once during the
+execution of the program. Algorithm~\ref{dvfs} shows where and when the proposed
+scaling algorithm is called in the iterative MPI program.
+
+\begin{figure}[!t]
+ \centering
+ \includegraphics[scale=0.5]{fig/start_freq}
+ \caption{Selecting the initial frequencies}
+ \label{fig:st_freq}
+\end{figure}
+
+The nodes in a heterogeneous platform have different computing powers, thus
+while executing message passing iterative synchronous applications, fast nodes
+have to wait for the slower ones to finish their computations before being able
+to synchronously communicate with them as in Figure~\ref{fig:heter}. These
+periods are called idle or slack times. The algorithm takes into account this
+problem and tries to reduce these slack times when selecting the frequency
+scaling factors vector. At first, it selects initial frequency scaling factors
+that increase the execution times of fast nodes and minimize the differences
+between the computation times of fast and slow nodes. The value of the initial
+frequency scaling factor for each node is inversely proportional to its
+computation time that was gathered from the first iteration. These initial
+frequency scaling factors are computed as a ratio between the computation time
+of the slowest node and the computation time of the node $i$ as follows:
+\begin{equation}
+ \label{eq:Scp}
+ \Scp[i] = \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]}
+\end{equation}
+Using the initial frequency scaling factors computed in (\ref{eq:Scp}), the
+algorithm computes the initial frequencies for all nodes as a ratio between the
+maximum frequency of node $i$ and the computation scaling factor $\Scp[i]$ as
+follows:
+\begin{equation}
+ \label{eq:Fint}
+ F_{i} = \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\dots,N}
+\end{equation}
+If the computed initial frequency for a node is not available in the gears of
+that node, it is replaced by the nearest available frequency. In
+Figure~\ref{fig:st_freq}, the nodes are sorted by their computing power in
+ascending order and the frequencies of the faster nodes are scaled down
+according to the computed initial frequency scaling factors. The resulting new
+frequencies are highlighted in Figure~\ref{fig:st_freq}. This set of
+frequencies can be considered as a higher bound for the search space of the
+optimal vector of frequencies because selecting frequency scaling factors higher
+than the higher bound will not improve the performance of the application and it
+will increase its overall energy consumption. Therefore the algorithm that
+selects the frequency scaling factors starts the search method from these
+initial frequencies and takes a downward search direction toward lower
+frequencies. The algorithm iterates on all left frequencies, from the higher
+bound until all nodes reach their minimum frequencies, to compute their overall
+energy consumption and performance, and select the optimal frequency scaling
+factors vector. At each iteration the algorithm determines the slowest node
+according to the equation (\ref{eq:perf}) and keeps its frequency unchanged,
+while it lowers the frequency of all other nodes by one gear. The new overall
+energy consumption and execution time are computed according to the new scaling
+factors. The optimal set of frequency scaling factors is the set that gives the
+highest distance according to the objective function (\ref{eq:max}).
+
+Figures~\ref{fig:r1} and \ref{fig:r2} illustrate the normalized performance and
+consumed energy for an application running on a homogeneous platform and a
+heterogeneous platform respectively while increasing the scaling factors. It can
+be noticed that in a homogeneous platform the search for the optimal scaling
+factor should start from the maximum frequency because the performance and the
+consumed energy decrease from the beginning of the plot. On the other hand, in
+the heterogeneous platform the performance is maintained at the beginning of the
+plot even if the frequencies of the faster nodes decrease until the computing
+power of scaled down nodes are lower than the slowest node. In other words,
+until they reach the higher bound. It can also be noticed that the higher the
+difference between the faster nodes and the slower nodes is, the bigger the
+maximum distance between the energy curve and the performance curve is while the
+scaling factors are varying which results in bigger energy savings.
+
+\subsection{The evaluation of the proposed algorithm}
+\label{sec.verif.algo}
+
+The precision of the proposed algorithm mainly depends on the execution time
+prediction model defined in (\ref{eq:perf}) and the energy model computed by
+(\ref{eq:energy}). The energy model is also significantly dependent on the
+execution time model because the static energy is linearly related to the
+execution time and the dynamic energy is related to the computation time. So,
+all the works presented in this paper are based on the execution time model. To
+verify this model, the predicted execution time was compared to the real
+execution time over SimGrid/SMPI simulator,
+v3.10~\cite{casanova+giersch+legrand+al.2014.versatile}, for all the NAS
+parallel benchmarks NPB v3.3 \cite{NAS.Parallel.Benchmarks}, running class B on
+8 or 9 nodes. The comparison showed that the proposed execution time model is
+very precise, the maximum normalized difference between the predicted execution
+time and the real execution time is equal to 0.03 for all the NAS benchmarks.
+
+Since the proposed algorithm is not an exact method it does not test all the
+possible solutions (vectors of scaling factors) in the search space. To prove
+its efficiency, it was compared on small instances to a brute force search
+algorithm that tests all the possible solutions. The brute force algorithm was
+applied to different NAS benchmarks classes with different number of nodes. The
+solutions returned by the brute force algorithm and the proposed algorithm were
+identical and the proposed algorithm was on average 10 times faster than the
+brute force algorithm. It has a small execution time: for a heterogeneous
+cluster composed of four different types of nodes having the characteristics
+presented in Table~\ref{table:platform}, it takes on average \np[ms]{0.04} for 4
+nodes and \np[ms]{0.15} on average for 144 nodes to compute the best scaling
+factors vector. The algorithm complexity is $O(F\cdot N)$, where $F$ is the
+number of iterations and $N$ is the number of computing nodes. The algorithm
+needs from 12 to 20 iterations to select the best vector of frequency scaling
+factors that gives the results of the next sections.
+
+\begin{table}[!t]