-In this section we proposed an heterogeneous scaling algorithm,
-(figure~\ref{HSA}), that selects the optimal set of scaling factors from each
-node. The algorithm is numerates the suitable range of available scaling
-factors for each node in the heterogeneous cluster, returns a set of optimal
-frequency scaling factors for each node. Using heterogeneous cluster is produces
-different workloads for each node. Therefore, the fastest nodes waiting at the
-barrier for the slowest nodes to finish there work as in figure
-(\ref{fig:heter}). Our algorithm takes into account these imbalanced workloads
-when is starts to search for selecting the best scaling factors. So, the
-algorithm is selecting the initial frequencies values for each node proportional
-to the times of computations that gathered from the first iteration. As an
-example in figure (\ref{fig:st_freq}), the algorithm don't test the first
-frequencies of the fastest nodes until it converge their frequencies to the
-frequency of the slowest node. If the algorithm is starts test changing the
-frequency of the slowest nodes from beginning, we are loosing performance and
-then not selecting the best trade-off (the distance). This case will be similar
-to the homogeneous cluster when all nodes scales their frequencies together from
-the beginning. In this case there is a small distance between energy and
-performance curves, for example see the figure(\ref{fig:r1}). Then the
-algorithm searching for optimal frequency scaling factor from the selected
-frequencies until the last available ones.
-\begin{figure}[t]
+\subsection{The algorithm details}
+In this section, Algorithm~\ref{HSA} is presented. It selects the frequency
+scaling factors vector that gives the best trade-off between minimizing the
+energy consumption and maximizing the performance of a message passing
+synchronous iterative application executed on a heterogeneous platform. It works
+online during the execution time of the iterative message passing program. It
+uses information gathered during the first iteration such as the computation
+time and the communication time in one iteration for each node. The algorithm is
+executed after the first iteration and returns a vector of optimal frequency
+scaling factors that satisfies the objective function (\ref{eq:max}). The
+program applies DVFS operations to change the frequencies of the CPUs according
+to the computed scaling factors. This algorithm is called just once during the
+execution of the program. Algorithm~\ref{dvfs} shows where and when the proposed
+scaling algorithm is called in the iterative MPI program.
+
+The nodes in a heterogeneous platform have different computing powers, thus
+while executing message passing iterative synchronous applications, fast nodes
+have to wait for the slower ones to finish their computations before being able
+to synchronously communicate with them as in Figure~\ref{fig:heter}. These
+periods are called idle or slack times. The algorithm takes into account this
+problem and tries to reduce these slack times when selecting the frequency
+scaling factors vector. At first, it selects initial frequency scaling factors
+that increase the execution times of fast nodes and minimize the differences
+between the computation times of fast and slow nodes. The value of the initial
+frequency scaling factor for each node is inversely proportional to its
+computation time that was gathered from the first iteration. These initial
+frequency scaling factors are computed as a ratio between the computation time
+of the slowest node and the computation time of the node $i$ as follows:
+\begin{equation}
+ \label{eq:Scp}
+ \Scp[i] = \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]}
+\end{equation}
+Using the initial frequency scaling factors computed in (\ref{eq:Scp}), the algorithm computes
+the initial frequencies for all nodes as a ratio between the maximum frequency of node $i$
+and the computation scaling factor $\Scp[i]$ as follows:
+\begin{equation}
+ \label{eq:Fint}
+ F_{i} = \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\dots,N}
+\end{equation}
+If the computed initial frequency for a node is not available in the gears of
+that node, it is replaced by the nearest available frequency. In
+Figure~\ref{fig:st_freq}, the nodes are sorted by their computing power in
+ascending order and the frequencies of the faster nodes are scaled down
+according to the computed initial frequency scaling factors. The resulting new
+frequencies are colored in blue in Figure~\ref{fig:st_freq}. This set of
+frequencies can be considered as a higher bound for the search space of the
+optimal vector of frequencies because selecting frequency scaling factors higher
+than the higher bound will not improve the performance of the application and it
+will increase its overall energy consumption. Therefore the algorithm that
+selects the frequency scaling factors starts the search method from these
+initial frequencies and takes a downward search direction toward lower
+frequencies. The algorithm iterates on all left frequencies, from the higher
+bound until all nodes reach their minimum frequencies, to compute their overall
+energy consumption and performance, and select the optimal frequency scaling
+factors vector. At each iteration the algorithm determines the slowest node
+according to the equation (\ref{eq:perf}) and keeps its frequency unchanged,
+while it lowers the frequency of all other nodes by one gear. The new overall
+energy consumption and execution time are computed according to the new scaling
+factors. The optimal set of frequency scaling factors is the set that gives the
+highest distance according to the objective function (\ref{eq:max}).
+
+Figures~\ref{fig:r1} and \ref{fig:r2} illustrate the normalized performance and
+consumed energy for an application running on a homogeneous platform and a
+heterogeneous platform respectively while increasing the scaling factors. It can
+be noticed that in a homogeneous platform the search for the optimal scaling
+factor should start from the maximum frequency because the performance and the
+consumed energy decrease from the beginning of the plot. On the other hand,
+in the heterogeneous platform the performance is maintained at the beginning of
+the plot even if the frequencies of the faster nodes decrease until the
+computing power of scaled down nodes are lower than the slowest node. In other
+words, until they reach the higher bound. It can also be noticed that the higher
+the difference between the faster nodes and the slower nodes is, the bigger the
+maximum distance between the energy curve and the performance curve is while
+ the scaling factors are varying which results in bigger energy savings.
+\begin{figure}[!t]