A directory for the extended version have been added

[mpi-energy2.git] / Heter_paper.tex
diff --git a/Heter_paper.tex b/Heter_paper.tex

index 9d743db707e2ffa874517ab225d0d211b722b9f3..a9a1e6da497ddbd13099d0c00c389ec2b6a52213 100644 (file)
--- a/Heter_paper.tex
+++ b/Heter_paper.tex
@@ -8,7 +8,6 @@
  \usepackage{algorithm}
  \usepackage{subfig}
  \usepackage{amsmath}
-\usepackage{multirow}
  \usepackage{url}
  \DeclareUrlCommand\email{\urlstyle{same}}
  
@@ -24,47 +23,55 @@
  \newcommand{\JC}[2][inline]{%
    \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace}
  
-\newcommand{\Xsub}[2]{\ensuremath{#1_\textit{#2}}}
+\newcommand{\Xsub}[2]{{\ensuremath{#1_\mathit{#2}}}}
  
-\newcommand{\Dist}{\textit{Dist}}
+%% used to put some subscripts lower, and make them more legible
+\newcommand{\fxheight}[1]{\ifx#1\relax\relax\else\rule{0pt}{1.52ex}#1\fi}
+
+\newcommand{\CL}{\Xsub{C}{L}}
+\newcommand{\Dist}{\mathit{Dist}}
+\newcommand{\EdNew}{\Xsub{E}{dNew}}
  \newcommand{\Eind}{\Xsub{E}{ind}}
  \newcommand{\Enorm}{\Xsub{E}{Norm}}
  \newcommand{\Eoriginal}{\Xsub{E}{Original}}
  \newcommand{\Ereduced}{\Xsub{E}{Reduced}}
-\newcommand{\Fdiff}{\Xsub{F}{diff}}
-\newcommand{\Fmax}{\Xsub{F}{max}}
+\newcommand{\Es}{\Xsub{E}{S}}
+\newcommand{\Fdiff}[1][]{\Xsub{F}{diff}_{\!#1}}
+\newcommand{\Fmax}[1][]{\Xsub{F}{max}_{\fxheight{#1}}}
  \newcommand{\Fnew}{\Xsub{F}{new}}
  \newcommand{\Ileak}{\Xsub{I}{leak}}
  \newcommand{\Kdesign}{\Xsub{K}{design}}
-\newcommand{\MaxDist}{\textit{Max Dist}}
+\newcommand{\MaxDist}{\mathit{Max}\Dist}
+\newcommand{\MinTcm}{\mathit{Min}\Tcm}
  \newcommand{\Ntrans}{\Xsub{N}{trans}}
-\newcommand{\Pdyn}{\Xsub{P}{dyn}}
-\newcommand{\PnormInv}{\Xsub{P}{NormInv}}
+\newcommand{\Pd}[1][]{\Xsub{P}{d}_{\fxheight{#1}}}
+\newcommand{\PdNew}{\Xsub{P}{dNew}}
+\newcommand{\PdOld}{\Xsub{P}{dOld}}
  \newcommand{\Pnorm}{\Xsub{P}{Norm}}
-\newcommand{\Tnorm}{\Xsub{T}{Norm}}
-\newcommand{\Pstates}{\Xsub{P}{states}}
-\newcommand{\Pstatic}{\Xsub{P}{static}}
-\newcommand{\Sopt}{\Xsub{S}{opt}}
-\newcommand{\Tcomp}{\Xsub{T}{comp}}
-\newcommand{\TmaxCommOld}{\Xsub{T}{Max Comm Old}}
-\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}}
-\newcommand{\Tmax}{\Xsub{T}{max}}
+\newcommand{\Ps}[1][]{\Xsub{P}{s}_{\fxheight{#1}}}
+\newcommand{\Scp}[1][]{\Xsub{S}{cp}_{#1}}
+\newcommand{\Sopt}[1][]{\Xsub{S}{opt}_{#1}}
+\newcommand{\Tcm}[1][]{\Xsub{T}{cm}_{\fxheight{#1}}}
+\newcommand{\Tcp}[1][]{\Xsub{T}{cp}_{#1}}
+\newcommand{\TcpOld}[1][]{\Xsub{T}{cpOld}_{#1}}
  \newcommand{\Tnew}{\Xsub{T}{New}}
-\newcommand{\Told}{\Xsub{T}{Old}} 
-\begin{document} 
+\newcommand{\Told}{\Xsub{T}{Old}}
+
+\begin{document}
  
-\title{Energy Consumption Reduction for Message Passing Iterative  Applications in Heterogeneous Architecture Using DVFS}
- 
-\author{% 
+\title{Energy Consumption Reduction with DVFS for \\
+  Message Passing Iterative Applications on \\
+  Heterogeneous Architectures}
+
+\author{%
    \IEEEauthorblockN{%
      Jean-Claude Charr,
      Raphaël Couturier,
      Ahmed Fanfakh and
      Arnaud Giersch
-  } 
+  }
    \IEEEauthorblockA{%
-    FEMTO-ST Institute\\
-    University of Franche-Comté\\
+    FEMTO-ST Institute, University of Franche-Comté\\
      IUT de Belfort-Montbéliard,
      19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
      % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
@@ -76,152 +83,174 @@
  \maketitle
  
  \begin{abstract}
-Computing platforms  are consuming  more and more  energy due to  the increasing
-number  of nodes  composing  them.  To  minimize  the operating  costs of  these
-platforms many techniques have been  used. Dynamic voltage and frequency scaling
-(DVFS) is  one of them. It  reduces the frequency of  a CPU to  lower its energy
-consumption.  However,  lowering the  frequency  of  a  CPU might  increase  the
-execution  time of  an application  running on  that processor.   Therefore, the
-frequency that  gives the best tradeoff  between the energy  consumption and the
-performance of an application must be selected.
-
-In this  paper, a new  online frequencies selecting algorithm  for heterogeneous
-platforms is presented.   It selects the frequency which tries  to give the best
-tradeoff  between  energy saving  and  performance  degradation,  for each  node
-computing the message  passing iterative application. The algorithm  has a small
-overhead and works without training or profiling. It uses a new energy model for
-message passing iterative applications  running on a heterogeneous platform. The
-proposed algorithm is  evaluated on the Simgrid simulator  while running the NAS
-parallel  benchmarks.  The  experiments   show  that  it  reduces  the  energy
-consumption by up to 35\% while  limiting the performance degradation as much as
-possible.   Finally,  the algorithm  is  compared  to  an existing  method,  the
-comparison results showing that it outperforms the latter.
+  Computing platforms are consuming more and more energy due to the increasing
+  number of nodes composing them.  To minimize the operating costs of these
+  platforms many techniques have been used. Dynamic voltage and frequency
+  scaling (DVFS) is one of them. It reduces the frequency of a CPU to lower its
+  energy consumption.  However, lowering the frequency of a CPU may increase
+  the execution time of an application running on that processor.  Therefore,
+  the frequency that gives the best trade-off between the energy consumption and
+  the performance of an application must be selected.
+
+  In this paper, a new online frequency selecting algorithm for heterogeneous
+  platforms is presented.  It selects the frequencies and tries to give the best
+  trade-off between energy saving and performance degradation, for each node
+  computing the message passing iterative application. The algorithm has a small
+  overhead and works without training or profiling. It uses a new energy model
+  for message passing iterative applications running on a heterogeneous
+  platform. The proposed algorithm is evaluated on the SimGrid simulator while
+  running the NAS parallel benchmarks.  The experiments show that it reduces the
+  energy consumption by up to \np[\%]{35} while limiting the performance
+  degradation as much as possible.  Finally, the algorithm is compared to an
+  existing method, the comparison results showing that it outperforms the
+  latter.
  
  \end{abstract}
  
  \section{Introduction}
  \label{sec.intro}
-The  need for  more  computing  power is  continually  increasing. To  partially
-satisfy  this need,  most supercomputers  constructors just  put  more computing
-nodes in their  platform. The resulting platforms might  achieve higher floating
-point operations  per second  (FLOPS), but the  energy consumption and  the heat
-dissipation  are  also increased.   As  an  example,  the Chinese  supercomputer
-Tianhe-2 had  the highest FLOPS  in November 2014  according to the  Top500 list
-\cite{TOP500_Supercomputers_Sites}.  However, it was  also the most power hungry
-platform  with  its  over  3  million cores  consuming  around  17.8  megawatts.
-Moreover,    according   to    the    U.S.    annual    energy   outlook    2014
-\cite{U.S_Annual.Energy.Outlook.2014}, the  price of energy  for 1 megawatt-hour
+
+The need for more computing power is continually increasing. To partially
+satisfy this need, most supercomputers constructors just put more computing
+nodes in their platform. The resulting platforms may achieve higher floating
+point operations per second (FLOPS), but the energy consumption and the heat
+dissipation are also increased.  As an example, the Chinese supercomputer
+Tianhe-2 had the highest FLOPS in November 2014 according to the Top500 list
+\cite{TOP500_Supercomputers_Sites}.  However, it was also the most power hungry
+platform with its over 3 million cores consuming around 17.8 megawatts.
+Moreover, according to the U.S.  annual energy outlook 2014
+\cite{U.S_Annual.Energy.Outlook.2014}, the price of energy for 1 megawatt-hour
  was approximately equal to \$70.  Therefore, the price of the energy consumed by
-the Tianhe-2  platform is approximately more  than \$10 million  each year.  The
-computing platforms must  be more energy efficient and  offer the highest number
-of FLOPS  per watt  possible, such as  the L-CSC  from the GSI  Helmholtz Center
+the Tianhe-2 platform is approximately more than \$10 million each year.  The
+computing platforms must be more energy efficient and offer the highest number
+of FLOPS per watt possible, such as the L-CSC from the GSI Helmholtz Center
  which became the top of the Green500 list in November 2014 \cite{Green500_List}.
  This heterogeneous platform executes more than 5 GFLOPS per watt while consuming
  57.15 kilowatts.
  
-Besides platform  improvements, there are many software  and hardware techniques
-to lower  the energy consumption of  these platforms, such  as scheduling, DVFS,
-...   DVFS is  a  widely used  process to  reduce  the energy  consumption of  a
-processor            by             lowering            its            frequency
+Besides platform improvements, there are many software and hardware techniques
+to lower the energy consumption of these platforms, such as scheduling, DVFS,
+\dots{} DVFS is a widely used process to reduce the energy consumption of a
+processor by lowering its frequency
  \cite{Rizvandi_Some.Observations.on.Optimal.Frequency}. However, it also reduces
-the number of FLOPS executed by the processor which might increase the execution
+the number of FLOPS executed by the processor which may increase the execution
  time of the application running over that processor.  Therefore, researchers use
-different optimization  strategies to select  the frequency that gives  the best
-tradeoff  between the  energy reduction  and performance  degradation  ratio. In
-\cite{Our_first_paper}, a  frequency selecting algorithm was  proposed to reduce
-the energy  consumption of message  passing iterative applications  running over
-homogeneous platforms.  The results of  the experiments show  significant energy
-consumption  reductions. In  this  paper, a  new  frequency selecting  algorithm
-adapted  for heterogeneous  platform  is  presented. It  selects  the vector  of
-frequencies, for  a heterogeneous platform  running a message  passing iterative
+different optimization strategies to select the frequency that gives the best
+trade-off between the energy reduction and performance degradation ratio. In
+\cite{Our_first_paper}, a frequency selecting algorithm was proposed to reduce
+the energy consumption of message passing iterative applications running over
+homogeneous platforms.  The results of the experiments show significant energy
+consumption reductions. In this paper, a new frequency selecting algorithm
+adapted for heterogeneous platform is presented. It selects the vector of
+frequencies, for a heterogeneous platform running a message passing iterative
  application, that simultaneously tries to offer the maximum energy reduction and
-minimum performance degradation ratio. The  algorithm has a very small overhead,
+minimum performance degradation ratio. The algorithm has a very small overhead,
  works online and does not need any training or profiling.
  
  This paper is organized as follows: Section~\ref{sec.relwork} presents some
  related works from other authors.  Section~\ref{sec.exe} describes how the
-execution time of message passing programs can be predicted.  It also presents an energy
-model that predicts the energy consumption of an application running over a heterogeneous platform. Section~\ref{sec.compet} presents
-the energy-performance objective function that maximizes the reduction of energy
+execution time of message passing programs can be predicted.  It also presents
+an energy model that predicts the energy consumption of an application running
+over a heterogeneous platform. Section~\ref{sec.compet} presents the
+energy-performance objective function that maximizes the reduction of energy
  consumption while minimizing the degradation of the program's performance.
-Section~\ref{sec.optim} details the proposed frequency selecting algorithm then the precision of the proposed algorithm is verified. 
-Section~\ref{sec.expe} presents the results of applying the algorithm on  the NAS parallel benchmarks and executing them 
-on a heterogeneous platform. It shows the results of running three 
-different power scenarios and comparing them. Moreover, it also shows the comparison results
-between the proposed method and an existing method.
-Finally, in Section~\ref{sec.concl} the paper ends with a summary and some future works.
+Section~\ref{sec.optim} details the proposed frequency selecting algorithm then
+the precision of the proposed algorithm is verified.  Section~\ref{sec.expe}
+presents the results of applying the algorithm on the NAS parallel benchmarks
+and executing them on a heterogeneous platform. It shows the results of running
+three different power scenarios and comparing them. Moreover, it also shows the
+comparison results between the proposed method and an existing method.  Finally,
+in Section~\ref{sec.concl} the paper ends with a summary and some future works.
  
  \section{Related works}
  \label{sec.relwork}
+
  DVFS is a technique used in modern processors to scale down both the voltage and
-the  frequency  of the  CPU  while  computing, in  order  to  reduce the  energy
-consumption of  the processor. DVFS is also  allowed in  GPUs  to achieve the
-same goal. Reducing the frequency of  a processor lowers its number of FLOPS and
-might  degrade the  performance of  the application  running on  that processor,
-especially if it is compute bound. Therefore selecting the appropriate frequency
-for a processor to satisfy some objectives while taking into account all the
-constraints,  is  not a  trivial  operation.   Many  researchers used  different
-strategies to  tackle this problem. Some  of them developed  online methods that
-compute   the  new   frequency  while   executing  the   application,   such  as
-~\cite{Hao_Learning.based.DVFS,Spiliopoulos_Green.governors.Adaptive.DVFS}. Others
-used  offline methods  that might  need to  run the  application and  profile it
-before       selecting       the        new       frequency,       such       as
-~\cite{Rountree_Bounding.energy.consumption.in.MPI,Cochran_Pack_and_Cap_Adaptive_DVFS}. The
-methods could  be heuristics, exact or  brute force methods  that satisfy varied
-objectives such as  energy reduction or performance. They  also could be adapted
-to  the  execution's  environment  and  the  type of  the  application  such  as
-sequential, parallel  or distributed architecture,  homogeneous or heterogeneous
-platform, synchronous or asynchronous application, ...
-
-In this paper, we are interested in reducing energy for message passing iterative synchronous applications running over heterogeneous platforms.
-Some works have already been done for such platforms and they can be classified into two types of heterogeneous platforms: 
+the frequency of the CPU while computing, in order to reduce the energy
+consumption of the processor. DVFS is also allowed in GPUs to achieve the same
+goal. Reducing the frequency of a processor lowers its number of FLOPS and may
+degrade the performance of the application running on that processor, especially
+if it is compute bound. Therefore selecting the appropriate frequency for a
+processor to satisfy some objectives while taking into account all the
+constraints, is not a trivial operation.  Many researchers used different
+strategies to tackle this problem. Some of them developed online methods that
+compute the new frequency while executing the application, such
+as~\cite{Hao_Learning.based.DVFS,Spiliopoulos_Green.governors.Adaptive.DVFS}.
+Others used offline methods that may need to run the application and profile
+it before selecting the new frequency, such
+as~\cite{Rountree_Bounding.energy.consumption.in.MPI,Cochran_Pack_and_Cap_Adaptive_DVFS}.
+The methods could be heuristics, exact or brute force methods that satisfy
+varied objectives such as energy reduction or performance. They also could be
+adapted to the execution's environment and the type of the application such as
+sequential, parallel or distributed architecture, homogeneous or heterogeneous
+platform, synchronous or asynchronous application, \dots{}
+
+In this paper, we are interested in reducing energy for message passing
+iterative synchronous applications running over heterogeneous platforms.  Some
+works have already been done for such platforms and they can be classified into
+two types of heterogeneous platforms:
  \begin{itemize}
-
  \item the platform is composed of homogeneous GPUs and homogeneous CPUs.
  \item the platform is only composed of heterogeneous CPUs.
-
  \end{itemize}
  
-For the first type of platform, the computing intensive parallel tasks are executed on the  GPUs and the rest are executed 
-on the CPUs.  Luley et al.
-~\cite{Luley_Energy.efficiency.evaluation.and.benchmarking}, proposed  a heterogeneous 
-cluster composed of Intel Xeon CPUs and NVIDIA GPUs. Their main goal was to maximize the 
-energy efficiency of the platform during computation by maximizing the number of FLOPS per watt generated. 
-In~\cite{KaiMa_Holistic.Approach.to.Energy.Efficiency.in.GPU-CPU}, Kai Ma et al. developed a scheduling 
-algorithm that distributes  workloads proportional to the computing power of the nodes which could be a GPU or a CPU. All the tasks must be completed at the same time.
-In~\cite{Rong_Effects.of.DVFS.on.K20.GPU}, Rong et al. showed that 
-a heterogeneous (GPUs and CPUs) cluster that enables DVFS gave better energy and performance 
-efficiency than other clusters only composed of  CPUs.
- 
-The work presented in this paper concerns the second type of platform, with heterogeneous CPUs.
-Many methods were conceived to reduce the energy consumption of this type of platform.  Naveen et al.~\cite{Naveen_Power.Efficient.Resource.Scaling}  
-developed a method that minimizes the value of $energy*delay^2$ (the delay is the sum of slack times that happen during synchronous communications) by dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed
-an algorithm that divides the executed tasks into two types: the critical and 
-non critical tasks. The algorithm scales down the frequency of  non critical tasks proportionally to their  slack and communication times while limiting  the performance degradation percentage to less than 10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed 
-  a heterogeneous cluster composed of two  types 
-of Intel and AMD processors. They use a gradient method to predict the impact of DVFS operations on performance.
-In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and \cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks}, 
- the best frequencies for a specified heterogeneous cluster are selected offline using some 
-heuristic. Chen et al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic programming approach to  
-minimize the power consumption of heterogeneous servers  while respecting given time constraints. This approach 
-had considerable overhead.
-In contrast to the above described papers, this paper presents the following contributions :
+For the first type of platform, the computing intensive parallel tasks are
+executed on the GPUs and the rest are executed on the CPUs.  Luley et
+al.~\cite{Luley_Energy.efficiency.evaluation.and.benchmarking}, proposed a
+heterogeneous cluster composed of Intel Xeon CPUs and NVIDIA GPUs. Their main
+goal was to maximize the energy efficiency of the platform during computation by
+maximizing the number of FLOPS per watt generated.
+In~\cite{KaiMa_Holistic.Approach.to.Energy.Efficiency.in.GPU-CPU}, Kai Ma et
+al. developed a scheduling algorithm that distributes workloads proportional to
+the computing power of the nodes which could be a GPU or a CPU. All the tasks
+must be completed at the same time.  In~\cite{Rong_Effects.of.DVFS.on.K20.GPU},
+Rong et al. showed that a heterogeneous (GPUs and CPUs) cluster that enables
+DVFS gave better energy and performance efficiency than other clusters only
+composed of CPUs.
+
+The work presented in this paper concerns the second type of platform, with
+heterogeneous CPUs.  Many methods were conceived to reduce the energy
+consumption of this type of platform.  Naveen et
+al.~\cite{Naveen_Power.Efficient.Resource.Scaling} developed a method that
+minimizes the value of $\mathit{energy}\times \mathit{delay}^2$ (the delay is
+the sum of slack times that happen during synchronous communications) by
+dynamically assigning new frequencies to the CPUs of the heterogeneous cluster.
+Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed an
+algorithm that divides the executed tasks into two types: the critical and non
+critical tasks. The algorithm scales down the frequency of non critical tasks
+proportionally to their slack and communication times while limiting the
+performance degradation percentage to less than \np[\%]{10}.
+In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed a
+heterogeneous cluster composed of two types of Intel and AMD processors. They
+use a gradient method to predict the impact of DVFS operations on performance.
+In~\cite{Shelepov_Scheduling.on.Heterogeneous.Multicore} and
+\cite{Li_Minimizing.Energy.Consumption.for.Frame.Based.Tasks}, the best
+frequencies for a specified heterogeneous cluster are selected offline using
+some heuristic.  Chen et
+al.~\cite{Chen_DVFS.under.quality.of.service.requirements} used a greedy dynamic
+programming approach to minimize the power consumption of heterogeneous servers
+while respecting given time constraints.  This approach had considerable
+overhead.  In contrast to the above described papers, this paper presents the
+following contributions :
  \begin{enumerate}
-\item  two new energy and performance models for message passing iterative synchronous applications running over 
-       a heterogeneous platform. Both models take into account  communication and slack times. The models can predict the required energy and the execution time of the application.
-       
-\item a new online frequency selecting algorithm for heterogeneous platforms. The algorithm has a very small 
-      overhead and does not need any training or profiling. It uses a new optimization function which simultaneously maximizes the performance and minimizes the energy consumption of a message passing iterative synchronous application.
-      
+\item two new energy and performance models for message passing iterative
+  synchronous applications running over a heterogeneous platform. Both models
+  take into account communication and slack times. The models can predict the
+  required energy and the execution time of the application.
+
+\item a new online frequency selecting algorithm for heterogeneous
+  platforms. The algorithm has a very small overhead and does not need any
+  training or profiling. It uses a new optimization function which
+  simultaneously maximizes the performance and minimizes the energy consumption
+  of a message passing iterative synchronous application.
+
  \end{enumerate}
  
  \section{The performance and energy consumption measurements on heterogeneous architecture}
  \label{sec.exe}
  
-
-
-\subsection{The execution time of message passing distributed 
-                iterative applications on a heterogeneous platform}
+\subsection{The execution time of message passing distributed iterative
+  applications on a heterogeneous platform}
  
  In this paper, we are interested in reducing the energy consumption of message
  passing distributed iterative synchronous applications running over
@@ -231,44 +260,44 @@ network. Therefore, each node has different characteristics such as computing
  power (FLOPS), energy consumption, CPU's frequency range, \dots{} but they all
  have the same network bandwidth and latency.
  
-The overall execution time  of a distributed iterative synchronous application 
-over a heterogeneous platform  consists of the sum of the computation time and 
-the communication time for every iteration on a node. However, due to the 
-heterogeneous computation power of the computing nodes, slack times might occur 
-when fast nodes have to  wait, during synchronous communications, for  the slower 
-nodes to finish  their computations (see Figure~(\ref{fig:heter})). 
-Therefore,  the overall execution time  of the program is the execution time of the slowest
-task which has the highest computation time and no slack time.
-  
- \begin{figure}[t]
+\begin{figure}[!t]
    \centering
-   \includegraphics[scale=0.6]{fig/commtasks}
+  \includegraphics[scale=0.6]{fig/commtasks}
    \caption{Parallel tasks on a heterogeneous platform}
    \label{fig:heter}
  \end{figure}
  
-Dynamic Voltage and Frequency Scaling (DVFS) is a process, implemented in 
-modern processors, that reduces the energy consumption of a CPU by scaling 
-down its voltage and frequency.  Since DVFS lowers the frequency of a CPU 
-and consequently its computing power, the execution time of a program running 
-over that scaled down processor might increase, especially if the program is 
-compute bound.  The frequency reduction process can be  expressed by the scaling 
-factor S which is the ratio between  the maximum and the new frequency of a CPU 
+The overall execution time of a distributed iterative synchronous application
+over a heterogeneous platform consists of the sum of the computation time and
+the communication time for every iteration on a node. However, due to the
+heterogeneous computation power of the computing nodes, slack times may occur
+when fast nodes have to wait, during synchronous communications, for the slower
+nodes to finish their computations (see Figure~\ref{fig:heter}).  Therefore, the
+overall execution time of the program is the execution time of the slowest task
+which has the highest computation time and no slack time.
+
+Dynamic Voltage and Frequency Scaling (DVFS) is a process, implemented in
+modern processors, that reduces the energy consumption of a CPU by scaling
+down its voltage and frequency.  Since DVFS lowers the frequency of a CPU
+and consequently its computing power, the execution time of a program running
+over that scaled down processor may increase, especially if the program is
+compute bound.  The frequency reduction process can be  expressed by the scaling
+factor S which is the ratio between  the maximum and the new frequency of a CPU
  as in (\ref{eq:s}).
  \begin{equation}
    \label{eq:s}
- S = \frac{F_\textit{max}}{F_\textit{new}}
+  S = \frac{\Fmax}{\Fnew}
  \end{equation}
- The execution time of a compute bound sequential program is linearly proportional 
- to the frequency scaling factor $S$.  On the other hand,  message passing 
- distributed applications consist of two parts: computation and communication. 
- The execution time of the computation part is linearly proportional to the 
- frequency scaling factor $S$ but  the communication time is not affected by the 
- scaling factor because  the processors involved remain idle during the  
- communications~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}. 
- The communication time for a task is the summation of  periods of 
- time that begin with an MPI call for sending or receiving   a message 
- until the message is synchronously sent or received.
+The execution time of a compute bound sequential program is linearly
+proportional to the frequency scaling factor $S$.  On the other hand, message
+passing distributed applications consist of two parts: computation and
+communication.  The execution time of the computation part is linearly
+proportional to the frequency scaling factor $S$ but the communication time is
+not affected by the scaling factor because the processors involved remain idle
+during the communications~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}.  The
+communication time for a task is the summation of periods of time that begin
+with an MPI call for sending or receiving a message until the message is
+synchronously sent or received.
  
  Since in a heterogeneous platform each node has different characteristics,
  especially different frequency gears, when applying DVFS operations on these
@@ -282,364 +311,288 @@ operation. Then the execution time for one iteration of the application with any
  vector of scaling factors can be predicted using (\ref{eq:perf}).
  \begin{equation}
    \label{eq:perf}
- \textit  T_\textit{new} = 
- \max_{i=1,2,\dots,N} ({TcpOld_{i}} \cdot S_{i}) +  MinTcm 
+  \Tnew = \max_{i=1,2,\dots,N} ({\TcpOld[i]} \cdot S_{i}) +  \MinTcm
  \end{equation}
-Where:\\
+Where:
  \begin{equation}
-\label{eq:perf2}
- MinTcm = \min_{i=1,2,\dots,N} (Tcm_i)
+  \label{eq:perf2}
+  \MinTcm = \min_{i=1,2,\dots,N} (\Tcm[i])
  \end{equation}
-where  $TcpOld_i$ is  the computation  time of  processor $i$  during  the first
-iteration and $MinTcm$  is the communication time of  the slowest processor from
-the  first iteration.   The model  computes  the maximum  computation time  with
-scaling factor  from each node  added to the  communication time of  the slowest
-node. It means only the communication  time without any slack time is taken into
+where $\TcpOld[i]$ is the computation time of processor $i$ during the first
+iteration and $\MinTcm$ is the communication time of the slowest processor from
+the first iteration.  The model computes the maximum computation time with
+scaling factor from each node added to the communication time of the slowest
+node. It means only the communication time without any slack time is taken into
  account.  Therefore, the execution time of the iterative application is equal to
-the  execution time of  one iteration  as in  (\ref{eq:perf}) multiplied  by the
+the execution time of one iteration as in (\ref{eq:perf}) multiplied by the
  number of iterations of that application.
  
-This prediction model is developed from  the model to predict the execution time
-of     message    passing     distributed    applications     for    homogeneous
-architectures~\cite{Our_first_paper}.   The execution  time prediction  model is
-used in  the method  to optimize both the energy consumption and the performance of
-iterative methods, which is presented in the following sections.
-
+This prediction model is developed from the model to predict the execution time
+of message passing distributed applications for homogeneous
+architectures~\cite{Our_first_paper}.  The execution time prediction model is
+used in the method to optimize both the energy consumption and the performance
+of iterative methods, which is presented in the following sections.
  
  \subsection{Energy model for heterogeneous platform}
+
  Many researchers~\cite{Malkowski_energy.efficient.high.performance.computing,
-Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling,
-Rizvandi_Some.Observations.on.Optimal.Frequency} divide the power consumed by a processor into
-two power metrics: the static and the dynamic power.  While the first one is
-consumed as long as the computing unit is turned on, the latter is only consumed during
-computation times.  The dynamic power $Pd$ is related to the switching
-activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
-operational frequency $F$, as shown in (\ref{eq:pd}).
+  Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling,
+  Rizvandi_Some.Observations.on.Optimal.Frequency} divide the power consumed by
+a processor into two power metrics: the static and the dynamic power.  While the
+first one is consumed as long as the computing unit is turned on, the latter is
+only consumed during computation times.  The dynamic power $\Pd$ is related to
+the switching activity $\alpha$, load capacitance $\CL$, the supply voltage $V$
+and operational frequency $F$, as shown in (\ref{eq:pd}).
  \begin{equation}
    \label{eq:pd}
-  Pd = \alpha \cdot C_L \cdot V^2 \cdot F
+  \Pd = \alpha \cdot \CL \cdot V^2 \cdot F
  \end{equation}
-The static power $Ps$ captures the leakage power as follows:
+The static power $\Ps$ captures the leakage power as follows:
  \begin{equation}
    \label{eq:ps}
-   Ps  = V \cdot N_{trans} \cdot K_{design} \cdot I_{leak}
+   \Ps  = V \cdot \Ntrans \cdot \Kdesign \cdot \Ileak
  \end{equation}
-where V is the supply voltage, $N_{trans}$ is the number of transistors,
-$K_{design}$ is a design dependent parameter and $I_{leak}$ is a
+where V is the supply voltage, $\Ntrans$ is the number of transistors,
+$\Kdesign$ is a design dependent parameter and $\Ileak$ is a
  technology dependent parameter.  The energy consumed by an individual processor
  to execute a given program can be computed as:
  \begin{equation}
    \label{eq:eind}
-   E_\textit{ind} =  Pd \cdot Tcp + Ps \cdot T
+  \Eind =  \Pd \cdot \Tcp + \Ps \cdot T
  \end{equation}
-where $T$ is the execution time of the program, $Tcp$ is the computation
-time and $Tcp \le T$.  $Tcp$ may be equal to $T$ if there is no
+where $T$ is the execution time of the program, $\Tcp$ is the computation
+time and $\Tcp \le T$.  $\Tcp$ may be equal to $T$ if there is no
  communication and no slack time.
  
-The main objective of DVFS operation is to reduce the overall energy consumption~\cite{Le_DVFS.Laws.of.Diminishing.Returns}.  
-The operational frequency $F$ depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot F$ with some
-constant $\beta$.~This equation is used to study the change of the dynamic
-voltage with respect to various frequency values in~\cite{Rauber_Analytical.Modeling.for.Energy}.  The reduction
-process of the frequency can be expressed by the scaling factor $S$ which is the
-ratio between the maximum and the new frequency as in (\ref{eq:s}).
-The CPU governors are power schemes supplied by the operating
-system's kernel to lower a core's frequency. The new frequency 
-$F_{new}$ from (\ref{eq:s}) can be calculated as follows:
+The main objective of DVFS operation is to reduce the overall energy
+consumption~\cite{Le_DVFS.Laws.of.Diminishing.Returns}.  The operational
+frequency $F$ depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot
+F$ with some constant $\beta$.~This equation is used to study the change of the
+dynamic voltage with respect to various frequency values
+in~\cite{Rauber_Analytical.Modeling.for.Energy}.  The reduction process of the
+frequency can be expressed by the scaling factor $S$ which is the ratio between
+the maximum and the new frequency as in (\ref{eq:s}).  The CPU governors are
+power schemes supplied by the operating system's kernel to lower a core's
+frequency. The new frequency $\Fnew$ from (\ref{eq:s}) can be calculated as
+follows:
  \begin{equation}
    \label{eq:fnew}
-   F_\textit{new} = S^{-1} \cdot F_\textit{max}
+   \Fnew = S^{-1} \cdot \Fmax
  \end{equation}
-Replacing $F_{new}$ in (\ref{eq:pd}) as in (\ref{eq:fnew}) gives the following 
+Replacing $\Fnew$ in (\ref{eq:pd}) as in (\ref{eq:fnew}) gives the following
  equation for dynamic power consumption:
  \begin{multline}
    \label{eq:pdnew}
-   {P}_\textit{dNew} = \alpha \cdot C_L \cdot V^2 \cdot F_{new} = \alpha \cdot C_L \cdot \beta^2 \cdot F_{new}^3 \\
-   {} = \alpha \cdot C_L \cdot V^2 \cdot F_{max} \cdot S^{-3} = P_{dOld} \cdot S^{-3}
+   \PdNew = \alpha \cdot \CL \cdot V^2 \cdot \Fnew = \alpha \cdot \CL \cdot \beta^2 \cdot \Fnew^3 \\
+   {} = \alpha \cdot \CL \cdot V^2 \cdot \Fmax \cdot S^{-3} = \PdOld \cdot S^{-3}
  \end{multline}
-where $ {P}_\textit{dNew}$  and $P_{dOld}$ are the  dynamic power consumed with the 
+where $\PdNew$  and $\PdOld$ are the  dynamic power consumed with the
  new frequency and the maximum frequency respectively.
  
-According to (\ref{eq:pdnew}) the dynamic power is reduced by a factor of $S^{-3}$ when 
-reducing the frequency by a factor of $S$~\cite{Rauber_Analytical.Modeling.for.Energy}. Since the FLOPS of a CPU is proportional 
-to the frequency of a CPU, the computation time is increased proportionally to $S$.  
-The new dynamic energy is the  dynamic power multiplied by the new time of computation 
-and is given by the following equation:
+According to (\ref{eq:pdnew}) the dynamic power is reduced by a factor of
+$S^{-3}$ when reducing the frequency by a factor of
+$S$~\cite{Rauber_Analytical.Modeling.for.Energy}. Since the FLOPS of a CPU is
+proportional to the frequency of a CPU, the computation time is increased
+proportionally to $S$.  The new dynamic energy is the dynamic power multiplied
+by the new time of computation and is given by the following equation:
  \begin{equation}
    \label{eq:Edyn}
-   E_\textit{dNew} = P_{dOld} \cdot S^{-3} \cdot (Tcp \cdot S)= S^{-2}\cdot P_{dOld} \cdot  Tcp 
+   \EdNew = \PdOld \cdot S^{-3} \cdot (\Tcp \cdot S)= S^{-2}\cdot \PdOld \cdot  \Tcp
  \end{equation}
-The static power is related to the power leakage of the CPU and is consumed during computation 
-and even when idle. As in~\cite{Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling}, 
- the static power of a processor is considered as constant 
-during idle and computation periods, and for all its available frequencies. 
-The static energy is the static power multiplied by the execution time of the program. 
-According to the execution time model in (\ref{eq:perf}), the execution time of the program 
-is the sum of the computation and the communication times. The computation time is linearly related  
-to the frequency scaling factor, while this scaling factor does not affect the communication time. 
-The static energy of a processor after scaling its frequency is computed as follows: 
+The static power is related to the power leakage of the CPU and is consumed
+during computation and even when idle. As
+in~\cite{Rauber_Analytical.Modeling.for.Energy,Zhuo_Energy.efficient.Dynamic.Task.Scheduling},
+the static power of a processor is considered as constant during idle and
+computation periods, and for all its available frequencies.  The static energy
+is the static power multiplied by the execution time of the program.  According
+to the execution time model in (\ref{eq:perf}), the execution time of the
+program is the sum of the computation and the communication times. The
+computation time is linearly related to the frequency scaling factor, while this
+scaling factor does not affect the communication time.  The static energy of a
+processor after scaling its frequency is computed as follows:
  \begin{equation}
    \label{eq:Estatic}
- E_\textit{s} = Ps \cdot (Tcp \cdot S  + Tcm)
+  \Es = \Ps \cdot (\Tcp \cdot S  + \Tcm)
  \end{equation}
  
-In  the  considered  heterogeneous  platform,  each  processor  $i$  might  have
-different   dynamic  and  static   powers,  noted   as  $Pd_{i}$   and  $Ps_{i}$
-respectively.  Therefore,  even if  the  distributed  message passing  iterative
-application  is  load balanced,  the  computation time  of  each  CPU $i$  noted
-$Tcp_{i}$ might  be different and  different frequency scaling factors  might be
-computed in order to decrease  the overall energy consumption of the application
-and reduce slack  times.  The communication time of a processor  $i$ is noted as
-$Tcm_{i}$  and could  contain slack  times when  communicating  with slower
-nodes,  see figure(\ref{fig:heter}).  Therefore,  all nodes  do  not have  equal
-communication  times. While  the dynamic  energy  is computed  according to  the
-frequency  scaling   factor  and   the  dynamic  power   of  each  node   as  in
-(\ref{eq:Edyn}), the static energy is computed  as the sum of the execution time
-of  one iteration multiplied  by the static  power of  each processor.   The overall
-energy consumption of a message  passing distributed application executed over a
-heterogeneous platform during one iteration  is the summation of all dynamic and
+In the considered heterogeneous platform, each processor $i$ may have
+different dynamic and static powers, noted as $\Pd[i]$ and $\Ps[i]$
+respectively.  Therefore, even if the distributed message passing iterative
+application is load balanced, the computation time of each CPU $i$ noted
+$\Tcp[i]$ may be different and different frequency scaling factors may be
+computed in order to decrease the overall energy consumption of the application
+and reduce slack times.  The communication time of a processor $i$ is noted as
+$\Tcm[i]$ and could contain slack times when communicating with slower nodes,
+see Figure~\ref{fig:heter}.  Therefore, all nodes do not have equal
+communication times. While the dynamic energy is computed according to the
+frequency scaling factor and the dynamic power of each node as in
+(\ref{eq:Edyn}), the static energy is computed as the sum of the execution time
+of one iteration multiplied by the static power of each processor.  The overall
+energy consumption of a message passing distributed application executed over a
+heterogeneous platform during one iteration is the summation of all dynamic and
  static energies for each processor.  It is computed as follows:
  \begin{multline}
    \label{eq:energy}
- E = \sum_{i=1}^{N} {(S_i^{-2} \cdot Pd_{i} \cdot  Tcp_i)} + {} \\
- \sum_{i=1}^{N} (Ps_{i} \cdot (\max_{i=1,2,\dots,N} (Tcp_i \cdot S_{i}) +
-  {MinTcm))}
- \end{multline}
-
-Reducing the frequencies of the processors according to the vector of
-scaling factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the
-application and thus, increase the static energy because the execution time is
-increased~\cite{Kim_Leakage.Current.Moore.Law}. The overall energy consumption for the iterative 
-application can be measured by measuring  the energy consumption for one iteration as in (\ref{eq:energy}) 
-multiplied by the number of iterations of that application.
+ E = \sum_{i=1}^{N} {(S_i^{-2} \cdot \Pd[i] \cdot  \Tcp[i])} + {} \\
+ \sum_{i=1}^{N} (\Ps[i] \cdot (\max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) +
+  {\MinTcm))}
+\end{multline}
  
+Reducing the frequencies of the processors according to the vector of scaling
+factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the application
+and thus, increase the static energy because the execution time is
+increased~\cite{Kim_Leakage.Current.Moore.Law}. The overall energy consumption
+for the iterative application can be measured by measuring the energy
+consumption for one iteration as in (\ref{eq:energy}) multiplied by the number
+of iterations of that application.
  
  \section{Optimization of both energy consumption and performance}
  \label{sec.compet}
  
  Using the lowest frequency for each processor does not necessarily give the most
-energy efficient  execution of an  application. Indeed, even though  the dynamic
-power  is  reduced  while  scaling  down  the  frequency  of  a  processor,  its
-computation power  is proportionally decreased. Hence, the  execution time might
-be drastically  increased and  during that time,  dynamic and static  powers are
-being consumed.  Therefore,  it might cancel any gains  achieved by scaling down
-the frequency of all nodes to  the minimum and the overall energy consumption of
-the application might not  be the optimal one.  It is not  trivial to select the
-appropriate frequency  scaling factor for  each processor while  considering the
-characteristics  of each  processor  (computation power,  range of  frequencies,
-dynamic  and static  powers)  and the  task executed  (computation/communication
-ratio). The  aim being  to reduce  the overall energy  consumption and  to avoid
-increasing    significantly    the    execution    time.   In    our    previous
-work~\cite{Our_first_paper},  we  proposed a  method  that  selects the  optimal
-frequency scaling factor  for a homogeneous cluster executing  a message passing
-iterative synchronous  application while giving  the best trade-off  between the
-energy consumption and  the performance for such applications.   In this work we
-are  interested  in heterogeneous  clusters  as  described  above.  Due  to  the
-heterogeneity of the processors, a vector of scaling factors should
-be selected and  it must give the best trade-off  between energy consumption and
-performance.
-
-The  relation between  the  energy consumption  and  the execution  time for  an
-application  is complex  and nonlinear,  Thus, unlike  the relation  between the
-execution time and  the scaling factor, the relation between  the energy and the
-frequency   scaling    factors   is   nonlinear,   for    more   details   refer
-to~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}.   Moreover,  these relations
-are not  measured using the same  metric.  To solve this  problem, the execution
-time is normalized by computing the  ratio between the new execution time (after
-scaling  down the  frequencies of  some processors)  and the  initial  one (with
+energy efficient execution of an application. Indeed, even though the dynamic
+power is reduced while scaling down the frequency of a processor, its
+computation power is proportionally decreased. Hence, the execution time might
+be drastically increased and during that time, dynamic and static powers are
+being consumed.  Therefore, it might cancel any gains achieved by scaling down
+the frequency of all nodes to the minimum and the overall energy consumption of
+the application might not be the optimal one.  It is not trivial to select the
+appropriate frequency scaling factor for each processor while considering the
+characteristics of each processor (computation power, range of frequencies,
+dynamic and static powers) and the task executed (computation/communication
+ratio). The aim being to reduce the overall energy consumption and to avoid
+increasing significantly the execution time.  In our previous
+work~\cite{Our_first_paper}, we proposed a method that selects the optimal
+frequency scaling factor for a homogeneous cluster executing a message passing
+iterative synchronous application while giving the best trade-off between the
+energy consumption and the performance for such applications.  In this work we
+are interested in heterogeneous clusters as described above.  Due to the
+heterogeneity of the processors, a vector of scaling factors should be selected
+and it must give the best trade-off between energy consumption and performance.
+
+The relation between the energy consumption and the execution time for an
+application is complex and nonlinear, Thus, unlike the relation between the
+execution time and the scaling factor, the relation between the energy and the
+frequency scaling factors is nonlinear, for more details refer
+to~\cite{Freeh_Exploring.the.Energy.Time.Tradeoff}.  Moreover, these relations
+are not measured using the same metric.  To solve this problem, the execution
+time is normalized by computing the ratio between the new execution time (after
+scaling down the frequencies of some processors) and the initial one (with
  maximum frequency for all nodes) as follows:
  \begin{multline}
    \label{eq:pnorm}
-  P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}}\\
-       {} = \frac{ \max_{i=1,2,\dots,N} (Tcp_{i} \cdot S_{i}) +MinTcm}
-           {\max_{i=1,2,\dots,N}{(Tcp_i+Tcm_i)}}
+  \Pnorm = \frac{\Tnew}{\Told}\\
+       {} = \frac{ \max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) +\MinTcm}
+           {\max_{i=1,2,\dots,N}{(\Tcp[i]+\Tcm[i])}}
  \end{multline}
  
-
-In the same way, the energy is normalized by computing the ratio between the consumed energy 
-while scaling down the frequency and the consumed energy with maximum frequency for all nodes:
+In the same way, the energy is normalized by computing the ratio between the
+consumed energy while scaling down the frequency and the consumed energy with
+maximum frequency for all nodes:
  \begin{multline}
    \label{eq:enorm}
-  E_\textit{Norm} = \frac{E_\textit{Reduced}}{E_\textit{Original}} \\
-  {} = \frac{ \sum_{i=1}^{N}{(S_i^{-2} \cdot Pd_i \cdot  Tcp_i)} +
- \sum_{i=1}^{N} {(Ps_i \cdot T_{New})}}{\sum_{i=1}^{N}{( Pd_i \cdot  Tcp_i)} +
- \sum_{i=1}^{N} {(Ps_i \cdot T_{Old})}}
-\end{multline} 
-Where $E_\textit{Reduced}$ and $E_\textit{Original}$ are computed using (\ref{eq:energy}) and
-  $T_{New}$ and $T_{Old}$ are computed as in (\ref{eq:pnorm}).
-
-While the main 
-goal is to optimize the energy and execution time at the same time, the normalized 
-energy and execution time curves are not in the same direction. According 
-to the equations~(\ref{eq:pnorm}) and (\ref{eq:enorm}), the vector  of frequency
-scaling factors $S_1,S_2,\dots,S_N$ reduce both the energy and the execution
-time simultaneously.  But the main objective is to produce maximum energy
-reduction with minimum execution time reduction.  
-  
-This problem can be solved by making the optimization process for energy and 
-execution time following the same direction.  Therefore, the equation of the 
-normalized execution time is inverted which gives the normalized performance equation, as follows:
+  \Enorm = \frac{\Ereduced}{\Eoriginal} \\
+  {} = \frac{ \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot  \Tcp[i])} +
+ \sum_{i=1}^{N} {(\Ps[i] \cdot \Tnew)}}{\sum_{i=1}^{N}{( \Pd[i] \cdot  \Tcp[i])} +
+ \sum_{i=1}^{N} {(\Ps[i] \cdot \Told)}}
+\end{multline}
+Where $\Ereduced$ and $\Eoriginal$ are computed using (\ref{eq:energy}) and
+$\Tnew$ and $\Told$ are computed as in (\ref{eq:pnorm}).
+
+While the main goal is to optimize the energy and execution time at the same
+time, the normalized energy and execution time curves are not in the same
+direction. According to the equations~(\ref{eq:pnorm}) and (\ref{eq:enorm}), the
+vector of frequency scaling factors $S_1,S_2,\dots,S_N$ reduce both the energy
+and the execution time simultaneously.  But the main objective is to produce
+maximum energy reduction with minimum execution time reduction.
+
+This problem can be solved by making the optimization process for energy and
+execution time following the same direction.  Therefore, the equation of the
+normalized execution time is inverted which gives the normalized performance
+equation, as follows:
  \begin{multline}
    \label{eq:pnorm_inv}
-  P_\textit{Norm} = \frac{T_\textit{Old}}{T_\textit{New}}\\
-          = \frac{\max_{i=1,2,\dots,N}{(Tcp_i+Tcm_i)}}
-            { \max_{i=1,2,\dots,N} (Tcp_{i} \cdot S_{i}) + MinTcm} 
+  \Pnorm = \frac{\Told}{\Tnew}\\
+          = \frac{\max_{i=1,2,\dots,N}{(\Tcp[i]+\Tcm[i])}}
+            { \max_{i=1,2,\dots,N} (\Tcp[i] \cdot S_{i}) + \MinTcm}
  \end{multline}
  
-
-\begin{figure}
+\begin{figure}[!t]
    \centering
    \subfloat[Homogeneous platform]{%
      \includegraphics[width=.33\textwidth]{fig/homo}\label{fig:r1}}%
-  
-  
+
    \subfloat[Heterogeneous platform]{%
      \includegraphics[width=.33\textwidth]{fig/heter}\label{fig:r2}}
    \label{fig:rel}
    \caption{The energy and performance relation}
  \end{figure}
  
-Then, the objective function can be modeled in order to find the maximum distance
-between the energy curve (\ref{eq:enorm}) and the  performance
-curve (\ref{eq:pnorm_inv}) over all available sets of scaling factors.  This
-represents the minimum energy consumption with minimum execution time (maximum 
-performance) at the same time, see figure~(\ref{fig:r1}) or figure~(\ref{fig:r2}). Then the objective
-function has the following form:
+Then, the objective function can be modeled in order to find the maximum
+distance between the energy curve (\ref{eq:enorm}) and the performance curve
+(\ref{eq:pnorm_inv}) over all available sets of scaling factors.  This
+represents the minimum energy consumption with minimum execution time (maximum
+performance) at the same time, see Figure~\ref{fig:r1} or
+Figure~\ref{fig:r2}. Then the objective function has the following form:
  \begin{equation}
    \label{eq:max}
-  Max Dist = 
-  \max_{i=1,\dots F, j=1,\dots,N}
-      (\overbrace{P_\textit{Norm}(S_{ij})}^{\text{Maximize}} -
-       \overbrace{E_\textit{Norm}(S_{ij})}^{\text{Minimize}} )
+  \MaxDist =
+  \mathop{\max_{i=1,\dots F}}_{j=1,\dots,N}
+      (\overbrace{\Pnorm(S_{ij})}^{\text{Maximize}} -
+       \overbrace{\Enorm(S_{ij})}^{\text{Minimize}} )
  \end{equation}
-where $N$ is the number of nodes and $F$ is the  number of available frequencies for each node. 
-Then, the optimal set of scaling factors that satisfies (\ref{eq:max}) can be selected.  
-The objective function can work with any energy model or any power values for each node 
-(static and dynamic powers). However, the most important energy reduction gain can be achieved when 
-the energy curve has a convex form as shown in~\cite{Zhuo_Energy.efficient.Dynamic.Task.Scheduling,Rauber_Analytical.Modeling.for.Energy,Hao_Learning.based.DVFS}.
+where $N$ is the number of nodes and $F$ is the number of available frequencies
+for each node.  Then, the optimal set of scaling factors that satisfies
+(\ref{eq:max}) can be selected.  The objective function can work with any energy
+model or any power values for each node (static and dynamic powers). However,
+the most important energy reduction gain can be achieved when the energy curve
+has a convex form as shown
+in~\cite{Zhuo_Energy.efficient.Dynamic.Task.Scheduling,Rauber_Analytical.Modeling.for.Energy,Hao_Learning.based.DVFS}.
  
  \section{The scaling factors selection algorithm for heterogeneous platforms }
  \label{sec.optim}
  
-\subsection{The algorithm details}
-In this section, algorithm \ref{HSA} is presented. It selects the frequency scaling factors 
-vector that gives the best trade-off between minimizing the energy consumption  and maximizing 
-the performance of a message passing synchronous iterative application executed on a heterogeneous 
-platform. It works online during the execution time of the iterative message passing program.  
-It uses information gathered during the first iteration such as the computation time and the 
-communication time in one iteration for each node. The algorithm is executed  after the first 
-iteration and returns a vector of optimal frequency scaling factors   that satisfies the objective 
-function (\ref{eq:max}). The program applies DVFS operations to change the frequencies of the CPUs 
-according to the computed scaling factors.  This algorithm is called just once during the execution 
-of the program. Algorithm~(\ref{dvfs}) shows where and when the proposed scaling algorithm is called 
-in the iterative MPI program.
-
-The nodes in a heterogeneous platform have different computing powers, thus while executing message 
-passing iterative synchronous applications, fast nodes have to wait for the slower ones to finish their 
-computations before being able to synchronously communicate with them as in figure (\ref{fig:heter}). 
-These periods are called idle or slack times. 
-The algorithm takes into account this problem and tries to reduce these slack times when selecting the 
-frequency scaling factors vector. At first, it selects initial frequency scaling factors that increase 
-the execution times of fast nodes and  minimize the  differences between  the  computation times of 
-fast and slow nodes. The value of the initial frequency scaling factor  for each node is inversely 
-proportional to its computation time that was gathered from the first iteration. These initial frequency 
-scaling factors are computed as   a ratio between the computation time of the slowest node and the 
-computation time of the node $i$ as follows:
-\begin{equation}
-  \label{eq:Scp}
- Scp_{i} = \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i}
-\end{equation}
-Using the initial  frequency scaling factors computed in (\ref{eq:Scp}), the algorithm computes 
-the initial frequencies for all nodes as a ratio between the maximum frequency of node $i$  
-and the computation scaling factor $Scp_i$ as follows:
-\begin{equation}
-  \label{eq:Fint}
- F_{i} = \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N}
-\end{equation}
-If the computed  initial frequency for a  node is not available in  the gears of
-that  node,  it  is replaced  by  the  nearest  available frequency.  In  figure
-(\ref{fig:st_freq}), the nodes are sorted by their computing power in ascending
-order and the  frequencies of the faster nodes are scaled  down according to the
-computed initial  frequency scaling factors.  The resulting new  frequencies are
-colored in  blue in figure (\ref{fig:st_freq}).  This set of  frequencies can be
-considered  as a higher  bound for  the search  space of  the optimal  vector of
-frequencies because  selecting frequency scaling factors higher  than the higher
-bound will not  improve the performance of the application  and it will increase
-its  overall  energy  consumption.  Therefore  the algorithm  that  selects  the
-frequency  scaling   factors  starts  the  search  method   from  these  initial
-frequencies and takes a downward  search direction toward lower frequencies. The
-algorithm  iterates on all  left frequencies,  from the  higher bound  until all
-nodes  reach  their  minimum   frequencies,  to  compute  their  overall  energy
-consumption and  performance, and select  the optimal frequency  scaling factors
-vector. At each iteration the algorithm determines the slowest node according to
-the equation (\ref{eq:perf}) and keeps  its frequency unchanged, while it lowers
-the  frequency  of  all  other  nodes  by one  gear.   The  new  overall  energy
-consumption  and  execution time  are  computed  according  to the  new  scaling
-factors.  The optimal set of frequency scaling factors is the set that gives the
-highest distance according to the objective function (\ref{eq:max}).
-
-Figures~\ref{fig:r1} and \ref{fig:r2}  illustrate the normalized performance and
-consumed  energy for  an application  running on  a homogeneous  platform  and a
-heterogeneous platform respectively while increasing the scaling factors. It can
-be noticed  that in a  homogeneous platform the  search for the  optimal scaling
-factor should start  from the maximum frequency because  the performance and the
-consumed energy decrease from the beginning of the plot. On the other hand,
-in the heterogeneous platform the  performance is maintained at the beginning of
-the plot  even if the  frequencies of the  faster nodes decrease  until the
-computing power of scaled down  nodes are lower than the slowest  node. In other
-words, until they reach the higher bound. It can also be noticed that the higher
-the difference between the faster nodes  and the slower nodes is, the bigger the
-maximum distance  between the  energy curve and  the performance curve  is while
- the scaling factors are varying which results in bigger energy savings.
-\begin{figure}[t]
-  \centering
-    \includegraphics[scale=0.5]{fig/start_freq}
-  \caption{Selecting the initial frequencies}
-  \label{fig:st_freq}
-\end{figure}
-
-
-
-
  \begin{algorithm}
    \begin{algorithmic}[1]
      % \footnotesize
      \Require ~
      \begin{description}
-    \item[$Tcp_i$] array of all computation times for all nodes during one iteration and with highest frequency.
-    \item[$Tcm_i$] array of all communication times for all nodes during one iteration and with highest frequency.
-    \item[$Fmax_i$] array of the maximum frequencies for all nodes.
-    \item[$Pd_i$] array of the dynamic powers for all nodes.
-    \item[$Ps_i$] array of the static powers for all nodes.
-    \item[$Fdiff_i$] array of the difference between two successive frequencies for all nodes.
+    \item[{$\Tcp[i]$}] array of all computation times for all nodes during one iteration and with highest frequency.
+    \item[{$\Tcm[i]$}] array of all communication times for all nodes during one iteration and with highest frequency.
+    \item[{$\Fmax[i]$}] array of the maximum frequencies for all nodes.
+    \item[{$\Pd[i]$}] array of the dynamic powers for all nodes.
+    \item[{$\Ps[i]$}] array of the static powers for all nodes.
+    \item[{$\Fdiff[i]$}] array of the difference between two successive frequencies for all nodes.
      \end{description}
-    \Ensure $Sopt_1,Sopt_2 \dots, Sopt_N$ is a vector of optimal scaling factors
+    \Ensure $\Sopt[1],\Sopt[2] \dots, \Sopt[N]$ is a vector of optimal scaling factors
  
-    \State $ Scp_i \gets \frac{\max_{i=1,2,\dots,N}(Tcp_i)}{Tcp_i} $
-    \State $F_{i} \gets  \frac{Fmax_i}{Scp_i},~{i=1,2,\cdots,N}$
+    \State $\Scp[i] \gets \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]} $
+    \State $F_{i} \gets  \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\cdots,N}$
      \State Round the computed initial frequencies $F_i$ to the closest one available in each node.
      \If{(not the first frequency)}
-          \State $F_i \gets F_i+Fdiff_i,~i=1,\dots,N.$
-    \EndIf 
-    \State $T_\textit{Old} \gets max_{~i=1,\dots,N } (Tcp_i+Tcm_i)$
-    \State $E_\textit{Original} \gets \sum_{i=1}^{N}{( Pd_i \cdot  Tcp_i)} +\sum_{i=1}^{N} {(Ps_i \cdot T_{Old})}$
-    \State  $Sopt_{i} \gets 1,~i=1,\dots,N. $
-    \State $Dist \gets 0 $
+          \State $F_i \gets F_i+\Fdiff[i],~i=1,\dots,N.$
+    \EndIf
+    \State $\Told \gets \max_{i=1,\dots,N} (\Tcp[i]+\Tcm[i])$
+    % \State $\Eoriginal \gets \sum_{i=1}^{N}{( \Pd[i] \cdot  \Tcp[i])} +\sum_{i=1}^{N} {(\Ps[i] \cdot \Told)}$
+    \State $\Eoriginal \gets \sum_{i=1}^{N}{( \Pd[i] \cdot  \Tcp[i] + \Ps[i] \cdot \Told)}$
+    \State $\Sopt[i] \gets 1,~i=1,\dots,N. $
+    \State $\Dist \gets 0 $
      \While {(all nodes not reach their  minimum  frequency)}
          \If{(not the last freq. \textbf{and} not the slowest node)}
-        \State $F_i \gets F_i - Fdiff_i,~i=1,\dots,N.$
-        \State $S_i \gets \frac{Fmax_i}{F_i},~i=1,\dots,N.$
+        \State $F_i \gets F_i - \Fdiff[i],~i=1,\dots,N.$
+        \State $S_i \gets \frac{\Fmax[i]}{F_i},~i=1,\dots,N.$
          \EndIf
-       \State $T_{New} \gets max_\textit{~i=1,\dots,N} (Tcp_{i} \cdot S_{i}) + MinTcm $
-       \State $E_\textit{Reduced} \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot Pd_i \cdot  Tcp_i)} + $  \hspace*{43 mm} 
-               $\sum_{i=1}^{N} {(Ps_i \cdot T_{New})} $
-       \State $ P_\textit{Norm} \gets \frac{T_\textit{Old}}{T_\textit{New}}$
-       \State $E_\textit{Norm}\gets \frac{E_\textit{Reduced}}{E_\textit{Original}}$
+       \State $\Tnew \gets \max_{i=1,\dots,N} (\Tcp[i] \cdot S_{i}) + \MinTcm $
+%       \State $\Ereduced \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot  \Tcp[i])} + \sum_{i=1}^{N} {(\Ps[i] \cdot \rlap{\Tnew)}} $
+       \State $\Ereduced \gets \sum_{i=1}^{N}{(S_i^{-2} \cdot \Pd[i] \cdot  \Tcp[i] + \Ps[i] \cdot \rlap{\Tnew)}} $
+       \State $\Pnorm \gets \frac{\Told}{\Tnew}$
+       \State $\Enorm\gets \frac{\Ereduced}{\Eoriginal}$
        \If{$(\Pnorm - \Enorm > \Dist)$}
-        \State $Sopt_{i} \gets S_{i},~i=1,\dots,N. $
+        \State $\Sopt[i] \gets S_{i},~i=1,\dots,N. $
          \State $\Dist \gets \Pnorm - \Enorm$
        \EndIf
      \EndWhile
-    \State  Return $Sopt_1,Sopt_2,\dots,Sopt_N$
+    \State  Return $\Sopt[1],\Sopt[2],\dots,\Sopt[N]$
    \end{algorithmic}
    \caption{frequency scaling factors selection algorithm}
    \label{HSA}
@@ -654,7 +607,7 @@ maximum distance  between the  energy curve and  the performance curve  is while
        \If {$(k=1)$}
          \State Gather all times of computation and\newline\hspace*{3em}%
                 communication from each node.
-        \State Call algorithm \ref{HSA}.
+        \State Call Algorithm \ref{HSA}.
          \State Compute the new frequencies from the\newline\hspace*{3em}%
                 returned optimal scaling factors.
          \State Set the new frequencies to nodes.
@@ -665,439 +618,553 @@ maximum distance  between the  energy curve and  the performance curve  is while
    \label{dvfs}
  \end{algorithm}
  
-\subsection{The evaluation of the proposed algorithm}
-\label{sec.verif.algo}
-The precision  of the  proposed algorithm mainly  depends on the  execution time
-prediction model  defined in  (\ref{eq:perf}) and the  energy model  computed by
-(\ref{eq:energy}).   The energy  model is  also significantly  dependent  on the
-execution  time model  because  the static  energy  is linearly  related to  the
-execution time  and the dynamic energy  is related to the  computation time. So,
-all the works presented  in this paper are based on the  execution time model. To
-verify  this  model, the  predicted  execution time  was  compared  to the  real
-execution          time           over          SimGrid/SMPI          simulator,
-v3.10~\cite{casanova+giersch+legrand+al.2014.versatile},   for   all   the   NAS
-parallel benchmarks NPB v3.3  \cite{NAS.Parallel.Benchmarks}, running class B on
-8 or  9 nodes. The comparison showed  that the proposed execution  time model is
-very precise, the maximum  normalized difference between the predicted execution
-time and the real execution time is equal to 0.03 for all the NAS benchmarks.
+\subsection{The algorithm details}
  
-Since  the proposed algorithm is not an exact method it does not test all the possible solutions (vectors of scaling factors) 
-in the search space. To prove its efficiency, it was compared on small instances to a brute force search algorithm 
-that tests all the possible solutions. The brute force algorithm was applied to different NAS benchmarks classes with 
-different number of nodes. The solutions returned by the brute force algorithm and the proposed algorithm were identical 
-and the proposed algorithm was on average 10 times faster than the brute force algorithm. It has a small execution time: 
-for a heterogeneous cluster composed of four different types of nodes having the characteristics presented in 
-table~\ref{table:platform}, it takes on average \np[ms]{0.04}  for 4 nodes and \np[ms]{0.15} on average for 144 nodes 
-to compute the best scaling factors vector.  The algorithm complexity is $O(F\cdot (N \cdot4) )$, where $F$ is the number 
-of iterations and $N$ is the number of computing nodes. The algorithm needs  from 12 to 20 iterations to select the best 
-vector of frequency scaling factors that gives the results of the next sections.
+In this section, Algorithm~\ref{HSA} is presented. It selects the frequency
+scaling factors vector that gives the best trade-off between minimizing the
+energy consumption and maximizing the performance of a message passing
+synchronous iterative application executed on a heterogeneous platform. It works
+online during the execution time of the iterative message passing program.  It
+uses information gathered during the first iteration such as the computation
+time and the communication time in one iteration for each node. The algorithm is
+executed after the first iteration and returns a vector of optimal frequency
+scaling factors that satisfies the objective function (\ref{eq:max}). The
+program applies DVFS operations to change the frequencies of the CPUs according
+to the computed scaling factors.  This algorithm is called just once during the
+execution of the program. Algorithm~\ref{dvfs} shows where and when the proposed
+scaling algorithm is called in the iterative MPI program.
+
+\begin{figure}[!t]
+  \centering
+  \includegraphics[scale=0.5]{fig/start_freq}
+  \caption{Selecting the initial frequencies}
+  \label{fig:st_freq}
+\end{figure}
  
-\section{Experimental results}
-\label{sec.expe}
-To  evaluate the  efficiency and  the  overall energy  consumption reduction  of
-algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The
-experiments were executed on the  simulator SimGrid/SMPI which offers easy tools
-to create a heterogeneous platform and run message passing applications over it.
-The heterogeneous  platform that was used  in the experiments, had  one core per
-node because just one process was executed per node.  The heterogeneous platform
-was  composed  of  four  types  of  nodes. Each  type  of  nodes  had  different
-characteristics  such as  the maximum  CPU  frequency, the  number of  available
-frequencies  and the  computational power,  see Table  \ref{table:platform}. The
-characteristics  of  these  different  types  of nodes  are  inspired  from  the
-specifications of real  Intel processors.  The heterogeneous platform  had up to
-144 nodes and had nodes from the four types in equal proportions, for example if
-a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
-constructors of  CPUs do not specify the  dynamic and the static  power of their
-CPUs, for  each type of  node they were  chosen proportionally to  its computing
-power  (FLOPS).  In  the initial  heterogeneous platform,  while  computing with
-highest frequency,  each node  consumed an amount  of power proportional  to its
-computing  power  (which  corresponds to  80\%  of  its  dynamic power  and  the
-remaining  20\%  to  the  static   power),  the  same  assumption  was  made  in
-\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}.    Finally,  These
-nodes were connected via an ethernet network with 1 Gbit/s bandwidth.
+The nodes in a heterogeneous platform have different computing powers, thus
+while executing message passing iterative synchronous applications, fast nodes
+have to wait for the slower ones to finish their computations before being able
+to synchronously communicate with them as in Figure~\ref{fig:heter}.  These
+periods are called idle or slack times.  The algorithm takes into account this
+problem and tries to reduce these slack times when selecting the frequency
+scaling factors vector. At first, it selects initial frequency scaling factors
+that increase the execution times of fast nodes and minimize the differences
+between the computation times of fast and slow nodes. The value of the initial
+frequency scaling factor for each node is inversely proportional to its
+computation time that was gathered from the first iteration. These initial
+frequency scaling factors are computed as a ratio between the computation time
+of the slowest node and the computation time of the node $i$ as follows:
+\begin{equation}
+  \label{eq:Scp}
+  \Scp[i] = \frac{\max_{i=1,2,\dots,N}(\Tcp[i])}{\Tcp[i]}
+\end{equation}
+Using the initial frequency scaling factors computed in (\ref{eq:Scp}), the
+algorithm computes the initial frequencies for all nodes as a ratio between the
+maximum frequency of node $i$ and the computation scaling factor $\Scp[i]$ as
+follows:
+\begin{equation}
+  \label{eq:Fint}
+  F_{i} = \frac{\Fmax[i]}{\Scp[i]},~{i=1,2,\dots,N}
+\end{equation}
+If the computed initial frequency for a node is not available in the gears of
+that node, it is replaced by the nearest available frequency.  In
+Figure~\ref{fig:st_freq}, the nodes are sorted by their computing power in
+ascending order and the frequencies of the faster nodes are scaled down
+according to the computed initial frequency scaling factors.  The resulting new
+frequencies are highlighted in Figure~\ref{fig:st_freq}.  This set of
+frequencies can be considered as a higher bound for the search space of the
+optimal vector of frequencies because selecting frequency scaling factors higher
+than the higher bound will not improve the performance of the application and it
+will increase its overall energy consumption.  Therefore the algorithm that
+selects the frequency scaling factors starts the search method from these
+initial frequencies and takes a downward search direction toward lower
+frequencies. The algorithm iterates on all left frequencies, from the higher
+bound until all nodes reach their minimum frequencies, to compute their overall
+energy consumption and performance, and select the optimal frequency scaling
+factors vector. At each iteration the algorithm determines the slowest node
+according to the equation (\ref{eq:perf}) and keeps its frequency unchanged,
+while it lowers the frequency of all other nodes by one gear.  The new overall
+energy consumption and execution time are computed according to the new scaling
+factors.  The optimal set of frequency scaling factors is the set that gives the
+highest distance according to the objective function (\ref{eq:max}).
+
+Figures~\ref{fig:r1} and \ref{fig:r2} illustrate the normalized performance and
+consumed energy for an application running on a homogeneous platform and a
+heterogeneous platform respectively while increasing the scaling factors. It can
+be noticed that in a homogeneous platform the search for the optimal scaling
+factor should start from the maximum frequency because the performance and the
+consumed energy decrease from the beginning of the plot. On the other hand, in
+the heterogeneous platform the performance is maintained at the beginning of the
+plot even if the frequencies of the faster nodes decrease until the computing
+power of scaled down nodes are lower than the slowest node. In other words,
+until they reach the higher bound. It can also be noticed that the higher the
+difference between the faster nodes and the slower nodes is, the bigger the
+maximum distance between the energy curve and the performance curve is while the
+scaling factors are varying which results in bigger energy savings.
+
+\subsection{The evaluation of the proposed algorithm}
+\label{sec.verif.algo}
  
+The precision of the proposed algorithm mainly depends on the execution time
+prediction model defined in (\ref{eq:perf}) and the energy model computed by
+(\ref{eq:energy}).  The energy model is also significantly dependent on the
+execution time model because the static energy is linearly related to the
+execution time and the dynamic energy is related to the computation time. So,
+all the works presented in this paper are based on the execution time model. To
+verify this model, the predicted execution time was compared to the real
+execution time over SimGrid/SMPI simulator,
+v3.10~\cite{casanova+giersch+legrand+al.2014.versatile}, for all the NAS
+parallel benchmarks NPB v3.3 \cite{NAS.Parallel.Benchmarks}, running class B on
+8 or 9 nodes. The comparison showed that the proposed execution time model is
+very precise, the maximum normalized difference between the predicted execution
+time and the real execution time is equal to 0.03 for all the NAS benchmarks.
  
-\begin{table}[htb]
+Since the proposed algorithm is not an exact method it does not test all the
+possible solutions (vectors of scaling factors) in the search space. To prove
+its efficiency, it was compared on small instances to a brute force search
+algorithm that tests all the possible solutions. The brute force algorithm was
+applied to different NAS benchmarks classes with different number of nodes. The
+solutions returned by the brute force algorithm and the proposed algorithm were
+identical and the proposed algorithm was on average 10 times faster than the
+brute force algorithm. It has a small execution time: for a heterogeneous
+cluster composed of four different types of nodes having the characteristics
+presented in Table~\ref{table:platform}, it takes on average \np[ms]{0.04} for 4
+nodes and \np[ms]{0.15} on average for 144 nodes to compute the best scaling
+factors vector.  The algorithm complexity is $O(F\cdot N)$, where $F$ is the
+maximum number of available frequencies, and $N$ is the number of computing
+nodes. The algorithm needs from 12 to 20 iterations to select the best vector of
+frequency scaling factors that gives the results of the next sections.
+
+\begin{table}[!t]
    \caption{Heterogeneous nodes characteristics}
    % title of Table
    \centering
-  \begin{tabular}{|*{7}{l|}}
+  \begin{tabular}{|*{7}{r|}}
      \hline
-    Node          &Simulated  & Max      & Min          & Diff.          & Dynamic      & Static \\
-    type          &GFLOPS     & Freq.    & Freq.        & Freq.          & power        & power \\
-                  &           & GHz      & GHz          &GHz             &              &       \\
+    Node   & Simulated & Max   & Min   & Diff. & Dynamic    & Static    \\
+    type   & GFLOPS    & Freq. & Freq. & Freq. & power      & power     \\
+           &           & GHz   & GHz   & GHz   &            &           \\
      \hline
-    1             &40         & 2.5      & 1.2          & 0.1            & 20~w         &4~w    \\
-         
+    1      & 40        & 2.50  & 1.20  & 0.100 & \np[W]{20} & \np[W]{4} \\
      \hline
-    2             &50         & 2.66     & 1.6          & 0.133          & 25~w         &5~w    \\
-                  
+    2      & 50        & 2.66  & 1.60  & 0.133 & \np[W]{25} & \np[W]{5} \\
      \hline
-    3             &60         & 2.9      & 1.2          & 0.1            & 30~w         &6~w    \\
-                  
+    3      & 60        & 2.90  & 1.20  & 0.100 & \np[W]{30} & \np[W]{6} \\
      \hline
-    4             &70         & 3.4      & 1.6          & 0.133          & 35~w         &7~w    \\
-                  
+    4      & 70        & 3.40  & 1.60  & 0.133 & \np[W]{35} & \np[W]{7} \\
      \hline
    \end{tabular}
    \label{table:platform}
  \end{table}
  
- 
-%\subsection{Performance prediction verification}
+\section{Experimental results}
+\label{sec.expe}
+
+To evaluate the efficiency and the overall energy consumption reduction of
+Algorithm~\ref{HSA}, it was applied to the NAS parallel benchmarks NPB v3.3. The
+experiments were executed on the simulator SimGrid/SMPI which offers easy tools
+to create a heterogeneous platform and run message passing applications over it.
+The heterogeneous platform that was used in the experiments, had one core per
+node because just one process was executed per node.  The heterogeneous platform
+was composed of four types of nodes. Each type of nodes had different
+characteristics such as the maximum CPU frequency, the number of available
+frequencies and the computational power, see Table~\ref{table:platform}. The
+characteristics of these different types of nodes are inspired from the
+specifications of real Intel processors.  The heterogeneous platform had up to
+144 nodes and had nodes from the four types in equal proportions, for example if
+a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
+constructors of CPUs do not specify the dynamic and the static power of their
+CPUs, for each type of node they were chosen proportionally to its computing
+power (FLOPS).  In the initial heterogeneous platform, while computing with
+highest frequency, each node consumed an amount of power proportional to its
+computing power (which corresponds to \np[\%]{80} of its dynamic power and the
+remaining \np[\%]{20} to the static power), the same assumption was made in
+\cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}.  Finally, These
+nodes were connected via an Ethernet network with \np[Gbit/s]{1} bandwidth.
  
  
  \subsection{The experimental results of the scaling algorithm}
  \label{sec.res}
  
-
  The proposed algorithm was applied to the seven parallel NAS benchmarks (EP, CG,
-MG, FT, BT, LU and SP) and  the benchmarks were executed with the three classes:
+MG, FT, BT, LU and SP) and the benchmarks were executed with the three classes:
  A, B and C. However, due to the lack of space in this paper, only the results of
-the  biggest class,  C, are  presented while  being run  on different  number of
-nodes,  ranging from 4  to 128  or 144  nodes depending  on the  benchmark being
-executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on $1,
-2, 4, 8, 16, 32, 64, 128$ nodes.   The other benchmarks such as BT and SP had to
-be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
-
- 
- 
-\begin{table}[htb]
+the biggest class, C, are presented while being run on different number of
+nodes, ranging from 4 to 128 or 144 nodes depending on the benchmark being
+executed. Indeed, the benchmarks CG, MG, LU, EP and FT had to be executed on 1,
+2, 4, 8, 16, 32, 64, or 128 nodes.  The other benchmarks such as BT and SP had
+to be executed on 1, 4, 9, 16, 36, 64, or 144 nodes.
+
+\begin{table}[!t]
    \caption{Running NAS benchmarks on 4 nodes }
    % title of Table
    \centering
-  \begin{tabular}{|*{7}{l|}}
+  \begin{tabular}{|*{7}{r|}}
      \hline
-    Program     & Execution     & Energy         & Energy      & Performance        & Distance      \\
-    name       & time/s        & consumption/J  & saving\%    & degradation\%      &               \\
+    \hspace{-2.2084pt}%
+    Program & Execution & Energy        & Energy   & Performance   & Distance \\
+    name    & time/s    & consumption/J & saving\% & degradation\% &          \\
      \hline
-    CG         &  64.64        & 3560.39        &34.16        &6.72               &27.44       \\
-    \hline 
-    MG         & 18.89         & 1074.87           &35.37            &4.34                   &31.03       \\
-   \hline
-    EP         &79.73         &5521.04         &26.83            &3.04               &23.79      \\
-   \hline
-    LU         &308.65        &21126.00           &34.00             &6.16                   &27.84      \\
+    CG      &  64.64    &  3560.39      & 34.16    & 6.72          & 27.44    \\
+    \hline
+    MG      &  18.89    &  1074.87      & 35.37    & 4.34          & 31.03    \\
+    \hline
+    EP      &  79.73    &  5521.04      & 26.83    & 3.04          & 23.79    \\
+    \hline
+    LU      & 308.65    & 21126.00      & 34.00    & 6.16          & 27.84    \\
+    \hline
+    BT      & 360.12    & 21505.55      & 35.36    & 8.49          & 26.87    \\
+    \hline
+    SP      & 234.24    & 13572.16      & 35.22    & 5.70          & 29.52    \\
+    \hline
+    FT      &  81.58    &  4151.48      & 35.58    & 0.99          & 34.59    \\
      \hline
-    BT         &360.12         &21505.55          &35.36         &8.49               &26.87     \\
-   \hline
-    SP         &234.24        &13572.16           &35.22         &5.70               &29.52    \\
-   \hline
-    FT         &81.58          &4151.48        &35.58         &0.99                  &34.59    \\
-\hline 
    \end{tabular}
    \label{table:res_4n}
-\end{table}
+% \end{table}
  
-\begin{table}[htb]
+  \medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 8 and 9 nodes }
    % title of Table
    \centering
-  \begin{tabular}{|*{7}{l|}}
+  \begin{tabular}{|*{7}{r|}}
      \hline
-    Program    & Execution     & Energy         & Energy      & Performance        & Distance      \\
-    name       & time/s        & consumption/J  & saving\%    & degradation\%      &               \\
+    \hspace{-2.2084pt}%
+    Program & Execution & Energy        & Energy   & Performance   & Distance \\
+    name    & time/s    & consumption/J & saving\% & degradation\% &          \\
      \hline
-    CG         &36.11             &3263.49             &31.25        &7.12                    &24.13     \\
-    \hline 
-    MG         &8.99          &953.39          &33.78        &6.41                    &27.37     \\
-   \hline
-    EP         &40.39         &5652.81         &27.04        &0.49                    &26.55     \\
-   \hline
-    LU         &218.79            &36149.77        &28.23        &0.01                    &28.22      \\
+    CG      &  36.11    &  3263.49      & 31.25    & 7.12          & 24.13    \\
+    \hline
+    MG      &   8.99    &   953.39      & 33.78    & 6.41          & 27.37    \\
+    \hline
+    EP      &  40.39    &  5652.81      & 27.04    & 0.49          & 26.55    \\
+    \hline
+    LU      & 218.79    & 36149.77      & 28.23    & 0.01          & 28.22    \\
+    \hline
+    BT      & 166.89    & 23207.42      & 32.32    & 7.89          & 24.43    \\
+    \hline
+    SP      & 104.73    & 18414.62      & 24.73    & 2.78          & 21.95    \\
+    \hline
+    FT      &  51.10    &  4913.26      & 31.02    & 2.54          & 28.48    \\
      \hline
-    BT         &166.89                &23207.42            &32.32            &7.89                    &24.43      \\
-   \hline
-    SP         &104.73        &18414.62            &24.73            &2.78                    &21.95      \\
-   \hline
-    FT         &51.10         &4913.26         &31.02        &2.54                    &28.48      \\
-\hline 
    \end{tabular}
    \label{table:res_8n}
-\end{table}
+% \end{table}
  
-\begin{table}[htb]
+  \medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 16 nodes }
    % title of Table
    \centering
-  \begin{tabular}{|*{7}{l|}}
+  \begin{tabular}{|*{7}{r|}}
      \hline
-    Program     & Execution     & Energy         & Energy      & Performance        & Distance      \\
-    name       & time/s        & consumption/J  & saving\%    & degradation\%      &               \\
+    \hspace{-2.2084pt}%
+    Program & Execution & Energy        & Energy   & Performance   & Distance \\
+    name    & time/s    & consumption/J & saving\% & degradation\% &          \\
      \hline
-    CG         &31.74         &4373.90         &26.29        &9.57                    &16.72          \\
-    \hline 
-    MG         &5.71          &1076.19         &32.49        &6.05                    &26.44         \\
-   \hline
-    EP         &20.11         &5638.49         &26.85        &0.56                    &26.29         \\
-   \hline
-    LU         &144.13        &42529.06            &28.80            &6.56                    &22.24         \\
+    CG      &  31.74    &  4373.90      & 26.29    & 9.57          & 16.72    \\
+    \hline
+    MG      &   5.71    &  1076.19      & 32.49    & 6.05          & 26.44    \\
+    \hline
+    EP      &  20.11    &  5638.49      & 26.85    & 0.56          & 26.29    \\
+    \hline
+    LU      & 144.13    & 42529.06      & 28.80    & 6.56          & 22.24    \\
+    \hline
+    BT      &  97.29    & 22813.86      & 34.95    & 5.80          & 29.15    \\
+    \hline
+    SP      &  66.49    & 20821.67      & 22.49    & 3.82          & 18.67    \\
+    \hline
+    FT      &  37.01    &  5505.60      & 31.59    & 6.48          & 25.11    \\
      \hline
-    BT         &97.29         &22813.86            &34.95        &5.80                &29.15         \\
-   \hline
-    SP         &66.49         &20821.67            &22.49            &3.82                    &18.67         \\
-   \hline
-    FT            &37.01          &5505.60             &31.59        &6.48                    &25.11         \\
-\hline 
    \end{tabular}
    \label{table:res_16n}
-\end{table}
+% \end{table}
  
-\begin{table}[htb]
+  \medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 32 and 36 nodes }
    % title of Table
    \centering
-  \begin{tabular}{|*{7}{l|}}
+  \begin{tabular}{|*{7}{r|}}
      \hline
-    Program    & Execution     & Energy         & Energy      & Performance        & Distance      \\
-    name       & time/s        & consumption/J  & saving\%    & degradation\%      &               \\
+    \hspace{-2.2084pt}%
+    Program & Execution & Energy        & Energy   & Performance   & Distance \\
+    name    & time/s    & consumption/J & saving\% & degradation\% &          \\
      \hline
-    CG         &32.35         &6704.21         &16.15        &5.30                    &10.85           \\
-    \hline 
-    MG         &4.30          &1355.58         &28.93        &8.85                    &20.08          \\
-   \hline
-    EP         &9.96           &5519.68                &26.98        &0.02                    &26.96          \\
-   \hline
-    LU         &99.93         &67463.43            &23.60            &2.45                    &21.15          \\
+    CG      & 32.35     &  6704.21      & 16.15    & 5.30          & 10.85    \\
+    \hline
+    MG      &  4.30     &  1355.58      & 28.93    & 8.85          & 20.08    \\
+    \hline
+    EP      &  9.96     &  5519.68      & 26.98    & 0.02          & 26.96    \\
+    \hline
+    LU      & 99.93     & 67463.43      & 23.60    & 2.45          & 21.15    \\
+    \hline
+    BT      & 48.61     & 23796.97      & 34.62    & 5.83          & 28.79    \\
+    \hline
+    SP      & 46.01     & 27007.43      & 22.72    & 3.45          & 19.27    \\
+    \hline
+    FT      & 28.06     &  7142.69      & 23.09    & 2.90          & 20.19    \\
      \hline
-    BT         &48.61         &23796.97            &34.62            &5.83                    &28.79          \\
-   \hline
-    SP         &46.01         &27007.43            &22.72            &3.45                    &19.27           \\
-   \hline
-    FT            &28.06          &7142.69             &23.09        &2.90                    &20.19           \\
-\hline 
    \end{tabular}
    \label{table:res_32n}
-\end{table}
+% \end{table}
  
-\begin{table}[htb]
+  \medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 64 nodes }
    % title of Table
    \centering
-  \begin{tabular}{|*{7}{l|}}
+  \begin{tabular}{|*{7}{r|}}
      \hline
-    Program    & Execution     & Energy         & Energy      & Performance        & Distance      \\
-    name       & time/s        & consumption/J  & saving\%    & degradation\%      &               \\
+    \hspace{-2.2084pt}%
+    Program & Execution & Energy        & Energy   & Performance   & Distance \\
+    name    & time/s    & consumption/J & saving\% & degradation\% &          \\
      \hline
-    CG         &46.65         &17521.83            &8.13             &1.68                    &6.45           \\
-    \hline 
-    MG         &3.27          &1534.70         &29.27        &14.35               &14.92          \\
-   \hline
-    EP         &5.05           &5471.1084          &27.12            &3.11                &24.01         \\
-   \hline
-    LU         &73.92         &101339.16           &21.96            &3.67                    &18.29         \\
+    CG      & 46.65     &  17521.83     &  8.13    &  1.68         &  6.45    \\
+    \hline
+    MG      &  3.27     &   1534.70     & 29.27    & 14.35         & 14.92    \\
+    \hline
+    EP      &  5.05     &   5471.11     & 27.12    &  3.11         & 24.01    \\
+    \hline
+    LU      & 73.92     & 101339.16     & 21.96    &  3.67         & 18.29    \\
+    \hline
+    BT      & 39.99     &  27166.71     & 32.02    & 12.28         & 19.74    \\
+    \hline
+    SP      & 52.00     &  49099.28     & 24.84    &  0.03         & 24.81    \\
+    \hline
+    FT      & 25.97     &  10416.82     & 20.15    &  4.87         & 15.28    \\
      \hline
-    BT         &39.99         &27166.71            &32.02            &12.28               &19.74         \\
-   \hline
-    SP         &52.00         &49099.28            &24.84            &0.03                    &24.81         \\
-   \hline
-    FT         &25.97         &10416.82        &20.15        &4.87                    &15.28         \\
-\hline 
    \end{tabular}
    \label{table:res_64n}
-\end{table}
+% \end{table}
  
-
-\begin{table}[htb]
+  \medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 128 and 144 nodes }
    % title of Table
    \centering
-  \begin{tabular}{|*{7}{l|}}
+  \begin{tabular}{|*{7}{r|}}
      \hline
-    Program    & Execution     & Energy         & Energy      & Performance        & Distance     \\
-    name       & time/s        & consumption/J  & saving\%    & degradation\%      &              \\
+    \hspace{-2.2084pt}%
+    Program & Execution & Energy        & Energy   & Performance   & Distance \\
+    name    & time/s    & consumption/J & saving\% & degradation\% &          \\
      \hline
-    CG         &56.92         &41163.36        &4.00         &1.10                    &2.90          \\
-    \hline 
-    MG         &3.55           &2843.33         &18.77       &10.38               &8.39          \\
-   \hline
-    EP         &2.67           &5669.66                &27.09        &0.03                    &27.06         \\
-   \hline
-    LU         &51.23         &144471.90       &16.67        &2.36                    &14.31         \\
+    CG      & 56.92     &  41163.36     &  4.00    &  1.10         &  2.90    \\
+    \hline
+    MG      &  3.55     &   2843.33     & 18.77    & 10.38         &  8.39    \\
+    \hline
+    EP      &  2.67     &   5669.66     & 27.09    &  0.03         & 27.06    \\
+    \hline
+    LU      & 51.23     & 144471.90     & 16.67    &  2.36         & 14.31    \\
+    \hline
+    BT      & 37.96     &  44243.82     & 23.18    &  1.28         & 21.90    \\
+    \hline
+    SP      & 64.53     & 115409.71     & 26.72    &  0.05         & 26.67    \\
+    \hline
+    FT      & 25.51     &  18808.72     & 12.85    &  2.84         & 10.01    \\
      \hline
-    BT         &37.96          &44243.82           &23.18            &1.28                    &21.90         \\
-   \hline
-    SP         &64.53         &115409.71           &26.72            &0.05                    &26.67         \\
-   \hline
-    FT         &25.51         &18808.72            &12.85            &2.84                    &10.01         \\
-\hline 
    \end{tabular}
    \label{table:res_128n}
  \end{table}
-The overall energy  consumption was computed for each  instance according to the
-energy  consumption  model  (\ref{eq:energy}),  with and  without  applying  the
-algorithm. The execution time was also measured for all these experiments. Then,
-the energy saving and performance degradation percentages were computed for each
-instance.    The   results   are   presented  in   Tables   (\ref{table:res_4n},
-\ref{table:res_8n},           \ref{table:res_16n},          \ref{table:res_32n},
-\ref{table:res_64n} and \ref{table:res_128n}). All these results are the average
-values  from many experiments  for energy  savings and  performance degradation.
-The tables show the experimental results for running the NAS parallel benchmarks
-on  different  number  of  nodes.   The  experiments  show  that  the  algorithm
-significantly reduces the energy consumption (up to 35\%) and tries to limit the
-performance  degradation.  They  also  show that  the  energy saving  percentage
-decreases when the  number of computing nodes increases.   This reduction is due
-to the increase of the communication  times compared to the execution times when
-the benchmarks are run over a high number of nodes.  Indeed, the benchmarks with
-the  same  class,  C,  are  executed  on different  numbers  of  nodes,  so  the
-computation required  for each iteration is  divided by the  number of computing
-nodes.  On the other hand,  more communications are required when increasing the
-number  of  nodes so  the  static energy  increases  linearly  according to  the
-communication time and the dynamic power  is less relevant in the overall energy
-consumption.   Therefore, reducing the  frequency with  algorithm~(\ref{HSA}) is
-less effective  in reducing the overall  energy savings. It can  also be noticed
-that for the benchmarks EP and  SP that contain little or no communications, the
-energy savings are  not significantly affected by the high  number of nodes.  No
-experiments were conducted  using bigger classes than D,  because they require a
-lot  of memory (more  than 64GB)  when being  executed by  the simulator  on one
-machine.   The maximum  distance between  the  normalized energy  curve and  the
-normalized performance for each instance is  also shown in the result tables. It
-decrease in the same way as  the energy saving percentage.  The tables also show
-that the performance degradation  percentage is not significantly increased when
-the number  of computing  nodes is increased  because the computation  times are
-small when compared to the communication times.
-
-
- 
-\begin{figure}
+
+\begin{figure}[!t]
    \centering
-  \subfloat[Energy saving]{%
+  \subfloat[Energy saving (\%)]{%
      \includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}%
-  
-  \subfloat[Performance degradation ]{%
+
+  \subfloat[Performance degradation (\%)]{%
      \includegraphics[width=.33\textwidth]{fig/per_deg}\label{fig:per_deg}}
    \label{fig:avg}
    \caption{The energy and performance for all NAS benchmarks running with a different number of nodes}
  \end{figure}
  
-Figures  \ref{fig:energy} and  \ref{fig:per_deg} present  the energy  saving and
-performance  degradation respectively for  all the  benchmarks according  to the
-number of used nodes. As shown  in the first plot, the energy saving percentages
-of the benchmarks MG,  LU, BT and FT decrease linearly when  the number of nodes
-increase. While  for the EP and  SP benchmarks, the energy  saving percentage is
-not affected by the increase of  the number of computing nodes, because in these
-benchmarks there are little or  no communications. Finally, the energy saving of
-the  GC benchmark  significantly  decrease  when the  number  of nodes  increase
-because this benchmark has more  communications than the others. The second plot
-shows that  the performance  degradation percentages of  most of  the benchmarks
-decrease when  they run on a  big number of  nodes because they spend  more time
-communicating than computing,  thus, scaling down the frequencies  of some nodes
+The overall energy consumption was computed for each instance according to the
+energy consumption model (\ref{eq:energy}), with and without applying the
+algorithm. The execution time was also measured for all these experiments. Then,
+the energy saving and performance degradation percentages were computed for each
+instance.  The results are presented in Tables~\ref{table:res_4n},
+\ref{table:res_8n}, \ref{table:res_16n}, \ref{table:res_32n},
+\ref{table:res_64n} and \ref{table:res_128n}. All these results are the average
+values from many experiments for energy savings and performance degradation.
+The tables show the experimental results for running the NAS parallel benchmarks
+on different number of nodes.  The experiments show that the algorithm
+significantly reduces the energy consumption (up to \np[\%]{35}) and tries to
+limit the performance degradation.  They also show that the energy saving
+percentage decreases when the number of computing nodes increases.  This
+reduction is due to the increase of the communication times compared to the
+execution times when the benchmarks are run over a higher number of nodes.
+Indeed, the benchmarks with the same class, C, are executed on different numbers
+of nodes, so the computation required for each iteration is divided by the
+number of computing nodes.  On the other hand, more communications are required
+when increasing the number of nodes so the static energy increases linearly
+according to the communication time and the dynamic power is less relevant in
+the overall energy consumption.  Therefore, reducing the frequency with
+Algorithm~\ref{HSA} is less effective in reducing the overall energy savings. It
+can also be noticed that for the benchmarks EP and SP that contain little or no
+communications, the energy savings are not significantly affected by the high
+number of nodes.  No experiments were conducted using bigger classes than D,
+because they require a lot of memory (more than \np[GB]{64}) when being executed
+by the simulator on one machine.  The maximum distance between the normalized
+energy curve and the normalized performance for each instance is also shown in
+the result tables. It decrease in the same way as the energy saving percentage.
+The tables also show that the performance degradation percentage is not
+significantly increased when the number of computing nodes is increased because
+the computation times are small when compared to the communication times.
+
+Figures~\ref{fig:energy} and \ref{fig:per_deg} present the energy saving and
+performance degradation respectively for all the benchmarks according to the
+number of used nodes. As shown in the first plot, the energy saving percentages
+of the benchmarks MG, LU, BT and FT decrease linearly when the number of nodes
+increase. While for the EP and SP benchmarks, the energy saving percentage is
+not affected by the increase of the number of computing nodes, because in these
+benchmarks there are little or no communications. Finally, the energy saving of
+the GC benchmark significantly decrease when the number of nodes increase
+because this benchmark has more communications than the others. The second plot
+shows that the performance degradation percentages of most of the benchmarks
+decrease when they run on a big number of nodes because they spend more time
+communicating than computing, thus, scaling down the frequencies of some nodes
  has less effect on the performance.
  
-
-
-
  \subsection{The results for different power consumption scenarios}
  \label{sec.compare}
-The results  of the previous section  were obtained while  using processors that
-consume during  computation an overall power  which is 80\%  composed of dynamic
-power and of 20\% of static power. In this section, these ratios are changed and
-two new  power scenarios are  considered in order  to evaluate how  the proposed
-algorithm adapts itself  according to the static and  dynamic power values.  The
-two new power scenarios are the following:
+
+The results of the previous section were obtained while using processors that
+consume during computation an overall power which is \np[\%]{80} composed of
+dynamic power and of \np[\%]{20} of static power. In this section, these ratios
+are changed and two new power scenarios are considered in order to evaluate how
+the proposed algorithm adapts itself according to the static and dynamic power
+values.  The two new power scenarios are the following:
  
  \begin{itemize}
-\item 70\% of dynamic power  and 30\% of static power
-\item 90\% of dynamic power  and 10\% of static power
+\item \np[\%]{70} of dynamic power and \np[\%]{30} of static power
+\item \np[\%]{90} of dynamic power and \np[\%]{10} of static power
  \end{itemize}
  
-The NAS parallel benchmarks were  executed again over processors that follow the
-new power scenarios.   The class C of each  benchmark was run over 8  or 9 nodes
-and   the    results   are   presented   in    Tables   \ref{table:res_s1}   and
-\ref{table:res_s2}. These tables  show that the energy saving  percentage of the
-70\%-30\% scenario is  smaller for all benchmarks compared  to the energy saving
-of the 90\%-10\% scenario. Indeed, in  the latter more dynamic power is consumed
-when  nodes are running  on their  maximum frequencies,  thus, scaling  down the
-frequency of  the nodes results in  higher energy savings than  in the 70\%-30\%
-scenario. On the  other hand, the performance degradation  percentage is smaller
-in the 70\%-30\% scenario compared to the 90\%-10\% scenario. This is due to the
-higher  static  power percentage  in  the first  scenario  which  makes it  more
-relevant in the  overall consumed energy.  Indeed, the  static energy is related
-to the execution time and if  the performance is degraded the amount of consumed
-static  energy directly  increas.  Therefore,  the proposed  algorithm  does not
-really significantly  scale down much the  frequencies of the nodes  in order to
-limit the  increase of the  execution time and  thus limiting the effect  of the
+The NAS parallel benchmarks were executed again over processors that follow the
+new power scenarios.  The class C of each benchmark was run over 8 or 9 nodes
+and the results are presented in Tables~\ref{table:res_s1} and
+\ref{table:res_s2}. These tables show that the energy saving percentage of the
+\np[\%]{70}-\np[\%]{30} scenario is smaller for all benchmarks compared to the
+energy saving of the \np[\%]{90}-\np[\%]{10} scenario.  Indeed, in the latter
+more dynamic power is consumed when nodes are running on their maximum
+frequencies, thus, scaling down the frequency of the nodes results in higher
+energy savings than in the \np[\%]{70}-\np[\%]{30} scenario. On the other hand,
+the performance degradation percentage is smaller in the \np[\%]{70}-\np[\%]{30}
+scenario compared to the \np[\%]{90}-\np[\%]{10} scenario. This is due to the
+higher static power percentage in the first scenario which makes it more
+relevant in the overall consumed energy.  Indeed, the static energy is related
+to the execution time and if the performance is degraded the amount of consumed
+static energy directly increases.  Therefore, the proposed algorithm does not
+really significantly scale down much the frequencies of the nodes in order to
+limit the increase of the execution time and thus limiting the effect of the
  consumed static energy.
  
-Both   new  power   scenarios   are  compared   to   the  old   one  in   figure
-(\ref{fig:sen_comp}). It  shows the average of the  performance degradation, the
-energy saving and the  distances for all NAS benchmarks of class  C running on 8
-or 9 nodes.   The comparison shows that the energy  saving ratio is proportional
-to the dynamic power ratio: it is increased when applying the 90\%-10\% scenario
-because at  maximum frequency  the dynamic  energy is the  most relevant  in the
-overall consumed  energy and can  be reduced by  lowering the frequency  of some
-processors. On  the other hand, the  energy saving decreases  when the 70\%-30\%
-scenario is  used because  the dynamic  energy is less  relevant in  the overall
-consumed energy and  lowering the frequency does not  return big energy savings.
-Moreover, the average  of the performance degradation is  decreased when using a
-higher  ratio   for  static  power  (e.g.   70\%-30\%   scenario  and  80\%-20\%
-scenario). Since  the proposed algorithm  optimizes the energy  consumption when
-using a  higher ratio for dynamic  power the algorithm  selects bigger frequency
-scaling  factors that result  in more  energy saving  but less  performance, for
-example see  Figure (\ref{fig:scales_comp}). The  opposite happens when  using a
-higher  ratio for  static power,  the algorithm  proportionally  selects smaller
-scaling  values which result  in less  energy saving  but also  less performance
+Both new power scenarios are compared to the old one in
+Figure~\ref{fig:sen_comp}. It shows the average of the performance degradation,
+the energy saving and the distances for all NAS benchmarks of class C running on
+8 or 9 nodes.  The comparison shows that the energy saving ratio is proportional
+to the dynamic power ratio: it is increased when applying the
+\np[\%]{90}-\np[\%]{10} scenario because at maximum frequency the dynamic energy
+is the most relevant in the overall consumed energy and can be reduced by
+lowering the frequency of some processors. On the other hand, the energy saving
+decreases when the \np[\%]{70}-\np[\%]{30} scenario is used because the dynamic
+energy is less relevant in the overall consumed energy and lowering the
+frequency does not return big energy savings.  Moreover, the average of the
+performance degradation is decreased when using a higher ratio for static power
+(e.g.  \np[\%]{70}-\np[\%]{30} scenario and \np[\%]{80}-\np[\%]{20}
+scenario). Since the proposed algorithm optimizes the energy consumption when
+using a higher ratio for dynamic power the algorithm selects bigger frequency
+scaling factors that result in more energy saving but less performance, for
+example see Figure~\ref{fig:scales_comp}. The opposite happens when using a
+higher ratio for static power, the algorithm proportionally selects smaller
+scaling values which result in less energy saving but also less performance
  degradation.
  
-
- \begin{table}[htb]
-  \caption{The results of the 70\%-30\% power scenario}
+\begin{table}[!t]
+  \caption{The results of the \np[\%]{70}-\np[\%]{30} power scenario}
    % title of Table
    \centering
-  \begin{tabular}{|*{6}{l|}}
+  \begin{tabular}{|*{6}{r|}}
      \hline
-    Program    & Energy          & Energy      & Performance        & Distance     \\
-    name       & consumption/J   & saving\%    & degradation\%      &              \\
+    Program & Energy        & Energy   & Performance   & Distance \\
+    name    & consumption/J & saving\% & degradation\% &          \\
      \hline
-    CG         &4144.21          &22.42        &7.72                &14.70         \\
-    \hline 
-    MG         &1133.23          &24.50        &5.34                &19.16          \\
+    CG      &  4144.21      & 22.42    & 7.72          & 14.70    \\
+    \hline
+    MG      &  1133.23      & 24.50    & 5.34          & 19.16    \\
     \hline
-    EP         &6170.30                &16.19         &0.02                &16.17          \\
+    EP      &  6170.30      & 16.19    & 0.02          & 16.17    \\
     \hline
-    LU         &39477.28        &20.43        &0.07                &20.36          \\
+    LU      & 39477.28      & 20.43    & 0.07          & 20.36    \\
      \hline
-    BT         &26169.55           &25.34             &6.62                &18.71          \\
+    BT      & 26169.55      & 25.34    & 6.62          & 18.71    \\
     \hline
-    SP         &19620.09           &19.32             &3.66                &15.66          \\
+    SP      & 19620.09      & 19.32    & 3.66          & 15.66    \\
     \hline
-    FT         &6094.07                &23.17         &0.36                &22.81          \\
-\hline 
+    FT      &  6094.07      & 23.17    & 0.36          & 22.81    \\
+    \hline
    \end{tabular}
    \label{table:res_s1}
  \end{table}
  
-
-
-\begin{table}[htb]
-  \caption{The results of the 90\%-10\% power scenario}
+\begin{table}[!t]
+  \caption{The results of the \np[\%]{90}-\np[\%]{10} power scenario}
    % title of Table
    \centering
-  \begin{tabular}{|*{6}{l|}}
+  \begin{tabular}{|*{6}{r|}}
      \hline
-    Program    & Energy          & Energy      & Performance        & Distance     \\
-    name       & consumption/J   & saving\%    & degradation\%      &              \\
+    Program & Energy          & Energy      & Performance   & Distance \\
+    name    & consumption/J   & saving\%    & degradation\% &          \\
      \hline
-    CG         &2812.38                 &36.36        &6.80                &29.56         \\
-    \hline 
-    MG         &825.427                 &38.35        &6.41                &31.94         \\
-   \hline
-    EP         &5281.62                 &35.02        &2.68                &32.34         \\
-   \hline
-    LU         &31611.28            &39.15        &3.51                    &35.64        \\
+    CG      &  2812.38        & 36.36       &  6.80         & 29.56    \\
+    \hline
+    MG      &   825.43        & 38.35       &  6.41         & 31.94    \\
+    \hline
+    EP      &  5281.62        & 35.02       &  2.68         & 32.34    \\
+    \hline
+    LU      & 31611.28        & 39.15       &  3.51         & 35.64    \\
+    \hline
+    BT      & 21296.46        & 36.70       &  6.60         & 30.10    \\
+    \hline
+    SP      & 15183.42        & 35.19       & 11.76         & 23.43    \\
+    \hline
+    FT      &  3856.54        & 40.80       &  5.67         & 35.13    \\
      \hline
-    BT         &21296.46            &36.70            &6.60                &30.10       \\
-   \hline
-    SP         &15183.42            &35.19            &11.76               &23.43        \\
-   \hline
-    FT         &3856.54                 &40.80        &5.67                &35.13        \\
-\hline 
    \end{tabular}
    \label{table:res_s2}
  \end{table}
  
+\begin{table}[!t]
+  \caption{Comparing the proposed algorithm}
+  \centering
+  \begin{tabular}{|*{7}{r|}}
+    \hline
+    Program & \multicolumn{2}{c|}{Energy saving \%}
+            & \multicolumn{2}{c|}{Perf.  degradation \%}
+            & \multicolumn{2}{c|}{Distance} \\
+    \cline{2-7}
+    name    & EDP   & MaxDist & EDP  & MaxDist & EDP   & MaxDist \\
+    \hline
+    CG      & 27.58 & 31.25   & 5.82 & 7.12    & 21.76 & 24.13   \\
+    \hline
+    MG      & 29.49 & 33.78   & 3.74 & 6.41    & 25.75 & 27.37   \\
+    \hline
+    LU      & 19.55 & 28.33   & 0.00 & 0.01    & 19.55 & 28.22   \\
+    \hline
+    EP      & 28.40 & 27.04   & 4.29 & 0.49    & 24.11 & 26.55   \\
+    \hline
+    BT      & 27.68 & 32.32   & 6.45 & 7.87    & 21.23 & 24.43   \\
+    \hline
+    SP      & 20.52 & 24.73   & 5.21 & 2.78    & 15.31 & 21.95   \\
+    \hline
+    FT      & 27.03 & 31.02   & 2.75 & 2.54    & 24.28 & 28.48   \\
+    \hline
+  \end{tabular}
+  \label{table:compare_EDP}
+\end{table}
  
-\begin{figure}
+\begin{figure}[!t]
    \centering
    \subfloat[Comparison  between the results on 8 nodes]{%
      \includegraphics[width=.33\textwidth]{fig/sen_comp}\label{fig:sen_comp}}%
@@ -1106,79 +1173,68 @@ degradation.
      \includegraphics[width=.33\textwidth]{fig/three_scenarios}\label{fig:scales_comp}}
    \label{fig:comp}
    \caption{The comparison of the three power scenarios}
-\end{figure}  
-
+\end{figure}
  
+\begin{figure}[!t]
+  \centering
+   \includegraphics[scale=0.5]{fig/compare_EDP.pdf}
+  \caption{Trade-off comparison for NAS benchmarks class C}
+  \label{fig:compare_EDP}
+\end{figure}
  
  
  \subsection{The comparison of the proposed scaling algorithm }
  \label{sec.compare_EDP}
-In this section, the scaling  factors selection algorithm, called MaxDist,
-is compared to Spiliopoulos et al. algorithm \cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP. 
-They developed a green governor that regularly applies an online frequency selecting algorithm to reduce the energy consumed by a multicore architecture without degrading much its performance. The algorithm selects the frequencies that minimize the energy and delay products, $EDP=Enegry*Delay$ using the predicted overall energy consumption and execution time delay for each frequency.
-To fairly compare both algorithms, the same energy and execution time models, equations (\ref{eq:energy}) and  (\ref{eq:fnew}), were used for both algorithms to predict the energy consumption and the execution times. Also Spiliopoulos et al. algorithm was adapted to  start the search from the 
-initial frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm is an exhaustive search algorithm that minimizes the EDP and has the initial frequencies values as an upper bound.
-
-Both algorithms were applied to the parallel NAS benchmarks to compare their efficiency. Table \ref{table:compare_EDP}  presents the results of comparing the execution times and the energy consumptions for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than Spiliopoulos et al. algorithm, 
-on average it results in 29.76\% energy saving while their algorithm returns just 25.75\%. The average of performance degradation percentage is approximately the same for both algorithms, about 4\%. 
  
+In this section, the scaling factors selection algorithm, called MaxDist, is
+compared to Spiliopoulos et al. algorithm
+\cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP.  They developed a
+green governor that regularly applies an online frequency selecting algorithm to
+reduce the energy consumed by a multicore architecture without degrading much
+its performance. The algorithm selects the frequencies that minimize the energy
+and delay products, $\mathit{EDP}=\mathit{energy}\times \mathit{delay}$ using
+the predicted overall energy consumption and execution time delay for each
+frequency.  To fairly compare both algorithms, the same energy and execution
+time models, equations (\ref{eq:energy}) and (\ref{eq:fnew}), were used for both
+algorithms to predict the energy consumption and the execution times. Also
+Spiliopoulos et al. algorithm was adapted to start the search from the initial
+frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm
+is an exhaustive search algorithm that minimizes the EDP and has the initial
+frequencies values as an upper bound.
+
+Both algorithms were applied to the parallel NAS benchmarks to compare their
+efficiency. Table~\ref{table:compare_EDP} presents the results of comparing the
+execution times and the energy consumption for both versions of the NAS
+benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous
+nodes. The results show that our algorithm provides better energy savings than
+Spiliopoulos et al. algorithm, on average it results in \np[\%]{29.76} energy
+saving while their algorithm returns just \np[\%]{25.75}. The average of
+performance degradation percentage is approximately the same for both
+algorithms, about \np[\%]{4}.
  
  For all benchmarks,  our algorithm outperforms Spiliopoulos et  al. algorithm in
-terms of  energy and  performance tradeoff, see  figure (\ref{fig:compare_EDP}),
+terms of  energy and  performance trade-off, see  Figure~\ref{fig:compare_EDP},
  because it maximizes the distance  between the energy saving and the performance
  degradation values while giving the same weight for both metrics.
  
-
-
-
-\begin{table}[h]
- \caption{Comparing the proposed algorithm}
- \centering
-\begin{tabular}{|l|l|l|l|l|l|l|l|}
-\hline
-\multicolumn{2}{|l|}{\multirow{2}{*}{\begin{tabular}[c]{@{}l@{}}Program \\ name\end{tabular}}} & \multicolumn{2}{l|}{Energy saving \%} & \multicolumn{2}{l|}{Perf.  degradation \%} & \multicolumn{2}{l|}{Distance} \\ \cline{3-8} 
-\multicolumn{2}{|l|}{}                                                                         & EDP             & MaxDist          & EDP            & MaxDist           & EDP          & MaxDist        \\ \hline
-\multicolumn{2}{|l|}{CG}                                                                       & 27.58           & 31.25            & 5.82           & 7.12              & 21.76        & 24.13          \\ \hline
-\multicolumn{2}{|l|}{MG}                                                                       & 29.49           & 33.78            & 3.74           & 6.41              & 25.75        & 27.37          \\ \hline
-\multicolumn{2}{|l|}{LU}                                                                       & 19.55           & 28.33            & 0.0            & 0.01              & 19.55        & 28.22          \\ \hline
-\multicolumn{2}{|l|}{EP}                                                                       & 28.40           & 27.04            & 4.29           & 0.49              & 24.11        & 26.55          \\ \hline
-\multicolumn{2}{|l|}{BT}                                                                       & 27.68           & 32.32            & 6.45           & 7.87              & 21.23        & 24.43          \\ \hline
-\multicolumn{2}{|l|}{SP}                                                                       & 20.52           & 24.73            & 5.21           & 2.78              & 15.31         & 21.95         \\ \hline
-\multicolumn{2}{|l|}{FT}                                                                       & 27.03           & 31.02            & 2.75           & 2.54              & 24.28        & 28.48           \\ \hline
-
-\end{tabular}
-\label{table:compare_EDP}
-\end{table}
-
-
-
-
-
-\begin{figure}[t]
-  \centering
-   \includegraphics[scale=0.5]{fig/compare_EDP.pdf}
-  \caption{Tradeoff comparison for NAS benchmarks class C}
-  \label{fig:compare_EDP}
-\end{figure}
-
-
  \section{Conclusion}
-\label{sec.concl} 
+\label{sec.concl}
+
  In this paper, a new online frequency selecting algorithm has been presented. It
-selects the  best possible  vector of frequency  scaling factors that  gives the
-maximum  distance  (optimal  tradeoff)  between  the predicted  energy  and  the
+selects the best possible vector of frequency scaling factors that gives the
+maximum distance (optimal trade-off) between the predicted energy and the
  predicted performance curves for a heterogeneous platform. This algorithm uses a
-new  energy  model  for  measuring  and predicting  the  energy  of  distributed
-iterative  applications running  over heterogeneous  platforms. To  evaluate the
+new energy model for measuring and predicting the energy of distributed
+iterative applications running over heterogeneous platforms. To evaluate the
  proposed method, it was applied on the NAS parallel benchmarks and executed over
-a heterogeneous  platform simulated by  Simgrid. The results of  the experiments
-showed that the algorithm reduces up to 35\% the energy consumption of a message
-passing iterative method while limiting  the degradation of the performance. The
-algorithm also selects different scaling  factors according to the percentage of
-the computing and communication times, and according to the values of the static
-and  dynamic  powers  of the  CPUs.   Finally,  the  algorithm was  compared  to
-Spiliopoulos et al.  algorithm and  the results showed that it outperforms their
-algorithm in terms of energy-time tradeoff.
+a heterogeneous platform simulated by SimGrid. The results of the experiments
+showed that the algorithm reduces up to \np[\%]{35} the energy consumption of a
+message passing iterative method while limiting the degradation of the
+performance. The algorithm also selects different scaling factors according to
+the percentage of the computing and communication times, and according to the
+values of the static and dynamic powers of the CPUs.  Finally, the algorithm was
+compared to Spiliopoulos et al.  algorithm and the results showed that it
+outperforms their algorithm in terms of energy-time trade-off.
  
  In the near future, this method  will be applied to real heterogeneous platforms
  to evaluate its  performance in a real study case. It  would also be interesting
@@ -1193,9 +1249,9 @@ the iterative system.
  \section*{Acknowledgment}
  
  This work has been partially supported by the Labex
-ACTION project (contract “ANR-11-LABX-01-01”). As a PhD student, 
+ACTION project (contract ``ANR-11-LABX-01-01''). As a PhD student,
  Mr. Ahmed Fanfakh, would like to thank the University of
-Babylon (Iraq) for supporting his work. 
+Babylon (Iraq) for supporting his work.
  
  
  % trigger a \newpage just before the given reference
@@ -1207,7 +1263,7 @@ Babylon (Iraq) for supporting his work.
  \bibliographystyle{IEEEtran}
  \bibliography{IEEEabrv,my_reference}
  \end{document}
-   
+
  %%% Local Variables:
  %%% mode: latex
  %%% TeX-master: t
@@ -1216,6 +1272,6 @@ Babylon (Iraq) for supporting his work.
  %%% End:
  
  % LocalWords:  Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
-% LocalWords:  CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex
-% LocalWords:  de badri muslim MPI TcpOld TcmOld dNew dOld cp Sopt Tcp Tcm Ps
-% LocalWords:  Scp Fmax Fdiff SimGrid GFlops Xeon EP BT
+% LocalWords:  CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex GPU
+% LocalWords:  de badri muslim MPI SimGrid GFlops Xeon EP BT GPUs CPUs AMD
+%  LocalWords:  Spiliopoulos scalability