Merge branch 'master' of ssh://info.iut-bm.univ-fcomte.fr/mpi-energy

[mpi-energy.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index a405346a5ef4fbc7a2295502ab6b0cece5337230..53a1cf78d501428bf64ca4d5bad550671ceba416 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -1,5 +1,4 @@
-\documentclass[12pt]{article}
-%\documentclass[12pt,twocolumn]{article}
+\documentclass[conference]{IEEEtran}
  
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
@@ -10,743 +9,858 @@
  \usepackage{listings}
  \usepackage{colortbl}
  \usepackage{amsmath}
-% \usepackage{sectsty}
-% \usepackage{titlesec}
-% \usepackage{secdot}
-%\usepackage[font={footnotesize,bt}]{caption}
-%\usepackage[font=scriptsize,labelfont=bf]{caption}
-\usepackage{lmodern}
-\usepackage{todonotes}
-\newcommand{\AG}[2][inline]{\todo[color=green!50,#1]{\sffamily\small\textbf{AG:} #2}}
+
+\usepackage{url}
+\DeclareUrlCommand\email{\urlstyle{same}}
+
+\usepackage[autolanguage,np]{numprint}
+\AtBeginDocument{%
+  \renewcommand*\npunitcommand[1]{\text{#1}}
+  \npthousandthpartsep{}}
+
+\usepackage{xspace}
+\usepackage[textsize=footnotesize]{todonotes}
+\newcommand{\AG}[2][inline]{%
+  \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
+\newcommand{\JC}[2][inline]{%
+  \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace}
  
  \begin{document}
  
-\title{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs}
-\author{A. Badri \and J.-C. Charr \and R. Couturier \and A. Giersch}
-\maketitle
+\title{Dynamic Frequency Scaling for Energy Consumption
+  Reduction in Distributed MPI Programs}
+
+\author{%
+  \IEEEauthorblockN{%
+    Jean-Claude Charr,
+    Raphaël Couturier,
+    Ahmed Fanfakh and
+    Arnaud Giersch
+  }
+  \IEEEauthorblockA{%
+    FEMTO-ST Institute\\
+    University of Franche-Comté\\
+    IUT de Belfort-Montbéliard,
+    19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
+    % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
+    % Fax: \mbox{+33 3 84 58 77 81}\\      % Dept Info
+    Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr}
+   }
+  }
  
-\AG{``Optimal'' is a bit pretentious in the title}
+\maketitle
  
  \begin{abstract}
-  \AG{complete the abstract\dots}
+  Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs.  This
+  technique is usually used to reduce the energy consumed by a CPU while
+  computing.  Indeed, power consumption by a processor at a given instant is
+  exponentially related to its frequency.  Thus, decreasing the frequency
+  reduces the power consumed by the CPU.  However, it can also significantly
+  affect the performance of the executed program if it is compute bound and if a
+  low CPU frequency is selected.  The performance degradation ratio can even be
+  higher than the saved energy ratio.  Therefore, the chosen scaling factor must
+  give the best possible trade-off between energy reduction and performance.
+
+  In this paper we present an algorithm that predicts the energy consumed with
+  each frequency gear and selects the one that gives the best ratio between
+  energy consumption reduction and performance.  This algorithm works online
+  without training or profiling and has a very small overhead.  It also takes
+  into account synchronous communications between the nodes that are executing
+  the distributed algorithm.  The algorithm has been evaluated over the SimGrid
+  simulator while being applied to the NAS parallel benchmark programs.  The
+  results of the experiments show that it outperforms other existing scaling
+  factor selection algorithms.
  \end{abstract}
  
  \section{Introduction}
-
-The need for computing power is still increasing and it is not expected to slow
-down in the coming years. To satisfy this demand, researchers and supercomputers
-constructors have been regularly increasing the number of computing cores in
-supercomputers (for example in November 2013, according to the top 500
-list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3
-millions of cores and delivers more than 33 Tflop/s while consuming 17808
-kW). This large increase in number of computing cores has led to large energy
-consumption by these architectures. Moreover, the price of energy is expected to
-continue its ascent according to the demand. For all these reasons energy
-reduction became an important topic in the high performance computing field. To
-tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency
-Scaling) operations which reduce dynamically the frequency and voltage of cores
-and thus their energy consumption. However, this operation also degrades the
-performance of computation. Therefore researchers try to reduce the frequency to
-minimum when processors are idle (waiting for data from other processors or
-communicating with other processors). Moreover, depending on their objectives
-they use heuristics to find the best scaling factor during the computation. If
-they aim for performance they choose the best scaling factor that reduces the
-consumed energy while affecting as little as possible the performance. On the
-other hand, if they aim for energy reduction, the chosen scaling factor must
-produce the most energy efficient execution without considering the degradation
-of the performance. It is important to notice that lowering the frequency to
-minimum value does not always give the most efficient execution due to energy
-leakage. The best scaling factor might be chosen during execution (online) or
-during a pre-execution phase.  In this paper we emphasize to develop an
-algorithm that selects the optimal frequency scaling factor that takes into
-consideration simultaneously the energy consumption and the performance. The
-main objective of HPC systems is to run the application with less execution
-time. Therefore, our algorithm selects the optimal scaling factor online with
-very small footprint. The proposed algorithm takes into account the
-communication times of the MPI programs to choose the scaling factor. This
-algorithm has ability to predict both energy consumption and execution time over
-all available scaling factors.  The prediction achieved depends on some
-computing time information, gathered at the beginning of the runtime.  We apply
-this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel
-benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
-using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183}
-over an homogeneous distributed memory architecture. Furthermore, we compare the
-proposed algorithm with Rauber's methods. The comparison's results show that our
-algorithm gives better energy-time trade off.
-
-\section{Related Works}
-
-In the this section some heuristics, to compute the scaling factor, are
-presented and classified in two parts : offline and online methods.
-
-\subsection{The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as an example in Azevedo et al.~\cite{40}. He used
-intra-task algorithm to choose the DVFS setting when there are dependency points
-between tasks. While in~\cite{29}, Xie et al. used breadth-first search
-algorithm to do that. Their goal is saving energy with time limits. Another
-approaches gathers and stores the runtime information for each DVFS state, then
-used their methods offline to select the suitable DVFS that optimize energy-time
-trade offs. As an example~\cite{8}, Rountree et al. used liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression
-algorithm for the same goal. The offline study that shown the DVFS impact on the
-communication time of the MPI program is~\cite{17}, Freeh et al. show that these
-times not changed when the frequency is scaled down.
-
-\subsection{The online DVFS orientations}
-
-The objective of these works is to dynamically compute and set the frequency of
-the CPU during the runtime of the program for saving energy. Estimating and
-predicting approaches for the energy-time trade offs developed by
-~\cite{11,2,31}. These works select the best DVFS setting depending on the slack
-times. These times happen when the processors have to wait for data from other
-processors to compute their task. For example, during the synchronous
-communication time that take place in the MPI programs, the processors are
-idle. The optimal DVFS can be selected using the learning methods. Therefore, in
-~\cite{39,19} used machine learning to converge to the suitable DVFS
-configuration. Their learning algorithms have big time to converge when the
-number of available frequencies is high. Also, the communication time of the MPI
-program used online for saving energy as in~\cite{1}, Lim et al. developed an
-algorithm that detects the communication sections and changes the frequency
-during these sections only. This approach changes the frequency many times
-because an iteration may contain more than one communication section. The domain
-of analytical modeling used for choosing the optimal frequency as in~\cite{3},
-Rauber et al. developed an analytical mathematical model for determining the
-optimal frequency scaling factor for any number of concurrent tasks, without
-considering communication times. They set the slowest task to maximum frequency
-for maintaining performance.  In this paper we compare our algorithm with
-Rauber's model~\cite{3}, because his model can be used for any number of
-concurrent tasks for homogeneous platform and this is the same direction of this
-paper.  However, the primary contributions of this paper are:
+\label{sec.intro}
+
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years.  To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers (for example in November 2013, according to the
+TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer.  It has more
+than 3 millions of cores and delivers more than \np[Tflop/s]{33} while consuming
+\np[kW]{17808}).  This large increase in number of computing cores has led to
+large energy consumption by these architectures.  Moreover, the price of energy
+is expected to continue its ascent according to the demand.  For all these
+reasons energy reduction became an important topic in the high performance
+computing field.  To tackle this problem, many researchers used DVFS (Dynamic
+Voltage Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption.  Indeed, modern CPUs offer a
+set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs.  However, DVFS also degrades the performance of computation.  Therefore
+researchers try to reduce the frequency to minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives they use heuristics to find the best
+scaling factor during the computation.  If they aim for performance they choose
+the best scaling factor that reduces the consumed energy while affecting as
+little as possible the performance.  On the other hand, if they aim for energy
+reduction, the chosen scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance.  It is
+important to notice that lowering the frequency to minimum value does not always
+give the most energy efficient execution due to energy leakage.  The best
+scaling factor might be chosen during execution (online) or during a
+pre-execution phase.  In this paper, we present an algorithm that selects a
+frequency scaling factor that simultaneously takes into consideration the energy
+consumption by the CPU and the performance of the application.  The main
+objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with very small
+footprint.  The proposed algorithm takes into account the communication times of
+the MPI program to choose the scaling factor.  This algorithm has ability to
+predict both energy consumption and execution time over all available scaling
+factors.  The prediction achieved depends on some computing time information,
+gathered at the beginning of the runtime.  We apply this algorithm to seven MPI
+benchmarks.  These MPI programs are the NAS parallel benchmarks (NPB v3.3)
+developed by NASA~\cite{44}.  Our experiments are executed using the simulator
+SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over an homogeneous
+distributed memory architecture.  Furthermore, we compare the proposed algorithm
+with Rauber and Rünger methods~\cite{3}.  The comparison's results show that our
+algorithm gives better energy-time trade-off.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents some
+related works from other authors.  Section~\ref{sec.exe} explains the execution
+of parallel tasks and the sources of slack times.  It also presents an energy
+model for homogeneous platforms.  Section~\ref{sec.mpip} describes how the
+performance of MPI programs can be predicted.  Section~\ref{sec.compet} presents
+the energy-performance objective function that maximizes the reduction of energy
+consumption while minimizing the degradation of the program's performance.
+Section~\ref{sec.optim} details the proposed energy-performance algorithm.
+Section~\ref{sec.expe} verifies the accuracy of the performance prediction model
+and presents the results of the proposed algorithm.  It also shows the
+comparison results between our method and other existing methods.  Finally, we
+conclude in Section~\ref{sec.concl} with a summary and some future works.
+
+\section{Related works}
+\label{sec.relwork}
+
+
+In this section, some heuristics to compute the scaling factor are presented and
+classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program.  They return static scaling factor values to the processors
+participating in the execution of the parallel program.  On one hand, the
+scaling factor values could be computed based on information retrieved by
+analyzing the code of the program and the computing system that will execute it.
+In~\cite{40}, Azevedo et al. detect during compilation the dependency points
+between tasks in a multi-task program.  This information is then used to lower
+the frequency of some processors in order to eliminate slack times.  A slack
+time is the period of time during which a processor that have already finished
+its computation, have to wait for a set of processors to finish their
+computations and send their results to the waiting processor in order to
+continue its task that is dependent on the results of computations being
+executed on other processors.  Freeh et al. showed in~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled
+down.  On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or partial executions of the program.  A
+part or the whole program is usually executed over all the available frequency
+gears and the the execution time and the energy consumed with each frequency
+gear are measured.  Then an heuristic or an exact method uses the retrieved
+information to compute the values of the scaling factor for the processors.
+In~\cite{29}, Xie et al. use an exact exponential breadth-first search algorithm
+to compute the scaling factor values that give the optimal energy reduction
+while respecting a deadline for a sequential program.  They also present a
+linear heuristic that approximates the optimal solution.  In~\cite{8} , Rountree
+et al. use a linear programming algorithm, while in~\cite{38,34}, Cochran et
+al. use multi logistic regression algorithm for the same goal.  The main
+drawback for these methods is that they all require executing a part or the
+whole program on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+
+The online scaling factor selection methods are executed during the runtime of
+the program.  They are usually integrated into iterative programs where the same
+block of instructions is executed many times.  During the first few iterations,
+many informations are measured such as the execution time, the energy consumed
+using a multimeter, the slack times, \dots{} Then a method will exploit these
+measurements to compute the scaling factor values for each processor.  This
+operation, measurements and computing new scaling factor, can be repeated as
+much as needed if the iterations are not regular.  Kimura, Peraza, Yu-Liang et
+al.~\cite{11,2,31} used varied heuristics to select the appropriate scaling
+factor values to eliminate the slack times during runtime.  However, as seen
+in~\cite{39,19}, machine learning methods can take a lot of time to converge
+when the number of available gears is big.  To reduce the impact of slack times,
+in~\cite{1}, Lim et al. developed an algorithm that detects the communication
+sections and changes the frequency during these sections only.  This approach
+might change the frequency of each processor many times per iteration if an
+iteration contains more than one communication section.  In~\cite{3}, Rauber and
+Rünger used an analytical model that can predict the consumed energy and the
+execution time for every frequency gear after measuring the consumed energy and
+the execution time with the highest frequency gear.  These predictions may be
+used to choose the optimal gear for each processor executing the parallel
+program to reduce energy consumption.  To maintain the performance of the
+parallel program , they set the processor with the biggest load to the highest
+gear and then compute the scaling factor values for the rest of the processors.
+Although this model was built for parallel architectures, it can be adapted to
+distributed architectures by taking into account the communications.  The
+primary contribution of our paper is presenting a new online scaling factor
+selection method which has the following characteristics:
  \begin{enumerate}
-\item Selecting the optimal frequency scaling factor for energy and performance
-  simultaneously. While taking into account the communication time.
-\item Adapting our scale factor to taking into account the imbalanced tasks.
-\item The execution time of our algorithm is very small when compared to other
-  methods (e.g.,~\cite{19}).
-\item The proposed algorithm works online without profiling or training as
+\item It is based on Rauber and Rünger analytical model to predict the energy
+  consumption of the application with different frequency gears.
+\item It selects the frequency scaling factor for simultaneously optimizing
+  energy reduction and maintaining performance.
+\item It is well adapted to distributed architectures because it takes into
+  account the communication time.
+\item It is well adapted to distributed applications with imbalanced tasks.
+\item it has very small footprint when compared to other methods
+  (e.g.,~\cite{19}) and does not require profiling or training as
    in~\cite{38,34}.
  \end{enumerate}
  
-\section{Parallel Tasks Execution on Homogeneous Platform}
  
-A homogeneous cluster consists of identical nodes in terms of the hardware and
-the software. Each node has its own memory and at least one processor which can
-be a multi-core. The nodes are connected via a high bandwidth network. Tasks
-executed on this model can be either synchronous or asynchronous. In this paper
+\section{Execution and energy of parallel tasks on homogeneous platform}
+\label{sec.exe}
+
+% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'',
+%   can be deleted if we need space, we can just say we are interested in this
+%   paper in homogeneous clusters}
+
+\subsection{Parallel tasks execution on homogeneous platform}
+
+A homogeneous cluster consists of identical nodes in terms of hardware and
+software.  Each node has its own memory and at least one processor which can be
+a multi-core.  The nodes are connected via a high bandwidth network.  Tasks
+executed on this model can be either synchronous or asynchronous.  In this paper
  we consider execution of the synchronous tasks on distributed homogeneous
-platform. These tasks can exchange the data via synchronous memory passing.
-\begin{figure}[h]
+platform.  These tasks can exchange the data via synchronous message passing.
+\begin{figure*}[t]
    \centering
-  \subfloat[Synch. Imbalanced Communications]{\includegraphics[scale=0.67]{synch_tasks}\label{fig:h1}}
-  \subfloat[Synch. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
-  \caption{Parallel Tasks on Homogeneous Platform}
+  \subfloat[Sync. imbalanced communications]{%
+    \includegraphics[scale=0.67]{fig/commtasks}\label{fig:h1}}
+  \subfloat[Sync. imbalanced computations]{%
+    \includegraphics[scale=0.67]{fig/compt}\label{fig:h2}}
+  \caption{Parallel tasks on homogeneous platform}
    \label{fig:homo}
-\end{figure}
+\end{figure*}
  Therefore, the execution time of a task consists of the computation time and the
-communication time. Moreover, the synchronous communications between tasks can
-lead to idle time while tasks wait at the synchronous point for others tasks to
-finish their communications see figure~(\ref{fig:h1}).  Another source for idle
-times is the imbalanced computations. This happen when processing different
-amounts of data on each processor as an example see figure~(\ref{fig:h2}). In
-this case the fastest tasks have to wait at the synchronous barrier for the
-slowest tasks to finish their job. In both two cases the overall execution time
-of the program is the execution time of the slowest task as :
+communication time.  Moreover, the synchronous communications between tasks can
+lead to slack times while tasks wait at the synchronization barrier for other
+tasks to finish their tasks (see figure~(\ref{fig:h1})).  The imbalanced
+communications happen when nodes have to send/receive different amount of data
+or they communicate with different number of nodes.  Another source of slack
+times is the imbalanced computations.  This happens when processing different
+amounts of data on each processor (see figure~(\ref{fig:h2})).  In this case the
+fastest tasks have to wait at the synchronization barrier for the slowest ones
+to begin the next task.  In both cases the overall execution time of the program
+is the execution time of the slowest task as in EQ~(\ref{eq:T1}).
  \begin{equation}
    \label{eq:T1}
    \textit{Program Time} = \max_{i=1,2,\dots,N} T_i
  \end{equation}
-where $T_i$ is the execution time of process $i$.
+where $T_i$ is the execution time of task $i$ and all the tasks are executed
+concurrently on different processors.
  
-\section{Energy Model for Homogeneous Platform}
+\subsection{Energy model for homogeneous platform}
  
-The energy consumption by the processor consists of two powers metric: the
-dynamic and the static power. This general power formulation is used by many
-researchers see~\cite{9,3,15,26}. The dynamic power of the CMOS processors
-$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$,
-the supply voltage $V$ and operational frequency $f$ respectively as follow :
+Many researchers~\cite{9,3,15,26} divide the power consumed by a processor to
+two power metrics: the static and the dynamic power.  While the first one is
+consumed as long as the computing unit is on, the latter is only consumed during
+computation times.  The dynamic power $P_{dyn}$ is related to the switching
+activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
+operational frequency $f$, as shown in EQ~(\ref{eq:pd}).
  \begin{equation}
    \label{eq:pd}
-  P_{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f
+  P_\textit{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f
  \end{equation}
-The static power $P_{static}$ captures the leakage power consumption as well as
-the power consumption of peripheral devices like the I/O subsystem.
+The static power $P_{static}$ captures the leakage power as follows:
  \begin{equation}
    \label{eq:ps}
-  P_{static}  = V \cdot N \cdot K_{design} \cdot I_{leak}
+   P_\textit{static}  = V \cdot N_{trans} \cdot K_{design} \cdot I_{leak}
  \end{equation}
-where V is the supply voltage, N is the number of transistors, $K_{design}$ is a
-design dependent parameter and $I_{leak}$ is a technology-dependent
-parameter. Energy consumed by an individual processor $E_{ind}$ is the summation
-of the dynamic and the static power multiply by the execution time for example
-see~\cite{36,15} .
+where V is the supply voltage, $N_{trans}$ is the number of transistors,
+$K_{design}$ is a design dependent parameter and $I_{leak}$ is a
+technology-dependent parameter.  The energy consumed by an individual processor
+to execute a given program can be computed as:
  \begin{equation}
    \label{eq:eind}
-  E_{ind} = ( P_{dyn} + P_{static} ) \cdot T
+   E_\textit{ind} =  P_\textit{dyn} \cdot T_{Comp} + P_\textit{static} \cdot T
  \end{equation}
-The dynamic voltage and frequency scaling (DVFS) is a process that allowed in
-modern processors to reduce the dynamic power by scaling down the voltage and
-frequency. Its main objective is to reduce the overall energy
-consumption~\cite{37}. The operational frequency \emph f depends linearly on the
-supply voltage $V$, i.e., $V = \beta . f$ with some constant $\beta$. This
-equation is used to study the change of the dynamic voltage with respect to
-various frequency values in~\cite{3}. The reduction process of the frequency are
-expressed by scaling factor \emph S. The scale \emph S is the ratio between the
-maximum and the new frequency as in EQ~(\ref{eq:s}).
+where $T$ is the execution time of the program, $T_{Comp}$ is the computation
+time and $T_{Comp} \leq T$.  $T_{Comp}$ may be equal to $T$ if there is no
+communications, no slack times and no synchronizations.
+
+DVFS is a process that is allowed in modern processors to reduce the dynamic
+power by scaling down the voltage and frequency.  Its main objective is to
+reduce the overall energy consumption~\cite{37}.  The operational frequency $f$
+depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot f$ with some
+constant $\beta$.  This equation is used to study the change of the dynamic
+voltage with respect to various frequency values in~\cite{3}.  The reduction
+process of the frequency can be expressed by the scaling factor $S$ which is the
+ratio between the maximum and the new frequency as in EQ~(\ref{eq:s}).
  \begin{equation}
    \label{eq:s}
-  S = \frac{F_{max}}{F_{new}}
+ S = \frac{F_\textit{max}}{F_\textit{new}}
  \end{equation}
-The value of the scale \emph S is grater than 1 when changing the frequency to
-any new frequency value(\emph {P-state}) in governor. It is equal to 1 when the
-frequency are set to the maximum frequency.  The energy consumption model for
-parallel homogeneous platform is depending on the scaling factor \emph S. This
-factor reduces quadratically the dynamic power.  Also, this factor increases the
-static energy linearly because the execution time is increased~\cite{36}.  The
-energy model, depending on the frequency scaling factor, of homogeneous platform
-for any number of concurrent tasks develops by Rauber~\cite{3}. This model
-consider the two powers metric for measuring the energy of the parallel tasks as
-in EQ~(\ref{eq:energy}).
+The value of the scaling factor $S$ is greater than 1 when changing the
+frequency of the CPU to any new frequency value~(\emph{P-state}) in the
+governor.  The CPU governor is an interface driver supplied by the operating
+system's kernel to lower a core's frequency.  This factor reduces quadratically
+the dynamic power which may cause degradation in performance and thus, the
+increase of the static energy because the execution time is increased~\cite{36}.
+If the tasks are sorted according to their execution times before scaling in a
+descending order, the total energy consumption model for a parallel homogeneous
+platform, as presented by Rauber and Rünger~\cite{3}, can be written as a
+function of the scaling factor $S$, as in EQ~(\ref{eq:energy}).
  
  \begin{equation}
    \label{eq:energy}
-  E = P_{dyn} \cdot S_1^{-2} \cdot
+  E = P_\textit{dyn} \cdot S_1^{-2} \cdot
      \left( T_1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^2} \right) +
-    P_{static} \cdot T_1 \cdot S_1 \cdot N
+    P_\textit{static} \cdot T_1 \cdot S_1 \cdot N
   \hfill
  \end{equation}
-Where \emph N is the number of parallel nodes, $T_1 $ is the time of the slower
-task, $T_i$ is the time of the task $i$ and $S_1$ is the maximum scaling factor
-for the slower task. The scaling factor $S_1$, as in EQ~(\ref{eq:s1}), selects
-from the set of scales values $S_i$. Each of these scales are proportional to
-the time value $T_i$ depends on the new frequency value as in EQ~(\ref{eq:si}).
-\begin{equation}
-  \label{eq:s1}
-  S_1 = \max_{i=1,2,\dots,F} S_i
-\end{equation}
+where $N$ is the number of parallel nodes, $T_i$ and $S_i$ for $i=1,\dots,N$ are
+the execution times and scaling factors of the sorted tasks.  Therefore, $T1$ is
+the time of the slowest task, and $S_1$ its scaling factor which should be the
+highest because they are proportional to the time values $T_i$.  The scaling
+factors are computed as in EQ~(\ref{eq:si}).
  \begin{equation}
    \label{eq:si}
    S_i = S \cdot \frac{T_1}{T_i}
-      = \frac{F_{max}}{F_{new}} \cdot \frac{T_1}{T_i}
+      = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i}
  \end{equation}
-Where $F$ is the number of available frequencies. In this paper we depend on
-Rauber's energy model EQ~(\ref{eq:energy}) for two reasons : 1-this model used
-for homogeneous platform that we work on in this paper. 2-we are compare our
-algorithm with Rauber's scaling model.  Rauber's optimal scaling factor for
-optimal energy reduction derived from the EQ~(\ref{eq:energy}). He takes the
-derivation for this equation (to be minimized) and set it to zero to produce the
-scaling factor as in EQ~(\ref{eq:sopt}).
+In this paper we depend on Rauber and Rünger energy model EQ~(\ref{eq:energy})
+for two reasons: (1) this model is used for any number of concurrent tasks, and
+(2) we compare our algorithm with Rauber and Rünger scaling factor selection
+method which is based on EQ~(\ref{eq:energy}).  The optimal scaling factor is
+computed by minimizing the derivation for this equation which produces
+EQ~(\ref{eq:sopt}).
+
  \begin{equation}
    \label{eq:sopt}
-  S_{opt} = \sqrt[3]{\frac{2}{n} \cdot \frac{P_{dyn}}{P_{static}} \cdot
+  S_\textit{opt} = \sqrt[3]{\frac{2}{N} \cdot \frac{P_\textit{dyn}}{P_\textit{static}} \cdot
      \left( 1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^3} \right) }
  \end{equation}
  
-\section{Performance Evaluation of MPI Programs}
-
-The performance (execution time) of the parallel MPI applications are depends on
-the time of the slowest task as in figure~(\ref{fig:homo}). Normally the
-execution time of the parallel programs are proportional to the operational
-frequency. Therefore, any DVFS operation for the energy reduction increase the
-execution time of the parallel program. As shown in EQ~(\ref{eq:energy}) the
-energy affected by the scaling factor $S$. This factor also has a great impact
-on the performance. When scaling down the frequency to the new value according
-to EQ~(\ref{eq:s}) lead to the value of the scale $S$ has inverse relation with
-new frequency value ($S \propto \frac{1}{F_{new}}$). Also when decrease the
-frequency value, the execution time increase. Then the new frequency value has
-inverse relation with time ($F_{new} \propto \frac{1}{T}$). This lead to the
-frequency scaling factor $S$ proportional linearly with execution time ($S
-\propto T$). Large scale MPI applications such as NAS benchmarks have
-considerable amount of communications embedded in these programs. During the
-communication process the processor remain idle until the communication has
-finished. For that reason any change in the frequency has no impact on the time
-of communication but it has obvious impact on the time of
-computation~\cite{17}. We are made many tests on real cluster to prove that the
-frequency scaling factor \emph S has a linear relation with computation time
-only also see~\cite{41}. To predict the execution time of MPI program, firstly
-must be precisely specifying communication time and the computation time for the
-slower task. Secondly, we use these times for predicting the execution time for
-any MPI program as a function of the new scaling factor as in the
-EQ~(\ref{eq:tnew}).
+
+\section{Performance evaluation of MPI programs}
+\label{sec.mpip}
+
+The performance (execution time) of parallel synchronous MPI applications depend
+on the time of the slowest task as in figure~(\ref{fig:homo}).  If there is no
+communication and the application is not data bounded, the execution time of a
+parallel program is linearly proportional to the operational frequency and any
+DVFS operation for energy reduction increases the execution time of the parallel
+program.  Therefore, the scaling factor $S$ is linearly proportional to the
+execution time.  However, in most of MPI applications the processes exchange
+data.  During these communications the processors involved remain idle until the
+communications are finished.  For that reason any change in the frequency has no
+impact on the time of communication~\cite{17}.  The communication time for a
+task is the summation of periods of time that begin with an MPI call for sending
+or receiving a message till the message is synchronously sent or received.  To
+be able to predict the execution time of MPI program, the communication time and
+the computation time for the slower task must be measured before scaling.  These
+times are used to predict the execution time for any MPI program as a function
+of the new scaling factor as in EQ~(\ref{eq:tnew}).
  \begin{equation}
    \label{eq:tnew}
-  T_{new} = T_{\textit{Max Comp Old}} \cdot S + T_{\textit{Max Comm Old}}
+ \textit  T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}}
  \end{equation}
-The above equation shows that the scaling factor \emph S has linear relation
-with the computation time without affecting the communication time. The
-communication time consists of the beginning times which an MPI calls for
-sending or receiving till the message is synchronously sent or received. In this
-paper we predict the execution time of the program for any new scaling factor
-value. Depending on this prediction we can produce our energy-performance scaling
-method as we will show in the coming sections. In the next section we make an
-investigation study for the EQ~(\ref{eq:tnew}).
-
-\section{Performance Prediction Verification}
-
-In this section we evaluate the precision of our performance prediction methods
-on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution
-time for any scale value. The NAS programs run the class B for comparing the
-real execution time with the predicted execution time. Each program runs offline
-with all available scaling factors on 8 or 9 nodes to produce real execution
-time values. These scaling factors are computed by dividing the maximum
-frequency by the new one see EQ~(\ref{eq:s}). In all tests, we use the simulator
-SimGrid/SMPI v3.10 to run the NAS programs.
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
-  \centering
-  \includegraphics[scale=0.60]{cg_per.eps}
-  \includegraphics[scale=0.60]{mg_pre.eps}
-  \includegraphics[scale=0.60]{bt_pre.eps}
-  \includegraphics[scale=0.60]{lu_pre.eps}
-  \caption{Fitting Predicted to Real Execution Time}
-  \label{fig:pred}
-\end{figure}
-%see Figure~\ref{fig:pred}
-In our cluster there are 18 available frequency states for each processor from
-2.5 GHz to 800 MHz, there is 100 MHz difference between two successive
-frequencies. For more details on the characteristics of the platform refer to
-table~(\ref{table:platform}). This lead to 18 run states for each program. We
-use seven MPI programs of the NAS parallel benchmarks : CG, MG, EP, FT, BT, LU
-and SP. The average normalized errors between the predicted execution time and
-the real time (SimGrid time) for all programs is between 0.0032 to 0.0133. AS an
-example, we are present the execution times of the NAS benchmarks as in the
-figure~(\ref{fig:pred}).
-
-\section{Performance to Energy Competition}
-This section demonstrates our approach for choosing the optimal scaling
-factor. This factor gives maximum energy reduction taking into account the
-execution time for both computation and communication times . The relation
-between the energy and the performance are nonlinear and complex, because the
-relation of the energy with scaling factor is nonlinear and with the performance
-it is linear see~\cite{17}. The relation between the energy and the performance
-is not straightforward. Moreover, they are not measured using the same metric.
-For solving this problem, we normalize the energy by calculating the ratio
-between the consumed energy with scaled frequency and the consumed energy
-without scaled frequency :
-\begin{equation}
+In this paper, this prediction method is used to select the best scaling factor
+for each processor as presented in the next section.
+
+\section{Performance to energy competition}
+\label{sec.compet}
+
+This section demonstrates our approach for choosing the optimal scaling factor.
+This factor gives maximum energy reduction taking into account the execution
+times for both computation and communication.  The relation between the energy
+and the performance is nonlinear and complex, because the relation of the energy
+with scaling factor is nonlinear and with the performance it is linear
+see~\cite{17}.  Moreover, they are not measured using the same metric.  For
+solving this problem, we normalize the energy by calculating the ratio between
+the consumed energy with scaled frequency and the consumed energy without scaled
+frequency:
+\begin{multline}
    \label{eq:enorm}
-  E_\textit{Norm} = \frac{E_{Reduced}}{E_{Original}}
-          = \frac{ P_{dyn} \cdot S_i^{-2} \cdot
+  E_\textit{Norm} = \frac{ E_\textit{Reduced}}{E_\textit{Original}} \\
+        {} = \frac{P_\textit{dyn} \cdot S_1^{-2} \cdot
                 \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
-               P_{static} \cdot T_1 \cdot S_i \cdot N  }{
-              P_{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
-              P_{static} \cdot T_1 \cdot N }
-\end{equation}
-\AG{Use \texttt{\textbackslash{}text\{xxx\}} or
-  \texttt{\textbackslash{}textit\{xxx\}} for all subscripted words in equations
-  (e.g. \mbox{\texttt{E\_\{\textbackslash{}text\{Norm\}\}}}).
-
-  Don't hesitate to define new commands :
-  \mbox{\texttt{\textbackslash{}newcommand\{\textbackslash{}ENorm\}\{E\_\{\textbackslash{}text\{Norm\}\}\}}}
-}
-By the same way we can normalize the performance as follows :
+               P_\textit{static} \cdot T_1 \cdot S_1 \cdot N  }{
+              P_\textit{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+              P_\textit{static} \cdot T_1 \cdot N }
+\end{multline}
+By the same way we can normalize the performance as follows:
  \begin{equation}
    \label{eq:pnorm}
-  P_{Norm} = \frac{T_{New}}{T_{Old}}
-          = \frac{T_{\textit{Max Comp Old}} \cdot S +
-              T_{\textit{Max Comm Old}}}{T_{Old}}
+  P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}}
+          = \frac{T_\textit{Max Comp Old} \cdot S +
+           T_\textit{Max Comm Old}}{T_\textit{Max Comp Old} +
+           T_\textit{Max Comm Old}}
  \end{equation}
-The second problem is the optimization operation for both energy and performance
-is not in the same direction. In other words, the normalized energy and the
-performance curves are not in the same direction see figure~(\ref{fig:r2}).
-While the main goal is to optimize the energy and performance in the same
-time. According to the equations~(\ref{eq:enorm}) and~(\ref{eq:pnorm}) the
-scaling factor \emph S reduce both the energy and the performance
-simultaneously. But the main objective is to produce maximum energy reduction
-with minimum performance reduction. Many researchers used different strategies
-to solve this nonlinear problem for example see~\cite{19,42}, their methods add
-big overhead to the algorithm for selecting the suitable frequency. In this
-paper we are present a method to find the optimal scaling factor \emph S for
-optimize both energy and performance simultaneously without adding big
+The second problem is that the optimization operation for both energy and
+performance is not in the same direction.  In other words, the normalized energy
+and the performance curves are not in the same direction see
+figure~(\ref{fig:r2}).  While the main goal is to optimize the energy and
+performance in the same time.  According to the equations~(\ref{eq:enorm})
+and~(\ref{eq:pnorm}), the scaling factor $S$ reduce both the energy and the
+performance simultaneously.  But the main objective is to produce maximum energy
+reduction with minimum performance reduction.  Many researchers used different
+strategies to solve this nonlinear problem for example see~\cite{19,42}, their
+methods add big overhead to the algorithm for selecting the suitable frequency.
+In this paper we present a method to find the optimal scaling factor $S$ for
+optimizing both energy and performance simultaneously without adding big
  overheads.  Our solution for this problem is to make the optimization process
-have the same direction. Therefore, we inverse the equation of normalize
-performance as follows :
+have the same direction.  Therefore, we inverse the equation of normalize
+performance as follows:
  \begin{equation}
    \label{eq:pnorm_en}
-  P^{-1}_{Norm} = \frac{T_{Old}}{T_{New}}
-               = \frac{T_{Old}}{T_{\textit{Max Comp Old}} \cdot S +
-                 T_{\textit{Max Comm Old}}}
+  P^{-1}_\textit{Norm} = \frac{ T_\textit{Old}}{ T_\textit{New}}
+               = \frac{T_\textit{Max Comp Old} +
+                 T_\textit{Max Comm Old}}{T_\textit{Max Comp Old} \cdot S +
+                 T_\textit{Max Comm Old}}
  \end{equation}
-\begin{figure}
+\begin{figure*}
    \centering
-  \subfloat[Converted Relation.]{\includegraphics[scale=0.70]{file.eps}\label{fig:r1}}
-  \subfloat[Real Relation.]{\includegraphics[scale=0.70]{file3.eps}\label{fig:r2}}
+  \subfloat[Converted relation.]{%
+    \includegraphics[width=.4\textwidth]{fig/file}\label{fig:r1}}%
+  \qquad%
+  \subfloat[Real relation.]{%
+    \includegraphics[width=.4\textwidth]{fig/file3}\label{fig:r2}}
    \label{fig:rel}
-  \caption{The Energy and Performance Relation}
-\end{figure}
+  \caption{The energy and performance relation}
+\end{figure*}
  Then, we can modelize our objective function as finding the maximum distance
  between the energy curve EQ~(\ref{eq:enorm}) and the inverse of performance
-curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This represent
-the minimum energy consumption with minimum execution time (better performance)
-in the same time, see figure~(\ref{fig:r1}). Then our objective function has the
-following form:
+curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors.  This
+represents the minimum energy consumption with minimum execution time (better
+performance) at the same time, see figure~(\ref{fig:r1}).  Then our objective
+function has the following form:
  \begin{equation}
    \label{eq:max}
-  \textit{MaxDist} = \max (\overbrace{P^{-1}_{Norm}}^{\text{Maximize}} -
-                           \overbrace{E_{Norm}}^{\text{Minimize}} )
+  Max Dist = \max_{j=1,2,\dots,F}
+      (\overbrace{P^{-1}_\textit{Norm}(S_j)}^{\text{Maximize}} -
+       \overbrace{E_\textit{Norm}(S_j)}^{\text{Minimize}} )
  \end{equation}
-Then we can select the optimal scaling factor that satisfy the
-EQ~(\ref{eq:max}).  Our objective function can works with any energy model or
-static power values stored in a data file. Moreover, this function works in
-optimal way when the energy function has a convex form with frequency scaling
-factor as shown in~\cite{15,3,19}. Energy measurement model is not the
-objective of this paper and we choose Rauber's model as an example with two
-reasons that mentioned before.
-
-\section{Optimal Scaling Factor for Performance and Energy}
-
-In the previous section we described the objective function that satisfy our
-goal in discovering optimal scaling factor for both performance and energy at
-the same time. Therefore, we develop an energy to performance scaling algorithm
-(EPSA). This algorithm is simple and has a direct way to calculate the optimal
-scaling factor for both energy and performance at the same time.
+where $F$ is the number of available frequencies. Then we can select the optimal
+scaling factor that satisfies EQ~(\ref{eq:max}).  Our objective function can
+work with any energy model or static power values stored in a data file.
+Moreover, this function works in optimal way when the energy curve has a convex
+form over the available frequency scaling factors as shown in~\cite{15,3,19}.
+
+\section{Optimal scaling factor for performance and energy}
+\label{sec.optim}
+
+Algorithm~\ref{EPSA} computes the optimal scaling factor according to the
+objective function described above.
  \begin{algorithm}[tp]
-  \caption{EPSA}
+  \caption{Scaling factor selection algorithm}
    \label{EPSA}
    \begin{algorithmic}[1]
      \State  Initialize the variable $Dist=0$
      \State Set dynamic and static power values.
      \State Set $P_{states}$ to the number of available frequencies.
      \State Set the variable $F_{new}$ to max. frequency,  $F_{new} = F_{max} $
-    \State Set the variable $F_{diff}$ to the scale value between each two frequencies.
-    \For {$i=1$   to   $P_{states} $}
-      \State - Calculate the new frequency as $F_{new}=F_{new} - F_{diff} $
-      \State - Calculate the scale factor $S$ as in EQ~(\ref{eq:s}).
-      \State - Calculate all available scales $S_i$  depend on $S$ as in EQ~(\ref{eq:si}).
-      \State - Select the maximum scale factor $S_1$ from the set of scales $S_i$.
-      \State - Calculate the normalize energy $E_{Norm}=E_{R}/E_{O}$ as in EQ~(\ref{eq:enorm}).
-      \State - Calculate the normalize inverse of performance\par
-               $P_{NormInv}=T_{old}/T_{new}$ as in EQ~(\ref{eq:pnorm_en}).
-      \If{  $(P_{NormInv}-E_{Norm} > Dist$) }
-        \State $S_{optimal} = S$
+    \State Set the variable $F_{diff}$ to the difference between two successive
+           frequencies.
+    \For {$j:=1$   to   $P_{states} $}
+      \State $F_{new}=F_{new} - F_{diff} $
+      \State $S = \frac{F_\textit{max}}{F_\textit{new}}$
+      \State $S_i = S \cdot \frac{T_1}{T_i}
+                  = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i}$
+             for $i=1,\dots,N$
+      \State $E_\textit{Norm} =
+          \frac{P_\textit{dyn} \cdot S_1^{-2} \cdot
+                  \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+                  P_\textit{static} \cdot T_1 \cdot S_1 \cdot N }{
+                P_\textit{dyn} \cdot
+                  \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+                  P_\textit{static} \cdot T_1 \cdot N }$
+      \State $P_{NormInv}=T_{old}/T_{new}$
+      \If{$(P_{NormInv}-E_{Norm} > Dist)$}
+        \State $S_{opt} = S$
          \State $Dist = P_{NormInv} - E_{Norm}$
        \EndIf
      \EndFor
-    \State  Return $S_{optimal}$
+    \State  Return $S_{opt}$
    \end{algorithmic}
  \end{algorithm}
-The proposed EPSA algorithm works online during the execution time of the MPI
-program. It selects the optimal scaling factor by gathering some information
-from the program after one iteration. This algorithm has small execution time
-(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data
-required by this algorithm is the computation time and the communication time
-for each task from the first iteration only. When these times are measured, the
-MPI program calls the EPSA algorithm to choose the new frequency using the
-optimal scaling factor. Then the program set the new frequency to the
-system. The algorithm is called just one time during the execution of the
-program. The DVFS algorithm~(\ref{dvfs}) shows where and when the EPSA algorithm is called
-in the MPI program.
-%\begin{minipage}{\textwidth}
-%\AG{Use the same format as for Algorithm~\ref{EPSA}}
+
+The proposed algorithm works online during the execution time of the MPI
+program.  It selects the optimal scaling factor after gathering the computation
+and communication times from the program after one iteration.  Then the program
+changes the new frequencies of the CPUs according to the computed scaling
+factors.  This algorithm has a small execution time: for a homogeneous cluster
+composed of nodes having the characteristics presented in
+table~\ref{table:platform}, it takes \np[ms]{0.00152} on average for 4 nodes and
+\np[ms]{0.00665} on average for 32 nodes.  The algorithm complexity is $O(F\cdot
+N)$, where $F$ is the number of available frequencies and $N$ is the number of
+computing nodes.  The algorithm is called just once during the execution of the
+program.  The DVFS algorithm~(\ref{dvfs}) shows where and when the algorithm is
+called in the MPI program.
+\begin{table}[htb]
+  \caption{Platform file parameters}
+  % title of Table
+  \centering
+  \begin{tabular}{|*{7}{l|}}
+    \hline
+    Max      & Min      & Backbone        & Backbone         & Link         & Link            & Sharing \\
+    Freq.    & Freq.    & Bandwidth       & Latency          & Bandwidth    & Latency         & Policy  \\
+    \hline
+    \np{2.5} & \np{800} & \np[GBps]{2.25} & \np[$\mu$s]{0.5} & \np[GBps]{1} & \np[$\mu$s]{50} & Full    \\
+    GHz      & MHz      &                 &                  &              &                 & Duplex  \\
+    \hline
+  \end{tabular}
+  \label{table:platform}
+\end{table}
  
  \begin{algorithm}[tp]
    \caption{DVFS}
    \label{dvfs}
-  \begin{algorithmic}
- \For {$J:=1$ to $Some-Iterations \; $}
-  \State -Computations Section.
-   \State -Communications Section.
-   \If {$(J==1)$} 
-     \State -Gather all times of computation and\par
-        \State      communication from each node.
-     \State -Call EPSA with these times.
-     \State -Calculate the new frequency from optimal scale.
-     \State -Set the new frequency to the system.
-   \EndIf
-\EndFor
-\end{algorithmic}
+  \begin{algorithmic}[1]
+    \For {$k:=1$ to \textit{some iterations}}
+      \State Computations section.
+      \State Communications section.
+      \If {$(k=1)$}
+        \State Gather all times of computation and\newline\hspace*{3em}%
+               communication from each node.
+        \State Call algorithm~\ref{EPSA} with these times.
+        \State Compute the new frequency from the\newline\hspace*{3em}%
+               returned optimal scaling factor.
+        \State Set the new frequency to the CPU.
+      \EndIf
+    \EndFor
+  \end{algorithmic}
  \end{algorithm}
-
-After obtaining the optimal scale factor from the EPSA algorithm. The program
-calculates the new frequency $F_i$ for each task proportionally to its time
-value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we
-can calculate the new frequency $F_i$ as follows :
+After obtaining the optimal scaling factor, the program calculates the new
+frequency $F_i$ for each task proportionally to its time value $T_i$.  By
+substitution of EQ~(\ref{eq:s}) in EQ~(\ref{eq:si}), we can calculate the new
+frequency $F_i$ as follows:
  \begin{equation}
    \label{eq:fi}
-  F_i = \frac{F_{max} \cdot T_i}{S_{optimal} \cdot T_{max}}
+  F_i = \frac{F_\textit{max} \cdot T_i}{S_\textit{optimal} \cdot T_\textit{max}}
  \end{equation}
  According to this equation all the nodes may have the same frequency value if
-they have balanced workloads. Otherwise, they take different frequencies when
-have imbalanced workloads. Then EQ~(\ref{eq:fi}) works in adaptive way to change
-the frequency according to the nodes workloads.
-
-\section{Experimental Results}
-
-The proposed EPSA algorithm was applied to seven MPI programs of the NAS
-benchmarks (EP, CG, MG, FT, BT, LU and SP). We work on three classes (A, B and
-C) for each program. Each program runs on specific number of processors
-proportional to the size of the class.  Each class represents the problem size
-ascending from the class A to C. Additionally, depending on some speed up points
-for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes
-respectively. Our experiments are executed on the simulator SimGrid/SMPI
-v3.10. We design a platform file that simulates a cluster with one core per
-node. This cluster is a homogeneous architecture with distributed memory. The
-detailed characteristics of our platform file are shown in the
-table~(\ref{table:platform}). Each node in the cluster has 18 frequency values
-from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive
-frequencies.
-\begin{table}[ht]
-  \caption{Platform File Parameters}
-  % title of Table
+they have balanced workloads, otherwise, they take different frequencies when
+having imbalanced workloads.  Thus, EQ~(\ref{eq:fi}) adapts the frequency of the
+CPU to the nodes' workloads to maintain performance.
+
+\section{Experimental results}
+\label{sec.expe}
+Our experiments are executed on the simulator SimGrid/SMPI v3.10.  We configure
+the simulator to use a homogeneous cluster with one core per node.  The detailed
+characteristics of our platform file are shown in the
+table~(\ref{table:platform}).  Each node in the cluster has 18 frequency values
+from \np[GHz]{2.5} to \np[MHz]{800} with \np[MHz]{100} difference between each
+two successive frequencies.  The simulated network link is \np[GB]{1} Ethernet
+(TCP/IP).  The backbone of the cluster simulates a high performance switch.
+
+\subsection{Performance prediction verification}
+
+In this section we evaluate the precision of our performance prediction method
+based on EQ~(\ref{eq:tnew}) by applying it the NAS benchmarks.  The NAS programs
+are executed with the class B option for comparing the real execution time with
+the predicted execution time.  Each program runs offline with all available
+scaling factors on 8 or 9 nodes (depending on the benchmark) to produce real
+execution time values.  These scaling factors are computed by dividing the
+maximum frequency by the new one see EQ~(\ref{eq:s}).
+\begin{figure*}[t]
    \centering
-  \begin{tabular}{ | l | l | l |l | l |l |l |  p{2cm} |}
-    \hline
-    Max & Min & Backbone & Backbone&Link &Link& Sharing  \\
-    Freq. & Freq. & Bandwidth & Latency & Bandwidth& Latency&Policy  \\ \hline
-    2.5 &800 & 2.25 GBps &$5\times 10^{-7} s$& 1 GBps & $5\times 10^{-5}$ s&Full  \\
-    GHz& MHz&  & & &  &Duplex  \\\hline
-  \end{tabular}
-  \label{table:platform}
-\end{table}
-Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all
-the NAS MPI programs while assuming the power dynamic is equal to 20W and the
-power static is equal to 4W for all experiments. We run the proposed EPSA
-algorithm for all these programs. The results showed that the algorithm selected
-different scaling factors for each program depending on the communication
-features of the program as in the figure~(\ref{fig:nas}). This figure shows that
-there are different distances between the normalized energy and the normalized
-inversed performance curves, because there are different communication features
-for each MPI program.  When there are little or not communications, the inversed
-performance curve is very close to the energy curve. Then the distance between
-the two curves is very small. This lead to small energy savings. The opposite
+  \includegraphics[width=.328\textwidth]{fig/cg_per}\hfill%
+  \includegraphics[width=.328\textwidth]{fig/mg_pre}\hfill%
+ % \includegraphics[width=.4\textwidth]{fig/bt_pre}\qquad%
+  \includegraphics[width=.328\textwidth]{fig/lu_pre}\hfill%
+  \caption{Comparing predicted to real execution time}
+  \label{fig:pred}
+\end{figure*}
+%see Figure~\ref{fig:pred}
+In our cluster there are 18 available frequency states for each processor.  This
+leads to 18 run states for each program.  We use seven MPI programs of the NAS
+parallel benchmarks: CG, MG, EP, FT, BT, LU and SP.  Figure~(\ref{fig:pred})
+presents plots of the real execution times and the simulated ones.  The maximum
+normalized error between these two execution times varies between \np{0.0073} to
+\np{0.031} dependent on the executed benchmark.  The smallest prediction error
+was for CG and the worst one was for LU.
+
+\subsection{The experimental results for the scaling algorithm }
+
+The proposed algorithm was applied to seven MPI programs of the NAS benchmarks
+(EP, CG, MG, FT, BT, LU and SP) which were run with three classes (A, B and C).
+For each instance the benchmarks were executed on a number of processors
+proportional to the size of the class.  Each class represents the problem size
+ascending from the class A to C.  Additionally, depending on some speed up
+points for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes
+respectively.  Depending on EQ~(\ref{eq:energy}), we measure the energy
+consumption for all the NAS MPI programs while assuming the power dynamic with
+the highest frequency is equal to \np[W]{20} and the power static is equal to
+\np[W]{4} for all experiments.  These power values were also used by Rauber and
+Rünger in~\cite{3}.  The results showed that the algorithm selected different
+scaling factors for each program depending on the communication features of the
+program as in the plots~(\ref{fig:nas}).  These plots illustrate that there are
+different distances between the normalized energy and the normalized inverted
+performance curves, because there are different communication features for each
+benchmark.  When there are little or not communications, the inverted
+performance curve is very close to the energy curve.  Then the distance between
+the two curves is very small.  This leads to small energy savings.  The opposite
  happens when there are a lot of communication, the distance between the two
-curves is big.  This lead to more energy savings (e.g. CG and FT), see
-table~(\ref{table:factors results}). All discovered frequency scaling factors
-optimize both the energy and the performance simultaneously for all the NAS
-programs. In table~(\ref{table:factors results}), we record all optimal scaling
-factors results for each program on class C. These factors give the maximum
-energy saving percent and the minimum performance degradation percent in the
-same time over all available scales.
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
+curves is big.  This leads to more energy savings (e.g. CG and FT), see
+table~(\ref{table:factors results}).  All discovered frequency scaling factors
+optimize both the energy and the performance simultaneously for all NAS
+benchmarks.  In table~(\ref{table:factors results}), we record all optimal
+scaling factors results for each benchmark running class C.  These scaling
+factors give the maximum energy saving percent and the minimum performance
+degradation percent at the same time from all available scaling factors.
+\begin{figure*}[t]
    \centering
-  \includegraphics[scale=0.47]{ep.eps}
-  \includegraphics[scale=0.47]{cg.eps}
-  \includegraphics[scale=0.47]{sp.eps}
-  \includegraphics[scale=0.47]{lu.eps}
-  \includegraphics[scale=0.47]{bt.eps}
-  \includegraphics[scale=0.47]{ft.eps}
-  \caption{Optimal scaling factors for The NAS MPI Programs}
+  \includegraphics[width=.328\textwidth]{fig/ep}\hfill%
+  \includegraphics[width=.328\textwidth]{fig/cg}\hfill%
+  \includegraphics[width=.328\textwidth]{fig/sp}
+  \includegraphics[width=.328\textwidth]{fig/lu}\hfill%
+  \includegraphics[width=.328\textwidth]{fig/bt}\hfill%
+  \includegraphics[width=.328\textwidth]{fig/ft}
+  \caption{Optimal scaling factors for the predicted energy and performance of NAS benchmarks}
    \label{fig:nas}
-\end{figure}
-\begin{table}[width=\textwidth,height=\textheight,keepaspectratio]
-  \caption{Optimal Scaling Factors Results}
+\end{figure*}
+\begin{table}[htb]
+  \caption{The scaling factors results}
    % title of Table
    \centering
-  \AG{Use the same number of decimals for all numbers in a column,
-    and vertically align the numbers along the decimal points.
-    The same for all the following tables.}
-  \begin{tabular}{ | l | l | l |l | l | p{2cm} |}
+  \begin{tabular}{|l|*{4}{r|}}
+    \hline
+    Program & Optimal        & Energy    & Performance    & Energy-Perf. \\
+    Name    & Scaling Factor & Saving \% & Degradation \% & Distance     \\
+    \hline
+    CG      & 1.56           & 39.23     & 14.88          & 24.35 \\
+    \hline
+    MG      & 1.47           & 34.97     & 21.70          & 13.27 \\
+    \hline
+    EP      & 1.04           & 22.14     & 20.73          &  1.41 \\
+    \hline
+    LU      & 1.38           & 35.83     & 22.49          & 13.34 \\
+    \hline
+    BT      & 1.31           & 29.60     & 21.28          &  8.32 \\
+    \hline
+    SP      & 1.38           & 33.48     & 21.36          & 12.12 \\
+    \hline
+    FT      & 1.47           & 34.72     & 19.00          & 15.72 \\
      \hline
-    Program & Optimal & Energy  & Performance&Energy-Perf.\\
-    Name & Scaling Factor& Saving \%&Degradation \% &Distance  \\ \hline
-    CG & 1.56 &39.23&14.88 &24.35\\ \hline
-    MG & 1.47 &34.97&21.70 &13.27 \\ \hline
-    EP & 1.04 &22.14&20.73 &1.41\\ \hline
-    LU & 1.38 &35.83&22.49 &13.34\\ \hline
-    BT & 1.31 &29.60&21.28 &8.32\\ \hline
-    SP & 1.38 &33.48&21.36 &12.12\\ \hline
-    FT & 1.47 &34.72&19.00 &15.72\\ \hline
    \end{tabular}
    \label{table:factors results}
    % is used to refer this table in the text
  \end{table}
-
  As shown in the table~(\ref{table:factors results}), when the optimal scaling
  factor has big value we can gain more energy savings for example as in CG and
-FT. The opposite happens when the optimal scaling factor is small value as
-example BT and EP. Our algorithm selects big scaling factor value when the
+FT.  The opposite happens when the optimal scaling factor is small value as
+example BT and EP.  Our algorithm selects big scaling factor value when the
  communication and the other slacks times are big and smaller ones in opposite
-cases. In EP there are no communications inside the iterations. This make our
-EPSA to selects smaller scaling factor values (inducing smaller energy savings).
-
-\section{Comparing Results}
-
-In this section, we compare our EPSA algorithm results with Rauber's
-methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal
-level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to
-this scenario as $Rauber_{E}$. The second scenario is similar to the first
-except setting the slower task to the maximum frequency (when the scale $S=1$)
-to keep the performance from degradation as mush as possible. We refer to this
-scenario as $Rauber_{E-P}$. The comparison is made in tables~(\ref{table:compare
-  Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These
-tables show the results of our EPSA and Rauber's two scenarios for all the NAS
-benchmarks programs for classes A,B and C.
-\begin{table}[ht]
-  \caption{Comparing Results for  The NAS Class A}
+cases.  In EP there are no communications inside the iterations.  This make our
+algorithm to selects smaller scaling factor values (inducing smaller energy
+savings).
+
+\subsection{Results comparison}
+
+In this section, we compare our scaling factor selection method with Rauber and
+Rünger methods~\cite{3}.  They had two scenarios, the first is to reduce energy
+to the optimal level without considering the performance as in
+EQ~(\ref{eq:sopt}).  We refer to this scenario as $R_{E}$.  The second scenario
+is similar to the first except setting the slower task to the maximum frequency
+(when the scale $S=1$) to keep the performance from degradation as mush as
+possible.  We refer to this scenario as $R_{E-P}$.  While we refer to our
+algorithm as EPSA (Energy to Performance Scaling Algorithm).  The comparison is
+made in tables \ref{table:compareA}, \ref{table:compareB},
+and~\ref{table:compareC}.  These tables show the results of our method and
+Rauber and Rünger scenarios for all the NAS benchmarks programs for classes A, B
+and C.
+\begin{table}[p]
+  \caption{Comparing results for  the NAS class A}
    % title of Table
    \centering
-  \begin{tabular}{ | l | l | l |l | l | l|  }
+  \begin{tabular}{|l|l|*{4}{r|}}
+    \hline
+    Method    & Program & Factor & Energy    & Performance    & Energy-Perf. \\
+    Name      & Name    & Value  & Saving \% & Degradation \% & Distance     \\
      \hline
-    Method&Program&Factor& Energy& Performance &Energy-Perf.\\
-    name &name&value& Saving \%&Degradation \% &Distance
-    \\ \hline
      % \rowcolor[gray]{0.85}
-    EPSA&CG & 1.56 &37.02 & 13.88 & 23.14\\ \hline
-    $Rauber_{E-P}$&CG &2.14 &42.77 & 25.27 & 17.50\\ \hline
-    $Rauber_{E}$&CG &2.14 &42.77&26.46&16.31\\ \hline
+    $EPSA$    & CG      & 1.56   & 37.02     & 13.88          &  23.14 \\ \hline
+    $R_{E-P}$ & CG      & 2.14   & 42.77     & 25.27          &  17.50 \\ \hline
+    $R_{E}$   & CG      & 2.14   & 42.77     & 26.46          &  16.31 \\ \hline
  
-    EPSA&MG & 1.47 &27.66&16.82&10.84\\ \hline
-    $Rauber_{E-P}$&MG &2.14&34.45&31.84&2.61\\ \hline
-    $Rauber_{E}$&MG &2.14&34.48&33.65&0.80 \\ \hline
+    $EPSA$    & MG      & 1.47   & 27.66     & 16.82          &  10.84 \\ \hline
+    $R_{E-P}$ & MG      & 2.14   & 34.45     & 31.84          &   2.61 \\ \hline
+    $R_{E}$   & MG      & 2.14   & 34.48     & 33.65          &   0.80 \\ \hline
  
-    EPSA&EP &1.19 &25.32&20.79&4.53\\ \hline
-    $Rauber_{E-P}$&EP&2.05&41.45&55.67&-14.22\\ \hline
-    $Rauber_{E}$&EP&2.05&42.09&57.59&-15.50\\ \hline
+    $EPSA$    & EP      & 1.19   & 25.32     & 20.79          &   4.53 \\ \hline
+    $R_{E-P}$ & EP      & 2.05   & 41.45     & 55.67          & -14.22 \\ \hline
+    $R_{E}$   & EP      & 2.05   & 42.09     & 57.59          & -15.50 \\ \hline
  
-    EPSA&LU&1.56& 39.55 &19.38& 20.17\\ \hline
-    $Rauber_{E-P}$&LU&2.14&45.62&27.00&18.62 \\ \hline
-    $Rauber_{E}$&LU&2.14&45.66&33.01&12.65\\ \hline
+    $EPSA$    & LU      & 1.56   & 39.55     & 19.38          &  20.17 \\ \hline
+    $R_{E-P}$ & LU      & 2.14   & 45.62     & 27.00          &  18.62 \\ \hline
+    $R_{E}$   & LU      & 2.14   & 45.66     & 33.01          &  12.65 \\ \hline
  
-    EPSA&BT&1.31& 29.60&20.53&9.07 \\ \hline
-    $Rauber_{E-P}$&BT&2.10&45.53&49.63&-4.10\\ \hline
-    $Rauber_{E}$&BT&2.10&43.93&52.86&-8.93\\ \hline
+    $EPSA$    & BT      & 1.31   & 29.60     & 20.53          &   9.07 \\ \hline
+    $R_{E-P}$ & BT      & 2.10   & 45.53     & 49.63          &  -4.10 \\ \hline
+    $R_{E}$   & BT      & 2.10   & 43.93     & 52.86          &  -8.93 \\ \hline
  
-    EPSA&SP&1.38& 33.51&15.65&17.86 \\ \hline
-    $Rauber_{E-P}$&SP&2.11&45.62&42.52&3.10\\ \hline
-    $Rauber_{E}$&SP&2.11&45.78&43.09&2.69\\ \hline
+    $EPSA$    & SP      & 1.38   & 33.51     & 15.65          &  17.86 \\ \hline
+    $R_{E-P}$ & SP      & 2.11   & 45.62     & 42.52          &   3.10 \\ \hline
+    $R_{E}$   & SP      & 2.11   & 45.78     & 43.09          &   2.69 \\ \hline
  
-    EPSA&FT&1.25&25.00&10.80&14.20 \\ \hline
-    $Rauber_{E-P}$&FT&2.10&39.29&34.30&4.99 \\ \hline
-    $Rauber_{E}$&FT&2.10&37.56&38.21&-0.65\\ \hline
+    $EPSA$    & FT      & 1.25   & 25.00     & 10.80          &  14.20 \\ \hline
+    $R_{E-P}$ & FT      & 2.10   & 39.29     & 34.30          &   4.99 \\ \hline
+    $R_{E}$   & FT      & 2.10   & 37.56     & 38.21          &  -0.65 \\ \hline
    \end{tabular}
-  \label{table:compare Class A}
+  \label{table:compareA}
    % is used to refer this table in the text
  \end{table}
-\begin{table}[ht]
-  \caption{Comparing Results for The NAS Class B}
+\begin{table}[p]
+  \caption{Comparing results for the NAS class B}
    % title of Table
    \centering
-  \begin{tabular}{ | l | l | l |l | l |l|  }
+  \begin{tabular}{|l|l|*{4}{r|}}
+    \hline
+    Method    & Program & Factor & Energy    & Performance    & Energy-Perf. \\
+    Name      & Name    & Value  & Saving \% & Degradation \% & Distance     \\
      \hline
-    Method&Program&Factor& Energy& Performance &Energy-Perf.\\
-    name &name&value& Saving \%&Degradation \% &Distance
-    \\ \hline
      % \rowcolor[gray]{0.85}
-    EPSA&CG & 1.66 &39.23&16.63&22.60   \\ \hline
-    $Rauber_{E-P}$&CG &2.15 &45.34&27.60&17.74\\ \hline
-    $Rauber_{E}$&CG &2.15 &45.34&28.88&16.46\\ \hline
+    $EPSA$    & CG      & 1.66   & 39.23     & 16.63          &  22.60 \\ \hline
+    $R_{E-P}$ & CG      & 2.15   & 45.34     & 27.60          &  17.74 \\ \hline
+    $R_{E}$   & CG      & 2.15   & 45.34     & 28.88          &  16.46 \\ \hline
  
-    EPSA&MG & 1.47 &34.98&18.35&16.63\\ \hline
-    $Rauber_{E-P}$&MG &2.14&43.55&36.42&7.13 \\ \hline
-    $Rauber_{E}$&MG &2.14&43.56&37.07&6.49 \\ \hline
+    $EPSA$    & MG      & 1.47   & 34.98     & 18.35          &  16.63 \\ \hline
+    $R_{E-P}$ & MG      & 2.14   & 43.55     & 36.42          &   7.13 \\ \hline
+    $R_{E}$   & MG      & 2.14   & 43.56     & 37.07          &   6.49 \\ \hline
  
-    EPSA&EP &1.08 &20.29&17.15&3.14 \\ \hline
-    $Rauber_{E-P}$&EP&2.00&42.38&56.88&-14.50\\ \hline
-    $Rauber_{E}$&EP&2.00&39.73&59.94&-20.21\\ \hline
+    $EPSA$    & EP      & 1.08   & 20.29     & 17.15          &   3.14 \\ \hline
+    $R_{E-P}$ & EP      & 2.00   & 42.38     & 56.88          & -14.50 \\ \hline
+    $R_{E}$   & EP      & 2.00   & 39.73     & 59.94          & -20.21 \\ \hline
  
-    EPSA&LU&1.47&38.57&21.34&17.23 \\ \hline
-    $Rauber_{E-P}$&LU&2.10&43.62&36.51&7.11 \\ \hline
-    $Rauber_{E}$&LU&2.10&43.61&38.54&5.07 \\ \hline
+    $EPSA$    & LU      & 1.47   & 38.57     & 21.34          &  17.23 \\ \hline
+    $R_{E-P}$ & LU      & 2.10   & 43.62     & 36.51          &   7.11 \\ \hline
+    $R_{E}$   & LU      & 2.10   & 43.61     & 38.54          &   5.07 \\ \hline
  
-    EPSA&BT&1.31& 29.59&20.88&8.71\\ \hline
-    $Rauber_{E-P}$&BT&2.10&44.53&53.05&-8.52\\ \hline
-    $Rauber_{E}$&BT&2.10&42.93&52.80&-9.87\\ \hline
+    $EPSA$    & BT      & 1.31   & 29.59     & 20.88          &   8.71 \\ \hline
+    $R_{E-P}$ & BT      & 2.10   & 44.53     & 53.05          &  -8.52 \\ \hline
+    $R_{E}$   & BT      & 2.10   & 42.93     & 52.80          &  -9.87 \\ \hline
  
-    EPSA&SP&1.38&33.44&19.24&14.20 \\ \hline
-    $Rauber_{E-P}$&SP&2.15&45.69&43.20&2.49\\ \hline
-    $Rauber_{E}$&SP&2.15&45.41&44.47&0.94\\ \hline
+    $EPSA$    & SP      & 1.38   & 33.44     & 19.24          &  14.20 \\ \hline
+    $R_{E-P}$ & SP      & 2.15   & 45.69     & 43.20          &   2.49 \\ \hline
+    $R_{E}$   & SP      & 2.15   & 45.41     & 44.47          &   0.94 \\ \hline
  
-    EPSA&FT&1.38&34.40&14.57&19.83 \\ \hline
-    $Rauber_{E-P}$&FT&2.13&42.98&37.35&5.63 \\ \hline
-    $Rauber_{E}$&FT&2.13&43.04&37.90&5.14\\ \hline
+    $EPSA$    & FT      & 1.38   & 34.40     & 14.57          &  19.83 \\ \hline
+    $R_{E-P}$ & FT      & 2.13   & 42.98     & 37.35          &   5.63 \\ \hline
+    $R_{E}$   & FT      & 2.13   & 43.04     & 37.90          &   5.14 \\ \hline
    \end{tabular}
-  \label{table:compare Class B}
+  \label{table:compareB}
    % is used to refer this table in the text
  \end{table}
  
-\begin{table}[ht]
-  \caption{Comparing Results for The NAS Class C}
+\begin{table}[p]
+  \caption{Comparing results for the NAS class C}
    % title of Table
    \centering
-  \begin{tabular}{ | l | l | l |l | l |l|  }
+  \begin{tabular}{|l|l|*{4}{r|}}
+    \hline
+    Method    & Program & Factor & Energy    & Performance    & Energy-Perf. \\
+    Name      & Name    & Value  & Saving \% & Degradation \% & Distance     \\
      \hline
-    Method&Program&Factor& Energy& Performance &Energy-Perf.\\
-    name &name&value& Saving \%&Degradation \% &Distance
-    \\ \hline
      % \rowcolor[gray]{0.85}
-    EPSA&CG & 1.56 &39.23&14.88&24.35  \\ \hline
-    $Rauber_{E-P}$&CG &2.15 &45.36&25.89&19.47\\ \hline
-    $Rauber_{E}$&CG &2.15 &45.36&26.70&18.66\\ \hline
+    $EPSA$    & CG      & 1.56   & 39.23     & 14.88          &  24.35 \\ \hline
+    $R_{E-P}$ & CG      & 2.15   & 45.36     & 25.89          &  19.47 \\ \hline
+    $R_{E}$   & CG      & 2.15   & 45.36     & 26.70          &  18.66 \\ \hline
  
-    EPSA&MG & 1.47 &34.97&21.69&13.27\\ \hline
-    $Rauber_{E-P}$&MG &2.15&43.65&40.45&3.20 \\ \hline
-    $Rauber_{E}$&MG &2.15&43.64&41.38&2.26 \\ \hline
+    $EPSA$    & MG      & 1.47   & 34.97     & 21.69          &  13.27 \\ \hline
+    $R_{E-P}$ & MG      & 2.15   & 43.65     & 40.45          &   3.20 \\ \hline
+    $R_{E}$   & MG      & 2.15   & 43.64     & 41.38          &   2.26 \\ \hline
  
-    EPSA&EP &1.04 &22.14&20.73&1.41 \\ \hline
-    $Rauber_{E-P}$&EP&1.92&39.40&56.33&-16.93\\ \hline
-    $Rauber_{E}$&EP&1.92&38.10&56.35&-18.25\\ \hline
+    $EPSA$    & EP      & 1.04   & 22.14     & 20.73          &   1.41 \\ \hline
+    $R_{E-P}$ & EP      & 1.92   & 39.40     & 56.33          & -16.93 \\ \hline
+    $R_{E}$   & EP      & 1.92   & 38.10     & 56.35          & -18.25 \\ \hline
  
-    EPSA&LU&1.38&35.83&22.49&13.34 \\ \hline
-    $Rauber_{E-P}$&LU&2.15&44.97&41.00&3.97 \\ \hline
-    $Rauber_{E}$&LU&2.15&44.97&41.80&3.17 \\ \hline
+    $EPSA$    & LU      & 1.38   & 35.83     & 22.49          &  13.34 \\ \hline
+    $R_{E-P}$ & LU      & 2.15   & 44.97     & 41.00          &   3.97 \\ \hline
+    $R_{E}$   & LU      & 2.15   & 44.97     & 41.80          &   3.17 \\ \hline
  
-    EPSA&BT&1.31& 29.60&21.28&8.32\\ \hline
-    $Rauber_{E-P}$&BT&2.13&45.60&49.84&-4.24\\ \hline
-    $Rauber_{E}$&BT&2.13&44.90&55.16&-10.26\\ \hline
+    $EPSA$    & BT      & 1.31   & 29.60     & 21.28          &   8.32 \\ \hline
+    $R_{E-P}$ & BT      & 2.13   & 45.60     & 49.84          &  -4.24 \\ \hline
+    $R_{E}$   & BT      & 2.13   & 44.90     & 55.16          & -10.26 \\ \hline
  
-    EPSA&SP&1.38&33.48&21.35&12.12\\ \hline
-    $Rauber_{E-P}$&SP&2.10&45.69&43.60&2.09\\ \hline
-    $Rauber_{E}$&SP&2.10&45.75&44.10&1.65\\ \hline
+    $EPSA$    & SP      & 1.38   & 33.48     & 21.35          &  12.12 \\ \hline
+    $R_{E-P}$ & SP      & 2.10   & 45.69     & 43.60          &   2.09 \\ \hline
+    $R_{E}$   & SP      & 2.10   & 45.75     & 44.10          &   1.65 \\ \hline
  
-    EPSA&FT&1.47&34.72&19.00&15.72 \\ \hline
-    $Rauber_{E-P}$&FT&2.04&39.40&37.10&2.30\\ \hline
-    $Rauber_{E}$&FT&2.04&39.35&37.70&1.65\\ \hline
+    $EPSA$    & FT      & 1.47   & 34.72     & 19.00          &  15.72 \\ \hline
+    $R_{E-P}$ & FT      & 2.04   & 39.40     & 37.10          &   2.30 \\ \hline
+    $R_{E}$   & FT      & 2.04   & 39.35     & 37.70          &   1.65 \\ \hline
    \end{tabular}
-\label{table:compare Class C}
-% is used to refer this table in the text
+  \label{table:compareC}
+  % is used to refer this table in the text
  \end{table}
-As shown in these tables our scaling factor is not optimal for energy saving
-such as Rauber's scaling factor EQ~(\ref{eq:sopt}), but it is optimal for both
-the energy and the performance simultaneously. Our EPSA optimal scaling factors
-has better simultaneous optimization for both the energy and the performance
-compared to Rauber's energy-performance method ($Rauber_{E-P}$). Also, in
-($Rauber_{E-P}$) method when setting the frequency to maximum value for the
-slower task lead to a small improvement of the performance. Also the results
-show that this method keep or improve energy saving. Because of the energy
-consumption decrease when the execution time decreased while the frequency value
-increased.
-
-Figure~(\ref{fig:compare}) shows the maximum distance between the energy saving
-percent and the performance degradation percent. Therefore, this means it is the
-same resultant of our objective function EQ~(\ref{eq:max}). Our algorithm always
-gives positive energy to performance trade offs while Rauber's method
-($Rauber_{E-P}$) gives in some time negative trade offs such as in BT and
-EP. The positive trade offs with highest values lead to maximum energy savings
-concatenating with less performance degradation and this the objective of this
-paper. While the negative trade offs refers to improving energy saving (or may
-be the performance) while degrading the performance (or may be the energy) more
-than the first.
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
-  \centering
-  \includegraphics[scale=0.60]{compare_class_A.pdf}
-  \includegraphics[scale=0.60]{compare_class_B.pdf}
-  \includegraphics[scale=0.60]{compare_class_c.pdf}
-  % use scale 35 for all to be in the same line
-  \caption{Comparing Our EPSA with Rauber's Methods}
-  \label{fig:compare}
-\end{figure}
+As shown in tables~\ref{table:compareA},~\ref{table:compareB}
+and~\ref{table:compareC}, the ($R_{E-P}$) method outperforms the ($R_{E}$)
+method in terms of performance and energy reduction.  The ($R_{E-P}$) method
+also gives better energy savings than our method.  However, although our scaling
+factor is not optimal for energy reduction, the results in these tables prove
+that our algorithm returns the best scaling factor that satisfy our objective
+method: the largest distance between energy reduction and performance
+degradation.  Negative values in the energy-performance column mean that one of
+the two objectives (energy or performance) have been degraded more than the
+other.  The positive trade-offs with the highest values lead to maximum energy
+savings while keeping the performance degradation as low as possible.  Our
+algorithm always gives the highest positive energy to performance trade-offs
+while Rauber and Rünger method ($R_{E-P}$) gives in some time negative
+trade-offs such as in BT and EP.
+%\begin{figure*}[t]
+%  \centering
+%  \includegraphics[width=.328\textwidth]{fig/compare_class_A}
+%  \includegraphics[width=.328\textwidth]{fig/compare_class_B}
+%  \includegraphics[width=.328\textwidth]{fig/compare_class_C}
+%  \caption{Comparing our method to Rauber and Rünger methods}
+%  \label{fig:compare}
+%\end{figure*}
  
  \section{Conclusion}
-\label{sec.conc}
-
-\AG{the conclusion needs to be written\dots{} one day}
+\label{sec.concl}
+
+In this paper, we have presented a new online scaling factor selection method
+that optimizes simultaneously the energy and performance of a distributed
+application running on an homogeneous cluster.  It uses the computation and
+communication times measured at the first iteration to predict energy
+consumption and the performance of the parallel application at every available
+frequency.  Then, it selects the scaling factor that gives the best trade-off
+between energy reduction and performance which is the maximum distance between
+the energy and the inverted performance curves.  To evaluate this method, we
+have applied it to the NAS benchmarks and it was compared to Rauber and Rünger
+methods while being executed on the simulator SimGrid.  The results showed that
+our method, outperforms Rauber and Rünger methods in terms of energy-performance
+ratio.
+
+In the near future, we would like to adapt this scaling factor selection method
+to heterogeneous platforms where each node has different characteristics.  In
+particular, each CPU has different available frequencies, energy consumption and
+performance.  It would be also interesting to develop a new energy model for
+asynchronous parallel iterative methods where the number of iterations is not
+known in advance and depends on the global convergence of the iterative system.
  
  \section*{Acknowledgment}
  
-Computations have been performed on the supercomputer facilities of the
-Mésocentre de calcul de Franche-Comté.
+This work has been supported by the Labex ACTION project (contract
+``ANR-11-LABX-01-01'').  Computations have been performed on the supercomputer
+facilities of the Mésocentre de calcul de Franche-Comté.  As a PhD student,
+Mr. Ahmed Fanfakh, would like to thank the University of Babylon (Iraq) for
+supporting his work.
+
+% trigger a \newpage just before the given reference
+% number - used to balance the columns on the last page
+% adjust value as needed - may need to be readjusted if
+% the document is modified later
+%\IEEEtriggeratref{15}
  
-\bibliographystyle{plain}
-\bibliography{my_reference}
+\bibliographystyle{IEEEtran}
+\bibliography{IEEEabrv,my_reference}
  \end{document}
  
  %%% Local Variables:
  %%% mode: latex
  %%% TeX-master: t
  %%% fill-column: 80
-%%%ispell-local-dictionary: "american"
+%%% ispell-local-dictionary: "american"
  %%% End:
  
-%  LocalWords:  Badri Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
-%  LocalWords:  CMOS EQ EPSA
+%  LocalWords:  Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
+%  LocalWords:  CMOS EQ EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex