Fix perms.

[mpi-energy.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index 1ce9ae117070ebe5f851f0dbc9e1c4876637bdf6..e62908a6ffab9c5c2b8a14a1e0f6de133fba4da7 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -1,737 +1,709 @@
-\documentclass[12pt]{article}
-%\documentclass[12pt,twocolumn]{article}
+\documentclass[conference]{IEEEtran}
  
  \usepackage[T1]{fontenc}
  \usepackage[utf8]{inputenc}
  \usepackage[english]{babel}
-\usepackage{algorithm,algorithmicx,algpseudocode}
-\usepackage{graphicx,graphics}
+\usepackage{algpseudocode}
+\usepackage{graphicx}
  \usepackage{subfig}
-\usepackage{listings}
-\usepackage{colortbl}
  \usepackage{amsmath}
-% \usepackage{sectsty}
-% \usepackage{titlesec}
-% \usepackage{secdot}
-%\usepackage[font={footnotesize,bt}]{caption}
-%\usepackage[font=scriptsize,labelfont=bf]{caption}
-\usepackage{lmodern}
  
-\usepackage{todonotes}
-\newcommand{\AG}[2][inline]{\todo[color=green!50,#1]{\sffamily\small\textbf{AG:} #2}}
+\usepackage{url}
+\DeclareUrlCommand\email{\urlstyle{same}}
+
+\usepackage[autolanguage,np]{numprint}
+\AtBeginDocument{%
+  \renewcommand*\npunitcommand[1]{\text{#1}}
+  \npthousandthpartsep{}}
+
+\usepackage{xspace}
+\usepackage[textsize=footnotesize]{todonotes}
+\newcommand{\AG}[2][inline]{%
+  \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
+\newcommand{\JC}[2][inline]{%
+  \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace}
+
+\newcommand{\Xsub}[2]{\ensuremath{#1_\textit{#2}}}
+
+\newcommand{\Dist}{\textit{Dist}}
+\newcommand{\Eind}{\Xsub{E}{ind}}
+\newcommand{\Enorm}{\Xsub{E}{Norm}}
+\newcommand{\Eoriginal}{\Xsub{E}{Original}}
+\newcommand{\Ereduced}{\Xsub{E}{Reduced}}
+\newcommand{\Fdiff}{\Xsub{F}{diff}}
+\newcommand{\Fmax}{\Xsub{F}{max}}
+\newcommand{\Fnew}{\Xsub{F}{new}}
+\newcommand{\Ileak}{\Xsub{I}{leak}}
+\newcommand{\Kdesign}{\Xsub{K}{design}}
+\newcommand{\MaxDist}{\textit{Max Dist}}
+\newcommand{\Ntrans}{\Xsub{N}{trans}}
+\newcommand{\Pdyn}{\Xsub{P}{dyn}}
+\newcommand{\PnormInv}{\Xsub{P}{NormInv}}
+\newcommand{\Pnorm}{\Xsub{P}{Norm}}
+\newcommand{\Tnorm}{\Xsub{T}{Norm}}
+\newcommand{\Pstates}{\Xsub{P}{states}}
+\newcommand{\Pstatic}{\Xsub{P}{static}}
+\newcommand{\Sopt}{\Xsub{S}{opt}}
+\newcommand{\Tcomp}{\Xsub{T}{comp}}
+\newcommand{\TmaxCommOld}{\Xsub{T}{Max Comm Old}}
+\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}}
+\newcommand{\Tmax}{\Xsub{T}{max}}
+\newcommand{\Tnew}{\Xsub{T}{New}}
+\newcommand{\Told}{\Xsub{T}{Old}}
  
  \begin{document}
  
-\title{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs}
-\author{A. Badri \and J.-C. Charr \and R. Couturier \and A. Giersch}
-\maketitle
+\title{Dynamic Frequency Scaling for Energy Consumption
+  Reduction in Synchronous Distributed Applications}
+
+\author{%
+  \IEEEauthorblockN{%
+    Jean-Claude Charr,
+    Raphaël Couturier,
+    Ahmed Fanfakh and
+    Arnaud Giersch
+  }
+  \IEEEauthorblockA{%
+    FEMTO-ST Institute\\
+    University of Franche-Comté\\
+    IUT de Belfort-Montbéliard,
+    19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
+    % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
+    % Fax: \mbox{+33 3 84 58 77 81}\\      % Dept Info
+    Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr}
+   }
+  }
  
-\AG{``Optimal'' is a bit pretentious in the title}
+\maketitle
  
  \begin{abstract}
-  \AG{FIXME}
+  Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs.  This
+  technique is usually used to reduce the energy consumed by a CPU while
+  computing.  Thus, decreasing the frequency
+  reduces the power consumed by the CPU.  However, it can also significantly
+  affect the performance of the executed program if it is compute bound and if a
+  low CPU frequency is selected.  Therefore, the chosen scaling factor must
+  give the best possible trade-off between energy reduction and performance.
+
+  In this paper we present an algorithm that predicts the energy consumed with
+  each frequency gear and selects the one that gives the best ratio between
+  energy consumption reduction and performance.  This algorithm works online
+  without training or profiling and has a very small overhead.  It also takes
+  into account synchronous communications between the nodes that are executing
+  the distributed algorithm.  The algorithm has been evaluated over the SimGrid
+  simulator while being applied to the NAS parallel benchmark programs.  The
+  results of the experiments show that it outperforms other existing scaling
+  factor selection algorithms.
  \end{abstract}
  
  \section{Introduction}
-
-bla bla bla
-
-
-The need for computing power is still increasing and it is not expected to slow
-down in the coming years. To satisfy this demand, researchers and supercomputers
-constructors have been regularly increasing the number of computing cores in
-supercomputers (for example in November 2013, according to the top 500
-list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3
-millions of cores and delivers more than 33 Tflop/s while consuming 17808
-kW). This large increase in number of computing cores has led to large energy
-consumption by these architectures. Moreover, the price of energy is expected to
-continue its ascent according to the demand. For all these reasons energy
-reduction became an important topic in the high performance computing field. To
-tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency
-Scaling) operations which reduce dynamically the frequency and voltage of cores
-and thus their energy consumption. However, this operation also degrades the
-performance of computation. Therefore researchers try to reduce the frequency to
-minimum when processors are idle (waiting for data from other processors or
-communicating with other processors). Moreover, depending on their objectives
-they use heuristics to find the best scaling factor during the computation. If
-they aim for performance they choose the best scaling factor that reduces the
-consumed energy while affecting as little as possible the performance. On the
-other hand, if they aim for energy reduction, the chosen scaling factor must
-produce the most energy efficient execution without considering the degradation
-of the performance. It is important to notice that lowering the frequency to
-minimum value does not always give the most efficient execution due to energy
-leakage. The best scaling factor might be chosen during execution (online) or
-during a pre-execution phase.  In this paper we emphasize to develop an
-algorithm that selects the optimal frequency scaling factor that takes into
-consideration simultaneously the energy consumption and the performance. The
-main objective of HPC systems is to run the application with less execution
-time. Therefore, our algorithm selects the optimal scaling factor online with
-very small footprint. The proposed algorithm takes into account the
-communication times of the MPI programs to choose the scaling factor. This
-algorithm has ability to predict both energy consumption and execution time over
-all available scaling factors.  The prediction achieved depends on some
-computing time information, gathered at the beginning of the runtime.  We apply
-this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel
-benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
-using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183}
-over an homogeneous distributed memory architecture. Furthermore, we compare the
-proposed algorithm with Rauber's methods. The comparison's results show that our
-algorithm gives better energy-time trade off.
-
-\section{Related Works}
-
-In the this section some heuristics, to compute the scaling factor, are
-presented and classified in two parts : offline and online methods.
-
-\subsection{The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as an example in Azevedo et al.~\cite{40}. He used
-intra-task algorithm to choose the DVFS setting when there are dependency points
-between tasks. While in~\cite{29}, Xie et al. used breadth-first search
-algorithm to do that. Their goal is saving energy with time limits. Another
-approaches gathers and stores the runtime information for each DVFS state, then
-used their methods offline to select the suitable DVFS that optimize energy-time
-trade offs. As an example~\cite{8}, Rountree et al. used liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. used multi logistic regression
-algorithm for the same goal. The offline study that shown the DVFS impact on the
-communication time of the MPI program is~\cite{17}, Freeh et al. show that these
-times not changed when the frequency is scaled down.
-
-\subsection{The online DVFS orientations}
-
-The objective of these works is to dynamically compute and set the frequency of
-the CPU during the runtime of the program for saving energy. Estimating and
-predicting approaches for the energy-time trade offs developed by
-~\cite{11,2,31}. These works select the best DVFS setting depending on the slack
-times. These times happen when the processors have to wait for data from other
-processors to compute their task. For example, during the synchronous
-communication time that take place in the MPI programs, the processors are
-idle. The optimal DVFS can be selected using the learning methods. Therefore, in
-~\cite{39,19} used machine learning to converge to the suitable DVFS
-configuration. Their learning algorithms have big time to converge when the
-number of available frequencies is high. Also, the communication time of the MPI
-program used online for saving energy as in~\cite{1}, Lim et al. developed an
-algorithm that detects the communication sections and changes the frequency
-during these sections only. This approach changes the frequency many times
-because an iteration may contain more than one communication section. The domain
-of analytical modeling used for choosing the optimal frequency as in~\cite{3},
-Rauber et al. developed an analytical mathematical model for determining the
-optimal frequency scaling factor for any number of concurrent tasks, without
-considering communication times. They set the slowest task to maximum frequency
-for maintaining performance.  In this paper we compare our algorithm with
-Rauber's model~\cite{3}, because his model can be used for any number of
-concurrent tasks for homogeneous platform and this is the same direction of this
-paper.  However, the primary contributions of this paper are:
-\begin{enumerate}
-\item Selecting the optimal frequency scaling factor for energy and performance
-  simultaneously. While taking into account the communication time.
-\item Adapting our scale factor to taking into account the imbalanced tasks.
-\item The execution time of our algorithm is very small when compared to other
-  methods (e.g.,~\cite{19}).
-\item The proposed algorithm works online without profiling or training as
-  in~\cite{38,34}.
-\end{enumerate}
-
-\section{Parallel Tasks Execution on Homogeneous Platform}
-
-A homogeneous cluster consists of identical nodes in terms of the hardware and
-the software. Each node has its own memory and at least one processor which can
-be a multi-core. The nodes are connected via a high bandwidth network. Tasks
-executed on this model can be either synchronous or asynchronous. In this paper
-we consider execution of the synchronous tasks on distributed homogeneous
-platform. These tasks can exchange the data via synchronous memory passing.
-\begin{figure}[h]
-  \centering
-  \subfloat[Synch. Imbalanced Communications]{\includegraphics[scale=0.67]{synch_tasks}\label{fig:h1}}
-  \subfloat[Synch. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
-  \caption{Parallel Tasks on Homogeneous Platform}
-  \label{fig:homo}
-\end{figure}
-Therefore, the execution time of a task consists of the computation time and the
-communication time. Moreover, the synchronous communications between tasks can
-lead to idle time while tasks wait at the synchronous point for others tasks to
-finish their communications see figure~(\ref{fig:h1}).  Another source for idle
-times is the imbalanced computations. This happen when processing different
-amounts of data on each processor as an example see figure~(\ref{fig:h2}). In
-this case the fastest tasks have to wait at the synchronous barrier for the
-slowest tasks to finish their job. In both two cases the overall execution time
-of the program is the execution time of the slowest task as :
-\begin{equation}
-  \label{eq:T1}
-  \textit{Program Time} = \max_{i=1,2,\dots,N} T_i
-\end{equation}
-where $T_i$ is the execution time of process $i$.
-
-\section{Energy Model for Homogeneous Platform}
-
-The energy consumption by the processor consists of two powers metric: the
-dynamic and the static power. This general power formulation is used by many
-researchers see~\cite{9,3,15,26}. The dynamic power of the CMOS processors
-$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$,
-the supply voltage $V$ and operational frequency $f$ respectively as follow :
+\label{sec.intro}
+
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years.  To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers (for example in November 2013, according to the
+TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer.  It has more
+than 3 million of cores and delivers more than \np[Tflop/s]{33} while consuming
+\np[kW]{17808}).  This large increase in number of computing cores has led to
+large energy consumption by these architectures.  Moreover, the price of energy
+is expected to continue its ascent according to the demand.  For all these
+reasons energy reduction has become an important topic in the high performance
+computing field.  To tackle this problem, many researchers use DVFS (Dynamic
+Voltage Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption.  Indeed, modern CPUs offer a
+set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs.  However, DVFS also degrades the performance of computation.  Therefore
+researchers try to reduce the frequency to the minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives, they use heuristics to find the best
+scaling factor during the computation.  If they aim for performance they choose
+the best scaling factor that reduces the consumed energy while affecting as
+little as possible the performance.  On the other hand, if they aim for energy
+reduction, the chosen scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance.  It is
+important to notice that lowering the frequency to the minimum value does not always
+give the most energy efficient execution due to energy leakage.  The best
+scaling factor might be chosen during execution (online) or during a
+pre-execution phase.  In this paper, we present an algorithm that selects a
+frequency scaling factor that simultaneously takes into consideration the energy
+consumption by the CPU and the performance of the application.  The main
+objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with very small
+overhead.  The proposed algorithm takes into account the communication times of
+the MPI program to choose the scaling factor.  This algorithm has the ability to
+predict both energy consumption and execution time over all available scaling
+factors.  The prediction achieved depends on some computing time information,
+gathered at the beginning of the runtime.  We apply this algorithm to the NAS parallel benchmarks (NPB v3.3)~\cite{44}.  Our experiments are executed using the simulator
+SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over a homogeneous
+distributed memory architecture.  Furthermore, we compare the proposed algorithm
+with Rauber and Rünger methods~\cite{3}.  The comparison's results show that our
+algorithm gives better energy-time trade-off.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents some
+related works from other authors.  Section~\ref{sec.exe}  presents an energy
+model for homogeneous platforms.  Section~\ref{sec.mpip} describes how the
+performance of MPI programs can be predicted.  Section~\ref{sec.compet} presents
+the energy-performance objective function that maximizes the reduction of energy
+consumption while minimizing the degradation of the program's performance.
+Section~\ref{sec.optim} details the proposed energy-performance algorithm.
+Section~\ref{sec.expe} verifies the accuracy of the performance prediction model
+and presents the results of the proposed algorithm.  It also shows the
+comparison results between our method and other existing methods.  Finally, we
+conclude in Section~\ref{sec.concl} with a summary and some future works.
+
+\section{Related works}
+\label{sec.relwork}
+
+
+In this section, some heuristics to compute the scaling factor are presented and
+classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program.  They return static scaling factor values to the processors
+participating in the execution of the parallel program.  On the one hand, the
+scaling factor values could be computed based on information retrieved by
+analyzing the code of the program and the computing system that will execute it.
+In~\cite{40}, Azevedo et al. detect during compilation the dependency points
+between tasks in a multi-task program.  This information is then used to lower
+the frequency of some processors in order to eliminate slack times.  A slack
+time is the period of time during which a processor that has already finished
+its computation, has to wait for a set of processors to finish their
+computations and send their results to the waiting processor in order to
+continue its task that is dependent on the results of computations being
+executed on other processors.  Freeh et al. showed in~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled
+down.  On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or partial executions of the program. The whole program or, a 
+part of it,  is usually executed over all the available frequency
+gears and the execution time and the energy consumed with each frequency
+gear are measured.  Then a heuristic or an exact method uses the retrieved
+information to compute the values of the scaling factor for the processors.
+In~\cite{8} , Rountree et al. use a linear programming algorithm, while in~\cite{34}, Cochran et
+al. use a multi-logistic regression algorithm for the same goal.  The main
+drawback of these methods is that they all require executing the
+whole program or, a part of it, on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+
+The online scaling factor selection methods are executed during the runtime of
+the program.  They are usually integrated into iterative programs where the same
+block of instructions is executed many times.  During the first few iterations,
+a lot of information is measured such as the execution time, the energy consumed
+using a multimeter, the slack times, \dots{} Then a method will exploit these
+measurements to compute the scaling factor values for each processor.  This
+operation, measurements and computing new scaling factor, can be repeated as
+much as needed if the iterations are not regular. Peraza, Yu-Liang et
+al.~\cite{2,31} used varied heuristics to select the appropriate scaling
+factor values to eliminate the slack times during runtime.  However, as seen
+in~\cite{19}, machine learning method takes a lot of time to converge
+when the number of available gears is big.  To reduce the impact of slack times,
+in~\cite{1}, Lim et al. developed an algorithm that detects the communication
+sections and changes the frequency during these sections only.  This approach
+might change the frequency of each processor many times per iteration if an
+iteration contains more than one communication section.  In~\cite{3}, Rauber and
+Rünger used an analytical model that can predict the consumed energy for every frequency gear after measuring the consumed energy. They
+maintain the performance as mush as possible by setting the highest frequency gear to the slowest task. 
+
+The primary contribution of 
+our paper is to present a new online scaling factor selection method which has the
+ following characteristics:\\
+1) It is based on Rauber and Rünger analytical model to predict the energy
+  consumption of the application with different frequency gears. 
+2) It selects the frequency scaling factor for simultaneously optimizing
+  energy reduction and maintaining performance. 
+3) It is well adapted to distributed architectures because it takes into
+  account the communication time. 
+4) It is well adapted to distributed applications with imbalanced tasks.
+5) It has a very small overhead when compared to other methods
+  (e.g.,~\cite{19}) and does not require profiling or training as
+  in~\cite{34}.
+
+
+% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'',
+%   can be deleted if we need space, we can just say we are interested in this
+%   paper in homogeneous clusters}
+
+
+\section{Energy model for a homogeneous platform}
+\label{sec.exe}
+Many researchers~\cite{9,3,15,26} divide the power consumed by a processor into
+two power metrics: the static and the dynamic power.  While the first one is
+consumed as long as the computing unit is on, the latter is only consumed during
+computation times.  The dynamic power $\Pdyn$ is related to the switching
+activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
+operational frequency $f$, as shown in EQ~\eqref{eq:pd}.
  \begin{equation}
    \label{eq:pd}
-  P_{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f
+  \Pdyn = \alpha \cdot C_L \cdot V^2 \cdot f
  \end{equation}
-The static power $P_{static}$ captures the leakage power consumption as well as
-the power consumption of peripheral devices like the I/O subsystem.
+The static power $\Pstatic$ captures the leakage power as follows:
  \begin{equation}
    \label{eq:ps}
-  P_{static}  = V \cdot N \cdot K_{design} \cdot I_{leak}
+   \Pstatic  = V \cdot \Ntrans \cdot \Kdesign \cdot \Ileak
  \end{equation}
-where V is the supply voltage, N is the number of transistors, $K_{design}$ is a
-design dependent parameter and $I_{leak}$ is a technology-dependent
-parameter. Energy consumed by an individual processor $E_{ind}$ is the summation
-of the dynamic and the static power multiply by the execution time for example
-see~\cite{36,15} .
+where V is the supply voltage, $\Ntrans$ is the number of transistors,
+$\Kdesign$ is a design dependent parameter and $\Ileak$ is a
+technology-dependent parameter.  The energy consumed by an individual processor
+to execute a given program can be computed as:
  \begin{equation}
    \label{eq:eind}
-  E_{ind} = ( P_{dyn} + P_{static} ) \cdot T
+   \Eind =  \Pdyn \cdot \Tcomp + \Pstatic \cdot T
  \end{equation}
-The dynamic voltage and frequency scaling (DVFS) is a process that allowed in
-modern processors to reduce the dynamic power by scaling down the voltage and
-frequency. Its main objective is to reduce the overall energy
-consumption~\cite{37}. The operational frequency \emph f depends linearly on the
-supply voltage $V$, i.e., $V = \beta . f$ with some constant $\beta$. This
-equation is used to study the change of the dynamic voltage with respect to
-various frequency values in~\cite{3}. The reduction process of the frequency are
-expressed by scaling factor \emph S. The scale \emph S is the ratio between the
-maximum and the new frequency as in EQ~(\ref{eq:s}).
+where $T$ is the execution time of the program, $\Tcomp$ is the computation
+time and $\Tcomp \leq T$. $\Tcomp$ may be equal to $T$ if there is no
+communication, no slack time and no synchronization.
+
+DVFS is a process that is allowed in modern processors to reduce the dynamic
+power by scaling down the voltage and frequency.  Its main objective is to
+reduce the overall energy consumption~\cite{37}.  The operational frequency $f$
+depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot f$ with some
+constant $\beta$.  This equation is used to study the change of the dynamic
+voltage with respect to various frequency values in~\cite{3}.  The reduction
+process of the frequency can be expressed by the scaling factor $S$ which is the
+ratio between the maximum and the new frequency as in EQ~\eqref{eq:s}.
  \begin{equation}
    \label{eq:s}
-  S = \frac{F_{max}}{F_{new}}
+  S = \frac{\Fmax}{\Fnew}
  \end{equation}
-The value of the scale \emph S is grater than 1 when changing the frequency to
-any new frequency value(\emph {P-state}) in governor. It is equal to 1 when the
-frequency are set to the maximum frequency.  The energy consumption model for
-parallel homogeneous platform is depending on the scaling factor \emph S. This
-factor reduces quadratically the dynamic power.  Also, this factor increases the
-static energy linearly because the execution time is increased~\cite{36}.  The
-energy model, depending on the frequency scaling factor, of homogeneous platform
-for any number of concurrent tasks develops by Rauber~\cite{3}. This model
-consider the two powers metric for measuring the energy of the parallel tasks as
-in EQ~(\ref{eq:energy}).
+The value of the scaling factor $S$ is greater than 1 when changing the
+frequency of the CPU to any new frequency value~(\emph{P-state}) in the
+governor. This factor reduces quadratically
+the dynamic power which may cause degradation in performance and thus, the
+increase of the static energy because the execution time is increased~\cite{36}.
+If the tasks are sorted according to their execution times before scaling in a
+descending order, the total energy consumption model for a parallel homogeneous
+platform, as presented by Rauber and Rünger~\cite{3}, can be written as a
+function of the scaling factor $S$, as in EQ~\eqref{eq:energy}.
  
  \begin{equation}
    \label{eq:energy}
-  E = P_{dyn} \cdot S_1^{-2} \cdot
+  E = \Pdyn \cdot S_1^{-2} \cdot
      \left( T_1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^2} \right) +
-    P_{static} \cdot T_1 \cdot S_1 \cdot N
- \hfill
-\end{equation}
-Where \emph N is the number of parallel nodes, $T_1 $ is the time of the slower
-task, $T_i$ is the time of the task $i$ and $S_1$ is the maximum scaling factor
-for the slower task. The scaling factor $S_1$, as in EQ~(\ref{eq:s1}), selects
-from the set of scales values $S_i$. Each of these scales are proportional to
-the time value $T_i$ depends on the new frequency value as in EQ~(\ref{eq:si}).
-\begin{equation}
-  \label{eq:s1}
-  S_1 = \max_{i=1,2,\dots,F} S_i
+      \Pstatic \cdot T_1 \cdot S_1 \cdot N
  \end{equation}
+where $N$ is the number of parallel nodes, $T_i$ for $i=1,\dots,N$ are
+the execution times of the sorted tasks.  Therefore, $T_1$ is
+the time of the slowest task, and $S_1$ its scaling factor which should be the
+highest because they are proportional to the time values $T_i$.  The scaling
+factors $S_i$ are computed as in EQ~\eqref{eq:si}.
  \begin{equation}
    \label{eq:si}
    S_i = S \cdot \frac{T_1}{T_i}
-      = \frac{F_{max}}{F_{new}} \cdot \frac{T_1}{T_i}
+      = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i}
  \end{equation}
-Where $F$ is the number of available frequencies. In this paper we depend on
-Rauber's energy model EQ~(\ref{eq:energy}) for two reasons : 1-this model used
-for homogeneous platform that we work on in this paper. 2-we are compare our
-algorithm with Rauber's scaling model.  Rauber's optimal scaling factor for
-optimal energy reduction derived from the EQ~(\ref{eq:energy}). He takes the
-derivation for this equation (to be minimized) and set it to zero to produce the
-scaling factor as in EQ~(\ref{eq:sopt}).
+In this paper we use Rauber and Rünger's energy model, EQ~\eqref{eq:energy}, because it can be applied to homogeneous clusters if the communication time is taken in consideration. Moreover, we compare our algorithm with Rauber and Rünger's scaling factor selection
+method which uses the same energy model.  In their method, the optimal scaling factor is
+computed by minimizing the derivation of EQ~\eqref{eq:energy} which produces
+EQ~\eqref{eq:sopt}.
+
  \begin{equation}
    \label{eq:sopt}
-  S_{opt} = \sqrt[3]{\frac{2}{n} \cdot \frac{P_{dyn}}{P_{static}} \cdot
+  \Sopt = \sqrt[3]{\frac{2}{N} \cdot \frac{\Pdyn}{\Pstatic} \cdot
      \left( 1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^3} \right) }
  \end{equation}
  
-\section{Performance Evaluation of MPI Programs}
-
-The performance (execution time) of the parallel MPI applications are depends on
-the time of the slowest task as in figure~(\ref{fig:homo}). Normally the
-execution time of the parallel programs are proportional to the operational
-frequency. Therefore, any DVFS operation for the energy reduction increase the
-execution time of the parallel program. As shown in EQ~(\ref{eq:energy}) the
-energy affected by the scaling factor $S$. This factor also has a great impact
-on the performance. When scaling down the frequency to the new value according
-to EQ~(\ref{eq:s}) lead to the value of the scale $S$ has inverse relation with
-new frequency value ($S \propto \frac{1}{F_{new}}$). Also when decrease the
-frequency value, the execution time increase. Then the new frequency value has
-inverse relation with time ($F_{new} \propto \frac{1}{T}$). This lead to the
-frequency scaling factor $S$ proportional linearly with execution time ($S
-\propto T$). Large scale MPI applications such as NAS benchmarks have
-considerable amount of communications embedded in these programs. During the
-communication process the processor remain idle until the communication has
-finished. For that reason any change in the frequency has no impact on the time
-of communication but it has obvious impact on the time of
-computation~\cite{17}. We are made many tests on real cluster to prove that the
-frequency scaling factor \emph S has a linear relation with computation time
-only also see~\cite{41}. To predict the execution time of MPI program, firstly
-must be precisely specifying communication time and the computation time for the
-slower task. Secondly, we use these times for predicting the execution time for
-any MPI program as a function of the new scaling factor as in the
-EQ~(\ref{eq:tnew}).
+
+\section{Performance evaluation of MPI programs}
+\label{sec.mpip}
+
+The execution time of a parallel synchronous iterative application is
+equal to the execution time of the slowest task.  If there is no
+communication and the application is not data bounded, the execution time of a
+parallel program is linearly proportional to the operational frequency and any
+DVFS operation for energy reduction increases the execution time of the parallel
+program.  Therefore, the scaling factor $S$ is linearly proportional to the
+execution time.  However, in most MPI applications the processes exchange
+data.  During these communications the processors involved remain idle until the
+communications are finished.  For that reason, any change in the frequency has no
+impact on the time of communication~\cite{17}.  The communication time for a
+task is the summation of periods of time that begin with an MPI call for sending
+or receiving a message till the message is synchronously sent or received.  To
+be able to predict the execution time of MPI program, the communication time and
+the computation time for the slowest task must be measured before scaling.  These
+times are used to predict the execution time for any MPI program as a function
+of the new scaling factor as in EQ~\eqref{eq:tnew}.
  \begin{equation}
    \label{eq:tnew}
-  T_{new} = T_{\textit{Max Comp Old}} \cdot S + T_{\textit{Max Comm Old}}
+  \Tnew = \TmaxCompOld \cdot S + \TmaxCommOld
  \end{equation}
-The above equation shows that the scaling factor \emph S has linear relation
-with the computation time without affecting the communication time. The
-communication time consists of the beginning times which an MPI calls for
-sending or receiving till the message is synchronously sent or received. In this
-paper we predict the execution time of the program for any new scaling factor
-value. Depending on this prediction we can produce our energy-performance scaling
-method as we will show in the coming sections. In the next section we make an
-investigation study for the EQ~(\ref{eq:tnew}).
-
-\section{Performance Prediction Verification}
-
-In this section we evaluate the precision of our performance prediction methods
-on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution
-time for any scale value. The NAS programs run the class B for comparing the
-real execution time with the predicted execution time. Each program runs offline
-with all available scaling factors on 8 or 9 nodes to produce real execution
-time values. These scaling factors are computed by dividing the maximum
-frequency by the new one see EQ~(\ref{eq:s}). In all tests, we use the simulator
-SimGrid/SMPI v3.10 to run the NAS programs.
-\AG{Fig.~\ref{fig:pred} is hard to read when printed in black and white,
-  especially the ``Normalize Real Perf.'' curve.}
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
-  \centering
-  \includegraphics[scale=0.60]{cg_per.eps}
-  \includegraphics[scale=0.60]{mg_pre.eps}
-  \includegraphics[scale=0.60]{bt_pre.eps}
-  \includegraphics[scale=0.60]{lu_pre.eps}
-  \caption{Fitting Predicted to Real Execution Time}
-  \label{fig:pred}
-\end{figure}
-%see Figure~\ref{fig:pred}
-In our cluster there are 18 available frequency states for each processor from
-2.5 GHz to 800 MHz, there is 100 MHz difference between two successive
-frequencies. For more details on the characteristics of the platform refer to
-table~(\ref{table:platform}). This lead to 18 run states for each program. We
-use seven MPI programs of the NAS parallel benchmarks : CG, MG, EP, FT, BT, LU
-and SP. The average normalized errors between the predicted execution time and
-the real time (SimGrid time) for all programs is between 0.0032 to 0.0133. AS an
-example, we are present the execution times of the NAS benchmarks as in the
-figure~(\ref{fig:pred}).
-
-\section{Performance to Energy Competition}
-This section demonstrates our approach for choosing the optimal scaling
-factor. This factor gives maximum energy reduction taking into account the
-execution time for both computation and communication times . The relation
-between the energy and the performance are nonlinear and complex, because the
-relation of the energy with scaling factor is nonlinear and with the performance
-it is linear see~\cite{17}. The relation between the energy and the performance
-is not straightforward. Moreover, they are not measured using the same metric.
-For solving this problem, we normalize the energy by calculating the ratio
-between the consumed energy with scaled frequency and the consumed energy
-without scaled frequency :
-\begin{equation}
+In this paper, this prediction method is used to select the best scaling factor
+for each processor as presented in the next section.
+
+\section{Performance and energy reduction trade-off}
+\label{sec.compet}
+
+This section presents our method for choosing the optimal scaling factor that
+gives the best tradeoff between energy reduction and performance. This method
+takes into account the execution times for both computation and communication to
+compute the scaling factor.  Since the energy consumption and the performance
+are not measured using the same metric, a normalized value of both measurements
+can be used to compare them.  The normalized energy is the ratio between the
+consumed energy with scaled frequency and the consumed energy without scaled
+frequency:
+\begin{multline}
    \label{eq:enorm}
-  E_{Norm} = \frac{E_{Reduced}}{E_{Original}}
-          = \frac{ P_{dyn} \cdot S_i^{-2} \cdot
-               \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
-               P_{static} \cdot T_1 \cdot S_i \cdot N  }{
-              P_{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
-              P_{static} \cdot T_1 \cdot N }
-\end{equation}
-\AG{Use \texttt{\textbackslash{}text\{xxx\}} or
-  \texttt{\textbackslash{}textit\{xxx\}} for all subscripted words in equations
-  (e.g. \mbox{\texttt{E\_\{\textbackslash{}text\{Norm\}\}}}).
-
-  Don't hesitate to define new commands :
-  \mbox{\texttt{\textbackslash{}newcommand\{\textbackslash{}ENorm\}\{E\_\{\textbackslash{}text\{Norm\}\}\}}}
-}
-By the same way we can normalize the performance as follows :
+  \Enorm = \frac{ \Ereduced}{\Eoriginal} \\
+      {} = \frac{\Pdyn \cdot S_1^{-2} \cdot
+             \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+               \Pstatic \cdot T_1 \cdot S_1 \cdot N}{
+             \Pdyn \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+               \Pstatic \cdot T_1 \cdot N }
+\end{multline}
+In the same way, the normalized execution time of a program is computed as follows:
  \begin{equation}
    \label{eq:pnorm}
-  P_{Norm} = \frac{T_{New}}{T_{Old}}
-          = \frac{T_{\textit{Max Comp Old}} \cdot S +
-              T_{\textit{Max Comm Old}}}{T_{Old}}
+  \Tnorm = \frac{\Tnew}{\Told}
+         = \frac{\TmaxCompOld \cdot S + \TmaxCommOld}{
+             \TmaxCompOld + \TmaxCommOld}
  \end{equation}
-The second problem is the optimization operation for both energy and performance
-is not in the same direction. In other words, the normalized energy and the
-performance curves are not in the same direction see figure~(\ref{fig:r2}).
-While the main goal is to optimize the energy and performance in the same
-time. According to the equations~(\ref{eq:enorm}) and~(\ref{eq:pnorm}) the
-scaling factor \emph S reduce both the energy and the performance
-simultaneously. But the main objective is to produce maximum energy reduction
-with minimum performance reduction. Many researchers used different strategies
-to solve this nonlinear problem for example see~\cite{19,42}, their methods add
-big overhead to the algorithm for selecting the suitable frequency. In this
-paper we are present a method to find the optimal scaling factor \emph S for
-optimize both energy and performance simultaneously without adding big
-overheads.  Our solution for this problem is to make the optimization process
-have the same direction. Therefore, we inverse the equation of normalize
-performance as follows :
+The relation between the execution time and the consumed energy of a program is nonlinear and complex. In consequences,  the relation between the consumed energy and the scaling factor is also nonlinear, for more details refer to~\cite{17}. Therefore, the resulting normalized energy consumption curve and execution time curve, for different scaling factors, do not have the same direction see Figure~\ref{fig:rel}\subref{fig:r2}.  To tackle this problem and optimize both terms,  we inverse the equation of the normalized execution time as follows:
  \begin{equation}
    \label{eq:pnorm_en}
-  P^{-1}_{Norm} = \frac{T_{Old}}{T_{New}}
-               = \frac{T_{Old}}{T_{\textit{Max Comp Old}} \cdot S +
-                 T_{\textit{Max Comm Old}}}
+  \Pnorm = \frac{ \Told}{ \Tnew}
+               = \frac{\TmaxCompOld +
+                 \TmaxCommOld}{\TmaxCompOld \cdot S +
+                 \TmaxCommOld}
  \end{equation}
  \begin{figure}
    \centering
-  \subfloat[Converted Relation.]{\includegraphics[scale=0.70]{file.eps}\label{fig:r1}}
-  \subfloat[Real Relation.]{\includegraphics[scale=0.70]{file3.eps}\label{fig:r2}}
+  \subfloat[Real relation.]{%
+    \includegraphics[width=.5\linewidth]{fig/file3}\label{fig:r2}}%
+  \subfloat[Converted relation.]{%
+    \includegraphics[width=.5\linewidth]{fig/file}\label{fig:r1}}
+  \caption{The energy and performance relation}
    \label{fig:rel}
-  \caption{The Energy and Performance Relation}
  \end{figure}
-Then, we can modelize our objective function as finding the maximum distance
-between the energy curve EQ~(\ref{eq:enorm}) and the inverse of performance
-curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This represent
-the minimum energy consumption with minimum execution time (better performance)
-in the same time, see figure~(\ref{fig:r1}). Then our objective function has the
-following form:
+Then, we can model our objective function as finding the maximum distance
+between the energy curve EQ~\eqref{eq:enorm} and the inverse of the execution time (performance)
+curve EQ~\eqref{eq:pnorm_en} over all available scaling factors.  This
+represents the minimum energy consumption with minimum execution time (better
+performance) at the same time, see Figure~\ref{fig:rel}\subref{fig:r1}.  Then
+our objective function has the following form:
  \begin{equation}
    \label{eq:max}
-  \textit{MaxDist} = \max (\overbrace{P^{-1}_{Norm}}^{\text{Maximize}} -
-                           \overbrace{E_{Norm}}^{\text{Minimize}} )
+  \MaxDist = \max_{j=1,2,\dots,F}
+      (\overbrace{\Pnorm(S_j)}^{\text{Maximize}} -
+       \overbrace{\Enorm(S_j)}^{\text{Minimize}} )
  \end{equation}
-Then we can select the optimal scaling factor that satisfy the
-EQ~(\ref{eq:max}).  Our objective function can works with any energy model or
-static power values stored in a data file. Moreover, this function works in
-optimal way when the energy function has a convex form with frequency scaling
-factor as shown in~\cite{15,3,19}. Energy measurement model is not the
-objective of this paper and we choose Rauber's model as an example with two
-reasons that mentioned before.
-
-\section{Optimal Scaling Factor for Performance and Energy}
-
-In the previous section we described the objective function that satisfy our
-goal in discovering optimal scaling factor for both performance and energy at
-the same time. Therefore, we develop an energy to performance scaling algorithm
-(EPSA). This algorithm is simple and has a direct way to calculate the optimal
-scaling factor for both energy and performance at the same time.
-\begin{algorithm}[t]
-  \caption{EPSA}
+where $F$ is the number of available frequencies. Then we can select the optimal
+scaling factor that satisfies EQ~\eqref{eq:max}.  Our objective function can
+work with any energy model or static power values stored in a data file.
+Moreover, this function works in optimal way when the energy curve has a convex
+form over the available frequency scaling factors as shown in~\cite{15,3,19}.
+
+\section{Optimal scaling factor for performance and energy}
+\label{sec.optim}
+
+Algorithm on Figure~\ref{EPSA} computes the optimal scaling factor according to
+the objective function described above.
+\begin{figure}[tp]
+  \begin{algorithmic}[1]
+    % \footnotesize
+    \Require ~
+    \begin{description}
+    \item[$\Pstatic$] static power value
+    \item[$\Pdyn$] dynamic power value
+    \item[$\Pstates$] number of available frequencies
+    \item[$\Fmax$] maximum frequency
+    \item[$\Fdiff$] difference between two successive freq.
+    \end{description}
+    \Ensure $\Sopt$ is the optimal scaling factor
+
+    \State $\Sopt \gets 1$
+    \State $\Dist \gets 0$
+    \State $\Fnew \gets \Fmax$
+    \For {$j = 2$ to $\Pstates$}
+      \State $\Fnew \gets \Fnew - \Fdiff$
+      \State $S \gets \Fmax / \Fnew$
+      \State $S_i \gets S \cdot \frac{T_1}{T_i}
+                  = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i}$
+             for $i=1,\dots,N$
+      \State $\Enorm \gets
+          \frac{\Pdyn \cdot S_1^{-2} \cdot
+                  \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+                  \Pstatic \cdot T_1 \cdot S_1 \cdot N }{
+                \Pdyn \cdot
+                  \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+                  \Pstatic \cdot T_1 \cdot N }$
+      \State $\Pnorm \gets \Told / \Tnew$
+      \If{$(\Pnorm - \Enorm > \Dist)$}
+        \State $\Sopt \gets S$
+        \State $\Dist \gets \Pnorm - \Enorm$
+      \EndIf
+    \EndFor
+    \State  Return $\Sopt$
+  \end{algorithmic}
+  \caption{Scaling factor selection algorithm}
    \label{EPSA}
+\end{figure}
+
+The proposed algorithm works online during the execution time of the MPI
+program.  It selects the optimal scaling factor after gathering the computation
+and communication times from the program after one iteration.  Then the program
+changes the new frequencies of the CPUs according to the computed scaling
+factors. In our experiments over a homogeneous cluster described in
+Section~\ref{sec.expe}, this algorithm has a small execution time. It takes
+\np[$\mu$s]{1.52} on average for 4 nodes and \np[$\mu$s]{6.65} on average for 32
+nodes.  The algorithm complexity is $O(F\cdot N)$, where $F$ is the number of
+available frequencies and $N$ is the number of computing nodes.  The algorithm
+is called just once during the execution of the program.  The DVFS algorithm on
+Figure~\ref{dvfs} shows where and when the algorithm is called in the MPI
+program.
+%\begin{table}[htb]
+%  \caption{Platform file parameters}
+%  % title of Table
+%  \centering
+%  \begin{tabular}{|*{7}{l|}}
+%    \hline
+%    Max      & Min      & Backbone        & Backbone         & Link         & Link            & Sharing \\
+%    Freq.    & Freq.    & Bandwidth       & Latency          & Bandwidth    & Latency         & Policy  \\
+%    \hline
+%    \np{2.5} & \np{800} & \np[GBps]{2.25} & \np[$\mu$s]{0.5} & \np[GBps]{1} & \np[$\mu$s]{50} & Full    \\
+%    GHz      & MHz      &                 &                  &              &                 & Duplex  \\
+%    \hline
+%  \end{tabular}
+%  \label{table:platform}
+%\end{table}
+
+\begin{figure}[tp]
    \begin{algorithmic}[1]
-    \State  Initialize the variable $Dist=0$
-    \State Set dynamic and static power values.
-    \State Set $P_{states}$ to the number of available frequencies.
-    \State Set the variable $F_{new}$ to max. frequency,  $F_{new} = F_{max} $
-    \State Set the variable $F_{diff}$ to the scale value between each two frequencies.
-    \For {$i=1$   to   $P_{states} $}
-      \State - Calculate the new frequency as $F_{new}=F_{new} - F_{diff} $
-      \State - Calculate the scale factor $S$ as in EQ~(\ref{eq:s}).
-      \State - Calculate all available scales $S_i$  depend on $S$ as in EQ~(\ref{eq:si}).
-      \State - Select the maximum scale factor $S_1$ from the set of scales $S_i$.
-      \State - Calculate the normalize energy $E_{Norm}=E_{R}/E_{O}$ as in EQ~(\ref{eq:enorm}).
-      \State - Calculate the normalize inverse of performance\par
-               $P_{NormInv}=T_{old}/T_{new}$ as in EQ~(\ref{eq:pnorm_en}).
-      \If{  $(P_{NormInv}-E_{Norm} > Dist$) }
-        \State $S_{optimal} = S$
-        \State $Dist = P_{NormInv} - E_{Norm}$
+    % \footnotesize
+    \For {$k=1$ to \textit{some iterations}}
+      \State Computations section.
+      \State Communications section.
+      \If {$(k=1)$}
+        \State Gather all times of computation and\newline\hspace*{3em}%
+               communication from each node.
+        \State Call algorithm from Figure~\ref{EPSA} with these times.
+        \State Compute the new frequency from the\newline\hspace*{3em}%
+              returned optimal scaling factor.
+        \State Set the new frequency to the CPU.
        \EndIf
      \EndFor
-    \State  Return $S_{optimal}$
    \end{algorithmic}
-\end{algorithm}
-The proposed EPSA algorithm works online during the execution time of the MPI
-program. It selects the optimal scaling factor by gathering some information
-from the program after one iteration. This algorithm has small execution time
-(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data
-required by this algorithm is the computation time and the communication time
-for each task from the first iteration only. When these times are measured, the
-MPI program calls the EPSA algorithm to choose the new frequency using the
-optimal scaling factor. Then the program set the new frequency to the
-system. The algorithm is called just one time during the execution of the
-program. The following example shows where and when the EPSA algorithm is called
-in the MPI program :
-\begin{minipage}{\textwidth}
-\AG{Use the same format as for Algorithm~\ref{EPSA}}
-\begin{lstlisting}[frame=tb]
-FOR J:=1 to Some_iterations Do
-   -Computations Section.
-   -Communications Section.
-   IF (J==1) THEN
-     -Gather all times of computation and communication
-      from each node.
-     -Call EPSA with these times.
-     -Calculate the new frequency from optimal scale.
-     -Set the new frequency to the system.
-   ENDIF
-ENDFOR
-\end{lstlisting}
-\end{minipage}
-After obtaining the optimal scale factor from the EPSA algorithm. The program
-calculates the new frequency $F_i$ for each task proportionally to its time
-value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we
-can calculate the new frequency $F_i$ as follows :
+  \caption{DVFS algorithm}
+  \label{dvfs}
+\end{figure}
+After obtaining the optimal scaling factor, the program calculates the new
+frequency $F_i$ for each task proportionally to its time value $T_i$.  By
+substitution of EQ~\eqref{eq:s} in EQ~\eqref{eq:si}, we can calculate the new
+frequency $F_i$ as follows:
  \begin{equation}
    \label{eq:fi}
-  F_i = \frac{F_{max} \cdot T_i}{S_{optimal} \cdot T_{max}}
+  F_i = \frac{\Fmax \cdot T_i}{\Sopt \cdot \Tmax}
  \end{equation}
  According to this equation all the nodes may have the same frequency value if
-they have balanced workloads. Otherwise, they take different frequencies when
-have imbalanced workloads. Then EQ~(\ref{eq:fi}) works in adaptive way to change
-the frequency according to the nodes workloads.
-
-\section{Experimental Results}
-
-The proposed EPSA algorithm was applied to seven MPI programs of the NAS
-benchmarks (EP, CG, MG, FT, BT, LU and SP). We work on three classes (A, B and
-C) for each program. Each program runs on specific number of processors
+they have balanced workloads, otherwise, they take different frequencies when
+having imbalanced workloads.  Thus, EQ~\eqref{eq:fi} adapts the frequency of the
+CPU to the nodes' workloads to maintain the performance of the program.
+
+\section{Experimental results}
+\label{sec.expe}
+Our experiments are executed on the simulator SimGrid/SMPI v3.10.  We configure
+the simulator to use a homogeneous cluster with one core per node.  
+%The detailed characteristics of our platform file are shown in Table~\ref{table:platform}.
+Each node in the cluster has 18 frequency values
+from \np[GHz]{2.5} to \np[MHz]{800} with \np[MHz]{100} difference between each
+two successive frequencies. The nodes are connected via an ethernet network with 1Gbit/s  bandwidth. 
+
+\subsection{Execution time prediction verification}
+
+In this section we evaluate the precision of our execution time prediction method
+based on EQ~\eqref{eq:tnew} by applying it to the NAS benchmarks.  The NAS programs
+are executed with the class B option to compare the real execution time with
+the predicted execution time.  Each program runs offline with all available
+scaling factors on 8 or 9 nodes (depending on the benchmark) to produce real
+execution time values.  These scaling factors are computed by dividing the
+maximum frequency by the new one see EQ~\eqref{eq:s}.
+\begin{figure}
+  \centering
+  \includegraphics[width=.5\linewidth]{fig/cg_per}\hfill%
+ % \includegraphics[width=.5\linewidth]{fig/mg_pre}\hfill%
+ % \includegraphics[width=.5\linewidth]{fig/bt_pre}\qquad%
+   \includegraphics[width=.5\linewidth]{fig/lu_pre}\hfill%
+  \caption{Comparing predicted to real execution times}
+  \label{fig:pred}
+\end{figure}
+%see Figure~\ref{fig:pred}
+In our cluster there are 18 available frequency states for each processor.  This
+leads to 18 run states for each program.  We use seven MPI programs of the NAS
+parallel benchmarks: CG, MG, EP, FT, BT, LU and SP.  Figure~\ref{fig:pred}
+presents plots of the real execution times and the simulated ones. The maximum
+normalized error between these two execution times varies between \np{0.0073} to
+\np{0.031} dependent on the executed benchmark.  The smallest prediction error
+was for CG and the worst one was for LU.
+
+\subsection{The experimental results for the scaling algorithm }
+The proposed algorithm was applied to seven MPI programs of the NAS benchmarks
+(EP, CG, MG, FT, BT, LU and SP) which were run with three classes (A, B and C).
+For each instance the benchmarks were executed on a number of processors
  proportional to the size of the class.  Each class represents the problem size
-ascending from the class A to C. Additionally, depending on some speed up points
+ascending from class A to C.  Additionally, depending on some speed up points
  for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes
-respectively. Our experiments are executed on the simulator SimGrid/SMPI
-v3.10. We design a platform file that simulates a cluster with one core per
-node. This cluster is a homogeneous architecture with distributed memory. The
-detailed characteristics of our platform file are shown in the
-table~(\ref{table:platform}). Each node in the cluster has 18 frequency values
-from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive
-frequencies.
-\begin{table}[ht]
-  \caption{Platform File Parameters}
-  % title of Table
-  \centering
-  \AG{Use e.g. $5\times 10^{-7}$ instead of 5E-7}
-  \begin{tabular}{ | l | l | l |l | l |l |l |  p{2cm} |}
-    \hline
-    Max & Min & Backbone & Backbone&Link &Link& Sharing  \\
-    Freq. & Freq. & Bandwidth & Latency & Bandwidth& Latency&Policy  \\ \hline
-    2.5 &800 & 2.25 GBps &5E-7 s & 1 GBps & 5E-5 s&Full  \\
-    GHz& MHz&  & & &  &Duplex  \\\hline
-  \end{tabular}
-  \label{table:platform}
-\end{table}
-Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all
-the NAS MPI programs while assuming the power dynamic is equal to 20W and the
-power static is equal to 4W for all experiments. We run the proposed EPSA
-algorithm for all these programs. The results showed that the algorithm selected
+respectively.  Depending on EQ~\eqref{eq:energy}, we measure the energy
+consumption for all the NAS MPI programs while assuming that the dynamic power
+with the highest frequency is equal to \np[W]{20} and the power static is equal
+to \np[W]{4} for all experiments.  These power values were also used by Rauber
+and Rünger in~\cite{3}.  The results showed that the algorithm selected
  different scaling factors for each program depending on the communication
-features of the program as in the figure~(\ref{fig:nas}). This figure shows that
-there are different distances between the normalized energy and the normalized
-inversed performance curves, because there are different communication features
-for each MPI program.  When there are little or not communications, the inversed
-performance curve is very close to the energy curve. Then the distance between
-the two curves is very small. This lead to small energy savings. The opposite
-happens when there are a lot of communication, the distance between the two
-curves is big.  This lead to more energy savings (e.g. CG and FT), see
-table~(\ref{table:factors results}). All discovered frequency scaling factors
-optimize both the energy and the performance simultaneously for all the NAS
-programs. In table~(\ref{table:factors results}), we record all optimal scaling
-factors results for each program on class C. These factors give the maximum
-energy saving percent and the minimum performance degradation percent in the
-same time over all available scales.
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
+features of the program as in the plots from Figure~\ref{fig:nas}.  These plots
+illustrate that there are different distances between the normalized energy and
+the normalized inverted execution time curves, because there are different
+communication features for each benchmark.  When there are little or no
+communications, the inverted execution time curve is very close to the energy
+curve.  Then the distance between the two curves is very small.  This leads to
+small energy savings.  The opposite happens when there are a lot of
+communication, the distance between the two curves is big.  This leads to more
+energy savings (e.g. CG and FT), see Table~\ref{table:compareC}.  All discovered
+frequency scaling factors optimize both the energy and the execution time
+simultaneously for all NAS benchmarks.  In Table~\ref{table:compareC}, we record
+all optimal scaling factors results for each benchmark running class C.  These
+scaling factors give the maximum energy saving percentage and the minimum
+performance degradation percentage at the same time from all available scaling
+factors.
+\begin{figure*}[t]
    \centering
-  \includegraphics[scale=0.47]{ep.eps}
-  \includegraphics[scale=0.47]{cg.eps}
-  \includegraphics[scale=0.47]{sp.eps}
-  \includegraphics[scale=0.47]{lu.eps}
-  \includegraphics[scale=0.47]{bt.eps}
-  \includegraphics[scale=0.47]{ft.eps}
-  \caption{Optimal scaling factors for The NAS MPI Programs}
+  \includegraphics[width=.33\linewidth]{fig/ep}\hfill%
+  \includegraphics[width=.33\linewidth]{fig/cg}\hfill%
+ % \includegraphics[width=.328\linewidth]{fig/sp}
+ %  \includegraphics[width=.328\linewidth]{fig/lu}\hfill%
+  \includegraphics[width=.33\linewidth]{fig/bt}
+ % \includegraphics[width=.328\linewidth]{fig/ft}
+  \caption{Optimal scaling factors for the predicted energy and performance of NAS benchmarks}
    \label{fig:nas}
-\end{figure}
-\begin{table}[width=\textwidth,height=\textheight,keepaspectratio]
-  \caption{Optimal Scaling Factors Results}
-  % title of Table
-  \centering
-  \AG{Use the same number of decimals for all numbers in a column,
-    and vertically align the numbers along the decimal points.
-    The same for all the following tables.}
-  \begin{tabular}{ | l | l | l |l | l | p{2cm} |}
-    \hline
-    Program & Optimal & Energy  & Performance&Energy-Perf.\\
-    Name & Scaling Factor& Saving \%&Degradation \% &Distance  \\ \hline
-    CG & 1.56 &39.23 & 14.88 & 24.35\\ \hline
-    MG & 1.47 &34.97&21.7& 13.27   \\ \hline
-    EP & 1.04 &22.14&20.73 &1.41\\ \hline
-    LU & 1.388 &35.83&22.49 &13.34\\ \hline
-    BT & 1.315 &29.6&21.28 &8.32\\ \hline
-    SP & 1.388 &33.48 &21.36&12.12\\ \hline
-    FT & 1.47 &34.72 &19&15.72\\ \hline
-  \end{tabular}
-  \label{table:factors results}
-  % is used to refer this table in the text
-\end{table}
+\end{figure*}
  
-As shown in the table~(\ref{table:factors results}), when the optimal scaling
-factor has big value we can gain more energy savings for example as in CG and
-FT. The opposite happens when the optimal scaling factor is small value as
-example BT and EP. Our algorithm selects big scaling factor value when the
+As shown in Table~\ref{table:compareC}, when the optimal scaling factor has a
+big value we can gain more energy savings as in CG and FT benchmarks.  The
+opposite happens when the optimal scaling factor has a small value as in BT and
+EP benchmarks.  Our algorithm selects a big scaling factor value when the
  communication and the other slacks times are big and smaller ones in opposite
-cases. In EP there are no communications inside the iterations. This make our
-EPSA to selects smaller scaling factor values (inducing smaller energy savings).
-
-\section{Comparing Results}
-
-In this section, we compare our EPSA algorithm results with Rauber's
-methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal
-level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to
-this scenario as $Rauber_{E}$. The second scenario is similar to the first
-except setting the slower task to the maximum frequency (when the scale $S=1$)
-to keep the performance from degradation as mush as possible. We refer to this
-scenario as $Rauber_{E-P}$. The comparison is made in tables~(\ref{table:compare
-  Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These
-tables show the results of our EPSA and Rauber's two scenarios for all the NAS
-benchmarks programs for classes A,B and C.
-\begin{table}[ht]
-  \caption{Comparing Results for  The NAS Class A}
+cases.  In EP there are no communication inside the iterations.  This leads our
+algorithm to select smaller scaling factor values (inducing smaller energy
+savings).
+
+\subsection{Results comparison}
+
+In this section, we compare our scaling factor selection method with Rauber and
+Rünger methods~\cite{3}.  They had two scenarios, the first is to reduce energy
+to the optimal level without considering the execution time as in
+EQ~\eqref{eq:sopt}.  We refer to this scenario as $R_{E}$.  The second scenario
+is similar to the first except setting the slower task to the maximum frequency
+(when the scale $S=1$) to keep the performance from degradation as mush as
+possible.  We refer to this scenario as $R_{E-P}$.  While we refer to our
+algorithm as EPSA (Energy to Performance Scaling Algorithm).  The comparison is
+made in Table~\ref{table:compareC}.  This table shows the results of our method and
+Rauber and Rünger scenarios for all the NAS benchmarks programs for class C.
+
+\begin{table}
+  \caption{Comparing results for the NAS class C}
    % title of Table
    \centering
-  \begin{tabular}{ | l | l | l |l | l |l|  }
+  \begin{tabular}{|l|l|*{4}{r|}}
      \hline
-    Method&Program&Factor& Energy& Performance &Energy-Perf.\\
-    name &name&value& Saving \%&Degradation \% &Distance
-    \\ \hline
-    % \rowcolor[gray]{0.85}
-    EPSA&CG & 1.56 &37.02 & 13.88 & 23.14\\ \hline
-    $Rauber_{E-P}$&CG &2.14 &42.77 & 25.27 & 17.5\\ \hline
-    $Rauber_{E}$&CG &2.14 &42.77&26.46&16.31\\ \hline
-
-    EPSA&MG & 1.47 &27.66&16.82&10.84\\ \hline
-    $Rauber_{E-P}$&MG &2.14&34.45&31.84&2.61\\ \hline
-    $Rauber_{E}$&MG &2.14&34.48&33.65&0.8 \\ \hline
-
-    EPSA&EP &1.19 &25.32&20.79&4.53\\ \hline
-    $Rauber_{E-P}$&EP&2.05&41.45&55.67&-14.22\\ \hline
-    $Rauber_{E}$&EP&2.05&42.09&57.59&-15.5\\ \hline
-
-    EPSA&LU&1.56& 39.55 &19.38& 20.17\\ \hline
-    $Rauber_{E-P}$&LU&2.14&45.62&27&18.62 \\ \hline
-    $Rauber_{E}$&LU&2.14&45.66&33.01&12.65\\ \hline
-
-    EPSA&BT&1.315& 29.6&20.53&9.07 \\ \hline
-    $Rauber_{E-P}$&BT&2.1&45.53&49.63&-4.1\\ \hline
-    $Rauber_{E}$&BT&2.1&43.93&52.86&-8.93\\ \hline
-
-    EPSA&SP&1.388& 33.51&15.65&17.86 \\ \hline
-    $Rauber_{E-P}$&SP&2.11&45.62&42.52&3.1\\ \hline
-    $Rauber_{E}$&SP&2.11&45.78&43.09&2.69\\ \hline
-
-    EPSA&FT&1.25& 25&10.8&14.2 \\ \hline
-    $Rauber_{E-P}$&FT&2.1&39.29&34.3&4.99 \\ \hline
-    $Rauber_{E}$&FT&2.1&37.56&38.21&-0.65\\ \hline
-  \end{tabular}
-  \label{table:compare Class A}
-  % is used to refer this table in the text
-\end{table}
-\begin{table}[ht]
-  \caption{Comparing Results for The NAS Class B}
-  % title of Table
-  \centering
-  \begin{tabular}{ | l | l | l |l | l |l|  }
+    Method    & Program & Factor & Energy    & Performance    & Energy-Perf. \\
+    Name      & Name    & Value  & Saving \% & Degradation  \% & Distance     \\
      \hline
-    Method&Program&Factor& Energy& Performance &Energy-Perf.\\
-    name &name&value& Saving \%&Degradation \% &Distance
-    \\ \hline
      % \rowcolor[gray]{0.85}
-    EPSA&CG & 1.66 &39.23&16.63&22.6   \\ \hline
-    $Rauber_{E-P}$&CG &2.15 &45.34&27.6&17.74\\ \hline
-    $Rauber_{E}$&CG &2.15 &45.34&28.88&16.46\\ \hline
+    $EPSA$    & CG      & 1.56   & 39.23     & 14.88          &  24.35 \\ \hline
+    $R_{E-P}$ & CG      & 2.15   & 45.36     & 25.89          &  19.47 \\ \hline
+    $R_{E}$   & CG      & 2.15   & 45.36     & 26.70          &  18.66 \\ \hline
  
-    EPSA&MG & 1.47 &34.98&18.35&16.63\\ \hline
-    $Rauber_{E-P}$&MG &2.14&43.55&36.42&7.13 \\ \hline
-    $Rauber_{E}$&MG &2.14&43.56&37.07&6.49 \\ \hline
+    $EPSA$    & MG      & 1.47   & 34.97     & 21.69          &  13.27 \\ \hline
+    $R_{E-P}$ & MG      & 2.15   & 43.65     & 40.45          &   3.20 \\ \hline
+    $R_{E}$   & MG      & 2.15   & 43.64     & 41.38          &   2.26 \\ \hline
  
-    EPSA&EP &1.08 &20.29&17.15&3.14 \\ \hline
-    $Rauber_{E-P}$&EP&2&42.38&56.88&-14.5\\ \hline
-    $Rauber_{E}$&EP&2&39.73&59.94&-20.21\\ \hline
+    $EPSA$    & EP      & 1.04   & 22.14     & 20.73          &   1.41 \\ \hline
+    $R_{E-P}$ & EP      & 1.92   & 39.40     & 56.33          & -16.93 \\ \hline
+    $R_{E}$   & EP      & 1.92   & 38.10     & 56.35          & -18.25 \\ \hline
  
-    EPSA&LU&1.47&38.57&21.34&17.23 \\ \hline
-    $Rauber_{E-P}$&LU&2.1&43.62&36.51&7.11 \\ \hline
-    $Rauber_{E}$&LU&2.1&43.61&38.54&5.07 \\ \hline
+    $EPSA$    & LU      & 1.38   & 35.83     & 22.49          &  13.34 \\ \hline
+    $R_{E-P}$ & LU      & 2.15   & 44.97     & 41.00          &   3.97 \\ \hline
+    $R_{E}$   & LU      & 2.15   & 44.97     & 41.80          &   3.17 \\ \hline
  
-    EPSA&BT&1.315& 29.59&20.88&8.71\\ \hline
-    $Rauber_{E-P}$&BT&2.1&44.53&53.05&-8.52\\ \hline
-    $Rauber_{E}$&BT&2.1&42.93&52.806&-9.876\\ \hline
+    $EPSA$    & BT      & 1.31   & 29.60     & 21.28          &   8.32 \\ \hline
+    $R_{E-P}$ & BT      & 2.13   & 45.60     & 49.84          &  -4.24 \\ \hline
+    $R_{E}$   & BT      & 2.13   & 44.90     & 55.16          & -10.26 \\ \hline
  
-    EPSA&SP&1.388&33.44&19.24&14.2 \\ \hline
-    $Rauber_{E-P}$&SP&2.15&45.69&43.2&2.49\\ \hline
-    $Rauber_{E}$&SP&2.15&45.41&44.47&0.94\\ \hline
+    $EPSA$    & SP      & 1.38   & 33.48     & 21.35          &  12.12 \\ \hline
+    $R_{E-P}$ & SP      & 2.10   & 45.69     & 43.60          &   2.09 \\ \hline
+    $R_{E}$   & SP      & 2.10   & 45.75     & 44.10          &   1.65 \\ \hline
  
-    EPSA&FT&1.388&34.4&14.57&19.83 \\ \hline
-    $Rauber_{E-P}$&FT&2.13&42.98&37.35&5.63 \\ \hline
-    $Rauber_{E}$&FT&2.13&43.04&37.9&5.14\\ \hline
+    $EPSA$    & FT      & 1.47   & 34.72     & 19.00          &  15.72 \\ \hline
+    $R_{E-P}$ & FT      & 2.04   & 39.40     & 37.10          &   2.30 \\ \hline
+    $R_{E}$   & FT      & 2.04   & 39.35     & 37.70          &   1.65 \\ \hline
    \end{tabular}
-  \label{table:compare Class B}
+  \label{table:compareC}
    % is used to refer this table in the text
  \end{table}
-
-\begin{table}[ht]
-  \caption{Comparing Results for The NAS Class C}
-  % title of Table
-  \centering
-  \begin{tabular}{ | l | l | l |l | l |l|  }
-    \hline
-    Method&Program&Factor& Energy& Performance &Energy-Perf.\\
-    name &name&value& Saving \%&Degradation \% &Distance
-    \\ \hline
-    % \rowcolor[gray]{0.85}
-    EPSA&CG & 1.56 &39.23&14.88&24.35  \\ \hline
-    $Rauber_{E-P}$&CG &2.15 &45.36&25.89&19.47\\ \hline
-    $Rauber_{E}$&CG &2.15 &45.36&26.7&18.66\\ \hline
-
-    EPSA&MG & 1.47 &34.97&21.697&13.273\\ \hline
-    $Rauber_{E-P}$&MG &2.15&43.65&40.45&3.2 \\ \hline
-    $Rauber_{E}$&MG &2.15&43.64&41.38&2.26 \\ \hline
-
-    EPSA&EP &1.04 &22.14&20.73&1.41 \\ \hline
-    $Rauber_{E-P}$&EP&1.92&39.4&56.33&-16.93\\ \hline
-    $Rauber_{E}$&EP&1.92&38.1&56.35&-18.25\\ \hline
-
-    EPSA&LU&1.388&35.83&22.49&13.34 \\ \hline
-    $Rauber_{E-P}$&LU&2.15&44.97&41&3.97 \\ \hline
-    $Rauber_{E}$&LU&2.15&44.97&41.8&3.17 \\ \hline
-
-    EPSA&BT&1.315& 29.6&21.28&8.32\\ \hline
-    $Rauber_{E-P}$&BT&2.13&45.6&49.84&-4.24\\ \hline
-    $Rauber_{E}$&BT&2.13&44.9&55.16&-10.26\\ \hline
-
-    EPSA&SP&1.388&33.48&21.35&12.12\\ \hline
-    $Rauber_{E-P}$&SP&2.1&45.69&43.6&2.09\\ \hline
-    $Rauber_{E}$&SP&2.1&45.75&44.1&1.65\\ \hline
-
-    EPSA&FT&1.47&34.72&19&15.72 \\ \hline
-    $Rauber_{E-P}$&FT&2.04&39.4&37.1&2.3\\ \hline
-    $Rauber_{E}$&FT&2.04&39.35&37.7&1.65\\ \hline
-  \end{tabular}
-\label{table:compare Class C}
-% is used to refer this table in the text
-\end{table}
-As shown in these tables our scaling factor is not optimal for energy saving
-such as Rauber's scaling factor EQ~(\ref{eq:sopt}), but it is optimal for both
-the energy and the performance simultaneously. Our EPSA optimal scaling factors
-has better simultaneous optimization for both the energy and the performance
-compared to Rauber's energy-performance method ($Rauber_{E-P}$). Also, in
-($Rauber_{E-P}$) method when setting the frequency to maximum value for the
-slower task lead to a small improvement of the performance. Also the results
-show that this method keep or improve energy saving. Because of the energy
-consumption decrease when the execution time decreased while the frequency value
-increased.
-
-Figure~(\ref{fig:compare}) shows the maximum distance between the energy saving
-percent and the performance degradation percent. Therefore, this means it is the
-same resultant of our objective function EQ~(\ref{eq:max}). Our algorithm always
-gives positive energy to performance trade offs while Rauber's method
-($Rauber_{E-P}$) gives in some time negative trade offs such as in BT and
-EP. The positive trade offs with highest values lead to maximum energy savings
-concatenating with less performance degradation and this the objective of this
-paper. While the negative trade offs refers to improving energy saving (or may
-be the performance) while degrading the performance (or may be the energy) more
-than the first.
-\begin{figure}[width=\textwidth,height=\textheight,keepaspectratio]
+As shown in Table~\ref{table:compareC}, the ($R_{E-P}$) method outperforms the ($R_{E}$)
+method in terms of performance and energy reduction.  The ($R_{E-P}$) method
+also gives better energy savings than our method.  However, although our scaling
+factor is not optimal for energy reduction, the results in this table prove
+that our algorithm returns the best scaling factor that satisfy our objective
+method: the largest distance between energy reduction and performance
+degradation. Figure~\ref{fig:compare}  illustrates even better the distance between 
+the energy reduction and performance degradation. The negative values mean that one of
+the two objectives (energy or performance) have been degraded more than the
+other.  The positive trade-offs with the highest values lead to maximum energy
+savings while keeping the performance degradation as low as possible.  Our
+algorithm always gives the highest positive energy to performance trade-offs
+while Rauber and Rünger's method, ($R_{E-P}$), gives sometimes negative
+trade-offs such as in BT and EP.
+\begin{figure}[t]
    \centering
-  \includegraphics[scale=0.60]{compare_class_A.pdf}
-  \includegraphics[scale=0.60]{compare_class_B.pdf}
-  \includegraphics[scale=0.60]{compare_class_c.pdf}
-  % use scale 35 for all to be in the same line
-  \caption{Comparing Our EPSA with Rauber's Methods}
+%  \includegraphics[width=.328\linewidth]{fig/compare_class_A}
+%  \includegraphics[width=.328\linewidth]{fig/compare_class_B}
+  \includegraphics[width=\linewidth]{fig/compare_class_C}
+  \caption{Comparing our method to Rauber and Rünger's methods}
    \label{fig:compare}
  \end{figure}
  
-\AG{\texttt{bibtex} gives many errors, please correct them}
-\bibliographystyle{plain}
-\bibliography{my_reference}
+\section{Conclusion}
+\label{sec.concl}
+
+In this paper, we have presented a new online scaling factor selection method
+that optimizes simultaneously the energy and performance of a distributed
+application running on a homogeneous cluster.  It uses the computation and
+communication times measured at the first iteration to predict energy
+consumption and the execution time of the parallel application at every available
+frequency.  Then, it selects the scaling factor that gives the best trade-off
+between energy reduction and performance which is the maximum distance between
+the energy and the inverted execution time curves.  To evaluate this method, we
+have applied it to the NAS benchmarks and it was compared to Rauber and Rünger
+methods while being executed on the simulator SimGrid.  The results showed that
+our method, outperforms Rauber and Rünger's methods in terms of energy-performance
+ratio. 
+
+In the near future, we would like to adapt this scaling factor selection method
+to heterogeneous platforms where each node has different characteristics.  In
+particular, each CPU has different available frequencies, energy consumption and
+performance.  It would be also interesting to develop a new energy model for
+asynchronous parallel iterative methods where the number of iterations is not
+known in advance and depends on the global convergence of the iterative system.
+
+\section*{Acknowledgment}
+
+This work has been partially supported by the Labex ACTION project (contract
+``ANR-11-LABX-01-01'').  Computations have been performed on the supercomputer
+facilities of the Mésocentre de calcul de Franche-Comté.  As a PhD student,
+Mr. Ahmed Fanfakh, would like to thank the University of Babylon (Iraq) for
+supporting his work.
+
+% trigger a \newpage just before the given reference
+% number - used to balance the columns on the last page
+% adjust value as needed - may need to be readjusted if
+% the document is modified later
+%\IEEEtriggeratref{15}
+
+\newpage
+\bibliographystyle{IEEEtran}
+\bibliography{IEEEabrv,my_reference}
  \end{document}
  
  %%% Local Variables:
@@ -741,5 +713,5 @@ than the first.
  %%% ispell-local-dictionary: "american"
  %%% End:
  
-%  LocalWords:  Badri Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
-%  LocalWords:  CMOS EQ EPSA
+%  LocalWords:  Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
+%  LocalWords:  CMOS EQ EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex