\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
-\usepackage{algorithm,algorithmicx,algpseudocode}
-\usepackage{graphicx,graphics}
+\usepackage{algpseudocode}
+\usepackage{graphicx}
\usepackage{subfig}
-\usepackage{listings}
-\usepackage{colortbl}
\usepackage{amsmath}
-% \usepackage{sectsty}
-% \usepackage{titlesec}
-% \usepackage{secdot}
-%\usepackage[font={footnotesize,bt}]{caption}
-%\usepackage[font=scriptsize,labelfont=bf]{caption}
+
+\usepackage{url}
+\DeclareUrlCommand\email{\urlstyle{same}}
+
+\usepackage[autolanguage,np]{numprint}
+\AtBeginDocument{%
+ \renewcommand*\npunitcommand[1]{\text{#1}}
+ \npthousandthpartsep{}}
+
\usepackage{xspace}
\usepackage[textsize=footnotesize]{todonotes}
-\newcommand{\AG}[2][inline]{\todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
+\newcommand{\AG}[2][inline]{%
+ \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
+\newcommand{\JC}[2][inline]{%
+ \todo[color=red!10,#1]{\sffamily\textbf{JC:} #2}\xspace}
+
+\newcommand{\Xsub}[2]{\ensuremath{#1_\textit{#2}}}
+
+\newcommand{\Dist}{\textit{Dist}}
+\newcommand{\Eind}{\Xsub{E}{ind}}
+\newcommand{\Enorm}{\Xsub{E}{Norm}}
+\newcommand{\Eoriginal}{\Xsub{E}{Original}}
+\newcommand{\Ereduced}{\Xsub{E}{Reduced}}
+\newcommand{\Fdiff}{\Xsub{F}{diff}}
+\newcommand{\Fmax}{\Xsub{F}{max}}
+\newcommand{\Fnew}{\Xsub{F}{new}}
+\newcommand{\Ileak}{\Xsub{I}{leak}}
+\newcommand{\Kdesign}{\Xsub{K}{design}}
+\newcommand{\MaxDist}{\textit{Max Dist}}
+\newcommand{\Ntrans}{\Xsub{N}{trans}}
+\newcommand{\Pdyn}{\Xsub{P}{dyn}}
+\newcommand{\PnormInv}{\Xsub{P}{NormInv}}
+\newcommand{\Pnorm}{\Xsub{P}{Norm}}
+\newcommand{\Tnorm}{\Xsub{T}{Norm}}
+\newcommand{\Pstates}{\Xsub{P}{states}}
+\newcommand{\Pstatic}{\Xsub{P}{static}}
+\newcommand{\Sopt}{\Xsub{S}{opt}}
+\newcommand{\Tcomp}{\Xsub{T}{comp}}
+\newcommand{\TmaxCommOld}{\Xsub{T}{Max Comm Old}}
+\newcommand{\TmaxCompOld}{\Xsub{T}{Max Comp Old}}
+\newcommand{\Tmax}{\Xsub{T}{max}}
+\newcommand{\Tnew}{\Xsub{T}{New}}
+\newcommand{\Told}{\Xsub{T}{Old}}
\begin{document}
-\title{Optimal Dynamic Frequency Scaling for Energy-Performance of Parallel MPI Programs}
+\title{Dynamic Frequency Scaling for Energy Consumption
+ Reduction in Synchronous Distributed Applications}
\author{%
\IEEEauthorblockN{%
}
\IEEEauthorblockA{%
FEMTO-ST Institute\\
- University of Franche-Comté
+ University of Franche-Comté\\
+ IUT de Belfort-Montbéliard,
+ 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
+ % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
+ % Fax: \mbox{+33 3 84 58 77 81}\\ % Dept Info
+ Email: \email{{jean-claude.charr,raphael.couturier,ahmed.fanfakh_badri_muslim,arnaud.giersch}@univ-fcomte.fr}
+ }
}
-}
\maketitle
-\AG{``Optimal'' is a bit pretentious in the title.\\
- Complete affiliation, add an email address, etc.}
-
\begin{abstract}
-The important technique for energy reduction of parallel systems is CPU frequency
-scaling. This operation used by many researchers to reduce energy consumption in many
-ways. Frequency scaling operation also has big impact on the performance. In some cases,
-the performance degradation ratio is bigger than energy saving ratio when the frequency scaled
-to down level. Therefore, the trade offs between the energy and performance becomes more
-important topic when using this technique. In this paper we developed an algorithm that
-select the frequency scaling factor for both energy and performance simultaneously.
-This algorithm takes into account the communication times when selecting the frequency scaling
-factor. It is works online without training or profiling to have very small overhead.
-The algorithm has better energy-performance trade offs compared to other methods.
+ Dynamic Voltage Frequency Scaling (DVFS) can be applied to modern CPUs. This
+ technique is usually used to reduce the energy consumed by a CPU while
+ computing. Thus, decreasing the frequency
+ reduces the power consumed by the CPU. However, it can also significantly
+ affect the performance of the executed program if it is compute bound and if a
+ low CPU frequency is selected. Therefore, the chosen scaling factor must
+ give the best possible trade-off between energy reduction and performance.
+
+ In this paper we present an algorithm that predicts the energy consumed with
+ each frequency gear and selects the one that gives the best ratio between
+ energy consumption reduction and performance. This algorithm works online
+ without training or profiling and has a very small overhead. It also takes
+ into account synchronous communications between the nodes that are executing
+ the distributed algorithm. The algorithm has been evaluated over the SimGrid
+ simulator while being applied to the NAS parallel benchmark programs. The
+ results of the experiments show that it outperforms other existing scaling
+ factor selection algorithms.
\end{abstract}
\section{Introduction}
\label{sec.intro}
-The need for computing power is still increasing and it is not expected to slow
-down in the coming years. To satisfy this demand, researchers and supercomputers
-constructors have been regularly increasing the number of computing cores in
-supercomputers (for example in November 2013, according to the TOP500
-list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more than 3
-millions of cores and delivers more than 33 Tflop/s while consuming 17808
-kW). This large increase in number of computing cores has led to large energy
-consumption by these architectures. Moreover, the price of energy is expected to
-continue its ascent according to the demand. For all these reasons energy
-reduction became an important topic in the high performance computing field. To
-tackle this problem, many researchers used DVFS (Dynamic Voltage Frequency
-Scaling) operations which reduce dynamically the frequency and voltage of cores
-and thus their energy consumption. However, this operation also degrades the
-performance of computation. Therefore researchers try to reduce the frequency to
-minimum when processors are idle (waiting for data from other processors or
-communicating with other processors). Moreover, depending on their objectives
-they use heuristics to find the best scaling factor during the computation. If
-they aim for performance they choose the best scaling factor that reduces the
-consumed energy while affecting as little as possible the performance. On the
-other hand, if they aim for energy reduction, the chosen scaling factor must
-produce the most energy efficient execution without considering the degradation
-of the performance. It is important to notice that lowering the frequency to
-minimum value does not always give the most efficient execution due to energy
-leakage. The best scaling factor might be chosen during execution (online) or
-during a pre-execution phase. In this paper we emphasize to develop an
-algorithm that selects a frequency scaling factor that simultaneously takes into
-consideration the energy consumption and the performance. The
-main objective of HPC systems is to run the application with less execution
-time. Therefore, our algorithm selects the scaling factor online with
-very small footprint. The proposed algorithm takes into account the
-communication times of the MPI program to choose the scaling factor. This
-algorithm has ability to predict both energy consumption and execution time over
-all available scaling factors. The prediction achieved depends on some
-computing time information, gathered at the beginning of the runtime. We apply
-this algorithm to seven MPI benchmarks. These MPI programs are the NAS parallel
-benchmarks (NPB v3.3) developed by NASA~\cite{44}. Our experiments are executed
-using the simulator SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183}
-over an homogeneous distributed memory architecture. Furthermore, we compare the
-proposed algorithm with Rauber and Rünger methods~\cite{3}.
-The comparison's results show that our
-algorithm gives better energy-time trade off.
-
-This paper is organized as follows: Section~\ref{sec.relwork} presents the works from other authors.
-Section~\ref{sec.ptasks} shows the execution of parallel tasks and sources of idle times. Section~\ref{sec.energy} resumes the
-energy model of homogeneous platform. Section~\ref{sec.mpip} evaluates the performance of MPI program.
-Section~\ref{sec.verif} verifies the performance prediction model. Section~\ref{sec.compet} presents
-the energy-performance trade offs objective function. Section~\ref{sec.optim} demonstrates the proposed
-energy-performance algorithm. Section~\ref{sec.expe} presents the results of our experiments.
-Section~\ref{sec.compare} shows the comparison results. Finally, we conclude in Section~\ref{sec.concl}.
-
-\section{Related Works}
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years. To satisfy this demand, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers (for example in November 2013, according to the
+TOP500 list~\cite{43}, the Tianhe-2 was the fastest supercomputer. It has more
+than 3 million of cores and delivers more than \np[Tflop/s]{33} while consuming
+\np[kW]{17808}). This large increase in number of computing cores has led to
+large energy consumption by these architectures. Moreover, the price of energy
+is expected to continue its ascent according to the demand. For all these
+reasons energy reduction has become an important topic in the high performance
+computing field. To tackle this problem, many researchers use DVFS (Dynamic
+Voltage Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption. Indeed, modern CPUs offer a
+set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs. However, DVFS also degrades the performance of computation. Therefore
+researchers try to reduce the frequency to the minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives, they use heuristics to find the best
+scaling factor during the computation. If they aim for performance they choose
+the best scaling factor that reduces the consumed energy while affecting as
+little as possible the performance. On the other hand, if they aim for energy
+reduction, the chosen scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance. It is
+important to notice that lowering the frequency to the minimum value does not always
+give the most energy efficient execution due to energy leakage. The best
+scaling factor might be chosen during execution (online) or during a
+pre-execution phase. In this paper, we present an algorithm that selects a
+frequency scaling factor that simultaneously takes into consideration the energy
+consumption by the CPU and the performance of the application. The main
+objective of HPC systems is to execute as fast as possible the application.
+Therefore, our algorithm selects the scaling factor online with very small
+overhead. The proposed algorithm takes into account the communication times of
+the MPI program to choose the scaling factor. This algorithm has the ability to
+predict both energy consumption and execution time over all available scaling
+factors. The prediction achieved depends on some computing time information,
+gathered at the beginning of the runtime. We apply this algorithm to the NAS parallel benchmarks (NPB v3.3)~\cite{44}. Our experiments are executed using the simulator
+SimGrid/SMPI v3.10~\cite{Casanova:2008:SGF:1397760.1398183} over a homogeneous
+distributed memory architecture. Furthermore, we compare the proposed algorithm
+with Rauber and Rünger methods~\cite{3}. The comparison's results show that our
+algorithm gives better energy-time trade-off.
+
+This paper is organized as follows: Section~\ref{sec.relwork} presents some
+related works from other authors. Section~\ref{sec.exe} presents an energy
+model for homogeneous platforms. Section~\ref{sec.mpip} describes how the
+performance of MPI programs can be predicted. Section~\ref{sec.compet} presents
+the energy-performance objective function that maximizes the reduction of energy
+consumption while minimizing the degradation of the program's performance.
+Section~\ref{sec.optim} details the proposed energy-performance algorithm.
+Section~\ref{sec.expe} verifies the accuracy of the performance prediction model
+and presents the results of the proposed algorithm. It also shows the
+comparison results between our method and other existing methods. Finally, we
+conclude in Section~\ref{sec.concl} with a summary and some future works.
+
+\section{Related works}
\label{sec.relwork}
-\AG{Consider introducing the models (sec.~\ref{sec.ptasks},
- maybe~\ref{sec.energy}) before related works}
-
-In the this section some heuristics to compute the scaling factor are
-presented and classified in two parts: offline and online methods.
-
-\subsection{The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as for example in Azevedo et al.~\cite{40}. They use
-dynamic voltage scaling (DVS) algorithm to choose the DVS setting when there are dependency points
-between tasks. While in~\cite{29}, Xie et al. used breadth-first search
-algorithm to do that. Their goal is to save energy with time limits. Another
-approach gathers and stores the runtime information for each DVFS state, then
-selects the suitable DVFS offline to optimize energy-time
-trade offs. As an example Rountree et al.~\cite{8}, use liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. use multi logistic regression
-algorithm for the same goal. The offline study that shows the DVFS impact on the
-communication time of the MPI program is~\cite{17}, where Freeh et al. show that these
-times do not change when the frequency is scaled down.
-
-\subsection{The online DVFS orientations}
-
-The objective of online DVFS orientations works is to dynamically compute and set the frequency of
-the CPU during the runtime of the program for saving energy. Estimating and
-predicting approaches for the energy-time trade offs are developed by Kimura, Peraza, Yu-Liang et al.
-~\cite{11,2,31}. These works select the best DVFS setting depending on the slack
-times. These times happen when the processors have to wait for data from other
-processors to compute their task. For example, during the synchronous
-communications that take place in MPI programs, some processors are
-idle. The optimal DVFS can be selected using learning methods. Therefore, in Dhiman, Hao Shen et al.
-~\cite{39,19} used machine learning to converge to the suitable DVFS
-configuration. Their learning algorithms take big time to converge when the
-number of available frequencies is high. Also, the communication sections of the MPI
-program can be used to save energy. In~\cite{1}, Lim et al. developed an
-algorithm that detects the communication sections and changes the frequency
-during these sections only. This approach changes the frequency many times
-because an iteration may contain more than one communication section. The domain
-of analytical modeling used for choosing the optimal frequency as in Rauber and Rünger~\cite{3}. they
-developed an analytical mathematical model to determine the
-optimal frequency scaling factor for any number of concurrent tasks. They set the slowest task to maximum frequency for maintaining performance. In this paper we compare our algorithm with
-Rauber and Rünger model~\cite{3}, because their model can be used for any number of
-concurrent tasks for homogeneous platforms. The primary contributions of this paper are:
-\begin{enumerate}
-\item Selecting the frequency scaling factor for simultaneously optimizing energy and performance,
- while taking into account the communication time.
-\item Adapting our scaling factor to take into account the imbalanced tasks.
-\item The execution time of our algorithm is very small when compared to other
- methods (e.g.,~\cite{19}).
-\item The proposed algorithm works online without profiling or training as
- in~\cite{38,34}.
-\end{enumerate}
-
-\section{Parallel Tasks Execution on Homogeneous Platform}
-\label{sec.ptasks}
-
-A homogeneous cluster consists of identical nodes in terms of hardware and software.
-Each node has its own memory and at least one processor which can
-be a multi-core. The nodes are connected via a high bandwidth network. Tasks
-executed on this model can be either synchronous or asynchronous. In this paper
-we consider execution of the synchronous tasks on distributed homogeneous
-platform. These tasks can exchange the data via synchronous message passing.
-\begin{figure*}[t]
- \centering
- \subfloat[Sync. Imbalanced Communications]{\includegraphics[scale=0.67]{commtasks}\label{fig:h1}}
- \subfloat[Sync. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
- \caption{Parallel Tasks on Homogeneous Platform}
- \label{fig:homo}
-\end{figure*}
-Therefore, the execution time of a task consists of the computation time and the
-communication time. Moreover, the synchronous communications between tasks can
-lead to idle time while tasks wait at the synchronization barrier for other tasks to
-finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications happen when nodes have to send/receive different amount of data or each node is communicates with different number of nodes. Another source for idle times is the imbalanced computations. This happen when processing different
-amounts of data on each processor (see figure~(\ref{fig:h2})). In
-this case the fastest tasks have to wait at the synchronization barrier for the
-slowest tasks to finish their job. In both cases the overall execution time
-of the program is the execution time of the slowest task as:
-\begin{equation}
- \label{eq:T1}
- \textit{Program Time} = \max_{i=1,2,\dots,N} T_i
-\end{equation}
-where $T_i$ is the execution time of task $i$.
-
-\section{Energy Model for Homogeneous Platform}
-\label{sec.energy}
-The energy consumption by the processor consists of two power metrics: the
-dynamic and the static power. This general power formulation is used by many
-researchers~\cite{9,3,15,26}. The dynamic power of the CMOS processors
-$P_{dyn}$ is related to the switching activity $\alpha$, load capacitance $C_L$,
-the supply voltage $V$ and operational frequency $f$ respectively as follow:
+In this section, some heuristics to compute the scaling factor are presented and
+classified into two categories: offline and online methods.
+
+\subsection{Offline scaling factor selection methods}
+
+The offline scaling factor selection methods are executed before the runtime of
+the program. They return static scaling factor values to the processors
+participating in the execution of the parallel program. On the one hand, the
+scaling factor values could be computed based on information retrieved by
+analyzing the code of the program and the computing system that will execute it.
+In~\cite{40}, Azevedo et al. detect during compilation the dependency points
+between tasks in a multi-task program. This information is then used to lower
+the frequency of some processors in order to eliminate slack times. A slack
+time is the period of time during which a processor that has already finished
+its computation, has to wait for a set of processors to finish their
+computations and send their results to the waiting processor in order to
+continue its task that is dependent on the results of computations being
+executed on other processors. Freeh et al. showed in~\cite{17} that the
+communication times of MPI programs do not change when the frequency is scaled
+down. On the other hand, some offline scaling factor selection methods use the
+information gathered from previous full or partial executions of the program. The whole program or, a
+part of it, is usually executed over all the available frequency
+gears and the execution time and the energy consumed with each frequency
+gear are measured. Then a heuristic or an exact method uses the retrieved
+information to compute the values of the scaling factor for the processors.
+In~\cite{8} , Rountree et al. use a linear programming algorithm, while in~\cite{34}, Cochran et
+al. use a multi-logistic regression algorithm for the same goal. The main
+drawback of these methods is that they all require executing the
+whole program or, a part of it, on all frequency gears for each new instance of the same program.
+
+\subsection{Online scaling factor selection methods}
+
+The online scaling factor selection methods are executed during the runtime of
+the program. They are usually integrated into iterative programs where the same
+block of instructions is executed many times. During the first few iterations,
+a lot of information is measured such as the execution time, the energy consumed
+using a multimeter, the slack times, \dots{} Then a method will exploit these
+measurements to compute the scaling factor values for each processor. This
+operation, measurements and computing new scaling factor, can be repeated as
+much as needed if the iterations are not regular. Peraza, Yu-Liang et
+al.~\cite{2,31} used varied heuristics to select the appropriate scaling
+factor values to eliminate the slack times during runtime. However, as seen
+in~\cite{19}, machine learning method takes a lot of time to converge
+when the number of available gears is big. To reduce the impact of slack times,
+in~\cite{1}, Lim et al. developed an algorithm that detects the communication
+sections and changes the frequency during these sections only. This approach
+might change the frequency of each processor many times per iteration if an
+iteration contains more than one communication section. In~\cite{3}, Rauber and
+Rünger used an analytical model that can predict the consumed energy for every frequency gear after measuring the consumed energy. They
+maintain the performance as mush as possible by setting the highest frequency gear to the slowest task.
+
+The primary contribution of
+our paper is to present a new online scaling factor selection method which has the
+ following characteristics:\\
+1) It is based on Rauber and Rünger analytical model to predict the energy
+ consumption of the application with different frequency gears.
+2) It selects the frequency scaling factor for simultaneously optimizing
+ energy reduction and maintaining performance.
+3) It is well adapted to distributed architectures because it takes into
+ account the communication time.
+4) It is well adapted to distributed applications with imbalanced tasks.
+5) It has a very small overhead when compared to other methods
+ (e.g.,~\cite{19}) and does not require profiling or training as
+ in~\cite{34}.
+
+
+% \JC{The whole subsection ``Parallel Tasks Execution on Homogeneous Platform'',
+% can be deleted if we need space, we can just say we are interested in this
+% paper in homogeneous clusters}
+
+
+\section{Energy model for a homogeneous platform}
+\label{sec.exe}
+Many researchers~\cite{9,3,15,26} divide the power consumed by a processor into
+two power metrics: the static and the dynamic power. While the first one is
+consumed as long as the computing unit is on, the latter is only consumed during
+computation times. The dynamic power $\Pdyn$ is related to the switching
+activity $\alpha$, load capacitance $C_L$, the supply voltage $V$ and
+operational frequency $f$, as shown in EQ~\eqref{eq:pd}.
\begin{equation}
\label{eq:pd}
- P_\textit{dyn} = \alpha \cdot C_L \cdot V^2 \cdot f
+ \Pdyn = \alpha \cdot C_L \cdot V^2 \cdot f
\end{equation}
-The static power $P_{static}$ captures the leakage power consumption as well as
-the power consumption of peripheral devices like the I/O subsystem.
+The static power $\Pstatic$ captures the leakage power as follows:
\begin{equation}
\label{eq:ps}
- P_\textit{static} = V \cdot N \cdot K_{design} \cdot I_{leak}
+ \Pstatic = V \cdot \Ntrans \cdot \Kdesign \cdot \Ileak
\end{equation}
-where V is the supply voltage, N is the number of transistors, $K_{design}$ is a
-design dependent parameter and $I_{leak}$ is a technology-dependent
-parameter. Energy consumed by an individual processor $E_{ind}$ is the summation
-of the dynamic and the static power multiplied by the execution time for example
-see~\cite{36,15}.
+where V is the supply voltage, $\Ntrans$ is the number of transistors,
+$\Kdesign$ is a design dependent parameter and $\Ileak$ is a
+technology-dependent parameter. The energy consumed by an individual processor
+to execute a given program can be computed as:
\begin{equation}
\label{eq:eind}
- E_\textit{ind} = ( P_\textit{dyn} + P_\textit{static} ) \cdot T
+ \Eind = \Pdyn \cdot \Tcomp + \Pstatic \cdot T
\end{equation}
-The dynamic voltage and frequency scaling (DVFS) is a process that is allowed in
-modern processors to reduce the dynamic power by scaling down the voltage and
-frequency. Its main objective is to reduce the overall energy
-consumption~\cite{37}. The operational frequency \emph f depends linearly on the
-supply voltage $V$, i.e., $V = \beta \cdot f$ with some constant $\beta$. This
-equation is used to study the change of the dynamic voltage with respect to
-various frequency values in~\cite{3}. The reduction process of the frequency are
-expressed by scaling factor \emph S. The scale \emph S is the ratio between the
-maximum and the new frequency as in EQ~(\ref{eq:s}).
+where $T$ is the execution time of the program, $\Tcomp$ is the computation
+time and $\Tcomp \leq T$. $\Tcomp$ may be equal to $T$ if there is no
+communication, no slack time and no synchronization.
+
+DVFS is a process that is allowed in modern processors to reduce the dynamic
+power by scaling down the voltage and frequency. Its main objective is to
+reduce the overall energy consumption~\cite{37}. The operational frequency $f$
+depends linearly on the supply voltage $V$, i.e., $V = \beta \cdot f$ with some
+constant $\beta$. This equation is used to study the change of the dynamic
+voltage with respect to various frequency values in~\cite{3}. The reduction
+process of the frequency can be expressed by the scaling factor $S$ which is the
+ratio between the maximum and the new frequency as in EQ~\eqref{eq:s}.
\begin{equation}
\label{eq:s}
- S = \frac{F_\textit{max}}{F_\textit{new}}
+ S = \frac{\Fmax}{\Fnew}
\end{equation}
-The value of the scale $S$ is greater than 1 when changing the frequency to
-any new frequency value~(\emph {P-state}) in governor, the CPU governor is an interface
-driver supplied by the operating system kernel (e.g. Linux) to lowering core's frequency.
-The scaling factor is equal to 1 when the frequency set is to the maximum frequency.
-The energy consumption model for parallel homogeneous platform depends on the scaling factor \emph S. This factor reduces quadratically the dynamic power. Also, this factor increases the
-static energy linearly because the execution time is increased~\cite{36}. The
-energy model depending on the frequency scaling factor for homogeneous platform
-for any number of concurrent tasks was developed by Rauber and Rünger~\cite{3}. This model
-considers the two power metrics for measuring the energy of the parallel tasks as
-in EQ~(\ref{eq:energy}):
+The value of the scaling factor $S$ is greater than 1 when changing the
+frequency of the CPU to any new frequency value~(\emph{P-state}) in the
+governor. This factor reduces quadratically
+the dynamic power which may cause degradation in performance and thus, the
+increase of the static energy because the execution time is increased~\cite{36}.
+If the tasks are sorted according to their execution times before scaling in a
+descending order, the total energy consumption model for a parallel homogeneous
+platform, as presented by Rauber and Rünger~\cite{3}, can be written as a
+function of the scaling factor $S$, as in EQ~\eqref{eq:energy}.
\begin{equation}
\label{eq:energy}
- E = P_\textit{dyn} \cdot S_1^{-2} \cdot
+ E = \Pdyn \cdot S_1^{-2} \cdot
\left( T_1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^2} \right) +
- P_\textit{static} \cdot T_1 \cdot S_1 \cdot N
- \hfill
-\end{equation}
-where \emph N is the number of parallel nodes, $T_1 $ is the time of the slowest
-task, $T_i$ is the time of the task $i$ and $S_1$ is the maximum scaling factor
-for the slower task. The scaling factor $S_1$, as in EQ~(\ref{eq:s1}), selects
-from the set of scales values $S_i$. Each of these scales are proportional to
-the time value $T_i$ depends on the new frequency value as in EQ~(\ref{eq:si}).
-\begin{equation}
- \label{eq:s1}
- S_1 = \max_{i=1,2,\dots,F} S_i
+ \Pstatic \cdot T_1 \cdot S_1 \cdot N
\end{equation}
+where $N$ is the number of parallel nodes, $T_i$ for $i=1,\dots,N$ are
+the execution times of the sorted tasks. Therefore, $T_1$ is
+the time of the slowest task, and $S_1$ its scaling factor which should be the
+highest because they are proportional to the time values $T_i$. The scaling
+factors $S_i$ are computed as in EQ~\eqref{eq:si}.
\begin{equation}
\label{eq:si}
S_i = S \cdot \frac{T_1}{T_i}
- = \frac{F_\textit{max}}{F_\textit{new}} \cdot \frac{T_1}{T_i}
+ = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i}
\end{equation}
-where $F$ is the number of available frequencies. In this paper we depend on
-Rauber and Rünger energy model EQ~(\ref{eq:energy}) for two reasons: (1)-this model is used
-for homogeneous platform that we work on in this paper. 2-we compare our
-algorithm with Rauber and Rünger scaling model. Rauber and Rünger scaling factor that reduce
- energy consumption derived from the EQ~(\ref{eq:energy}). They take the
-derivation for this equation (to be minimized) and set it to zero to produce the
-scaling factor as in EQ~(\ref{eq:sopt}).
+In this paper we use Rauber and Rünger's energy model, EQ~\eqref{eq:energy}, because it can be applied to homogeneous clusters if the communication time is taken in consideration. Moreover, we compare our algorithm with Rauber and Rünger's scaling factor selection
+method which uses the same energy model. In their method, the optimal scaling factor is
+computed by minimizing the derivation of EQ~\eqref{eq:energy} which produces
+EQ~\eqref{eq:sopt}.
+
\begin{equation}
\label{eq:sopt}
- S_\textit{opt} = \sqrt[3]{\frac{2}{n} \cdot \frac{P_\textit{dyn}}{P_\textit{static}} \cdot
+ \Sopt = \sqrt[3]{\frac{2}{N} \cdot \frac{\Pdyn}{\Pstatic} \cdot
\left( 1 + \sum_{i=2}^{N} \frac{T_i^3}{T_1^3} \right) }
\end{equation}
-\section{Performance Evaluation of MPI Programs}
+
+\section{Performance evaluation of MPI programs}
\label{sec.mpip}
-The performance (execution time) of parallel MPI applications depend on
-the time of the slowest task as in figure~(\ref{fig:homo}). Normally the
-execution time of the parallel programs are proportional to the operational
-frequency. Therefore, any DVFS operation for the energy reduction increases the
-execution time of the parallel program. As shown in EQ~(\ref{eq:energy}) the
-energy is affected by the scaling factor $S$. This factor also has a great impact
-on the performance. When scaling down the frequency to the new value according
-to EQ~(\ref{eq:s}), the value of the scale $S$ has inverse relation with
-new frequency value ($S \propto \frac{1}{F_{new}}$). Also when decreasing the
-frequency value, the execution time increases. Then the new frequency value has
-inverse relation with time ($F_{new} \propto \frac{1}{T}$). This leads to the
-frequency scaling factor $S$ proportional linearly with execution time ($S
-\propto T$). Large scale MPI applications such as NAS benchmarks have
-considerable amount of communications embedded in these programs. During the
-communication process the processors remain idle until the communication has
-finished. For that reason any change in the frequency has no impact on the time
-of communication but it has obvious impact on the time of
-computation~\cite{17}. We have made many tests on a real cluster to prove that the
-frequency scaling factor \emph S has a linear relation with computation time
-only. To predict the execution time of MPI program, the communication time and
-the computation time for the slower task must be first precisely specified. Secondly,
-these times are used to predict the execution time for any MPI program as a function of
-the new scaling factor as in the EQ~(\ref{eq:tnew}).
+The execution time of a parallel synchronous iterative application is
+equal to the execution time of the slowest task. If there is no
+communication and the application is not data bounded, the execution time of a
+parallel program is linearly proportional to the operational frequency and any
+DVFS operation for energy reduction increases the execution time of the parallel
+program. Therefore, the scaling factor $S$ is linearly proportional to the
+execution time. However, in most MPI applications the processes exchange
+data. During these communications the processors involved remain idle until the
+communications are finished. For that reason, any change in the frequency has no
+impact on the time of communication~\cite{17}. The communication time for a
+task is the summation of periods of time that begin with an MPI call for sending
+or receiving a message till the message is synchronously sent or received. To
+be able to predict the execution time of MPI program, the communication time and
+the computation time for the slowest task must be measured before scaling. These
+times are used to predict the execution time for any MPI program as a function
+of the new scaling factor as in EQ~\eqref{eq:tnew}.
\begin{equation}
\label{eq:tnew}
- \textit T_\textit{new} = T_\textit{Max Comp Old} \cdot S + T_{\textit{Max Comm Old}}
+ \Tnew = \TmaxCompOld \cdot S + \TmaxCommOld
\end{equation}
-The above equation shows that the scaling factor \emph S has linear relation
-with the computation time without affecting the communication time. The
-communication time consists of the beginning times which an MPI calls for
-sending or receiving till the message is synchronously sent or received. In this
-paper we predict the execution time of the program for any new scaling factor
-value. Depending on this prediction we can produce our energy-performance scaling
-method as we will show in the coming sections. In the next section we make an
-investigation study for the EQ~(\ref{eq:tnew}).
-
-\section{Performance Prediction Verification}
-\label{sec.verif}
-
-In this section we evaluate the precision of our performance prediction methods
-on the NAS benchmark. We use the EQ~(\ref{eq:tnew}) that predicts the execution
-time for any scale value. The NAS programs run the class B for comparing the
-real execution time with the predicted execution time. Each program runs offline
-with all available scaling factors on 8 or 9 nodes to produce real execution
-time values. These scaling factors are computed by dividing the maximum
-frequency by the new one see EQ~(\ref{eq:s}). In all tests, we use the simulator
-SimGrid/SMPI v3.10 to run the NAS programs.
-\begin{figure*}[t]
- \centering
- \includegraphics[width=.4\textwidth]{cg_per.eps}\qquad%
- \includegraphics[width=.4\textwidth]{mg_pre.eps}
- \includegraphics[width=.4\textwidth]{bt_pre.eps}\qquad%
- \includegraphics[width=.4\textwidth]{lu_pre.eps}
- \caption{Fitting Predicted to Real Execution Time}
- \label{fig:pred}
-\end{figure*}
-%see Figure~\ref{fig:pred}
-In our cluster there are 18 available frequency states for each processor from
-2.5 GHz to 800 MHz, there is 100 MHz difference between two successive
-frequencies. For more details on the characteristics of the platform refer to
-table~(\ref{table:platform}). This lead to 18 run states for each program. We
-use seven MPI programs of the NAS parallel benchmarks: CG, MG, EP, FT, BT, LU
-and SP. The average normalized errors between the predicted execution time and
-the real time (SimGrid time) for all programs is between 0.0032 to 0.0133. AS an
-example, we are present the execution times of the NAS benchmarks as in the
-figure~(\ref{fig:pred}).
-
-\section{Performance to Energy Competition}
+In this paper, this prediction method is used to select the best scaling factor
+for each processor as presented in the next section.
+
+\section{Performance and energy reduction trade-off}
\label{sec.compet}
-This section demonstrates our approach for choosing the optimal scaling
-factor. This factor gives maximum energy reduction taking into account the
-execution time for both computation and communication times. The relation
-between the energy and the performance are nonlinear and complex, because the
-relation of the energy with scaling factor is nonlinear and with the performance
-it is linear see~\cite{17}. The relation between the energy and the performance
-is not straightforward. Moreover, they are not measured using the same metric.
-For solving this problem, we normalize the energy by calculating the ratio
-between the consumed energy with scaled frequency and the consumed energy
-without scaled frequency:
+This section presents our method for choosing the optimal scaling factor that
+gives the best tradeoff between energy reduction and performance. This method
+takes into account the execution times for both computation and communication to
+compute the scaling factor. Since the energy consumption and the performance
+are not measured using the same metric, a normalized value of both measurements
+can be used to compare them. The normalized energy is the ratio between the
+consumed energy with scaled frequency and the consumed energy without scaled
+frequency:
\begin{multline}
\label{eq:enorm}
- E_\textit{Norm} = \frac{ E_\textit{Reduced}}{E_\textit{Original}} \\
- {} = \frac{P_\textit{dyn} \cdot S_i^{-2} \cdot
- \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
- P_\textit{static} \cdot T_1 \cdot S_i \cdot N }{
- P_\textit{dyn} \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
- P_\textit{static} \cdot T_1 \cdot N }
+ \Enorm = \frac{ \Ereduced}{\Eoriginal} \\
+ {} = \frac{\Pdyn \cdot S_1^{-2} \cdot
+ \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+ \Pstatic \cdot T_1 \cdot S_1 \cdot N}{
+ \Pdyn \cdot \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+ \Pstatic \cdot T_1 \cdot N }
\end{multline}
-By the same way we can normalize the performance as follows:
+In the same way, the normalized execution time of a program is computed as follows:
\begin{equation}
\label{eq:pnorm}
- P_\textit{Norm} = \frac{T_\textit{New}}{T_\textit{Old}}
- = \frac{T_\textit{Max Comp Old} \cdot S +
- T_\textit{Max Comm Old}}{ T_\textit{Old}}
+ \Tnorm = \frac{\Tnew}{\Told}
+ = \frac{\TmaxCompOld \cdot S + \TmaxCommOld}{
+ \TmaxCompOld + \TmaxCommOld}
\end{equation}
-The second problem is the optimization operation for both energy and performance
-is not in the same direction. In other words, the normalized energy and the
-performance curves are not in the same direction see figure~(\ref{fig:r2}).
-While the main goal is to optimize the energy and performance in the same
-time. According to the equations~(\ref{eq:enorm}) and~(\ref{eq:pnorm}) the
-scaling factor \emph S reduce both the energy and the performance
-simultaneously. But the main objective is to produce maximum energy reduction
-with minimum performance reduction. Many researchers used different strategies
-to solve this nonlinear problem for example see~\cite{19,42}, their methods add
-big overhead to the algorithm for selecting the suitable frequency. In this
-paper we are present a method to find the optimal scaling factor \emph S for
-optimize both energy and performance simultaneously without adding big
-overheads. Our solution for this problem is to make the optimization process
-have the same direction. Therefore, we inverse the equation of normalize
-performance as follows:
+The relation between the execution time and the consumed energy of a program is nonlinear and complex. In consequences, the relation between the consumed energy and the scaling factor is also nonlinear, for more details refer to~\cite{17}. Therefore, the resulting normalized energy consumption curve and execution time curve, for different scaling factors, do not have the same direction see Figure~\ref{fig:rel}\subref{fig:r2}. To tackle this problem and optimize both terms, we inverse the equation of the normalized execution time as follows:
\begin{equation}
\label{eq:pnorm_en}
- P^{-1}_\textit{Norm} = \frac{ T_\textit{Old}}{ T_\textit{New}}
- = \frac{ T_\textit{Old}}{T_\textit{Max Comp Old} \cdot S +
- T_\textit{Max Comm Old}}
+ \Pnorm = \frac{ \Told}{ \Tnew}
+ = \frac{\TmaxCompOld +
+ \TmaxCommOld}{\TmaxCompOld \cdot S +
+ \TmaxCommOld}
\end{equation}
-\begin{figure*}
+\begin{figure}
\centering
- \subfloat[Converted Relation.]{%
- \includegraphics[width=.4\textwidth]{file.eps}\label{fig:r1}}%
- \qquad%
- \subfloat[Real Relation.]{%
- \includegraphics[width=.4\textwidth]{file3.eps}\label{fig:r2}}
+ \subfloat[Real relation.]{%
+ \includegraphics[width=.5\linewidth]{fig/file3}\label{fig:r2}}%
+ \subfloat[Converted relation.]{%
+ \includegraphics[width=.5\linewidth]{fig/file}\label{fig:r1}}
+ \caption{The energy and performance relation}
\label{fig:rel}
- \caption{The Energy and Performance Relation}
-\end{figure*}
-Then, we can modelize our objective function as finding the maximum distance
-between the energy curve EQ~(\ref{eq:enorm}) and the inverse of performance
-curve EQ~(\ref{eq:pnorm_en}) over all available scaling factors. This represent
-the minimum energy consumption with minimum execution time (better performance)
-in the same time, see figure~(\ref{fig:r1}). Then our objective function has the
-following form:
+\end{figure}
+Then, we can model our objective function as finding the maximum distance
+between the energy curve EQ~\eqref{eq:enorm} and the inverse of the execution time (performance)
+curve EQ~\eqref{eq:pnorm_en} over all available scaling factors. This
+represents the minimum energy consumption with minimum execution time (better
+performance) at the same time, see Figure~\ref{fig:rel}\subref{fig:r1}. Then
+our objective function has the following form:
\begin{equation}
\label{eq:max}
- \textit{MaxDist} = \max (\overbrace{P^{-1}_\textit{Norm}}^{\text{Maximize}} -
- \overbrace{E_\textit{Norm}}^{\text{Minimize}} )
+ \MaxDist = \max_{j=1,2,\dots,F}
+ (\overbrace{\Pnorm(S_j)}^{\text{Maximize}} -
+ \overbrace{\Enorm(S_j)}^{\text{Minimize}} )
\end{equation}
-Then we can select the optimal scaling factor that satisfy the
-EQ~(\ref{eq:max}). Our objective function can works with any energy model or
-static power values stored in a data file. Moreover, this function works in
-optimal way when the energy function has a convex form with frequency scaling
-factor as shown in~\cite{15,3,19}. Energy measurement model is not the
-objective of this paper and we choose Rauber and Rünger model as an example with two
-reasons that mentioned before.
-
-\section{Optimal Scaling Factor for Performance and Energy}
+where $F$ is the number of available frequencies. Then we can select the optimal
+scaling factor that satisfies EQ~\eqref{eq:max}. Our objective function can
+work with any energy model or static power values stored in a data file.
+Moreover, this function works in optimal way when the energy curve has a convex
+form over the available frequency scaling factors as shown in~\cite{15,3,19}.
+
+\section{Optimal scaling factor for performance and energy}
\label{sec.optim}
-In the previous section we described the objective function that satisfy our
-goal in discovering optimal scaling factor for both performance and energy at
-the same time. Therefore, we develop an energy to performance scaling algorithm
-($EPSA$). This algorithm is simple and has a direct way to calculate the optimal
-scaling factor for both energy and performance at the same time.
-\begin{algorithm}[tp]
- \caption{EPSA}
+Algorithm on Figure~\ref{EPSA} computes the optimal scaling factor according to
+the objective function described above.
+\begin{figure}[tp]
+ \begin{algorithmic}[1]
+ % \footnotesize
+ \Require ~
+ \begin{description}
+ \item[$\Pstatic$] static power value
+ \item[$\Pdyn$] dynamic power value
+ \item[$\Pstates$] number of available frequencies
+ \item[$\Fmax$] maximum frequency
+ \item[$\Fdiff$] difference between two successive freq.
+ \end{description}
+ \Ensure $\Sopt$ is the optimal scaling factor
+
+ \State $\Sopt \gets 1$
+ \State $\Dist \gets 0$
+ \State $\Fnew \gets \Fmax$
+ \For {$j = 2$ to $\Pstates$}
+ \State $\Fnew \gets \Fnew - \Fdiff$
+ \State $S \gets \Fmax / \Fnew$
+ \State $S_i \gets S \cdot \frac{T_1}{T_i}
+ = \frac{\Fmax}{\Fnew} \cdot \frac{T_1}{T_i}$
+ for $i=1,\dots,N$
+ \State $\Enorm \gets
+ \frac{\Pdyn \cdot S_1^{-2} \cdot
+ \left( T_1 + \sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+ \Pstatic \cdot T_1 \cdot S_1 \cdot N }{
+ \Pdyn \cdot
+ \left(T_1+\sum_{i=2}^{N}\frac{T_i^3}{T_1^2}\right) +
+ \Pstatic \cdot T_1 \cdot N }$
+ \State $\Pnorm \gets \Told / \Tnew$
+ \If{$(\Pnorm - \Enorm > \Dist)$}
+ \State $\Sopt \gets S$
+ \State $\Dist \gets \Pnorm - \Enorm$
+ \EndIf
+ \EndFor
+ \State Return $\Sopt$
+ \end{algorithmic}
+ \caption{Scaling factor selection algorithm}
\label{EPSA}
+\end{figure}
+
+The proposed algorithm works online during the execution time of the MPI
+program. It selects the optimal scaling factor after gathering the computation
+and communication times from the program after one iteration. Then the program
+changes the new frequencies of the CPUs according to the computed scaling
+factors. In our experiments over a homogeneous cluster described in
+Section~\ref{sec.expe}, this algorithm has a small execution time. It takes
+\np[$\mu$s]{1.52} on average for 4 nodes and \np[$\mu$s]{6.65} on average for 32
+nodes. The algorithm complexity is $O(F\cdot N)$, where $F$ is the number of
+available frequencies and $N$ is the number of computing nodes. The algorithm
+is called just once during the execution of the program. The DVFS algorithm on
+Figure~\ref{dvfs} shows where and when the algorithm is called in the MPI
+program.
+%\begin{table}[htb]
+% \caption{Platform file parameters}
+% % title of Table
+% \centering
+% \begin{tabular}{|*{7}{l|}}
+% \hline
+% Max & Min & Backbone & Backbone & Link & Link & Sharing \\
+% Freq. & Freq. & Bandwidth & Latency & Bandwidth & Latency & Policy \\
+% \hline
+% \np{2.5} & \np{800} & \np[GBps]{2.25} & \np[$\mu$s]{0.5} & \np[GBps]{1} & \np[$\mu$s]{50} & Full \\
+% GHz & MHz & & & & & Duplex \\
+% \hline
+% \end{tabular}
+% \label{table:platform}
+%\end{table}
+
+\begin{figure}[tp]
\begin{algorithmic}[1]
- \State Initialize the variable $Dist=0$
- \State Set dynamic and static power values.
- \State Set $P_{states}$ to the number of available frequencies.
- \State Set the variable $F_{new}$ to max. frequency, $F_{new} = F_{max} $
- \State Set the variable $F_{diff}$ to the scale value between each two frequencies.
- \For {$i=1$ to $P_{states} $}
- \State - Calculate the new frequency as $F_{new}=F_{new} - F_{diff} $
- \State - Calculate the scale factor $S$ as in EQ~(\ref{eq:s}).
- \State - Calculate all available scales $S_i$ depend on $S$ as\par\hspace{1 pt} in EQ~(\ref{eq:si}).
- \State - Select the maximum scale factor $S_1$ from the set\par\hspace{1 pt} of scales $S_i$.
- \State - Calculate the normalize energy $E_{Norm}=E_{R}/E_{O}$
- \par\hspace{1 pt} as in EQ~(\ref{eq:enorm}).
- \State - Calculate the normalize inverse of performance\par\hspace{1 pt}
- $P_{NormInv}=T_{old}/T_{new}$ as in EQ~(\ref{eq:pnorm_en}).
- \If{ $(P_{NormInv}-E_{Norm} > Dist$) }
- \State $S_{optimal} = S$
- \State $Dist = P_{NormInv} - E_{Norm}$
+ % \footnotesize
+ \For {$k=1$ to \textit{some iterations}}
+ \State Computations section.
+ \State Communications section.
+ \If {$(k=1)$}
+ \State Gather all times of computation and\newline\hspace*{3em}%
+ communication from each node.
+ \State Call algorithm from Figure~\ref{EPSA} with these times.
+ \State Compute the new frequency from the\newline\hspace*{3em}%
+ returned optimal scaling factor.
+ \State Set the new frequency to the CPU.
\EndIf
\EndFor
- \State Return $S_{optimal}$
\end{algorithmic}
-\end{algorithm}
-The proposed EPSA algorithm works online during the execution time of the MPI
-program. It selects the optimal scaling factor by gathering some information
-from the program after one iteration. This algorithm has small execution time
-(between 0.00152 $ms$ for 4 nodes to 0.00665 $ms$ for 32 nodes). The data
-required by this algorithm is the computation time and the communication time
-for each task from the first iteration only. When these times are measured, the
-MPI program calls the EPSA algorithm to choose the new frequency using the
-optimal scaling factor. Then the program set the new frequency to the
-system. The algorithm is called just one time during the execution of the
-program. The DVFS algorithm~(\ref{dvfs}) shows where and when the EPSA algorithm is called
-in the MPI program.
-%\begin{minipage}{\textwidth}
-%\AG{Use the same format as for Algorithm~\ref{$EPSA$}}
-
-\begin{algorithm}[tp]
- \caption{DVFS}
+ \caption{DVFS algorithm}
\label{dvfs}
- \begin{algorithmic}[1]
- \For {$J:=1$ to $Some-Iterations \; $}
- \State -Computations Section.
- \State -Communications Section.
- \If {$(J==1)$}
- \State -Gather all times of computation and\par\hspace{13 pt} communication from each node.
- \State -Call EPSA with these times.
- \State -Calculate the new frequency from optimal scale.
- \State -Set the new frequency to the system.
- \EndIf
-\EndFor
-\end{algorithmic}
-\end{algorithm}
-
-After obtaining the optimal scale factor from the EPSA algorithm. The program
-calculates the new frequency $F_i$ for each task proportionally to its time
-value $T_i$. By substitution of the EQ~(\ref{eq:s}) in the EQ~(\ref{eq:si}), we
-can calculate the new frequency $F_i$ as follows:
+\end{figure}
+After obtaining the optimal scaling factor, the program calculates the new
+frequency $F_i$ for each task proportionally to its time value $T_i$. By
+substitution of EQ~\eqref{eq:s} in EQ~\eqref{eq:si}, we can calculate the new
+frequency $F_i$ as follows:
\begin{equation}
\label{eq:fi}
- F_i = \frac{F_\textit{max} \cdot T_i}{S_\textit{optimal} \cdot T_\textit{max}}
+ F_i = \frac{\Fmax \cdot T_i}{\Sopt \cdot \Tmax}
\end{equation}
According to this equation all the nodes may have the same frequency value if
-they have balanced workloads. Otherwise, they take different frequencies when
-have imbalanced workloads. Then EQ~(\ref{eq:fi}) works in adaptive way to change
-the frequency according to the nodes workloads.
+they have balanced workloads, otherwise, they take different frequencies when
+having imbalanced workloads. Thus, EQ~\eqref{eq:fi} adapts the frequency of the
+CPU to the nodes' workloads to maintain the performance of the program.
-\section{Experimental Results}
+\section{Experimental results}
\label{sec.expe}
-
-The proposed EPSA algorithm was applied to seven MPI programs of the NAS
-benchmarks (EP, CG, MG, FT, BT, LU and SP). We work on three classes (A, B and
-C) for each program. Each program runs on specific number of processors
+Our experiments are executed on the simulator SimGrid/SMPI v3.10. We configure
+the simulator to use a homogeneous cluster with one core per node.
+%The detailed characteristics of our platform file are shown in Table~\ref{table:platform}.
+Each node in the cluster has 18 frequency values
+from \np[GHz]{2.5} to \np[MHz]{800} with \np[MHz]{100} difference between each
+two successive frequencies. The nodes are connected via an ethernet network with 1Gbit/s bandwidth.
+
+\subsection{Execution time prediction verification}
+
+In this section we evaluate the precision of our execution time prediction method
+based on EQ~\eqref{eq:tnew} by applying it to the NAS benchmarks. The NAS programs
+are executed with the class B option to compare the real execution time with
+the predicted execution time. Each program runs offline with all available
+scaling factors on 8 or 9 nodes (depending on the benchmark) to produce real
+execution time values. These scaling factors are computed by dividing the
+maximum frequency by the new one see EQ~\eqref{eq:s}.
+\begin{figure}
+ \centering
+ \includegraphics[width=.5\linewidth]{fig/cg_per}\hfill%
+ % \includegraphics[width=.5\linewidth]{fig/mg_pre}\hfill%
+ % \includegraphics[width=.5\linewidth]{fig/bt_pre}\qquad%
+ \includegraphics[width=.5\linewidth]{fig/lu_pre}\hfill%
+ \caption{Comparing predicted to real execution times}
+ \label{fig:pred}
+\end{figure}
+%see Figure~\ref{fig:pred}
+In our cluster there are 18 available frequency states for each processor. This
+leads to 18 run states for each program. We use seven MPI programs of the NAS
+parallel benchmarks: CG, MG, EP, FT, BT, LU and SP. Figure~\ref{fig:pred}
+presents plots of the real execution times and the simulated ones. The maximum
+normalized error between these two execution times varies between \np{0.0073} to
+\np{0.031} dependent on the executed benchmark. The smallest prediction error
+was for CG and the worst one was for LU.
+
+\subsection{The experimental results for the scaling algorithm }
+The proposed algorithm was applied to seven MPI programs of the NAS benchmarks
+(EP, CG, MG, FT, BT, LU and SP) which were run with three classes (A, B and C).
+For each instance the benchmarks were executed on a number of processors
proportional to the size of the class. Each class represents the problem size
-ascending from the class A to C. Additionally, depending on some speed up points
+ascending from class A to C. Additionally, depending on some speed up points
for each class we run the classes A, B and C on 4, 8 or 9 and 16 nodes
-respectively. Our experiments are executed on the simulator SimGrid/SMPI
-v3.10. We design a platform file that simulates a cluster with one core per
-node. This cluster is a homogeneous architecture with distributed memory. The
-detailed characteristics of our platform file are shown in the
-table~(\ref{table:platform}). Each node in the cluster has 18 frequency values
-from 2.5 GHz to 800 MHz with 100 MHz difference between each two successive
-frequencies.
-\begin{table}[htb]
- \caption{Platform File Parameters}
- % title of Table
- \centering
- \begin{tabular}{ | l | l | l |l | l |l |l | p{2cm} |}
- \hline
- Max & Min & Backbone & Backbone&Link &Link& Sharing \\
- Freq. & Freq. & Bandwidth & Latency & Bandwidth& Latency&Policy \\ \hline
- 2.5 &800 & 2.25 GBps &$5\times 10^{-7} s$& 1 GBps & $5\times 10^{-5} s$ &Full \\
- GHz& MHz& & & & &Duplex \\\hline
- \end{tabular}
- \label{table:platform}
-\end{table}
-Depending on the EQ~(\ref{eq:energy}), we measure the energy consumption for all
-the NAS MPI programs while assuming the power dynamic is equal to 20W and the
-power static is equal to 4W for all experiments. We run the proposed EPSA
-algorithm for all these programs. The results showed that the algorithm selected
+respectively. Depending on EQ~\eqref{eq:energy}, we measure the energy
+consumption for all the NAS MPI programs while assuming that the dynamic power
+with the highest frequency is equal to \np[W]{20} and the power static is equal
+to \np[W]{4} for all experiments. These power values were also used by Rauber
+and Rünger in~\cite{3}. The results showed that the algorithm selected
different scaling factors for each program depending on the communication
-features of the program as in the figure~(\ref{fig:nas}). This figure shows that
-there are different distances between the normalized energy and the normalized
-inversed performance curves, because there are different communication features
-for each MPI program. When there are little or not communications, the inversed
-performance curve is very close to the energy curve. Then the distance between
-the two curves is very small. This lead to small energy savings. The opposite
-happens when there are a lot of communication, the distance between the two
-curves is big. This lead to more energy savings (e.g. CG and FT), see
-table~(\ref{table:factors results}). All discovered frequency scaling factors
-optimize both the energy and the performance simultaneously for all the NAS
-programs. In table~(\ref{table:factors results}), we record all optimal scaling
-factors results for each program on class C. These factors give the maximum
-energy saving percent and the minimum performance degradation percent in the
-same time over all available scales.
+features of the program as in the plots from Figure~\ref{fig:nas}. These plots
+illustrate that there are different distances between the normalized energy and
+the normalized inverted execution time curves, because there are different
+communication features for each benchmark. When there are little or no
+communications, the inverted execution time curve is very close to the energy
+curve. Then the distance between the two curves is very small. This leads to
+small energy savings. The opposite happens when there are a lot of
+communication, the distance between the two curves is big. This leads to more
+energy savings (e.g. CG and FT), see Table~\ref{table:compareC}. All discovered
+frequency scaling factors optimize both the energy and the execution time
+simultaneously for all NAS benchmarks. In Table~\ref{table:compareC}, we record
+all optimal scaling factors results for each benchmark running class C. These
+scaling factors give the maximum energy saving percentage and the minimum
+performance degradation percentage at the same time from all available scaling
+factors.
\begin{figure*}[t]
\centering
- \includegraphics[width=.33\textwidth]{ep.eps}\hfill%
- \includegraphics[width=.33\textwidth]{cg.eps}\hfill%
- \includegraphics[width=.33\textwidth]{sp.eps}
- \includegraphics[width=.33\textwidth]{lu.eps}\hfill%
- \includegraphics[width=.33\textwidth]{bt.eps}\hfill%
- \includegraphics[width=.33\textwidth]{ft.eps}
- \caption{Optimal scaling factors for The NAS MPI Programs}
+ \includegraphics[width=.33\linewidth]{fig/ep}\hfill%
+ \includegraphics[width=.33\linewidth]{fig/cg}\hfill%
+ % \includegraphics[width=.328\linewidth]{fig/sp}
+ % \includegraphics[width=.328\linewidth]{fig/lu}\hfill%
+ \includegraphics[width=.33\linewidth]{fig/bt}
+ % \includegraphics[width=.328\linewidth]{fig/ft}
+ \caption{Optimal scaling factors for the predicted energy and performance of NAS benchmarks}
\label{fig:nas}
\end{figure*}
-\begin{table}[htb]
- \caption{Optimal Scaling Factors Results}
- % title of Table
- \centering
- \AG{Use the same number of decimals for all numbers in a column,
- and vertically align the numbers along the decimal points.
- The same for all the following tables.}
- \begin{tabular}{|l|*{4}{r|}}
- \hline
- Program & Optimal & Energy & Performance&Energy-Perf.\\
- Name & Scaling Factor& Saving \%&Degradation \% &Distance \\ \hline
- CG & 1.56 &39.23&14.88 &24.35\\ \hline
- MG & 1.47 &34.97&21.70 &13.27 \\ \hline
- EP & 1.04 &22.14&20.73 &1.41\\ \hline
- LU & 1.38 &35.83&22.49 &13.34\\ \hline
- BT & 1.31 &29.60&21.28 &8.32\\ \hline
- SP & 1.38 &33.48&21.36 &12.12\\ \hline
- FT & 1.47 &34.72&19.00 &15.72\\ \hline
- \end{tabular}
- \label{table:factors results}
- % is used to refer this table in the text
-\end{table}
-As shown in the table~(\ref{table:factors results}), when the optimal scaling
-factor has big value we can gain more energy savings for example as in CG and
-FT. The opposite happens when the optimal scaling factor is small value as
-example BT and EP. Our algorithm selects big scaling factor value when the
+As shown in Table~\ref{table:compareC}, when the optimal scaling factor has a
+big value we can gain more energy savings as in CG and FT benchmarks. The
+opposite happens when the optimal scaling factor has a small value as in BT and
+EP benchmarks. Our algorithm selects a big scaling factor value when the
communication and the other slacks times are big and smaller ones in opposite
-cases. In EP there are no communications inside the iterations. This make our
-EPSA to selects smaller scaling factor values (inducing smaller energy savings).
-
-\section{Comparing Results}
-\label{sec.compare}
-
-In this section, we compare our EPSA algorithm results with Rauber and Rünger
-methods~\cite{3}. He had two scenarios, the first is to reduce energy to optimal
-level without considering the performance as in EQ~(\ref{eq:sopt}). We refer to
-this scenario as $R_{E}$. The second scenario is similar to the first
-except setting the slower task to the maximum frequency (when the scale $S=1$)
-to keep the performance from degradation as mush as possible. We refer to this
-scenario as $R_{E-P}$. The comparison is made in tables~(\ref{table:compare
- Class A},\ref{table:compare Class B},\ref{table:compare Class C}). These
-tables show the results of our EPSA and Rauber and Rünger scenarios for all the NAS
-benchmarks programs for classes A,B and C.
-\begin{table}[p]
- \caption{Comparing Results for The NAS Class A}
+cases. In EP there are no communication inside the iterations. This leads our
+algorithm to select smaller scaling factor values (inducing smaller energy
+savings).
+
+\subsection{Results comparison}
+
+In this section, we compare our scaling factor selection method with Rauber and
+Rünger methods~\cite{3}. They had two scenarios, the first is to reduce energy
+to the optimal level without considering the execution time as in
+EQ~\eqref{eq:sopt}. We refer to this scenario as $R_{E}$. The second scenario
+is similar to the first except setting the slower task to the maximum frequency
+(when the scale $S=1$) to keep the performance from degradation as mush as
+possible. We refer to this scenario as $R_{E-P}$. While we refer to our
+algorithm as EPSA (Energy to Performance Scaling Algorithm). The comparison is
+made in Table~\ref{table:compareC}. This table shows the results of our method and
+Rauber and Rünger scenarios for all the NAS benchmarks programs for class C.
+
+\begin{table}
+ \caption{Comparing results for the NAS class C}
% title of Table
\centering
\begin{tabular}{|l|l|*{4}{r|}}
\hline
- Method&Program&Factor& Energy& Performance &Energy-Perf.\\
- Name &Name&Value& Saving \%&Degradation \% &Distance
- \\ \hline
- % \rowcolor[gray]{0.85}
- $EPSA$&CG & 1.56 &37.02 & 13.88 & 23.14\\ \hline
- $R_{E-P}$&CG &2.14 &42.77 & 25.27 & 17.50\\ \hline
- $R_{E}$&CG &2.14 &42.77&26.46&16.31\\ \hline
-
- $EPSA$&MG & 1.47 &27.66&16.82&10.84\\ \hline
- $R_{E-P}$&MG &2.14&34.45&31.84&2.61\\ \hline
- $R_{E}$&MG &2.14&34.48&33.65&0.80 \\ \hline
-
- $EPSA$&EP &1.19 &25.32&20.79&4.53\\ \hline
- $R_{E-P}$&EP&2.05&41.45&55.67&-14.22\\ \hline
- $R_{E}$&EP&2.05&42.09&57.59&-15.50\\ \hline
-
- $EPSA$&LU&1.56& 39.55 &19.38& 20.17\\ \hline
- $R_{E-P}$&LU&2.14&45.62&27.00&18.62 \\ \hline
- $R_{E}$&LU&2.14&45.66&33.01&12.65\\ \hline
-
- $EPSA$&BT&1.31& 29.60&20.53&9.07 \\ \hline
- $R_{E-P}$&BT&2.10&45.53&49.63&-4.10\\ \hline
- $R_{E}$&BT&2.10&43.93&52.86&-8.93\\ \hline
-
- $EPSA$&SP&1.38& 33.51&15.65&17.86 \\ \hline
- $R_{E-P}$&SP&2.11&45.62&42.52&3.10\\ \hline
- $R_{E}$&SP&2.11&45.78&43.09&2.69\\ \hline
-
- $EPSA$&FT&1.25&25.00&10.80&14.20 \\ \hline
- $R_{E-P}$&FT&2.10&39.29&34.30&4.99 \\ \hline
- $R_{E}$&FT&2.10&37.56&38.21&-0.65\\ \hline
- \end{tabular}
- \label{table:compare Class A}
- % is used to refer this table in the text
-\end{table}
-\begin{table}[p]
- \caption{Comparing Results for The NAS Class B}
- % title of Table
- \centering
- \begin{tabular}{|l|l|*{4}{r|}}
+ Method & Program & Factor & Energy & Performance & Energy-Perf. \\
+ Name & Name & Value & Saving \% & Degradation \% & Distance \\
\hline
- Method&Program&Factor& Energy& Performance &Energy-Perf.\\
- Name &Name&Value& Saving \%&Degradation \% &Distance
- \\ \hline
% \rowcolor[gray]{0.85}
- $EPSA$&CG & 1.66 &39.23&16.63&22.60 \\ \hline
- $R_{E-P}$&CG &2.15 &45.34&27.60&17.74\\ \hline
- $R_{E}$&CG &2.15 &45.34&28.88&16.46\\ \hline
+ $EPSA$ & CG & 1.56 & 39.23 & 14.88 & 24.35 \\ \hline
+ $R_{E-P}$ & CG & 2.15 & 45.36 & 25.89 & 19.47 \\ \hline
+ $R_{E}$ & CG & 2.15 & 45.36 & 26.70 & 18.66 \\ \hline
- $EPSA$ &MG & 1.47 &34.98&18.35&16.63\\ \hline
- $R_{E-P}$&MG &2.14&43.55&36.42&7.13 \\ \hline
- $R_{E}$&MG &2.14&43.56&37.07&6.49 \\ \hline
+ $EPSA$ & MG & 1.47 & 34.97 & 21.69 & 13.27 \\ \hline
+ $R_{E-P}$ & MG & 2.15 & 43.65 & 40.45 & 3.20 \\ \hline
+ $R_{E}$ & MG & 2.15 & 43.64 & 41.38 & 2.26 \\ \hline
- $EPSA$&EP &1.08 &20.29&17.15&3.14 \\ \hline
- $R_{E-P}$&EP&2.00&42.38&56.88&-14.50\\ \hline
- $R_{E}$&EP&2.00&39.73&59.94&-20.21\\ \hline
+ $EPSA$ & EP & 1.04 & 22.14 & 20.73 & 1.41 \\ \hline
+ $R_{E-P}$ & EP & 1.92 & 39.40 & 56.33 & -16.93 \\ \hline
+ $R_{E}$ & EP & 1.92 & 38.10 & 56.35 & -18.25 \\ \hline
- $EPSA$&LU&1.47&38.57&21.34&17.23 \\ \hline
- $R_{E-P}$&LU&2.10&43.62&36.51&7.11 \\ \hline
- $R_{E}$&LU&2.10&43.61&38.54&5.07 \\ \hline
+ $EPSA$ & LU & 1.38 & 35.83 & 22.49 & 13.34 \\ \hline
+ $R_{E-P}$ & LU & 2.15 & 44.97 & 41.00 & 3.97 \\ \hline
+ $R_{E}$ & LU & 2.15 & 44.97 & 41.80 & 3.17 \\ \hline
- $EPSA$&BT&1.31& 29.59&20.88&8.71\\ \hline
- $R_{E-P}$&BT&2.10&44.53&53.05&-8.52\\ \hline
- $R_{E}$&BT&2.10&42.93&52.80&-9.87\\ \hline
+ $EPSA$ & BT & 1.31 & 29.60 & 21.28 & 8.32 \\ \hline
+ $R_{E-P}$ & BT & 2.13 & 45.60 & 49.84 & -4.24 \\ \hline
+ $R_{E}$ & BT & 2.13 & 44.90 & 55.16 & -10.26 \\ \hline
- $EPSA$&SP&1.38&33.44&19.24&14.20 \\ \hline
- $R_{E-P}$&SP&2.15&45.69&43.20&2.49\\ \hline
- $R_{E}$&SP&2.15&45.41&44.47&0.94\\ \hline
+ $EPSA$ & SP & 1.38 & 33.48 & 21.35 & 12.12 \\ \hline
+ $R_{E-P}$ & SP & 2.10 & 45.69 & 43.60 & 2.09 \\ \hline
+ $R_{E}$ & SP & 2.10 & 45.75 & 44.10 & 1.65 \\ \hline
- $EPSA$&FT&1.38&34.40&14.57&19.83 \\ \hline
- $R_{E-P}$&FT&2.13&42.98&37.35&5.63 \\ \hline
- $R_{E}$&FT&2.13&43.04&37.90&5.14\\ \hline
+ $EPSA$ & FT & 1.47 & 34.72 & 19.00 & 15.72 \\ \hline
+ $R_{E-P}$ & FT & 2.04 & 39.40 & 37.10 & 2.30 \\ \hline
+ $R_{E}$ & FT & 2.04 & 39.35 & 37.70 & 1.65 \\ \hline
\end{tabular}
- \label{table:compare Class B}
+ \label{table:compareC}
% is used to refer this table in the text
\end{table}
-
-\begin{table}[p]
- \caption{Comparing Results for The NAS Class C}
- % title of Table
- \centering
- \begin{tabular}{|l|l|*{4}{r|}}
- \hline
- Method&Program&Factor& Energy& Performance &Energy-Perf.\\
- Name &Name&Value& Saving \%&Degradation \% &Distance
- \\ \hline
- % \rowcolor[gray]{0.85}
- $EPSA$&CG & 1.56 &39.23&14.88&24.35 \\ \hline
- $R_{E-P}$&CG &2.15 &45.36&25.89&19.47\\ \hline
- $R_{E}$&CG &2.15 &45.36&26.70&18.66\\ \hline
-
- $EPSA$&MG & 1.47 &34.97&21.69&13.27\\ \hline
- $R_{E-P}$&MG &2.15&43.65&40.45&3.20 \\ \hline
- $R_{E}$&MG &2.15&43.64&41.38&2.26 \\ \hline
-
- $EPSA$&EP &1.04 &22.14&20.73&1.41 \\ \hline
- $R_{E-P}$&EP&1.92&39.40&56.33&-16.93\\ \hline
- $R_{E}$&EP&1.92&38.10&56.35&-18.25\\ \hline
-
- $EPSA$&LU&1.38&35.83&22.49&13.34 \\ \hline
- $R_{E-P}$&LU&2.15&44.97&41.00&3.97 \\ \hline
- $R_{E}$&LU&2.15&44.97&41.80&3.17 \\ \hline
-
- $EPSA$&BT&1.31& 29.60&21.28&8.32\\ \hline
- $R_{E-P}$&BT&2.13&45.60&49.84&-4.24\\ \hline
- $R_{E}$&BT&2.13&44.90&55.16&-10.26\\ \hline
-
- $EPSA$&SP&1.38&33.48&21.35&12.12\\ \hline
- $R_{E-P}$&SP&2.10&45.69&43.60&2.09\\ \hline
- $R_{E}$&SP&2.10&45.75&44.10&1.65\\ \hline
-
- $EPSA$&FT&1.47&34.72&19.00&15.72 \\ \hline
- $R_{E-P}$&FT&2.04&39.40&37.10&2.30\\ \hline
- $R_{E}$&FT&2.04&39.35&37.70&1.65\\ \hline
- \end{tabular}
-\label{table:compare Class C}
-% is used to refer this table in the text
-\end{table}
-As shown in these tables our scaling factor is not optimal for energy saving
-such as Rauber's scaling factor EQ~(\ref{eq:sopt}), but it is optimal for both
-the energy and the performance simultaneously. Our $EPSA$ optimal scaling factors
-has better simultaneous optimization for both the energy and the performance
-compared to Rauber and Rünger energy-performance method ($R_{E-P}$). Also, in
-($R_{E-P}$) method when setting the frequency to maximum value for the
-slower task lead to a small improvement of the performance. Also the results
-show that this method keep or improve energy saving. Because of the energy
-consumption decrease when the execution time decreased while the frequency value
-increased.
-
-Figure~(\ref{fig:compare}) shows the maximum distance between the energy saving
-percent and the performance degradation percent. Therefore, this means it is the
-same resultant of our objective function EQ~(\ref{eq:max}). Our algorithm always
-gives positive energy to performance trade offs while Rauber and Rünger method
-($R_{E-P}$) gives in some time negative trade offs such as in BT and
-EP. The positive trade offs with highest values lead to maximum energy savings
-concatenating with less performance degradation and this the objective of this
-paper. While the negative trade offs refers to improving energy saving (or may
-be the performance) while degrading the performance (or may be the energy) more
-than the first.
+As shown in Table~\ref{table:compareC}, the ($R_{E-P}$) method outperforms the ($R_{E}$)
+method in terms of performance and energy reduction. The ($R_{E-P}$) method
+also gives better energy savings than our method. However, although our scaling
+factor is not optimal for energy reduction, the results in this table prove
+that our algorithm returns the best scaling factor that satisfy our objective
+method: the largest distance between energy reduction and performance
+degradation. Figure~\ref{fig:compare} illustrates even better the distance between
+the energy reduction and performance degradation. The negative values mean that one of
+the two objectives (energy or performance) have been degraded more than the
+other. The positive trade-offs with the highest values lead to maximum energy
+savings while keeping the performance degradation as low as possible. Our
+algorithm always gives the highest positive energy to performance trade-offs
+while Rauber and Rünger's method, ($R_{E-P}$), gives sometimes negative
+trade-offs such as in BT and EP.
\begin{figure}[t]
\centering
- \includegraphics[width=.33\textwidth]{compare_class_A.pdf}
- \includegraphics[width=.33\textwidth]{compare_class_B.pdf}
- \includegraphics[width=.33\textwidth]{compare_class_c.pdf}
- \caption{Comparing Our EPSA with Rauber and Rünger Methods}
+% \includegraphics[width=.328\linewidth]{fig/compare_class_A}
+% \includegraphics[width=.328\linewidth]{fig/compare_class_B}
+ \includegraphics[width=\linewidth]{fig/compare_class_C}
+ \caption{Comparing our method to Rauber and Rünger's methods}
\label{fig:compare}
\end{figure}
\section{Conclusion}
\label{sec.concl}
-\AG{the conclusion needs to be written\dots{} one day}
+In this paper, we have presented a new online scaling factor selection method
+that optimizes simultaneously the energy and performance of a distributed
+application running on a homogeneous cluster. It uses the computation and
+communication times measured at the first iteration to predict energy
+consumption and the execution time of the parallel application at every available
+frequency. Then, it selects the scaling factor that gives the best trade-off
+between energy reduction and performance which is the maximum distance between
+the energy and the inverted execution time curves. To evaluate this method, we
+have applied it to the NAS benchmarks and it was compared to Rauber and Rünger
+methods while being executed on the simulator SimGrid. The results showed that
+our method, outperforms Rauber and Rünger's methods in terms of energy-performance
+ratio.
+
+In the near future, we would like to adapt this scaling factor selection method
+to heterogeneous platforms where each node has different characteristics. In
+particular, each CPU has different available frequencies, energy consumption and
+performance. It would be also interesting to develop a new energy model for
+asynchronous parallel iterative methods where the number of iterations is not
+known in advance and depends on the global convergence of the iterative system.
\section*{Acknowledgment}
-\AG{Right?}
-Computations have been performed on the supercomputer facilities of the
-Mésocentre de calcul de Franche-Comté.
+This work has been partially supported by the Labex ACTION project (contract
+``ANR-11-LABX-01-01''). Computations have been performed on the supercomputer
+facilities of the Mésocentre de calcul de Franche-Comté. As a PhD student,
+Mr. Ahmed Fanfakh, would like to thank the University of Babylon (Iraq) for
+supporting his work.
% trigger a \newpage just before the given reference
% number - used to balance the columns on the last page
% the document is modified later
%\IEEEtriggeratref{15}
+\newpage
\bibliographystyle{IEEEtran}
\bibliography{IEEEabrv,my_reference}
\end{document}
%%% End:
% LocalWords: Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
-% LocalWords: CMOS EQ EPSA Franche Comté Tflop Rünger
+% LocalWords: CMOS EQ EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex