-\AG{Consider introducing the models sec.~\ref{sec.exe} maybe before related works}
-
-In the this section some heuristics to compute the scaling factor are
-presented and classified in two parts: offline and online methods.
-
-\subsection{The offline DVFS orientations}
-
-The DVFS offline methods are static and are not executed during the runtime of
-the program. Some approaches used heuristics to select the best DVFS state
-during the compilation phases as for example in Azevedo et al.~\cite{40}. They
-use dynamic voltage scaling (DVS) algorithm to choose the DVS setting when there
-are dependency points between tasks. While in~\cite{29}, Xie et al. used
-breadth-first search algorithm to do that. Their goal is to save energy with
-time limits. Another approach gathers and stores the runtime information for
-each DVFS state, then selects the suitable DVFS offline to optimize energy-time
-trade offs. As an example, Rountree et al.~\cite{8} use liner programming
-algorithm, while in~\cite{38,34}, Cochran et al. use multi logistic regression
-algorithm for the same goal. The offline study that shows the DVFS impact on the
-communication time of the MPI program is~\cite{17}, where Freeh et al. show that
-these times do not change when the frequency is scaled down.
-
-\subsection{The online DVFS orientations}
-
-The objective of the online DVFS orientations is to dynamically compute and set
-the frequency of the CPU for saving energy during the runtime of the
-programs. Estimating and predicting approaches for the energy-time trade offs
-are developed by Kimura, Peraza, Yu-Liang et al. ~\cite{11,2,31}. These works
-select the best DVFS setting depending on the slack times. These times happen
-when the processors have to wait for data from other processors to compute their
-task. For example, during the synchronous communications that take place in MPI
-programs, some processors are idle. The optimal DVFS can be selected using
-learning methods. Therefore, in Dhiman, Hao Shen et al. ~\cite{39,19} used
-machine learning to converge to the suitable DVFS configuration. Their learning
-algorithms take big time to converge when the number of available frequencies is
-high. Also, the communication sections of the MPI program can be used to save
-energy. In~\cite{1}, Lim et al. developed an algorithm that detects the
-communication sections and changes the frequency during these sections
-only. This approach changes the frequency many times because an iteration may
-contain more than one communication section. The domain of analytical modeling
-can also be used for choosing the optimal frequency as in Rauber and
-Rünger~\cite{3}. They developed an analytical mathematical model to determine
-the optimal frequency scaling factor for any number of concurrent tasks. They
-set the slowest task to maximum frequency for maintaining performance. In this
-paper we compare our algorithm with Rauber and Rünger model~\cite{3}, because
-their model can be used for any number of concurrent tasks for homogeneous
-platforms. The primary contributions of this paper are:
-\begin{enumerate}
-\item Selecting the frequency scaling factor for simultaneously optimizing energy and performance,
- while taking into account the communication time.
-\item Adapting our scaling factor to take into account the imbalanced tasks.
-\item The execution time of our algorithm is very small when compared to other
- methods (e.g.,~\cite{19}).
-\item The proposed algorithm works online without profiling or training as
- in~\cite{38,34}.
-\end{enumerate}
-\section{Execution and Energy of Parallel Tasks on Homogeneous Platform}
-\label{sec.exe}
-
-\subsection{Parallel Tasks Execution on Homogeneous Platform}
-A homogeneous cluster consists of identical nodes in terms of hardware and software.
-Each node has its own memory and at least one processor which can
-be a multi-core. The nodes are connected via a high bandwidth network. Tasks
-executed on this model can be either synchronous or asynchronous. In this paper
-we consider execution of the synchronous tasks on distributed homogeneous
-platform. These tasks can exchange the data via synchronous message passing.
-\begin{figure*}[t]
- \centering
- \subfloat[Sync. Imbalanced Communications]{\includegraphics[scale=0.67]{commtasks}\label{fig:h1}}
- \subfloat[Sync. Imbalanced Computations]{\includegraphics[scale=0.67]{compt}\label{fig:h2}}
- \caption{Parallel Tasks on Homogeneous Platform}
- \label{fig:homo}
-\end{figure*}
-Therefore, the execution time of a task consists of the computation time and the
-communication time. Moreover, the synchronous communications between tasks can
-lead to idle time while tasks wait at the synchronization barrier for other tasks to
-finish their communications (see figure~(\ref{fig:h1})). The imbalanced communications
-happen when nodes have to send/receive different amount of data or each node is communicates
-with different number of nodes. Another source for idle times is the imbalanced computations.
-This happens when processing different amounts of data on each processor (see figure~(\ref{fig:h2})).
-In this case the fastest tasks have to wait at the synchronization barrier for the
-slowest tasks to finish their job. In both cases the overall execution time
-of the program is the execution time of the slowest task as:
-\begin{equation}
- \label{eq:T1}
- \textit{Program Time} = \max_{i=1,2,\dots,N} T_i
-\end{equation}
-where $T_i$ is the execution time of task $i$.
-
-\subsection{Energy Model for Homogeneous Platform}