cluster $i$, $\TcpOld[ij]$ is the computation time of processor $j$ in the cluster $i$
and $\Tcm[hj]$ is the communication time of processor $j$ in the cluster $h$ during the
first iteration. The execution time for one iteration is equal to the sum of the maximum computation time for all nodes with the new scaling factors
-and the communication time of the slower node without slack time during one iteration.
-The slower node $h$ is the node that gives the maximum execution time in all the clusters before applying DVFS.
+and the communication time of the slowest node without slack time during one iteration.
+ The slowest node $h$ is the node which takes the maximum execution time to execute an iteration before scaling down its frequency.
It means that only the communication time without any slack time is taken into account.
Therefore, the execution time of the application is equal to
the execution time of one iteration as in Equation (\ref{eq:perf}) multiplied by the
Both methods selects the frequencies that gives the best trade-off between
energy consumption reduction and performance for message passing
synchronous applications \textcolor{blue}{with iterations}. In this work we
-are interested in grids that are composed of heterogeneous clusters, \textcolor{blue}{where} the nodes
-have different characteristics such as dynamic power, static power, computation power,
+are interested in grids that are composed of heterogeneous clusters. The nodes from distinct clusters may have
+ different characteristics such as dynamic power, static power, computation power,
frequencies range, network latency and bandwidth.
Due to the heterogeneity of the processors, a vector of scaling factors should be selected
and it must give the best trade-off between energy consumption and performance.
publisher = {IEEE Computer Society},
address = {Washington, DC, USA}
}
-@inproceedings{4,
+@inproceedings{5,
author={Charr, Jean-Claude and Couturier, Raphael and Fanfakh, Ahmed and Giersch, Arnaud},
booktitle={Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International},
title={Energy Consumption Reduction with DVFS for Message Passing Iterative Applications on Heterogeneous Architectures},
keywords={Arrays;Computational modeling;Degradation;Energy consumption;Mathematical model;Message passing;Time-frequency analysis},
doi={10.1109/IPDPSW.2015.44},
month={May}
+}
+
+@inproceedings{4,
+ title = {Dynamic frequency scaling for energy consumption reduction in synchronous distributed applications},
+ author = {Charr, Jean-Claude and Couturier, Rapha\"{e}l and Fanfakh, Ahmed and Giersch, Arnaud},
+ year = {2014},
+ address = {Milan, Italy},
+ booktitle = {ISPA 2014, 12th IEEE Int. Symposium on Parallel and Distributed Processing with Applications},
+ month = {aug},
+ pages = {225--230},
+ doi = {10.1109/ISPA.2014.38},
+ publisher = {IEEE}
}
\ No newline at end of file
\begin{enumerate}
\item Move the contributions from related work to introduction
-\item why emphasize it is a grid platform? the presentation of related work follows the logic of heterogeneous CPUs. Grid is only a type of platform with heterogeneous CPUs
+\item Why emphasize it is a grid platform? the presentation of related work follows the logic of heterogeneous CPUs. Grid is only a type of platform with heterogeneous CPUs.
\textbf{Answer:} We agree with the reviewer that a grid is a type of heterogeneous architecture and the proposed algorithm can also work on any heterogeneous architecture.
- In \cite{4}, we have proposed a frequency selection algorithm for distributed applications running on heterogeneous clusters, while in this work, the proposed algorithm was adapted to the grid architecture which is composed
+ In \cite{5}, we have proposed a frequency selection algorithm for distributed applications running on heterogeneous clusters, while in this work, the proposed algorithm was adapted to the grid architecture which is composed
of homogeneous clusters interconnected by a wide area network which is slower than the local
network in each cluster.
\textbf{Answer:} Figure 1 was redrawn, the white space before the barrier is the slack time. Slack times occur when a node has to wait for another node to finish its computation to synchronously communicate with it. In Figure 1, task 1 was assumed to be the slowest task. All the other tasks will finish their computations before the slowest task and wait until it finishes its computation before being able to synchronously communicate with it. This waiting time is the slack time and since the slowest task do not have to wait for the other tasks it has almost no slack time.
-\item define the parameters in eq. 1.
+\item Define the parameters in eq. 1.
\textbf{Answer:} Fmax and Fnew have been defined as follows in the paper: ``$\Fmax$ is the maximum frequency before applying any DVFS and $\Fnew$ is the new frequency after applying DVFS''.
-\item eq. 2: are you assuming each cluster has the same number of nodes?
+\item Eq. 2: are you assuming each cluster has the same number of nodes?
\textbf{Answer:} No, each cluster can have a different number of nodes. Therefore, in the paper, $M$, the number of nodes in a cluster, was replaced by $M_i$, the number of nodes in cluster $i$, in all the equations.
Asynchronous applications are beyond the scope of this paper and will be considered in a future work.
-\item eq. 2 is not clear:
-
--how to define and determine the slowest cluster h? the one before scaling or after scaling?
+\item Eq. 2 is not clear:
+\begin{enumerate}
+\item How to define and determine the slowest cluster h? the one before scaling or after scaling?
-\textcolor{blue}{Answer: The slower task is the task which gives maximum execution time before scaling the frequency of the node. We have added this sentence to the paper (page 8).}
+\textbf{Answer:} The slowest node $h$ is the node which takes the maximum execution time to execute an iteration before scaling down its frequency. The previous sentence has been added to the paper.
-- what is the communication time without slack time
+\item What is the communication time without slack time?
+
+\textbf{Answer:} There is no synchronous communications with zero slack times, but if a node sends a message to another node who is already waiting for that message. The latter will acknowledge the reception of the message from the sender without any delay. On the other hand, if the receiving node is still computing the sender has to wait for it to finish its computation to acknowledge the reception of the message. This time is called the slack time.
-\textcolor{blue}{Answer: There is no synchronous communications with zero slack times, but if a node send a message to another node which is already waiting for that message. The latter will acknowledge the reception of the message from the sender without any delay. On the other hand, if the receiving node is still computing the sender has to wait for it to finish its computation to acknowledge the reception of the message. This time is called the slack time. }
+\item In equation, min operation is used to get the communication time, but in text, it says to use the slowest communication time, which should use the max operation then.
+
+\textbf{Answer:} We agree with the reviewer and the sentence "the slowest communication time" has been changed to "the communication time of the slowest node" in the paper.
+
+\end{enumerate}
-- in equation, min operation is used to get the communication time, but in text, it says to use the slowest communication time, which should use the max operation then.
-\textcolor{blue}{Answer: We agree with the reviewer and the sentence "slower communication time" changed to "communication time of the slower node" in the paper.}
+\item Discuss the difference between eq. 2 and the prediction model in references \cite{4} and \cite{5}.
-\item discuss the difference between eq. 2 and the prediction model in references [5] and [6]
+\textbf{Answer:} The prediction models in \cite{4} and \cite{5} are for homogeneous and heterogeneous clusters respectively, while the model in Equation 2 is adapted for grids. We have adapted the prediction models to the used architecture. Each architecture has its own characteristics. For example, in a homogeneous cluster all the nodes have the same specifications and only one scaling factor is computed by the algorithm to all the nodes of the cluster.
+On the other hand, in a heterogeneous cluster, the nodes may have different specifications and a scaling factor should be computed to each node. The prediction models of a heterogeneous cluster can be used for a homogeneous cluster. In the same the models in this paper take more characteristics into considerations such as different networks to be adapted for grids and they can also be applied to a heterogeneous cluster. Therefore, the models presented in this paper are more complete than those presented in \cite{4} and \cite{5} and take more characteristics into consideration.
-\textcolor{blue}{Answer: The prediction models in [5] and [6] are for homogeneous and heterogeneous clusters respectively, while eq. 2 is for a grid. where the homogeneous cluster predication model was used one scaling factor denoted as $S$, because all the nodes in the cluster have the same computing powers. Whereas, in heterogeneous cluster prediction model all the nodes have different scales and the scaling factors have denoted as one dimensional vector $(S_1, S_2, \dots, S_N)$. The execution time prediction model for a grid Equation (2) defines a two dimensional array of scales
-$(S_{11}, S_{12},\dots, S_{NM_i})$. We have added this to the paper (page 8).}
\item Eq. 10: Can the authors comment on the energy consumed by communications?
-\textcolor{blue}{Answer: The CPU during communications consumed only the static power power. While
-in computations the CPU consumes both the dynamic and static communication, refer to \cite{3}. We have added this sentience to the paper, page 11.}
+\textbf{Answer:} During communications, the CPU only consumes the static power power and during computations it consumes both dynamic and static power. For more information the reviewer can refer to \cite{3}.
\item This work assume homogeneous cpu in one cluster. Line 55 says: even if the distributed message
passing iterative application is load balanced, the computation time of each cpu j in cluster i may be different Why?
-\textcolor{blue}{Answer: The computation times may be slightly different due to the delay caused
-by the scheduler of the operating system. We have added this in the paper.}
+\textbf{Answer:} In a homogeneous cluster executing a load balanced distributed application, the computation time of each node might be slightly different than the others due to some delay caused
+by the scheduler of the operating system of the node.
\item Comment why the applications in NAS parallel benchmark are iterative application? These applications are normally run in one cluster. Describe in more detail how they are run across multiple clusters.
-\textcolor{blue}{Answer: The applications in NAS parallel benchmark are application with iterations because they iterate the same block of instructions (communications and computations) many times. All the benchmarks are MPI programs that allowed to be executed on any distributed memory platform such as clusters and grids with no required modifications. Since, we have deployed the same operating system on all the nodes, we just compile the source on one cluster and then copied the executable program on all the clusters. }
+\textbf{Answer:} The sentence ``iterative applications'' was replaced by ``applications with iterations'' because the proposed algorithm can be applied to any application that executes the same block of instructions many times and it is not limited to iterative methods that terminate when they converge. The NAS parallel benchmarks are application with iterations because they iterate the same block of instructions until convergence or for fixed number of iterations. These benchmarks can be executed on any distributed memory platform such as clusters or grids with no required modifications. Since, we have deployed the same operating system on all the nodes, we just compile the source on one node and then copy the executable program on all the nodes. The application can then be executed with an ``mpirun'' command that takes three arguments:
+\begin{itemize}
+\item the name of the application to execute
+\item the number of processes required to execute the application
+\item the architecture file that contains the names of the nodes that will execute the application. They could be from different clusters.
+\end{itemize}
\item broken sentence in line 28 on page 12
-\textcolor{blue}{Answer: The word "were" replaced with "where".}
+\textbf{Answer:} The sentence was corrected.
\item Why $T_{old}$ is computed using eq. 12, which applies MAX over computation time and communication time, while in $T_{new}$, max and min operations are applied over computation and communication separately?
\item Line 55 on page 16 is to define the slack time, which should be introduced at the beginning of the paper, such as in figure 1.
-\textcolor{blue}{Answer: We have changed it in the paper and added to page 6.}
+\textbf{Answer:} We agree with the reviewer and the slack time is now presented at the beginning of the paper.
\item Authors comment whether (and how) the proposed methods can be applied/extended to other programming models and/or platform, such as mapreduce, heterogeneous cluster with CPU+GPU.