\title{Answers to the questions of the reviewers}
\maketitle
+
+We would like to thank the reviewers for taking time to review our paper. Their remarks were very constructive and allowed us to improve our paper and clarify some ambiguous points. We took in consideration all the remarks of the reviewers and modified the paper accordingly. In the following sections, the reviewers can find our answers to their questions:
+
\section{Questions and remarks of the first reviewer}
\begin{enumerate}
variability has been ignored. What would be the solution for that?
Consider variability in the model.
-\textbf{Answer:} In this paper we have considered that the application executes regular iterations over stable computers computing only this application. Therefore, we have assumed that the execution times of all the iterations of the application executed on the same computing node should be almost the same. For this reason we did not take into consideration the variability of the computer system. Moreover, applying the frequency scaling algorithm after many iterations would reduce its impact on the energy consumption especially for applications executing a relatively low number of iterations.
+\textbf{Answer:} In this paper we have considered that the application executes regular iterations over stable machines computing only this application. Therefore, we have assumed that the execution times of all the iterations of the application executed on the same computing node should be almost the same. For this reason we did not take into consideration the variability of the computer system. Moreover, applying the frequency scaling algorithm after many iterations would reduce its impact on the energy consumption especially for applications executing a relatively low number of iterations.
However, the variability of the computing system can be taken into consideration in a future work. For example, the proposed algorithm can be executed twice: after the first iteration the frequencies are scaled down according to the execution times measured in the first iteration, then after a fixed number of iterations, the frequencies are adjusted according to the execution times measured during the fixed number of iterations. If the computing power of the system is constantly changing, it would be interesting to implement a mechanism that detects this change and adjusts the frequencies according to the variability of the system.
- Taking account of the variability of the system has been added as a perspective at the end of the paper.
+ Taking the variability of the system into account has been added as a perspective at the end of the paper.
\item Another point is that you mention
where considered.
\textbf{Answer:}
-We agree with the reviewer that the algorithm is centralized and might be a bottleneck if it was applied to an application running on many thousands of nodes. However, up to 144 nodes in a heterogeneous cluster, the overhead of the algorithm was very small, 0.15 ms, as presented in the simulation results of [6]. We did not execute experiments with more than 32 nodes on Grid'5000 because it does not have many nodes that allow DVFS operations and have energy measurement tools.
+We agree with the reviewer that the algorithm is centralized and might be a bottleneck if it was applied to an application running on many thousands of nodes. However, up to 144 nodes in a heterogeneous cluster, the overhead of the algorithm was very small, 0.15 ms, as presented in the simulation results of \cite{5}. We did not execute experiments with more than 32 nodes on Grid'5000 because it does not have many nodes that allow DVFS operations and have energy measurement tools.
On the other hand, the scalability of the proposed algorithm can be improved if we use asynchronous computations or if the algorithm was distributed in a hierarchical manner where a leader is chosen for each cluster or a group of nodes to compute their scaled frequencies. Improving the scalability of the algorithm is beyond the scope of this paper.
site/32, two sites/16, two sites/32). I believe it would be much
easier to compare and avoid the problem of lines.
- \textbf{Answer:} We agree with the reviewer. The curves in Figures 6 and 8 in the paper were replaced by histograms.
+ \textbf{Answer:} We agree with the reviewer. The curves in Figures 6 and 8 in the paper have been replaced by histograms.
detected by MPI if you pin processes to cores).
\textbf{Answer:} We did not manually pin processes to cores. Since the communication times
-increased, we think that the shared memory was not used when two processes, running on the same node, exchange data.
+have increased, we think that the shared memory was not used when two processes, running on the same node, exchange data.
\item In P33, Sec 6.5, you mention that the proposed algorithm outperforms
EDP because the former considers both metrics (time, energy) and the
+ Same for Fig 7.
- \textbf{Answer:} Answer: We have taken in consideration all these remarks and the paper was modified accordingly.
+ \textbf{Answer:} We have taken in consideration all these remarks and the paper has been modified accordingly.
\item From the design of experiments, did you consider using replications?
There is no variability metric in your results. Have you run multiple
\item In summary, I think this is a very interesting work but the experimental evaluation lacks variability measurements, consider larger experiments (1K nodes for instance) to see how everything scales, and there is no overhead measurements although authors stress that in abstract/introduction.
-\textbf{Answer:} For the time being, we do not have the resources nor the time to evaluate the proposed algorithm over large platforms composed of more than 1K nodes. However, as said in the perspectives of the paper, the evaluation of the scalability of the algorithm will be in a conducted in a future work as soon as we have access to larger resources. We have discussed the overhead of the algorithm and its complexity in section 6.5 and given in the answer to question 2 some solutions to improve its scalability and reduce its overhead.
+\textbf{Answer:} For the time being, we do not have the resources nor the time to evaluate the proposed algorithm over large platforms composed of more than 1K nodes. However, as said in the perspectives of the paper, the evaluation of the scalability of the algorithm will be conducted in a future work as soon as we have access to larger resources. We have discussed the overhead of the algorithm and its complexity in section 6.5 and given in the answer to question 2 some solutions to improve its scalability and reduce its overhead.
For the variability issue, please refer to the answer to question 1.
\section{Questions and remarks of the second reviewer}
\begin{enumerate}
-\item Move the contributions from related work to introduction
+\item Move the contributions from related work to introduction.
+
+\textbf{Answer:} The contributions have been moved to the introduction section.
\item Why emphasize it is a grid platform? the presentation of related work follows the logic of heterogeneous CPUs. Grid is only a type of platform with heterogeneous CPUs.
\item Figure 1 is not clearly explained. Where is the slack time in figure 1 and why slack time =0 for task 1?
-\textbf{Answer:} Figure 1 was redrawn, the white space before the barrier is the slack time. Slack times occur when a node has to wait for another node to finish its computation to synchronously communicate with it. In Figure 1, task 1 was assumed to be the slowest task. All the other tasks will finish their computations before the slowest task and wait until it finishes its computation before being able to synchronously communicate with it. This waiting time is the slack time and since the slowest task do not have to wait for the other tasks it has almost no slack time.
+\textbf{Answer:} Figure 1 was redrawn, the white space before the barrier is the slack time. Slack times occur when a node has to wait for another node to finish its computation to synchronously communicate with it. In Figure 1, task 1 was assumed to be the slowest task. All the other tasks will finish their computations before the slowest task and wait until it finishes its computation before being able to synchronously communicate with it. This waiting time is the slack time and since the slowest task does not have to wait for the other tasks, there is almost no slack time.
\item Define the parameters in eq. 1.
\item What is the communication time without slack time?
-\textbf{Answer:} There is no synchronous communications with zero slack times, but if a node sends a message to another node who is already waiting for that message. The latter will acknowledge the reception of the message from the sender without any delay. On the other hand, if the receiving node is still computing the sender has to wait for it to finish its computation to acknowledge the reception of the message. This time is called the slack time.
+\textbf{Answer:} There is no synchronous communication with zero slack times, but if a node sends a message to another node that is already waiting for that message, the latter will acknowledge the reception of the message from the sender without any delay. On the other hand, if the receiving node is still computing the sender has to wait for it to finish its computation to acknowledge the reception of the message. This time is called the slack time.
\item In equation, min operation is used to get the communication time, but in text, it says to use the slowest communication time, which should use the max operation then.
\item Discuss the difference between eq. 2 and the prediction model in references \cite{4} and \cite{5}.
-\textbf{Answer:} The prediction models in \cite{4} and \cite{5} are for homogeneous and heterogeneous clusters respectively, while the model in Equation 2 is adapted for grids. We have adapted the prediction models to the used architecture. Each architecture has its own characteristics. For example, in a homogeneous cluster all the nodes have the same specifications and only one scaling factor is computed by the algorithm to all the nodes of the cluster.
-On the other hand, in a heterogeneous cluster, the nodes may have different specifications and a scaling factor should be computed to each node. The prediction models of a heterogeneous cluster can be used for a homogeneous cluster. In the same the models in this paper take more characteristics into considerations such as different networks to be adapted for grids and they can also be applied to a heterogeneous cluster. Therefore, the models presented in this paper are more complete than those presented in \cite{4} and \cite{5} and take more characteristics into consideration.
+\textbf{Answer:} The prediction models in \cite{4} and \cite{5} are for homogeneous and heterogeneous clusters respectively, while the model in Equation 2 is adapted for grids. We have adapted the prediction models to the used architecture. Each architecture has its own characteristics. For example, in a homogeneous cluster all the nodes have the same specifications and only one scaling factor is computed by the algorithm for all the nodes of the cluster.
+On the other hand, in a heterogeneous cluster, the nodes may have different specifications and a scaling factor should be computed for each node. The prediction models of a heterogeneous cluster can be used for a homogeneous cluster. In the same way, the models in this paper take more characteristics into consideration such as different networks to be adapted for grids and they can also be applied to a heterogeneous cluster. Therefore, the models presented in this paper are more complete than those presented in \cite{4} and \cite{5} and take more characteristics into consideration.
\item Eq. 10: Can the authors comment on the energy consumed by communications?
-\textbf{Answer:} During communications, the CPU only consumes the static power power and during computations it consumes both dynamic and static power. For more information the reviewer can refer to \cite{3}.
+\textbf{Answer:} During communications, the CPU only consumes the static power and during computations it consumes both dynamic and static power. For more information the reviewer can refer to \cite{3}.
\item This work assume homogeneous cpu in one cluster. Line 55 says: even if the distributed message
passing iterative application is load balanced, the computation time of each cpu j in cluster i may be different Why?
\item Comment why the applications in NAS parallel benchmark are iterative application? These applications are normally run in one cluster. Describe in more detail how they are run across multiple clusters.
-\textbf{Answer:} The sentence ``iterative applications'' was replaced by ``applications with iterations'' because the proposed algorithm can be applied to any application that executes the same block of instructions many times and it is not limited to iterative methods that terminate when they converge. The NAS parallel benchmarks are application with iterations because they iterate the same block of instructions until convergence or for fixed number of iterations. These benchmarks can be executed on any distributed memory platform such as clusters or grids with no required modifications. Since, we have deployed the same operating system on all the nodes, we just compile the source on one node and then copy the executable program on all the nodes. The application can then be executed with an ``mpirun'' command that takes three arguments:
+\textbf{Answer:} The sentence ``iterative applications'' was replaced by ``applications with iterations'' because the proposed algorithm can be applied to any application that executes the same block of instructions many times and it is not limited to iterative methods that terminate when they converge. The NAS parallel benchmarks are applications with iterations because they iterate the same block of instructions until convergence or for a fixed number of iterations. These benchmarks can be executed on any distributed memory platform such as clusters or grids with no required modifications. Since we have deployed the same operating system on all the nodes, we just compile the source on one node and then copy the executable program on all the other nodes. The application can then be executed with an ``mpirun'' command that takes three arguments:
\begin{itemize}
\item the name of the application to execute
\item the number of processes required to execute the application
\item the architecture file that contains the names of the nodes that will execute the application. They could be from different clusters.
\end{itemize}
-\item broken sentence in line 28 on page 12
+\item Broken sentence in line 28 on page 12.
-\textbf{Answer:} The sentence was corrected.
+\textbf{Answer:} The sentence has been corrected.
\item Why $T_{old}$ is computed using eq. 12, which applies MAX over computation time and communication time, while in $T_{new}$, max and min operations are applied over computation and communication separately?
-\textcolor{blue}{Answer: We agree with the reviewer, $T_{old}$ is the maximum execution time of the application before scaling the frequency and it is computed as in $T_{new}$ equation without scaling factors. So, we have changed the $T_{old}$ in the paper as as follows:
+\textbf{Answer:} Both forms can be used for computing $T_{old}$ and $T_{new}$. To avoid this confusion, the same form was used for both equations in the paper.
+
\begin{equation}
\label{eq:told}
T_{old} = \mathop{\max_{i=1,2,\dots,N}}_{j=1,2,\dots,M_i} (\Tcp[ij]) +
- \mathop{\min_{i=1,2,\dots,N}} (\Tcm[hj] )
+ \mathop{\min_{j=1,2,\dots,M_h}} (\Tcm[hj] )
\end{equation}
-}
+where $h$ is the index of the slowest cluster.
+
+
\item Line 55 on page 16 is to define the slack time, which should be introduced at the beginning of the paper, such as in figure 1.
\item Authors comment whether (and how) the proposed methods can be applied/extended to other programming models and/or platform, such as mapreduce, heterogeneous cluster with CPU+GPU.
-Revision
-\textcolor{blue}{Answer: The proposed method can only be applied to parallel programming with iteration
-and with or without message passing. Indeed, the proposed method can be applied to the parallel application with mapreduce if it is a regular application with iterations. Therefore, the time of each map and reduce operations (communications) and the computation times in the program must be computed at the first iterations to predict the energy consumption and the execution time. After, the proposed algorithm can be used as it to select the best frequencies.
-The proposed method can be applied to a heterogeneous platform composed from GPUs and CPUs, since modern GPUs like CPUs allow the use of DVFS operation.}
+\textbf{Answer:} The proposed method can only be applied to parallel models with iterations
+and with or without message passing. If only a few map and reduce operations are executed in the application and these operations are not iterative, the proposed algorithm cannot be adapted to that type of applications. On the other hand, if the map or reduce operations are iterative, the proposed algorithm can be applied when executing these operations. Finally if, in the application, the same map and reduce operations are executed many times iteratively, the proposed algorithm can then be applied to the whole application while considering that an iteration consists of a map operation followed by a reduce operation.
+
+The proposed method with some adaptations can be applied to applications with iterations running on heterogeneous platforms composed of GPUs and CPUs because modern GPUs like CPUs allow the use of DVFS.
\end{enumerate}
\section{Questions and remarks of the third reviewer}
\begin{enumerate}
-\item suggest the authors to use much larger size of nodes, instead of on 16 nodes, distributed on three clusters, to see the scalability of the energy saving
+\item Suggest the authors to use much larger size of nodes, instead of on 16 nodes, distributed on three clusters, to see the scalability of the energy saving
+
+\textbf{Answer:} The experiments were not only conducted over 16 nodes, but they were also executed over 32 nodes distributed over three clusters.
+In \cite{5} the algorithm was evaluated on a simulated heterogeneous cluster composed of up to 144 nodes. The overhead of the algorithm was very small, just 0.15 ms.
+
+ The experiments were not conducted on more than 32 nodes of Grid'5000 because it does not have many nodes that allow DVFS operations and have energy measurement tools. We agree with the reviewer that experiments using much larger number of nodes should be conducted to evaluate the scalability of the proposed algorithm and when we have access to such platforms, we will evaluate the proposed method over a larger number of nodes.
+
+\item The energy saving is actually calculated by the quantitative formula instead of the real measurements. Can you have any discussions on the real measurements?
+
+\textbf{Answer:} This paper does not focus on measuring the energy consumption of CPUs in a grid. It presents models to predict the energy consumption and the performance of an application with iterations running on a grid. These models use the given dynamic and static powers to predict the energy consumption of each CPU with different scaling factors. Moreover, since we do not have physical access to the nodes of the grid which are geographically distributed on many sites in France, we cannot use hardware tools to measure the consumption of CPUs. Therefore, we used Grid'5000's tool which measures the overall power consumption of a node in real-time. These values were used to deduce the dynamic power of the node when computing with the maximum frequency.
+
+ As a future work, it would be interesting to compare the accuracy of the results of the proposed energy model to the values given by instruments that measure the energy consumptions of CPUs during the execution time, as in \cite{2}.
+
+\item The overhead is not measured, can you present something on this as well to demonstrate what the authors claimed "has a small overhead and works without training or profiling"?
-\textcolor{blue}{Answer: We have made the experiments not only on 16 nodes, but we have also made them over 32 nodes distributed over three clusters and in the near future we will apply the proposed method over a larger number of nodes.}
+\textbf{Answer:} In the comparison section 6.5, we have presented the execution time of the algorithm when it is executed over 32 nodes from three clusters and located in two different sites. It takes on average 0.01 $ms$. In \cite{5} the algorithm was evaluated on a simulated heterogeneous cluster composed of up to 144 nodes. The overhead of the algorithm was just 0.15 ms.
-\item the energy saving is actually calculated by the quantitative formula instead of the real measurements. Can you have any discussions on the real measurements?
-\textcolor{blue}{Answer: The scope of this paper is not mainly focuses on the energy measurements, but it focuses on modelling and optimizing the energy and performance of grid systems. The proposed energy model depends on the dynamic and static power values for each CPU. We have used a real power measurement tools allowed in Grid'5000 sites to measure the dynamic power consumption. Moreover, the real measurements are difficult for a grid platform when the nodes are geographically distributed. As a future work, it is interesting to compare the accuracy of the proposed energy model with a real instruments to measure the energy consumption for local clusters such as the measurement tools presented in \cite{2}.}
+The sentence : ``the algorithm works online without profiling'' means that it only uses the measured communication and computation times during the run-time and does not require to profile the application before run-time. Some methods use profilers before executing the application to gather a lot of information about the application such as computations to communication ratios and dependencies between tasks. The gathered information is used to scale the frequency of each node before executing the application.
-\item the overhead is not measured, can you present something on this as well to demonstrate what the authors claimed "has a small overhead and works without training or profiling"?
+The algorithm works without training because it does not require the partial or the total execution of the application before run-time. Indeed, some applications run parts of the application while using various frequencies to measure their energy consumption in advance. Then, using these values, they select the frequency of each node before executing the application.
-\textcolor{blue}{Answer: In the comparison section 6.5, we have presented the execution time of the algorithm when it is executed over 32 nodes distributed over three sites located at two different sites, it takes on average 0.01 $ms$. The algorithm works online without training which means it only uses the measured communication and computation times during the runtime and do not require any profiling or training executed before runtime.}
\end{enumerate}
\bibliographystyle{plain}
\bibliography{ref.bib}