X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy2.git/blobdiff_plain/23d36dd1f5ae679a51998b8d4898c54fc117b950..79fdd4e76758f6190bd71b7ea2c749504714ab13:/mpi-energy2-extension/review/review.tex?ds=sidebyside diff --git a/mpi-energy2-extension/review/review.tex b/mpi-energy2-extension/review/review.tex index c815cf3..a7253f6 100644 --- a/mpi-energy2-extension/review/review.tex +++ b/mpi-energy2-extension/review/review.tex @@ -54,6 +54,9 @@ \title{Answers to the questions of the reviewers} \maketitle + +We would like to thank the reviewers for taking time to review our paper. Their remarks were very constructive and allowed us to improve our paper and clarify some ambiguous points. We took in consideration all the remarks of the reviewers and modified the paper accordingly. In the following sections, the reviewers can find our answers to their questions: + \section{Questions and remarks of the first reviewer} \begin{enumerate} @@ -85,7 +88,7 @@ machines for example. In this paper experiments, only 16 and 32 nodes where considered. \textbf{Answer:} -We agree with the reviewer that the algorithm is centralized and might be a bottleneck if it was applied to an application running on many thousands of nodes. However, up to 144 nodes in a heterogeneous cluster, the overhead of the algorithm was very small, 0.15 ms, as presented in the simulation results of [6]. We did not execute experiments with more than 32 nodes on Grid'5000 because it does not have many nodes that allow DVFS operations and have energy measurement tools. +We agree with the reviewer that the algorithm is centralized and might be a bottleneck if it was applied to an application running on many thousands of nodes. However, up to 144 nodes in a heterogeneous cluster, the overhead of the algorithm was very small, 0.15 ms, as presented in the simulation results of \cite{5}. We did not execute experiments with more than 32 nodes on Grid'5000 because it does not have many nodes that allow DVFS operations and have energy measurement tools. On the other hand, the scalability of the proposed algorithm can be improved if we use asynchronous computations or if the algorithm was distributed in a hierarchical manner where a leader is chosen for each cluster or a group of nodes to compute their scaled frequencies. Improving the scalability of the algorithm is beyond the scope of this paper. @@ -184,7 +187,9 @@ For the variability issue, please refer to the answer to question 1. \section{Questions and remarks of the second reviewer} \begin{enumerate} -\item Move the contributions from related work to introduction +\item Move the contributions from related work to introduction. + +\textbf{Answer:} The contributions were moved to the introduction section. \item Why emphasize it is a grid platform? the presentation of related work follows the logic of heterogeneous CPUs. Grid is only a type of platform with heterogeneous CPUs. @@ -264,19 +269,22 @@ by the scheduler of the operating system of the node. \item the architecture file that contains the names of the nodes that will execute the application. They could be from different clusters. \end{itemize} -\item broken sentence in line 28 on page 12 +\item Broken sentence in line 28 on page 12. \textbf{Answer:} The sentence was corrected. \item Why $T_{old}$ is computed using eq. 12, which applies MAX over computation time and communication time, while in $T_{new}$, max and min operations are applied over computation and communication separately? -\textcolor{blue}{Answer: We agree with the reviewer, $T_{old}$ is the maximum execution time of the application before scaling the frequency and it is computed as in $T_{new}$ equation without scaling factors. So, we have changed the $T_{old}$ in the paper as as follows: +\textbf{Answer:} Both forms can be used for computing $T_{old}$ and $T_{new}$. To avoid this confusion, the same form was used for both equations in the paper. + \begin{equation} \label{eq:told} T_{old} = \mathop{\max_{i=1,2,\dots,N}}_{j=1,2,\dots,M_i} (\Tcp[ij]) + - \mathop{\min_{i=1,2,\dots,N}} (\Tcm[hj] ) + \mathop{\min_{j=1,2,\dots,M_h}} (\Tcm[hj] ) \end{equation} -} +where $h$ is the index of the slowest cluster. + + \item Line 55 on page 16 is to define the slack time, which should be introduced at the beginning of the paper, such as in figure 1. @@ -284,27 +292,38 @@ by the scheduler of the operating system of the node. \item Authors comment whether (and how) the proposed methods can be applied/extended to other programming models and/or platform, such as mapreduce, heterogeneous cluster with CPU+GPU. -Revision -\textcolor{blue}{Answer: The proposed method can only be applied to parallel programming with iteration -and with or without message passing. Indeed, the proposed method can be applied to the parallel application with mapreduce if it is a regular application with iterations. Therefore, the time of each map and reduce operations (communications) and the computation times in the program must be computed at the first iterations to predict the energy consumption and the execution time. After, the proposed algorithm can be used as it to select the best frequencies. -The proposed method can be applied to a heterogeneous platform composed from GPUs and CPUs, since modern GPUs like CPUs allow the use of DVFS operation.} +\textbf{Answer:} The proposed method can only be applied to parallel models with iterations +and with or without message passing. If only a few map and reduce operations are executed in the application and these operations are not iterative, the proposed algorithm cannot be adapted to that type of applications. On the other hand, if the map or reduce operations are iterative, the proposed algorithm can be applied when executing these operations. Finally, if in the application, the same map and reduce operations are executed many times iteratively, the proposed algorithm can then be applied to the whole application while considering that an iteration consists of a map operation followed by a reduce operation. + +The proposed method with some adaptations can be applied to applications with iterations running on heterogeneous platforms composed of GPUs and CPUs because modern GPUs like CPUs allow the use of DVFS. \end{enumerate} \section{Questions and remarks of the third reviewer} \begin{enumerate} -\item suggest the authors to use much larger size of nodes, instead of on 16 nodes, distributed on three clusters, to see the scalability of the energy saving +\item Suggest the authors to use much larger size of nodes, instead of on 16 nodes, distributed on three clusters, to see the scalability of the energy saving + +\textbf{Answer:} The experiments were not only conducted over 16 nodes, but they were also executed over 32 nodes distributed over three clusters. +In \cite{5} the algorithm was evaluated on a simulated heterogeneous cluster composed of up to 144 nodes. The overhead of the algorithm was very small, just 0.15 ms. + + The experiments were not conducted on more than 32 nodes of Grid'5000 because it does not have many nodes that allow DVFS operations and have energy measurement tools. We agree with the reviewer that experiments using much more nodes should be conducted to evaluate the scalability of the proposed algorithm and when we will have access to such platforms, we will evaluate the proposed method over a larger number of nodes. + +\item The energy saving is actually calculated by the quantitative formula instead of the real measurements. Can you have any discussions on the real measurements? + +\textbf{Answer:} This paper does not focus on measuring the energy consumption of CPUs in a grid. It presents models to predict the energy consumption and the performance of an application with iterations running on a grid. These models use the given dynamic and static powers to predict the energy consumption of each CPU with different scaling factors. Moreover, since we do not have physical access to the nodes of the grid which are geographically distributed on many sites in France, we cannot use hardware tools to measure the consumption of CPUs. Therefore, we used Grid'5000's tool which measures the overall power consumption of a node in real-time. These values were used to deduce the dynamic power of the node when computing with the maximum frequency. + + As a future work, it would be interesting to compare the accuracy of the results of the proposed energy model to the values given by instruments that measure the energy consumptions of CPUs during the execution time, as in \cite{2}. + +\item The overhead is not measured, can you present something on this as well to demonstrate what the authors claimed "has a small overhead and works without training or profiling"? -\textcolor{blue}{Answer: We have made the experiments not only on 16 nodes, but we have also made them over 32 nodes distributed over three clusters and in the near future we will apply the proposed method over a larger number of nodes.} +\textbf{Answer:} In the comparison section 6.5, we have presented the execution time of the algorithm when it is executed over 32 nodes from three clusters and located in two different sites. It takes on average 0.01 $ms$. In \cite{5} the algorithm was evaluated on a simulated heterogeneous cluster composed of up to 144 nodes. The overhead of the algorithm was just 0.15 ms. -\item the energy saving is actually calculated by the quantitative formula instead of the real measurements. Can you have any discussions on the real measurements? -\textcolor{blue}{Answer: The scope of this paper is not mainly focuses on the energy measurements, but it focuses on modelling and optimizing the energy and performance of grid systems. The proposed energy model depends on the dynamic and static power values for each CPU. We have used a real power measurement tools allowed in Grid'5000 sites to measure the dynamic power consumption. Moreover, the real measurements are difficult for a grid platform when the nodes are geographically distributed. As a future work, it is interesting to compare the accuracy of the proposed energy model with a real instruments to measure the energy consumption for local clusters such as the measurement tools presented in \cite{2}.} +The algorithm works online without profiling means it only uses the measured communication and computation times during the run-time and do not require to profile the application before run-time. Some methods use profilers before executing the application to gather a lot of information about the application such as computations to communication ratios and dependencies between tasks. The gathered information is used to scale the frequency of each node before executing the application. -\item the overhead is not measured, can you present something on this as well to demonstrate what the authors claimed "has a small overhead and works without training or profiling"? +The algorithm works without training because it does not require the partial or the total execution of the application before run-time. Indeed, some applications run parts of the application while using various frequencies to measure in advance their energy consumption. Then, using these values they select the the frequency of each node before executing the application. -\textcolor{blue}{Answer: In the comparison section 6.5, we have presented the execution time of the algorithm when it is executed over 32 nodes distributed over three sites located at two different sites, it takes on average 0.01 $ms$. The algorithm works online without training which means it only uses the measured communication and computation times during the runtime and do not require any profiling or training executed before runtime.} \end{enumerate} \bibliographystyle{plain} \bibliography{ref.bib}