From: jean-claude Date: Thu, 26 May 2016 09:53:00 +0000 (+0200) Subject: corrections X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/mpi-energy2.git/commitdiff_plain/63db3888ceb38d26393ccd23243db9c4939810ad corrections --- diff --git a/mpi-energy2-extension/review/review.pdf b/mpi-energy2-extension/review/review.pdf index 4e6dea0..fa09432 100644 Binary files a/mpi-energy2-extension/review/review.pdf and b/mpi-energy2-extension/review/review.pdf differ diff --git a/mpi-energy2-extension/review/review.tex b/mpi-energy2-extension/review/review.tex index 7302d62..e7525c1 100644 --- a/mpi-energy2-extension/review/review.tex +++ b/mpi-energy2-extension/review/review.tex @@ -1,11 +1,11 @@ -\documentclass[12pt,a4paper]{report} +\documentclass[12pt,a4paper]{journal} \usepackage[utf8]{inputenc} \usepackage{amsmath} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{graphicx} \usepackage{color} -\title{Reviewers' comments} +%\title{Reviewers' comments} \newcommand{\AG}[2][inline]{% \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} @@ -52,34 +52,10 @@ \begin{document} +\title{Answers to the questions of the reviewers} +\maketitle +\section{Questions of the first reviewer} - -\section{Reviewer 1} -This work tries to reduce energy consumption of regular applications, -with no dynamic load balancing, that execute in heterogeneous -platforms with different machine configurations. A energy and -execution time model are given for such configuration that allows the -reader to easily understand the context. The proposed energy reduction -strategy slows down (using DVFS) the faster processes, reducing their -slack time. The technique could be seen as "load balancing" through -de-acceleration. Of course the objective here is to reduce energy -consumption, and the proposed technique is indeed interesting. I -suggest you to read the following paper that has a similar strategy -(slack time is called residual imbalances in this paper) applied in a -slightly different scenario: - -+ Padoin et. al. Saving energy by exploiting residual imbalances on - iterative applications. 21st International Conference on High - Performance Computing (HiPC), HIPC 2014. - -It is very hard to optimize energy consumption and performance. One -affects the other, very few workarounds for that. I found the -discussion in Sec. 4 very interesting since it details a possible -workaround by exploring the fact that the application is distributed -in the platform and we know that the overall execution time is -dominated by the critical path. - -Remarks \begin{enumerate} @@ -92,15 +68,27 @@ measuring the computation time and energy consumption for one iteration only. Let's suppose something went bad in this first iteration. The scaling factors will not be the best tradeoff because variability has been ignored. What would be the solution for that? -Consider variability in the model. Another point is that you mention +Consider variability in the model. + +\textbf{Answer:} In this paper we have considered that the application executes regular iterations over stable computers computing only this application. Therefore, we have assumed that the execution times of all the iterations of the application executed on the same computing node should be almost the same. For this reason we did not take into consideration the variability of the computer system. Moreover, applying the frequency scaling algorithm after many iterations would reduce its impact on the energy consumption especially for applications executing a relatively low number of iterations. + +However, the variability of the computing system can be taken into consideration in a future work. For example, the proposed algorithm can be executed twice: after the first iteration the frequencies are scaled down according to the execution times measured in the first iteration, then after a fixed number of iterations, the frequencies are adjusted according to the execution times measured during the fixed number of iterations. If the computing power of the system is constantly changing, it would be interesting to implement a mechanism that detects this change and adjusts the frequencies according to the variability of the system. + + Taking account of the variability of the system has been added as a perspective at the end of the paper. + + +\item Another point is that you mention in the abstract and introduction that your solution has low overhead, but it is a centralized solution. Probably it won't scale when we reach hundreds or thousands of computer nodes: take one of that large machines for example. In this paper experiments, only 16 and 32 nodes where considered. -\textcolor{blue}{Answer: We plan to take the variability in the proposed algorithm as a future works in two steps. In the first step, the algorithm selects the best frequencies at the end of the first iterations and apply them to the system. In the second step, after some iterations (e.g. 5 iterations) the algorithm recomputes the frequencies depending on the average of the communication and computation times for all previous iterations. It will change the frequency of each node if the new frequency is different from the old one. Otherwise, it keeps the old frequency. We have added this to our perspectives at the paper. - The algorithm overhead is very small, for example in the simulation results [6], it takes 0.15 ms on average for 144 nodes to compute the best scaling factors vector for a heterogeneous cluster. On Grid'5000 it is very hard to book a lot of nodes that allow DVFS operations and have an energy measurement tools. } +\textbf{Answer:} +We agree with the reviewer that the algorithm is centralized and might be a bottleneck if it was applied to an application running on many thousands of nodes. However, up to 144 nodes in a heterogeneous cluster, the overhead of the algorithm was very small, 0.15 ms, as presented in the simulation results of [6]. We did not execute experiments with more than 32 nodes on Grid'5000 because it does not have many nodes that allow DVFS operations and have energy measurement tools. + +On the other hand, the scalability of the proposed algorithm can be improved if we use asynchronous computations or if the algorithm was distributed in a hierarchical manner where a leader is chosen for each cluster or a group of nodes to compute their scaled frequencies. Improving the scalability of the algorithm is beyond the scope of this paper. + \item In Fig 6, you draw lines between the points. Lines here mean nothing @@ -109,7 +97,7 @@ instance a non-stacked bar plot with four colors (one site/16, one site/32, two sites/16, two sites/32). I believe it would be much easier to compare and avoid the problem of lines. - \textcolor{blue}{Answer: we agree with reviewer. We have changed figures 6 and 8 in the paper.} + \textbf{Answer:} We agree with the reviewer. The curves in Figures 6 and 8 in the paper were replaced by histograms. @@ -124,8 +112,9 @@ instance). You say that on scale this would produce less energy savings, but your arguments for providing a solution for this was based that today's supercomputers are achieving massive scale. -\textcolor{blue}{Answer: In grid, the cost of communications between distinct sites is the main factor. The NAS benchmarks are significantly affected by the number of nodes and the increase in the communications between them. So, the instance is too small to be executed over 32 nodes and the computation to communication ratio is very small. Therefore, bigger instances should be executed on -much number of nodes. } +\textbf{Answer:} In the Figure 7, the energy consumption of the benchmarks solving the class D and running on many scenarios are presented. The number of used nodes varies between 16 and 32 in the scenarios while the size of the problem is not modified. Therefore, the computations to communications times ratio is lower when 32 nodes are used instead of 16. When this ratio is small, it means there are not enough computations when compared to the communications times and the impact of scaling down the frequency of the CPU on its energy consumption is reduced. To solve this problem, the problem +should be solved on a number of nodes adequate to its size. For example, for the NAS benchmarks, the class E should have been solved on 32 nodes to have a good computations to communications times ratio. + \item In Sec 6.3, why did you choose to keep 32 processes for the evaluation with multi-core clusters? How did you configure MPI for the results @@ -188,7 +177,7 @@ very hard to affirm anything about measurements. \end{enumerate} -\section{Reviewer 2} +\section{Questions of the second reviewer} This paper presents detailed performance and energy model for iterative message passing applications. Further a method is proposed to select the frequencies of heterogeneous cpus online. The selection method itself is not difficult. But I like the systematic modeling for energy consumption and performance. This paper is well written in general. The technical contents are presented in a logical way overall. The experiments are conducted in real platform, which shows the practicality of this work and also makes the work have more impact on the field. However, I have the following comments and concerns for this paper. The authors should clarify them in the revised version. @@ -301,7 +290,7 @@ and with or without message passing. Indeed, the proposed method can be applied The proposed method can be applied to a heterogeneous platform composed from GPUs and CPUs, since modern GPUs like CPUs allow the use of DVFS operation.} \end{enumerate} -\section{Reviewer 3} +\section{Questions of the third reviewer} In this paper, a new online frequency selecting algorithm for grids, composed of heterogeneous clusters, is presented. It selects the frequencies and tries to give the best trade-off between energy saving and performance degradation, for each node computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for message passing iterative applications running on a grid. The proposed algorithm is evaluated on a real grid, the Grid'5000 platform, while running the NAS parallel benchmarks. The experiments on 16 nodes, distributed on three clusters, show