----------------------- REVIEW 1 --------------------- PAPER: 36 TITLE: Dynamic Frequency Scaling for Energy Consumption Reduction in Distributed MPI Programs AUTHORS: Jean-Claude Charr, Raphaël Couturier, Ahmed Fanfakh and Arnaud Giersch OVERALL EVALUATION: 0 (borderline paper) REVIEWER'S CONFIDENCE: 4 (high) ----------- REVIEW ----------- Paper Summary: This paper describes an online algorithm for estimating the energy consumption of a program when running in different DVFS states, or frequency gears. The algorithm uses information obtained at runtime on the compute vs. communication time of the application. Their system picks the frequency gear with the biggest delta between inverse performance and normalized energy. Their algorithm is evaluated using simulation of a small cluster (8 to 16 nodes) of homogenous single-core nodes with Ethernet network. Importance of Problem: Energy efficiency must be improved significantly – more than 20x – to meet the power budget goal for exascale-class supercomputers. This paper proposes an online algorithm for reducing the energy consumed by an application with as little performance impact as possible, this improving energy efficiency. Strengths and weaknesses: The paper is well written overall. The background material is clear and the power models are presented in an understandable way. In my view the main weakness of this work is it doesn’t sufficiently distinguish itself from existing work. This seems similar to several other dynamic DVFS control papers that I’ve read. In terms of execution, another weakness of this paper is the evaluation section uses only simulation and only includes small-scale results. Other work has used real HPC systems at a much larger scale (100's or 1000's of nodes). I also question the wisdom of always choosing the largest gap between performance and energy curves. This seems to be equivalent to minimizing the Energy*Delay product metric. In many of the NPB results, performance is significantly affected (> 30%). This may be unacceptable for many HPC use cases. Others have used Energy*Delay^2 or Energy*Delay^3 metrics to further emphasize the importance of performance. Finally, the paper ! could be improved by examining additional workloads in addition to the NPB’s. Additional comments: - In abstract, is the frequency / energy relationship really exponential? - Suggest changing footprint -> overhead - modelize -> model - What is “Backbone Bandwidth”? Since only 16 nodes were simulated, the Ethernet switch simulated can easily be a full crossbar. Also, do you mean 1 gigabit per second Ethernet or 1 gigabyte per second Ethernet? What would happen if a much faster network was used? Would that eliminate much of the slack time opportunity? ----------------------- REVIEW 2 --------------------- PAPER: 36 TITLE: Dynamic Frequency Scaling for Energy Consumption Reduction in Distributed MPI Programs AUTHORS: Jean-Claude Charr, Raphaël Couturier, Ahmed Fanfakh and Arnaud Giersch OVERALL EVALUATION: 0 (borderline paper) REVIEWER'S CONFIDENCE: 4 (high) ----------- REVIEW ----------- The authors proposed an algorithm to optimize the performance and energy consumption for message-passing parallel programs. The algorithm is an extension of [4] in the reference, by replacing the idle time due to the variance in the computation time between processes with the communication times. First, the authors have made a primitive but an essential mistake. The (relative) performance is the inverse of the execution time ratio. In other words, EQ (12) is the relative performance metric (not inverse of the performance). Before presenting the experimental results, they verified the accuracy of the predicted performance (for a given scaling factor S) by comparing it to what they measured on the simulator. This does not seem to be sufficient. Why did not they compare the predicted performance (and energy!) to those of executions on a real machine ? The specifications of the simulated machine seem to be chosen in favor of their algorithm. For example, they assumed one core for each node. This implies any communication with other nodes must go through the giga-bit Ethernet, stretching the communication time and giving more freedom to their algorithm. On the current and future machines, each 'processor' should have multiple cores (and each node may have more than one 'processor'). Therefore, the some fraction of communication is performed using the shared memory (even the parallel programs themselves are written with the message passing primitives). Moreover, the range of the clock frequency and its granularity (0.8 GHz to 2.4 GHz with 0.1 GHz increment) is NOT impossible but seems be chosen to support the results. If the number of available frequency is smaller, the effectiveness of their algorithm should also be limited. While they claim that their algorithm woks 'online', it is not an 'on-the-fly' algorithm. It requires a full iteration of running the program, and use the results for the later executions. This assumes that the 2nd and later executions have the same performance and energy consumption against the selected scaling factor. This assumption may limit the applicability of their algorithm. Few more comments: - In Section VII-C, they compared their algorithm with that in [4]. Should not they include the option of maximum energy reduction with zero performance - Their work is not specific to the MPI and the last part of the title should be " synchronous message passing program." - For the readability of the paper, the text should be more polished. For example, use appropriate connectives. - Similarly, some sentences look redundant. For example, "To be able to predict ..." (in the first paragraph of Section IV) could be "To predict .." - Capitalization, such as "Table" and "Figure" (not "table" and "figure") - Use appropriate units (6.65us instead of 0.00665ms) - What is 'platform file ?' (e.g. Table 1) ----------------------- REVIEW 3 --------------------- PAPER: 36 TITLE: Dynamic Frequency Scaling for Energy Consumption Reduction in Distributed MPI Programs AUTHORS: Jean-Claude Charr, Raphaël Couturier, Ahmed Fanfakh and Arnaud Giersch OVERALL EVALUATION: 1 (weak accept) REVIEWER'S CONFIDENCE: 4 (high) ----------- REVIEW ----------- The paper presents an online algorithmic approach to reduce energy consumption in MPI programs. The authors consider the computation and communication times of the sub-tasks in the MPI program, and scale the voltage and frequency of the individual cores to achieve overall energy reduction that balances reduction in performance with the increase in energy savings. VF scaling is the dominant runtime method to cut down power consumption, hence approaches to identifying and exploiting program slack provides are very useful. The algorithm and the results are explained well. My central concern is the run "once" approach of the online approach. I'm not convinced that arriving at the scaling factors once after one iteration of the subtasks (ending in the barrie) is sufficient to determine right scaling factors of the threads. I suspect the runtime interactions of the thread will lead to different threads becoming critical at various iterations. It will be good to do a comparison with idealized scenario of algorithm running every iteration ignoring overheads. It will be useful to summarize the results in Tables III through V in a chart. The cost of core frequency transitions is ignored in the paper (minor). There are plenty of other works in this area. For example, how does your work compared to Thrifty barrier [ http://csl.cornell.edu/~martinez/doc/hpca04.pdf ] (minor)?