ispa14_review.txt

   1 ----------------------- REVIEW 1 ---------------------
   2 PAPER: 36
   3 TITLE: Dynamic Frequency Scaling for Energy Consumption Reduction in Distributed MPI Programs
   4 AUTHORS: Jean-Claude Charr, Raphaël Couturier, Ahmed Fanfakh and Arnaud Giersch
   5
   6 OVERALL EVALUATION: 0 (borderline paper)
   7 REVIEWER'S CONFIDENCE: 4 (high)
   8
   9 ----------- REVIEW -----------
  10 Paper Summary:
  11 This paper describes an online algorithm for estimating the energy consumption
  12 of a program when running in different DVFS states, or frequency gears.  The
  13 algorithm uses information obtained at runtime on the compute vs. communication
  14 time of the application.  Their system picks the frequency gear with the biggest
  15 delta between inverse performance and normalized energy.  Their algorithm is
  16 evaluated using simulation of a small cluster (8 to 16 nodes) of homogenous
  17 single-core nodes with Ethernet network.
  18
  19 Importance of Problem:
  20 Energy efficiency must be improved significantly – more than 20x – to meet the
  21 power budget goal for exascale-class supercomputers.  This paper proposes an
  22 online algorithm for reducing the energy consumed by an application with as
  23 little performance impact as possible, this improving energy efficiency.
  24
  25 Strengths and weaknesses:
  26 The paper is well written overall.  The background material is clear and the
  27 power models are presented in an understandable way.  In my view the main
  28 weakness of this work is it doesn’t sufficiently distinguish itself from
  29 existing work.  This seems similar to several other dynamic DVFS control papers
  30 that I’ve read.  In terms of execution, another weakness of this paper is the
  31 evaluation section uses only simulation and only includes small-scale results.
  32 Other work has used real HPC systems at a much larger scale (100's or 1000's of
  33 nodes).  I also question the wisdom of always choosing the largest gap between
  34 performance and energy curves.  This seems to be equivalent to minimizing the
  35 Energy*Delay product metric.  In many of the NPB results, performance is
  36 significantly affected (> 30%).  This may be unacceptable for many HPC use
  37 cases.  Others have used Energy*Delay^2 or Energy*Delay^3 metrics to further
  38 emphasize the importance of performance.  Finally, the paper !  could be
  39 improved by examining additional workloads in addition to the NPB’s.
  40
  41 Additional comments:
  42 - In abstract, is the frequency / energy relationship really exponential?
  43 - Suggest changing footprint -> overhead
  44 - modelize -> model
  45 - What is “Backbone Bandwidth”?  Since only 16 nodes were simulated, the
  46   Ethernet switch simulated can easily be a full crossbar.  Also, do you mean 1
  47   gigabit per second Ethernet or 1 gigabyte per second Ethernet?  What would
  48   happen if a much faster network was used?  Would that eliminate much of the
  49   slack time opportunity?
  50
  51
  52 ----------------------- REVIEW 2 ---------------------
  53 PAPER: 36
  54 TITLE: Dynamic Frequency Scaling for Energy Consumption Reduction in Distributed MPI Programs
  55 AUTHORS: Jean-Claude Charr, Raphaël Couturier, Ahmed Fanfakh and Arnaud Giersch
  56
  57 OVERALL EVALUATION: 0 (borderline paper)
  58 REVIEWER'S CONFIDENCE: 4 (high)
  59
  60 ----------- REVIEW -----------
  61 The authors proposed an algorithm to optimize the performance and energy
  62 consumption for message-passing parallel programs.  The algorithm is an
  63 extension of [4] in the reference, by replacing the idle time due to the
  64 variance in the computation time between processes with the communication times.
  65
  66 First, the authors have made a primitive but an essential mistake.  The
  67 (relative) performance is the inverse of the execution time ratio. In other
  68 words, EQ (12) is the relative performance metric (not inverse of the
  69 performance).
  70
  71 Before presenting the experimental results, they verified the accuracy of the
  72 predicted performance (for a given scaling factor S) by comparing it to what
  73 they measured on the simulator.  This does not seem to be sufficient. Why did
  74 not they compare the predicted performance (and energy!) to those of executions
  75 on a real machine ?
  76
  77 The specifications of the simulated machine seem to be chosen in favor of their
  78 algorithm. For example, they assumed one core for each node. This implies any
  79 communication with other nodes must go through the giga-bit Ethernet, stretching
  80 the communication time and giving more freedom to their algorithm.  On the
  81 current and future machines, each 'processor' should have multiple cores (and
  82 each node may have more than one 'processor').  Therefore, the some fraction of
  83 communication is performed using the shared memory (even the parallel programs
  84 themselves are written with the message passing primitives).
  85
  86 Moreover, the range of the clock frequency and its granularity (0.8 GHz to 2.4
  87 GHz with 0.1 GHz increment) is NOT impossible but seems be chosen to support the
  88 results. If the number of available frequency is smaller, the effectiveness of
  89 their algorithm should also be limited.
  90
  91 While they claim that their algorithm woks 'online', it is not an 'on-the-fly'
  92 algorithm. It requires a full iteration of running the program, and use the
  93 results for the later executions.  This assumes that the 2nd and later
  94 executions have the same performance and energy consumption against the selected
  95 scaling factor.  This assumption may limit the applicability of their algorithm.
  96
  97 Few more comments:
  98
  99 - In Section VII-C, they compared their algorithm with that in [4].
 100   Should not they include the option of maximum energy reduction with zero
 101   performance
 102
 103 - Their work is not specific to the MPI and the last part of the title should be
 104   " synchronous message passing program."
 105
 106 - For the readability of the paper, the text should be more polished.
 107   For example, use appropriate connectives.
 108
 109 - Similarly, some sentences look redundant. For example, "To be able to predict
 110   ..." (in the first paragraph of Section IV) could be "To predict .."
 111
 112 - Capitalization, such as "Table" and "Figure" (not "table" and "figure")
 113
 114 - Use appropriate units (6.65us instead of 0.00665ms)
 115
 116 - What is 'platform file ?' (e.g. Table 1)
 117
 118
 119 ----------------------- REVIEW 3 ---------------------
 120 PAPER: 36
 121 TITLE: Dynamic Frequency Scaling for Energy Consumption Reduction in Distributed MPI Programs
 122 AUTHORS: Jean-Claude Charr, Raphaël Couturier, Ahmed Fanfakh and Arnaud Giersch
 123
 124 OVERALL EVALUATION: 1 (weak accept)
 125 REVIEWER'S CONFIDENCE: 4 (high)
 126
 127 ----------- REVIEW -----------
 128 The paper presents an online algorithmic approach to reduce energy consumption
 129 in MPI programs.  The authors consider the computation and communication times
 130 of the sub-tasks in the MPI program, and scale the voltage and frequency of the
 131 individual cores to achieve overall energy reduction that balances reduction in
 132 performance with the increase in energy savings.
 133
 134 VF scaling is the dominant runtime method to cut down power consumption, hence
 135 approaches to identifying and exploiting program slack provides are very useful.
 136 The algorithm and the results are explained well.
 137
 138 My central concern is the run "once" approach of the online approach.  I'm not
 139 convinced that arriving at the scaling factors once after one iteration of the
 140 subtasks (ending in the barrie) is sufficient to determine right scaling factors
 141 of the threads.  I suspect the runtime interactions of the thread will lead to
 142 different threads becoming critical at various iterations.  It will be good to
 143 do a comparison with idealized scenario of algorithm running every iteration
 144 ignoring overheads.
 145
 146 It will be useful to summarize the results in Tables III through V in a chart.
 147
 148 The cost of core frequency transitions is ignored in the paper (minor).
 149
 150 There are plenty of other works in this area.  For example, how does your work
 151 compared to Thrifty barrier [ http://csl.cornell.edu/~martinez/doc/hpca04.pdf ]
 152 (minor)?