corrections

author jean-claude <jean-claude.charr@univ-fcomte.fr>

Fri, 27 May 2016 09:38:05 +0000 (11:38 +0200)

committer jean-claude <jean-claude.charr@univ-fcomte.fr>

Fri, 27 May 2016 09:38:05 +0000 (11:38 +0200)
author jean-claude <jean-claude.charr@univ-fcomte.fr>
Fri, 27 May 2016 09:38:05 +0000 (11:38 +0200)
committer jean-claude <jean-claude.charr@univ-fcomte.fr>
Fri, 27 May 2016 09:38:05 +0000 (11:38 +0200)
diff --git a/mpi-energy2-extension/Heter_paper.tex b/mpi-energy2-extension/Heter_paper.tex

index 5ef27298690893c35cdb23d425ac60510eec3e40..a72c5530e965cdaa7acbf261f00f1b1be5708e9c 100644 (file)
--- a/mpi-energy2-extension/Heter_paper.tex
+++ b/mpi-energy2-extension/Heter_paper.tex
@@ -107,9 +107,9 @@
  
  
  
-\title{Optimizing Energy Consumption with DVFS for Message \\
-         Passing Applications \textcolor{blue}{with iterations} on \\
-                    Grid Architectures} 
+\title{Optimizing the Energy Consumption \\ 
+of Message Passing Applications with Iterations \\ 
+Executed over Grids} 
    
  
  
@@ -143,10 +143,10 @@ scaling (DVFS) is one of them. It can be used to reduce the power consumption of
    In this paper, a new online frequency selecting algorithm for grids, composed of heterogeneous clusters, is presented.  
    It selects the frequencies and tries to give the best
    trade-off between energy saving and performance degradation, for each node
-  computing the message passing  application \textcolor{blue}{with iterations}. 
+  computing the message passing  application with iterations. 
    The algorithm has a small
    overhead and works without training or profiling. It uses a new energy model
-  for message passing  applications \textcolor{blue}{with iterations} running on a  grid. 
+  for message passing  applications with iterations running on a  grid. 
    The proposed algorithm is evaluated on a real grid, the Grid'5000 platform, while
    running the NAS parallel benchmarks.  The experiments on 16 nodes, distributed on three clusters, show that it reduces  on average the
    energy consumption  by \np[\%]{30} while  the performance  is on average only degraded
@@ -192,7 +192,7 @@ This heterogeneous platform executes more than 7 GFlops per watt while consuming
  50.32 kilowatts.
  
  Besides platform improvements, there are many software and hardware techniques
-to lower the energy consumption of these platforms, such as DVFS, scheduling \textcolor{blue}{and other techniques}.
+to lower the energy consumption of these platforms, such as DVFS, scheduling and other techniques.
   DVFS is a widely used process to reduce the energy consumption of a
  processor by lowering its frequency
  \cite{Rizvandi_Some.Observations.on.Optimal.Frequency}. However, it also reduces
@@ -1229,8 +1229,8 @@ The experimental results, the energy saving, performance degradation and trade-o
  presented in  Figures~\ref{fig:edp-eng}, \ref{fig:edp-perf} and \ref{fig:edp-dist} respectively.
  
  As shown in these figures, the proposed frequencies selection algorithm, Maxdist, outperforms the EDP algorithm in terms of energy consumption reduction and performance for all of the benchmarks executed over the two scenarios. 
-The proposed algorithm gives better results than the EDP method because it 
-maximizes the energy saving and the performance at the same time. 
+The proposed algorithm gives better results than the EDP method because the former selects the set of frequencies that  
+gives the best tradeoff between energy saving and performance. 
  Moreover, the proposed scaling algorithm gives the same weight for these two metrics.
  Whereas, the EDP algorithm gives sometimes negative trade-off values for some benchmarks in the two sites scenarios.
  These negative trade-off values mean that the performance degradation percentage is higher than the energy saving percentage.
diff --git a/mpi-energy2-extension/review/review.pdf b/mpi-energy2-extension/review/review.pdf

index fa094326cab6529b6734a3e5c8d1bb13edf9987f..61518cfdb0b16f40fd0f3af6241f9666c95da34d 100644 (file)

Binary files a/mpi-energy2-extension/review/review.pdf and b/mpi-energy2-extension/review/review.pdf differ
diff --git a/mpi-energy2-extension/review/review.tex b/mpi-energy2-extension/review/review.tex

index e7525c15231fa124332292a6c386c8d871ea3c53..4e5eef6248bdc2c1c8f6033e3b70b8f5a64be09f 100644 (file)
--- a/mpi-energy2-extension/review/review.tex
+++ b/mpi-energy2-extension/review/review.tex
@@ -54,7 +54,7 @@
  
  \title{Answers to the questions of the reviewers}
  \maketitle
-\section{Questions of the first reviewer} 
+\section{Questions and remarks of the first reviewer} 
  
  \begin{enumerate}
  
@@ -112,16 +112,17 @@ instance). You say that on scale this would produce less energy
  savings, but your arguments for providing a solution for this was
  based that today's supercomputers are achieving massive scale.
  
-\textbf{Answer:} In the Figure 7, the energy consumption of the benchmarks solving the class D and running on many scenarios are presented. The number of used nodes varies between 16 and 32 in the scenarios while the size of the problem is not modified. Therefore, the computations to communications times ratio is lower when 32 nodes are used instead of 16. When this ratio is small, it means there are not enough computations when compared to the communications times and the impact of scaling down the frequency of the CPU on its energy consumption is reduced. To solve this problem, the problem 
+\textbf{Answer:} In Figure 7, the energy consumption of the benchmarks solving the class D and running on many scenarios are presented. The number of used nodes varies between 16 and 32 in the scenarios while the size of the problem is not modified. Therefore, the computations to communications times ratio is lower when 32 nodes are used instead of 16. When this ratio is small, it means there are not enough computations when compared to the communications times and the impact of scaling down the frequency of the CPU on its energy consumption is reduced. To solve this problem, the problem 
  should be solved on a number of nodes adequate to its size. For example, for the NAS benchmarks, the class E should have been solved on 32 nodes to have a good computations to communications times ratio.  
  
  
  \item In Sec 6.3, why did you choose to keep 32 processes for the evaluation
  with multi-core clusters? How did you configure MPI for the results
  
-\textcolor{blue}{Answer:  We keep choosing 32 nodes in both scenarios 
-to compare them while one core per node scenario has distributed communications (one network link for each node) and multi-core scenario uses shared network link communications and thus comparing their impact on the results. 
-We configure MPI on one core per node scenario by choosing one core per  nodes (e.g in machine file we did: node1, node2 ,node3, node4). While in multi-core scenario we choose one machine with four cores (e.g. node1 slots=4).}
+\textbf{Answer:} In section 6.3, we wanted to evaluate how much energy can be saved when applying the proposed algorithm to message passing applications with iterations running over a grid composed of multi-core nodes. Therefore, the same experiments as in section 6.2 were conducted on the new  multi-core platform. Instead of running one process per node as in the previous section, 3 or 4 processes were executed on each multi-core node. The total number of processes, 32 processes, was not modified in order to fairly compare the single core and the multi-core versions. 
+
+Only the architecture file was modified between the single and the multi-core architectures. For the single core architecture, the architecture file contains the name of 32 different nodes. For the multi-core architecture, the architecture file contains less nodes and for every node 3 or 4 slots (cores) are used. The total number of slots is equal to 32.
+
   
  
  \item shown in Fig 8a? Some MPI implementations have an option to use shared
@@ -130,8 +131,8 @@ the explanation of the network card utilization, but this
  shared-memory optimization is possible (sometimes automatically
  detected by MPI if you pin processes to cores).
  
-\textcolor{blue}{Answer: We didn't  manually pin processes to cores and since the communication times  
-increased. We guess that the shared memory wasn't used.}
+\textbf{Answer:} We did not  manually pin processes to cores. Since the communication times  
+increased, we think that the shared memory was not used when two processes, running on the same node, exchange data.
  
  \item In P33, Sec 6.5, you mention that the proposed algorithm outperforms
  EDP because the former considers both metrics (time, energy) and the
@@ -139,8 +140,7 @@ same time. EDP does also, but using a single metric which you have
  defined: energy x execution time. I think this is only a matter of
  phrasing.
  
-\textcolor{blue}{Answer: we use the delay in execution time not the execution time. Then, the equation that we used is EDP= energy x (Tnew-Told). The experiments shows that our objective function 
-is better than the EDP objective.}
+ \textbf{Answer:}  We agree with the reviewer, EDP also uses two metrics in the objective function: energy and delay. The sentence in the paper was modified to clarify this misunderstanding. The main difference between our algorithm and the EDP method is the used objective function. For EDP, the product of energy and delay must be minimized, while for our algorithm, the difference between the normalized performance and the normalized energy should be maximized. This new formulation of the objective function allows our algorithm to select the set of frequencies that gives the best tradeoff between the energy consumption and the performance. The objective function of EDP does not give the same frequencies as our algorithm and thus it is outperformed by our method. The results of the experiments confirm that the objective function used by our algorithm  is more efficient than the one used by EDP.
  
  
  \item Other complementary points to consider:
@@ -159,7 +159,7 @@ is better than the EDP objective.}
  
  + Same for Fig 7.
  
-\item \textcolor{blue}{ Answer: We have considered these points in the paper.}
+  \textbf{Answer:} Answer: We have taken in consideration all these remarks and the paper was modified accordingly.
  
  \item From the design of experiments, did you consider using replications?
  There is no variability metric in your results. Have you run multiple
@@ -167,24 +167,24 @@ times and got the average (execution time and energy consumption)? I
  feel that such variability needs to be accounted for, otherwise it is
  very hard to affirm anything about measurements.
  
- \textcolor{blue}{Answer: Each experiment has been executed many times and the results presented in the 
- figures are the average values of many executions.}
+ \textbf{Answer:}  Each experiment has been executed many times and the results presented in the 
+ figures are the average values of many executions. Since we have deployed the same operating system on the booked machines and we were the only users executing processes on them during the experiments, no significant variability in the execution time of the applications was noticed. 
  
  
  \item In summary, I think this is a very interesting work but the experimental evaluation lacks variability measurements, consider larger experiments (1K nodes for instance) to see how everything scales,  and there is no overhead measurements although authors stress that in abstract/introduction.
  
-\textcolor{blue}{Answer: We will expand the experimental over a large number of nodes in the future work while increasing the  problem size and considering the variability issues. We have discussed the algorithm overhead and its complexity in section 6.5.}
-\end{enumerate}
+\textbf{Answer:} For the time being, we do not have the resources  nor the time to evaluate the proposed algorithm over large platforms composed of more than 1K nodes. However, as said in the perspectives of the paper, the evaluation of the scalability of the algorithm will be in a conducted in a future work as soon as we have access to larger resources. We have discussed the overhead of the algorithm and its complexity in section 6.5 and given in the answer to question 2 some solutions to improve its scalability and reduce its overhead. 
  
+For the variability issue, please refer to the answer to question 1. 
  
-\section{Questions of the second reviewer} 
-This paper presents detailed performance and energy model for iterative message passing applications. Further a method is proposed to select the frequencies of heterogeneous cpus online. The selection method itself is not difficult. But I like the systematic modeling for energy consumption and performance. This paper is well written in general. The technical contents are presented in a logical way overall. The experiments are conducted in real platform, which shows the practicality of this work and also makes the work have more impact on the field. However, I have the following comments and concerns for this paper. The authors should clarify them in the revised version. 
+  
+\end{enumerate}
  
  
--move the contributions from related work to introduction
+\section{Questions and remarks of the second reviewer} 
  
  \begin{enumerate}
-
+\item Move the contributions from related work to introduction
  
  \item why emphasize it is a grid platform? the presentation of related work follows the logic of heterogeneous CPUs. Grid is only a type of platform with heterogeneous CPUs
  
@@ -290,14 +290,7 @@ and with or without message passing. Indeed, the proposed method can be applied
  The proposed method can be applied to a heterogeneous platform composed from GPUs and CPUs, since modern GPUs like CPUs allow the use of DVFS operation.}
  \end{enumerate}
  
-\section{Questions of the third reviewer}
-In this paper, a new online frequency selecting algorithm for grids, composed of heterogeneous clusters, is presented. It selects the frequencies and tries to give the best trade-off between
-energy saving and performance degradation, for each node computing the message passing iterative application. The algorithm has a small overhead and works without training or profiling. It uses a new energy model for message
-passing iterative applications running on a grid. The proposed algorithm is evaluated on a real grid, the Grid'5000 platform, while running the NAS parallel benchmarks. The experiments on 16 nodes, distributed on three clusters, show
-that it reduces on average the energy consumption by 30\% while the performance is on average only degraded by 3.2\%. Finally, the algorithm is compared to an existing method. The comparison results show that it outperforms the
-latter in terms of energy consumption reduction and performance.
-
-this paper is quite interesting and solid. But before acceptance, I suggest to have the following major revisions:
+\section{Questions and remarks of the third reviewer}
  \begin{enumerate}
  
  \item suggest the authors to use much larger size of nodes, instead of on 16 nodes, distributed on three clusters, to see the scalability of the energy saving
author	jean-claude <jean-claude.charr@univ-fcomte.fr>
	Fri, 27 May 2016 09:38:05 +0000 (11:38 +0200)
committer	jean-claude <jean-claude.charr@univ-fcomte.fr>
	Fri, 27 May 2016 09:38:05 +0000 (11:38 +0200)
mpi-energy2-extension/Heter_paper.tex		patch \| blob \| history
mpi-energy2-extension/review/review.pdf		patch \| blob \| history
mpi-energy2-extension/review/review.tex		patch \| blob \| history