Fix underfull hbox.

[mpi-energy2.git] / Heter_paper.tex
diff --git a/Heter_paper.tex b/Heter_paper.tex

index 9d743db707e2ffa874517ab225d0d211b722b9f3..5d2837732a9458ededc33d95f2c1293ffb17bba1 100644 (file)
--- a/Heter_paper.tex
+++ b/Heter_paper.tex
@@ -63,8 +63,7 @@
      Arnaud Giersch
    } 
    \IEEEauthorblockA{%
      Arnaud Giersch
    } 
    \IEEEauthorblockA{%
-    FEMTO-ST Institute\\
-    University of Franche-Comté\\
+    FEMTO-ST Institute, University of Franche-Comte\\
      IUT de Belfort-Montbéliard,
      19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
      % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
      IUT de Belfort-Montbéliard,
      19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
      % Telephone: \mbox{+33 3 84 58 77 86}, % Raphaël
@@ -82,16 +81,15 @@ platforms many techniques have been  used. Dynamic voltage and frequency scaling
  (DVFS) is  one of them. It  reduces the frequency of  a CPU to  lower its energy
  consumption.  However,  lowering the  frequency  of  a  CPU might  increase  the
  execution  time of  an application  running on  that processor.   Therefore, the
  (DVFS) is  one of them. It  reduces the frequency of  a CPU to  lower its energy
  consumption.  However,  lowering the  frequency  of  a  CPU might  increase  the
  execution  time of  an application  running on  that processor.   Therefore, the
-frequency that  gives the best tradeoff  between the energy  consumption and the
-performance of an application must be selected.
-
+frequency that  gives the best trade-off  between the energy  consumption and the
+performance of an application must be selected.\\
  In this  paper, a new  online frequencies selecting algorithm  for heterogeneous
  platforms is presented.   It selects the frequency which tries  to give the best
  In this  paper, a new  online frequencies selecting algorithm  for heterogeneous
  platforms is presented.   It selects the frequency which tries  to give the best
-tradeoff  between  energy saving  and  performance  degradation,  for each  node
+trade-off  between  energy saving  and  performance  degradation,  for each  node
  computing the message  passing iterative application. The algorithm  has a small
  overhead and works without training or profiling. It uses a new energy model for
  message passing iterative applications  running on a heterogeneous platform. The
  computing the message  passing iterative application. The algorithm  has a small
  overhead and works without training or profiling. It uses a new energy model for
  message passing iterative applications  running on a heterogeneous platform. The
-proposed algorithm is  evaluated on the Simgrid simulator  while running the NAS
+proposed algorithm is  evaluated on the SimGrid simulator  while running the NAS
  parallel  benchmarks.  The  experiments   show  that  it  reduces  the  energy
  consumption by up to 35\% while  limiting the performance degradation as much as
  possible.   Finally,  the algorithm  is  compared  to  an existing  method,  the
  parallel  benchmarks.  The  experiments   show  that  it  reduces  the  energy
  consumption by up to 35\% while  limiting the performance degradation as much as
  possible.   Finally,  the algorithm  is  compared  to  an existing  method,  the
@@ -127,7 +125,7 @@ processor            by             lowering            its            frequency
  the number of FLOPS executed by the processor which might increase the execution
  time of the application running over that processor.  Therefore, researchers use
  different optimization  strategies to select  the frequency that gives  the best
  the number of FLOPS executed by the processor which might increase the execution
  time of the application running over that processor.  Therefore, researchers use
  different optimization  strategies to select  the frequency that gives  the best
-tradeoff  between the  energy reduction  and performance  degradation  ratio. In
+trade-off  between the  energy reduction  and performance  degradation  ratio. In
  \cite{Our_first_paper}, a  frequency selecting algorithm was  proposed to reduce
  the energy  consumption of message  passing iterative applications  running over
  homogeneous platforms.  The results of  the experiments show  significant energy
  \cite{Our_first_paper}, a  frequency selecting algorithm was  proposed to reduce
  the energy  consumption of message  passing iterative applications  running over
  homogeneous platforms.  The results of  the experiments show  significant energy
@@ -195,7 +193,7 @@ efficiency than other clusters only composed of  CPUs.
   
  The work presented in this paper concerns the second type of platform, with heterogeneous CPUs.
  Many methods were conceived to reduce the energy consumption of this type of platform.  Naveen et al.~\cite{Naveen_Power.Efficient.Resource.Scaling}  
   
  The work presented in this paper concerns the second type of platform, with heterogeneous CPUs.
  Many methods were conceived to reduce the energy consumption of this type of platform.  Naveen et al.~\cite{Naveen_Power.Efficient.Resource.Scaling}  
-developed a method that minimizes the value of $energy*delay^2$ (the delay is the sum of slack times that happen during synchronous communications) by dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed
+developed a method that minimizes the value of $energy\cdot delay^2$ (the delay is the sum of slack times that happen during synchronous communications) by dynamically assigning new frequencies to the CPUs of the heterogeneous cluster. Lizhe et al.~\cite{Lizhe_Energy.aware.parallel.task.scheduling} proposed
  an algorithm that divides the executed tasks into two types: the critical and 
  non critical tasks. The algorithm scales down the frequency of  non critical tasks proportionally to their  slack and communication times while limiting  the performance degradation percentage to less than 10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed 
    a heterogeneous cluster composed of two  types 
  an algorithm that divides the executed tasks into two types: the critical and 
  non critical tasks. The algorithm scales down the frequency of  non critical tasks proportionally to their  slack and communication times while limiting  the performance degradation percentage to less than 10\%. In~\cite{Joshi_Blackbox.prediction.of.impact.of.DVFS}, they developed 
    a heterogeneous cluster composed of two  types 
@@ -240,7 +238,7 @@ nodes to finish  their computations (see Figure~(\ref{fig:heter})).
  Therefore,  the overall execution time  of the program is the execution time of the slowest
  task which has the highest computation time and no slack time.
    
  Therefore,  the overall execution time  of the program is the execution time of the slowest
  task which has the highest computation time and no slack time.
    
- \begin{figure}[t]
+ \begin{figure}[!t]
    \centering
     \includegraphics[scale=0.6]{fig/commtasks}
    \caption{Parallel tasks on a heterogeneous platform}
    \centering
     \includegraphics[scale=0.6]{fig/commtasks}
    \caption{Parallel tasks on a heterogeneous platform}
@@ -285,7 +283,7 @@ vector of scaling factors can be predicted using (\ref{eq:perf}).
   \textit  T_\textit{new} = 
   \max_{i=1,2,\dots,N} ({TcpOld_{i}} \cdot S_{i}) +  MinTcm 
  \end{equation}
   \textit  T_\textit{new} = 
   \max_{i=1,2,\dots,N} ({TcpOld_{i}} \cdot S_{i}) +  MinTcm 
  \end{equation}
-Where:\\
+Where:
  \begin{equation}
  \label{eq:perf2}
   MinTcm = \min_{i=1,2,\dots,N} (Tcm_i)
  \begin{equation}
  \label{eq:perf2}
   MinTcm = \min_{i=1,2,\dots,N} (Tcm_i)
@@ -486,7 +484,7 @@ normalized execution time is inverted which gives the normalized performance equ
  \end{multline}
  
  
  \end{multline}
  
  
-\begin{figure}
+\begin{figure}[!t]
    \centering
    \subfloat[Homogeneous platform]{%
      \includegraphics[width=.33\textwidth]{fig/homo}\label{fig:r1}}%
    \centering
    \subfloat[Homogeneous platform]{%
      \includegraphics[width=.33\textwidth]{fig/homo}\label{fig:r1}}%
@@ -590,7 +588,7 @@ words, until they reach the higher bound. It can also be noticed that the higher
  the difference between the faster nodes  and the slower nodes is, the bigger the
  maximum distance  between the  energy curve and  the performance curve  is while
   the scaling factors are varying which results in bigger energy savings.
  the difference between the faster nodes  and the slower nodes is, the bigger the
  maximum distance  between the  energy curve and  the performance curve  is while
   the scaling factors are varying which results in bigger energy savings.
-\begin{figure}[t]
+\begin{figure}[!t]
    \centering
      \includegraphics[scale=0.5]{fig/start_freq}
    \caption{Selecting the initial frequencies}
    \centering
      \includegraphics[scale=0.5]{fig/start_freq}
    \caption{Selecting the initial frequencies}
@@ -714,10 +712,10 @@ highest frequency,  each node  consumed an amount  of power proportional  to its
  computing  power  (which  corresponds to  80\%  of  its  dynamic power  and  the
  remaining  20\%  to  the  static   power),  the  same  assumption  was  made  in
  \cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}.    Finally,  These
  computing  power  (which  corresponds to  80\%  of  its  dynamic power  and  the
  remaining  20\%  to  the  static   power),  the  same  assumption  was  made  in
  \cite{Our_first_paper,Rauber_Analytical.Modeling.for.Energy}.    Finally,  These
-nodes were connected via an ethernet network with 1 Gbit/s bandwidth.
+nodes were connected via an Ethernet network with 1 Gbit/s bandwidth.
  
  
  
  
-\begin{table}[htb]
+\begin{table}[!t]
    \caption{Heterogeneous nodes characteristics}
    % title of Table
    \centering
    \caption{Heterogeneous nodes characteristics}
    % title of Table
    \centering
@@ -762,7 +760,7 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
  
   
   
  
   
   
-\begin{table}[htb]
+\begin{table}[!t]
    \caption{Running NAS benchmarks on 4 nodes }
    % title of Table
    \centering
    \caption{Running NAS benchmarks on 4 nodes }
    % title of Table
    \centering
@@ -787,9 +785,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
  \hline 
    \end{tabular}
    \label{table:res_4n}
  \hline 
    \end{tabular}
    \label{table:res_4n}
-\end{table}
+% \end{table}
  
  
-\begin{table}[htb]
+\medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 8 and 9 nodes }
    % title of Table
    \centering
    \caption{Running NAS benchmarks on 8 and 9 nodes }
    % title of Table
    \centering
@@ -814,9 +813,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
  \hline 
    \end{tabular}
    \label{table:res_8n}
  \hline 
    \end{tabular}
    \label{table:res_8n}
-\end{table}
+% \end{table}
  
  
-\begin{table}[htb]
+\medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 16 nodes }
    % title of Table
    \centering
    \caption{Running NAS benchmarks on 16 nodes }
    % title of Table
    \centering
@@ -841,9 +841,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
  \hline 
    \end{tabular}
    \label{table:res_16n}
  \hline 
    \end{tabular}
    \label{table:res_16n}
-\end{table}
+% \end{table}
  
  
-\begin{table}[htb]
+\medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 32 and 36 nodes }
    % title of Table
    \centering
    \caption{Running NAS benchmarks on 32 and 36 nodes }
    % title of Table
    \centering
@@ -868,9 +869,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
  \hline 
    \end{tabular}
    \label{table:res_32n}
  \hline 
    \end{tabular}
    \label{table:res_32n}
-\end{table}
+% \end{table}
  
  
-\begin{table}[htb]
+\medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 64 nodes }
    % title of Table
    \centering
    \caption{Running NAS benchmarks on 64 nodes }
    % title of Table
    \centering
@@ -895,10 +897,10 @@ be executed on $1, 4, 9, 16, 36, 64, 144$ nodes.
  \hline 
    \end{tabular}
    \label{table:res_64n}
  \hline 
    \end{tabular}
    \label{table:res_64n}
-\end{table}
-
+% \end{table}
  
  
-\begin{table}[htb]
+\medskip
+% \begin{table}[!t]
    \caption{Running NAS benchmarks on 128 and 144 nodes }
    % title of Table
    \centering
    \caption{Running NAS benchmarks on 128 and 144 nodes }
    % title of Table
    \centering
@@ -959,7 +961,7 @@ small when compared to the communication times.
  
  
   
  
  
   
-\begin{figure}
+\begin{figure}[!t]
    \centering
    \subfloat[Energy saving]{%
      \includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}%
    \centering
    \subfloat[Energy saving]{%
      \includegraphics[width=.33\textwidth]{fig/energy}\label{fig:energy}}%
@@ -1014,7 +1016,7 @@ in the 70\%-30\% scenario compared to the 90\%-10\% scenario. This is due to the
  higher  static  power percentage  in  the first  scenario  which  makes it  more
  relevant in the  overall consumed energy.  Indeed, the  static energy is related
  to the execution time and if  the performance is degraded the amount of consumed
  higher  static  power percentage  in  the first  scenario  which  makes it  more
  relevant in the  overall consumed energy.  Indeed, the  static energy is related
  to the execution time and if  the performance is degraded the amount of consumed
-static  energy directly  increas.  Therefore,  the proposed  algorithm  does not
+static  energy directly  increases.  Therefore,  the proposed  algorithm  does not
  really significantly  scale down much the  frequencies of the nodes  in order to
  limit the  increase of the  execution time and  thus limiting the effect  of the
  consumed static energy.
  really significantly  scale down much the  frequencies of the nodes  in order to
  limit the  increase of the  execution time and  thus limiting the effect  of the
  consumed static energy.
@@ -1040,7 +1042,7 @@ scaling  values which result  in less  energy saving  but also  less performance
  degradation.
  
  
  degradation.
  
  
- \begin{table}[htb]
+ \begin{table}[!t]
    \caption{The results of the 70\%-30\% power scenario}
    % title of Table
    \centering
    \caption{The results of the 70\%-30\% power scenario}
    % title of Table
    \centering
@@ -1069,7 +1071,7 @@ degradation.
  
  
  
  
  
  
-\begin{table}[htb]
+\begin{table}[!t]
    \caption{The results of the 90\%-10\% power scenario}
    % title of Table
    \centering
    \caption{The results of the 90\%-10\% power scenario}
    % title of Table
    \centering
@@ -1097,7 +1099,7 @@ degradation.
  \end{table}
  
  
  \end{table}
  
  
-\begin{figure}
+\begin{figure}[!t]
    \centering
    \subfloat[Comparison  between the results on 8 nodes]{%
      \includegraphics[width=.33\textwidth]{fig/sen_comp}\label{fig:sen_comp}}%
    \centering
    \subfloat[Comparison  between the results on 8 nodes]{%
      \includegraphics[width=.33\textwidth]{fig/sen_comp}\label{fig:sen_comp}}%
@@ -1115,23 +1117,23 @@ degradation.
  \label{sec.compare_EDP}
  In this section, the scaling  factors selection algorithm, called MaxDist,
  is compared to Spiliopoulos et al. algorithm \cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP. 
  \label{sec.compare_EDP}
  In this section, the scaling  factors selection algorithm, called MaxDist,
  is compared to Spiliopoulos et al. algorithm \cite{Spiliopoulos_Green.governors.Adaptive.DVFS}, called EDP. 
-They developed a green governor that regularly applies an online frequency selecting algorithm to reduce the energy consumed by a multicore architecture without degrading much its performance. The algorithm selects the frequencies that minimize the energy and delay products, $EDP=Enegry*Delay$ using the predicted overall energy consumption and execution time delay for each frequency.
+They developed a green governor that regularly applies an online frequency selecting algorithm to reduce the energy consumed by a multicore architecture without degrading much its performance. The algorithm selects the frequencies that minimize the energy and delay products, $EDP=Energy\cdot Delay$ using the predicted overall energy consumption and execution time delay for each frequency.
  To fairly compare both algorithms, the same energy and execution time models, equations (\ref{eq:energy}) and  (\ref{eq:fnew}), were used for both algorithms to predict the energy consumption and the execution times. Also Spiliopoulos et al. algorithm was adapted to  start the search from the 
  initial frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm is an exhaustive search algorithm that minimizes the EDP and has the initial frequencies values as an upper bound.
  
  To fairly compare both algorithms, the same energy and execution time models, equations (\ref{eq:energy}) and  (\ref{eq:fnew}), were used for both algorithms to predict the energy consumption and the execution times. Also Spiliopoulos et al. algorithm was adapted to  start the search from the 
  initial frequencies computed using the equation (\ref{eq:Fint}). The resulting algorithm is an exhaustive search algorithm that minimizes the EDP and has the initial frequencies values as an upper bound.
  
-Both algorithms were applied to the parallel NAS benchmarks to compare their efficiency. Table \ref{table:compare_EDP}  presents the results of comparing the execution times and the energy consumptions for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than Spiliopoulos et al. algorithm, 
+Both algorithms were applied to the parallel NAS benchmarks to compare their efficiency. Table \ref{table:compare_EDP}  presents the results of comparing the execution times and the energy consumption for both versions of the NAS benchmarks while running the class C of each benchmark over 8 or 9 heterogeneous nodes. The results show that our algorithm provides better energy savings than Spiliopoulos et al. algorithm, 
  on average it results in 29.76\% energy saving while their algorithm returns just 25.75\%. The average of performance degradation percentage is approximately the same for both algorithms, about 4\%. 
  
  
  For all benchmarks,  our algorithm outperforms Spiliopoulos et  al. algorithm in
  on average it results in 29.76\% energy saving while their algorithm returns just 25.75\%. The average of performance degradation percentage is approximately the same for both algorithms, about 4\%. 
  
  
  For all benchmarks,  our algorithm outperforms Spiliopoulos et  al. algorithm in
-terms of  energy and  performance tradeoff, see  figure (\ref{fig:compare_EDP}),
+terms of  energy and  performance trade-off, see  figure (\ref{fig:compare_EDP}),
  because it maximizes the distance  between the energy saving and the performance
  degradation values while giving the same weight for both metrics.
  
  
  
  
  because it maximizes the distance  between the energy saving and the performance
  degradation values while giving the same weight for both metrics.
  
  
  
  
-\begin{table}[h]
+\begin{table}[!t]
   \caption{Comparing the proposed algorithm}
   \centering
  \begin{tabular}{|l|l|l|l|l|l|l|l|}
   \caption{Comparing the proposed algorithm}
   \centering
  \begin{tabular}{|l|l|l|l|l|l|l|l|}
@@ -1154,10 +1156,10 @@ degradation values while giving the same weight for both metrics.
  
  
  
  
  
  
-\begin{figure}[t]
+\begin{figure}[!t]
    \centering
     \includegraphics[scale=0.5]{fig/compare_EDP.pdf}
    \centering
     \includegraphics[scale=0.5]{fig/compare_EDP.pdf}
-  \caption{Tradeoff comparison for NAS benchmarks class C}
+  \caption{Trade-off comparison for NAS benchmarks class C}
    \label{fig:compare_EDP}
  \end{figure}
  
    \label{fig:compare_EDP}
  \end{figure}
  
@@ -1166,19 +1168,19 @@ degradation values while giving the same weight for both metrics.
  \label{sec.concl} 
  In this paper, a new online frequency selecting algorithm has been presented. It
  selects the  best possible  vector of frequency  scaling factors that  gives the
  \label{sec.concl} 
  In this paper, a new online frequency selecting algorithm has been presented. It
  selects the  best possible  vector of frequency  scaling factors that  gives the
-maximum  distance  (optimal  tradeoff)  between  the predicted  energy  and  the
+maximum  distance  (optimal  trade-off)  between  the predicted  energy  and  the
  predicted performance curves for a heterogeneous platform. This algorithm uses a
  new  energy  model  for  measuring  and predicting  the  energy  of  distributed
  iterative  applications running  over heterogeneous  platforms. To  evaluate the
  proposed method, it was applied on the NAS parallel benchmarks and executed over
  predicted performance curves for a heterogeneous platform. This algorithm uses a
  new  energy  model  for  measuring  and predicting  the  energy  of  distributed
  iterative  applications running  over heterogeneous  platforms. To  evaluate the
  proposed method, it was applied on the NAS parallel benchmarks and executed over
-a heterogeneous  platform simulated by  Simgrid. The results of  the experiments
+a heterogeneous  platform simulated by  SimGrid. The results of  the experiments
  showed that the algorithm reduces up to 35\% the energy consumption of a message
  passing iterative method while limiting  the degradation of the performance. The
  algorithm also selects different scaling  factors according to the percentage of
  the computing and communication times, and according to the values of the static
  and  dynamic  powers  of the  CPUs.   Finally,  the  algorithm was  compared  to
  Spiliopoulos et al.  algorithm and  the results showed that it outperforms their
  showed that the algorithm reduces up to 35\% the energy consumption of a message
  passing iterative method while limiting  the degradation of the performance. The
  algorithm also selects different scaling  factors according to the percentage of
  the computing and communication times, and according to the values of the static
  and  dynamic  powers  of the  CPUs.   Finally,  the  algorithm was  compared  to
  Spiliopoulos et al.  algorithm and  the results showed that it outperforms their
-algorithm in terms of energy-time tradeoff.
+algorithm in terms of energy-time trade-off.
  
  In the near future, this method  will be applied to real heterogeneous platforms
  to evaluate its  performance in a real study case. It  would also be interesting
  
  In the near future, this method  will be applied to real heterogeneous platforms
  to evaluate its  performance in a real study case. It  would also be interesting
@@ -1216,6 +1218,7 @@ Babylon (Iraq) for supporting his work.
  %%% End:
  
  % LocalWords:  Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
  %%% End:
  
  % LocalWords:  Fanfakh Charr FIXME Tianhe DVFS HPC NAS NPB SMPI Rauber's Rauber
-% LocalWords:  CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex
+% LocalWords:  CMOS EPSA Franche Comté Tflop Rünger IUT Maréchal Juin cedex GPU
  % LocalWords:  de badri muslim MPI TcpOld TcmOld dNew dOld cp Sopt Tcp Tcm Ps
  % LocalWords:  de badri muslim MPI TcpOld TcmOld dNew dOld cp Sopt Tcp Tcm Ps
-% LocalWords:  Scp Fmax Fdiff SimGrid GFlops Xeon EP BT
+% LocalWords:  Scp Fmax Fdiff SimGrid GFlops Xeon EP BT GPUs CPUs AMD
+%  LocalWords:  Spiliopoulos scalability