adding some changes

[ThesisAhmed.git] / CHAPITRE_03.tex
diff --git a/CHAPITRE_03.tex b/CHAPITRE_03.tex

index 27d95ecb0c5c36a5aa89295f4ef0876050a8ef00..953bcfcf71a9a7027f67bccae389933b7cae6421 100644 (file)
--- a/CHAPITRE_03.tex
+++ b/CHAPITRE_03.tex
@@ -68,16 +68,16 @@
    clusters (heterogeneous CPUs)  and grid platforms are presented.  
    They select the frequencies that try to give the best
    trade-off between energy saving and performance degradation, for each node
    clusters (heterogeneous CPUs)  and grid platforms are presented.  
    They select the frequencies that try to give the best
    trade-off between energy saving and performance degradation, for each node
-  computing the synchronous message passing iterative application. These algorithms have a small
+  computing the synchronous message passing  application with iterations. These algorithms have a small
    overhead and work without training or profiling. They use new energy models
    overhead and work without training or profiling. They use new energy models
-  for  message passing iterative synchronous applications running on both the heterogeneous
+  for  message passing  synchronous applications with iterations running on both the heterogeneous
    local cluster and the grid platform. The first proposed algorithm for a heterogeneous local 
    cluster was evaluated on the SimGrid simulator while running the class C of the NAS parallel
    benchmarks. The experiments conducted  over 8 heterogeneous nodes show that it reduces on 
    average the energy consumption by  29.8\%  while limiting the performance degradation to 3.8\%.  
    The second proposed algorithm for a grid platform was evaluated on the Grid5000 testbed 
    platform  while running the class D of the NAS parallel benchmarks.
    local cluster and the grid platform. The first proposed algorithm for a heterogeneous local 
    cluster was evaluated on the SimGrid simulator while running the class C of the NAS parallel
    benchmarks. The experiments conducted  over 8 heterogeneous nodes show that it reduces on 
    average the energy consumption by  29.8\%  while limiting the performance degradation to 3.8\%.  
    The second proposed algorithm for a grid platform was evaluated on the Grid5000 testbed 
    platform  while running the class D of the NAS parallel benchmarks.
-  The experiments were run on 16 nodes, distributed on three clusters, and show that it reduces  
+  The experiments were run on 16 nodes, distributed on three clusters, and show that the algorithm reduces  
    on average the energy consumption by 30\% while  the performance  is on average only degraded
     by 3.2\%.  
    Finally, both algorithms were compared to the EDP method. The comparison 
    on average the energy consumption by 30\% while  the performance  is on average only degraded
     by 3.2\%.  
    Finally, both algorithms were compared to the EDP method. The comparison 
@@ -96,13 +96,12 @@ Section~\ref{ch3:3} shows the energy and performance models in addition to the f
  selecting algorithm of synchronous message passing programs running over a grid platform.
  Section~\ref{ch3:4} presents the results of applying the algorithm on the 
  NAS parallel benchmarks (class D) and executing them on the Grid'5000 testbed. 
  selecting algorithm of synchronous message passing programs running over a grid platform.
  Section~\ref{ch3:4} presents the results of applying the algorithm on the 
  NAS parallel benchmarks (class D) and executing them on the Grid'5000 testbed. 
-The algorithm is also evaluated over multi-core architectures and over three different power scenarios. Moreover, section~\ref{ch3:4}, shows the comparison results between the proposed method and the EDP method.
+The algorithm is also evaluated over multi-core architectures and over three different power scenarios. Moreover, Section~\ref{ch3:4}, shows the comparison results between the proposed method and the EDP method.
  Finally, in Section~\ref{ch3:concl}  the chapter ends with a summary.
  
  \section{Related works}
  \label{ch3:relwork}
  
  Finally, in Section~\ref{ch3:concl}  the chapter ends with a summary.
  
  \section{Related works}
  \label{ch3:relwork}
  
-As same as in CPUs, DVFS is also allowed in GPUs to reduce their energy consumption. 
  The process of selecting the appropriate frequency for a
  processor to satisfy some objectives, while taking into account all the
  constraints, is not a trivial operation.  Many researchers used different
  The process of selecting the appropriate frequency for a
  processor to satisfy some objectives, while taking into account all the
  constraints, is not a trivial operation.  Many researchers used different
@@ -119,7 +118,7 @@ sequential, parallel or distributed architecture, homogeneous or heterogeneous
  platform, synchronous or asynchronous application, \dots{}
  
  In this chapter, we are interested in reducing the energy consumption when running a message passing
  platform, synchronous or asynchronous application, \dots{}
  
  In this chapter, we are interested in reducing the energy consumption when running a message passing
-iterative synchronous applications  over a heterogeneous platform.  Some
+synchronous applications with iterations over a heterogeneous platform.  Some
  works have already been done for such platforms which can be classified into
  two types of heterogeneous platforms:
  \begin{itemize}
  works have already been done for such platforms which can be classified into
  two types of heterogeneous platforms:
  \begin{itemize}
@@ -166,8 +165,8 @@ while respecting the given time constraint.  This approach had considerable
  overhead.  In contrast to the above described works, the work of this chapter presents the
  following contributions:
  \begin{enumerate}
  overhead.  In contrast to the above described works, the work of this chapter presents the
  following contributions:
  \begin{enumerate}
-\item two new energy and two performance models for message passing iterative
-  synchronous applications running over a heterogeneous local cluster and a grid platform. 
+\item two new energy and two performance models for message passing 
+  synchronous applications with iterations running over a heterogeneous local cluster and a grid platform. 
    All the models take into account the communications and the slack times. The models can predict the
    energy  consumption and the execution time of the application.
  
    All the models take into account the communications and the slack times. The models can predict the
    energy  consumption and the execution time of the application.
  
@@ -175,18 +174,18 @@ following contributions:
    local cluster and a grid platform. The algorithms have a very small overhead and do not need any
    training or profiling. They use a new optimization function which
    simultaneously maximizes the performance and minimizes the energy consumption
    local cluster and a grid platform. The algorithms have a very small overhead and do not need any
    training or profiling. They use a new optimization function which
    simultaneously maximizes the performance and minimizes the energy consumption
-  of a message passing iterative synchronous application.
+  of a message passing synchronous application with iterations.
  \end{enumerate}
  
  \end{enumerate}
  
-\section[The energy optimization of a heterogeneous cluster]{The energy optimization of parallel iterative applications running over local heterogeneous 
+\section[The energy optimization of a heterogeneous cluster]{The energy optimization of parallel  applications with iterations running over local heterogeneous 
  clusters}
  \label{ch3:1}
  
  clusters}
  \label{ch3:1}
  
-\subsection{The execution time of message passing distributed iterative
-  applications on a heterogeneous local cluster}
+\subsection{The execution time of message passing distributed 
+  applications with iterations on a heterogeneous local cluster}
  \label{ch3:1:1}
  In this section, we are interested in reducing the energy consumption of message
  \label{ch3:1:1}
  In this section, we are interested in reducing the energy consumption of message
-passing distributed iterative synchronous applications running over heterogeneous local clusters. 
+passing distributed synchronous applications with iterations running over heterogeneous local clusters. 
  In this work, a heterogeneous local cluster is defined as a collection of
  heterogeneous computing nodes interconnected via a high speed homogeneous
  network. Therefore, the nodes may have different characteristics such as computing
  In this work, a heterogeneous local cluster is defined as a collection of
  heterogeneous computing nodes interconnected via a high speed homogeneous
  network. Therefore, the nodes may have different characteristics such as computing
@@ -200,7 +199,7 @@ have the same network bandwidth and latency.
    \label{fig:task-heter}
  \end{figure}
  
    \label{fig:task-heter}
  \end{figure}
  
-The overall execution time of a distributed iterative synchronous application
+The overall execution time of a distributed synchronous application with iterations 
  over a heterogeneous local cluster consists of the sum of the computation time and
  the communication time for every iteration on a node. However, due to the
  heterogeneous computation power of the computing nodes, slack times may occur
  over a heterogeneous local cluster consists of the sum of the computation time and
  the communication time for every iteration on a node. However, due to the
  heterogeneous computation power of the computing nodes, slack times may occur
@@ -227,8 +226,8 @@ Since in a heterogeneous cluster the nodes may have different characteristics,
  especially different frequency gears, when applying DVFS operations on these
  nodes, they may get different scaling factors represented by a scaling vector:
  $(S_1, S_2,\dots, S_N)$ where $S_i$ is the scaling factor of processor $i$. To
  especially different frequency gears, when applying DVFS operations on these
  nodes, they may get different scaling factors represented by a scaling vector:
  $(S_1, S_2,\dots, S_N)$ where $S_i$ is the scaling factor of processor $i$. To
-be able to predict the execution time of message passing synchronous iterative
-applications running over a heterogeneous local cluster, for different vectors of
+be able to predict the execution time of message passing synchronous 
+applications with iterations running over a heterogeneous local cluster, for different vectors of
  scaling factors, the communication time and the computation time for all the
  tasks must be measured during the first iteration before applying any DVFS
  operation. Then the execution time for one iteration of the application with any
  scaling factors, the communication time and the computation time for all the
  tasks must be measured during the first iteration before applying any DVFS
  operation. Then the execution time for one iteration of the application with any
@@ -242,27 +241,27 @@ where $\TcpOld[i]$ is the computation time of processor $i$ during the first
  iteration.  The model computes the maximum computation time with
  scaling factor from each node added to the communication time of the slowest
  node. It means only the communication time without any slack time is taken into
  iteration.  The model computes the maximum computation time with
  scaling factor from each node added to the communication time of the slowest
  node. It means only the communication time without any slack time is taken into
-account.  Therefore, the execution time of the iterative application is equal to
+account.  Therefore, the execution time of the  application with iterations is equal to
  the execution time of one iteration as in (\ref{eq:perf_heter}) multiplied by the
  number of iterations of that application.
  
  This prediction model is improved from the model that predicts the execution time
  of message passing distributed applications for homogeneous
  the execution time of one iteration as in (\ref{eq:perf_heter}) multiplied by the
  number of iterations of that application.
  
  This prediction model is improved from the model that predicts the execution time
  of message passing distributed applications for homogeneous
-architectures presented in chapter \ref{ch2} section \ref{ch2:3}.  The execution time prediction model is
+architectures presented in Chapter \ref{ch2} Section \ref{ch2:3}.  The execution time prediction model is
  used in the method that optimizes both the energy consumption and the performance
  used in the method that optimizes both the energy consumption and the performance
-of iterative methods, which is presented in the following sections.
+of parallel application with iterations, which is presented in the following sections.
  
  \subsection{Energy model for heterogeneous local cluster}
  \label{ch3:1:2}
  
  \subsection{Energy model for heterogeneous local cluster}
  \label{ch3:1:2}
-In  chapter \ref{ch2}, the dynamic and the static  energy consumption of a 
+In Chapter \ref{ch2}, the dynamic and the static  energy consumption of a 
  processor is computed according to Equations \ref{eq:Edyn_new} and \ref{eq:Estatic_new} respectively. Then, the total energy consumption of a processor is the sum of these two metrics.  
  Therefore, the overall energy consumption for the parallel tasks over  a parallel cluster 
  is the  summation of the energies consumed by all the processors. 
  
  In the considered heterogeneous platform, each processor $i$ may have
  different dynamic and static powers, noted as $\Pd[i]$ and $\Ps[i]$
  processor is computed according to Equations \ref{eq:Edyn_new} and \ref{eq:Estatic_new} respectively. Then, the total energy consumption of a processor is the sum of these two metrics.  
  Therefore, the overall energy consumption for the parallel tasks over  a parallel cluster 
  is the  summation of the energies consumed by all the processors. 
  
  In the considered heterogeneous platform, each processor $i$ may have
  different dynamic and static powers, noted as $\Pd[i]$ and $\Ps[i]$
-respectively.  Therefore, even if the distributed message passing iterative
-application is load balanced, the computation time of each CPU $i$ noted
+respectively.  Therefore, even if the distributed message passing  
+application with iterations is load balanced, the computation time of each CPU $i$ noted
  $\Tcp[i]$ may be different and different frequency scaling factors may be
  computed in order to decrease the overall energy consumption of the application
  and reduce the slack times.  The communication time of a processor $i$ is noted as
  $\Tcp[i]$ may be different and different frequency scaling factors may be
  computed in order to decrease the overall energy consumption of the application
  and reduce the slack times.  The communication time of a processor $i$ is noted as
@@ -273,7 +272,7 @@ frequency scaling factor and the dynamic power of each node as in
  (\ref{eq:Edyn_new}), the static energy is computed as the sum of the execution time
  of one iteration as in \ref{eq:perf_heter}  multiplied by the static power of each processor.  
  The overall energy consumption of a message passing distributed application executed over a
  (\ref{eq:Edyn_new}), the static energy is computed as the sum of the execution time
  of one iteration as in \ref{eq:perf_heter}  multiplied by the static power of each processor.  
  The overall energy consumption of a message passing distributed application executed over a
-heterogeneous cluster during one iteration is the summation of all the dynamic and
+heterogeneous cluster during one iteration is the summation of the dynamic and
  static energies for all the processors.  It is computed as follows:
  \begin{equation}
    \label{eq:energy-heter}
  static energies for all the processors.  It is computed as follows:
  \begin{equation}
    \label{eq:energy-heter}
@@ -286,7 +285,7 @@ Reducing the frequencies of the processors according to the vector of scaling
  factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the application
  and thus, increase the consumed static energy because the execution time is
  increased~\cite{ref78}. The overall energy consumption
  factors $(S_1, S_2,\dots, S_N)$ may degrade the performance of the application
  and thus, increase the consumed static energy because the execution time is
  increased~\cite{ref78}. The overall energy consumption
-for an iterative application can be measured by measuring the energy
+for an  application with iterations can be measured by measuring the energy
  consumption for one iteration as in (\ref{eq:energy-heter}) multiplied by the number
  of iterations of that application.
  
  consumption for one iteration as in (\ref{eq:energy-heter}) multiplied by the number
  of iterations of that application.
  
@@ -303,14 +302,14 @@ the application might not be the optimal one.  It is not trivial to select the
  appropriate frequency scaling factor for each processor while considering the
  characteristics of each processor (computation power, range of frequencies,
  dynamic and static powers) and the task it is executing (computation/communication
  appropriate frequency scaling factor for each processor while considering the
  characteristics of each processor (computation power, range of frequencies,
  dynamic and static powers) and the task it is executing (computation/communication
-ratio).  In  chapter~\ref{ch2}, we proposed a method that selects the optimal
+ratio).  In  Chapter~\ref{ch2}, we proposed a method that selects the optimal
  frequency scaling factor for a homogeneous cluster executing a message passing
  frequency scaling factor for a homogeneous cluster executing a message passing
-iterative synchronous application while giving the best trade-off between the
+ synchronous application with iterations while giving the best trade-off between the
  energy consumption and the performance for such applications.  
  In this section, this optimization method is improved while considering a heterogeneous clusters.
  
  As described before, the  relation between the energy consumption and the execution time for an
  energy consumption and the performance for such applications.  
  In this section, this optimization method is improved while considering a heterogeneous clusters.
  
  As described before, the  relation between the energy consumption and the execution time for an
-application is complex and nonlinear. Thus, to find the trade-off relation between the energy consumption computed in Equation \ref{eq:energy-heter} and the performance with Equation \ref{eq:perf_heter}  for the iterative message passing applications, first we need to normalize both term as follows:
+application is complex and nonlinear. Thus, to find the trade-off relation between the energy consumption computed in Equation \ref{eq:energy-heter} and the performance with Equation \ref{eq:perf_heter}  for the  message passing applications with iterations, first we need to normalize both terms as follows:
  
  
  \begin{equation}
  
  
  \begin{equation}
@@ -334,7 +333,7 @@ application is complex and nonlinear. Thus, to find the trade-off relation betwe
  \begin{figure}[!t]
    \centering
      \includegraphics[width=.7\textwidth]{fig/ch3/heter}
  \begin{figure}[!t]
    \centering
      \includegraphics[width=.7\textwidth]{fig/ch3/heter}
-  \caption{The energy and performance relation in Heterogeneous cluster}
+  \caption{The energy and performance relation in heterogeneous cluster}
    \label{fig:rel-heter}
  \end{figure}
  
    \label{fig:rel-heter}
  \end{figure}
  
@@ -431,8 +430,8 @@ for the node $i$.  Then, the set of scaling factors that maximizes the objective
  In this section, Algorithm~\ref{HSA} is presented. It selects the frequency
  scaling factors vector that gives the best trade-off between minimizing the
  energy consumption and maximizing the performance of a message passing
  In this section, Algorithm~\ref{HSA} is presented. It selects the frequency
  scaling factors vector that gives the best trade-off between minimizing the
  energy consumption and maximizing the performance of a message passing
-synchronous iterative application executed on a heterogeneous local cluster. It works
-online during the execution time of the iterative message passing program.  It
+synchronous application with iterations executed on a heterogeneous local cluster. It works
+online during the execution time of the  message passing program with iterations.  It
  uses information gathered during the first iteration such as the computation
  time and the communication time in one iteration for each node. The algorithm is
  executed after the first iteration and returns a vector of optimal frequency
  uses information gathered during the first iteration such as the computation
  time and the communication time in one iteration for each node. The algorithm is
  executed after the first iteration and returns a vector of optimal frequency
@@ -440,7 +439,7 @@ scaling factors that satisfies the objective function (\ref{eq:max-heter}). The
  program applies DVFS operations to change the frequencies of the CPUs according
  to the computed scaling factors.  This algorithm is called just once during the
  execution of the program. Algorithm~\ref{dvfs-heter} shows where and when the proposed
  program applies DVFS operations to change the frequencies of the CPUs according
  to the computed scaling factors.  This algorithm is called just once during the
  execution of the program. Algorithm~\ref{dvfs-heter} shows where and when the proposed
-scaling algorithm is called in the iterative MPI program.
+scaling algorithm is called in the  MPI program with iterations.
  
  \begin{figure}[!t]
    \centering
  
  \begin{figure}[!t]
    \centering
@@ -578,10 +577,10 @@ specifications of real Intel processors.  The heterogeneous cluster had up to
  144 nodes and had nodes from the four types in equal proportions, for example if
  a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
  constructors of CPUs do not specify the dynamic and the static power of their
  144 nodes and had nodes from the four types in equal proportions, for example if
  a benchmark was executed on 8 nodes, 2 nodes from each type were used. Since the
  constructors of CPUs do not specify the dynamic and the static power of their
-CPUs, for each type of node they were chosen proportionally to its computing
+CPUs, for each type of node they were chosen proportionally to their computing
  powers (FLOPS).  The dynamic power corresponds to 80\% of the overall power consumption while executing at 
  the higher frequency and the
  powers (FLOPS).  The dynamic power corresponds to 80\% of the overall power consumption while executing at 
  the higher frequency and the
-remaining 20\% is the static power. The same assumption was made in chapter \ref{ch2} and
+remaining 20\% is the static power. The same assumption was made in Chapter \ref{ch2} and
  \cite{ref3}.  Finally, These nodes were connected via an Ethernet network with 1 \textit{Gbit/s} bandwidth.
  
  
  \cite{ref3}.  Finally, These nodes were connected via an Ethernet network with 1 \textit{Gbit/s} bandwidth.
  
  
@@ -824,7 +823,7 @@ more dynamic power is consumed when nodes are running on their maximum
  frequencies, thus, scaling down the frequency of the nodes results in higher
  energy savings than in the 70\%-30\% scenario. On the other hand,
  the performance degradation percentage is smaller in the 70\%-30\%
  frequencies, thus, scaling down the frequency of the nodes results in higher
  energy savings than in the 70\%-30\% scenario. On the other hand,
  the performance degradation percentage is smaller in the 70\%-30\%
-scenario compared to the 90\%-\%10 scenario. This is due to the
+scenario compared to the 90\%-10\% scenario. This is due to the
  higher static power percentage in the first scenario which makes it more
  relevant in the overall consumed energy.  Indeed, the static energy is related
  to the execution time and if the performance is degraded the amount of consumed
  higher static power percentage in the first scenario which makes it more
  relevant in the overall consumed energy.  Indeed, the static energy is related
  to the execution time and if the performance is degraded the amount of consumed
@@ -844,14 +843,14 @@ lowering the frequency of some processors. On the other hand, the energy saving
  decreases when the 70\%-30\% scenario is used because the dynamic
  energy is less relevant in the overall consumed energy and lowering the
  frequency does not return big energy savings.  Moreover, the average of the
  decreases when the 70\%-30\% scenario is used because the dynamic
  energy is less relevant in the overall consumed energy and lowering the
  frequency does not return big energy savings.  Moreover, the average of the
-performance degradation is decreased when using a higher ratio for static power
+performance degradation is decreased when using a higher ratio for the static power
  (e.g.  70\%-30\% scenario and 80\%-20\% scenario). Since the proposed 
  algorithm optimizes the energy consumption when
  using a higher ratio for the dynamic power, the algorithm selects bigger frequency
  scaling factors that results in more energy saving but degrade the performance, for
  example see Figure~\ref{fig:powers-heter} (b). The opposite happens when using a
  higher ratio for the static power, the algorithm proportionally selects smaller
  (e.g.  70\%-30\% scenario and 80\%-20\% scenario). Since the proposed 
  algorithm optimizes the energy consumption when
  using a higher ratio for the dynamic power, the algorithm selects bigger frequency
  scaling factors that results in more energy saving but degrade the performance, for
  example see Figure~\ref{fig:powers-heter} (b). The opposite happens when using a
  higher ratio for the static power, the algorithm proportionally selects smaller
-scaling values which result in less energy saving but also less performance
+scaling values which results in less energy saving but also less performance
  degradation.
  
  \begin{table}[!t]
  degradation.
  
  \begin{table}[!t]
@@ -989,13 +988,13 @@ the   energy reduction  to  performance trade-off, see  Figure~\ref{fig:compare_
  because it maximizes the distance  between the energy saving and the performance
  degradation values while giving the same weight for both metrics.
  
  because it maximizes the distance  between the energy saving and the performance
  degradation values while giving the same weight for both metrics.
  
-\section[The energy optimization of grid]{The energy optimization of parallel iterative applications running over grids}
+\section[The energy optimization of grid]{The energy optimization of parallel  applications with iterations  running over grids}
  \label{ch3:3}
  
  \subsection{The energy and performance models of grid platform}
  \label{ch3:3:1}
  In this section, we are interested in reducing the energy consumption of message
  \label{ch3:3}
  
  \subsection{The energy and performance models of grid platform}
  \label{ch3:3:1}
  In this section, we are interested in reducing the energy consumption of message
-passing  iterative synchronous applications running over
+passing   applications with synchronous iterations running over
  heterogeneous grid platforms. A heterogeneous grid platform could be defined as a collection of
  heterogeneous computing clusters interconnected via a long distance network which has a lower bandwidth 
  and a higher latency than the local networks of the clusters. Each computing cluster in the grid is composed of homogeneous nodes that are connected together via a high speed network. However, nodes from distinct  clusters may have different characteristics such as computing power (FLOPS), energy consumption, CPU's frequency range, network bandwidth and latency.
  heterogeneous grid platforms. A heterogeneous grid platform could be defined as a collection of
  heterogeneous computing clusters interconnected via a long distance network which has a lower bandwidth 
  and a higher latency than the local networks of the clusters. Each computing cluster in the grid is composed of homogeneous nodes that are connected together via a high speed network. However, nodes from distinct  clusters may have different characteristics such as computing power (FLOPS), energy consumption, CPU's frequency range, network bandwidth and latency.
@@ -1003,9 +1002,9 @@ and a higher latency than the local networks of the clusters. Each computing clu
  Since in a heterogeneous grid each cluster has different characteristics,
   when applying DVFS operations on the nodes 
  of these clusters, they may get different scaling factors represented by a scaling vector:
  Since in a heterogeneous grid each cluster has different characteristics,
   when applying DVFS operations on the nodes 
  of these clusters, they may get different scaling factors represented by a scaling vector:
-$(S_{11}, S_{12},\dots, S_{NM})$ where $S_{ij}$ is the scaling factor of processor $j$ in cluster $i$ . To
-be able to predict the execution time of message passing synchronous iterative
-applications running over a heterogeneous grid, for different vectors of
+$(S_{11}, S_{12},\dots, S_{NM})$ where $S_{ij}$ is the scaling factor of processor $j$ in cluster $i$. To
+be able to predict the execution time of message passing  
+applications with synchronous iterations running over a heterogeneous grid, for different vectors of
  scaling factors, the communication time and the computation time for all the
  tasks must be measured during the first iteration before applying any DVFS
  operation. Then the execution time for one iteration of the application with any
  scaling factors, the communication time and the computation time for all the
  tasks must be measured during the first iteration before applying any DVFS
  operation. Then the execution time for one iteration of the application with any
@@ -1024,7 +1023,7 @@ first  iteration.  The execution time for one iteration is equal to the sum of t
  and the slowest communication time without slack time during one iteration.
  The latter is equal to the  communication time of the slowest node in the slowest cluster $h$.
  It means that only the communication time without any slack time is taken into account.
  and the slowest communication time without slack time during one iteration.
  The latter is equal to the  communication time of the slowest node in the slowest cluster $h$.
  It means that only the communication time without any slack time is taken into account.
-Therefore, the execution time of the iterative application is equal to
+Therefore, the execution time of the parallel  application with iterations is equal to
  the execution time of one iteration as in Equation (\ref{eq:perf-grid}) multiplied by the
  number of iterations of that application.
  
  the execution time of one iteration as in Equation (\ref{eq:perf-grid}) multiplied by the
  number of iterations of that application.
  
@@ -1032,7 +1031,7 @@ number of iterations of that application.
  In the considered heterogeneous grid platform, each node $j$ in cluster $i$ may have
  different dynamic and static powers from the nodes of the other clusters, 
  noted as $\Pd[ij]$ and $\Ps[ij]$ respectively.  Therefore, even if the distributed 
  In the considered heterogeneous grid platform, each node $j$ in cluster $i$ may have
  different dynamic and static powers from the nodes of the other clusters, 
  noted as $\Pd[ij]$ and $\Ps[ij]$ respectively.  Therefore, even if the distributed 
-message passing iterative application is load balanced, the computation time of each CPU $j$ 
+message passing  application with iterations is load balanced, the computation time of each CPU $j$ 
  in cluster $i$ noted $\Tcp[ij]$ may be different and different frequency scaling factors may be
  computed in order to decrease the overall energy consumption of the application
  and reduce slack times.  The communication time of a processor $j$ in cluster $i$ is noted as
  in cluster $i$ noted $\Tcp[ij]$ may be different and different frequency scaling factors may be
  computed in order to decrease the overall energy consumption of the application
  and reduce slack times.  The communication time of a processor $j$ in cluster $i$ is noted as
@@ -1055,7 +1054,7 @@ static energies for $M_i$ processors in $N$ clusters.  It is computed as follows
  
  
  To optimize both of the energy consumption model computed by \ref{eq:energy-grid} and the performance model computed by \ref{eq:perf-grid},
  
  
  To optimize both of the energy consumption model computed by \ref{eq:energy-grid} and the performance model computed by \ref{eq:perf-grid},
-they must be normalized  as in \ref{eq:enorm-heter} and \ref{eq:pnorm-heter} Equations respectively.
+they must be normalized  as in Equation \ref{eq:enorm-heter} and Equation \ref{eq:pnorm-heter}  respectively.
  While the original energy consumption is the consumed energy with the 
  maximum frequency for all the  nodes computed as follows:
  
  While the original energy consumption is the consumed energy with the 
  maximum frequency for all the  nodes computed as follows:
  
@@ -1149,7 +1148,7 @@ of scaling factors that satisfies (\ref{eq:max-grid}) can be selected.
  \begin{figure}[!t]
    \centering
    \includegraphics[scale=0.7]{fig/ch3/init_freq}
  \begin{figure}[!t]
    \centering
    \includegraphics[scale=0.7]{fig/ch3/init_freq}
-  \caption{Selecting the initial frequencies in grid}
+  \caption{Selecting the initial frequencies in the grid architecture}
    \label{fig:st_freq-grid}
  \end{figure}
  
    \label{fig:st_freq-grid}
  \end{figure}
  
@@ -1165,9 +1164,9 @@ In this section, the scaling factors selection algorithm for a grid, Algorithm~\
  is presented. It selects the vector of  frequency
  scaling factors  that gives the best trade-off between minimizing the
  energy consumption and maximizing the performance of a message passing
  is presented. It selects the vector of  frequency
  scaling factors  that gives the best trade-off between minimizing the
  energy consumption and maximizing the performance of a message passing
-synchronous iterative application executed on a grid.
+  application with synchronous iterations executed on a grid.
  It is similar to the frequency selection algorithm for heterogeneous 
  It is similar to the frequency selection algorithm for heterogeneous 
-local clusters presented in section \ref{ch3:1:4}. 
+local clusters presented in Section \ref{ch3:1:4}. 
  
  The value of the initial frequency scaling factor for each node is inversely proportional to its
  computation time that was gathered in the first iteration. The initial
  
  The value of the initial frequency scaling factor for each node is inversely proportional to its
  computation time that was gathered in the first iteration. The initial
@@ -1237,7 +1236,7 @@ $\lbrace\Theta_1,\Theta_2\rbrace$ is the time interval for the measured  idle po
  Therefore, the dynamic power of one core is computed as the difference between the maximum 
  measured value in maximum powers vector and the minimum measured value in the idle powers vector.
  
  Therefore, the dynamic power of one core is computed as the difference between the maximum 
  measured value in maximum powers vector and the minimum measured value in the idle powers vector.
  
-On the other hand, the static power consumption by one core is a part of the measured idle power consumption of the node. Since in Grid'5000 there is no way to measure precisely the consumed static power and it was assumed, as in sections  \ref{ch3:2} and \ref{ch2:6}, that  the static power  represents a ratio of the dynamic power, the value of the static power is assumed to be equal to 20\% of the dynamic power consumption of the core.
+On the other hand, the static power consumption by one core is a part of the measured idle power consumption of the node. Since in Grid'5000 there is no way to measure precisely the consumed static power and it was assumed, as in Sections  \ref{ch3:2} and \ref{ch2:6}, that  the static power  represents a ratio of the dynamic power, the value of the static power is assumed to be equal to 20\% of the dynamic power consumption of the core.
  
  In the experiments presented in the following sections, two sites of Grid'5000 were used, Lyon and Nancy sites. These two sites have in total seven different clusters as shown in Figure~\ref{fig:grid5000}.
  
  
  In the experiments presented in the following sections, two sites of Grid'5000 were used, Lyon and Nancy sites. These two sites have in total seven different clusters as shown in Figure~\ref{fig:grid5000}.
  
@@ -1297,7 +1296,6 @@ The benchmarks have seven different classes, S, W, A, B, C, D and E, that repres
    \end{tabular}
    \label{table:grid5000-1}
  \end{table} 
    \end{tabular}
    \label{table:grid5000-1}
  \end{table} 
-CPUs 
  
  
  \subsection{The experimental results of the scaling algorithm on a Grid}
  
  
  \subsection{The experimental results of the scaling algorithm on a Grid}
@@ -1387,7 +1385,7 @@ results in  a lower energy consumption. Indeed, the dynamic  consumed power
  is exponentially related to the CPU's frequency value. On the other hand, the increase in the number of computing nodes can 
  increase the communication times and thus produces less energy saving depending on the 
  benchmarks being executed. The results of the benchmarks CG, MG, BT and FT show more 
  is exponentially related to the CPU's frequency value. On the other hand, the increase in the number of computing nodes can 
  increase the communication times and thus produces less energy saving depending on the 
  benchmarks being executed. The results of the benchmarks CG, MG, BT and FT show more 
-energy saving percentage in the one site scenario when executed over 16 nodes than on 32 nodes.  LU and SP consume more energy with 16 nodes than with 32 node  on one site  because their computations to communications ratio is not affected by the increase of the number of local communications. 
+energy saving percentage in the one site scenario when executed over 16 nodes than on 32 nodes.  LU and SP consume more energy with 16 nodes than with 32 nodes  on one site  because their computations to communications ratio is not affected by the increase of the number of local communications. 
  \begin{figure}[!h]
    \centering
    \centering
  \begin{figure}[!h]
    \centering
    \centering
@@ -1400,21 +1398,21 @@ energy saving percentage in the one site scenario when executed over 16 nodes th
  \begin{figure*}[!h]
  \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/eng_s.eps}
  \begin{figure*}[!h]
  \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/eng_s.eps}
-\caption{The energy reduction while executing the NAS benchmarks over different scenarios}
+\caption{The energy reduction percentages while executing the NAS benchmarks over different scenarios}
  \label{fig:eng_s}
  \end{figure*}
  
  \begin{figure*}[!h]
  \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/per_d.eps}
  \label{fig:eng_s}
  \end{figure*}
  
  \begin{figure*}[!h]
  \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/per_d.eps}
-\caption{The performance degradation of the NAS benchmarks over different scenarios}
+\caption{The performance degradation percentages of the NAS benchmarks over different scenarios}
  \label{fig:per_d}
  \end{figure*}
  
  \begin{figure*}[!h]
  \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/dist.eps}
  \label{fig:per_d}
  \end{figure*}
  
  \begin{figure*}[!h]
  \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/dist.eps}
-\caption{The trade-off distance between the energy reduction and the performance of the NAS benchmarks  
+\caption{The trade-off distance percentages between the energy reduction and the performance of the NAS benchmarks  
        over different scenarios}
  \label{fig:dist-grid}
  \end{figure*}
        over different scenarios}
  \label{fig:dist-grid}
  \end{figure*}
@@ -1486,13 +1484,13 @@ Scenario name                          & Cluster name & Nodes per cluster &
  \begin{figure}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/time.eps}
  \begin{figure}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/time.eps}
- \caption{The execution times of  NAS benchmarks running over the one core and the multi-core scenarios}
+ \caption{The execution times of the NAS benchmarks running over the one core and the multi-core scenarios}
    \label{fig:time-mc}
  \end{figure}
  \begin{figure}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/eng_con.eps}
    \label{fig:time-mc}
  \end{figure}
  \begin{figure}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/eng_con.eps}
- \caption{The energy consumptions and execution times of NAS benchmarks over one core and multi-core per node architectures}
+ \caption{The energy consumptions and execution times of the NAS benchmarks over one core and multi-core per node architectures}
  \label{fig:eng-cons-mc}
  \end{figure}
  
  \label{fig:eng-cons-mc}
  \end{figure}
  
@@ -1518,21 +1516,21 @@ scenarios  because there are no or small communications. Contrary to EP and MG,
  \begin{figure*}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/eng_s_mc.eps}
  \begin{figure*}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/eng_s_mc.eps}
-  \caption{The energy saving of running NAS benchmarks over one core and multi-core scenarios}
+  \caption{The energy saving percentages of running NAS benchmarks over one core and multi-core scenarios}
    \label{fig:eng-s-mc}
  \end{figure*}   
  
  \begin{figure*}[!h]
   \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/per_d_mc.eps}
    \label{fig:eng-s-mc}
  \end{figure*}   
  
  \begin{figure*}[!h]
   \centering
  \includegraphics[width=.7\textwidth]{fig/ch3/per_d_mc.eps}
-  \caption{The performance degradation of running NAS benchmarks over one core and multi-core scenarios}
+  \caption{The performance degradation percentages of running NAS benchmarks over one core and multi-core scenarios}
    \label{fig:per-d-mc}
  \end{figure*}
  
  \begin{figure*}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/dist_mc.eps}
    \label{fig:per-d-mc}
  \end{figure*}
  
  \begin{figure*}[!h]
   \centering
   \includegraphics[width=.7\textwidth]{fig/ch3/dist_mc.eps}
-  \caption{The trade-off distance of running NAS benchmarks over one core and multi-core scenarios}
+  \caption{The trade-off distance percentages of running NAS benchmarks over one core and multi-core scenarios}
    \label{fig:dist-mc}
  \end{figure*}
  The energy saving percentages of all the NAS benchmarks running over these two scenarios are presented in Figure~\ref{fig:eng-s-mc}. 
    \label{fig:dist-mc}
  \end{figure*}
  The energy saving percentages of all the NAS benchmarks running over these two scenarios are presented in Figure~\ref{fig:eng-s-mc}. 
@@ -1577,7 +1575,7 @@ In these experiments, the class D of the NAS parallel benchmarks were executed o
  \begin{figure}[!h]
    \centering
    \includegraphics[width=.7\textwidth]{fig/ch3/dist_pow.eps}
  \begin{figure}[!h]
    \centering
    \includegraphics[width=.7\textwidth]{fig/ch3/dist_pow.eps}
-  \caption{The trade-off distance between the energy reduction and the performance of the NAS benchmarks over the three power scenarios}
+  \caption{The trade-off distance percentages between the energy reduction and the performance of the NAS benchmarks over the three power scenarios}
    \label{fig:dist-pow}
  \end{figure}
  
    \label{fig:dist-pow}
  \end{figure}
  
@@ -1632,21 +1630,21 @@ presented in  Figures~\ref{fig:edp-eng}, \ref{fig:edp-perf} and \ref{fig:edp-dis
  \begin{figure*}[!h]
    \centering
    \includegraphics[width=.7\textwidth]{fig/ch3/edp_eng}
  \begin{figure*}[!h]
    \centering
    \includegraphics[width=.7\textwidth]{fig/ch3/edp_eng}
-\caption{The energy reduction induced by the Maxdist method and the EDP method}
+\caption{The energy reduction percentages induced by the Maxdist method and the EDP method}
  \label{fig:edp-eng}
  \end{figure*}
  
  \begin{figure*}[!h]
    \centering
    \includegraphics[width=.7\textwidth]{fig/ch3/edp_per}
  \label{fig:edp-eng}
  \end{figure*}
  
  \begin{figure*}[!h]
    \centering
    \includegraphics[width=.7\textwidth]{fig/ch3/edp_per}
-\caption{The performance degradation induced by the Maxdist method and the EDP method}
+\caption{The performance degradation percentages induced by the Maxdist method and the EDP method}
  \label{fig:edp-perf}
  \end{figure*}
  
  \begin{figure*}[!h]
    \centering
     \includegraphics[width=.7\textwidth]{fig/ch3/edp_dist}
  \label{fig:edp-perf}
  \end{figure*}
  
  \begin{figure*}[!h]
    \centering
     \includegraphics[width=.7\textwidth]{fig/ch3/edp_dist}
-\caption{The trade-off distance between the energy consumption reduction and the performance for the Maxdist method and the  EDP method}
+\caption{The trade-off distance percentages between the energy consumption reduction and the performance for the Maxdist method and the  EDP method}
  \label{fig:edp-dist}
  \end{figure*}
  
  \label{fig:edp-dist}
  \end{figure*}
  
@@ -1658,7 +1656,7 @@ Moreover, the proposed scaling algorithm gives the same weight for these two met
  Whereas, the EDP algorithm gives sometimes negative trade-off values for some benchmarks in the two sites scenarios.
  These negative trade-off values mean that the performance degradation percentage is higher than the energy saving percentage.
  The high positive values of the trade-off distance percentage mean that the  energy saving percentage is much higher than the performance degradation percentage. 
  Whereas, the EDP algorithm gives sometimes negative trade-off values for some benchmarks in the two sites scenarios.
  These negative trade-off values mean that the performance degradation percentage is higher than the energy saving percentage.
  The high positive values of the trade-off distance percentage mean that the  energy saving percentage is much higher than the performance degradation percentage. 
-The  complexity of both algoriths, Maxdist and EDP, are of order $O(N \cdot M_i \cdot F_j)$ and 
+The  complexity of both algorithms, Maxdist and EDP, are of order $O(N \cdot M_i \cdot F_j)$ and 
  $O(N \cdot M_i \cdot F_j^2)$ respectively, where $N$ is the number of the clusters, $M_i$ is the number of nodes and $F_j$ is the 
  maximum number of available frequencies of node $j$. When Maxdist is applied to a benchmark that is being executed over 32 nodes distributed between Nancy and Lyon sites, it takes on average  $0.01$ $ms$  to compute the best frequencies while the EDP method is on average ten times slower over the same architecture.  
  
  $O(N \cdot M_i \cdot F_j^2)$ respectively, where $N$ is the number of the clusters, $M_i$ is the number of nodes and $F_j$ is the 
  maximum number of available frequencies of node $j$. When Maxdist is applied to a benchmark that is being executed over 32 nodes distributed between Nancy and Lyon sites, it takes on average  $0.01$ $ms$  to compute the best frequencies while the EDP method is on average ten times slower over the same architecture.  
  
@@ -1669,8 +1667,8 @@ In this chapter,  two new online frequency scaling factors selecting algorithms
  maximum distance (optimal trade-off) between the predicted energy   and the
  predicted performance curves for a heterogeneous cluster and grid. Both  algorithms use a
  new energy models for measuring and predicting the energy consumption of message passing 
  maximum distance (optimal trade-off) between the predicted energy   and the
  predicted performance curves for a heterogeneous cluster and grid. Both  algorithms use a
  new energy models for measuring and predicting the energy consumption of message passing 
-iterative applications running over a heterogeneous local cluster and a grid platform. 
-Firstly, the  proposed scaling factors selection algorithm for a heterogeneous local cluster  is applied to  the class C of NAS parallel benchmarks and  simulated by SimGrid.
+ applications with iterations running over a heterogeneous local cluster and a grid platform. 
+Firstly, the  proposed scaling factors selection algorithm for a heterogeneous local cluster  is applied to  the class C of the NAS parallel benchmarks and  simulated by SimGrid.
  The results of the simulations showed  that the algorithm on average reduces by 29.8\% the energy 
  consumption of the NAS benchmarks executed over 8 nodes while limiting the degradation of the performance by 3.8\%.  The algorithm also selects different scaling factors according to
  the percentage of the computing and communication times, and according to the
  The results of the simulations showed  that the algorithm on average reduces by 29.8\% the energy 
  consumption of the NAS benchmarks executed over 8 nodes while limiting the degradation of the performance by 3.8\%.  The algorithm also selects different scaling factors according to
  the percentage of the computing and communication times, and according to the