\begin{abbreviations}
-%\item[ADC] Analog to Digital Converters
-%\item[AIMMS] Advanced Interactive Multidimensional Modeling System
-%\item[AMPL] A Mathematical Programming Language
-\item[]
-%\item[WSNL] Wireless Sensor Node Leader
+\item[AIAC] \textbf{A}synchronous \textbf{I}terations and \textbf{A}synchronous \textbf{C}ommunications
+\item[BLP] \textbf{B}it \textbf{L}evel \textbf{P}arallelism
+\item[BT] \textbf{B}lock \textbf{T}ridiagonal
+\item[CG] \textbf{C}onjugate \textbf{G}radient
+\item[CPU] \textbf{C}entral \textbf{P}rocessing \textbf{U}nit
+\item[CUDA] \textbf{C}ompute \textbf{U}nified \textbf{D}evice \textbf{A}rchitecture
+\item[DLP] \textbf{D}ata \textbf{L}evel \textbf{P}arallelism
+\item[DVFS] \textbf{D}ynamic \textbf{V}oltage and \textbf{F}requency \textbf{S}caling
+\item[EDP] \textbf{E}nergy and \textbf{D}elay \textbf{P}roduct
+\item[EP] \textbf{E}mbarrassingly \textbf{P}arallel
+\item[EPSA] \textbf{E}nergy and \textbf{P}erformance \textbf{S}caling \textbf{A}lgorithm
+\item[FLOPS] \textbf{Fl}oating-point \textbf{O}perations \textbf{P}er \textbf{S}econd
+\item[FT] Fast \textbf{F}ourier \textbf{T}ransform
+\item[GMRES] \textbf{G}eneral \textbf{M}inimum \textbf{Res}idual
+\item[GPU] \textbf{G}raphical \textbf{P}rocessing \textbf{U}nit
+\item[HSA] \textbf{H}eterogeneous \textbf{S}caling \textbf{A}lgorithm
+\item[ILP] \textbf{I}nstruction \textbf{L}evel \textbf{P}arallelism
+\item[LAN] \textbf{L}ocal \textbf{A}rea Network
+\item[LLP] \textbf{L}oop \textbf{L}evel \textbf{P}arallelism
+\item[LU] \textbf{L}ower-\textbf{U}pper Symmetric Gauss-Seidel
+\item[MaxDist] \textbf{Max}imum \textbf{Dist}ance
+\item[MG] \textbf{M}ulti-\textbf{G}rid
+\item[MIMD] \textbf{M}ultiple \textbf{I}nstruction and \textbf{M}ultiple \textbf{D}ata
+\item[MISD] \textbf{M}ultiple \textbf{I}nstruction and \textbf{S}ingle \textbf{D}ata
+\item[MPI] \textbf{M}essage \textbf{P}assing \textbf{I}nterface
+\item[MPICH] \textbf{M}essage \textbf{P}assing \textbf{I}nterface of \textbf{Ch}ameleon
+\item[MS] \textbf{M}ulti-\textbf{S}plitting
+\item[NAS] \textbf{N}umerical \textbf{A}eronautical \textbf{S}imulation
+\item[NASA] \textbf{N}ational \textbf{A}eronautics and \textbf{S}pace \textbf{A}dministrations
+\item[OPENCL] \textbf{O}pen \textbf{C}omputing \textbf{L}anguage
+\item[OPENMP] \textbf{O}pen \textbf{M}ulti-\textbf{P}rocessing
+\item[RENATER] \textbf{Ré}seau \textbf{Na}tional de \textbf{T}élécommunications pour la \textbf{T}echnologie, \newline\hspace*{3em}
+ l'\textbf{E}nseignement et la \textbf{R}echerche
+\item[SIAC] \textbf{S}ynchronous \textbf{I}terations and \textbf{A}synchronous \textbf{C}ommunications
+\item[SIMD] \textbf{S}ingle \textbf{I}nstruction and \textbf{M}ultiple \textbf{D}ata
+\item[SISC] \textbf{S}ynchronous \textbf{I}terations and \textbf{S}ynchronous \textbf{C}ommunications
+\item[SISD] \textbf{S}ingle \textbf{I}nstruction and \textbf{S}ingle \textbf{D}ata
+\item[SP] \textbf{S}calar \textbf{P}entadiagonal
+\item[TLP] \textbf{T}hread \textbf{L}evel \textbf{P}arallelism
+\item[WAN] \textbf{W}ide \textbf{A}rea \textbf{N}etwork
\end{abbreviations}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\emph{ \begin{center} \Large Thesis title \end{center}}
+\emph{ \begin{center} \Large Energy Consumption Optimization of Parallel Iterative Applications using CPU Frequency Scaling\end{center}}
%\emph{ \begin{center} \large By \end{center}}
\emph{ \begin{center} \large Ahmed Badri Muslim Fanfakh \\ University of Franche-Comt\'e, 2016 \end{center}}
%\emph{ \begin{center} \large The University of Franche-Comt\'e, 2015 \end{center}}
\emph{ \begin{center} \large Supervisors: Raphaël Couturier, Jean-Claude Charr and Arnaud Giersch \end{center}}
-
-
-
-
-
-\textbf{KEY WORDS:} kw1,kw2,...
-
+In recent years, green computing has become an important topic in the supercomputing
+research domain. However, the computing platforms are still consuming more and more energy due to the increasing number of nodes composing them.
+To minimize the operating costs of these platforms many techniques have
+been used. Dynamic voltage and frequency scaling (DVFS) is one of them. It
+can be used to reduce the power consumption of the CPU while computing, by
+lowering its frequency. However, lowering the frequency of a CPU may increase
+the execution time of an application running on that processor. Therefore, the
+frequency that gives the best trade-off between the energy consumption and the
+performance of an application must be selected.
+In this thesis, a set of works are presented to optimize the energy consumption
+and the performance of the synchronous and asynchronous message passing iterative applications running over clusters and grids. The energy consumption and performance models for each type of iterative application
+are well modelled and defined according to the characteristics of both the application itself and the parallel architecture executing this application.
+
+Firstly, we propose a frequency scaling factors selection algorithm for synchronous iterative applications running over a homogeneous platform. It is applied to the NAS parallel benchmarks and stimulated by SimGrid simulator.
+Secondly, we develop two frequency scaling factors selection algorithms for synchronous iterative applications running over a heterogeneous cluster and grid. Both algorithms are implemented to the NAS parallel benchmarks and conducted over SimGrid simulator and Grid'5000 testbed.
+Thirdly, we propose a frequency scaling factors selection algorithm for an asynchronous and a hybrid iterative applications running over a grid. The algorithm is evaluated over SimGrid simulator and Grid'5000 testbed while running a multi-splitting application that solves 3D problem.
+All the proposed algorithms are compared to an existing methods, which are the Rauber and Rünger and the energy and delay products (EDP) methods. The comparison results show that all the proposed algorithms give better energy consumption and performance trade-off results. The proposed algorithms have a very small overhead on the execution time of the applications and they work online without training and profiling.
+
+\textbf{KEY WORDS:} Dynamic voltage and frequency scaling, Grid computing, Energy optimization, iterative applications and frequency scaling online algorithm.
\ No newline at end of file
To reduce the energy consumption of a CPU executing the asynchronous iterative method, the Dynamic voltage and frequency scaling (DVFS) technique can be used. Modern operating systems automatically adjust the frequency of the processor according to their needs using DVFS operations. However, the user can scale down the frequency of the CPU using the on-demand governor \cite{ref96}. It lowers the frequency of a CPU to reduce its energy
consumption, but it also decreases its computing power and thus it might increase the
execution time of an application running on that processor. Therefore, the frequency that gives the best trade-off between energy consumption and performance must be selected. For parallel asynchronous methods running over a grid, a different frequency might be selected for each CPU in the grid depending on its characteristics.
-In chapters \ref{ch2} and \ref{ch3}, three frequencies selecting algorithms were proposed
-to reduce the energy consumption of synchronous message passing iterative applications running over homogeneous and heterogeneous platforms respectively. In this chapter, a new frequency selecting algorithm for asynchronous iterative message passing applications running over grids is presented. An adaptation for hybrid methods, with synchronous and asynchronous communications, is also proposed.
+In chapters \ref{ch2} and \ref{ch3}, three frequency selecting algorithms were proposed
+to reduce the energy consumption of synchronous message passing applications with iterations running over homogeneous or heterogeneous platforms. In this chapter, a new frequency selecting algorithm for asynchronous iterative message passing applications running over grids is presented. An adaptation for hybrid methods, with synchronous and asynchronous communications, is also proposed.
The algorithm and its adaptation select the vector of frequencies which simultaneously offers a maximum energy reduction and minimum performance degradation ratio. The algorithm has a very small overhead and works online without needing any training nor any profiling.
-
This chapter is organized as follows: Section~\ref{ch4:2} presents some
-related works from other authors. models for predicting the performance and the energy consumption
- of both synchronous and asynchronous message passing programs
-running over a grid are explained in Section~\ref{ch4:3}.
+related works from other authors. Models for predicting the performance and the energy consumption
+ of both synchronous and asynchronous iterative message passing programs
+executed over grids are explained in Section~\ref{ch4:3}.
It also presents the objective function that maximizes the reduction of energy consumption while minimizing
the degradation of the program's performance, used to select the frequencies.
-Section~\ref{ch4:5} details the proposed frequencies selecting algorithm.
+Section~\ref{ch4:5} details the proposed frequency selecting algorithm.
Section~\ref{ch4:6} presents the iterative multi-splitting application which is a hybrid method and was used as a benchmark to evaluate the efficiency of the proposed algorithm.
Section~\ref{ch4:7} presents the simulation results of applying the algorithm on the multi-splitting application
and executing it on different grid scenarios. It also shows the results of running
-three different power scenarios and comparing them. Moreover, in the last subsection, the proposed algorithm is compared to the energy and delay product (EDP) method. Section \ref{ch4:8} shows the real experiment results of applying the proposed algorithm over Grid'5000 platform and the results with the EDP method . Finally, the chapter ends with a summary in section
-\ref{ch4:9}.
+three different power scenarios and comparing them. Moreover, in the last subsection, the proposed algorithm is compared to the energy and delay product (EDP) method. Section \ref{ch4:8} presenting the results of real experiments executed over the Grid'5000 platform and compared to the EDP method. Finally, the chapter ends with a summary.
+
\end{figure}
-An iterative application consists of a block of instructions that is repeatedly executed until convergence. A distributed iterative application with interdependent tasks requires, at each iteration, exchanging data between nodes to compute the distributed tasks. The communications between the nodes can be done synchronously or asynchronously. In the synchronous model, each node has to wait to receive data from all its neighbors to compute its iteration, see figures \ref{fig:ch1:15} and \ref{fig:ch1:16}.
+An iterative application consists of a block of instructions that is repeatedly executed until convergence. A distributed iterative application with interdependent tasks requires, at each iteration, exchanging data between nodes to compute the distributed tasks. The communications between the nodes can be done synchronously or asynchronously. In the synchronous model, each node has to wait to receive data from all its neighbours to compute its iteration, see figures \ref{fig:ch1:15} and \ref{fig:ch1:16}.
Since the tasks are synchronized, all the nodes execute the same number of iterations.
Then, The overall execution time of an iterative synchronous message passing application with balanced tasks, running on the grid described above, is equal to the execution time of the slowest node in the slowest cluster running a task as presented in \ref{eq:perf_heter}.
-Whereas, in the asynchronous model, the fast nodes do not have to wait for the slower nodes to finish their computations to exchange data, see Figure \ref{fig:ch1:17}. Therefore, there are no idle times between successive iterations, the node executes the computations with the last received data from its neighbors and the communications are overlapped by computations. Since there are no synchronizations between nodes, all nodes do not have the same number of iterations.
+Whereas, in the asynchronous model, the fast nodes do not have to wait for the slower nodes to finish their computations to exchange data, see Figure \ref{fig:ch1:17}. Therefore, there are no idle times between successive iterations, the node executes the computations with the last received data from its neighbours and the communications are overlapped by computations. Since there are no synchronizations between nodes, all nodes do not have the same number of iterations.
The difference in the number of executed iterations between the nodes depends on the heterogeneity of the computing powers of the nodes. The execution time of an asynchronous iterative message passing application is not equal to the execution time of the slowest node like in the synchronous mode because each node executes a different number of iterations. Moreover, the overall execution time is directly dependent on the method used to detect the global convergence of the asynchronous iterative application. The global convergence detection method might be synchronous or asynchronous and centralized or distributed.
In a grid, the nodes in each cluster have different characteristics, especially different frequency gears.
In Equation (\ref{eq:asyn_perf}), the communication times $\Ltcm[ij]$ are only the communications between the local nodes because the communications between the clusters are asynchronous and overlapped by computations.
-\subsection{The energy model and tradeoff optimization}
+\subsection{The energy model and trade-off optimization}
\label{ch3:3:3}
The energy consumption of an asynchronous application running over a heterogeneous grid is the summation of
calling the scaling factor selection algorithm algorithm~\ref{HSA-asyn}. The communications of the DVFS algorithm
can be applied synchronously or asynchronously which results in four different versions of the application: synchronous or asynchronous MS with synchronous or asynchronous DVFS communications. Figures \ref{fig:eng_time_dvfs} (a) and (b) present the energy consumption and the execution time for the four different versions of the application running on all the scenarios in Table \ref{table:comp}.
-
\begin{figure}[!t]
\centering
\centering
As in Figure \ref{fig:eng_time_dvfs} (a) and for the same reasons presented above, the asynchronous MS with synchronous DVFS version gives the best energy saving percentage when compared to the other versions.
- \begin{figure}[!t]
+ \begin{figure}[!h]
\centering
\includegraphics[scale=0.7]{fig/ch4/perf_degra.eps}
\caption{The results of the performance degradation}
\label{fig:perf_degr}
\end{figure}
- \begin{figure}[!t]
+ \begin{figure}[!h]
\centering
\includegraphics[scale=0.7]{fig/ch4/dist.eps}
\caption{The results of the tradeoff distance}
Therefore, applying the HSA algorithm over asynchronous applications is very promising. In this section, the number of iterations executed by the asynchronous MS method, while solving a 3D problem of size $400^3$ with and without applying the HSA algorithm, is evaluated. In Table \ref{table:sd}, the standard deviation of the number of iterations executed by the asynchronous application over all the grid platform scenarios, is presented.
-\begin{table}[h]
+\begin{table}[!h]
\centering
\caption{The standard deviation of the numbers of iterations for different asynchronous MS versions running over different grid platforms}
\label{table:sd}
The energy saving, performance degradation and distance percentages for both versions over both platform
scenarios and with the three power scenarios are presented in Figures \ref{fig:three_power_syn} and \ref{fig:three_power_asyn}.
-\begin{figure}[!t]
+\begin{figure}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch4/three_powers_syn.eps}
\caption{The results of the three power scenarios: Synchronous application of the HSA algorithm}
\label{fig:three_power_syn}
\end{figure}
-\begin{figure}
+\begin{figure}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch4/three_powers_Asyn.eps}
\caption{The results of the three power scenarios: Asynchronous application of the HSA algorithm}
\label{fig:three_power_asyn}
\end{figure}
-\begin{figure}[!t]
+\begin{figure}[!h]
\centering
\includegraphics[scale=.7]{fig/ch4/three_scenarios.pdf}
\caption{Comparison of the selected frequency scaling factors by the HSA algorithm for the three power scenarios}
However, others added some weights to the factors in order to direct the optimization towards more energy saving or less performance degradation. For example, in ~\cite{ref71} they used the product $\mathit{EDP}=\mathit{energy}\times \mathit{delay}^2$ which favour performance over energy consumption reduction.
In this work, the proposed scaling factors selection algorithm optimizes both the energy consumption and the performance at the same time and gives the same weight to both factors as in Equation \ref{eq:max-grid}. In this section, to evaluate the performance of the HSA algorithm, it is compared to the algorithm proposed by Spiliopoulos et al. \cite{ref67}. The latter is an online method that selects for each processor the frequency that minimizes the energy and delay product in order to reduce the energy consumption of a parallel application running over a homogeneous multi-cores platform. It gives the same weight to both metrics and predicts both the energy consumption and the execution time for each frequency gear as in the HSA algorithm.
-To fairly compare the HSA algorithm with the algorithm of Spiliopoulos et al., the same energy models, Equation (\ref{eq:energy-grid}) or (\ref{eq:asyn_energy}), and execution time models, Equation (\ref{eq:perf-grid}) or (\ref{eq:asyn_perf}), are used to predict the energy consumptions and the execution times.
-
-The EDP objective function can be equal to zero when the predicted delay is equal to zero. Moreover, this product is equal to zero before applying any DVFS operation. To eliminate the zero values, the EDP function must take the following form:
-
-
-\begin{equation}
- \label{eq:EDP}
- EDP = E_{Norm} \times (1+ D_{Norm})
-\end{equation}
-where $E_{Norm}$ is the normalized energy consumption which is computed as in Equation (\ref{eq:enorm})
-and $D_{Norm}$ is the normalized delay of the execution time which is computed as follows:
-\begin{equation}
- \label{eq:Dnorm}
- D_{Norm}= 1 -P_{Norm}= 1- (\frac{T_{old}}{T_{new}})
-\end{equation}
-Where $P_{Norm}$ is computed as in Equation (\ref{eq:pnorm}). Furthermore, the EDP algorithm starts the search process from the initial frequencies that are computed as in Equation (\ref{eq:Fint}). It stops the search process when it reaches the minimum available frequency for each processor. The EDP algorithm was applied to the synchronous and asynchronous MS algorithm solving a 3D problem of size $400^3$. Two platform scenarios, Grid 4*4 and Grid 4*8, were chosen for this experiment. The EDP method was applied synchronously and asynchronously to the MS application as for the HSA algorithm. The comparison results of the EDP and HSA algorithms are presented in the Figures \ref{fig:compare_syndvfs_synms}, \ref{fig:compare_asyndvfs_asynms},\ref{fig:compare_asyndvfs_synms} and \ref{fig:compare_asyndvfs_asynms}. Each of these figures presents the energy saving, performance degradation and distance percentages for one version of the MS algorithm. The results shown in these figures are also the average of the results obtained from running each version of the MS method over the two platform scenarios described above.
-
-
-
-
-\begin{figure}[!h]
+\begin{figure}[!t]
\centering
\includegraphics[width=.7\textwidth]{fig/ch4/compare_syndvfs_synms.eps}
\caption{Synchronous application of the frequency scaling selection method on the synchronous MS version}
\caption{Asynchronous application of the frequency scaling selection method on the synchronous MS version}
\label{fig:compare_asyndvfs_synms}
\end{figure}
+
\begin{figure}[!h]
\centering
\includegraphics[width=.7\textwidth]{fig/ch4/compare_asyndvfs_asynms.eps}
\caption{Asynchronous application of the frequency scaling selection method on the asynchronous MS version}
\label{fig:compare_asyndvfs_asynms}
\end{figure}
-
-
-
-
-All the figures show that the proposed HSA algorithm outperforms the EDP algorithm
-in terms of energy saving and performance degradation. EDP gave for some scenarios negative trade-off values which mean that the performance degradation percentages are higher than
-the energy saving percentages, while the HSA algorithm gives positive trade-off values over all scenarios.
-The frequency scaling factors selected by the EDP are most of the time higher than those selected by the HSA algorithm as shown in Figure \ref{fig:three_methods}.
-The results confirm that higher frequency scaling factors do not always give more energy saving, especially when the overall execution time is drastically increased. Therefore, the HSA method that computes the maximum distance between the energy saving and the performance degradation is an effective method to optimize these two metrics at the same time.
-
-\begin{figure}[h]
+To fairly compare the HSA algorithm with the algorithm of Spiliopoulos et al., the same energy models, Equation (\ref{eq:energy-grid}) or (\ref{eq:asyn_energy}), and execution time models, Equation (\ref{eq:perf-grid}) or (\ref{eq:asyn_perf}), are used to predict the energy consumptions and the execution times.
+The EDP objective function can be equal to zero when the predicted delay is equal to zero. Moreover, this product is equal to zero before applying any DVFS operation. To eliminate the zero values, the EDP function must take the following form:
+\begin{equation}
+ \label{eq:EDP}
+ EDP = E_{Norm} \times (1+ D_{Norm})
+\end{equation}
+where $E_{Norm}$ is the normalized energy consumption which is computed as in Equation (\ref{eq:enorm})
+and $D_{Norm}$ is the normalized delay of the execution time which is computed as follows:
+\begin{equation}
+ \label{eq:Dnorm}
+ D_{Norm}= 1 -P_{Norm}= 1- (\frac{T_{old}}{T_{new}})
+\end{equation}
+Where $P_{Norm}$ is computed as in Equation (\ref{eq:pnorm}). Furthermore, the EDP algorithm starts the search process from the initial frequencies that are computed as in Equation (\ref{eq:Fint}). It stops the search process when it reaches the minimum available frequency for each processor. The EDP algorithm was applied to the synchronous and asynchronous MS algorithm solving a 3D problem of size $400^3$. Two platform scenarios, Grid 4*4 and Grid 4*8, were chosen for this experiment. The EDP method was applied synchronously and asynchronously to the MS application as for the HSA algorithm. The comparison results of the EDP and HSA algorithms are presented in the Figures \ref{fig:compare_syndvfs_synms}, \ref{fig:compare_asyndvfs_asynms},\ref{fig:compare_asyndvfs_synms} and \ref{fig:compare_asyndvfs_asynms}. Each of these figures presents the energy saving, performance degradation and distance percentages for one version of the MS algorithm. The results shown in these figures are also the average of the results obtained from running each version of the MS method over the two platform scenarios described above.
+\begin{figure}[!h]
\centering
\includegraphics[scale=0.6]{fig/ch4/compare_scales.eps}
\caption{Comparison of the selected frequency scaling factors by the two algorithms
over the Grid 4*4 platform scenario}
\label{fig:three_methods}
\end{figure}
+All the figures show that the proposed HSA algorithm outperforms the EDP algorithm
+in terms of energy saving and performance degradation. EDP gave for some scenarios negative trade-off values which mean that the performance degradation percentages are higher than
+the energy saving percentages, while the HSA algorithm gives positive trade-off values over all scenarios.
+The frequency scaling factors selected by the EDP are most of the time higher than those selected by the HSA algorithm as shown in Figure \ref{fig:three_methods}.
+The results confirm that higher frequency scaling factors do not always give more energy saving, especially when the overall execution time is drastically increased. Therefore, the HSA method that computes the maximum distance between the energy saving and the performance degradation is an effective method to optimize these two metrics at the same time.
+
This testbed is a large-scale platform that consists of ten sites distributed
all over metropolitan France and Luxembourg. Moreover, some sites are equipped with power measurement tools that capture the power consumption for each node on those sites. Same method for computing the dynamic power consumption described in section \ref{ch3:4} is used.
Table \ref{table:grid5000} presents the characteristics of the selected clusters which are located on four different sites.
-\begin{table}[!t]
+
+\begin{table}[!h]
\caption{CPUs characteristics of the selected clusters}
% title of Table
\centering
Also, it can be noticed that both the asynchronous and synchronous MS with synchronous application of the HSA algorithm consume less energy than the other versions of the application. Synchronously applying the HSA algorithm allows them to scale down the CPUs' frequencies at the beginning of the second iteration. Thus, the consumption of dynamic energy by the application is reduced from the second iteration until the end of the application. On the contrary, with the asynchronous application of the HSA algorithm, the new frequencies cannot be computed at the end of the first iteration and consequently cannot be applied at the beginning of the second iteration. Indeed, since the performance information gathered during the first iteration is not sent synchronously at the end of the first iteration, fast nodes might execute many iterations before receiving the performance information, computing the new frequencies based on this information and applying the new computed frequencies. Therefore, many iterations might be computed by CPUs running on their highest frequency and consuming more dynamic energy than the scaled down processors.
Moreover, the execution time of the asynchronous MS version is lower than the execution time of the synchronous MS version because there is no idle time in the asynchronous version and the communications are overlapped by computations. Since the consumption of static energy is proportional to the execution time, the asynchronous MS version consumes less static energy than the synchronous version.
-\begin{figure}[!t]
+\begin{figure}[!h]
\centering
\includegraphics[width=.8\textwidth]{fig/ch4/time-compare.eps}
\caption{ Comparing the execution time}
\label{fig:time-compare}
\end{figure}
-\begin{figure}[!t]
+\begin{figure}[!h]
\centering
\includegraphics[width=.8\textwidth]{fig/ch4/energy-compare.eps}
\caption{ Comparing the energy consumption}
\end{figure}
-\begin{table}[]
+\begin{table}[!h]
\centering
\begin{tabular}{|l|l|l|l|l|}
\hline
$21.48\%$. This overall improvement is due to combining asynchronous computing and the synchronous application of the HSA algorithm.
+Finally, this section shows that the obtained results over Grid'5000 are comparable to the
+simulation results of section \ref{ch4:7:2}, the asynchronous MS applying synchronously the HSA algorithm is the best version in both of sections. Moreover, the results over Grid'5000 are better
+than simulation results because the computing clusters used in the Grid'5000 experiments are more heterogeneous in term of the computing power and network characteristics than the simulated platform with SimGrid. For example, the nodes in StRemi cluster have lower computing powers compared to the other used three clusters of Grid'5000 platform.
+As a result, the increase in the heterogeneity between the clusters' computing nodes increases the idle times which forces the proposed algorithm to select a big scaling factors and thus saving more energy.
-Finally, this section shows that the obtained results over Grid'5000 are comparable to
-simulation results of section \ref{ch4:7:2}, where the asynchronous MS applying synchronously the HSA algorithm is the best version in both of them. Moreover, results of Grid'5000 are better
-than simulation ones because its computing clusters are more heterogeneous in term of the computing power and network characteristics. For example, the StRemi cluster has smaller computing power compared to others three clusters of Grid'5000 platform.
-As a result, The increase in the idle times forces the proposed algorithm to select a big scaling factors and thus more energy saving.
\subsection{Comparing the HSA algorithm to the energy and delay product method}
\label{res-comp}
-
The EDP algorithm, described in section \ref{ch4:7:5}, was applied synchronously and asynchronously to both the synchronous and asynchronous MS application of size $N=400^3$. The experiments were conducted over 4 distributed clusters, described in Table \ref{table:grid5000}, and 8 homogeneous nodes were used from each cluster.
Table \ref{table:comapre} presents the results of energy saving, performance degradation and distance percentages when applying the EDP method on four different MS versions.
Figure \ref{fig:compare} compares the distance percentages, computed as the difference between energy saving and performance degradation percentages, of the EDP and HSA
-algorithms. This comparison shows that the proposed HSA algorithm gives better energy reduction and performance trade-off than the EDP method. The results of EDP method over Grid'5000 are better than those for EDP obtained by the simulation according to the increase in the heterogeneity between the computing clusters of Grid'5000 as mentioned before.
+algorithms. This comparison shows that the proposed HSA algorithm gives better energy reduction and performance trade-off than the EDP method. EDP gives better results when evaluated over Grid'5000 than over the simulator because the nodes used from Grid'5000 are more heterogeneous than those simulated with SimGrid.
-\begin{table}
+\begin{table}[!h]
\centering
\caption{The EDP algorithm results over the Grid'5000}
\label{table:comapre}
\centering
\includegraphics[scale=0.65]{fig/ch4/compare.eps}
\caption{Comparing the trade-off percentages of HSA and EDP methods over the Grid'5000}
- \label{fig:compare}
+ \label{fig:compare}
\end{figure}
-
-
-
\section{Conclusions}
\label{ch4:9}
by selecting a vector of frequencies that gives a better trade-off between the energy
consumption reduction and the performance.
-The experiments conducted over Grid'5000 were showed that applying the synchronous HSA algorithm on an asynchronous MS application saves the energy consumption by 26.93\% and reduces the execution time of the application by 21.48\%. On the other hand, these results are better than simulation ones, according to the increase in the heterogeneity level between the clusters of Grid'5000 compared to the simulated grid platform.
\ No newline at end of file
+The experiments conducted over Grid'5000 showed that applying the synchronous HSA algorithm on an asynchronous MS application gives the best results. It saves the energy consumption by 26.93\% and reduces the execution time of the application by 21.48\%. The experiments executed over Grid'5000 give better results than those simulated with SimGrid because the nodes used in Grid'5000 were more heterogeneous than the ones simulated by SimGrid.
-%\chapter*{Conclusion and Perspectives\markboth{Conclusion and Perspectives}{Conclusion and Perspectives}}
-%\label{ch7}
-%\addcontentsline{toc}{chapter}{Conclusion and Perspectives}
\chapter{Conclusion and Perspectives}
-\label{ch07}
+\label{ch5}
\section{Conclusion}
+In this dissertation, we have proposed a method to optimize both the energy consumption and
+the performance of synchronous and asynchronous iterative applications running over
+cluster and grid platforms. Dynamic voltage and frequency scaling (DVFS) technique was used to
+lower the frequency of the processor to reduce its energy consumption while computing. Reducing
+the frequency of the processor reduces the computing power of a processor and thus the performance of the application running over that processor is decreased too. However, as the first step in this work, different energy consumption and performance models were developed to effectively used in the prediction and measurement processes for energy and performance of iterative applications. Depending on these models, an objective function was defined the best trade-off relation between the energy and the performance. Then, this objective function was used in algorithms which optimize both energy consumption and the performance of the iterative application at the same time.
+The first part of this dissertation, chapter \ref{ch1}, presented different types of parallelism levels which have been categorized using some hardware and software techniques. Different parallel architectures have been
+demonstrated and classified according to how the processing units and the memory model connected to each others. Both shared and distributed platforms as well as their depending parallel programming models have been categorized. The two types of parallel iterative applications: synchronous and asynchronous ones have
+been discussed and compared to each others. It showed that applying the synchronous iterative methods are well adapted to local homogeneous clusters with a high speed network link, while the asynchronous iterative methods are more suited to the distributed heterogeneous clusters. At the end of this chapter, an energy consumption model proposed in the literature to measure the energy consumption of parallel applications was explained. This model does not consider the communication time of the parallel application being executed. Also, it is not well adapted to a heterogeneous architecture when there are different power consumption values for each type of processor.
+In the second part of the dissertation, we have developed a new energy and performance models for
+synchronous and asynchronous message passing applications running over a cluster and a grid. To simultaneously optimize the energy and performance of these applications, the trade-off relation between the two metrics have been defined as a maximum distance between the predicted energy and performance curves to select the best possible frequency scaling factors. We have proposed four different frequencies selecting algorithms to select the best frequencies that gives minimum energy reduction with maximum possible performance at the same time. They used the computation and communication times measured at the first iteration of the iterative application to predict the energy consumption and the performance of the parallel application at every available frequency. All these algorithms work online with a very small overhead on the execution time of a parallel application.
+In chapter \ref{ch2}, we were proposed a new online scaling factor selection method that optimized simultaneously the energy and performance of a distributed synchronous application running on a homogeneous cluster. We have applied this algorithm to the NAS benchmarks of the class C and evaluated by the SimGrid simulator. Firstly, Rauber and Rünger’s energy model used by the proposed algorithm to select best frequency gear. The proposed algorithm was compared to the Rauber and Rünger optimization method, which gives better energy and performance trade-off ratios compare to them. Secondly, a new energy consumption model was developed to take into consideration both the computation and communication times and their relation with the frequency scaling factor. The new enenrgy model was used by the proposed algorithm to select different frequencies. Thus, a new simulation results have been shown, which are more accurate and realistic than other results obtained using the Rauber and Rünger's energy model.
+In chapter \ref{ch3}, we were proposed two new online frequency scaling factors selecting algorithms to select the best possible vectors of frequency scaling factors that give best trade-off between the predicted energy and the predicted performance values of synchronous iterative application running over a heterogeneous cluster and a grid. Each algorithm depends on a new energy and performance models, which takes into account the underline parallel platform being used. Firstly, the proposed
+scaling factors selection algorithm for a heterogeneous local cluster was implemented to the NAS parallel benchmarks of the class C and simulated by SimGrid. The results of the experiments showed that the algorithm on average reduces by 29.8\% the energy consumption of the NAS benchmarks executed over 8 nodes while limiting the degradation of the performance by 3.8\%.
+Different frequency scaling factors were selected by the algorithm according to the ratio between the computation and communication times when different number of nodes were used, and when different values have been used for static and dynamic powers of the CPU. Secondly, the proposed scaling factors selection algorithm for a grid was implemented to the NAS parallel benchmarks of the class D and executed over Grid5000 testbed platform. The experiments on 16 nodes, distributed over three clusters, showed that the algorithm on average reduces by 30\% the energy consumption for all the NAS benchmarks while on average only degrading by 3.2\% the performance.
+The algorithm was also evaluated in different scenarios that vary in the distribution of the computing nodes between different clusters’ sites or use multi-cores per node architecture or consume different static power values. The algorithm selects different vectors of frequencies according to the computations and communication times ratios, and the values of the static and measured dynamic powers of the CPUs.
+Both of the proposed algorithms were compared to another method that uses the well known energy and delay product as an objective function. The comparison results showed that the proposed algorithms outperform the latter by selecting a vectors of frequencies that give a better trade-off between energy consumption reduction and performance.
+In chapter \ref{ch4}, we were presented a new online frequency selection algorithm for asynchronous iterative applications running over a grid. The algorithm uses new energy and performance models to predict the energy consumption and the execution time of asynchronous or hybrid message passing
+iterative applications running over a grid. The proposed algorithm was evaluated twice
+over the SimGrid simulator and Grid’5000 testbed while running a multi-splitting (MS)
+application that solves 3D problems. The experiments were executed over different grid
+scenarios composed of different numbers of clusters and different numbers of nodes
+per cluster. The proposed algorithm was applied synchronously and asynchronously on a
+synchronous and an asynchronous version of the MS application. Both the simulation
+and real experiment results show that applying synchronous frequency selecting algorithm on an
+asynchronous MS application gives the best tradeoff between energy consumption reduction
+and performance compared to other scenarios. In the simulation results, this scenario
+saves on average the energy consumption by 22\% and reduces the execution time of
+the application by 5.72\%. This version optimizes both of the dynamic energy
+consumption by applying synchronously the HSA algorithm at the end of the first iteration and the
+static energy consumption by using asynchronous communications between nodes from
+different clusters which are overlapped by computations. The proposed algorithm was also
+evaluated over three power scenarios, which selects different vectors of frequencies proportionally to dynamic and static powers values. More energy reduction has been achieved when the ratio of the
+dynamic power have been increase and vice versa. whereas, the performance degradation percentages were decreased when the static power ratio has been increased.
+In the Grid’5000 results, this scenario saves the energy consumption by 26.93\% and
+reduces the execution time of the application by 21.48\%. The experiments executed over Grid'5000 give better results than those simulated with SimGrid because the nodes used in Grid'5000 were more heterogeneous than the ones simulated by SimGrid.
+In both of the Simulation and Grid'5000 testbed experiments, we have compared the proposed algorithm to a method that uses the well known energy and delay product as an objective function. The comparison results showed that the proposed algorithm outperforms the latter by selecting a vector of frequencies that gives
+a better trade-off between the energy consumption reduction and the performance.
+Finally, we outline some perspectives that will be applied to this work in the future as in the next section.
\section{Perspectives}
+In the near future, we will adapt the proposed algorithms to take into consideration the variability between some iterations. For example, each proposed algorithm can be executed twice: after the first iteration the frequencies are scaled down according to the execution times measured in the first iteration, then after a fixed number of iterations, the frequencies are adjusted according to the execution times measured during the fixed number of iterations. If the computing power of the system is constantly changing, it would be interesting to implement a mechanism that detects this change and adjusts the frequencies according to the variability of the system.
+
+Also, it would be interesting to evaluate the scalability of the proposed algorithm by running it on large platforms composed of many thousands of cores. The scalability of the algorithm can be improved by distributing it in a hierarchical manner where a leader is chosen for each cluster or a group of nodes to compute their scaled frequencies and by using asynchronous messages to exchange the the data measured at the first iteration.
+
+The proposed algorithms would evaluate to other message passing iterative methods in order to see how it adapts to the characteristics of the new method.
+Also, it would be interesting to explore if a relation can be found between the numbers of asynchronous iterations required to global convergence and the applied frequencies to the
+nodes. The number of iterations required by each node for global convergence
+is not known in advance and the change in CPUs frequencies changes the
+number of iterations required by each node for global convergence.
+
+Furthermore, the proposed algorithms for a heterogeneous platforms, chapters \ref{ch3} and \ref{ch4}, should be applied to a heterogeneous platform composed of CPUs and GPUs, where all works of energy minimization in the literature over a collection of GPUs and CPUs platform have been shown more energy efficiency compared those composed from only CPUs.
+
+Finally, it would be interesting to compare the accuracy of the
+results of the proposed energy models to the values given by instruments
+that measure the energy consumptions of CPUs during the execution time, as in \cite{ref106}.
+
+
+
+
+
+
+
\section*{1. General Introduction}
\addcontentsline{toc}{section}{1. General Introduction }
+The need and demand for more computing power have been increasing since the
+birth of the first computing unit and it is not expected to slow down in the
+coming years. To meet these demands, at first the frequency of the CPU was regularly increased until reaching the thermal limit. Also, researchers and supercomputers
+constructors have been regularly increasing the number of computing cores and
+processors in supercomputers. Many parallel and distributed architectures, such as multi-core, clusters and grids, were implemented in order to obtain more computing power. This approach consists in using at the same time many computing nodes to solve a big problem that cannot be solved on a single node.
+Therefore, these two common approaches are the most common up to now to get more computing power, but they are increasing the energy consumption of the resulting computing architecture.
+Indeed, the power consumed by a processor exponentially increases when its frequency increases and a platform consisting of $N$ computing nodes consumes as much as the sum of the power consumed by each computing node.
+As an example, the Chinese supercomputer
+Tianhe-2 had the highest FLOPS in November 2015 according to the Top500 list
+\cite{ref101}. However, it was also the most power hungry
+platform with its over 3 million cores consuming around 17.8 megawatts.
+Moreover, according to the U.S. annual energy outlook 2015
+\cite{ref102}, the price of energy for 1 megawatt per hour
+was approximately equal to \$70. Therefore, the price of the energy consumed by
+the Tianhe-2 platform is approximately more than \$10 million each year.
+Moreover, the heat generated by the platform and therefore a cooling infrastructure \cite{ref103} which also consumes a lot of energy, must be implemented to keep the platform from overheating. High CPU's temperatures can also drastically increase its energy consumption, see \cite{ref104} for more details.
+The computing platforms must be more energy efficient and offer the highest number
+of FLOPS per watt possible, such as the Shoubu-ExaScaler from RIKEN
+which became the top of the Green500 list in November 2015 \cite{ref105}.
+This heterogeneous platform executes more than 7 GFlops per watt while consuming
+50.32 kilowatts.
+
+For all these reasons energy reduction has become an important topic in the high performance
+computing (HPC) field. To tackle this problem, many researchers use DVFS (Dynamic
+Voltage and Frequency Scaling) operations which reduce dynamically the frequency and
+voltage of cores and thus their energy consumption \cite{ref49}.
+Indeed, modern CPUs offer a set of acceptable frequencies which are usually called gears, and the user or
+the operating system can modify the frequency of the processor according to its
+needs. However, DVFS reduces the number of FLOPS executed by the processor which may increase the execution
+time of the application running over that processor.
+Therefore researchers try to reduce the frequency to the minimum when processors are idle
+(waiting for data from other processors or communicating with other processors).
+Moreover, depending on their objectives, they use heuristics to find the best
+frequency scaling factor during the computation. If they aim for performance they choose
+the best frequency scaling factor that reduces the consumed energy while affecting as
+little as possible the performance. On the other hand, if they aim for energy
+reduction, the chosen frequency scaling factor must produce the most energy efficient
+execution without considering the degradation of the performance. Whereas, it is
+important to notice that lowering the frequency to the minimum value does not always
+give the most energy efficient execution due to energy leakage that increases the total energy consumption of the CPU when the execution time increases. However, the more important question is how to select the best frequency gears that minimizes the total energy consumption and the maximizes the performance of parallel application running over a parallel platform at the same time?
+
+
+
+
\section*{2. Motivation of the Dissertation}
\addcontentsline{toc}{section}{2. Motivation of the Dissertation }
-\section*{3. The Objective of this Dissertation}
-\addcontentsline{toc}{section}{3. The Objective of this Dissertation}
+The main objective of HPC systems is to execute as fast as possible the sequential application over a parallel architecture.
+Hence, using DVFS to scale down the frequencies of CPUs composing the parallel architecture for energy reduction process, it can also significantly degrade the performance of the executed program if it is compute bound, when the program depends mainly of the computing power of the processor, and if a low CPU frequency is selected.
+Therefore, the chosen frequency scaling factor must give the best possible trade-off between the energy reduction and the performance of the parallel application.
+
+On the other hand, the relation between energy consumption and the execution time of parallel applications is complex and non-linear. It is very hard to optimize both the energy consumption and the performance of the parallel applications when scaling the frequency of processors executing them because one affects the other. There are a very few works in the state of the art have been dedicated for this problem. Therefore, mathematical models of both the energy consumption and performance of the parallel application running over a parallel platform are required and should be defined precisely to discover the best relation between them.
+
+Furthermore, researchers use different optimization strategies to select the frequencies that gives the best trade-off between the energy reduction and performance degradation ratio, which might be chosen during execution (online) or during a pre-execution phase (offline). Thus, the best optimization approach to optimize the energy consumption and the performance at the same time must be applied online with a very low overhead on the execution time of the parallel application.
\section*{3. Main Contributions of this Dissertation}
\addcontentsline{toc}{section}{3. Main Contributions of this Dissertation}
+The main contributions of this dissertation focus on optimizing both the energy consumption and the performance of iterative parallel applications running over clusters and grids. The main contributions of this work summarize as follows:
-
-%\section{Methodology}
-%In this dissertation, analytical, as well as computational, methods are used.
+\begin{enumerate} [I)]
+\item We develop an energy consumption and performance models for synchronous and asynchronous message passing iterative applications. These models take into consideration both the computation and communications times of these application, in addition to their relation with frequency scaling factors.
-% \section{ Refereed Journal and Conference Publications}
+\item The iterative parallel applications were executed over different parallel architectures such as: homogeneous local cluster, heterogeneous local cluster and distributed clusters (grid platform). The main goal behind using these different platforms is to study the effect of the heterogeneity in the computing powers of the the commuting nodes and the heterogeneity in the communication networks which connecting these nodes on the energy consumption and the performance of iterative applications.
+
+\item Depending on the proposed energy consumption and the performance models, we define a new objective function to optimize both the energy consumption and the performance of the iterative parallel applications at the same. The proposed objective function compute the maximum distance between the predicted energy consumption and the predicted performance curves to define the best possible trade-off between them.
+
+\item a new online frequencies selecting algorithms for cluster and grid are developed which used the new objective function. These algorithms selected the frequency scaling factors that simultaneously optimize both the energy consumption and performance. They have a very small overhead when comparing them to other methods in the state of the art and they are working without training and profiling.
+
+\item We conducted extensive simulation experiments over SimGrid simulator \cite{ref66}, which offers flexible and easy tools to built different types of parallel architectures. Furthermore, real experiments were conducted over Grid'5000 testbed \cite{ref21} and compared with simulation ones.
+The experimental results were executed over different number of nodes and different platform scenarios.
+
+\item In both the simulation and real experiments, NAS parallel benchmarks \cite{ref65} and Multi-splitting method solving 3D problem with different sizes used as a parallel applications were executed on clusters and grids. The final goal, is to evaluate the proposed methods over these applications and test their adaptation to these applications when different computation and communication ratios are existed.
+
+\item All the proposed methods of this work compared with two approaches found in the literature: Rauber and Rünger \cite{ref47} and Spiliopoulos et al. \cite{ref67} methods. Both the simulation and real testbed results showed that the proposed methods gives better energy and performance trade-off ratios than these methods.
+
+\end{enumerate}
+
+
\section*{4. Dissertation Outline}
\addcontentsline{toc}{section}{4. Dissertation Outline}
+The dissertation is organized as follows: chapter \ref{ch1} presents a scientific background about types of parallel architectures, the parallel iterative applications and the energy consumption model from the state of the art that can be used to measure the energy consumption of these applications.
+Chapter \ref{ch2} describes the propose energy and performance optimization method for synchronous iterative applications running over homogeneous cluster. Chapter \ref{ch3} presents two algorithms for the energy and performance optimization of synchronous iterative applications running over heterogeneous cluster and grid. Chapter \ref{ch4} presents
+the proposed energy and performance optimization method for asynchronous iterative applications running over grid. Finally, we conclude our work of this dissertation in chapter \ref{ch5}.
+
\ No newline at end of file
\addcontentsline{toc}{chapter}{List of Algorithms}
\setlength{\parindent}{0.5cm}
-%\addcontentsline{toc}{chapter}{List of Abbreviations}
+\addcontentsline{toc}{chapter}{List of Abbreviations}
%% tr
-%\include{ACRONYMS}
+\include{ACRONYMS}
\include{Dedication}
%% The second mandatory parameter is the date of the PhD defense.
%% The third mandatory parameter is the reference number given by the University Library after the PhD defense.
%\declarethesis[Sous-titre]{Titre}{17 septembre 2012}{XXX}
-\declarethesis{Thesis title}{Date}{21306697}
+\declarethesis{Energy Consumption Optimization of Parallel Iterative Applications using CPU Frequency Scaling}{Date}{21306697}
%%--------------------
%% Set the author of the PhD thesis
url = {http://www.green500.org}
}
+@inproceedings{ref106,
+ title = {GreenHPC: a novel framework to measure energy consumption on HPC applications},
+ author = {Gustavo Rostirolla and Rodrigo Da Rosa Righi and Vinicius Facco Rodrigues and Pedro Velho and Edson Luiz Padoin},
+ year = {2015},
+ pages = {1-8},
+ booktitle = {2015 Sustainable Internet and ICT for Sustainability, SustainIT 2015, Madrid, Spain, April 14-15, 2015},
+ publisher = {IEEE},
+
+}
+
\begin{center}%
-{\huge \textbf{Thesis title}} \\[.5cm]
+{\huge \textbf{Energy Consumption Optimization of Parallel Iterative Applications using
+CPU \\ Frequency Scaling}} \\[.5cm]
{\large By}\\[.5cm]%
{\huge Ahmed Badri Muslim Fanfakh}\\[.5cm]%
\end{center}