From: David Laiymani Date: Wed, 6 May 2015 13:59:36 +0000 (+0200) Subject: Merge branch 'master' of ssh://bilbo.iut-bm.univ-fcomte.fr/rce2015 X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/commitdiff_plain/f99a638c921635746c3df9d45c46fe345f57a074?ds=inline;hp=-c Merge branch 'master' of ssh://bilbo.iut-bm.univ-fcomte.fr/rce2015 --- f99a638c921635746c3df9d45c46fe345f57a074 diff --combined paper.tex index b11da8c,8fd9fb4..e60e242 --- a/paper.tex +++ b/paper.tex @@@ -96,21 -96,20 +96,21 @@@ performed. With some applications, it i accurate performance models. That is why another solution is to use a simulation tool which allows us to change many parameters of the architecture (network bandwidth, latency, number of processors) and to simulate the execution of such -applications. We have decided to use SimGrid as it enables to benchmark MPI -applications. +applications. The main contribution of this paper is to show that the use of a +simulation tool (here we have decided to use the SimGrid toolkit) can really +help developpers to better tune their applications for a given multi-core +architecture. -In this paper, we focus our attention on two parallel iterative algorithms based +In particular we focus our attention on two parallel iterative algorithms based on the Multisplitting algorithm and we compare them to the GMRES algorithm. -These algorithms are used to solve libear systems. Two different variants of +These algorithms are used to solve linear systems. Two different variants of the Multisplitting are studied: one using synchronoous iterations and another -one with asynchronous iterations. For each algorithm we have tested different -parameters to see their influence. We strongly recommend people interested -by investing into a new expensive hardware architecture to benchmark -their applications using a simulation tool before. - - - +one with asynchronous iterations. For each algorithm we have simulated +different architecture parameters to evaluate their influence on the overall +execution time. The obtain simulated results confirm the real results +previously obtained on different real multi-core architectures and also confirm +the efficiency of the asynchronous multisplitting algorithm compared to the +synchronous GMRES method. \end{abstract} @@@ -519,26 -518,28 +519,28 @@@ multisplitting method \end{figure} - The execution time difference between the two algorithms is important when - comparing between different grid architectures, even with the same number of - processors (like 2x16 and 4x8 = 32 processors for example). The - experiment concludes the low sensitivity of the multisplitting method - (compared with the classical GMRES) when scaling up the number of the processors in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors. - - \textit{\\3.b Running on two different speed cluster inter-networks\\} + The execution times between the two algorithms is significant with different + grid architectures, even with the same number of processors (for example, 2x16 + and 4x8). We can observ the low sensitivity of the Krylov multisplitting method + (compared with the classical GMRES) when scaling up the number of the processors + in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs + 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors. + + \subsubsection{Running on two different speed cluster inter-networks} + \ \\ - % environment - \begin{footnotesize} + \begin{figure} [ht!] + \begin{center} \begin{tabular}{r c } \hline Grid & 2x16, 4x8\\ %\hline Network & N1 : bw=10Gbs-lat=8.10$^{-6}$ \\ %\hline - & N2 : bw=1Gbs-lat=5.10$^{-5}$ \\ - Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\ + Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \end{tabular} - Table 2 : Clusters x Nodes - Networks N1 x N2 \\ - - \end{footnotesize} + \caption{Clusters x Nodes - Networks N1 x N2} + \end{center} + \end{figure} @@@ -547,31 -548,30 +549,30 @@@ \centering \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf} \caption{Cluster x Nodes N1 x N2} - %\label{overflow}} + \label{fig:02} \end{figure} %\end{wrapfigure} - The experiments compare the behavior of the algorithms running first on - a speed inter- cluster network (N1) and also on a less performant network (N2). - Figure 4 shows that end users will gain to reduce the execution time - for both algorithms in using a grid architecture like 4x16 or 8x8: the - performance was increased in a factor of 2. The results depict also that - when the network speed drops down (12.5\%), the difference between the execution - times can reach more than 25\%. + These experiments compare the behavior of the algorithms running first on a + speed inter-cluster network (N1) and also on a less performant network (N2). + Figure~\ref{fig:02} shows that end users will gain to reduce the execution time + for both algorithms in using a grid architecture like 4x16 or 8x8: the + performance was increased in a factor of 2. The results depict also that when + the network speed drops down (12.5\%), the difference between the execution + times can reach more than 25\%. \RC{c'est pas clair : la différence entre quoi et quoi?} - \textit{\\3.c Network latency impacts on performance\\} - - % environment - \begin{footnotesize} + \subsubsection{Network latency impacts on performance} + \ \\ + \begin{figure} [ht!] + \centering \begin{tabular}{r c } \hline Grid & 2x16\\ %\hline Network & N1 : bw=1Gbs \\ %\hline - Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline\\ + Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \end{tabular} - Table 3 : Network latency impact \\ - - \end{footnotesize} + \caption{Network latency impact} + \end{figure} @@@ -579,123 -579,124 +580,124 @@@ \centering \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf} \caption{Network latency impact on execution time} - %\label{overflow}} + \label{fig:03} \end{figure} - According the results in figure 5, degradation of the network - latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time - increase more than 75\% (resp. 82\%) of the execution for the classical - GMRES (resp. multisplitting) algorithm. In addition, it appears that the - multisplitting method tolerates more the network latency variation with - a less rate increase of the execution time. Consequently, in the worst case (lat=6.10$^{-5 - }$), the execution time for GMRES is almost the double of the time for - the multisplitting, even though, the performance was on the same order - of magnitude with a latency of 8.10$^{-6}$. - - \textit{\\3.d Network bandwidth impacts on performance\\} + According the results in Figure~\ref{fig:03}, a degradation of the network + latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time increase more + than 75\% (resp. 82\%) of the execution for the classical GMRES (resp. Krylov + multisplitting) algorithm. In addition, it appears that the Krylov + multisplitting method tolerates more the network latency variation with a less + rate increase of the execution time. Consequently, in the worst case + (lat=6.10$^{-5 }$), the execution time for GMRES is almost the double than the + time of the Krylov multisplitting, even though, the performance was on the same + order of magnitude with a latency of 8.10$^{-6}$. - % environment - \begin{footnotesize} + \subsubsection{Network bandwidth impacts on performance} + \ \\ + \begin{figure} [ht!] + \centering \begin{tabular}{r c } \hline Grid & 2x16\\ %\hline Network & N1 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\ \end{tabular} - Table 4 : Network bandwidth impact \\ - - \end{footnotesize} + \caption{Network bandwidth impact} + \end{figure} \begin{figure} [ht!] \centering \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf} \caption{Network bandwith impact on execution time} - %\label{overflow} + \label{fig:04} \end{figure} - The results of increasing the network bandwidth show the improvement - of the performance for both of the two algorithms by reducing the execution time (Figure 6). However, and again in this case, the multisplitting method presents a better performance in the considered bandwidth interval with a gain of 40\% which is only around 24\% for classical GMRES. - - \textit{\\3.e Input matrix size impacts on performance\\} + The results of increasing the network bandwidth show the improvement of the + performance for both algorithms by reducing the execution time (see + Figure~\ref{fig:04}). However, in this case, the Krylov multisplitting method + presents a better performance in the considered bandwidth interval with a gain + of 40\% which is only around 24\% for classical GMRES. - % environment - \begin{footnotesize} + \subsubsection{Input matrix size impacts on performance} + \ \\ + \begin{figure} [ht!] + \centering \begin{tabular}{r c } \hline Grid & 4x8\\ %\hline - Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline - Input matrix size & N$_{x}$ = From 40 to 200\\ \hline \\ + Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ + Input matrix size & N$_{x}$ = From 40 to 200\\ \hline \end{tabular} - Table 5 : Input matrix size impact\\ - - \end{footnotesize} + \caption{Input matrix size impact} + \end{figure} \begin{figure} [ht!] \centering \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf} - \caption{Pb size impact on execution time} - %\label{overflow}} + \caption{Problem size impact on execution time} + \label{fig:05} \end{figure} - In this experimentation, the input matrix size has been set from - N$_{x}$ = N$_{y}$ = N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to - 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 7, - the execution time for the two algorithms convergence increases with the - iinput matrix size. But the interesting results here direct on (i) the - drastic increase (300 times) of the number of iterations needed before - the convergence for the classical GMRES algorithm when the matrix size - go beyond N$_{x}$=150; (ii) the classical GMRES execution time also almost - the double from N$_{x}$=140 compared with the convergence time of the - multisplitting method. These findings may help a lot end users to setup - the best and the optimal targeted environment for the application - deployment when focusing on the problem size scale up. Note that the - same test has been done with the grid 2x16 getting the same conclusion. - - \textit{\\3.f CPU Power impact on performance\\} + In these experiments, the input matrix size has been set from N$_{x}$ = N$_{y}$ + = N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to 200$^{3}$ + = 8,000,000 points. Obviously, as shown in Figure~\ref{fig:05}, the execution + time for both algorithms increases when the input matrix size also increases. + But the interesting results are: + \begin{enumerate} + \item the drastic increase (300 times) \RC{Je ne vois pas cela sur la figure} + of the number of iterations needed to reach the convergence for the classical + GMRES algorithm when the matrix size go beyond N$_{x}$=150; + \item the classical GMRES execution time is almost the double for N$_{x}$=140 + compared with the Krylov multisplitting method. + \end{enumerate} - % environment - \begin{footnotesize} + These findings may help a lot end users to setup the best and the optimal + targeted environment for the application deployment when focusing on the problem + size scale up. It should be noticed that the same test has been done with the + grid 2x16 leading to the same conclusion. + + \subsubsection{CPU Power impact on performance} + + \begin{figure} [ht!] + \centering \begin{tabular}{r c } \hline Grid & 2x16\\ %\hline Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline \end{tabular} - Table 6 : CPU Power impact \\ - - \end{footnotesize} - + \caption{CPU Power impact} + \end{figure} \begin{figure} [ht!] \centering \includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf} \caption{CPU Power impact on execution time} - %\label{overflow}} - s\end{figure} - - Using the Simgrid simulator flexibility, we have tried to determine the - impact on the algorithms performance in varying the CPU power of the - clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6 - confirm the performance gain, around 95\% for both of the two methods, - after adding more powerful CPU. - - \subsection{Comparing GMRES in native synchronous mode and - Multisplitting algorithms in asynchronous mode} - - The previous paragraphs put in evidence the interests to simulate the - behavior of the application before any deployment in a real environment. - We have focused the study on analyzing the performance in varying the - key factors impacting the results. The study compares - the performance of the two proposed algorithms both in \textit{synchronous mode - }. In this section, following the same previous methodology, the goal is to - demonstrate the efficiency of the multisplitting method in \textit{ - asynchronous mode} compared with the classical GMRES staying in - \textit{synchronous mode}. + \label{fig:06} + \end{figure} + + Using the Simgrid simulator flexibility, we have tried to determine the impact + on the algorithms performance in varying the CPU power of the clusters nodes + from 1 to 19 GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the + performance gain, around 95\% for both of the two methods, after adding more + powerful CPU. + + \subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode} + + The previous paragraphs put in evidence the interests to simulate the behavior + of the application before any deployment in a real environment. We have focused + the study on analyzing the performance in varying the key factors impacting the + results. The study compares the performance of the two proposed algorithms both + in \textit{synchronous mode }. In this section, following the same previous + methodology, the goal is to demonstrate the efficiency of the multisplitting + method in \textit{ asynchronous mode} compared with the classical GMRES staying + in \textit{synchronous mode}. Note that the interest of using the asynchronous mode for data exchange is mainly, in opposite of the synchronous mode, the non-wait aspects of