X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/blobdiff_plain/2ef29db4c155662efe60a4020c208c62025628f4..34053b2cbdb34bf90e60922c39c759d343ed375d:/paper.tex diff --git a/paper.tex b/paper.tex index 2c9b309..82777e3 100644 --- a/paper.tex +++ b/paper.tex @@ -518,26 +518,28 @@ multisplitting method. \end{figure} -The execution time difference between the two algorithms is important when -comparing between different grid architectures, even with the same number of -processors (like 2x16 and 4x8 = 32 processors for example). The -experiment concludes the low sensitivity of the multisplitting method -(compared with the classical GMRES) when scaling up the number of the processors in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors. - -\textit{\\3.b Running on two different speed cluster inter-networks\\} +The execution times between the two algorithms is significant with different +grid architectures, even with the same number of processors (for example, 2x16 +and 4x8). We can observ the low sensitivity of the Krylov multisplitting method +(compared with the classical GMRES) when scaling up the number of the processors +in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs +40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors. + +\subsubsection{Running on two different speed cluster inter-networks} +\ \\ -% environment -\begin{footnotesize} +\begin{figure} [ht!] +\begin{center} \begin{tabular}{r c } \hline Grid & 2x16, 4x8\\ %\hline Network & N1 : bw=10Gbs-lat=8.10$^{-6}$ \\ %\hline - & N2 : bw=1Gbs-lat=5.10$^{-5}$ \\ - Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\ + Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \end{tabular} -Table 2 : Clusters x Nodes - Networks N1 x N2 \\ - - \end{footnotesize} +\caption{Clusters x Nodes - Networks N1 x N2} +\end{center} +\end{figure} @@ -546,31 +548,30 @@ Table 2 : Clusters x Nodes - Networks N1 x N2 \\ \centering \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf} \caption{Cluster x Nodes N1 x N2} -%\label{overflow}} +\label{fig:02} \end{figure} %\end{wrapfigure} -The experiments compare the behavior of the algorithms running first on -a speed inter- cluster network (N1) and also on a less performant network (N2). -Figure 4 shows that end users will gain to reduce the execution time -for both algorithms in using a grid architecture like 4x16 or 8x8: the -performance was increased in a factor of 2. The results depict also that -when the network speed drops down (12.5\%), the difference between the execution -times can reach more than 25\%. +These experiments compare the behavior of the algorithms running first on a +speed inter-cluster network (N1) and also on a less performant network (N2). +Figure~\ref{fig:02} shows that end users will gain to reduce the execution time +for both algorithms in using a grid architecture like 4x16 or 8x8: the +performance was increased in a factor of 2. The results depict also that when +the network speed drops down (12.5\%), the difference between the execution +times can reach more than 25\%. \RC{c'est pas clair : la différence entre quoi et quoi?} -\textit{\\3.c Network latency impacts on performance\\} - -% environment -\begin{footnotesize} +\subsubsection{Network latency impacts on performance} +\ \\ +\begin{figure} [ht!] +\centering \begin{tabular}{r c } \hline Grid & 2x16\\ %\hline Network & N1 : bw=1Gbs \\ %\hline - Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline\\ + Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \end{tabular} -Table 3 : Network latency impact \\ - -\end{footnotesize} +\caption{Network latency impact} +\end{figure} @@ -578,142 +579,141 @@ Table 3 : Network latency impact \\ \centering \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf} \caption{Network latency impact on execution time} -%\label{overflow}} +\label{fig:03} \end{figure} -According the results in figure 5, degradation of the network -latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time -increase more than 75\% (resp. 82\%) of the execution for the classical -GMRES (resp. multisplitting) algorithm. In addition, it appears that the -multisplitting method tolerates more the network latency variation with -a less rate increase of the execution time. Consequently, in the worst case (lat=6.10$^{-5 -}$), the execution time for GMRES is almost the double of the time for -the multisplitting, even though, the performance was on the same order -of magnitude with a latency of 8.10$^{-6}$. +According the results in Figure~\ref{fig:03}, a degradation of the network +latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time increase more +than 75\% (resp. 82\%) of the execution for the classical GMRES (resp. Krylov +multisplitting) algorithm. In addition, it appears that the Krylov +multisplitting method tolerates more the network latency variation with a less +rate increase of the execution time. Consequently, in the worst case +(lat=6.10$^{-5 }$), the execution time for GMRES is almost the double than the +time of the Krylov multisplitting, even though, the performance was on the same +order of magnitude with a latency of 8.10$^{-6}$. -\textit{\\3.d Network bandwidth impacts on performance\\} - -% environment -\begin{footnotesize} +\subsubsection{Network bandwidth impacts on performance} +\ \\ +\begin{figure} [ht!] +\centering \begin{tabular}{r c } \hline Grid & 2x16\\ %\hline Network & N1 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\ \end{tabular} -Table 4 : Network bandwidth impact \\ - -\end{footnotesize} +\caption{Network bandwidth impact} +\end{figure} \begin{figure} [ht!] \centering \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf} \caption{Network bandwith impact on execution time} -%\label{overflow} +\label{fig:04} \end{figure} -The results of increasing the network bandwidth show the improvement -of the performance for both of the two algorithms by reducing the execution time (Figure 6). However, and again in this case, the multisplitting method presents a better performance in the considered bandwidth interval with a gain of 40\% which is only around 24\% for classical GMRES. +The results of increasing the network bandwidth show the improvement of the +performance for both algorithms by reducing the execution time (see +Figure~\ref{fig:04}). However, in this case, the Krylov multisplitting method +presents a better performance in the considered bandwidth interval with a gain +of 40\% which is only around 24\% for classical GMRES. -\textit{\\3.e Input matrix size impacts on performance\\} - -% environment -\begin{footnotesize} +\subsubsection{Input matrix size impacts on performance} +\ \\ +\begin{figure} [ht!] +\centering \begin{tabular}{r c } \hline Grid & 4x8\\ %\hline - Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline - Input matrix size & N$_{x}$ = From 40 to 200\\ \hline \\ + Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ + Input matrix size & N$_{x}$ = From 40 to 200\\ \hline \end{tabular} -Table 5 : Input matrix size impact\\ - -\end{footnotesize} +\caption{Input matrix size impact} +\end{figure} \begin{figure} [ht!] \centering \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf} -\caption{Pb size impact on execution time} -%\label{overflow}} +\caption{Problem size impact on execution time} +\label{fig:05} \end{figure} -In this experimentation, the input matrix size has been set from -N$_{x}$ = N$_{y}$ = N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to -200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 7, -the execution time for the two algorithms convergence increases with the -iinput matrix size. But the interesting results here direct on (i) the -drastic increase (300 times) of the number of iterations needed before -the convergence for the classical GMRES algorithm when the matrix size -go beyond N$_{x}$=150; (ii) the classical GMRES execution time also almost -the double from N$_{x}$=140 compared with the convergence time of the -multisplitting method. These findings may help a lot end users to setup -the best and the optimal targeted environment for the application -deployment when focusing on the problem size scale up. Note that the -same test has been done with the grid 2x16 getting the same conclusion. - -\textit{\\3.f CPU Power impact on performance\\} +In these experiments, the input matrix size has been set from N$_{x}$ = N$_{y}$ += N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to 200$^{3}$ += 8,000,000 points. Obviously, as shown in Figure~\ref{fig:05}, the execution +time for both algorithms increases when the input matrix size also increases. +But the interesting results are: +\begin{enumerate} + \item the drastic increase (300 times) \RC{Je ne vois pas cela sur la figure} +of the number of iterations needed to reach the convergence for the classical +GMRES algorithm when the matrix size go beyond N$_{x}$=150; +\item the classical GMRES execution time is almost the double for N$_{x}$=140 + compared with the Krylov multisplitting method. +\end{enumerate} -% environment -\begin{footnotesize} +These findings may help a lot end users to setup the best and the optimal +targeted environment for the application deployment when focusing on the problem +size scale up. It should be noticed that the same test has been done with the +grid 2x16 leading to the same conclusion. + +\subsubsection{CPU Power impact on performance} + +\begin{figure} [ht!] +\centering \begin{tabular}{r c } \hline Grid & 2x16\\ %\hline Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline \end{tabular} -Table 6 : CPU Power impact \\ - -\end{footnotesize} - +\caption{CPU Power impact} +\end{figure} \begin{figure} [ht!] \centering \includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf} \caption{CPU Power impact on execution time} -%\label{overflow}} -s\end{figure} - -Using the Simgrid simulator flexibility, we have tried to determine the -impact on the algorithms performance in varying the CPU power of the -clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6 -confirm the performance gain, around 95\% for both of the two methods, -after adding more powerful CPU. - -\subsection{Comparing GMRES in native synchronous mode and -Multisplitting algorithms in asynchronous mode} - -The previous paragraphs put in evidence the interests to simulate the -behavior of the application before any deployment in a real environment. -We have focused the study on analyzing the performance in varying the -key factors impacting the results. The study compares -the performance of the two proposed algorithms both in \textit{synchronous mode -}. In this section, following the same previous methodology, the goal is to -demonstrate the efficiency of the multisplitting method in \textit{ -asynchronous mode} compared with the classical GMRES staying in -\textit{synchronous mode}. - -Note that the interest of using the asynchronous mode for data exchange -is mainly, in opposite of the synchronous mode, the non-wait aspects of -the current computation after a communication operation like sending -some data between nodes. Each processor can continue their local -calculation without waiting for the end of the communication. Thus, the -asynchronous may theoretically reduce the overall execution time and can -improve the algorithm performance. - -As stated supra, Simgrid simulator tool has been used to prove the -efficiency of the multisplitting in asynchronous mode and to find the -best combination of the grid resources (CPU, Network, input matrix size, -\ldots ) to get the highest \textit{"relative gain"} (exec\_time$_{GMRES}$ / exec\_time$_{multisplitting}$) in comparison with the classical GMRES time. +\label{fig:06} +\end{figure} + +Using the Simgrid simulator flexibility, we have tried to determine the impact +on the algorithms performance in varying the CPU power of the clusters nodes +from 1 to 19 GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the +performance gain, around 95\% for both of the two methods, after adding more +powerful CPU. + +\subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode} + +The previous paragraphs put in evidence the interests to simulate the behavior +of the application before any deployment in a real environment. In this +section, following the same previous methodology, our goal is to compare the +efficiency of the multisplitting method in \textit{ asynchronous mode} with the +classical GMRES in \textit{synchronous mode}. + +The interest of using an asynchronous algorithm is that there is no more +synchronization. With geographically distant clusters, this may be essential. +In this case, each processor can compute its iteration freely without any +synchronization with the other processors. Thus, the asynchronous may +theoretically reduce the overall execution time and can improve the algorithm +performance. + +\RC{la phrase suivante est bizarre, je ne comprends pas pourquoi elle vient ici} +As stated before, the Simgrid simulator tool has been successfully used to show +the efficiency of the multisplitting in asynchronous mode and to find the best +combination of the grid resources (CPU, Network, input matrix size, \ldots ) to +get the highest \textit{"relative gain"} (exec\_time$_{GMRES}$ / +exec\_time$_{multisplitting}$) in comparison with the classical GMRES time. The test conditions are summarized in the table below : \\ -% environment -\begin{footnotesize} +\begin{figure} [ht!] +\centering \begin{tabular}{r c } \hline Grid & 2x50 totaling 100 processors\\ %\hline @@ -723,15 +723,17 @@ The test conditions are summarized in the table below : \\ Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline Residual error precision & 10$^{-5}$ to 10$^{-9}$\\ \hline \\ \end{tabular} -\end{footnotesize} +\end{figure} -Again, comprehensive and extensive tests have been conducted varying the -CPU power and the network parameters (bandwidth and latency) in the -simulator tool with different problem size. The relative gains greater -than 1 between the two algorithms have been captured after each step of -the test. Table 7 below has recorded the best grid configurations -allowing the multisplitting method execution time more performant 2.5 times than -the classical GMRES execution and convergence time. The experimentation has demonstrated the relative multisplitting algorithm tolerance when using a low speed network that we encounter usually with distant clusters thru the internet. +Again, comprehensive and extensive tests have been conducted with different +parametes as the CPU power, the network parameters (bandwidth and latency) in +the simulator tool and with different problem size. The relative gains greater +than 1 between the two algorithms have been captured after each step of the +test. In Figure~\ref{table:01} are reported the best grid configurations +allowing the multisplitting method to be more than 2.5 times faster than the +classical GMRES. These experiments also show the relative tolerance of the +multisplitting algorithm when using a low speed network as usually observed with +geographically distant clusters throuth the internet. % use the same column width for the following three tables \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}} @@ -742,14 +744,12 @@ the classical GMRES execution and convergence time. The experimentation has demo \end{tabular}} -\begin{table}[!t] - \centering +\begin{figure}[!t] +\centering +%\begin{table} % \caption{Relative gain of the multisplitting algorithm compared with the classical GMRES} % \label{"Table 7"} -Table 7. Relative gain of the multisplitting algorithm compared with -the classical GMRES \\ - - \begin{mytable}{11} + \begin{mytable}{11} \hline bandwidth (Mbit/s) & 5 & 5 & 5 & 5 & 5 & 50 & 50 & 50 & 50 & 50 \\ @@ -770,7 +770,11 @@ the classical GMRES \\ & 2.52 & 2.55 & 2.52 & 2.57 & 2.54 & 2.53 & 2.51 & 2.58 & 2.55 & 2.54 \\ \hline \end{mytable} -\end{table} +%\end{table} + \caption{Relative gain of the multisplitting algorithm compared with the classical GMRES} + \label{table:01} +\end{figure} + \section{Conclusion} CONCLUSION