X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/blobdiff_plain/d19f827e7641db37b6e37ee51db8e11b4ab92407..3a3dd233534f018748e5621c296ad59fad7c50d4:/paper.tex?ds=sidebyside diff --git a/paper.tex b/paper.tex index cdcce00..1391f0f 100644 --- a/paper.tex +++ b/paper.tex @@ -588,6 +588,7 @@ efficient for distributed systems with high latency networks. \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf} \caption{Various grid configurations with networks $N1$ vs. $N2$} \LZK{CE, remplacer les ``,'' des décimales par un ``.''} +\RCE{ok} \label{fig:02} \end{figure} @@ -597,7 +598,7 @@ Figure~\ref{fig:03} shows the impact of the network latency on the performances \begin{figure}[ht] \centering \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf} -\caption{Network latency impacts on execution times} +\caption{Network latency impacts on performances} \label{fig:03} \end{figure} @@ -607,7 +608,7 @@ Figure~\ref{fig:04} reports the results obtained for the simulation of a grid of \begin{figure}[ht] \centering \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf} -\caption{Network bandwith impacts on execution time} +\caption{Network bandwith impacts on performances} \label{fig:04} \end{figure} @@ -618,73 +619,28 @@ These findings may help a lot end users to setup the best and the optimal target \begin{figure}[ht] \centering \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf} -\caption{Problem size impacts on execution times} +\caption{Problem size impacts on performances} \label{fig:05} \end{figure} +\subsubsection{CPU power impacts on performances\\} +Using the SimGrid simulator flexibility, we have tried to determine the impact of the CPU power of the processors in the different clusters on performances of both algorithms. We have varied the CPU power from $1$GFlops to $19$GFlops. The simulation is conducted in a grid of 2$\times$16 processors interconnected by the network $N2$ (see Table~\ref{tab:01}) to solve a 3D Poisson problem of size $150^3$. The results depicted in Figure~\ref{fig:06} confirm the performance gain, about $95\%$ for both algorithms, after improving the CPU power of processors. - - - - - - - - - - - - - - - - - - - - - -\subsubsection{CPU Power impacts on performance\\} - - -\begin{table} [htbp] -\centering -\begin{tabular}{r c } - \hline - Grid architecture & 2 $\times$ 16\\ %\hline - Inter Network & N2 : $bw$=1Gbs - $lat$=5.10$^{-5}$ \\ %\hline - Input matrix size & $N_{x} = 150 \times 150 \times 150$\\ - CPU Power & From 3 to 19 GFlops \\ \hline - \end{tabular} -\caption{Test conditions: CPU Power impacts} -\label{tab:06} -\end{table} - -\begin{figure} [ht!] +\begin{figure}[ht] \centering \includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf} -\caption{CPU Power impacts on execution time} +\caption{CPU Power impacts on performances} \label{fig:06} \end{figure} - -Using the Simgrid simulator flexibility, we have tried to determine the impact -on the algorithms performance in varying the CPU power of the clusters nodes -from $1$ to $19$ GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the -performance gain, around $95\%$ for both of the two methods, after adding more -powerful CPU. \ \\ -%\DL{il faut une conclusion sur ces tests : ils confirment les résultats déjà -%obtenus en grandeur réelle. Donc c'est une aide précieuse pour les dev. Pas -%besoin de déployer sur une archi réelle} - To conclude these series of experiments, with SimGrid we have been able to make many simulations with many parameters variations. Doing all these experiments with a real platform is most of the time not possible. Moreover the behavior of -both GMRES and Krylov multisplitting methods is in accordance with larger real -executions on large scale supercomputer~\cite{couturier15}. +both GMRES and Krylov two-stage algorithms is in accordance with larger real +executions on large scale supercomputers~\cite{couturier15}. -\subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode} +\subsection{Comparison between synchronous GMRES and asynchronous two-stage multisplitting algorithms} The previous paragraphs put in evidence the interests to simulate the behavior of the application before any deployment in a real environment. In this @@ -699,42 +655,34 @@ synchronization with the other processors. Thus, the asynchronous may theoretically reduce the overall execution time and can improve the algorithm performance. -In this section, the Simgrid simulator is used to compare the behavior of the -multisplitting in asynchronous mode with GMRES in synchronous mode. Several -benchmarks have been performed with various combination of the grid resources -(CPU, Network, input matrix size, \ldots ). The test conditions are summarized -in Table~\ref{tab:07}. In order to compare the execution times, this table +In this section, the SimGrid simulator is used to compare the behavior of the +two-stage algorithm in asynchronous mode with GMRES in synchronous mode. Several +benchmarks have been performed with various combinations of the grid resources +(CPU, Network, matrix size, \ldots). The test conditions are summarized +in Table~\ref{tab:02}. In order to compare the execution times, Table~\ref{tab:03} reports the relative gain between both algorithms. It is defined by the ratio between the execution time of GMRES and the execution time of the -multisplitting. The ratio is greater than one because the asynchronous +multisplitting. +\LZK{Quelle table repporte les gains relatifs?? Sûrement pas Table II !!} +\RCE{Table III avec la nouvelle numerotation} +The ratio is greater than one because the asynchronous multisplitting version is faster than GMRES. - - -\begin{table} [htbp] +\begin{table}[htbp] \centering -\begin{tabular}{r c } +\begin{tabular}{ll} \hline - Grid Architecture & 2 $\times$ 50 totaling 100 processors\\ %\hline - Processors Power & 1 GFlops to 1.5 GFlops\\ - Intra-Network & bw=1.25 Gbits - lat=5.10$^{-5}$ \\ %\hline - Inter-Network & bw=5 Mbits - lat=2.10$^{-2}$\\ - Input matrix size & $N_{x}$ = From 62 to 150\\ %\hline - Residual error precision & 10$^{-5}$ to 10$^{-9}$\\ \hline \\ + Grid architecture & 2$\times$50 totaling 100 processors\\ + Processors Power & 1 GFlops to 1.5 GFlops \\ + \multirow{2}{*}{Network inter-clusters} & $bw$=1.25 Gbits, $lat=50\mu$s \\ + & $bw$=5 Mbits, $lat=20ms$s\\ + Matrix size & from $62^3$ to $150^3$\\ + Residual error precision & $10^{-5}$ to $10^{-9}$\\ \hline \\ \end{tabular} -\caption{Test conditions: GMRES in synchronous mode vs Krylov Multisplitting in asynchronous mode} -\label{tab:07} +\caption{Test conditions: GMRES in synchronous mode vs. Krylov two-stage in asynchronous mode} +\label{tab:02} \end{table} -Again, comprehensive and extensive tests have been conducted with different -parameters as the CPU power, the network parameters (bandwidth and latency) -and with different problem size. The relative gains greater than $1$ between the -two algorithms have been captured after each step of the test. In -Table~\ref{tab:08} are reported the best grid configurations allowing -the multisplitting method to be more than $2.5$ times faster than the -classical GMRES. These experiments also show the relative tolerance of the -multisplitting algorithm when using a low speed network as usually observed with -geographically distant clusters through the internet. % use the same column width for the following three tables \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}} @@ -772,15 +720,24 @@ geographically distant clusters through the internet. \hline \end{mytable} %\end{table} - \caption{Relative gain of the multisplitting algorithm compared with the classical GMRES} - \label{tab:08} + \caption{Relative gains of the two-stage multisplitting algorithm compared with the classical GMRES} + \label{tab:03} \end{table} +Again, comprehensive and extensive tests have been conducted with different +parameters as the CPU power, the network parameters (bandwidth and latency) +and with different problem size. The relative gains greater than $1$ between the +two algorithms have been captured after each step of the test. In +Table~\ref{tab:08} are reported the best grid configurations allowing +the two-stage multisplitting algorithm to be more than $2.5$ times faster than the +classical GMRES. These experiments also show the relative tolerance of the +multisplitting algorithm when using a low speed network as usually observed with +geographically distant clusters through the internet. -\section{Conclusion} +\section{Conclusion} In this paper we have presented the simulation of the execution of three -different parallel solvers on some multi-core architectures. We have show that +different parallel solvers on some multi-core architectures. We have shown that the SimGrid toolkit is an interesting simulation tool that has allowed us to determine which method to choose given a specified multi-core architecture. Moreover the simulated results are in accordance (i.e. with the same order of @@ -802,7 +759,7 @@ converge and so to very different execution times. In future works, we plan to investigate how to simulate the behavior of really large scale applications. For example, if we are interested to simulate the execution of the solvers of this paper with thousand or even dozens of thousands -or core, it is not possible to do that with SimGrid. In fact, this tool will +of cores, it is not possible to do that with SimGrid. In fact, this tool will make the real computation. So we plan to focus our research on that problematic.