paper, we show that it is interesting to use SimGrid to simulate the behaviors
of asynchronous iterative algorithms. For that, we compare the behaviour of a
synchronous GMRES algorithm with an asynchronous multisplitting one with
-simulations in which we choose some parameters. Both codes are real MPI
-codes. Simulations allow us to see when the multisplitting algorithm can be more
+simulations which let us easily choose some parameters. Both codes are real MPI
+codes and simulations allow us to see when the asynchronous multisplitting algorithm can be more
efficient than the GMRES one to solve a 3D Poisson problem.
$X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
demonstrate the convergence of these algorithms~\cite{BT89,Bahi07}.
-Parallelization of such algorithms generally involve the division of the problem
+Parallelization of such algorithms generally involves the division of the problem
into several \emph{blocks} that will be solved in parallel on multiple
processing units. The latter will communicate each intermediate results before a
new iteration starts and until the approximate solution is reached. These
\begin{figure}[!t]
\centering
- \includegraphics[width=60mm,keepaspectratio]{clustering}
-\caption{Example of three clusters of processors interconnected by a virtual unidirectional ring network.}
+ \includegraphics[width=60mm,keepaspectratio]{clustering2}
+\caption{Example of two distant clusters of processors.}
\label{fig:4.1}
\end{figure}
all clusters are interconnected by a virtual unidirectional ring network (see
Figure~\ref{fig:4.1}). During the resolution, a Boolean token circulates around
the virtual ring from a master processor to another until the global convergence
-is achieved. So starting from the cluster with rank 1, each master processor $i$
+is achieved. So starting from the cluster with rank 1, each master processor $\ell$
sets the token to \textit{True} if the local convergence is achieved or to
-\textit{False} otherwise, and sends it to master processor $i+1$. Finally, the
+\textit{False} otherwise, and sends it to master processor $\ell+1$. Finally, the
global convergence is detected when the master of cluster 1 receives from the
master of cluster $L$ a token set to \textit{True}. In this case, the master of
cluster 1 broadcasts a stop message to masters of other clusters. In this work,
compared to the asynchronous multisplitting algorithm ($t_\text{GMRES} / t_\text{Multisplitting}$) is defined as the \emph{relative gain}. So,
our objective running the algorithm in SimGrid is to obtain a relative gain greater than 1.
A priori, obtaining a relative gain greater than 1 would be difficult in a local
-area network configuration where the synchronous mode will take advantage on the
+area network configuration where the synchronous GMRES method will take advantage on the
rapid exchange of information on such high-speed links. Thus, the methodology
adopted was to launch the application on a clustered network. In this
configuration, degrading the inter-cluster network performance will penalize the
\begin{table}[!t]
\centering
- \caption{2 clusters, each with 50 nodes}
+ \caption{Relative gain of the multisplitting algorithm compared to GMRES for
+ different configurations with 2 clusters, each one composed of 50 nodes.}
\label{tab.cluster.2x50}
\begin{mytable}{5}
& 5 & 5 & 5 & 5 & 5 \\
\hline
latency (ms)
- & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
+ & 20 & 20 & 20 & 20 & 20 \\
\hline
power (GFlops)
& 1 & 1 & 1 & 1.5 & 1.5 \\
& 50 & 50 & 50 & 50 & 50 \\ % & 10 & 10 \\
\hline
latency (ms)
- & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\ % & 0.03 & 0.01 \\
+ & 20 & 20 & 20 & 20 & 20 \\ % & 0.03 & 0.01 \\
\hline
Power (GFlops)
& 1.5 & 1.5 & 1.5 & 1.5 & 1.5 \\ % & 1 & 1.5 \\
\end{mytable}
\end{table}
+\RC{Du coup la latence est toujours la même, pourquoi la mettre dans la table?}
+
%Then we have changed the network configuration using three clusters containing
%respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the
%clusters. In the same way as above, a judicious choice of key parameters has
%permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the
%relative gains greater than 1 with a matrix size from 62 to 100 elements.
-\CER{En accord avec RC, on a pour le moment enlevé les tableaux 2 et 3 sachant que les résultats obtenus sont limites. De même, on a enlevé aussi les deux dernières colonnes du tableau I en attendant une meilleure performance et une meilleure precision}
+%\CER{En accord avec RC, on a pour le moment enlevé les tableaux 2 et 3 sachant que les résultats obtenus sont limites. De même, on a enlevé aussi les deux dernières colonnes du tableau I en attendant une meilleure performance et une meilleure precision}
%\begin{table}[!t]
% \centering
% \caption{3 clusters, each with 33 nodes}
\begin{itemize}
\item 2 clusters of 50 hosts each;
\item Processor unit power: \np[GFlops]{1} or \np[GFlops]{1.5};
- \item Intra-cluster network bandwidth: \np[Gbit/s]{1.25} and latency: \np[$\mu$s]{0.05};
- \item Inter-cluster network bandwidth: \np[Mbit/s]{5} or \np[Mbit/s]{50} and latency: \np[$\mu$s]{20};
+ \item Intra-cluster network bandwidth: \np[Gbit/s]{1.25} and latency: \np[$\mu$s]{50};
+ \item Inter-cluster network bandwidth: \np[Mbit/s]{5} or \np[Mbit/s]{50} and latency: \np[ms]{20};
\end{itemize}
\end{itemize}
After analyzing the outputs, generally, for the two clusters including one hundred hosts configuration (Tables~\ref{tab.cluster.2x50}), some combinations of parameters affecting
the results have given a relative gain more than 2.5, showing the effectiveness of the
-asynchronous performance compared to the synchronous mode.
+asynchronous multisplitting compared to GMRES with two distant clusters.
With these settings, Table~\ref{tab.cluster.2x50} shows
-that after a deterioration of inter cluster network with a bandwidth of \np[Mbit/s]{5} and a latency in order of one hundredth of millisecond and a processor power
+that after setting the bandwidth of the inter cluster network to \np[Mbit/s]{5} and a latency in order of one hundredth of millisecond and a processor power
of one GFlops, an efficiency of about \np[\%]{40} is
obtained in asynchronous mode for a matrix size of 62 elements. It is noticed that the result remains
stable even we vary the residual error precision from \np{E-5} to \np{E-9}. By
%\LZK{Ma question est: le bandwidth et latency sont ceux inter-clusters ou pour les deux inter et intra cluster??}
%\CER{Définitivement, les paramètres réseaux variables ici se rapportent au réseau INTER cluster.}
\section{Conclusion}
-The experimental results on executing a parallel iterative algorithm in
-asynchronous mode on an environment simulating a large scale of virtual
-computers organized with interconnected clusters have been presented.
-Our work has demonstrated that using such a simulation tool allow us to
-reach the following three objectives:
+The simulation of the execution of parallel asynchronous iterative algorithms on large scale clusters has been presented.
+In this work, we show that SIMGRID is an efficient simulation tool that allows us to
+reach the following two objectives:
\begin{enumerate}
-\item To have a flexible configurable execution platform resolving the
-hard exercise to access to very limited but so solicited physical
-resources;
-\item to ensure the algorithm convergence with a reasonable time and
-iteration number ;
-\item and finally and more importantly, to find the correct combination
-of the cluster and network specifications permitting to save time in
-executing the algorithm in asynchronous mode.
+\item To have a flexible configurable execution platform that allows us to
+ simulate algorithms for which execution of all parts of
+ the code is necessary. Using simulations before real executions is a nice
+ solution to detect potential scalability problems.
+
+\item To test the combination of the cluster and network specifications permitting to execute an asynchronous algorithm faster than a synchronous one.
\end{enumerate}
-Our results have shown that in certain conditions, asynchronous mode is
-speeder up to \np[\%]{40} than executing the algorithm in synchronous mode
+Our results have shown that with two distant clusters, the asynchronous multisplitting is faster to \np[\%]{40} compared to the synchronous GMRES method
which is not negligible for solving complex practical problems with more
and more increasing size.
- Several studies have already addressed the performance execution time of
+Several studies have already addressed the performance execution time of
this class of algorithm. The work presented in this paper has
demonstrated an original solution to optimize the use of a simulation
tool to run efficiently an iterative parallel algorithm in asynchronous
mode in a grid architecture.
-\LZK{Perspectives???}
+In future works, we plan to extend our experimentations to larger scale platforms by increasing the number of computing cores and the number of clusters.
+We will also have to increase the size of the input problem which will require the use of a more powerful simulation platform. At last, we expect to compare our simulation results to real execution results on real architectures in order to experimentally validate our study.
\section*{Acknowledgment}
This work is partially funded by the Labex ACTION program (contract ANR-11-LABX-01-01).
-\todo[inline]{The authors would like to thank\dots{}}
+%\todo[inline]{The authors would like to thank\dots{}}
% trigger a \newpage just before the given reference
% number - used to balance the columns on the last page