accurate performance models. That is why another solution is to use a simulation
tool which allows us to change many parameters of the architecture (network
bandwidth, latency, number of processors) and to simulate the execution of such
-applications. We have decided to use SimGrid as it enables to benchmark MPI
-applications.
+applications. The main contribution of this paper is to show that the use of a
+simulation tool (here we have decided to use the SimGrid toolkit) can really
+help developpers to better tune their applications for a given multi-core
+architecture.
-In this paper, we focus our attention on two parallel iterative algorithms based
+In particular we focus our attention on two parallel iterative algorithms based
on the Multisplitting algorithm and we compare them to the GMRES algorithm.
-These algorithms are used to solve libear systems. Two different variants of
+These algorithms are used to solve linear systems. Two different variants of
the Multisplitting are studied: one using synchronoous iterations and another
-one with asynchronous iterations. For each algorithm we have tested different
-parameters to see their influence. We strongly recommend people interested
-by investing into a new expensive hardware architecture to benchmark
-their applications using a simulation tool before.
-
-
-
+one with asynchronous iterations. For each algorithm we have simulated
+different architecture parameters to evaluate their influence on the overall
+execution time. The obtain simulated results confirm the real results
+previously obtained on different real multi-core architectures and also confirm
+the efficiency of the asynchronous multisplitting algorithm compared to the
+synchronous GMRES method.
\end{abstract}
\end{figure}
- The execution time difference between the two algorithms is important when
- comparing between different grid architectures, even with the same number of
- processors (like 2x16 and 4x8 = 32 processors for example). The
- experiment concludes the low sensitivity of the multisplitting method
- (compared with the classical GMRES) when scaling up the number of the processors in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors.
-
- \textit{\\3.b Running on two different speed cluster inter-networks\\}
+ The execution times between the two algorithms is significant with different
+ grid architectures, even with the same number of processors (for example, 2x16
+ and 4x8). We can observ the low sensitivity of the Krylov multisplitting method
+ (compared with the classical GMRES) when scaling up the number of the processors
+ in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs
+ 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors.
+
+ \subsubsection{Running on two different speed cluster inter-networks}
+ \ \\
- % environment
- \begin{footnotesize}
+ \begin{figure} [ht!]
+ \begin{center}
\begin{tabular}{r c }
\hline
Grid & 2x16, 4x8\\ %\hline
Network & N1 : bw=10Gbs-lat=8.10$^{-6}$ \\ %\hline
- & N2 : bw=1Gbs-lat=5.10$^{-5}$ \\
- Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\
+ Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline
\end{tabular}
- Table 2 : Clusters x Nodes - Networks N1 x N2 \\
-
- \end{footnotesize}
+ \caption{Clusters x Nodes - Networks N1 x N2}
+ \end{center}
+ \end{figure}
\centering
\includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf}
\caption{Cluster x Nodes N1 x N2}
- %\label{overflow}}
+ \label{fig:02}
\end{figure}
%\end{wrapfigure}
- The experiments compare the behavior of the algorithms running first on
- a speed inter- cluster network (N1) and also on a less performant network (N2).
- Figure 4 shows that end users will gain to reduce the execution time
- for both algorithms in using a grid architecture like 4x16 or 8x8: the
- performance was increased in a factor of 2. The results depict also that
- when the network speed drops down (12.5\%), the difference between the execution
- times can reach more than 25\%.
+ These experiments compare the behavior of the algorithms running first on a
+ speed inter-cluster network (N1) and also on a less performant network (N2).
+ Figure~\ref{fig:02} shows that end users will gain to reduce the execution time
+ for both algorithms in using a grid architecture like 4x16 or 8x8: the
+ performance was increased in a factor of 2. The results depict also that when
+ the network speed drops down (12.5\%), the difference between the execution
+ times can reach more than 25\%. \RC{c'est pas clair : la différence entre quoi et quoi?}
- \textit{\\3.c Network latency impacts on performance\\}
-
- % environment
- \begin{footnotesize}
+ \subsubsection{Network latency impacts on performance}
+ \ \\
+ \begin{figure} [ht!]
+ \centering
\begin{tabular}{r c }
\hline
Grid & 2x16\\ %\hline
Network & N1 : bw=1Gbs \\ %\hline
- Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline\\
+ Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline
\end{tabular}
- Table 3 : Network latency impact \\
-
- \end{footnotesize}
+ \caption{Network latency impact}
+ \end{figure}
\centering
\includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf}
\caption{Network latency impact on execution time}
- %\label{overflow}}
+ \label{fig:03}
\end{figure}
- According the results in figure 5, degradation of the network
- latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
- increase more than 75\% (resp. 82\%) of the execution for the classical
- GMRES (resp. multisplitting) algorithm. In addition, it appears that the
- multisplitting method tolerates more the network latency variation with
- a less rate increase of the execution time. Consequently, in the worst case (lat=6.10$^{-5
- }$), the execution time for GMRES is almost the double of the time for
- the multisplitting, even though, the performance was on the same order
- of magnitude with a latency of 8.10$^{-6}$.
-
- \textit{\\3.d Network bandwidth impacts on performance\\}
+ According the results in Figure~\ref{fig:03}, a degradation of the network
+ latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time increase more
+ than 75\% (resp. 82\%) of the execution for the classical GMRES (resp. Krylov
+ multisplitting) algorithm. In addition, it appears that the Krylov
+ multisplitting method tolerates more the network latency variation with a less
+ rate increase of the execution time. Consequently, in the worst case
+ (lat=6.10$^{-5 }$), the execution time for GMRES is almost the double than the
+ time of the Krylov multisplitting, even though, the performance was on the same
+ order of magnitude with a latency of 8.10$^{-6}$.
- % environment
- \begin{footnotesize}
+ \subsubsection{Network bandwidth impacts on performance}
+ \ \\
+ \begin{figure} [ht!]
+ \centering
\begin{tabular}{r c }
\hline
Grid & 2x16\\ %\hline
Network & N1 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\
\end{tabular}
- Table 4 : Network bandwidth impact \\
-
- \end{footnotesize}
+ \caption{Network bandwidth impact}
+ \end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf}
\caption{Network bandwith impact on execution time}
- %\label{overflow}
+ \label{fig:04}
\end{figure}
- The results of increasing the network bandwidth show the improvement
- of the performance for both of the two algorithms by reducing the execution time (Figure 6). However, and again in this case, the multisplitting method presents a better performance in the considered bandwidth interval with a gain of 40\% which is only around 24\% for classical GMRES.
-
- \textit{\\3.e Input matrix size impacts on performance\\}
+ The results of increasing the network bandwidth show the improvement of the
+ performance for both algorithms by reducing the execution time (see
+ Figure~\ref{fig:04}). However, in this case, the Krylov multisplitting method
+ presents a better performance in the considered bandwidth interval with a gain
+ of 40\% which is only around 24\% for classical GMRES.
- % environment
- \begin{footnotesize}
+ \subsubsection{Input matrix size impacts on performance}
+ \ \\
+ \begin{figure} [ht!]
+ \centering
\begin{tabular}{r c }
\hline
Grid & 4x8\\ %\hline
- Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
- Input matrix size & N$_{x}$ = From 40 to 200\\ \hline \\
+ Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\
+ Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
\end{tabular}
- Table 5 : Input matrix size impact\\
-
- \end{footnotesize}
+ \caption{Input matrix size impact}
+ \end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
- \caption{Pb size impact on execution time}
- %\label{overflow}}
+ \caption{Problem size impact on execution time}
+ \label{fig:05}
\end{figure}
- In this experimentation, the input matrix size has been set from
- N$_{x}$ = N$_{y}$ = N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to
- 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 7,
- the execution time for the two algorithms convergence increases with the
- iinput matrix size. But the interesting results here direct on (i) the
- drastic increase (300 times) of the number of iterations needed before
- the convergence for the classical GMRES algorithm when the matrix size
- go beyond N$_{x}$=150; (ii) the classical GMRES execution time also almost
- the double from N$_{x}$=140 compared with the convergence time of the
- multisplitting method. These findings may help a lot end users to setup
- the best and the optimal targeted environment for the application
- deployment when focusing on the problem size scale up. Note that the
- same test has been done with the grid 2x16 getting the same conclusion.
-
- \textit{\\3.f CPU Power impact on performance\\}
+ In these experiments, the input matrix size has been set from N$_{x}$ = N$_{y}$
+ = N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to 200$^{3}$
+ = 8,000,000 points. Obviously, as shown in Figure~\ref{fig:05}, the execution
+ time for both algorithms increases when the input matrix size also increases.
+ But the interesting results are:
+ \begin{enumerate}
+ \item the drastic increase (300 times) \RC{Je ne vois pas cela sur la figure}
+ of the number of iterations needed to reach the convergence for the classical
+ GMRES algorithm when the matrix size go beyond N$_{x}$=150;
+ \item the classical GMRES execution time is almost the double for N$_{x}$=140
+ compared with the Krylov multisplitting method.
+ \end{enumerate}
- % environment
- \begin{footnotesize}
+ These findings may help a lot end users to setup the best and the optimal
+ targeted environment for the application deployment when focusing on the problem
+ size scale up. It should be noticed that the same test has been done with the
+ grid 2x16 leading to the same conclusion.
+
+ \subsubsection{CPU Power impact on performance}
+
+ \begin{figure} [ht!]
+ \centering
\begin{tabular}{r c }
\hline
Grid & 2x16\\ %\hline
Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
\end{tabular}
- Table 6 : CPU Power impact \\
-
- \end{footnotesize}
-
+ \caption{CPU Power impact}
+ \end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf}
\caption{CPU Power impact on execution time}
- %\label{overflow}}
- s\end{figure}
-
- Using the Simgrid simulator flexibility, we have tried to determine the
- impact on the algorithms performance in varying the CPU power of the
- clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
- confirm the performance gain, around 95\% for both of the two methods,
- after adding more powerful CPU.
-
- \subsection{Comparing GMRES in native synchronous mode and
- Multisplitting algorithms in asynchronous mode}
-
- The previous paragraphs put in evidence the interests to simulate the
- behavior of the application before any deployment in a real environment.
- We have focused the study on analyzing the performance in varying the
- key factors impacting the results. The study compares
- the performance of the two proposed algorithms both in \textit{synchronous mode
- }. In this section, following the same previous methodology, the goal is to
- demonstrate the efficiency of the multisplitting method in \textit{
- asynchronous mode} compared with the classical GMRES staying in
- \textit{synchronous mode}.
+ \label{fig:06}
+ \end{figure}
+
+ Using the Simgrid simulator flexibility, we have tried to determine the impact
+ on the algorithms performance in varying the CPU power of the clusters nodes
+ from 1 to 19 GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the
+ performance gain, around 95\% for both of the two methods, after adding more
+ powerful CPU.
+
+ \subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode}
+
+ The previous paragraphs put in evidence the interests to simulate the behavior
+ of the application before any deployment in a real environment. We have focused
+ the study on analyzing the performance in varying the key factors impacting the
+ results. The study compares the performance of the two proposed algorithms both
+ in \textit{synchronous mode }. In this section, following the same previous
+ methodology, the goal is to demonstrate the efficiency of the multisplitting
+ method in \textit{ asynchronous mode} compared with the classical GMRES staying
+ in \textit{synchronous mode}.
Note that the interest of using the asynchronous mode for data exchange
is mainly, in opposite of the synchronous mode, the non-wait aspects of