\todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
\newcommand{\RCE}[2][inline]{%
\todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
+\newcommand{\DL}[2][inline]{%
+ \todo[color=pink!10,#1]{\sffamily\textbf{DL:} #2}\xspace}
\algnewcommand\algorithmicinput{\textbf{Input:}}
\algnewcommand\Input{\item[\algorithmicinput]}
accurate performance models. That is why another solution is to use a simulation
tool which allows us to change many parameters of the architecture (network
bandwidth, latency, number of processors) and to simulate the execution of such
-applications. We have decided to use SimGrid as it enables to benchmark MPI
-applications.
+applications. The main contribution of this paper is to show that the use of a
+simulation tool (here we have decided to use the SimGrid toolkit) can really
+help developpers to better tune their applications for a given multi-core
+architecture.
-In this paper, we focus our attention on two parallel iterative algorithms based
+In particular we focus our attention on two parallel iterative algorithms based
on the Multisplitting algorithm and we compare them to the GMRES algorithm.
-These algorithms are used to solve libear systems. Two different variants of
+These algorithms are used to solve linear systems. Two different variants of
the Multisplitting are studied: one using synchronoous iterations and another
-one with asynchronous iterations. For each algorithm we have tested different
-parameters to see their influence. We strongly recommend people interested
-by investing into a new expensive hardware architecture to benchmark
-their applications using a simulation tool before.
-
-
-
+one with asynchronous iterations. For each algorithm we have simulated
+different architecture parameters to evaluate their influence on the overall
+execution time. The obtain simulated results confirm the real results
+previously obtained on different real multi-core architectures and also confirm
+the efficiency of the asynchronous multisplitting algorithm compared to the
+synchronous GMRES method.
\end{abstract}
In the scope of this paper, our first objective is to analyze when the Krylov
Multisplitting method has better performances than the classical GMRES
-method. With an iterative method, better performances mean a smaller number of
-iterations and execution time before reaching the convergence. For a systematic
-study, the experiments should figure out that, for various grid parameters
-values, the simulator will confirm the targeted outcomes, particularly for poor
-and slow networks, focusing on the impact on the communication performance on
-the chosen class of algorithm.
+method. With a synchronous iterative method, better performances mean a
+smaller number of iterations and execution time before reaching the convergence.
+For a systematic study, the experiments should figure out that, for various
+grid parameters values, the simulator will confirm the targeted outcomes,
+particularly for poor and slow networks, focusing on the impact on the
+communication performance on the chosen class of algorithm.
The following paragraphs present the test conditions, the output results
and our comments.\\
-\subsubsection{Execution of the the algorithms on various computational grid
-architecture and scaling up the input matrix size}
+\subsubsection{Execution of the algorithms on various computational grid
+architectures and scaling up the input matrix size}
\ \\
% environment
In this section, we analyze the performences of algorithms running on various
-grid configuration (2x16, 4x8, 4x16 and 8x8). First, the results in Figure~\ref{fig:01}
-show for all grid configuration the non-variation of the number of iterations of
-classical GMRES for a given input matrix size; it is not the case for the
+grid configurations (2x16, 4x8, 4x16 and 8x8). First, the results in Figure~\ref{fig:01}
+show for all grid configurations the non-variation of the number of iterations of
+classical GMRES for a given input matrix size; it is not the case for the
multisplitting method.
\RC{CE attention tu n'as pas mis de label dans tes figures, donc c'est le bordel, j'en mets mais vérifie...}
and 4x8). We can observ the low sensitivity of the Krylov multisplitting method
(compared with the classical GMRES) when scaling up the number of the processors
in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs
-40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors.
+$40\%$ better (resp. $48\%$) when running from 2x16=32 to 8x8=64 processors.
-\subsubsection{Running on two different speed cluster inter-networks}
+\subsubsection{Running on two different inter-clusters network speed}
\ \\
\begin{figure} [ht!]
Grid & 2x16, 4x8\\ %\hline
Network & N1 : bw=10Gbs-lat=8.10$^{-6}$ \\ %\hline
- & N2 : bw=1Gbs-lat=5.10$^{-5}$ \\
- Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline
+ Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline
\end{tabular}
\caption{Clusters x Nodes - Networks N1 x N2}
\end{center}
speed inter-cluster network (N1) and also on a less performant network (N2).
Figure~\ref{fig:02} shows that end users will gain to reduce the execution time
for both algorithms in using a grid architecture like 4x16 or 8x8: the
-performance was increased in a factor of 2. The results depict also that when
+performance was increased by a factor of $2$. The results depict also that when
the network speed drops down (12.5\%), the difference between the execution
times can reach more than 25\%. \RC{c'est pas clair : la différence entre quoi et quoi?}
+\DL{pas clair}
\subsubsection{Network latency impacts on performance}
\ \\
Network & N1 : bw=1Gbs \\ %\hline
Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline
\end{tabular}
-\caption{Network latency impact}
+\caption{Network latency impacts}
\end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf}
-\caption{Network latency impact on execution time}
+\caption{Network latency impacts on execution time}
\label{fig:03}
\end{figure}
-According the results in Figure~\ref{fig:03}, a degradation of the network
-latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time increase more
-than 75\% (resp. 82\%) of the execution for the classical GMRES (resp. Krylov
+According to the results of Figure~\ref{fig:03}, a degradation of the network
+latency from $8.10^{-6}$ to $6.10^{-5}$ implies an absolute time increase of more
+than $75\%$ (resp. $82\%$) of the execution for the classical GMRES (resp. Krylov
multisplitting) algorithm. In addition, it appears that the Krylov
multisplitting method tolerates more the network latency variation with a less
rate increase of the execution time. Consequently, in the worst case
-(lat=6.10$^{-5 }$), the execution time for GMRES is almost the double than the
+($lat=6.10^{-5 }$), the execution time for GMRES is almost the double than the
time of the Krylov multisplitting, even though, the performance was on the same
-order of magnitude with a latency of 8.10$^{-6}$.
+order of magnitude with a latency of $8.10^{-6}$.
\subsubsection{Network bandwidth impacts on performance}
\ \\
Network & N1 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\
\end{tabular}
-\caption{Network bandwidth impact}
+\caption{Network bandwidth impacts}
\end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf}
-\caption{Network bandwith impact on execution time}
+\caption{Network bandwith impacts on execution time}
\label{fig:04}
\end{figure}
-
-
The results of increasing the network bandwidth show the improvement of the
performance for both algorithms by reducing the execution time (see
Figure~\ref{fig:04}). However, in this case, the Krylov multisplitting method
presents a better performance in the considered bandwidth interval with a gain
-of 40\% which is only around 24\% for classical GMRES.
+of $40\%$ which is only around $24\%$ for the classical GMRES.
\subsubsection{Input matrix size impacts on performance}
\ \\
\begin{tabular}{r c }
\hline
Grid & 4x8\\ %\hline
- Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\
+ Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\
Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
\end{tabular}
-\caption{Input matrix size impact}
+\caption{Input matrix size impacts}
\end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
-\caption{Problem size impact on execution time}
+\caption{Problem size impacts on execution time}
\label{fig:05}
\end{figure}
-In these experiments, the input matrix size has been set from N$_{x}$ = N$_{y}$
-= N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to 200$^{3}$
-= 8,000,000 points. Obviously, as shown in Figure~\ref{fig:05}, the execution
+In these experiments, the input matrix size has been set from $N_{x} = N_{y}
+= N_{z} = 40$ to $200$ side elements that is from $40^{3} = 64.000$ to $200^{3}
+= 8,000,000$ points. Obviously, as shown in Figure~\ref{fig:05}, the execution
time for both algorithms increases when the input matrix size also increases.
But the interesting results are:
\begin{enumerate}
- \item the drastic increase (300 times) \RC{Je ne vois pas cela sur la figure}
+ \item the drastic increase ($300$ times) \RC{Je ne vois pas cela sur la figure}
of the number of iterations needed to reach the convergence for the classical
-GMRES algorithm when the matrix size go beyond N$_{x}$=150;
-\item the classical GMRES execution time is almost the double for N$_{x}$=140
+GMRES algorithm when the matrix size go beyond $N_{x}=150$;
+\item the classical GMRES execution time is almost the double for $N_{x}=140$
compared with the Krylov multisplitting method.
\end{enumerate}
size scale up. It should be noticed that the same test has been done with the
grid 2x16 leading to the same conclusion.
-\subsubsection{CPU Power impact on performance}
+\subsubsection{CPU Power impacts on performance}
\begin{figure} [ht!]
\centering
Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
\end{tabular}
-\caption{CPU Power impact}
+\caption{CPU Power impacts}
\end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf}
-\caption{CPU Power impact on execution time}
+\caption{CPU Power impacts on execution time}
\label{fig:06}
\end{figure}
Using the Simgrid simulator flexibility, we have tried to determine the impact
on the algorithms performance in varying the CPU power of the clusters nodes
-from 1 to 19 GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the
-performance gain, around 95\% for both of the two methods, after adding more
+from $1$ to $19$ GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the
+performance gain, around $95\%$ for both of the two methods, after adding more
powerful CPU.
+\DL{il faut une conclusion sur ces tests : ils confirment les résultats déjà
+obtenus en grandeur réelle. Donc c'est une aide précieuse pour les dev. Pas
+besoin de déployer sur une archi réelle}
+
\subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode}
The previous paragraphs put in evidence the interests to simulate the behavior