\todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
\newcommand{\RCE}[2][inline]{%
\todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
+\newcommand{\DL}[2][inline]{%
+ \todo[color=pink!10,#1]{\sffamily\textbf{DL:} #2}\xspace}
\algnewcommand\algorithmicinput{\textbf{Input:}}
\algnewcommand\Input{\item[\algorithmicinput]}
In addition, the following arguments are given to the programs at runtime:
\begin{itemize}
- \item maximum number of inner and outer iterations;
- \item inner and outer precisions;
- \item maximum number of the GMRES restarts in the Arnorldi process;
- \item maximum number of iterations and the tolerance threshold in classical GMRES;
- \item tolerance threshold for outer and inner-iterations;
- \item matrix size (N$_{x}$, N$_{y}$ and N$_{z}$) respectively on $x, y, z$ axis;
- \item matrix diagonal value is fixed to $6.0$ for synchronous Krylov multisplitting experiments and $6.2$ for asynchronous block Jacobi experiments; \RC{CE tu vérifies, je dis ca de tête}
- \item matrix off-diagonal value;
- \item execution mode: synchronous or asynchronous;
- \RCE {C'est ok la liste des arguments du programme mais si Lilia ou toi pouvez preciser pour les arguments pour CGLS ci dessous} \RC{Vu que tu n'as pas fait varier ce paramètre, on peut ne pas en parler}
- \item Size of matrix S;
- \item Maximum number of iterations and tolerance threshold for CGLS.
+ \item maximum number of inner iterations $\MIG$ and outer iterations $\MIM$,
+ \item inner precision $\TOLG$ and outer precision $\TOLM$,
+ \item matrix sizes of the 3D Poisson problem: N$_{x}$, N$_{y}$ and N$_{z}$ on axis $x$, $y$ and $z$ respectively,
+ \item matrix diagonal value is fixed to $6.0$ for synchronous Krylov multisplitting experiments and $6.2$ for asynchronous block Jacobi experiments,
+ \item matrix off-diagonal value is fixed to $-1.0$,
+ \item number of vectors in matrix $S$ (i.e. value of $s$),
+ \item maximum number of iterations $\MIC$ and precision $\TOLC$ for CGLS method,
+ \item maximum number of iterations and precision for the classical GMRES method,
+ \item maximum number of restarts for the Arnorldi process in GMRES method,
+ \item execution mode: synchronous or asynchronous.
\end{itemize}
+\LZK{CE pourrais tu vérifier et confirmer les valeurs des éléments diag et off-diag de la matrice?}
It should also be noticed that both solvers have been executed with the Simgrid selector \texttt{-cfg=smpi/running\_power} which determines the computational power (here 19GFlops) of the simulator host machine.
performance for both algorithms by reducing the execution time (see
Figure~\ref{fig:04}). However, in this case, the Krylov multisplitting method
presents a better performance in the considered bandwidth interval with a gain
-of 40\% which is only around 24\% for classical GMRES.
+of $40\%$ which is only around $24\%$ for the classical GMRES.
\subsubsection{Input matrix size impacts on performance}
\ \\
Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\
Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
\end{tabular}
-\caption{Input matrix size impact}
+\caption{Input matrix size impacts}
\end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
-\caption{Problem size impact on execution time}
+\caption{Problem size impacts on execution time}
\label{fig:05}
\end{figure}
-In these experiments, the input matrix size has been set from N$_{x}$ = N$_{y}$
-= N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to 200$^{3}$
-= 8,000,000 points. Obviously, as shown in Figure~\ref{fig:05}, the execution
+In these experiments, the input matrix size has been set from $N_{x} = N_{y}
+= N_{z} = 40$ to $200$ side elements that is from $40^{3} = 64.000$ to $200^{3}
+= 8,000,000$ points. Obviously, as shown in Figure~\ref{fig:05}, the execution
time for both algorithms increases when the input matrix size also increases.
But the interesting results are:
\begin{enumerate}
- \item the drastic increase (300 times) \RC{Je ne vois pas cela sur la figure}
+ \item the drastic increase ($300$ times) \RC{Je ne vois pas cela sur la figure}
of the number of iterations needed to reach the convergence for the classical
-GMRES algorithm when the matrix size go beyond N$_{x}$=150;
-\item the classical GMRES execution time is almost the double for N$_{x}$=140
+GMRES algorithm when the matrix size go beyond $N_{x}=150$;
+\item the classical GMRES execution time is almost the double for $N_{x}=140$
compared with the Krylov multisplitting method.
\end{enumerate}
size scale up. It should be noticed that the same test has been done with the
grid 2x16 leading to the same conclusion.
-\subsubsection{CPU Power impact on performance}
+\subsubsection{CPU Power impacts on performance}
\begin{figure} [ht!]
\centering
Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
\end{tabular}
-\caption{CPU Power impact}
+\caption{CPU Power impacts}
\end{figure}
\begin{figure} [ht!]
\centering
\includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf}
-\caption{CPU Power impact on execution time}
+\caption{CPU Power impacts on execution time}
\label{fig:06}
\end{figure}
Using the Simgrid simulator flexibility, we have tried to determine the impact
on the algorithms performance in varying the CPU power of the clusters nodes
-from 1 to 19 GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the
-performance gain, around 95\% for both of the two methods, after adding more
+from $1$ to $19$ GFlops. The outputs depicted in Figure~\ref{fig:06} confirm the
+performance gain, around $95\%$ for both of the two methods, after adding more
powerful CPU.
+\DL{il faut une conclusion sur ces tests : ils confirment les résultats déjà
+obtenus en grandeur réelle. Donc c'est une aide précieuse pour les dev. Pas
+besoin de déployer sur une archi réelle}
+
\subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode}
The previous paragraphs put in evidence the interests to simulate the behavior
-of the application before any deployment in a real environment. We have focused
-the study on analyzing the performance in varying the key factors impacting the
-results. The study compares the performance of the two proposed algorithms both
-in \textit{synchronous mode }. In this section, following the same previous
-methodology, the goal is to demonstrate the efficiency of the multisplitting
-method in \textit{ asynchronous mode} compared with the classical GMRES staying
-in \textit{synchronous mode}.
-
-Note that the interest of using the asynchronous mode for data exchange
-is mainly, in opposite of the synchronous mode, the non-wait aspects of
-the current computation after a communication operation like sending
-some data between nodes. Each processor can continue their local
-calculation without waiting for the end of the communication. Thus, the
-asynchronous may theoretically reduce the overall execution time and can
-improve the algorithm performance.
-
-As stated supra, Simgrid simulator tool has been used to prove the
-efficiency of the multisplitting in asynchronous mode and to find the
-best combination of the grid resources (CPU, Network, input matrix size,
-\ldots ) to get the highest \textit{"relative gain"} (exec\_time$_{GMRES}$ / exec\_time$_{multisplitting}$) in comparison with the classical GMRES time.
-
-
-The test conditions are summarized in the table below : \\
+of the application before any deployment in a real environment. In this
+section, following the same previous methodology, our goal is to compare the
+efficiency of the multisplitting method in \textit{ asynchronous mode} with the
+classical GMRES in \textit{synchronous mode}.
-% environment
-\begin{footnotesize}
+The interest of using an asynchronous algorithm is that there is no more
+synchronization. With geographically distant clusters, this may be essential.
+In this case, each processor can compute its iteration freely without any
+synchronization with the other processors. Thus, the asynchronous may
+theoretically reduce the overall execution time and can improve the algorithm
+performance.
+
+\RC{la phrase suivante est bizarre, je ne comprends pas pourquoi elle vient ici}
+As stated before, the Simgrid simulator tool has been successfully used to show
+the efficiency of the multisplitting in asynchronous mode and to find the best
+combination of the grid resources (CPU, Network, input matrix size, \ldots ) to
+get the highest \textit{"relative gain"} (exec\_time$_{GMRES}$ /
+exec\_time$_{multisplitting}$) in comparison with the classical GMRES time.
+
+
+The test conditions are summarized in the table below: \\
+
+\begin{figure} [ht!]
+\centering
\begin{tabular}{r c }
\hline
Grid & 2x50 totaling 100 processors\\ %\hline
Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
Residual error precision & 10$^{-5}$ to 10$^{-9}$\\ \hline \\
\end{tabular}
-\end{footnotesize}
+\end{figure}
-Again, comprehensive and extensive tests have been conducted varying the
-CPU power and the network parameters (bandwidth and latency) in the
-simulator tool with different problem size. The relative gains greater
-than 1 between the two algorithms have been captured after each step of
-the test. Table 7 below has recorded the best grid configurations
-allowing the multisplitting method execution time more performant 2.5 times than
-the classical GMRES execution and convergence time. The experimentation has demonstrated the relative multisplitting algorithm tolerance when using a low speed network that we encounter usually with distant clusters thru the internet.
+Again, comprehensive and extensive tests have been conducted with different
+parameters as the CPU power, the network parameters (bandwidth and latency)
+and with different problem size. The relative gains greater than $1$ between the
+two algorithms have been captured after each step of the test. In
+Figure~\ref{table:01} are reported the best grid configurations allowing
+the multisplitting method to be more than $2.5$ times faster than the
+classical GMRES. These experiments also show the relative tolerance of the
+multisplitting algorithm when using a low speed network as usually observed with
+geographically distant clusters through the internet.
% use the same column width for the following three tables
\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
\end{tabular}}
-\begin{table}[!t]
- \centering
+\begin{figure}[!t]
+\centering
+%\begin{table}
% \caption{Relative gain of the multisplitting algorithm compared with the classical GMRES}
% \label{"Table 7"}
-Table 7. Relative gain of the multisplitting algorithm compared with
-the classical GMRES \\
-
- \begin{mytable}{11}
+ \begin{mytable}{11}
\hline
bandwidth (Mbit/s)
& 5 & 5 & 5 & 5 & 5 & 50 & 50 & 50 & 50 & 50 \\
& 2.52 & 2.55 & 2.52 & 2.57 & 2.54 & 2.53 & 2.51 & 2.58 & 2.55 & 2.54 \\
\hline
\end{mytable}
-\end{table}
+%\end{table}
+ \caption{Relative gain of the multisplitting algorithm compared with the classical GMRES}
+ \label{table:01}
+\end{figure}
+
\section{Conclusion}
CONCLUSION