\end{equation}
where right-hand sides $c_\ell=b_\ell-\sum_{m\neq\ell}A_{\ell m}x_m$ are computed using the shared vectors $x_m$. In this paper, we use the well-known iterative method GMRES~\cite{saad86} as an inner iteration to approximate the solutions of the different splittings arising from the block Jacobi multisplitting of matrix $A$. The algorithm in Figure~\ref{alg:01} shows the main key points of our block Jacobi two-stage method executed by a cluster of processors. In line~\ref{solve}, the linear sub-system~(\ref{eq:03}) is solved in parallel using GMRES method where $\MIG$ and $\TOLG$ are the maximum number of inner iterations and the tolerance threshold for GMRES respectively. The convergence of the two-stage multisplitting methods, based on synchronous or asynchronous iterations, has been studied by many authors for example~\cite{Bru95,bahi07}.
-\begin{figure}[t]
+\begin{figure}[htpb]
%\begin{algorithm}[t]
%\caption{Block Jacobi two-stage multisplitting method}
\begin{algorithmic}[1]
\end{equation}
The algorithm in Figure~\ref{alg:02} includes the procedure of the residual minimization and the outer iteration is restarted with a new approximation $\tilde{x}$ at every $s$ iterations. The least-squares problem~(\ref{eq:06}) is solved in parallel by all clusters using CGLS method~\cite{Hestenes52} such that $\MIC$ is the maximum number of iterations and $\TOLC$ is the tolerance threshold for this method (line~\ref{cgls} in Figure~\ref{alg:02}).
-\begin{figure}[t]
+\begin{figure}[htbp]
%\begin{algorithm}[t]
%\caption{Krylov two-stage method using block Jacobi multisplitting}
\begin{algorithmic}[1]
inter-cluster communications. In the following, these parameters are described:
\begin{itemize}
- \item hostfile: hosts description file.
+ \item hostfile: hosts description file,
\item platform: file describing the platform architecture: clusters (CPU power,
\dots{}), intra cluster network description, inter cluster network (bandwidth $bw$,
-latency $lat$, \dots{}).
+latency $lat$, \dots{}),
\item archi : grid computational description (number of clusters, number of
nodes/processors in each cluster).
\end{itemize}
on the one hand the algorithm execution mode (synchronous and asynchronous)
and on the other hand the execution time and the number of iterations to reach the convergence. \\
-\textbf{Step 4 }: Set up the different grid testbed environments that will be
+\textbf{Step 4}: Set up the different grid testbed environments that will be
simulated in the simulator tool to run the program. The following architectures
have been configured in SimGrid : 2$\times$16, 4$\times$8, 4$\times$16, 8$\times$8 and 2$\times$50. The first number
represents the number of clusters in the grid and the second number represents
latency of 8.10$^{-6}$ seconds (resp. 5.10$^{-5}$) for the intra-clusters links
(resp. inter-clusters backbone links). \\
-\LZK{Il me semble que le bw et lat des deux réseaux varient dans les expés d'une simu à l'autre. On vire la dernière phrase?}
-\RC{il me semble qu'on peut laisser ca}
+%\LZK{Il me semble que le bw et lat des deux réseaux varient dans les expés d'une simu à l'autre. On vire la dernière phrase?}
+%\RC{il me semble qu'on peut laisser ca}
\textbf{Step 5}: Conduct an extensive and comprehensive testings
within these configurations by varying the key parameters, especially
\ \\
% environment
+\RC{Je ne comprends plus rien CE : pourquoi dans 5.4.1 il y a 2 network et aussi dans 5.4.2. Quelle est la différence? Dans la figure 3 de la section 5.4.1 pourquoi il n'y a pas N1 et N2?}
+
\begin{table} [ht!]
\begin{center}
\begin{tabular}{ll }
%\RC{Les légendes ne sont pas explicites...}
%\RCE{Corrige}
-\begin{figure} [ht!]
+\begin{figure} [htbp]
\begin{center}
\includegraphics[width=100mm]{cluster_x_nodes_nx_150_and_nx_170.pdf}
\end{center}
%\begin{wrapfigure}{l}{100mm}
-\begin{figure} [ht!]
+\begin{figure} [htbp]
\centering
\includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf}
-\caption{Various grid configurations with networks N1 vs N2
-\AG{\np{8E-6}, \np{5E-6} au lieu de 8E-6, 5E-6}}
-\RCE{Corrige}
+\caption{Various grid configurations with networks N1 vs N2}
+%\AG{\np{8E-6}, \np{5E-6} au lieu de 8E-6, 5E-6}}
+%\RCE{Corrige}
\label{fig:02}
\end{figure}
%\end{wrapfigure}
\label{tab:03}
\end{table}
-\begin{figure} [ht!]
+\begin{figure} [htbp]
\centering
\includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf}
-\caption{Network latency impacts on execution time
-\AG{\np{E-6}}}
+\caption{Network latency impacts on execution time}
+%\AG{\np{E-6}}}
\label{fig:03}
\end{figure}
-According to the results of Figure~\ref{fig:03}, a degradation of the network
-latency from $8.10^{-6}$ to $6.10^{-5}$ implies an absolute time increase of
-more than $75\%$ (resp. $82\%$) of the execution for the classical GMRES
-(resp. Krylov multisplitting) algorithm which means that the GMRES seems tolerate more the network latency variation with a less rate increase of the execution time. However, the execution time factor between the two algorithms varies from 2.2 to 1.5 times with a network latency decreasing from $8.10^{-6}$ to $6.10^{-5}$.
+In Table~\ref{tab:03}, parameters for the influence of the network latency are
+reported. According to the results of Figure~\ref{fig:03}, a degradation of the
+network latency from $8.10^{-6}$ to $6.10^{-5}$ implies an absolute time
+increase of more than $75\%$ (resp. $82\%$) of the execution for the classical
+GMRES (resp. Krylov multisplitting) algorithm. The execution time factor
+between the two algorithms varies from 2.2 to 1.5 times with a network latency
+decreasing from $8.10^{-6}$ to $6.10^{-5}$.
-\RC{Les 2 précédentes phrases me semblent en contradiction....}
-\RCE{Reformule}
\subsubsection{Network bandwidth impacts on performance}
\ \\
& $lat$= 5.10$^{-5}$ second \\
Input matrix size & $N_{x} \times N_{y} \times N_{z} =150 \times 150 \times 150$\\ \hline \\
\end{tabular}
-\caption{Test conditions: Network bandwidth impacts\RC{Qu'est ce qui varie ici? Il n'y a pas de variation dans le tableau}}
-\RCE{C est le bw}
+\caption{Test conditions: Network bandwidth impacts}
+% \RC{Qu'est ce qui varie ici? Il n'y a pas de variation dans le tableau}
+%\RCE{C est le bw}
\label{tab:04}
\end{table}
-\begin{figure} [ht!]
+\begin{figure} [htbp]
\centering
\includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf}
-\caption{Network bandwith impacts on execution time
-\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.}
-\RCE{Corrige}
+\caption{Network bandwith impacts on execution time}
+%\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.}
+%\RCE{Corrige}
\label{fig:04}
\end{figure}
\hline
Grid Architecture & 4 $\times$ 8\\ %\hline
Inter Network & $bw$=1Gbs - $lat$=5.10$^{-5}$ \\
- Input matrix size & $N_{x} \times N_{y} \times N_{z}$ = From 40$^{3}$ to 200$^{3}$\\ \hline
+ Input matrix size & $N_{x} \times N_{y} \times N_{z}$ = From 50$^{3}$ to 190$^{3}$\\ \hline
\end{tabular}
\caption{Test conditions: Input matrix size impacts}
\label{tab:05}
\end{table}
-\begin{figure} [ht!]
+\begin{figure} [htbp]
\centering
\includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
\caption{Problem size impacts on execution time}
\label{fig:05}
\end{figure}
-In these experiments, the input matrix size has been set from $N_{x} = N_{y}
-= N_{z} = 40$ to $200$ side elements that is from $40^{3} = 64.000$ to $200^{3}
-= 8,000,000$ points. Obviously, as shown in Figure~\ref{fig:05}, the execution
-time for both algorithms increases when the input matrix size also increases.
-But the interesting results are:
-\begin{enumerate}
- \item the important increase ($10$ times) of the number of iterations needed to
- reach the convergence for the classical GMRES algorithm particularly, when the matrix size
- go beyond $N_{x}=150$; \RC{C'est toujours pas clair... ok le nommbre d'itérations est 10 fois plus long mais la suite de la phrase ne veut rien dire}
- \RCE{Le nombre d'iterations augmente de 10 fois, cela surtout a partir de N=150}
-
-\item the classical GMRES execution time is almost the double for $N_{x}=140$
- compared with the Krylov multisplitting method.
-\end{enumerate}
+In these experiments, the input matrix size has been set from $50^3$ to
+$190^3$. Obviously, as shown in Figure~\ref{fig:05}, the execution time for both
+algorithms increases when the input matrix size also increases. For all problem
+sizes, GMRES is always slower than the Krylov multisplitting. Moreover, for this
+benchmark, it seems that the greater the problem size is, the bigger the ratio
+between both algorithm execution times is. We can also observ that for some
+problem sizes, the Krylov multisplitting convergence varies quite a
+lot. Consequently the execution times in that cases also varies.
+
These findings may help a lot end users to setup the best and the optimal
targeted environment for the application deployment when focusing on the problem
\subsubsection{CPU Power impacts on performance}
-\begin{table} [ht!]
+\begin{table} [htbp]
\centering
\begin{tabular}{r c }
\hline
theoretically reduce the overall execution time and can improve the algorithm
performance.
-\RC{la phrase suivante est bizarre, je ne comprends pas pourquoi elle vient ici}
-\RCE{C est la description du dernier test sync/async avec l'introduction de la notion de relative gain}
-In this section, Simgrid simulator tool has been successfully used to show
-the efficiency of the multisplitting in asynchronous mode and to find the best
-combination of the grid resources (CPU, Network, input matrix size, \ldots ) to
-get the highest \textit{"relative gain"} (exec\_time$_{GMRES}$ /
-exec\_time$_{multisplitting}$) in comparison with the classical GMRES time.
+In this section, the Simgrid simulator is used to compare the behavior of the
+multisplitting in asynchronous mode with GMRES in synchronous mode. Several
+benchmarks have been performed with various combination of the grid resources
+(CPU, Network, input matrix size, \ldots ). The test conditions are summarized
+in Table~\ref{tab:07}. In order to compare the execution times, this table
+reports the relative gain between both algorithms. It is defined by the ratio
+between the execution time of GMRES and the execution time of the
+multisplitting. The ration is greater than one because the asynchronous
+multisplitting version is faster than GMRES.
-The test conditions are summarized in the table~\ref{tab:07}: \\
-\begin{table} [ht!]
+\begin{table} [htbp]
\centering
\begin{tabular}{r c }
\hline
power (GFlops)
& 1 & 1 & 1 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 & 1.5 \\
\hline
- size (N)
+ size ($N^3$)
& 62 & 62 & 62 & 100 & 100 & 110 & 120 & 130 & 140 & 150 \\
\hline
Precision