+When the ``real'' application runs in the simulation environment and produces
+the expected results, varying the input parameters and the program arguments
+allows us to compare outputs from the code execution. We have noticed from this
+study that the results depend on the following parameters: (1) at the network
+level, we found that the most critical values are the bandwidth (bw) and the
+network latency (lat). (2) Hosts power (GFlops) can also influence on the
+results. And finally, (3) when submitting job batches for execution, the
+arguments values passed to the program like the maximum number of iterations or
+the ``external'' precision are critical to ensure not only the convergence of the
+algorithm but also to get the main objective of the experimentation of the
+simulation in having an execution time in asynchronous less than in synchronous
+mode, in others words, in having a ``speedup'' less than 1 (Speedup = Execution
+time in synchronous mode / Execution time in asynchronous mode).
+
+A priori, obtaining a speedup less than 1 would be difficult in a local area
+network configuration where the synchronous mode will take advantage on the rapid
+exchange of information on such high-speed links. Thus, the methodology adopted
+was to launch the application on clustered network. In this last configuration,
+degrading the inter-cluster network performance will "penalize" the synchronous
+mode allowing to get a speedup lower than 1. This action simulates the case of
+clusters linked with long distance network like Internet.
+
+As a first step, the algorithm was run on a network consisting of two clusters
+containing fifty hosts each, totaling one hundred hosts. Various combinations of
+the above factors have providing the results shown in Table~\ref{tab.cluster.2x50} with a matrix size
+ranging from Nx = Ny = Nz = 62 to 171 elements or from 62$^{3}$ = 238328 to
+171$^{3}$ = 5,211,000 entries.
+
+Then we have changed the network configuration using three clusters containing
+respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the
+clusters. In the same way as above, a judicious choice of key parameters has
+permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the speedups less than 1 with
+a matrix size from 62 to 100 elements.
+
+In a final step, results of an execution attempt to scale up the three clustered
+configuration but increasing by two hundreds hosts has been recorded in Table~\ref{tab.cluster.3x67}.
+
+Note that the program was run with the following parameters:
+
+\paragraph*{SMPI parameters}
+
+\begin{itemize}
+ \item HOSTFILE : Hosts file description.
+ \item PLATFORM: file description of the platform architecture : clusters (CPU power,
+... ) , intra cluster network description, inter cluster network (bandwidth bw ,
+lat latency , ... ).
+\end{itemize}
+
+
+\paragraph*{Arguments of the program}
+
+\begin{itemize}
+ \item Description of the cluster architecture;
+ \item Maximum number of internal and external iterations;
+ \item Internal and external precisions;
+ \item Matrix size NX , NY and NZ;
+ \item Matrix diagonal value = 6.0;
+ \item Execution Mode: synchronous or asynchronous.
+\end{itemize}
+
+\begin{table}
+ \centering
+ \caption{2 clusters X 50 nodes}
+ \label{tab.cluster.2x50}
+ \includegraphics[width=209pt]{img1.jpg}
+\end{table}
+
+\begin{table}
+ \centering
+ \caption{3 clusters X 33 nodes}
+ \label{tab.cluster.3x33}
+ \includegraphics[width=209pt]{img2.jpg}
+\end{table}
+
+\begin{table}
+ \centering
+ \caption{3 clusters X 67 nodes}
+ \label{tab.cluster.3x67}
+% \includegraphics[width=160pt]{img3.jpg}
+ \includegraphics[scale=0.5]{img3.jpg}
+\end{table}
+
+\paragraph*{Interpretations and comments}
+
+After analyzing the outputs, generally, for the configuration with two or three
+clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50} and~\ref{tab.cluster.3x33}), some combinations of the
+used parameters affecting the results have given a speedup less than 1, showing
+the effectiveness of the asynchronous performance compared to the synchronous
+mode.
+
+In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows that with a
+deterioration of inter cluster network set with 5 Mbits/s of bandwidth, a latency
+in order of a hundredth of a millisecond and a system power of one GFlops, an
+efficiency of about 40\% in asynchronous mode is obtained for a matrix size of 62
+elements . It is noticed that the result remains stable even if we vary the
+external precision from E -05 to E-09. By increasing the problem size up to 100
+elements, it was necessary to increase the CPU power of 50 \% to 1.5 GFlops for a
+convergence of the algorithm with the same order of asynchronous mode efficiency.
+Maintaining such a system power but this time, increasing network throughput
+inter cluster up to 50 Mbits /s, the result of efficiency of about 40\% is
+obtained with high external precision of E-11 for a matrix size from 110 to 150
+side elements .
+
+For the 3 clusters architecture including a total of 100 hosts, Table~\ref{tab.cluster.3x33} shows
+that it was difficult to have a combination which gives an efficiency of
+asynchronous below 80 \%. Indeed, for a matrix size of 62 elements, equality
+between the performance of the two modes (synchronous and asynchronous) is
+achieved with an inter cluster of 10 Mbits/s and a latency of E- 01 ms. To
+challenge an efficiency by 78\% with a matrix size of 100 points, it was
+necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s.
+
+A last attempt was made for a configuration of three clusters but more power
+with 200 nodes in total. The convergence with a speedup of 90 \% was obtained
+with a bandwidth of 1 Mbits/s as shown in Table~\ref{tab.cluster.3x67}.
+