+%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Two-stage splitting methods}
+\label{sec:04}
+\subsection{Multisplitting methods for sparse linear systems}
+\label{sec:04.01}
+Let us consider the following sparse linear system of $n$ equations in $\mathbb{R}$:
+\begin{equation}
+Ax=b,
+\label{eq:01}
+\end{equation}
+where $A$ is a sparse square and nonsingular matrix, $b$ is the right-hand side and $x$ is the solution of the system. The multisplitting methods solve the linear system~(\ref{eq:01}) iteratively as follows:
+\begin{equation}
+x^{k+1}=\displaystyle\sum^L_{\ell=1} E_\ell M^{-1}_\ell (N_\ell x^k + b),~k=1,2,3,\ldots
+\label{eq:02}
+\end{equation}
+where a collection of $L$ triplets $(M_\ell, N_\ell, E_\ell)$ defines the multisplitting of matrix $A$, such that: the different splittings are defined as $A=M_\ell-N_\ell$ where $M_\ell$ are nonsingular matrices, and $\sum_\ell{E_\ell=I}$ are diagonal nonnegative weighting matrices and $I$ is the identity matrix.
+
+\subsection{Simulation of two-stage methods using SimGrid framework}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{Experimental, Results and Comments}
+
+
+\textbf{V.1. Setup study and Methodology}
+
+To conduct our study, we have put in place the following methodology
+which can be reused with any grid-enabled applications.
+
+\textbf{Step 1} : Choose with the end users the class of algorithms or
+the application to be tested. Numerical parallel iterative algorithms
+have been chosen for the study in the paper.
+
+\textbf{Step 2} : Collect the software materials needed for the
+experimentation. In our case, we have three variants algorithms for the
+resolution of three 3D-Poisson problem: (1) using the classical GMRES
+\textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this
+paper, (2) using the multisplitting method alias Algo-2 and (3) an
+enhanced version of the multisplitting method as Algo-3. In addition,
+SIMGRID simulator has been chosen to simulate the behaviors of the
+distributed applications. SIMGRID is running on the Mesocentre
+datacenter in Franche-Comte University $[$10$]$ but also in a virtual
+machine on a laptop.
+
+\textbf{Step 3} : Fix the criteria which will be used for the future
+results comparison and analysis. In the scope of this study, we retain
+in one hand the algorithm execution mode (synchronous and asynchronous)
+and in the other hand the execution time and the number of iterations of
+the application before obtaining the convergence.
+
+\textbf{Step 4 }: Setup up the different grid testbeds environment
+which will be simulated in the simulator tool to run the program. The
+following architecture has been configured in Simgrid : 2x16 - that is a
+grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8,
+4x16, 8x8 and 2x50. The network has been designed to operate with a
+bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6
+microseconds (resp. 5E-5) for the intra-clusters links (resp.
+inter-clusters backbone links).
+
+\textbf{Step 5}: Process an extensive and comprehensive testings
+within these configurations in varying the key parameters, especially
+the CPU power capacity, the network parameters and also the size of the
+input matrix. Note that some parameters should be invariant to allow the
+comparison like some program input arguments.
+
+\textbf{Step 6} : Collect and analyze the output results.
+
+\textbf{ V.2. Factors impacting distributed applications performance in
+a grid environment}
+
+From our previous experience on running distributed application in a
+computational grid, many factors are identified to have an impact on the
+program behavior and performance on this specific environment. Mainly,
+first of all, the architecture of the grid itself can obviously
+influence the performance results of the program. The performance gain
+might be important theoretically when the number of clusters and/or the
+number of nodes (processors/cores) in each individual cluster increase.
+
+Another important factor impacting the overall performance of the
+application is the network configuration. Two main network parameters
+can modify drastically the program output results : (i) the network
+bandwidth (bw=bits/s) also known as "the data-carrying capacity"
+$[$13$]$ of the network is defined as the maximum of data that can pass
+from one point to another in a unit of time. (ii) the network latency
+(lat : microsecond) defined as the delay from the start time to send the
+data from a source and the final time the destination have finished to
+receive it. Upon the network characteristics, another impacting factor
+is the application dependent volume of data exchanged between the nodes
+in the cluster and between distant clusters. Large volume of data can be
+transferred in transit between the clusters and nodes during the code
+execution.
+
+ In a grid environment, it is common to distinguish in one hand, the
+"\,intra-network" which refers to the links between nodes within a
+cluster and in the other hand, the "\,inter-network" which is the
+backbone link between clusters. By design, these two networks perform
+with different speed. The intra-network generally works like a high
+speed local network with a high bandwith and very low latency. In
+opposite, the inter-network connects clusters sometime via heterogeneous
+networks components thru internet with a lower speed. The network
+between distant clusters might be a bottleneck for the global
+performance of the application.
+
+\textbf{V.3 Comparing GMRES and Multisplitting algorithms in
+synchronous mode}
+
+In the scope of this paper, our first objective is to demonstrate the
+Algo-2 (Multisplitting method) shows a better performance in grid
+architecture compared with Algo-1 (Classical GMRES) both running in
+\textbf{\textit{synchronous mode}}. Better algorithm performance
+should mean a less number of iterations output and a less execution time
+before reaching the convergence. For a systematic study, the experiments
+should figure out that, for various grid parameters values, the
+simulator will confirm the targeted outcomes, particularly for poor and
+slow networks, focusing on the impact on the communication performance
+on the chosen class of algorithm $[$12$]$.
+
+The following paragraphs present the test conditions, the output results
+and our comments.
+
+
+\textit{3.a Executing the algorithms on various computational grid
+architecture scaling up the input matrix size}
+\\
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline
+ Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline
+ - & N$_{x}$ =170 x 170 x 170 \\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+
+ Table 1 : Clusters x Nodes with NX=150 or NX=170
+
+\RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger}
+
+
+The results in figure 1 show the non-variation of the number of
+iterations of classical GMRES for a given input matrix size; it is not
+the case for the multisplitting method.
+
+%\begin{wrapfigure}{l}{60mm}
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{cluster_x_nodes_nx_150_and_nx_170.pdf}
+\caption{Cluster x Nodes NX=150 and NX=170}
+%\label{overflow}}
+\end{figure}
+%\end{wrapfigure}
+
+Unless the 8x8 cluster, the time
+execution difference between the two algorithms is important when
+comparing between different grid architectures, even with the same number of
+processors (like 2x16 and 4x8 = 32 processors for example). The
+experiment concludes the low sensitivity of the multisplitting method
+(compared with the classical GMRES) when scaling up to higher input
+matrix size.
+
+\textit{3.b Running on various computational grid architecture}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16, 4x8\\ %\hline
+ Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline
+ - & N2 : bw=1Gbs-lat=5E-05 \\
+ Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\
+ \end{tabular}
+\end{footnotesize}
+
+%Table 2 : Clusters x Nodes - Networks N1 x N2
+%\RCE{idem pour tous les tableaux de donnees}
+
+
+%\begin{wrapfigure}{l}{60mm}
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{cluster_x_nodes_n1_x_n2.pdf}
+\caption{Cluster x Nodes N1 x N2}
+%\label{overflow}}
+\end{figure}
+%\end{wrapfigure}
+
+The experiments compare the behavior of the algorithms running first on
+speed inter- cluster network (N1) and a less performant network (N2).
+The figure 2 shows that end users will gain to reduce the execution time
+for both algorithms in using a grid architecture like 4x16 or 8x8: the
+performance was increased in a factor of 2. The results depict also that
+when the network speed drops down, the difference between the execution
+times can reach more than 25\%.
+
+\textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16\\ %\hline
+ Network & N1 : bw=1Gbs \\ %\hline
+ Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\
+ \end{tabular}
+\end{footnotesize}
+
+Table 3 : Network latency impact
+
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{network_latency_impact_on_execution_time.pdf}
+\caption{Network latency impact on execution time}
+%\label{overflow}}
+\end{figure}
+
+
+According the results in table and figure 3, degradation of the network
+latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
+increase more than 75\% (resp. 82\%) of the execution for the classical
+GMRES (resp. multisplitting) algorithm. In addition, it appears that the
+multisplitting method tolerates more the network latency variation with
+a less rate increase. Consequently, in the worst case (lat=6.10$^{-5
+}$), the execution time for GMRES is almost the double of the time for
+the multisplitting, even though, the performance was on the same order
+of magnitude with a latency of 8.10$^{-6}$.
+
+\textit{3.d Network bandwidth impacts on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16\\ %\hline
+ Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Table 4 : Network bandwidth impact
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{network_bandwith_impact_on_execution_time.pdf}
+\caption{Network bandwith impact on execution time}
+%\label{overflow}
+\end{figure}
+
+
+
+The results of increasing the network bandwidth depict the improvement
+of the performance by reducing the execution time for both of the two
+algorithms. However, and again in this case, the multisplitting method
+presents a better performance in the considered bandwidth interval with
+a gain of 40\% which is only around 24\% for classical GMRES.
+
+\textit{3.e Input matrix size impacts on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 4x8\\ %\hline
+ Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Table 5 : Input matrix size impact
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{pb_size_impact_on_execution_time.pdf}
+\caption{Pb size impact on execution time}
+%\label{overflow}}
+\end{figure}
+
+In this experimentation, the input matrix size has been set from
+Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to
+200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5,
+the execution time for the algorithms convergence increases with the
+input matrix size. But the interesting result here direct on (i) the
+drastic increase (300 times) of the number of iterations needed before
+the convergence for the classical GMRES algorithm when the matrix size
+go beyond Nx=150; (ii) the classical GMRES execution time also almost
+the double from Nx=140 compared with the convergence time of the
+multisplitting method. These findings may help a lot end users to setup
+the best and the optimal targeted environment for the application
+deployment when focusing on the problem size scale up. Note that the
+same test has been done with the grid 2x16 getting the same conclusion.
+
+\textit{3.f CPU Power impact on performance}
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x16\\ %\hline
+ Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
+ Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Table 6 : CPU Power impact
+
+\begin{figure} [ht!]
+\centering
+\includegraphics[width=60mm]{cpu_power_impact_on_execution_time.pdf}
+\caption{CPU Power impact on execution time}
+%\label{overflow}}
+\end{figure}
+
+Using the SIMGRID simulator flexibility, we have tried to determine the
+impact on the algorithms performance in varying the CPU power of the
+clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
+confirm the performance gain, around 95\% for both of the two methods,
+after adding more powerful CPU. Note that the execution time axis in the
+figure is in logarithmic scale.
+
+ \textbf{V.4 Comparing GMRES in native synchronous mode and
+Multisplitting algorithms in asynchronous mode}
+
+The previous paragraphs put in evidence the interests to simulate the
+behavior of the application before any deployment in a real environment.
+We have focused the study on analyzing the performance in varying the
+key factors impacting the results. In the same line, the study compares
+the performance of the two proposed methods in \textbf{synchronous mode
+}. In this section, with the same previous methodology, the goal is to
+demonstrate the efficiency of the multisplitting method in \textbf{
+asynchronous mode} compare with the classical GMRES staying in the
+synchronous mode.
+
+Note that the interest of using the asynchronous mode for data exchange
+is mainly, in opposite of the synchronous mode, the non-wait aspects of
+the current computation after a communication operation like sending
+some data between nodes. Each processor can continue their local
+calculation without waiting for the end of the communication. Thus, the
+asynchronous may theoretically reduce the overall execution time and can
+improve the algorithm performance.
+
+As stated supra, SIMGRID simulator tool has been used to prove the
+efficiency of the multisplitting in asynchronous mode and to find the
+best combination of the grid resources (CPU, Network, input matrix size,
+\ldots ) to get the highest "\,relative gain" in comparison with the
+classical GMRES time.
+
+
+The test conditions are summarized in the table below :
+
+% environment
+\begin{footnotesize}
+\begin{tabular}{r c }
+ \hline
+ Grid & 2x50 totaling 100 processors\\ %\hline
+ Processors & 1 GFlops to 1.5 GFlops\\
+ Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline
+ Inter-Network & bw=5 Mbits - lat=2E-02\\
+ Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
+ Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline
+ \end{tabular}
+\end{footnotesize}
+
+Again, comprehensive and extensive tests have been conducted varying the
+CPU power and the network parameters (bandwidth and latency) in the
+simulator tool with different problem size. The relative gains greater
+than 1 between the two algorithms have been captured after each step of
+the test. Table I below has recorded the best grid configurations
+allowing a multiplitting method time more than 2.5 times lower than
+classical GMRES execution and convergence time. The finding thru this
+experimentation is the tolerance of the multisplitting method under a
+low speed network that we encounter usually with distant clusters thru the
+internet.
+
+% use the same column width for the following three tables
+\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
+\newenvironment{mytable}[1]{% #1: number of columns for data
+ \renewcommand{\arraystretch}{1.3}%
+ \begin{tabular}{|>{\bfseries}r%
+ |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
+ \end{tabular}}
+
+\begin{table}[!t]
+ \centering
+ \caption{Relative gain of the multisplitting algorithm compared with
+the classical GMRES}
+ \label{tab.cluster.2x50}
+
+ \begin{mytable}{6}
+ \hline
+ bw
+ & 5 & 5 & 5 & 5 & 5 & 50 \\
+ \hline
+ lat
+ & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
+ \hline
+ power
+ & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\
+ \hline
+ size
+ & 62 & 62 & 62 & 100 & 100 & 110 \\
+ \hline
+ Prec/Eprec
+ & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\
+ \hline
+ speedup
+ & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\
+ \hline
+ \end{mytable}
+
+ \smallskip
+
+ \begin{mytable}{6}
+ \hline
+ bw
+ & 50 & 50 & 50 & 50 & 10 & 10 \\
+ \hline
+ lat
+ & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\
+ \hline
+ power
+ & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\
+ \hline
+ size
+ & 120 & 130 & 140 & 150 & 171 & 171 \\
+ \hline
+ Prec/Eprec
+ & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\
+ \hline
+ speedup
+ & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\
+ \hline
+ \end{mytable}
+\end{table}