\usepackage[utf8]{inputenc}
\usepackage{amsfonts,amssymb}
\usepackage{amsmath}
-\usepackage{algorithm}
+%\usepackage{algorithm}
\usepackage{algpseudocode}
%\usepackage{amsthm}
\usepackage{graphicx}
\usepackage[textsize=footnotesize]{todonotes}
\newcommand{\AG}[2][inline]{%
\todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
-\newcommand{\RC}[2][inline]{%
- \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
+\newcommand{\DL}[2][inline]{%
+ \todo[color=yellow!50,#1]{\sffamily\textbf{DL:} #2}\xspace}
\newcommand{\LZK}[2][inline]{%
\todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
+\newcommand{\RC}[2][inline]{%
+ \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
\algnewcommand\algorithmicinput{\textbf{Input:}}
\algnewcommand\Input{\item[\algorithmicinput]}
\newcommand{\MI}{\mathit{MaxIter}}
-
\begin{document}
\title{Simulation of Asynchronous Iterative Numerical Algorithms Using SimGrid}
\author{%
\IEEEauthorblockN{%
- Charles Emile Ramamonjisoa and
- David Laiymani and
- Arnaud Giersch and
- Lilia Ziane Khodja and
- Raphaël Couturier
+ Charles Emile Ramamonjisoa\IEEEauthorrefmark{1},
+ David Laiymani\IEEEauthorrefmark{1},
+ Arnaud Giersch\IEEEauthorrefmark{1},
+ Lilia Ziane Khodja\IEEEauthorrefmark{2} and
+ Raphaël Couturier\IEEEauthorrefmark{1}
+ }
+ \IEEEauthorblockA{\IEEEauthorrefmark{1}%
+ Femto-ST Institute -- DISC Department\\
+ Université de Franche-Comté,
+ IUT de Belfort-Montbéliard\\
+ 19 avenue du Maréchal Juin, BP 527, 90016 Belfort cedex, France\\
+ Email: \email{{charles.ramamonjisoa,david.laiymani,arnaud.giersch,raphael.couturier}@univ-fcomte.fr}
}
- \IEEEauthorblockA{%
- Femto-ST Institute - DISC Department\\
- Université de Franche-Comté\\
- Belfort\\
- Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr}
+ \IEEEauthorblockA{\IEEEauthorrefmark{2}%
+ Inria Bordeaux Sud-Ouest\\
+ 200 avenue de la Vieille Tour, 33405 Talence cedex, France \\
+ Email: \email{lilia.ziane@inria.fr}
}
}
\maketitle
-\RC{Ordre des autheurs pas définitif.}
-\LZK{Adresse de Lilia: Inria Bordeaux Sud-Ouest, 200 Avenue de la Vieille Tour, 33405 Talence Cedex, France \\ Email: lilia.ziane@inria.fr}
+\RC{Ordre des auteurs pas définitif.}
\begin{abstract}
-The abstract goes here.
+In recent years, the scalability of large-scale implementation in a
+distributed environment of algorithms becoming more and more complex has
+always been hampered by the limits of physical computing resources
+capacity. One solution is to run the program in a virtual environment
+simulating a real interconnected computers architecture. The results are
+convincing and useful solutions are obtained with far fewer resources
+than in a real platform. However, challenges remain for the convergence
+and efficiency of a class of algorithms that concern us here, namely
+numerical parallel iterative algorithms executed in asynchronous mode,
+especially in a large scale level. Actually, such algorithm requires a
+balance and a compromise between computation and communication time
+during the execution. Two important factors determine the success of the
+experimentation: the convergence of the iterative algorithm on a large
+scale and the execution time reduction in asynchronous mode. Once again,
+from the current work, a simulated environment like SimGrid provides
+accurate results which are difficult or even impossible to obtain in a
+physical platform by exploiting the flexibility of the simulator on the
+computing units clusters and the network structure design. Our
+experimental outputs showed a saving of up to \np[\%]{40} for the algorithm
+execution time in asynchronous mode compared to the synchronous one with
+a residual precision up to \np{E-11}. Such successful results open
+perspectives on experimentations for running the algorithm on a
+simulated large scale growing environment and with larger problem size.
+
+% no keywords for IEEE conferences
+% Keywords: Algorithm distributed iterative asynchronous simulation SimGrid
\end{abstract}
\section{Introduction}
-Parallel computing and high performance computing (HPC) are becoming
-more and more imperative for solving various problems raised by
-researchers on various scientific disciplines but also by industrial in
-the field. Indeed, the increasing complexity of these requested
-applications combined with a continuous increase of their sizes lead to
-write distributed and parallel algorithms requiring significant hardware
-resources (grid computing, clusters, broadband network, etc\dots{}) but
-also a non-negligible CPU execution time. We consider in this paper a
-class of highly efficient parallel algorithms called iterative executed
-in a distributed environment. As their name suggests, these algorithm
-solves a given problem that might be NP- complete complex by successive
-iterations ($X_{n +1} = f(X_{n})$) from an initial value $X_{0}$ to find
-an approximate value $X^*$ of the solution with a very low
-residual error. Several well-known methods demonstrate the convergence
-of these algorithms. Generally, to reduce the complexity and the
-execution time, the problem is divided into several \emph{pieces} that will
-be solved in parallel on multiple processing units. The latter will
-communicate each intermediate results before a new iteration starts
-until the approximate solution is reached. These distributed parallel
-computations can be performed either in \emph{synchronous} communication mode
-where a new iteration begin only when all nodes communications are
-completed, either \emph{asynchronous} mode where processors can continue
-independently without or few synchronization points. Despite the
-effectiveness of iterative approach, a major drawback of the method is
-the requirement of huge resources in terms of computing capacity,
-storage and high speed communication network. Indeed, limited physical
-resources are blocking factors for large-scale deployment of parallel
-algorithms.
-
-In recent years, the use of a simulation environment to execute parallel
-iterative algorithms found some interests in reducing the highly cost of
-access to computing resources: (1) for the applications development life
-cycle and in code debugging (2) and in production to get results in a
-reasonable execution time with a simulated infrastructure not accessible
-with physical resources. Indeed, the launch of distributed iterative
-asynchronous algorithms to solve a given problem on a large-scale
-simulated environment challenges to find optimal configurations giving
-the best results with a lowest residual error and in the best of
-execution time. According our knowledge, no testing of large-scale
-simulation of the class of algorithm solving to achieve real results has
-been undertaken to date. We had in the scope of this work implemented a
-program for solving large non-symmetric linear system of equations by
-numerical method GMRES (Generalized Minimal Residual) in the simulation
-environment SimGrid. The simulated platform had allowed us to launch
-the application from a modest computing infrastructure by simulating
-different distributed architectures composed by clusters nodes
-interconnected by variable speed networks. In addition, it has been
-permitted to show the effectiveness of asynchronous mode algorithm by
-comparing its performance with the synchronous mode time. With selected
-parameters on the network platforms (bandwidth, latency of inter cluster
-network) and on the clusters architecture (number, capacity calculation
-power) in the simulated environment, the experimental results have
-demonstrated not only the algorithm convergence within a reasonable time
-compared with the physical environment performance, but also a time
-saving of up to \np[\%]{40} in asynchronous mode.
-
-This article is structured as follows: after this introduction, the next
-section will give a brief description of iterative asynchronous model.
-Then, the simulation framework SimGrid will be presented with the
-settings to create various distributed architectures. The algorithm of
-the multi -splitting method used by GMRES written with MPI primitives
-and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the
-next section. At last, the experiments results carried out will be
-presented before the conclusion which we will announce the opening of
-our future work after the results.
+Parallel computing and high performance computing (HPC) are becoming more and more imperative for solving various
+problems raised by researchers on various scientific disciplines but also by industrial in the field. Indeed, the
+increasing complexity of these requested applications combined with a continuous increase of their sizes lead to write
+distributed and parallel algorithms requiring significant hardware resources (grid computing, clusters, broadband
+network, etc.) but also a non-negligible CPU execution time. We consider in this paper a class of highly efficient
+parallel algorithms called \emph{numerical iterative algorithms} executed in a distributed environment. As their name
+suggests, these algorithms solve a given problem by successive iterations ($X_{n +1} = f(X_{n})$) from an initial value
+$X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
+demonstrate the convergence of these algorithms~\cite{BT89,Bahi07}.
+
+Parallelization of such algorithms generally involve the division of the problem into several \emph{blocks} that will
+be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new
+iteration starts and until the approximate solution is reached. These parallel computations can be performed either in
+\emph{synchronous} mode where a new iteration begins only when all nodes communications are completed,
+or in \emph{asynchronous} mode where processors can continue independently with few or no synchronization points. For
+instance in the \textit{Asynchronous Iterations~-- Asynchronous Communications (AIAC)} model~\cite{bcvc06:ij}, local
+computations do not need to wait for required data. Processors can then perform their iterations with the data present
+at that time. Even if the number of iterations required before the convergence is generally greater than for the
+synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to
+synchronizations especially in a grid computing context (see~\cite{Bahi07} for more details).
+
+Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment
+requirements. Quantifying their resource allocation policies and application scheduling algorithms in
+grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time
+consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC algorithms is even more problematic since they are very sensible to the
+execution environment context. For instance, variations in the network bandwidth (intra and inter-clusters), in the
+number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to
+very different execution times. Then, it appears that the use of simulation tools to explore various platform
+scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation
+environment to execute parallel iterative algorithms found some interests in reducing the highly cost of access to
+computing resources: (1) for the applications development life cycle and in code debugging (2) and in production to get
+results in a reasonable execution time with a simulated infrastructure not accessible with physical resources. Indeed,
+the launch of distributed iterative asynchronous algorithms to solve a given problem on a large-scale simulated
+environment challenges to find optimal configurations giving the best results with a lowest residual error and in the
+best of execution time.
+
+To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this
+paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
+SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their
+performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric
+linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor
+modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a real AIAC
+application on different computing architectures. The simulated results we obtained are in line with real results
+exposed in ??\AG[]{??}. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating
+different distributed architectures composed by clusters nodes interconnected by variable speed networks.
+With selected parameters on the network platforms (bandwidth, latency of inter cluster network) and
+on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results
+have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment
+performance, but also a time saving of up to \np[\%]{40} in asynchronous mode.
+
+This article is structured as follows: after this introduction, the next section will give a brief description of
+iterative asynchronous model. Then, the simulation framework SimGrid is presented with the settings to create various
+distributed architectures. The algorithm of the multisplitting method used by GMRES written with MPI primitives and
+its adaptation to SimGrid with SMPI (Simulated MPI) is detailed in the next section. At last, the experiments results
+carried out will be presented before some concluding remarks and future works.
-\section{The asynchronous iteration model}
-
-Décrire le modèle asynchrone. Je m'en charge (DL)
+\section{Motivations and scientific context}
+
+As exposed in the introduction, parallel iterative methods are now widely used in many scientific domains. They can be
+classified in three main classes depending on how iterations and communications are managed (for more details readers
+can refer to~\cite{bcvc06:ij}). In the \textit{Synchronous Iterations~-- Synchronous Communications (SISC)} model data
+are exchanged at the end of each iteration. All the processors must begin the same iteration at the same time and
+important idle times on processors are generated. The \textit{Synchronous Iterations~-- Asynchronous Communications
+(SIAC)} model can be compared to the previous one except that data required on another processor are sent asynchronously
+i.e. without stopping current computations. This technique allows to partially overlap communications by computations
+but unfortunately, the overlapping is only partial and important idle times remain. It is clear that, in a grid
+computing context, where the number of computational nodes is large, heterogeneous and widely distributed, the idle
+times generated by synchronizations are very penalizing. One way to overcome this problem is to use the
+\textit{Asynchronous Iterations~-- Asynchronous Communications (AIAC)} model. Here, local computations do not need to
+wait for required data. Processors can then perform their iterations with the data present at that time. Figure~\ref{fig:aiac}
+illustrates this model where the gray blocks represent the computation phases, the white spaces the idle
+times and the arrows the communications. With this algorithmic model, the number of iterations required before the
+convergence is generally greater than for the two former classes. But, and as detailed in~\cite{bcvc06:ij}, AIAC
+algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially
+in a grid computing context.
+
+\begin{figure}[!t]
+ \centering
+ \includegraphics[width=8cm]{AIAC.pdf}
+ \caption{The Asynchronous Iterations~-- Asynchronous Communications model}
+ \label{fig:aiac}
+\end{figure}
-\section{SimGrid}
-Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid} (Arnaud)
+It is very challenging to develop efficient applications for large scale, heterogeneous and distributed platforms such
+as computing grids. Researchers and engineers have to develop techniques for maximizing application performance of these
+multi-cluster platforms, by redesigning the applications and/or by using novel algorithms that can account for the
+composite and heterogeneous nature of the platform. Unfortunately, the deployment of such applications on these very
+large scale systems is very costly, labor intensive and time consuming. In this context, it appears that the use of
+simulation tools to explore various platform scenarios at will and to run enormous numbers of experiments quickly can be
+very promising. Several works...
+In the context of AIAC algorithms, the use of simulation tools is even more relevant. Indeed, this class of applications
+is very sensible to the execution environment context. For instance, variations in the network bandwidth (intra and
+inter-clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of
+iterations and so to very different execution times.
+\section{SimGrid}
+SimGrid~\cite{SimGrid,casanova+legrand+quinson.2008.simgrid} is a simulation
+framework to study the behavior of large-scale distributed systems. As its name
+says, it emanates from the grid computing community, but is nowadays used to
+study grids, clouds, HPC or peer-to-peer systems. The early versions of SimGrid
+date from 1999, but it's still actively developed and distributed as an open
+source software. Today, it's one of the major generic tools in the field of
+simulation for large-scale distributed systems.
+
+SimGrid provides several programming interfaces: MSG to simulate Concurrent
+Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to
+run real applications written in MPI~\cite{MPI}. Apart from the native C
+interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming
+languages. The SMPI interface supports applications written in C or Fortran,
+with little or no modifications. SMPI implements about \np[\%]{80} of the MPI
+2.0 standard~\cite{bedaride:hal-00919507}.
+
+%%% explain simulation
+%- simulated processes folded in one real process
+%- simulates interactions on the network, fluid model
+%- able to skip long-lasting computations
+%- traces + visu?
+
+%%% platforms
+%- describe resources and their interconnection, with their properties
+%- XML files
+
+%%% validation + refs
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Simulation of the multisplitting method}
%Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid.
-Let $Ax=b$ be a large sparse system of $n$ linear equations in $\mathbb{R}$, where $A$ is a sparse square and nonsingular matrix, $x$ is the solution vector and $b$ is the right-hand side vector. We use a multisplitting method based on the block Jacobi splitting to solve this linear system on a large scale platform composed of $L$ clusters of processors. In this case, we apply a row-by-row splitting without overlapping
+Let $Ax=b$ be a large sparse system of $n$ linear equations in $\mathbb{R}$, where $A$ is a sparse square and nonsingular matrix, $x$ is the solution vector and $b$ is the right-hand side vector. We use a multisplitting method based on the block Jacobi splitting to solve this linear system on a large scale platform composed of $L$ clusters of processors~\cite{o1985multi}. In this case, we apply a row-by-row splitting without overlapping
\[
\left(\begin{array}{ccc}
A_{11} & \cdots & A_{1L} \\
\vdots\\
B_L
\end{array} \right)\]
-in such a way that successive rows of matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster, where for all $l,m\in\{1,\ldots,L\}$ $A_{lm}$ is a rectangular block of $A$ of size $n_l\times n_m$, $X_l$ and $B_l$ are sub-vectors of $x$ and $b$, respectively, each of size $n_l$ and $\sum_{l} n_l=\sum_{m} n_m=n$.
+in such a way that successive rows of matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster, where for all $l,m\in\{1,\ldots,L\}$ $A_{lm}$ is a rectangular block of $A$ of size $n_l\times n_m$, $X_l$ and $B_l$ are sub-vectors of $x$ and $b$, respectively, of size $n_l$ each and $\sum_{l} n_l=\sum_{m} n_m=n$.
The multisplitting method proceeds by iteration to solve in parallel the linear system on $L$ clusters of processors, in such a way each sub-system
\begin{equation}
\end{equation}
is solved independently by a cluster and communications are required to update the right-hand side sub-vector $Y_l$, such that the sub-vectors $X_m$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel algorithm of GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers.
-\begin{algorithm}
-\caption{A multisplitting solver with GMRES method}
+\begin{figure}[!t]
+ %%% IEEE instructions forbid to use an algorithm environment here, use figure
+ %%% instead
\begin{algorithmic}[1]
\Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector)
\Output $X_l$ (solution sub-vector)\vspace{0.2cm}
\For {$k=0,1,2,\ldots$ until the global convergence}
\State Restart outer iteration with $x^0=x^k$
\State Inner iteration: \Call{InnerSolver}{$x^0$, $k+1$}
-\State Send shared elements of $X_l^{k+1}$ to neighboring clusters
-\State Receive shared elements in $\{X_m^{k+1}\}_{m\neq l}$
+\State\label{algo:01:send} Send shared elements of $X_l^{k+1}$ to neighboring clusters
+\State\label{algo:01:recv} Receive shared elements in $\{X_m^{k+1}\}_{m\neq l}$
\EndFor
\Statex
\State \Return $X_l^k$
\EndFunction
\end{algorithmic}
+\caption{A multisplitting solver with GMRES method}
\label{algo:01}
-\end{algorithm}
-
-Algorithm~\ref{algo:01} shows the main key points of the multisplitting method to solve a large sparse linear system. This algorithm is based on an outer-inner iteration method where the parallel synchronous GMRES method is used to solve the inner iteration. It is executed in parallel by each cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors with the subscript $l$ represent the local data for cluster $l$, while $\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and $\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with neighboring clusters. At every outer iteration $k$, asynchronous communications are performed between processors of the local cluster and those of distant clusters (lines $6$ and $7$ in Algorithm~\ref{algo:01}). The shared vector elements of the solution $x$ are exchanged by message passing using MPI non-blocking communication routines.
+\end{figure}
-\begin{figure}
+Algorithm on Figure~\ref{algo:01} shows the main key points of the
+multisplitting method to solve a large sparse linear system. This algorithm is
+based on an outer-inner iteration method where the parallel synchronous GMRES
+method is used to solve the inner iteration. It is executed in parallel by each
+cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors
+with the subscript $l$ represent the local data for cluster $l$, while
+$\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and
+$\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with
+neighboring clusters. At every outer iteration $k$, asynchronous communications
+are performed between processors of the local cluster and those of distant
+clusters (lines~\ref{algo:01:send} and~\ref{algo:01:recv} in
+Figure~\ref{algo:01}). The shared vector elements of the solution $x$ are
+exchanged by message passing using MPI non-blocking communication routines.
+
+\begin{figure}[!t]
\centering
\includegraphics[width=60mm,keepaspectratio]{clustering}
\caption{Example of three clusters of processors interconnected by a virtual unidirectional ring network.}
\LZK{Description du processus d'adaptation de l'algo multisplitting à SimGrid}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
+We did not encounter major blocking problems when adapting the multisplitting algorithm previously described to a simulation environment like SIMGRID unless some code
+debugging. Indeed, apart from the review of the program sequence for asynchronous exchanges between the six neighbors of each point in a submatrix within a cluster or
+between clusters, the algorithm was executed successfully with SMPI and provided identical outputs as those obtained with direct execution under MPI. In synchronous
+mode, the execution of the program raised no particular issue but in asynchronous mode, the review of the sequence of MPI\_Isend, MPI\_Irecv and MPI\_Waitall instructions
+and with the addition of the primitive MPI\_Test was needed to avoid a memory fault due to an infinite loop resulting from the non-convergence of the algorithm. Note here that the use of SMPI
+functions optimizer for memory footprint and CPU usage is not recommended knowing that one wants to get real results by simulation.
+As mentioned, upon this adaptation, the algorithm is executed as in the real life in the simulated environment after the following minor changes. First, all declared
+global variables have been moved to local variables for each subroutine. In fact, global variables generate side effects arising from the concurrent access of
+shared memory used by threads simulating each computing units in the SimGrid architecture. Second, the alignment of certain types of variables such as ``long int'' had
+also to be reviewed. Finally, some compilation errors on MPI\_Waitall and MPI\_Finalize primitives have been fixed with the latest version of SimGrid.
+In total, the initial MPI program running on the simulation environment SMPI gave after a very simple adaptation the same results as those obtained in a real
+environment. We have tested in synchronous mode with a simulated platform starting from a modest 2 or 3 clusters grid to a larger configuration like simulating
+Grid5000 with more than 1500 hosts with 5000 cores~\cite{bolze2006grid}. Once the code debugging and adaptation were complete, the next section shows our methodology and experimental
+results.
\section{Experimental results}
-When the ``real'' application runs in the simulation environment and produces
-the expected results, varying the input parameters and the program arguments
-allows us to compare outputs from the code execution. We have noticed from this
-study that the results depend on the following parameters: (1) at the network
-level, we found that the most critical values are the bandwidth (bw) and the
-network latency (lat). (2) Hosts power (GFlops) can also influence on the
-results. And finally, (3) when submitting job batches for execution, the
-arguments values passed to the program like the maximum number of iterations or
-the ``external'' precision are critical to ensure not only the convergence of the
-algorithm but also to get the main objective of the experimentation of the
-simulation in having an execution time in asynchronous less than in synchronous
-mode, in others words, in having a ``speedup'' less than 1 (Speedup = Execution
-time in synchronous mode / Execution time in asynchronous mode).
-
-A priori, obtaining a speedup less than 1 would be difficult in a local area
+When the \emph{real} application runs in the simulation environment and produces the expected results, varying the input
+parameters and the program arguments allows us to compare outputs from the code execution. We have noticed from this
+study that the results depend on the following parameters:
+\begin{itemize}
+\item At the network level, we found that
+the most critical values are the bandwidth (bw) and the network latency (lat).
+\item Hosts power (GFlops) can also
+influence on the results.
+\item Finally, when submitting job batches for execution, the arguments values passed to the
+program like the maximum number of iterations or the \emph{external} precision are critical. They allow to ensure not
+only the convergence of the algorithm but also to get the main objective of the experimentation of the simulation in
+having an execution time in asynchronous less than in synchronous mode (i.e. speed-up less than $1$).
+\end{itemize}
+
+A priori, obtaining a speedup less than $1$ would be difficult in a local area
network configuration where the synchronous mode will take advantage on the rapid
exchange of information on such high-speed links. Thus, the methodology adopted
was to launch the application on clustered network. In this last configuration,
degrading the inter-cluster network performance will \emph{penalize} the synchronous
-mode allowing to get a speedup lower than 1. This action simulates the case of
+mode allowing to get a speedup lower than $1$. This action simulates the case of
clusters linked with long distance network like Internet.
As a first step, the algorithm was run on a network consisting of two clusters
-containing fifty hosts each, totaling one hundred hosts. Various combinations of
-the above factors have providing the results shown in Table~\ref{tab.cluster.2x50} with a matrix size
-ranging from Nx = Ny = Nz = 62 to 171 elements or from $62^{3} = \np{238328}$ to
-$171^{3} = \np{5211000}$ entries.
+containing $50$ hosts each, totaling $100$ hosts. Various combinations of
+the above factors have providing the results shown in
+Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z =
+62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to $171^{3} =
+\np{5211000}$ entries.
+
+% use the same column width for the following three tables
+\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
+\newenvironment{mytable}[1]{% #1: number of columns for data
+ \renewcommand{\arraystretch}{1.3}%
+ \begin{tabular}{|>{\bfseries}r%
+ |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
+ \end{tabular}}
+
+\begin{table}[!t]
+ \centering
+ \caption{$2$ clusters, each with $50$ nodes}
+ \label{tab.cluster.2x50}
+ \begin{mytable}{6}
+ \hline
+ bw
+ & 5 & 5 & 5 & 5 & 5 & 50 \\
+ \hline
+ lat
+ & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
+ \hline
+ power
+ & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\
+ \hline
+ size
+ & 62 & 62 & 62 & 100 & 100 & 110 \\
+ \hline
+ Prec/Eprec
+ & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\
+ \hline
+ speedup
+ & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\
+ \hline
+ \end{mytable}
+
+ \smallskip
+
+ \begin{mytable}{6}
+ \hline
+ bw
+ & 50 & 50 & 50 & 50 & 10 & 10 \\
+ \hline
+ lat
+ & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\
+ \hline
+ power
+ & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\
+ \hline
+ size
+ & 120 & 130 & 140 & 150 & 171 & 171 \\
+ \hline
+ Prec/Eprec
+ & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\
+ \hline
+ speedup
+ & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\
+ \hline
+ \end{mytable}
+\end{table}
+
Then we have changed the network configuration using three clusters containing
-respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the
+respectively $33$, $33$ and $34$ hosts, or again by on hundred hosts for all the
clusters. In the same way as above, a judicious choice of key parameters has
-permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the speedups less than 1 with
-a matrix size from 62 to 100 elements.
+permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the
+speedups less than $1$ with a matrix size from $62$ to $100$ elements.
+
+\begin{table}[!t]
+ \centering
+ \caption{$3$ clusters, each with $33$ nodes}
+ \label{tab.cluster.3x33}
+
+ \begin{mytable}{6}
+ \hline
+ bw
+ & 10 & 5 & 4 & 3 & 2 & 6 \\
+ \hline
+ lat
+ & 0.01 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
+ \hline
+ power
+ & 1 & 1 & 1 & 1 & 1 & 1 \\
+ \hline
+ size
+ & 62 & 100 & 100 & 100 & 100 & 171 \\
+ \hline
+ Prec/Eprec
+ & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} \\
+ \hline
+ speedup
+ & 0.997 & 0.99 & 0.93 & 0.84 & 0.78 & 0.99 \\
+ \hline
+ \end{mytable}
+\end{table}
In a final step, results of an execution attempt to scale up the three clustered
-configuration but increasing by two hundreds hosts has been recorded in Table~\ref{tab.cluster.3x67}.
+configuration but increasing by two hundreds hosts has been recorded in
+Table~\ref{tab.cluster.3x67}.
+
+\begin{table}[!t]
+ \centering
+ \caption{3 clusters, each with 66 nodes}
+ \label{tab.cluster.3x67}
+
+ \begin{mytable}{1}
+ \hline
+ bw & 1 \\
+ \hline
+ lat & 0.02 \\
+ \hline
+ power & 1 \\
+ \hline
+ size & 62 \\
+ \hline
+ Prec/Eprec & \np{E-5} \\
+ \hline
+ speedup & 0.9 \\
+ \hline
+ \end{mytable}
+\end{table}
Note that the program was run with the following parameters:
\item Description of the cluster architecture;
\item Maximum number of internal and external iterations;
\item Internal and external precisions;
- \item Matrix size NX, NY and NZ;
- \item Matrix diagonal value = 6.0;
+ \item Matrix size $N_x$, $N_y$ and $N_z$;
+ \item Matrix diagonal value: \np{6.0};
\item Execution Mode: synchronous or asynchronous.
\end{itemize}
-\begin{table}
- \centering
- \caption{2 clusters X 50 nodes}
- \label{tab.cluster.2x50}
- \AG{Les images manquent dans le dépôt Git. Si ce sont vraiment des tableaux, utiliser un format vectoriel (eps ou pdf), et surtout pas de jpeg!}
- \includegraphics[width=209pt]{img1.jpg}
-\end{table}
-
-\begin{table}
- \centering
- \caption{3 clusters X 33 nodes}
- \label{tab.cluster.3x33}
- \AG{Le fichier manque.}
- \includegraphics[width=209pt]{img2.jpg}
-\end{table}
-
-\begin{table}
- \centering
- \caption{3 clusters X 67 nodes}
- \label{tab.cluster.3x67}
- \AG{Le fichier manque.}
-% \includegraphics[width=160pt]{img3.jpg}
- \includegraphics[scale=0.5]{img3.jpg}
-\end{table}
-
\paragraph*{Interpretations and comments}
After analyzing the outputs, generally, for the configuration with two or three
-clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50} and~\ref{tab.cluster.3x33}), some combinations of the
-used parameters affecting the results have given a speedup less than 1, showing
-the effectiveness of the asynchronous performance compared to the synchronous
-mode.
-
-In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows that with a
-deterioration of inter cluster network set with \np[Mbits/s]{5} of bandwidth, a latency
-in order of a hundredth of a millisecond and a system power of one GFlops, an
-efficiency of about \np[\%]{40} in asynchronous mode is obtained for a matrix size of 62
-elements. It is noticed that the result remains stable even if we vary the
-external precision from \np{E-5} to \np{E-9}. By increasing the problem size up to 100
-elements, it was necessary to increase the CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a
-convergence of the algorithm with the same order of asynchronous mode efficiency.
-Maintaining such a system power but this time, increasing network throughput
-inter cluster up to \np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is
-obtained with high external precision of \np{E-11} for a matrix size from 110 to 150
-side elements.
-
-For the 3 clusters architecture including a total of 100 hosts, Table~\ref{tab.cluster.3x33} shows
-that it was difficult to have a combination which gives an efficiency of
-asynchronous below \np[\%]{80}. Indeed, for a matrix size of 62 elements, equality
-between the performance of the two modes (synchronous and asynchronous) is
-achieved with an inter cluster of \np[Mbits/s]{10} and a latency of \np{E-1} ms. To
-challenge an efficiency by \np[\%]{78} with a matrix size of 100 points, it was
-necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s.
-
-A last attempt was made for a configuration of three clusters but more power
-with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was obtained
-with a bandwidth of \np[Mbits/s]{1} as shown in Table~\ref{tab.cluster.3x67}.
+clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50}
+and~\ref{tab.cluster.3x33}), some combinations of the used parameters affecting
+the results have given a speedup less than 1, showing the effectiveness of the
+asynchronous performance compared to the synchronous mode.
+
+In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows
+that with a deterioration of inter cluster network set with \np[Mbit/s]{5} of
+bandwidth, a latency in order of a hundredth of a millisecond and a system power
+of one GFlops, an efficiency of about \np[\%]{40} in asynchronous mode is
+obtained for a matrix size of 62 elements. It is noticed that the result remains
+stable even if we vary the external precision from \np{E-5} to \np{E-9}. By
+increasing the problem size up to $100$ elements, it was necessary to increase the
+CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a convergence of the algorithm
+with the same order of asynchronous mode efficiency. Maintaining such a system
+power but this time, increasing network throughput inter cluster up to
+\np[Mbit/s]{50}, the result of efficiency of about \np[\%]{40} is obtained with
+high external precision of \np{E-11} for a matrix size from $110$ to $150$ side
+elements.
+
+For the $3$ clusters architecture including a total of 100 hosts,
+Table~\ref{tab.cluster.3x33} shows that it was difficult to have a combination
+which gives an efficiency of asynchronous below \np[\%]{80}. Indeed, for a
+matrix size of $62$ elements, equality between the performance of the two modes
+(synchronous and asynchronous) is achieved with an inter cluster of
+\np[Mbit/s]{10} and a latency of \np[ms]{E-1}. To challenge an efficiency by
+\np[\%]{78} with a matrix size of $100$ points, it was necessary to degrade the
+inter cluster network bandwidth from 5 to \np[Mbit/s]{2}.
+
+A last attempt was made for a configuration of three clusters but more powerful
+with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was
+obtained with a bandwidth of \np[Mbit/s]{1} as shown in
+Table~\ref{tab.cluster.3x67}.
\section{Conclusion}
+The experimental results on executing a parallel iterative algorithm in
+asynchronous mode on an environment simulating a large scale of virtual
+computers organized with interconnected clusters have been presented.
+Our work has demonstrated that using such a simulation tool allow us to
+reach the following three objectives:
+
+\begin{enumerate}
+\item To have a flexible configurable execution platform resolving the
+hard exercise to access to very limited but so solicited physical
+resources;
+\item to ensure the algorithm convergence with a reasonable time and
+iteration number ;
+\item and finally and more importantly, to find the correct combination
+of the cluster and network specifications permitting to save time in
+executing the algorithm in asynchronous mode.
+\end{enumerate}
+Our results have shown that in certain conditions, asynchronous mode is
+speeder up to \np[\%]{40} than executing the algorithm in synchronous mode
+which is not negligible for solving complex practical problems with more
+and more increasing size.
+
+ Several studies have already addressed the performance execution time of
+this class of algorithm. The work presented in this paper has
+demonstrated an original solution to optimize the use of a simulation
+tool to run efficiently an iterative parallel algorithm in asynchronous
+mode in a grid architecture.
\section*{Acknowledgment}
-
-The authors would like to thank\dots{}
+This work is partially funded by the Labex ACTION program (contract ANR-11-LABX-01-01).
+\todo[inline]{The authors would like to thank\dots{}}
% trigger a \newpage just before the given reference
% adjust value as needed - may need to be readjusted if
% the document is modified later
\bibliographystyle{IEEEtran}
-\bibliography{hpccBib}
+\bibliography{IEEEabrv,hpccBib}
\end{document}
%%% fill-column: 80
%%% ispell-local-dictionary: "american"
%%% End:
+
+% LocalWords: Ramamonjisoa Laiymani Arnaud Giersch Ziane Khodja Raphaël Femto
+% LocalWords: Université Franche Comté IUT Montbéliard Maréchal Juin Inria Sud
+% LocalWords: Ouest Vieille Talence cedex scalability experimentations HPC MPI
+% LocalWords: Parallelization AIAC GMRES multi SMPI SISC SIAC SimDAG DAGs Lua
+% LocalWords: Fortran GFlops priori Mbit de du fcomte multisplitting scalable
+% LocalWords: SimGrid Belfort parallelize Labex ANR LABX IEEEabrv hpccBib