X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/blobdiff_plain/e421bb1806cde3dfe6a238e56119fbae3025fe9d..00fb583f643f96a407cb5cee06fd1df95268be7a:/hpcc.tex?ds=inline diff --git a/hpcc.tex b/hpcc.tex index 94e9ee7..92d275c 100644 --- a/hpcc.tex +++ b/hpcc.tex @@ -4,7 +4,7 @@ \usepackage[utf8]{inputenc} \usepackage{amsfonts,amssymb} \usepackage{amsmath} -\usepackage{algorithm} +%\usepackage{algorithm} \usepackage{algpseudocode} %\usepackage{amsthm} \usepackage{graphicx} @@ -40,7 +40,6 @@ \newcommand{\MI}{\mathit{MaxIter}} - \begin{document} \title{Simulation of Asynchronous Iterative Numerical Algorithms Using SimGrid} @@ -69,10 +68,8 @@ \maketitle -\RC{Ordre des autheurs pas définitif.} +\RC{Ordre des auteurs pas définitif.} \begin{abstract} -ABSTRACT - In recent years, the scalability of large-scale implementation in a distributed environment of algorithms becoming more and more complex has always been hampered by the limits of physical computing resources @@ -87,7 +84,7 @@ balance and a compromise between computation and communication time during the execution. Two important factors determine the success of the experimentation: the convergence of the iterative algorithm on a large scale and the execution time reduction in asynchronous mode. Once again, -from the current work, a simulated environment like Simgrid provides +from the current work, a simulated environment like SimGrid provides accurate results which are difficult or even impossible to obtain in a physical platform by exploiting the flexibility of the simulator on the computing units clusters and the network structure design. Our @@ -98,90 +95,145 @@ perspectives on experimentations for running the algorithm on a simulated large scale growing environment and with larger problem size. % no keywords for IEEE conferences -% Keywords: Algorithm distributed iterative asynchronous simulation simgrid +% Keywords: Algorithm distributed iterative asynchronous simulation SimGrid \end{abstract} \section{Introduction} -Parallel computing and high performance computing (HPC) are becoming -more and more imperative for solving various problems raised by -researchers on various scientific disciplines but also by industrial in -the field. Indeed, the increasing complexity of these requested -applications combined with a continuous increase of their sizes lead to -write distributed and parallel algorithms requiring significant hardware -resources (grid computing, clusters, broadband network, etc\dots{}) but -also a non-negligible CPU execution time. We consider in this paper a -class of highly efficient parallel algorithms called iterative executed -in a distributed environment. As their name suggests, these algorithm -solves a given problem that might be NP- complete complex by successive -iterations ($X_{n +1} = f(X_{n})$) from an initial value $X_{0}$ to find -an approximate value $X^*$ of the solution with a very low -residual error. Several well-known methods demonstrate the convergence -of these algorithms. Generally, to reduce the complexity and the -execution time, the problem is divided into several \emph{pieces} that will -be solved in parallel on multiple processing units. The latter will -communicate each intermediate results before a new iteration starts -until the approximate solution is reached. These distributed parallel -computations can be performed either in \emph{synchronous} communication mode -where a new iteration begin only when all nodes communications are -completed, either \emph{asynchronous} mode where processors can continue -independently without or few synchronization points. Despite the -effectiveness of iterative approach, a major drawback of the method is -the requirement of huge resources in terms of computing capacity, -storage and high speed communication network. Indeed, limited physical -resources are blocking factors for large-scale deployment of parallel -algorithms. - -In recent years, the use of a simulation environment to execute parallel -iterative algorithms found some interests in reducing the highly cost of -access to computing resources: (1) for the applications development life -cycle and in code debugging (2) and in production to get results in a -reasonable execution time with a simulated infrastructure not accessible -with physical resources. Indeed, the launch of distributed iterative -asynchronous algorithms to solve a given problem on a large-scale -simulated environment challenges to find optimal configurations giving -the best results with a lowest residual error and in the best of -execution time. According our knowledge, no testing of large-scale -simulation of the class of algorithm solving to achieve real results has -been undertaken to date. We had in the scope of this work implemented a -program for solving large non-symmetric linear system of equations by -numerical method GMRES (Generalized Minimal Residual) in the simulation -environment SimGrid. The simulated platform had allowed us to launch -the application from a modest computing infrastructure by simulating -different distributed architectures composed by clusters nodes -interconnected by variable speed networks. In addition, it has been -permitted to show the effectiveness of asynchronous mode algorithm by -comparing its performance with the synchronous mode time. With selected -parameters on the network platforms (bandwidth, latency of inter cluster -network) and on the clusters architecture (number, capacity calculation -power) in the simulated environment, the experimental results have -demonstrated not only the algorithm convergence within a reasonable time -compared with the physical environment performance, but also a time -saving of up to \np[\%]{40} in asynchronous mode. - -This article is structured as follows: after this introduction, the next -section will give a brief description of iterative asynchronous model. -Then, the simulation framework SimGrid will be presented with the -settings to create various distributed architectures. The algorithm of -the multi -splitting method used by GMRES written with MPI primitives -and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the -next section. At last, the experiments results carried out will be -presented before the conclusion which we will announce the opening of -our future work after the results. +Parallel computing and high performance computing (HPC) are becoming more and more imperative for solving various +problems raised by researchers on various scientific disciplines but also by industrial in the field. Indeed, the +increasing complexity of these requested applications combined with a continuous increase of their sizes lead to write +distributed and parallel algorithms requiring significant hardware resources (grid computing, clusters, broadband +network, etc.) but also a non-negligible CPU execution time. We consider in this paper a class of highly efficient +parallel algorithms called \emph{numerical iterative algorithms} executed in a distributed environment. As their name +suggests, these algorithms solve a given problem by successive iterations ($X_{n +1} = f(X_{n})$) from an initial value +$X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods +demonstrate the convergence of these algorithms~\cite{BT89,Bahi07}. + +Parallelization of such algorithms generally involve the division of the problem into several \emph{blocks} that will +be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new +iteration starts and until the approximate solution is reached. These parallel computations can be performed either in +\emph{synchronous} mode where a new iteration begins only when all nodes communications are completed, +or in \emph{asynchronous} mode where processors can continue independently with few or no synchronization points. For +instance in the \textit{Asynchronous Iterations~-- Asynchronous Communications (AIAC)} model~\cite{bcvc06:ij}, local +computations do not need to wait for required data. Processors can then perform their iterations with the data present +at that time. Even if the number of iterations required before the convergence is generally greater than for the +synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to +synchronizations especially in a grid computing context (see~\cite{Bahi07} for more details). + +Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment +requirements. Quantifying their resource allocation policies and application scheduling algorithms in +grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time +consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC algorithms is even more problematic since they are very sensible to the +execution environment context. For instance, variations in the network bandwidth (intra and inter-clusters), in the +number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to +very different execution times. Then, it appears that the use of simulation tools to explore various platform +scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation +environment to execute parallel iterative algorithms found some interests in reducing the highly cost of access to +computing resources: (1) for the applications development life cycle and in code debugging (2) and in production to get +results in a reasonable execution time with a simulated infrastructure not accessible with physical resources. Indeed, +the launch of distributed iterative asynchronous algorithms to solve a given problem on a large-scale simulated +environment challenges to find optimal configurations giving the best results with a lowest residual error and in the +best of execution time. + +To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this +paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the +SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their +performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric +linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor +modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a real AIAC +application on different computing architectures. The simulated results we obtained are in line with real results +exposed in ??\AG[]{??}. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating +different distributed architectures composed by clusters nodes interconnected by variable speed networks. +With selected parameters on the network platforms (bandwidth, latency of inter cluster network) and +on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results +have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment +performance, but also a time saving of up to \np[\%]{40} in asynchronous mode. + +This article is structured as follows: after this introduction, the next section will give a brief description of +iterative asynchronous model. Then, the simulation framework SimGrid is presented with the settings to create various +distributed architectures. The algorithm of the multisplitting method used by GMRES written with MPI primitives and +its adaptation to SimGrid with SMPI (Simulated MPI) is detailed in the next section. At last, the experiments results +carried out will be presented before some concluding remarks and future works. -\section{The asynchronous iteration model} +\section{Motivations and scientific context} + +As exposed in the introduction, parallel iterative methods are now widely used in many scientific domains. They can be +classified in three main classes depending on how iterations and communications are managed (for more details readers +can refer to~\cite{bcvc06:ij}). In the \textit{Synchronous Iterations~-- Synchronous Communications (SISC)} model data +are exchanged at the end of each iteration. All the processors must begin the same iteration at the same time and +important idle times on processors are generated. The \textit{Synchronous Iterations~-- Asynchronous Communications +(SIAC)} model can be compared to the previous one except that data required on another processor are sent asynchronously +i.e. without stopping current computations. This technique allows to partially overlap communications by computations +but unfortunately, the overlapping is only partial and important idle times remain. It is clear that, in a grid +computing context, where the number of computational nodes is large, heterogeneous and widely distributed, the idle +times generated by synchronizations are very penalizing. One way to overcome this problem is to use the +\textit{Asynchronous Iterations~-- Asynchronous Communications (AIAC)} model. Here, local computations do not need to +wait for required data. Processors can then perform their iterations with the data present at that time. Figure~\ref{fig:aiac} +illustrates this model where the gray blocks represent the computation phases, the white spaces the idle +times and the arrows the communications. With this algorithmic model, the number of iterations required before the +convergence is generally greater than for the two former classes. But, and as detailed in~\cite{bcvc06:ij}, AIAC +algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially +in a grid computing context. + +\begin{figure}[!t] + \centering + \includegraphics[width=8cm]{AIAC.pdf} + \caption{The Asynchronous Iterations~-- Asynchronous Communications model} + \label{fig:aiac} +\end{figure} -\DL{Décrire le modèle asynchrone. Je m'en charge} -\section{SimGrid} +It is very challenging to develop efficient applications for large scale, heterogeneous and distributed platforms such +as computing grids. Researchers and engineers have to develop techniques for maximizing application performance of these +multi-cluster platforms, by redesigning the applications and/or by using novel algorithms that can account for the +composite and heterogeneous nature of the platform. Unfortunately, the deployment of such applications on these very +large scale systems is very costly, labor intensive and time consuming. In this context, it appears that the use of +simulation tools to explore various platform scenarios at will and to run enormous numbers of experiments quickly can be +very promising. Several works... + +In the context of AIAC algorithms, the use of simulation tools is even more relevant. Indeed, this class of applications +is very sensible to the execution environment context. For instance, variations in the network bandwidth (intra and +inter-clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of +iterations and so to very different execution times. -\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid} (Arnaud)} + +\section{SimGrid} + +SimGrid~\cite{SimGrid,casanova+legrand+quinson.2008.simgrid} is a simulation +framework to study the behavior of large-scale distributed systems. As its name +says, it emanates from the grid computing community, but is nowadays used to +study grids, clouds, HPC or peer-to-peer systems. The early versions of SimGrid +date from 1999, but it's still actively developed and distributed as an open +source software. Today, it's one of the major generic tools in the field of +simulation for large-scale distributed systems. + +SimGrid provides several programming interfaces: MSG to simulate Concurrent +Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to +run real applications written in MPI~\cite{MPI}. Apart from the native C +interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming +languages. The SMPI interface supports applications written in C or Fortran, +with little or no modifications. SMPI implements about \np[\%]{80} of the MPI +2.0 standard~\cite{bedaride:hal-00919507}. + +%%% explain simulation +%- simulated processes folded in one real process +%- simulates interactions on the network, fluid model +%- able to skip long-lasting computations +%- traces + visu? + +%%% platforms +%- describe resources and their interconnection, with their properties +%- XML files + +%%% validation + refs + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Simulation of the multisplitting method} %Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid. -Let $Ax=b$ be a large sparse system of $n$ linear equations in $\mathbb{R}$, where $A$ is a sparse square and nonsingular matrix, $x$ is the solution vector and $b$ is the right-hand side vector. We use a multisplitting method based on the block Jacobi splitting to solve this linear system on a large scale platform composed of $L$ clusters of processors. In this case, we apply a row-by-row splitting without overlapping +Let $Ax=b$ be a large sparse system of $n$ linear equations in $\mathbb{R}$, where $A$ is a sparse square and nonsingular matrix, $x$ is the solution vector and $b$ is the right-hand side vector. We use a multisplitting method based on the block Jacobi splitting to solve this linear system on a large scale platform composed of $L$ clusters of processors~\cite{o1985multi}. In this case, we apply a row-by-row splitting without overlapping \[ \left(\begin{array}{ccc} A_{11} & \cdots & A_{1L} \\ @@ -214,8 +266,9 @@ Y_l = B_l - \displaystyle\sum_{\substack{m=1\\ m\neq l}}^{L}A_{lm}X_m \end{equation} is solved independently by a cluster and communications are required to update the right-hand side sub-vector $Y_l$, such that the sub-vectors $X_m$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel algorithm of GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers. -\begin{algorithm} -\caption{A multisplitting solver with GMRES method} +\begin{figure}[!t] + %%% IEEE instructions forbid to use an algorithm environment here, use figure + %%% instead \begin{algorithmic}[1] \Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector) \Output $X_l$ (solution sub-vector)\vspace{0.2cm} @@ -236,12 +289,25 @@ is solved independently by a cluster and communications are required to update t \State \Return $X_l^k$ \EndFunction \end{algorithmic} +\caption{A multisplitting solver with GMRES method} \label{algo:01} -\end{algorithm} - -Algorithm~\ref{algo:01} shows the main key points of the multisplitting method to solve a large sparse linear system. This algorithm is based on an outer-inner iteration method where the parallel synchronous GMRES method is used to solve the inner iteration. It is executed in parallel by each cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors with the subscript $l$ represent the local data for cluster $l$, while $\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and $\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with neighboring clusters. At every outer iteration $k$, asynchronous communications are performed between processors of the local cluster and those of distant clusters (lines $6$ and $7$ in Algorithm~\ref{algo:01}). The shared vector elements of the solution $x$ are exchanged by message passing using MPI non-blocking communication routines. +\end{figure} -\begin{figure} +Algorithm on Figure~\ref{algo:01} shows the main key points of the +multisplitting method to solve a large sparse linear system. This algorithm is +based on an outer-inner iteration method where the parallel synchronous GMRES +method is used to solve the inner iteration. It is executed in parallel by each +cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors +with the subscript $l$ represent the local data for cluster $l$, while +$\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and +$\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with +neighboring clusters. At every outer iteration $k$, asynchronous communications +are performed between processors of the local cluster and those of distant +clusters (lines $6$ and $7$ in Figure~\ref{algo:01}). The shared vector +elements of the solution $x$ are exchanged by message passing using MPI +non-blocking communication routines. + +\begin{figure}[!t] \centering \includegraphics[width=60mm,keepaspectratio]{clustering} \caption{Example of three clusters of processors interconnected by a virtual unidirectional ring network.} @@ -254,7 +320,20 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole \LZK{Description du processus d'adaptation de l'algo multisplitting à SimGrid} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% - +We did not encounter major blocking problems when adapting the multisplitting algorithm previously described to a simulation environment like SIMGRID unless some code +debugging. Indeed, apart from the review of the program sequence for asynchronous exchanges between the six neighbors of each point in a submatrix within a cluster or +between clusters, the algorithm was executed successfully with SMPI and provided identical outputs as those obtained with direct execution under MPI. In synchronous +mode, the execution of the program raised no particular issue but in asynchronous mode, the review of the sequence of MPI\_Isend, MPI\_Irecv and MPI\_Waitall instructions +and with the addition of the primitive MPI\_Test was needed to avoid a memory fault due to an infinite loop resulting from the non-convergence of the algorithm. Note here that the use of SMPI +functions optimizer for memory footprint and CPU usage is not recommended knowing that one wants to get real results by simulation. +As mentioned, upon this adaptation, the algorithm is executed as in the real life in the simulated environment after the following minor changes. First, all declared +global variables have been moved to local variables for each subroutine. In fact, global variables generate side effects arising from the concurrent access of +shared memory used by threads simulating each computing units in the SimGrid architecture. Second, the alignment of certain types of variables such as ``long int'' had +also to be reviewed. Finally, some compilation errors on MPI\_Waitall and MPI\_Finalize primitives have been fixed with the latest version of SimGrid. +In total, the initial MPI program running on the simulation environment SMPI gave after a very simple adaptation the same results as those obtained in a real +environment. We have tested in synchronous mode with a simulated platform starting from a modest 2 or 3 clusters grid to a larger configuration like simulating +Grid5000 with more than 1500 hosts with 5000 cores~\cite{bolze2006grid}. Once the code debugging and adaptation were complete, the next section shows our methodology and experimental +results. @@ -264,42 +343,154 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole \section{Experimental results} -When the ``real'' application runs in the simulation environment and produces -the expected results, varying the input parameters and the program arguments -allows us to compare outputs from the code execution. We have noticed from this -study that the results depend on the following parameters: (1) at the network -level, we found that the most critical values are the bandwidth (bw) and the -network latency (lat). (2) Hosts power (GFlops) can also influence on the -results. And finally, (3) when submitting job batches for execution, the -arguments values passed to the program like the maximum number of iterations or -the ``external'' precision are critical to ensure not only the convergence of the -algorithm but also to get the main objective of the experimentation of the -simulation in having an execution time in asynchronous less than in synchronous -mode, in others words, in having a ``speedup'' less than 1 (Speedup = Execution -time in synchronous mode / Execution time in asynchronous mode). - -A priori, obtaining a speedup less than 1 would be difficult in a local area +When the \emph{real} application runs in the simulation environment and produces the expected results, varying the input +parameters and the program arguments allows us to compare outputs from the code execution. We have noticed from this +study that the results depend on the following parameters: +\begin{itemize} +\item At the network level, we found that +the most critical values are the bandwidth (bw) and the network latency (lat). +\item Hosts power (GFlops) can also +influence on the results. +\item Finally, when submitting job batches for execution, the arguments values passed to the +program like the maximum number of iterations or the \emph{external} precision are critical. They allow to ensure not +only the convergence of the algorithm but also to get the main objective of the experimentation of the simulation in +having an execution time in asynchronous less than in synchronous mode (i.e. speed-up less than $1$). +\end{itemize} + +A priori, obtaining a speedup less than $1$ would be difficult in a local area network configuration where the synchronous mode will take advantage on the rapid exchange of information on such high-speed links. Thus, the methodology adopted was to launch the application on clustered network. In this last configuration, degrading the inter-cluster network performance will \emph{penalize} the synchronous -mode allowing to get a speedup lower than 1. This action simulates the case of +mode allowing to get a speedup lower than $1$. This action simulates the case of clusters linked with long distance network like Internet. As a first step, the algorithm was run on a network consisting of two clusters -containing fifty hosts each, totaling one hundred hosts. Various combinations of -the above factors have providing the results shown in Table~\ref{tab.cluster.2x50} with a matrix size -ranging from Nx = Ny = Nz = 62 to 171 elements or from $62^{3} = \np{238328}$ to -$171^{3} = \np{5211000}$ entries. +containing $50$ hosts each, totaling $100$ hosts. Various combinations of +the above factors have providing the results shown in +Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z = +62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to $171^{3} = +\np{5211000}$ entries. + +% use the same column width for the following three tables +\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}} +\newenvironment{mytable}[1]{% #1: number of columns for data + \renewcommand{\arraystretch}{1.3}% + \begin{tabular}{|>{\bfseries}r% + |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{% + \end{tabular}} + +\begin{table}[!t] + \centering + \caption{$2$ clusters, each with $50$ nodes} + \label{tab.cluster.2x50} + \begin{mytable}{6} + \hline + bw + & 5 & 5 & 5 & 5 & 5 & 50 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\ + \hline + power + & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\ + \hline + size + & 62 & 62 & 62 & 100 & 100 & 110 \\ + \hline + Prec/Eprec + & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\ + \hline + speedup + & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\ + \hline + \end{mytable} + + \smallskip + + \begin{mytable}{6} + \hline + bw + & 50 & 50 & 50 & 50 & 10 & 10 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\ + \hline + power + & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\ + \hline + size + & 120 & 130 & 140 & 150 & 171 & 171 \\ + \hline + Prec/Eprec + & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\ + \hline + speedup + & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\ + \hline + \end{mytable} +\end{table} + Then we have changed the network configuration using three clusters containing -respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the +respectively $33$, $33$ and $34$ hosts, or again by on hundred hosts for all the clusters. In the same way as above, a judicious choice of key parameters has -permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the speedups less than 1 with -a matrix size from 62 to 100 elements. +permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the +speedups less than $1$ with a matrix size from $62$ to $100$ elements. + +\begin{table}[!t] + \centering + \caption{$3$ clusters, each with $33$ nodes} + \label{tab.cluster.3x33} + + \begin{mytable}{6} + \hline + bw + & 10 & 5 & 4 & 3 & 2 & 6 \\ + \hline + lat + & 0.01 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\ + \hline + power + & 1 & 1 & 1 & 1 & 1 & 1 \\ + \hline + size + & 62 & 100 & 100 & 100 & 100 & 171 \\ + \hline + Prec/Eprec + & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} \\ + \hline + speedup + & 0.997 & 0.99 & 0.93 & 0.84 & 0.78 & 0.99 \\ + \hline + \end{mytable} +\end{table} In a final step, results of an execution attempt to scale up the three clustered -configuration but increasing by two hundreds hosts has been recorded in Table~\ref{tab.cluster.3x67}. +configuration but increasing by two hundreds hosts has been recorded in +Table~\ref{tab.cluster.3x67}. + +\begin{table}[!t] + \centering + \caption{3 clusters, each with 66 nodes} + \label{tab.cluster.3x67} + + \begin{mytable}{1} + \hline + bw & 1 \\ + \hline + lat & 0.02 \\ + \hline + power & 1 \\ + \hline + size & 62 \\ + \hline + Prec/Eprec & \np{E-5} \\ + \hline + speedup & 0.9 \\ + \hline + \end{mytable} +\end{table} Note that the program was run with the following parameters: @@ -319,92 +510,63 @@ lat latency, \dots{}). \item Description of the cluster architecture; \item Maximum number of internal and external iterations; \item Internal and external precisions; - \item Matrix size NX, NY and NZ; - \item Matrix diagonal value = 6.0; + \item Matrix size $N_x$, $N_y$ and $N_z$; + \item Matrix diagonal value: \np{6.0}; \item Execution Mode: synchronous or asynchronous. \end{itemize} -\begin{table} - \centering - \caption{2 clusters X 50 nodes} - \label{tab.cluster.2x50} - \AG{Ces tableaux (\ref{tab.cluster.2x50}, \ref{tab.cluster.3x33} et - \ref{tab.cluster.3x67}) sont affreux. Utiliser un format vectoriel (eps ou - pdf) ou, mieux, les réécrire en \LaTeX{}. Réécrire les légendes proprement - également (\texttt{\textbackslash{}times} au lieu de \texttt{X} par ex.)} - \includegraphics[width=209pt]{img1.jpg} -\end{table} - -\begin{table} - \centering - \caption{3 clusters X 33 nodes} - \label{tab.cluster.3x33} - \AG{Refaire le tableau.} - \includegraphics[width=209pt]{img2.jpg} -\end{table} - -\begin{table} - \centering - \caption{3 clusters X 67 nodes} - \label{tab.cluster.3x67} - \AG{Refaire le tableau.} -% \includegraphics[width=160pt]{img3.jpg} - \includegraphics[scale=0.5]{img3.jpg} -\end{table} - \paragraph*{Interpretations and comments} After analyzing the outputs, generally, for the configuration with two or three -clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50} and~\ref{tab.cluster.3x33}), some combinations of the -used parameters affecting the results have given a speedup less than 1, showing -the effectiveness of the asynchronous performance compared to the synchronous -mode. - -In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows that with a -deterioration of inter cluster network set with \np[Mbits/s]{5} of bandwidth, a latency -in order of a hundredth of a millisecond and a system power of one GFlops, an -efficiency of about \np[\%]{40} in asynchronous mode is obtained for a matrix size of 62 -elements. It is noticed that the result remains stable even if we vary the -external precision from \np{E-5} to \np{E-9}. By increasing the problem size up to 100 -elements, it was necessary to increase the CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a -convergence of the algorithm with the same order of asynchronous mode efficiency. -Maintaining such a system power but this time, increasing network throughput -inter cluster up to \np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is -obtained with high external precision of \np{E-11} for a matrix size from 110 to 150 -side elements. - -For the 3 clusters architecture including a total of 100 hosts, Table~\ref{tab.cluster.3x33} shows -that it was difficult to have a combination which gives an efficiency of -asynchronous below \np[\%]{80}. Indeed, for a matrix size of 62 elements, equality -between the performance of the two modes (synchronous and asynchronous) is -achieved with an inter cluster of \np[Mbits/s]{10} and a latency of \np[ms]{E-1}. To -challenge an efficiency by \np[\%]{78} with a matrix size of 100 points, it was -necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s. +clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50} +and~\ref{tab.cluster.3x33}), some combinations of the used parameters affecting +the results have given a speedup less than 1, showing the effectiveness of the +asynchronous performance compared to the synchronous mode. + +In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows +that with a deterioration of inter cluster network set with \np[Mbit/s]{5} of +bandwidth, a latency in order of a hundredth of a millisecond and a system power +of one GFlops, an efficiency of about \np[\%]{40} in asynchronous mode is +obtained for a matrix size of 62 elements. It is noticed that the result remains +stable even if we vary the external precision from \np{E-5} to \np{E-9}. By +increasing the problem size up to $100$ elements, it was necessary to increase the +CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a convergence of the algorithm +with the same order of asynchronous mode efficiency. Maintaining such a system +power but this time, increasing network throughput inter cluster up to +\np[Mbit/s]{50}, the result of efficiency of about \np[\%]{40} is obtained with +high external precision of \np{E-11} for a matrix size from $110$ to $150$ side +elements. + +For the $3$ clusters architecture including a total of 100 hosts, +Table~\ref{tab.cluster.3x33} shows that it was difficult to have a combination +which gives an efficiency of asynchronous below \np[\%]{80}. Indeed, for a +matrix size of $62$ elements, equality between the performance of the two modes +(synchronous and asynchronous) is achieved with an inter cluster of +\np[Mbit/s]{10} and a latency of \np[ms]{E-1}. To challenge an efficiency by +\np[\%]{78} with a matrix size of $100$ points, it was necessary to degrade the +inter cluster network bandwidth from 5 to \np[Mbit/s]{2}. A last attempt was made for a configuration of three clusters but more powerful -with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was obtained -with a bandwidth of \np[Mbits/s]{1} as shown in Table~\ref{tab.cluster.3x67}. +with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was +obtained with a bandwidth of \np[Mbit/s]{1} as shown in +Table~\ref{tab.cluster.3x67}. \section{Conclusion} -CONCLUSION - The experimental results on executing a parallel iterative algorithm in asynchronous mode on an environment simulating a large scale of virtual computers organized with interconnected clusters have been presented. Our work has demonstrated that using such a simulation tool allow us to reach the following three objectives: -\newcounter{numberedCntD} \begin{enumerate} \item To have a flexible configurable execution platform resolving the hard exercise to access to very limited but so solicited physical resources; -\item to ensure the algorithm convergence with a raisonnable time and +\item to ensure the algorithm convergence with a reasonable time and iteration number ; \item and finally and more importantly, to find the correct combination of the cluster and network specifications permitting to save time in executing the algorithm in asynchronous mode. -\setcounter{numberedCntD}{\theenumi} \end{enumerate} Our results have shown that in certain conditions, asynchronous mode is speeder up to \np[\%]{40} than executing the algorithm in synchronous mode @@ -419,8 +581,8 @@ mode in a grid architecture. \section*{Acknowledgment} - -The authors would like to thank\dots{} +This work is partially funded by the Labex ACTION program (contract ANR-11-LABX-01-01). +\todo[inline]{The authors would like to thank\dots{}} % trigger a \newpage just before the given reference @@ -428,7 +590,7 @@ The authors would like to thank\dots{} % adjust value as needed - may need to be readjusted if % the document is modified later \bibliographystyle{IEEEtran} -\bibliography{hpccBib} +\bibliography{IEEEabrv,hpccBib} \end{document} @@ -438,3 +600,10 @@ The authors would like to thank\dots{} %%% fill-column: 80 %%% ispell-local-dictionary: "american" %%% End: + +% LocalWords: Ramamonjisoa Laiymani Arnaud Giersch Ziane Khodja Raphaël Femto +% LocalWords: Université Franche Comté IUT Montbéliard Maréchal Juin Inria Sud +% LocalWords: Ouest Vieille Talence cedex scalability experimentations HPC MPI +% LocalWords: Parallelization AIAC GMRES multi SMPI SISC SIAC SimDAG DAGs Lua +% LocalWords: Fortran GFlops priori Mbit de du fcomte multisplitting scalable +% LocalWords: SimGrid Belfort parallelize Labex ANR LABX IEEEabrv hpccBib