X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/blobdiff_plain/0aa640af7dcb1e33350b4ade113575ce81bb0d81..64cf966afb9c408d8ab199f12a27f45e1c5ed82a:/hpcc.tex diff --git a/hpcc.tex b/hpcc.tex index 2c7d65d..d4dc19e 100644 --- a/hpcc.tex +++ b/hpcc.tex @@ -40,11 +40,6 @@ \newcommand{\MI}{\mathit{MaxIter}} -\usepackage{array} -\usepackage{color, colortbl} -\newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}} -\newcolumntype{Z}[1]{>{\raggedleft}m{#1}} - \begin{document} \title{Simulation of Asynchronous Iterative Numerical Algorithms Using SimGrid} @@ -115,10 +110,10 @@ suggests, these algorithm solves a given problem by successive iterations ($X_{n $X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods demonstrate the convergence of these algorithms \cite{}. -Parallelization of such algorithms generally involved the division of the problem into several \emph{pieces} that will +Parallelization of such algorithms generally involved the division of the problem into several \emph{blocks} that will be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new -iteration starts until the approximate solution is reached. These parallel computations can be performed either in -\emph{synchronous} communication mode where a new iteration begin only when all nodes communications are completed, +iteration starts and until the approximate solution is reached. These parallel computations can be performed either in +\emph{synchronous} mode where a new iteration begin only when all nodes communications are completed, either \emph{asynchronous} mode where processors can continue independently without or few synchronization points. For instance in the \textit{Asynchronous Iterations - Asynchronous Communications (AIAC)} model \cite{bcvc06:ij}, local computations do not need to wait for required data. Processors can then perform their iterations with the data present @@ -127,13 +122,13 @@ synchronous case, AIAC algorithms can significantly reduce overall execution tim synchronizations especially in a grid computing context (see \cite{bcvc06:ij} for more details). Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment -requirements. Quantifying their performance of resource allocation policies and application scheduling algorithms in -grid computing environments under varying load, CPU power and network speeds is very costly, labor intensive and time +requirements. Quantifying their resource allocation policies and application scheduling algorithms in +grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time consuming \cite{BuRaCa}. The case of AIAC algorithms is even more problematic since they are very sensible to the execution environment context. For instance, variations in the network bandwith (intra and inter- clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to -very different execution times. In this context, it appears that the use of simulation tools to explore various platform -scenarios and to run enormous numbers of experiments quickly can be very promising. In this way, the use of a simulation +very different execution times. Then, it appears that the use of simulation tools to explore various platform +scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation environment to execute parallel iterative algorithms found some interests in reducing the highly cost of access to computing resources: (1) for the applications development life cycle and in code debugging (2) and in production to get results in a reasonable execution time with a simulated infrastructure not accessible with physical resources. Indeed, @@ -142,15 +137,17 @@ environment challenges to find optimal configurations giving the best results wi best of execution time. To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this -paper is to give a first approach of the simulation of AIAC algorithms using the SimGrid toolkit \cite{SimGrid}. We had -in the scope of this work implemented a program for solving large non-symmetric linear system of equations by numerical -method GMRES (Generalized Minimal Residual). SimGrid had allowed us to launch the application from a modest computing -infrastructure by simulating different distributed architectures composed by clusters nodes interconnected by variable -speed networks. The simulated results we obtained are in line with real results exposed in ?? In addition, it has been -permitted to show the effectiveness of asynchronous mode algorithm by comparing its performance with the synchronous -mode time. With selected parameters on the network platforms (bandwidth, latency of inter cluster network) and on the -clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results have -demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment +paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the +SimGrid toolkit \cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their +performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric +linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor +modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campain of a real AIAC +application on different computing architectures. The simulated results we obtained are in line with real results +exposed in ??. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating +different distributed architectures composed by clusters nodes interconnected by variable speed networks. It has been +permitted to show With selected parameters on the network platforms (bandwidth, latency of inter cluster network) and +on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results +have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment performance, but also a time saving of up to \np[\%]{40} in asynchronous mode. This article is structured as follows: after this introduction, the next section will give a brief description of @@ -208,16 +205,18 @@ iterations and so to very different execution times. SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} is a simulation framework to sudy the behavior of large-scale distributed systems. As its name says, it emanates from the grid computing community, but is nowadays used to -study grids, clouds, HPC or peer-to-peer systems. -%- open source, developped since 1999, one of the major solution in the field -% +study grids, clouds, HPC or peer-to-peer systems. The early versions of SimGrid +date from 1999, but it's still actively developped and distributed as an open +source software. Today, it's one of the major generic tools in the field of +simulation for large-scale distributed systems. + SimGrid provides several programming interfaces: MSG to simulate Concurrent Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to run real applications written in MPI~\cite{MPI}. Apart from the native C interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming languages. The SMPI interface supports applications written in C or Fortran, -with little or no modifications. -%- implements most of MPI-2 \cite{ref} standard [CHECK] +with little or no modifications. SMPI implements about \np[\%]{80} of the MPI +2.0 standard~\cite{bedaride:hal-00919507}. %%% explain simulation %- simulated processes folded in one real process @@ -231,8 +230,6 @@ with little or no modifications. %%% validation + refs -\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)} - %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Simulation of the multisplitting method} %Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid. @@ -333,31 +330,30 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole \section{Experimental results} -When the \emph{real} application runs in the simulation environment and produces -the expected results, varying the input parameters and the program arguments -allows us to compare outputs from the code execution. We have noticed from this -study that the results depend on the following parameters: (1) at the network -level, we found that the most critical values are the bandwidth (bw) and the -network latency (lat). (2) Hosts power (GFlops) can also influence on the -results. And finally, (3) when submitting job batches for execution, the -arguments values passed to the program like the maximum number of iterations or -the \emph{external} precision are critical to ensure not only the convergence of the -algorithm but also to get the main objective of the experimentation of the -simulation in having an execution time in asynchronous less than in synchronous -mode, in others words, in having a \emph{speedup} less than 1 -({speedup}${}={}${execution time in synchronous mode}${}/{}${execution time in -asynchronous mode}). - -A priori, obtaining a speedup less than 1 would be difficult in a local area +When the \emph{real} application runs in the simulation environment and produces the expected results, varying the input +parameters and the program arguments allows us to compare outputs from the code execution. We have noticed from this +study that the results depend on the following parameters: +\begin{itemize} +\item At the network level, we found that +the most critical values are the bandwidth (bw) and the network latency (lat). +\item Hosts power (GFlops) can also +influence on the results. +\item Finally, when submitting job batches for execution, the arguments values passed to the +program like the maximum number of iterations or the \emph{external} precision are critical. They allow to ensure not +only the convergence of the algorithm but also to get the main objective of the experimentation of the simulation in +having an execution time in asynchronous less than in synchronous mode (i.e. speed-up less than $1$). +\end{itemize} + +A priori, obtaining a speedup less than $1$ would be difficult in a local area network configuration where the synchronous mode will take advantage on the rapid exchange of information on such high-speed links. Thus, the methodology adopted was to launch the application on clustered network. In this last configuration, degrading the inter-cluster network performance will \emph{penalize} the synchronous -mode allowing to get a speedup lower than 1. This action simulates the case of +mode allowing to get a speedup lower than $1$. This action simulates the case of clusters linked with long distance network like Internet. As a first step, the algorithm was run on a network consisting of two clusters -containing fifty hosts each, totaling one hundred hosts. Various combinations of +containing $50$ hosts each, totaling $100$ hosts. Various combinations of the above factors have providing the results shown in Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z = 62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to $171^{3} = @@ -365,48 +361,91 @@ Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z = \begin{table}[!t] \centering - \caption{2 clusters, each with 50 nodes} + \caption{$2$ clusters, each with $50$ nodes} \label{tab.cluster.2x50} - - \tiny - -\begin{tabular}{|Z{0.55cm}|Z{0.25cm}|Z{0.25cm}|M{0.25cm}|Z{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|} - \hline - \bf bw & 5 &5 & 5 & 5 & 5 & 50 & 50 & 50 & 50 & 50 & 10 & 10\\ - \hline - \bf lat & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01\\ - \hline - \bf power & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5\\ \hline \bf size & 62 & 62 & 62 & 100 & 100 & 110 & 120& 130 & 140 & 150 & 171 & 171\\ \hline - \bf Prec/Eprec & 10$^{-5}$ & 10$^{-8}$ & 10$^{-9}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-5}$ & 10$^{-5}$\\ \hline - \bf speedup & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778\\ \hline - \end{tabular} -\end{table} + \renewcommand{\arraystretch}{1.3} + + \begin{tabular}{|>{\bfseries}r|*{12}{c|}} + \hline + bw + & 5 & 5 & 5 & 5 & 5 & 50 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\ + \hline + power + & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\ + \hline + size + & 62 & 62 & 62 & 100 & 100 & 110 \\ + \hline + Prec/Eprec + & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\ + \hline + speedup + & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\ + \hline + \end{tabular} + + \smallskip + + \begin{tabular}{|>{\bfseries}r|*{12}{c|}} + \hline + bw + & 50 & 50 & 50 & 50 & 10 & 10 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\ + \hline + power + & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\ + \hline + size + & 120 & 130 & 140 & 150 & 171 & 171 \\ + \hline + Prec/Eprec + & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\ + \hline + speedup + & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\ + \hline + \end{tabular} +\end{table} Then we have changed the network configuration using three clusters containing -respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the +respectively $33$, $33$ and $34$ hosts, or again by on hundred hosts for all the clusters. In the same way as above, a judicious choice of key parameters has permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the -speedups less than 1 with a matrix size from 62 to 100 elements. +speedups less than $1$ with a matrix size from $62$ to $100$ elements. \begin{table}[!t] \centering - \caption{3 clusters, each with 33 nodes} + \caption{$3$ clusters, each with $33$ nodes} \label{tab.cluster.3x33} - - \tiny - -\begin{tabular}{|Z{0.55cm}|Z{0.25cm}|Z{0.25cm}|M{0.25cm}|Z{0.25cm}|M{0.25cm}|M{0.25cm}|} - \hline - \bf bw & 10 &5 & 4 & 3 & 2 & 6\\ \hline - \bf lat & 0.01 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02\\ - \hline - \bf power & 1 & 1 & 1 & 1 & 1 & 1\\ \hline - \bf size & 62 & 100 & 100 & 100 & 100 & 171\\ \hline - \bf Prec/Eprec & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$\\ \hline - \bf speedup & 0.997 & 0.99 & 0.93 & 0.84 & 0.78 & 0.99\\ - \hline - \end{tabular} -\end{table} + \renewcommand{\arraystretch}{1.3} + + \begin{tabular}{|>{\bfseries}r|*{6}{c|}} + \hline + bw + & 10 & 5 & 4 & 3 & 2 & 6 \\ + \hline + lat + & 0.01 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\ + \hline + power + & 1 & 1 & 1 & 1 & 1 & 1 \\ + \hline + size + & 62 & 100 & 100 & 100 & 100 & 171 \\ + \hline + Prec/Eprec + & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} \\ + \hline + speedup + & 0.997 & 0.99 & 0.93 & 0.84 & 0.78 & 0.99 \\ + \hline + \end{tabular} +\end{table} In a final step, results of an execution attempt to scale up the three clustered @@ -417,23 +456,24 @@ Table~\ref{tab.cluster.3x67}. \centering \caption{3 clusters, each with 66 nodes} \label{tab.cluster.3x67} - - \tiny -\begin{tabular}{|M{0.55cm}|M{0.25cm}|} - \hline - \bf bw & 1\\ \hline - \bf lat & 0.02\\ - \hline - \bf power & 1\\ - \hline - \bf size & 62\\ - \hline - \bf Prec/Eprec & 10$^{-5}$\\ - \hline - \bf speedup & 0.9\\ - \hline + \renewcommand{\arraystretch}{1.3} + + \begin{tabular}{|>{\bfseries}r|c|} + \hline + bw & 1 \\ + \hline + lat & 0.02 \\ + \hline + power & 1 \\ + \hline + size & 62 \\ + \hline + Prec/Eprec & \np{E-5} \\ + \hline + speedup & 0.9 \\ + \hline \end{tabular} -\end{table} +\end{table} Note that the program was run with the following parameters: @@ -472,21 +512,21 @@ bandwidth, a latency in order of a hundredth of a millisecond and a system power of one GFlops, an efficiency of about \np[\%]{40} in asynchronous mode is obtained for a matrix size of 62 elements. It is noticed that the result remains stable even if we vary the external precision from \np{E-5} to \np{E-9}. By -increasing the problem size up to 100 elements, it was necessary to increase the +increasing the problem size up to $100$ elements, it was necessary to increase the CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a convergence of the algorithm with the same order of asynchronous mode efficiency. Maintaining such a system power but this time, increasing network throughput inter cluster up to \np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is obtained with -high external precision of \np{E-11} for a matrix size from 110 to 150 side +high external precision of \np{E-11} for a matrix size from $110$ to $150$ side elements. -For the 3 clusters architecture including a total of 100 hosts, +For the $3$ clusters architecture including a total of 100 hosts, Table~\ref{tab.cluster.3x33} shows that it was difficult to have a combination which gives an efficiency of asynchronous below \np[\%]{80}. Indeed, for a -matrix size of 62 elements, equality between the performance of the two modes +matrix size of $62$ elements, equality between the performance of the two modes (synchronous and asynchronous) is achieved with an inter cluster of \np[Mbits/s]{10} and a latency of \np[ms]{E-1}. To challenge an efficiency by -\np[\%]{78} with a matrix size of 100 points, it was necessary to degrade the +\np[\%]{78} with a matrix size of $100$ points, it was necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s. A last attempt was made for a configuration of three clusters but more powerful