From: Arnaud Giersch Date: Tue, 22 Apr 2014 13:05:55 +0000 (+0200) Subject: Todo++, typos, and reindent. X-Git-Tag: hpcc2014_submission~86 X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/commitdiff_plain/8e535991311d246854e7597c06cdee25b67c8419?ds=inline;hp=-c Todo++, typos, and reindent. --- 8e535991311d246854e7597c06cdee25b67c8419 diff --git a/hpcc.tex b/hpcc.tex index c3a0b34..c9606c3 100644 --- a/hpcc.tex +++ b/hpcc.tex @@ -121,34 +121,48 @@ at that time. Even if the number of iterations required before the convergence i synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially in a grid computing context (see~\cite{Bahi07} for more details). -Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment -requirements. Quantifying their resource allocation policies and application scheduling algorithms in -grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time -consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC algorithms is even more problematic since they are very sensible to the -execution environment context. For instance, variations in the network bandwidth (intra and inter-clusters), in the -number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to -very different execution times. Then, it appears that the use of simulation tools to explore various platform -scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation -environment to execute parallel iterative algorithms found some interests in reducing the highly cost of access to -computing resources: (1) for the applications development life cycle and in code debugging (2) and in production to get -results in a reasonable execution time with a simulated infrastructure not accessible with physical resources. Indeed, -the launch of distributed iterative asynchronous algorithms to solve a given problem on a large-scale simulated -environment challenges to find optimal configurations giving the best results with a lowest residual error and in the -best of execution time. - -To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this -paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the -SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their -performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric -linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor -modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a real AIAC -application on different computing architectures. The simulated results we obtained are in line with real results -exposed in ??\AG[]{??}. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating -different distributed architectures composed by clusters nodes interconnected by variable speed networks. -With selected parameters on the network platforms (bandwidth, latency of inter cluster network) and -on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results -have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment -performance, but also a time saving of up to \np[\%]{40} in asynchronous mode. +Parallel numerical applications (synchronous or asynchronous) may have different +configuration and deployment requirements. Quantifying their resource +allocation policies and application scheduling algorithms in grid computing +environments under varying load, CPU power and network speeds is very costly, +very labor intensive and very time +consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC +algorithms is even more problematic since they are very sensible to the +execution environment context. For instance, variations in the network bandwidth +(intra and inter-clusters), in the number and the power of nodes, in the number +of clusters\dots{} can lead to very different number of iterations and so to +very different execution times. Then, it appears that the use of simulation +tools to explore various platform scenarios and to run large numbers of +experiments quickly can be very promising. In this way, the use of a simulation +environment to execute parallel iterative algorithms found some interests in +reducing the highly cost of access to computing resources: (1) for the +applications development life cycle and in code debugging (2) and in production +to get results in a reasonable execution time with a simulated infrastructure +not accessible with physical resources. Indeed, the launch of distributed +iterative asynchronous algorithms to solve a given problem on a large-scale +simulated environment challenges to find optimal configurations giving the best +results with a lowest residual error and in the best of execution time. + +To our knowledge, there is no existing work on the large-scale simulation of a +real AIAC application. The aim of this paper is twofold. First we give a first +approach of the simulation of AIAC algorithms using a simulation tool (i.e. the +SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of +asynchronous mode algorithms by comparing their performance with the synchronous +mode. More precisely, we had implemented a program for solving large +non-symmetric linear system of equations by numerical method GMRES (Generalized +Minimal Residual) []\AG[]{[]?}. We show, that with minor modifications of the +initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a +real AIAC application on different computing architectures. The simulated +results we obtained are in line with real results exposed in ??\AG[]{??}. +SimGrid had allowed us to launch the application from a modest computing +infrastructure by simulating different distributed architectures composed by +clusters nodes interconnected by variable speed networks. With selected +parameters on the network platforms (bandwidth, latency of inter cluster +network) and on the clusters architecture (number, capacity calculation power) +in the simulated environment, the experimental results have demonstrated not +only the algorithm convergence within a reasonable time compared with the +physical environment performance, but also a time saving of up to \np[\%]{40} in +asynchronous mode. This article is structured as follows: after this introduction, the next section will give a brief description of iterative asynchronous model. Then, the simulation framework SimGrid is presented with the settings to create various @@ -184,18 +198,25 @@ in a grid computing context. \end{figure} -It is very challenging to develop efficient applications for large scale, heterogeneous and distributed platforms such -as computing grids. Researchers and engineers have to develop techniques for maximizing application performance of these -multi-cluster platforms, by redesigning the applications and/or by using novel algorithms that can account for the -composite and heterogeneous nature of the platform. Unfortunately, the deployment of such applications on these very -large scale systems is very costly, labor intensive and time consuming. In this context, it appears that the use of -simulation tools to explore various platform scenarios at will and to run enormous numbers of experiments quickly can be -very promising. Several works... +It is very challenging to develop efficient applications for large scale, +heterogeneous and distributed platforms such as computing grids. Researchers and +engineers have to develop techniques for maximizing application performance of +these multi-cluster platforms, by redesigning the applications and/or by using +novel algorithms that can account for the composite and heterogeneous nature of +the platform. Unfortunately, the deployment of such applications on these very +large scale systems is very costly, labor intensive and time consuming. In this +context, it appears that the use of simulation tools to explore various platform +scenarios at will and to run enormous numbers of experiments quickly can be very +promising. Several works\dots{} -In the context of AIAC algorithms, the use of simulation tools is even more relevant. Indeed, this class of applications -is very sensible to the execution environment context. For instance, variations in the network bandwidth (intra and -inter-clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of -iterations and so to very different execution times. +\AG{Several works\dots{} what?\\ + Le paragraphe suivant se trouve déjà dans l'intro ?} +In the context of AIAC algorithms, the use of simulation tools is even more +relevant. Indeed, this class of applications is very sensible to the execution +environment context. For instance, variations in the network bandwidth (intra +and inter-clusters), in the number and the power of nodes, in the number of +clusters\dots{} can lead to very different number of iterations and so to very +different execution times. @@ -340,25 +361,34 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole \LZK{Description du processus d'adaptation de l'algo multisplitting à SimGrid} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -We did not encounter major blocking problems when adapting the multisplitting algorithm previously described to a simulation environment like SIMGRID unless some code -debugging. Indeed, apart from the review of the program sequence for asynchronous exchanges between the six neighbors of each point in a submatrix within a cluster or -between clusters, the algorithm was executed successfully with SMPI and provided identical outputs as those obtained with direct execution under MPI. In synchronous -mode, the execution of the program raised no particular issue but in asynchronous mode, the review of the sequence of MPI\_Isend, MPI\_Irecv and MPI\_Waitall instructions -and with the addition of the primitive MPI\_Test was needed to avoid a memory fault due to an infinite loop resulting from the non-convergence of the algorithm. Note here that the use of SMPI -functions optimizer for memory footprint and CPU usage is not recommended knowing that one wants to get real results by simulation. -As mentioned, upon this adaptation, the algorithm is executed as in the real life in the simulated environment after the following minor changes. First, all declared -global variables have been moved to local variables for each subroutine. In fact, global variables generate side effects arising from the concurrent access of -shared memory used by threads simulating each computing units in the SimGrid architecture. Second, the alignment of certain types of variables such as ``long int'' had -also to be reviewed. Finally, some compilation errors on MPI\_Waitall and MPI\_Finalize primitives have been fixed with the latest version of SimGrid. -In total, the initial MPI program running on the simulation environment SMPI gave after a very simple adaptation the same results as those obtained in a real -environment. We have tested in synchronous mode with a simulated platform starting from a modest 2 or 3 clusters grid to a larger configuration like simulating -Grid5000 with more than 1500 hosts with 5000 cores~\cite{bolze2006grid}. Once the code debugging and adaptation were complete, the next section shows our methodology and experimental -results. - - - - - +We did not encounter major blocking problems when adapting the multisplitting +algorithm previously described to a simulation environment like SimGrid unless +some code debugging. Indeed, apart from the review of the program sequence for +asynchronous exchanges between the six neighbors of each point in a submatrix +within a cluster or between clusters, the algorithm was executed successfully +with SMPI and provided identical outputs as those obtained with direct execution +under MPI. In synchronous mode, the execution of the program raised no +particular issue but in asynchronous mode, the review of the sequence of +MPI\_Isend, MPI\_Irecv and MPI\_Waitall instructions and with the addition of +the primitive MPI\_Test was needed to avoid a memory fault due to an infinite +loop resulting from the non-convergence of the algorithm. Note here that the use +of SMPI functions optimizer for memory footprint and CPU usage is not +recommended knowing that one wants to get real results by simulation. As +mentioned, upon this adaptation, the algorithm is executed as in the real life +in the simulated environment after the following minor changes. First, all +declared global variables have been moved to local variables for each +subroutine. In fact, global variables generate side effects arising from the +concurrent access of shared memory used by threads simulating each computing +units in the SimGrid architecture. Second, the alignment of certain types of +variables such as ``long int'' had also to be reviewed. Finally, some +compilation errors on MPI\_Waitall and MPI\_Finalize primitives have been fixed +with the latest version of SimGrid. In total, the initial MPI program running +on the simulation environment SMPI gave after a very simple adaptation the same +results as those obtained in a real environment. We have tested in synchronous +mode with a simulated platform starting from a modest 2 or 3 clusters grid to a +larger configuration like simulating Grid5000 with more than 1500 hosts with +5000 cores~\cite{bolze2006grid}. Once the code debugging and adaptation were +complete, the next section shows our methodology and experimental results. \section{Experimental results} @@ -511,7 +541,7 @@ Table~\ref{tab.cluster.3x67}. \hline speedup & 0.9 \\ \hline - \end{mytable} + \end{mytable} \end{table} Note that the program was run with the following parameters: