From 8af7de742ea7a8327a7d2fb131d94fca23d31c28 Mon Sep 17 00:00:00 2001 From: RCE Date: Tue, 8 Apr 2014 23:50:44 +0200 Subject: [PATCH] Section V : Experimental results 1) ecriture de la version en anglais de la section 2) Tableau a mettre plus tard --- hpcc.tex | 132 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 132 insertions(+) diff --git a/hpcc.tex b/hpcc.tex index 5fbeca1..c457958 100644 --- a/hpcc.tex +++ b/hpcc.tex @@ -433,6 +433,138 @@ D Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid. \section{Experimental results} +{\raggedright +When the ``real'' application runs in the simulation environment and produces +the expected results, varying the input parameters and the program arguments +allows us to compare outputs from the code execution. We have noticed from this +study that the results depend on the following parameters: (1) at the network +level, we found that the most critical values are the bandwidth (bw) and the +network latency (lat). (2) Hosts power (GFlops) can also influence on the +results. And finally, (3) when submitting job batches for execution, the +arguments values passed to the program like the maximum number of iterations or +the ``external'' precision are critical to ensure not only the convergence of the +algorithm but also to get the main objective of the experimentation of the +simulation in having an execution time in asynchronous less than in synchronous +mode, in others words, in having a ``speedup'' less than 1 (Speedup = Execution +time in synchronous mode / Execution time in asynchronous mode). +} + +{\raggedright +A priori, obtaining a speedup less than 1 would be difficult in a local area +network configuration where the synchronous mode will take advantage on the rapid +exchange of information on such high-speed links. Thus, the methodology adopted +was to launch the application on clustered network. In this last configuration, +degrading the inter-cluster network performance will "penalize" the synchronous +mode allowing to get a speedup lower than 1. This action simulates the case of +clusters linked with long distance network like Internet. +} + +{\raggedright +As a first step, the algorithm was run on a network consisting of two clusters +containing fifty hosts each, totaling one hundred hosts. Various combinations of +the above factors have providing the results shown in Table 1 with a matrix size +ranging from Nx = Ny = Nz = 62 to 171 elements or from 62$^{3}$ = 238328 to +171$^{3}$ = 5,211,000 entries. +} + +{\raggedright +Then we have changed the network configuration using three clusters containing +respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the +clusters. In the same way as above, a judicious choice of key parameters has +permitted to get the results in Table 2 which shows the speedups less than 1 with +a matrix size from 62 to 100 elements. +} + +{\raggedright +In a final step, results of an execution attempt to scale up the three clustered +configuration but increasing by two hundreds hosts has been recorded in Table 3. +} + +{\raggedright +Note that the program was run with the following parameters: +} + +%{\raggedright +\textbullet{} \textbf {SMPI parameters:} +%} + +\begin{itemize} + \item HOSTFILE : Hosts file description. + \item PLATFORM: file description of the platform architecture : clusters (CPU power, +... ) , intra cluster network description, inter cluster network (bandwidth bw , +lat latency , ... ). +\end{itemize} + + +%{\raggedright +\textbullet{} \textbf {Arguments of the program:} +%} + +\begin{itemize} + \item Description of the cluster architecture; + \item Maximum number of internal and external iterations; + \item Internal and external precisions; + \item Matrix size NX , NY and NZ; + \item Matrix diagonal value = 6.0; + \item Execution Mode: synchronous or asynchronous. +\end{itemize} + +\textbf{Table 1} + +\textit{{\scriptsize 2 clusters X 50 nodes}} +\includegraphics[width=209pt]{img-1.eps} + +\textbf{Table 2} + +\textit{{\scriptsize 3 clusters X 33 n\oe{}uds}} +\includegraphics[width=209pt]{img-1.eps} +\textbf{Table 3} + +\textit{{\scriptsize 3 clusters X 67 noeuds}} +\includegraphics[width=128pt]{img-2.eps} + +{\raggedright +\textbf{Interpretations and comments} +} + +{\raggedright +After analyzing the outputs, generally, for the configuration with two or three +clusters including one hundred hosts (Tables 1 and 2), some combinations of the +used parameters affecting the results have given a speedup less than 1, showing +the effectiveness of the asynchronous performance compared to the synchronous +mode. +} + +{\raggedright +In the case of a two clusters configuration, Table 1 shows that with a +deterioration of inter cluster network set with 5 Mbits/s of bandwidth, a latency +in order of a hundredth of a millisecond and a system power of one GFlops, an +efficiency of about 40\% in asynchronous mode is obtained for a matrix size of 62 +elements . It is noticed that the result remains stable even if we vary the +external precision from E -05 to E-09. By increasing the problem size up to 100 +elements, it was necessary to increase the CPU power of 50 \% to 1.5 GFlops for a +convergence of the algorithm with the same order of asynchronous mode efficiency. +Maintaining such a system power but this time, increasing network throughput +inter cluster up to 50 Mbits /s, the result of efficiency of about 40\% is +obtained with high external precision of E-11 for a matrix size from 110 to 150 +side elements . +} + +{\raggedright +For the 3 clusters architecture including a total of 100 hosts, Table 2 shows +that it was difficult to have a combination which gives an efficiency of +asynchronous below 80 \%. Indeed, for a matrix size of 62 elements, equality +between the performance of the two modes (synchronous and asynchronous) is +achieved with an inter cluster of 10 Mbits/s and a latency of E- 01 ms. To +challenge an efficiency by 78\% with a matrix size of 100 points, it was +necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s. +} + +{\raggedright +A last attempt was made for a configuration of three clusters but more power +with 200 nodes in total. The convergence with a speedup of 90 \% was obtained +with a bandwidth of 1 Mbits/s as shown in Table 3. +} \section{Conclusion} -- 2.39.5