1 \documentclass[conference]{IEEEtran}
7 \usepackage[T1]{fontenc}
8 \usepackage[utf8]{inputenc}
9 \usepackage{amsfonts,amssymb}
11 \usepackage{algorithm}
12 \usepackage{algpseudocode}
15 \usepackage[american]{babel}
16 % Extension pour les liens intra-documents (tagged PDF)
17 % et l'affichage correct des URL (commande \url{http://example.com})
18 %\usepackage{hyperref}
21 \DeclareUrlCommand\email{\urlstyle{same}}
23 \usepackage[autolanguage,np]{numprint}
25 \renewcommand*\npunitcommand[1]{\text{#1}}
26 \npthousandthpartsep{}}
29 \usepackage[textsize=footnotesize]{todonotes}
30 \newcommand{\AG}[2][inline]{%
31 \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
32 \newcommand{\RC}[2][inline]{%
33 \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
34 \newcommand{\LZK}[2][inline]{%
35 \todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
36 \newcommand{\RCE}[2][inline]{%
37 \todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
39 \algnewcommand\algorithmicinput{\textbf{Input:}}
40 \algnewcommand\Input{\item[\algorithmicinput]}
42 \algnewcommand\algorithmicoutput{\textbf{Output:}}
43 \algnewcommand\Output{\item[\algorithmicoutput]}
45 \newcommand{\MI}{\mathit{MaxIter}}
48 \usepackage{color, colortbl}
49 \newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
50 \newcolumntype{Z}[1]{>{\raggedleft}m{#1}}
52 \newcolumntype{g}{>{\columncolor{Gray}}c}
53 \definecolor{Gray}{gray}{0.9}
57 \RCE{Titre a confirmer.}
59 \title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms}
63 Charles Emile Ramamonjisoa and
66 Lilia Ziane Khodja and
70 Femto-ST Institute - DISC Department\\
71 Université de Franche-Comté\\
73 Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr}
83 Keywords : Algorithm distributed iterative asynchronous simulation simgrid performance
87 \section{Introduction}
89 \section{The asynchronous iteration model}
93 \section{Simulation of the multisplitting method}
95 \section{Experimental, Results and Comments}
98 \textbf{V.1. Setup study and Methodology}
100 To conduct our study, we have put in place the following methodology
101 which can be reused with any grid-enabled applications.
103 \textbf{Step 1} : Choose with the end users the class of algorithms or
104 the application to be tested. Numerical parallel iterative algorithms
105 have been chosen for the study in the paper.
107 \textbf{Step 2} : Collect the software materials needed for the
108 experimentation. In our case, we have three variants algorithms for the
109 resolution of three 3D-Poisson problem: (1) using the classical GMRES
110 \textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this
111 paper, (2) using the multisplitting method alias Algo-2 and (3) an
112 enhanced version of the multisplitting method as Algo-3. In addition,
113 SIMGRID simulator has been chosen to simulate the behaviors of the
114 distributed applications. SIMGRID is running on the Mesocentre
115 datacenter in Franche-Comte University $[$10$]$ but also in a virtual
118 \textbf{Step 3} : Fix the criteria which will be used for the future
119 results comparison and analysis. In the scope of this study, we retain
120 in one hand the algorithm execution mode (synchronous and asynchronous)
121 and in the other hand the execution time and the number of iterations of
122 the application before obtaining the convergence.
124 \textbf{Step 4 }: Setup up the different grid testbeds environment
125 which will be simulated in the simulator tool to run the program. The
126 following architecture has been configured in Simgrid : 2x16 - that is a
127 grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8,
128 4x16, 8x8 and 2x50. The network has been designed to operate with a
129 bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6
130 microseconds (resp. 5E-5) for the intra-clusters links (resp.
131 inter-clusters backbone links).
133 \textbf{Step 5}: Process an extensive and comprehensive testings
134 within these configurations in varying the key parameters, especially
135 the CPU power capacity, the network parameters and also the size of the
136 input matrix. Note that some parameters should be invariant to allow the
137 comparison like some program input arguments.
139 \textbf{Step 6} : Collect and analyze the output results.
141 \textbf{ V.2. Factors impacting distributed applications performance in
144 From our previous experience on running distributed application in a
145 computational grid, many factors are identified to have an impact on the
146 program behavior and performance on this specific environment. Mainly,
147 first of all, the architecture of the grid itself can obviously
148 influence the performance results of the program. The performance gain
149 might be important theoretically when the number of clusters and/or the
150 number of nodes (processors/cores) in each individual cluster increase.
152 Another important factor impacting the overall performance of the
153 application is the network configuration. Two main network parameters
154 can modify drastically the program output results : (i) the network
155 bandwidth (bw=bits/s) also known as "the data-carrying capacity"
156 $[$13$]$ of the network is defined as the maximum of data that can pass
157 from one point to another in a unit of time. (ii) the network latency
158 (lat : microsecond) defined as the delay from the start time to send the
159 data from a source and the final time the destination have finished to
160 receive it. Upon the network characteristics, another impacting factor
161 is the application dependent volume of data exchanged between the nodes
162 in the cluster and between distant clusters. Large volume of data can be
163 transferred in transit between the clusters and nodes during the code
166 In a grid environment, it is common to distinguish in one hand, the
167 "\,intra-network" which refers to the links between nodes within a
168 cluster and in the other hand, the "\,inter-network" which is the
169 backbone link between clusters. By design, these two networks perform
170 with different speed. The intra-network generally works like a high
171 speed local network with a high bandwith and very low latency. In
172 opposite, the inter-network connects clusters sometime via heterogeneous
173 networks components thru internet with a lower speed. The network
174 between distant clusters might be a bottleneck for the global
175 performance of the application.
177 \textbf{V.3 Comparing GMRES and Multisplitting algorithms in
180 In the scope of this paper, our first objective is to demonstrate the
181 Algo-2 (Multisplitting method) shows a better performance in grid
182 architecture compared with Algo-1 (Classical GMRES) both running in
183 \textbf{\textit{synchronous mode}}. Better algorithm performance
184 should mean a less number of iterations output and a less execution time
185 before reaching the convergence. For a systematic study, the experiments
186 should figure out that, for various grid parameters values, the
187 simulator will confirm the targeted outcomes, particularly for poor and
188 slow networks, focusing on the impact on the communication performance
189 on the chosen class of algorithm $[$12$]$.
191 The following paragraphs present the test conditions, the output results
195 \textit{3.a Executing the algorithms on various computational grid
196 architecture scaling up the input matrix size}
201 \begin{tabular}{r c }
203 Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline
204 Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline
205 Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline
206 - & N$_{x}$ =170 x 170 x 170 \\ \hline
211 Table 1 : Clusters x Nodes with NX=150 or NX=170
212 \RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger}
214 \begin{wrapfigure}{l}{50mm}
216 \includegraphics[width=50mm]{Cluster x Nodes NX=150 and NX=170.jpg}
217 \caption{Cluster x Nodes NX=150 and NX=170 \label{overflow}}
221 The results in figure 1 show the non-variation of the number of
222 iterations of classical GMRES for a given input matrix size; it is not
223 the case for the multisplitting method. Unless the 8x8 cluster, the time
224 execution difference between the two algorithms is important when
225 comparing between different grid architectures, even with the same number of
226 processors (like 2x16 and 4x8 = 32 processors for example). The
227 experiment concludes the low sensitivity of the multisplitting method
228 (compared with the classical GMRES) when scaling up to higher input
231 \textit{3.b Running on various computational grid architecture}
235 \begin{tabular}{r c }
237 Grid & 2x16, 4x8\\ %\hline
238 Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline
239 - & N2 : bw=1Gbs-lat=5E-05 \\
240 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
244 %Table 2 : Clusters x Nodes - Networks N1 x N2
245 %\RCE{idem pour tous les tableaux de donnees}
248 \begin{wrapfigure}{l}{45mm}
250 \includegraphics[width=50mm]{Cluster x Nodes N1 x N2.jpg}
251 \caption{Cluster x Nodes N1 x N2\label{overflow}}
254 The experiments compare the behavior of the algorithms running first on
255 speed inter- cluster network (N1) and a less performant network (N2).
256 The figure 2 shows that end users will gain to reduce the execution time
257 for both algorithms in using a grid architecture like 4x16 or 8x8: the
258 performance was increased in a factor of 2. The results depict also that
259 when the network speed drops down, the difference between the execution
260 times can reach more than 25\%.
262 \textit{3.c Network latency impacts on performance}
266 \begin{tabular}{r c }
268 Grid & 2x16\\ %\hline
269 Network & N1 : bw=1Gbs \\ %\hline
270 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
274 Table 3 : Network latency impact
277 \begin{wrapfigure}{l}{60mm}
279 \includegraphics[width=60mm]{Network latency impact on execution time.jpg}
280 \caption{Network latency impact on execution time\label{overflow}}
284 According the results in table and figure 3, degradation of the network
285 latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
286 increase more than 75\% (resp. 82\%) of the execution for the classical
287 GMRES (resp. multisplitting) algorithm. In addition, it appears that the
288 multisplitting method tolerates more the network latency variation with
289 a less rate increase. Consequently, in the worst case (lat=6.10$^{-5
290 }$), the execution time for GMRES is almost the double of the time for
291 the multisplitting, even though, the performance was on the same order
292 of magnitude with a latency of 8.10$^{-6}$.
294 \textit{3.d Network bandwidth impacts on performance}
298 \begin{tabular}{r c }
300 Grid & 2x16\\ %\hline
301 Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline
302 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
306 Table 4 : Network bandwidth impact
308 \begin{wrapfigure}{l}{60mm}
310 \includegraphics[width=60mm]{Network bandwith impact on execution time.jpg}
311 \caption{Network bandwith impact on execution time\label{overflow}}
316 The results of increasing the network bandwidth depict the improvement
317 of the performance by reducing the execution time for both of the two
318 algorithms. However, and again in this case, the multisplitting method
319 presents a better performance in the considered bandwidth interval with
320 a gain of 40\% which is only around 24\% for classical GMRES.
322 \textit{3.e Input matrix size impacts on performance}
326 \begin{tabular}{r c }
329 Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
330 Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
334 Table 5 : Input matrix size impact
336 \begin{wrapfigure}{l}{50mm}
338 \includegraphics[width=60mm]{Pb size impact on execution time.jpg}
339 \caption{Pb size impact on execution time\label{overflow}}
342 In this experimentation, the input matrix size has been set from
343 Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to
344 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5,
345 the execution time for the algorithms convergence increases with the
346 input matrix size. But the interesting result here direct on (i) the
347 drastic increase (300 times) of the number of iterations needed before
348 the convergence for the classical GMRES algorithm when the matrix size
349 go beyond Nx=150; (ii) the classical GMRES execution time also almost
350 the double from Nx=140 compared with the convergence time of the
351 multisplitting method. These findings may help a lot end users to setup
352 the best and the optimal targeted environment for the application
353 deployment when focusing on the problem size scale up. Note that the
354 same test has been done with the grid 2x16 getting the same conclusion.
356 \textit{3.f CPU Power impact on performance}
360 \begin{tabular}{r c }
362 Grid & 2x16\\ %\hline
363 Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
364 Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
368 Table 6 : CPU Power impact
370 \begin{wrapfigure}{l}{60mm}
372 \includegraphics[width=60mm]{CPU Power impact on execution time.jpg}
373 \caption{CPU Power impact on execution time\label{overflow}}
376 Using the SIMGRID simulator flexibility, we have tried to determine the
377 impact on the algorithms performance in varying the CPU power of the
378 clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
379 confirm the performance gain, around 95\% for both of the two methods,
380 after adding more powerful CPU. Note that the execution time axis in the
381 figure is in logarithmic scale.
383 \textbf{V.4 Comparing GMRES in native synchronous mode and
384 Multisplitting algorithms in asynchronous mode}
386 The previous paragraphs put in evidence the interests to simulate the
387 behavior of the application before any deployment in a real environment.
388 We have focused the study on analyzing the performance in varying the
389 key factors impacting the results. In the same line, the study compares
390 the performance of the two proposed methods in \textbf{synchronous mode
391 }. In this section, with the same previous methodology, the goal is to
392 demonstrate the efficiency of the multisplitting method in \textbf{
393 asynchronous mode} compare with the classical GMRES staying in the
396 Note that the interest of using the asynchronous mode for data exchange
397 is mainly, in opposite of the synchronous mode, the non-wait aspects of
398 the current computation after a communication operation like sending
399 some data between nodes. Each processor can continue their local
400 calculation without waiting for the end of the communication. Thus, the
401 asynchronous may theoretically reduce the overall execution time and can
402 improve the algorithm performance.
404 As stated supra, SIMGRID simulator tool has been used to prove the
405 efficiency of the multisplitting in asynchronous mode and to find the
406 best combination of the grid resources (CPU, Network, input matrix size,
407 \ldots ) to get the highest "\,relative gain" in comparison with the
408 classical GMRES time.
411 The test conditions are summarized in the table below :
415 \begin{tabular}{r c }
417 Grid & 2x50 totaling 100 processors\\ %\hline
418 Processors & 1 GFlops to 1.5 GFlops\\
419 Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline
420 Inter-Network & bw=5 Mbits - lat=2E-02\\
421 Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
422 Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline
426 Again, comprehensive and extensive tests have been conducted varying the
427 CPU power and the network parameters (bandwidth and latency) in the
428 simulator tool with different problem size. The relative gains greater
429 than 1 between the two algorithms have been captured after each step of
430 the test. Table I below has recorded the best grid configurations
431 allowing a multiplitting method time more than 2.5 times lower than
432 classical GMRES execution and convergence time. The finding thru this
433 experimentation is the tolerance of the multisplitting method under a
434 low speed network that we encounter usually with distant clusters thru the
437 % use the same column width for the following three tables
438 \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
439 \newenvironment{mytable}[1]{% #1: number of columns for data
440 \renewcommand{\arraystretch}{1.3}%
441 \begin{tabular}{|>{\bfseries}r%
442 |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
447 \caption{Relative gain of the multisplitting algorithm compared with
449 \label{tab.cluster.2x50}
454 & 5 & 5 & 5 & 5 & 5 & 50 \\
457 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
460 & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\
463 & 62 & 62 & 62 & 100 & 100 & 110 \\
466 & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\
469 & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\
478 & 50 & 50 & 50 & 50 & 10 & 10 \\
481 & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\
484 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\
487 & 120 & 130 & 140 & 150 & 171 & 171 \\
490 & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\
493 & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\
502 \section*{Acknowledgment}
505 The authors would like to thank\dots{}
508 % trigger a \newpage just before the given reference
509 % number - used to balance the columns on the last page
510 % adjust value as needed - may need to be readjusted if
511 % the document is modified later
512 \bibliographystyle{IEEEtran}
513 \bibliography{hpccBib}
521 %%% ispell-local-dictionary: "american"