1 \documentclass[times]{cpeauth}
5 %\usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref}
7 %\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em
8 %T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
16 \usepackage[T1]{fontenc}
17 \usepackage[utf8]{inputenc}
18 \usepackage{amsfonts,amssymb}
20 \usepackage{algorithm}
21 \usepackage{algpseudocode}
24 \usepackage[american]{babel}
25 % Extension pour les liens intra-documents (tagged PDF)
26 % et l'affichage correct des URL (commande \url{http://example.com})
27 %\usepackage{hyperref}
30 \DeclareUrlCommand\email{\urlstyle{same}}
32 \usepackage[autolanguage,np]{numprint}
34 \renewcommand*\npunitcommand[1]{\text{#1}}
35 \npthousandthpartsep{}}
38 \usepackage[textsize=footnotesize]{todonotes}
40 \newcommand{\AG}[2][inline]{%
41 \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
42 \newcommand{\RC}[2][inline]{%
43 \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
44 \newcommand{\LZK}[2][inline]{%
45 \todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
46 \newcommand{\RCE}[2][inline]{%
47 \todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
49 \algnewcommand\algorithmicinput{\textbf{Input:}}
50 \algnewcommand\Input{\item[\algorithmicinput]}
52 \algnewcommand\algorithmicoutput{\textbf{Output:}}
53 \algnewcommand\Output{\item[\algorithmicoutput]}
55 \newcommand{\MI}{\mathit{MaxIter}}
58 \usepackage{color, colortbl}
59 \newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
60 \newcolumntype{Z}[1]{>{\raggedleft}m{#1}}
62 \newcolumntype{g}{>{\columncolor{Gray}}c}
63 \definecolor{Gray}{gray}{0.9}
68 \RCE{Titre a confirmer.}
69 \title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms}
70 %\itshape{\journalnamelc}\footnotemark[2]}
72 \author{ Charles Emile Ramamonjisoa and
75 Lilia Ziane Khodja and
81 Femto-ST Institute - DISC Department\\
82 Université de Franche-Comté\\
84 Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr}
91 \keywords{Algorithm; distributed; iterative; asynchronous; simulation; simgrid; performance}
95 \section{Introduction}
97 \section{The asynchronous iteration model}
101 \section{Simulation of the multisplitting method}
103 \section{Experimental, Results and Comments}
106 \textbf{V.1. Setup study and Methodology}
108 To conduct our study, we have put in place the following methodology
109 which can be reused with any grid-enabled applications.
111 \textbf{Step 1} : Choose with the end users the class of algorithms or
112 the application to be tested. Numerical parallel iterative algorithms
113 have been chosen for the study in the paper.
115 \textbf{Step 2} : Collect the software materials needed for the
116 experimentation. In our case, we have three variants algorithms for the
117 resolution of three 3D-Poisson problem: (1) using the classical GMRES
118 \textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this
119 paper, (2) using the multisplitting method alias Algo-2 and (3) an
120 enhanced version of the multisplitting method as Algo-3. In addition,
121 SIMGRID simulator has been chosen to simulate the behaviors of the
122 distributed applications. SIMGRID is running on the Mesocentre
123 datacenter in Franche-Comte University $[$10$]$ but also in a virtual
126 \textbf{Step 3} : Fix the criteria which will be used for the future
127 results comparison and analysis. In the scope of this study, we retain
128 in one hand the algorithm execution mode (synchronous and asynchronous)
129 and in the other hand the execution time and the number of iterations of
130 the application before obtaining the convergence.
132 \textbf{Step 4 }: Setup up the different grid testbeds environment
133 which will be simulated in the simulator tool to run the program. The
134 following architecture has been configured in Simgrid : 2x16 - that is a
135 grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8,
136 4x16, 8x8 and 2x50. The network has been designed to operate with a
137 bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6
138 microseconds (resp. 5E-5) for the intra-clusters links (resp.
139 inter-clusters backbone links).
141 \textbf{Step 5}: Process an extensive and comprehensive testings
142 within these configurations in varying the key parameters, especially
143 the CPU power capacity, the network parameters and also the size of the
144 input matrix. Note that some parameters should be invariant to allow the
145 comparison like some program input arguments.
147 \textbf{Step 6} : Collect and analyze the output results.
149 \textbf{ V.2. Factors impacting distributed applications performance in
152 From our previous experience on running distributed application in a
153 computational grid, many factors are identified to have an impact on the
154 program behavior and performance on this specific environment. Mainly,
155 first of all, the architecture of the grid itself can obviously
156 influence the performance results of the program. The performance gain
157 might be important theoretically when the number of clusters and/or the
158 number of nodes (processors/cores) in each individual cluster increase.
160 Another important factor impacting the overall performance of the
161 application is the network configuration. Two main network parameters
162 can modify drastically the program output results : (i) the network
163 bandwidth (bw=bits/s) also known as "the data-carrying capacity"
164 $[$13$]$ of the network is defined as the maximum of data that can pass
165 from one point to another in a unit of time. (ii) the network latency
166 (lat : microsecond) defined as the delay from the start time to send the
167 data from a source and the final time the destination have finished to
168 receive it. Upon the network characteristics, another impacting factor
169 is the application dependent volume of data exchanged between the nodes
170 in the cluster and between distant clusters. Large volume of data can be
171 transferred in transit between the clusters and nodes during the code
174 In a grid environment, it is common to distinguish in one hand, the
175 "\,intra-network" which refers to the links between nodes within a
176 cluster and in the other hand, the "\,inter-network" which is the
177 backbone link between clusters. By design, these two networks perform
178 with different speed. The intra-network generally works like a high
179 speed local network with a high bandwith and very low latency. In
180 opposite, the inter-network connects clusters sometime via heterogeneous
181 networks components thru internet with a lower speed. The network
182 between distant clusters might be a bottleneck for the global
183 performance of the application.
185 \textbf{V.3 Comparing GMRES and Multisplitting algorithms in
188 In the scope of this paper, our first objective is to demonstrate the
189 Algo-2 (Multisplitting method) shows a better performance in grid
190 architecture compared with Algo-1 (Classical GMRES) both running in
191 \textbf{\textit{synchronous mode}}. Better algorithm performance
192 should mean a less number of iterations output and a less execution time
193 before reaching the convergence. For a systematic study, the experiments
194 should figure out that, for various grid parameters values, the
195 simulator will confirm the targeted outcomes, particularly for poor and
196 slow networks, focusing on the impact on the communication performance
197 on the chosen class of algorithm $[$12$]$.
199 The following paragraphs present the test conditions, the output results
203 \textit{3.a Executing the algorithms on various computational grid
204 architecture scaling up the input matrix size}
209 \begin{tabular}{r c }
211 Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline
212 Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline
213 Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline
214 - & N$_{x}$ =170 x 170 x 170 \\ \hline
219 Table 1 : Clusters x Nodes with NX=150 or NX=170
221 \RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger}
224 The results in figure 1 show the non-variation of the number of
225 iterations of classical GMRES for a given input matrix size; it is not
226 the case for the multisplitting method.
228 %\begin{wrapfigure}{l}{60mm}
231 \includegraphics[width=60mm]{cluster_x_nodes_nx_150_and_nx_170.pdf}
232 \caption{Cluster x Nodes NX=150 and NX=170}
237 Unless the 8x8 cluster, the time
238 execution difference between the two algorithms is important when
239 comparing between different grid architectures, even with the same number of
240 processors (like 2x16 and 4x8 = 32 processors for example). The
241 experiment concludes the low sensitivity of the multisplitting method
242 (compared with the classical GMRES) when scaling up to higher input
245 \textit{3.b Running on various computational grid architecture}
249 \begin{tabular}{r c }
251 Grid & 2x16, 4x8\\ %\hline
252 Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline
253 - & N2 : bw=1Gbs-lat=5E-05 \\
254 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\
258 %Table 2 : Clusters x Nodes - Networks N1 x N2
259 %\RCE{idem pour tous les tableaux de donnees}
262 %\begin{wrapfigure}{l}{60mm}
265 \includegraphics[width=60mm]{cluster_x_nodes_n1_x_n2.pdf}
266 \caption{Cluster x Nodes N1 x N2}
271 The experiments compare the behavior of the algorithms running first on
272 speed inter- cluster network (N1) and a less performant network (N2).
273 The figure 2 shows that end users will gain to reduce the execution time
274 for both algorithms in using a grid architecture like 4x16 or 8x8: the
275 performance was increased in a factor of 2. The results depict also that
276 when the network speed drops down, the difference between the execution
277 times can reach more than 25\%.
279 \textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance}
283 \begin{tabular}{r c }
285 Grid & 2x16\\ %\hline
286 Network & N1 : bw=1Gbs \\ %\hline
287 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\
291 Table 3 : Network latency impact
296 \includegraphics[width=60mm]{network_latency_impact_on_execution_time.pdf}
297 \caption{Network latency impact on execution time}
302 According the results in table and figure 3, degradation of the network
303 latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
304 increase more than 75\% (resp. 82\%) of the execution for the classical
305 GMRES (resp. multisplitting) algorithm. In addition, it appears that the
306 multisplitting method tolerates more the network latency variation with
307 a less rate increase. Consequently, in the worst case (lat=6.10$^{-5
308 }$), the execution time for GMRES is almost the double of the time for
309 the multisplitting, even though, the performance was on the same order
310 of magnitude with a latency of 8.10$^{-6}$.
312 \textit{3.d Network bandwidth impacts on performance}
316 \begin{tabular}{r c }
318 Grid & 2x16\\ %\hline
319 Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline
320 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
324 Table 4 : Network bandwidth impact
328 \includegraphics[width=60mm]{network_bandwith_impact_on_execution_time.pdf}
329 \caption{Network bandwith impact on execution time}
335 The results of increasing the network bandwidth depict the improvement
336 of the performance by reducing the execution time for both of the two
337 algorithms. However, and again in this case, the multisplitting method
338 presents a better performance in the considered bandwidth interval with
339 a gain of 40\% which is only around 24\% for classical GMRES.
341 \textit{3.e Input matrix size impacts on performance}
345 \begin{tabular}{r c }
348 Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
349 Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
353 Table 5 : Input matrix size impact
357 \includegraphics[width=60mm]{pb_size_impact_on_execution_time.pdf}
358 \caption{Pb size impact on execution time}
362 In this experimentation, the input matrix size has been set from
363 Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to
364 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5,
365 the execution time for the algorithms convergence increases with the
366 input matrix size. But the interesting result here direct on (i) the
367 drastic increase (300 times) of the number of iterations needed before
368 the convergence for the classical GMRES algorithm when the matrix size
369 go beyond Nx=150; (ii) the classical GMRES execution time also almost
370 the double from Nx=140 compared with the convergence time of the
371 multisplitting method. These findings may help a lot end users to setup
372 the best and the optimal targeted environment for the application
373 deployment when focusing on the problem size scale up. Note that the
374 same test has been done with the grid 2x16 getting the same conclusion.
376 \textit{3.f CPU Power impact on performance}
380 \begin{tabular}{r c }
382 Grid & 2x16\\ %\hline
383 Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
384 Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
388 Table 6 : CPU Power impact
392 \includegraphics[width=60mm]{cpu_power_impact_on_execution_time.pdf}
393 \caption{CPU Power impact on execution time}
397 Using the SIMGRID simulator flexibility, we have tried to determine the
398 impact on the algorithms performance in varying the CPU power of the
399 clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
400 confirm the performance gain, around 95\% for both of the two methods,
401 after adding more powerful CPU. Note that the execution time axis in the
402 figure is in logarithmic scale.
404 \textbf{V.4 Comparing GMRES in native synchronous mode and
405 Multisplitting algorithms in asynchronous mode}
407 The previous paragraphs put in evidence the interests to simulate the
408 behavior of the application before any deployment in a real environment.
409 We have focused the study on analyzing the performance in varying the
410 key factors impacting the results. In the same line, the study compares
411 the performance of the two proposed methods in \textbf{synchronous mode
412 }. In this section, with the same previous methodology, the goal is to
413 demonstrate the efficiency of the multisplitting method in \textbf{
414 asynchronous mode} compare with the classical GMRES staying in the
417 Note that the interest of using the asynchronous mode for data exchange
418 is mainly, in opposite of the synchronous mode, the non-wait aspects of
419 the current computation after a communication operation like sending
420 some data between nodes. Each processor can continue their local
421 calculation without waiting for the end of the communication. Thus, the
422 asynchronous may theoretically reduce the overall execution time and can
423 improve the algorithm performance.
425 As stated supra, SIMGRID simulator tool has been used to prove the
426 efficiency of the multisplitting in asynchronous mode and to find the
427 best combination of the grid resources (CPU, Network, input matrix size,
428 \ldots ) to get the highest "\,relative gain" in comparison with the
429 classical GMRES time.
432 The test conditions are summarized in the table below :
436 \begin{tabular}{r c }
438 Grid & 2x50 totaling 100 processors\\ %\hline
439 Processors & 1 GFlops to 1.5 GFlops\\
440 Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline
441 Inter-Network & bw=5 Mbits - lat=2E-02\\
442 Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
443 Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline
447 Again, comprehensive and extensive tests have been conducted varying the
448 CPU power and the network parameters (bandwidth and latency) in the
449 simulator tool with different problem size. The relative gains greater
450 than 1 between the two algorithms have been captured after each step of
451 the test. Table I below has recorded the best grid configurations
452 allowing a multiplitting method time more than 2.5 times lower than
453 classical GMRES execution and convergence time. The finding thru this
454 experimentation is the tolerance of the multisplitting method under a
455 low speed network that we encounter usually with distant clusters thru the
458 % use the same column width for the following three tables
459 \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
460 \newenvironment{mytable}[1]{% #1: number of columns for data
461 \renewcommand{\arraystretch}{1.3}%
462 \begin{tabular}{|>{\bfseries}r%
463 |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
468 \caption{Relative gain of the multisplitting algorithm compared with
470 \label{tab.cluster.2x50}
475 & 5 & 5 & 5 & 5 & 5 & 50 \\
478 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
481 & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\
484 & 62 & 62 & 62 & 100 & 100 & 110 \\
487 & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\
490 & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\
499 & 50 & 50 & 50 & 50 & 10 & 10 \\
502 & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\
505 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\
508 & 120 & 130 & 140 & 150 & 171 & 171 \\
511 & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\
514 & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\
523 \section*{Acknowledgment}
526 The authors would like to thank\dots{}
529 % trigger a \newpage just before the given reference
530 % number - used to balance the columns on the last page
531 % adjust value as needed - may need to be readjusted if
532 % the document is modified later
533 \bibliographystyle{IEEEtran}
534 \bibliography{hpccBib}
542 %%% ispell-local-dictionary: "american"