1 %\documentclass[conference]{IEEEtran}
2 \documentclass[times]{cpeauth}
6 %\usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref}
8 %\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em
9 %T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
17 \usepackage[T1]{fontenc}
18 \usepackage[utf8]{inputenc}
19 \usepackage{amsfonts,amssymb}
21 \usepackage{algorithm}
22 \usepackage{algpseudocode}
25 \usepackage[american]{babel}
26 % Extension pour les liens intra-documents (tagged PDF)
27 % et l'affichage correct des URL (commande \url{http://example.com})
28 %\usepackage{hyperref}
31 \DeclareUrlCommand\email{\urlstyle{same}}
33 \usepackage[autolanguage,np]{numprint}
35 \renewcommand*\npunitcommand[1]{\text{#1}}
36 \npthousandthpartsep{}}
39 \usepackage[textsize=footnotesize]{todonotes}
41 \newcommand{\AG}[2][inline]{%
42 \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
43 \newcommand{\RC}[2][inline]{%
44 \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
45 \newcommand{\LZK}[2][inline]{%
46 \todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
47 \newcommand{\RCE}[2][inline]{%
48 \todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
50 \algnewcommand\algorithmicinput{\textbf{Input:}}
51 \algnewcommand\Input{\item[\algorithmicinput]}
53 \algnewcommand\algorithmicoutput{\textbf{Output:}}
54 \algnewcommand\Output{\item[\algorithmicoutput]}
56 \newcommand{\MI}{\mathit{MaxIter}}
59 \usepackage{color, colortbl}
60 \newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
61 \newcolumntype{Z}[1]{>{\raggedleft}m{#1}}
63 \newcolumntype{g}{>{\columncolor{Gray}}c}
64 \definecolor{Gray}{gray}{0.9}
69 \RCE{Titre a confirmer.}
70 \title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms}
71 %\itshape{\journalnamelc}\footnotemark[2]}
73 \author{ Charles Emile Ramamonjisoa and
76 Lilia Ziane Khodja and
82 Femto-ST Institute - DISC Department\\
83 Université de Franche-Comté\\
85 Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr}
92 \keywords{Algorithm; distributed; iterative; asynchronous; simulation; simgrid; performance}
96 \section{Introduction}
98 \section{The asynchronous iteration model}
102 \section{Simulation of the multisplitting method}
104 \section{Experimental, Results and Comments}
107 \textbf{V.1. Setup study and Methodology}
109 To conduct our study, we have put in place the following methodology
110 which can be reused with any grid-enabled applications.
112 \textbf{Step 1} : Choose with the end users the class of algorithms or
113 the application to be tested. Numerical parallel iterative algorithms
114 have been chosen for the study in the paper.
116 \textbf{Step 2} : Collect the software materials needed for the
117 experimentation. In our case, we have three variants algorithms for the
118 resolution of three 3D-Poisson problem: (1) using the classical GMRES
119 \textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this
120 paper, (2) using the multisplitting method alias Algo-2 and (3) an
121 enhanced version of the multisplitting method as Algo-3. In addition,
122 SIMGRID simulator has been chosen to simulate the behaviors of the
123 distributed applications. SIMGRID is running on the Mesocentre
124 datacenter in Franche-Comte University $[$10$]$ but also in a virtual
127 \textbf{Step 3} : Fix the criteria which will be used for the future
128 results comparison and analysis. In the scope of this study, we retain
129 in one hand the algorithm execution mode (synchronous and asynchronous)
130 and in the other hand the execution time and the number of iterations of
131 the application before obtaining the convergence.
133 \textbf{Step 4 }: Setup up the different grid testbeds environment
134 which will be simulated in the simulator tool to run the program. The
135 following architecture has been configured in Simgrid : 2x16 - that is a
136 grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8,
137 4x16, 8x8 and 2x50. The network has been designed to operate with a
138 bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6
139 microseconds (resp. 5E-5) for the intra-clusters links (resp.
140 inter-clusters backbone links).
142 \textbf{Step 5}: Process an extensive and comprehensive testings
143 within these configurations in varying the key parameters, especially
144 the CPU power capacity, the network parameters and also the size of the
145 input matrix. Note that some parameters should be invariant to allow the
146 comparison like some program input arguments.
148 \textbf{Step 6} : Collect and analyze the output results.
150 \textbf{ V.2. Factors impacting distributed applications performance in
153 From our previous experience on running distributed application in a
154 computational grid, many factors are identified to have an impact on the
155 program behavior and performance on this specific environment. Mainly,
156 first of all, the architecture of the grid itself can obviously
157 influence the performance results of the program. The performance gain
158 might be important theoretically when the number of clusters and/or the
159 number of nodes (processors/cores) in each individual cluster increase.
161 Another important factor impacting the overall performance of the
162 application is the network configuration. Two main network parameters
163 can modify drastically the program output results : (i) the network
164 bandwidth (bw=bits/s) also known as "the data-carrying capacity"
165 $[$13$]$ of the network is defined as the maximum of data that can pass
166 from one point to another in a unit of time. (ii) the network latency
167 (lat : microsecond) defined as the delay from the start time to send the
168 data from a source and the final time the destination have finished to
169 receive it. Upon the network characteristics, another impacting factor
170 is the application dependent volume of data exchanged between the nodes
171 in the cluster and between distant clusters. Large volume of data can be
172 transferred in transit between the clusters and nodes during the code
175 In a grid environment, it is common to distinguish in one hand, the
176 "\,intra-network" which refers to the links between nodes within a
177 cluster and in the other hand, the "\,inter-network" which is the
178 backbone link between clusters. By design, these two networks perform
179 with different speed. The intra-network generally works like a high
180 speed local network with a high bandwith and very low latency. In
181 opposite, the inter-network connects clusters sometime via heterogeneous
182 networks components thru internet with a lower speed. The network
183 between distant clusters might be a bottleneck for the global
184 performance of the application.
186 \textbf{V.3 Comparing GMRES and Multisplitting algorithms in
189 In the scope of this paper, our first objective is to demonstrate the
190 Algo-2 (Multisplitting method) shows a better performance in grid
191 architecture compared with Algo-1 (Classical GMRES) both running in
192 \textbf{\textit{synchronous mode}}. Better algorithm performance
193 should mean a less number of iterations output and a less execution time
194 before reaching the convergence. For a systematic study, the experiments
195 should figure out that, for various grid parameters values, the
196 simulator will confirm the targeted outcomes, particularly for poor and
197 slow networks, focusing on the impact on the communication performance
198 on the chosen class of algorithm $[$12$]$.
200 The following paragraphs present the test conditions, the output results
204 \textit{3.a Executing the algorithms on various computational grid
205 architecture scaling up the input matrix size}
210 \begin{tabular}{r c }
212 Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline
213 Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline
214 Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline
215 - & N$_{x}$ =170 x 170 x 170 \\ \hline
220 Table 1 : Clusters x Nodes with NX=150 or NX=170
222 \RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger}
225 The results in figure 1 show the non-variation of the number of
226 iterations of classical GMRES for a given input matrix size; it is not
227 the case for the multisplitting method.
229 %%\begin{wrapfigure}{l}{60mm}
232 \includegraphics[width=60mm]{Cluster_x_Nodes_NX=150_and_NX=170.jpg}
233 \caption{Cluster x Nodes NX=150 and NX=170 \label{overflow}}
238 Unless the 8x8 cluster, the time
239 execution difference between the two algorithms is important when
240 comparing between different grid architectures, even with the same number of
241 processors (like 2x16 and 4x8 = 32 processors for example). The
242 experiment concludes the low sensitivity of the multisplitting method
243 (compared with the classical GMRES) when scaling up to higher input
246 \textit{3.b Running on various computational grid architecture}
250 \begin{tabular}{r c }
252 Grid & 2x16, 4x8\\ %\hline
253 Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline
254 - & N2 : bw=1Gbs-lat=5E-05 \\
255 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\
259 %Table 2 : Clusters x Nodes - Networks N1 x N2
260 %\RCE{idem pour tous les tableaux de donnees}
265 \includegraphics[width=60mm]{Cluster_x_Nodes_N1_x_N2.jpg}
266 \caption{Cluster x Nodes N1 x N2\label{overflow}}
269 The experiments compare the behavior of the algorithms running first on
270 speed inter- cluster network (N1) and a less performant network (N2).
271 The figure 2 shows that end users will gain to reduce the execution time
272 for both algorithms in using a grid architecture like 4x16 or 8x8: the
273 performance was increased in a factor of 2. The results depict also that
274 when the network speed drops down, the difference between the execution
275 times can reach more than 25\%.
277 \textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance}
281 \begin{tabular}{r c }
283 Grid & 2x16\\ %\hline
284 Network & N1 : bw=1Gbs \\ %\hline
285 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\
289 Table 3 : Network latency impact
294 \includegraphics[width=60mm]{Network_latency_impact_on_execution_time.jpg}
295 \caption{Network latency impact on execution time\label{overflow}}
299 According the results in table and figure 3, degradation of the network
300 latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
301 increase more than 75\% (resp. 82\%) of the execution for the classical
302 GMRES (resp. multisplitting) algorithm. In addition, it appears that the
303 multisplitting method tolerates more the network latency variation with
304 a less rate increase. Consequently, in the worst case (lat=6.10$^{-5
305 }$), the execution time for GMRES is almost the double of the time for
306 the multisplitting, even though, the performance was on the same order
307 of magnitude with a latency of 8.10$^{-6}$.
309 \textit{3.d Network bandwidth impacts on performance}
313 \begin{tabular}{r c }
315 Grid & 2x16\\ %\hline
316 Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline
317 Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
321 Table 4 : Network bandwidth impact
325 \includegraphics[width=60mm]{Network_bandwith_impact_on_execution_time.jpg}
326 \caption{Network bandwith impact on execution time\label{overflow}}
331 The results of increasing the network bandwidth depict the improvement
332 of the performance by reducing the execution time for both of the two
333 algorithms. However, and again in this case, the multisplitting method
334 presents a better performance in the considered bandwidth interval with
335 a gain of 40\% which is only around 24\% for classical GMRES.
337 \textit{3.e Input matrix size impacts on performance}
341 \begin{tabular}{r c }
344 Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
345 Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
349 Table 5 : Input matrix size impact
353 \includegraphics[width=60mm]{Pb_size_impact_on_execution_time.jpg}
354 \caption{Pb size impact on execution time\label{overflow}}
357 In this experimentation, the input matrix size has been set from
358 Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to
359 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5,
360 the execution time for the algorithms convergence increases with the
361 input matrix size. But the interesting result here direct on (i) the
362 drastic increase (300 times) of the number of iterations needed before
363 the convergence for the classical GMRES algorithm when the matrix size
364 go beyond Nx=150; (ii) the classical GMRES execution time also almost
365 the double from Nx=140 compared with the convergence time of the
366 multisplitting method. These findings may help a lot end users to setup
367 the best and the optimal targeted environment for the application
368 deployment when focusing on the problem size scale up. Note that the
369 same test has been done with the grid 2x16 getting the same conclusion.
371 \textit{3.f CPU Power impact on performance}
375 \begin{tabular}{r c }
377 Grid & 2x16\\ %\hline
378 Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
379 Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
383 Table 6 : CPU Power impact
387 \includegraphics[width=60mm]{CPU_Power_impact_on_execution_time.jpg}
388 \caption{CPU Power impact on execution time\label{overflow}}
391 Using the SIMGRID simulator flexibility, we have tried to determine the
392 impact on the algorithms performance in varying the CPU power of the
393 clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
394 confirm the performance gain, around 95\% for both of the two methods,
395 after adding more powerful CPU. Note that the execution time axis in the
396 figure is in logarithmic scale.
398 \textbf{V.4 Comparing GMRES in native synchronous mode and
399 Multisplitting algorithms in asynchronous mode}
401 The previous paragraphs put in evidence the interests to simulate the
402 behavior of the application before any deployment in a real environment.
403 We have focused the study on analyzing the performance in varying the
404 key factors impacting the results. In the same line, the study compares
405 the performance of the two proposed methods in \textbf{synchronous mode
406 }. In this section, with the same previous methodology, the goal is to
407 demonstrate the efficiency of the multisplitting method in \textbf{
408 asynchronous mode} compare with the classical GMRES staying in the
411 Note that the interest of using the asynchronous mode for data exchange
412 is mainly, in opposite of the synchronous mode, the non-wait aspects of
413 the current computation after a communication operation like sending
414 some data between nodes. Each processor can continue their local
415 calculation without waiting for the end of the communication. Thus, the
416 asynchronous may theoretically reduce the overall execution time and can
417 improve the algorithm performance.
419 As stated supra, SIMGRID simulator tool has been used to prove the
420 efficiency of the multisplitting in asynchronous mode and to find the
421 best combination of the grid resources (CPU, Network, input matrix size,
422 \ldots ) to get the highest "\,relative gain" in comparison with the
423 classical GMRES time.
426 The test conditions are summarized in the table below :
430 \begin{tabular}{r c }
432 Grid & 2x50 totaling 100 processors\\ %\hline
433 Processors & 1 GFlops to 1.5 GFlops\\
434 Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline
435 Inter-Network & bw=5 Mbits - lat=2E-02\\
436 Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
437 Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline
441 Again, comprehensive and extensive tests have been conducted varying the
442 CPU power and the network parameters (bandwidth and latency) in the
443 simulator tool with different problem size. The relative gains greater
444 than 1 between the two algorithms have been captured after each step of
445 the test. Table I below has recorded the best grid configurations
446 allowing a multiplitting method time more than 2.5 times lower than
447 classical GMRES execution and convergence time. The finding thru this
448 experimentation is the tolerance of the multisplitting method under a
449 low speed network that we encounter usually with distant clusters thru the
452 % use the same column width for the following three tables
453 \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
454 \newenvironment{mytable}[1]{% #1: number of columns for data
455 \renewcommand{\arraystretch}{1.3}%
456 \begin{tabular}{|>{\bfseries}r%
457 |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
462 \caption{Relative gain of the multisplitting algorithm compared with
464 \label{tab.cluster.2x50}
469 & 5 & 5 & 5 & 5 & 5 & 50 \\
472 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\
475 & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\
478 & 62 & 62 & 62 & 100 & 100 & 110 \\
481 & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\
484 & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\
493 & 50 & 50 & 50 & 50 & 10 & 10 \\
496 & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\
499 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\
502 & 120 & 130 & 140 & 150 & 171 & 171 \\
505 & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\
508 & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\
517 \section*{Acknowledgment}
520 The authors would like to thank\dots{}
523 % trigger a \newpage just before the given reference
524 % number - used to balance the columns on the last page
525 % adjust value as needed - may need to be readjusted if
526 % the document is modified later
527 \bibliographystyle{IEEEtran}
528 \bibliography{hpccBib}
536 %%% ispell-local-dictionary: "american"