paper.tex

   1 \documentclass[times]{cpeauth}
   2
   3 \usepackage{moreverb}
   4
   5 %\usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref}
   6
   7 %\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em
   8 %T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
   9
  10 \def\volumeyear{2015}
  11
  12 \usepackage{graphicx}
  13 \usepackage{wrapfig}
  14 \usepackage{grffile}
  15
  16 \usepackage[T1]{fontenc}
  17 \usepackage[utf8]{inputenc}
  18 \usepackage{amsfonts,amssymb}
  19 \usepackage{amsmath}
  20 \usepackage{algorithm}
  21 \usepackage{algpseudocode}
  22 %\usepackage{amsthm}
  23 \usepackage{graphicx}
  24 \usepackage[american]{babel}
  25 % Extension pour les liens intra-documents (tagged PDF)
  26 % et l'affichage correct des URL (commande \url{http://example.com})
  27 %\usepackage{hyperref}
  28
  29 \usepackage{url}
  30 \DeclareUrlCommand\email{\urlstyle{same}}
  31
  32 \usepackage[autolanguage,np]{numprint}
  33 \AtBeginDocument{%
  34   \renewcommand*\npunitcommand[1]{\text{#1}}
  35   \npthousandthpartsep{}}
  36
  37 \usepackage{xspace}
  38 \usepackage[textsize=footnotesize]{todonotes}
  39
  40 \newcommand{\AG}[2][inline]{%
  41   \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
  42 \newcommand{\RC}[2][inline]{%
  43   \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
  44 \newcommand{\LZK}[2][inline]{%
  45   \todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
  46 \newcommand{\RCE}[2][inline]{%
  47   \todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
  48
  49 \algnewcommand\algorithmicinput{\textbf{Input:}}
  50 \algnewcommand\Input{\item[\algorithmicinput]}
  51
  52 \algnewcommand\algorithmicoutput{\textbf{Output:}}
  53 \algnewcommand\Output{\item[\algorithmicoutput]}
  54
  55 \newcommand{\MI}{\mathit{MaxIter}}
  56
  57 \usepackage{array}
  58 \usepackage{color, colortbl}
  59 \newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
  60 \newcolumntype{Z}[1]{>{\raggedleft}m{#1}}
  61
  62 \newcolumntype{g}{>{\columncolor{Gray}}c}
  63 \definecolor{Gray}{gray}{0.9}
  64
  65
  66
  67 \begin{document}
  68 \RCE{Titre a confirmer.}
  69 \title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms}
  70 %\itshape{\journalnamelc}\footnotemark[2]}
  71
  72 \author{    Charles Emile Ramamonjisoa and
  73     David Laiymani and
  74     Arnaud Giersch and
  75     Lilia Ziane Khodja and
  76     Raphaël Couturier
  77 }
  78
  79 \address{
  80         \centering
  81     Femto-ST Institute - DISC Department\\
  82     Université de Franche-Comté\\
  83     Belfort\\
  84     Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr}
  85 }
  86
  87 \begin{abstract}
  88 ABSTRACT
  89 \end{abstract}
  90
  91 \keywords{Algorithm; distributed; iterative; asynchronous; simulation; simgrid; performance}
  92
  93 \maketitle
  94
  95 \section{Introduction}
  96
  97 \section{The asynchronous iteration model}
  98
  99 \section{SimGrid}
 100
 101 \section{Simulation of the multisplitting method}
 102
 103 \section{Experimental, Results and Comments}
 104
 105
 106 \textbf{V.1. Setup study and Methodology}
 107
 108 To conduct our study, we have put in place the following methodology
 109 which can be reused with any grid-enabled applications.
 110
 111 \textbf{Step 1} : Choose with the end users the class of algorithms or
 112 the application to be tested. Numerical parallel iterative algorithms
 113 have been chosen for the study in the paper.
 114
 115 \textbf{Step 2} : Collect the software materials needed for the
 116 experimentation. In our case, we have three variants algorithms for the
 117 resolution of three 3D-Poisson problem: (1) using the classical GMRES
 118 \textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this
 119 paper, (2) using the multisplitting method alias Algo-2 and (3) an
 120 enhanced version of the multisplitting method as Algo-3. In addition,
 121 SIMGRID simulator has been chosen to simulate the behaviors of the
 122 distributed applications. SIMGRID is running on the Mesocentre
 123 datacenter in Franche-Comte University $[$10$]$ but also in a virtual
 124 machine on a laptop.
 125
 126 \textbf{Step 3} : Fix the criteria which will be used for the future
 127 results comparison and analysis. In the scope of this study, we retain
 128 in one hand the algorithm execution mode (synchronous and asynchronous)
 129 and in the other hand the execution time and the number of iterations of
 130 the application before obtaining the convergence.
 131
 132 \textbf{Step 4 }: Setup up the different grid testbeds environment
 133 which will be simulated in the simulator tool to run the program. The
 134 following architecture has been configured in Simgrid : 2x16 - that is a
 135 grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8,
 136 4x16, 8x8 and 2x50. The network has been designed to operate with a
 137 bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6
 138 microseconds (resp. 5E-5) for the intra-clusters links (resp.
 139 inter-clusters backbone links).
 140
 141 \textbf{Step 5}: Process an extensive and comprehensive testings
 142 within these configurations in varying the key parameters, especially
 143 the CPU power capacity, the network parameters and also the size of the
 144 input matrix. Note that some parameters should be invariant to allow the
 145 comparison like some program input arguments.
 146
 147 \textbf{Step 6} : Collect and analyze the output results.
 148
 149 \textbf{ V.2. Factors impacting distributed applications performance in
 150 a grid environment}
 151
 152 From our previous experience on running distributed application in a
 153 computational grid, many factors are identified to have an impact on the
 154 program behavior and performance on this specific environment. Mainly,
 155 first of all, the architecture of the grid itself can obviously
 156 influence the performance results of the program. The performance gain
 157 might be important theoretically when the number of clusters and/or the
 158 number of nodes (processors/cores) in each individual cluster increase.
 159
 160 Another important factor impacting the overall performance of the
 161 application is the network configuration. Two main network parameters
 162 can modify drastically the program output results : (i) the network
 163 bandwidth (bw=bits/s) also known as "the data-carrying capacity"
 164 $[$13$]$ of the network is defined as the maximum of data that can pass
 165 from one point to another in a unit of time. (ii) the network latency
 166 (lat : microsecond) defined as the delay from the start time to send the
 167 data from a source and the final time the destination have finished to
 168 receive it. Upon the network characteristics, another impacting factor
 169 is the application dependent volume of data exchanged between the nodes
 170 in the cluster and between distant clusters. Large volume of data can be
 171 transferred in transit between the clusters and nodes during the code
 172 execution.
 173
 174  In a grid environment, it is common to distinguish in one hand, the
 175 "\,intra-network" which refers to the links between nodes within a
 176 cluster and in the other hand, the "\,inter-network" which is the
 177 backbone link between clusters. By design, these two networks perform
 178 with different speed. The intra-network generally works like a high
 179 speed local network with a high bandwith and very low latency. In
 180 opposite, the inter-network connects clusters sometime via heterogeneous
 181 networks components thru internet with a lower speed. The network
 182 between distant clusters might be a bottleneck for the global
 183 performance of the application.
 184
 185 \textbf{V.3 Comparing GMRES and Multisplitting algorithms in
 186 synchronous mode}
 187
 188 In the scope of this paper, our first objective is to demonstrate the
 189 Algo-2 (Multisplitting method) shows a better performance in grid
 190 architecture compared with Algo-1 (Classical GMRES) both running in
 191 \textbf{\textit{synchronous mode}}. Better algorithm performance
 192 should mean a less number of iterations output and a less execution time
 193 before reaching the convergence. For a systematic study, the experiments
 194 should figure out that, for various grid parameters values, the
 195 simulator will confirm the targeted outcomes, particularly for poor and
 196 slow networks, focusing on the impact on the communication performance
 197 on the chosen class of algorithm $[$12$]$.
 198
 199 The following paragraphs present the test conditions, the output results
 200 and our comments.
 201
 202
 203 \textit{3.a Executing the algorithms on various computational grid
 204 architecture scaling up the input matrix size}
 205 \\
 206
 207 % environment
 208 \begin{footnotesize}
 209 \begin{tabular}{r c }
 210  \hline
 211  Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline
 212  Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline
 213  Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline
 214  - & N$_{x}$ =170 x 170 x 170    \\ \hline
 215  \end{tabular}
 216 \end{footnotesize}
 217
 218
 219  Table 1 : Clusters x Nodes with NX=150 or NX=170
 220
 221 \RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger}
 222
 223
 224 The results in figure 1 show the non-variation of the number of
 225 iterations of classical GMRES for a given input matrix size; it is not
 226 the case for the multisplitting method.
 227
 228 %\begin{wrapfigure}{l}{60mm}
 229 \begin{figure} [ht!]
 230 \centering
 231 \includegraphics[width=60mm]{cluster_x_nodes_nx_150_and_nx_170.pdf}
 232 \caption{Cluster x Nodes NX=150 and NX=170}
 233 %\label{overflow}}
 234 \end{figure}
 235 %\end{wrapfigure}
 236
 237 Unless the 8x8 cluster, the time
 238 execution difference between the two algorithms is important when
 239 comparing between different grid architectures, even with the same number of
 240 processors (like 2x16 and 4x8 = 32 processors for example). The
 241 experiment concludes the low sensitivity of the multisplitting method
 242 (compared with the classical GMRES) when scaling up to higher input
 243 matrix size.
 244
 245 \textit{3.b Running on various computational grid architecture}
 246
 247 % environment
 248 \begin{footnotesize}
 249 \begin{tabular}{r c }
 250  \hline
 251  Grid & 2x16, 4x8\\ %\hline
 252  Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline
 253  - & N2 : bw=1Gbs-lat=5E-05 \\
 254  Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\
 255  \end{tabular}
 256 \end{footnotesize}
 257
 258 %Table 2 : Clusters x Nodes - Networks N1 x N2
 259 %\RCE{idem pour tous les tableaux de donnees}
 260
 261
 262 %\begin{wrapfigure}{l}{60mm}
 263 \begin{figure} [ht!]
 264 \centering
 265 \includegraphics[width=60mm]{cluster_x_nodes_n1_x_n2.pdf}
 266 \caption{Cluster x Nodes N1 x N2}
 267 %\label{overflow}}
 268 \end{figure}
 269 %\end{wrapfigure}
 270
 271 The experiments compare the behavior of the algorithms running first on
 272 speed inter- cluster network (N1) and a less performant network (N2).
 273 The figure 2 shows that end users will gain to reduce the execution time
 274 for both algorithms in using a grid architecture like 4x16 or 8x8: the
 275 performance was increased in a factor of 2. The results depict also that
 276 when the network speed drops down, the difference between the execution
 277 times can reach more than 25\%.
 278
 279 \textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance}
 280
 281 % environment
 282 \begin{footnotesize}
 283 \begin{tabular}{r c }
 284  \hline
 285  Grid & 2x16\\ %\hline
 286  Network & N1 : bw=1Gbs \\ %\hline
 287  Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\
 288  \end{tabular}
 289 \end{footnotesize}
 290
 291 Table 3 : Network latency impact
 292
 293
 294 \begin{figure} [ht!]
 295 \centering
 296 \includegraphics[width=60mm]{network_latency_impact_on_execution_time.pdf}
 297 \caption{Network latency impact on execution time}
 298 %\label{overflow}}
 299 \end{figure}
 300
 301
 302 According the results in table and figure 3, degradation of the network
 303 latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
 304 increase more than 75\% (resp. 82\%) of the execution for the classical
 305 GMRES (resp. multisplitting) algorithm. In addition, it appears that the
 306 multisplitting method tolerates more the network latency variation with
 307 a less rate increase. Consequently, in the worst case (lat=6.10$^{-5
 308 }$), the execution time for GMRES is almost the double of the time for
 309 the multisplitting, even though, the performance was on the same order
 310 of magnitude with a latency of 8.10$^{-6}$.
 311
 312 \textit{3.d Network bandwidth impacts on performance}
 313
 314 % environment
 315 \begin{footnotesize}
 316 \begin{tabular}{r c }
 317  \hline
 318  Grid & 2x16\\ %\hline
 319  Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline
 320  Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
 321  \end{tabular}
 322 \end{footnotesize}
 323
 324 Table 4 : Network bandwidth impact
 325
 326 \begin{figure} [ht!]
 327 \centering
 328 \includegraphics[width=60mm]{network_bandwith_impact_on_execution_time.pdf}
 329 \caption{Network bandwith impact on execution time}
 330 %\label{overflow}
 331 \end{figure}
 332
 333
 334
 335 The results of increasing the network bandwidth depict the improvement
 336 of the performance by reducing the execution time for both of the two
 337 algorithms. However, and again in this case, the multisplitting method
 338 presents a better performance in the considered bandwidth interval with
 339 a gain of 40\% which is only around 24\% for classical GMRES.
 340
 341 \textit{3.e Input matrix size impacts on performance}
 342
 343 % environment
 344 \begin{footnotesize}
 345 \begin{tabular}{r c }
 346  \hline
 347  Grid & 4x8\\ %\hline
 348  Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
 349  Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
 350  \end{tabular}
 351 \end{footnotesize}
 352
 353 Table 5 : Input matrix size impact
 354
 355 \begin{figure} [ht!]
 356 \centering
 357 \includegraphics[width=60mm]{pb_size_impact_on_execution_time.pdf}
 358 \caption{Pb size impact on execution time}
 359 %\label{overflow}}
 360 \end{figure}
 361
 362 In this experimentation, the input matrix size has been set from
 363 Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to
 364 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5,
 365 the execution time for the algorithms convergence increases with the
 366 input matrix size. But the interesting result here direct on (i) the
 367 drastic increase (300 times) of the number of iterations needed before
 368 the convergence for the classical GMRES algorithm when the matrix size
 369 go beyond Nx=150; (ii) the classical GMRES execution time also almost
 370 the double from Nx=140 compared with the convergence time of the
 371 multisplitting method. These findings may help a lot end users to setup
 372 the best and the optimal targeted environment for the application
 373 deployment when focusing on the problem size scale up. Note that the
 374 same test has been done with the grid 2x16 getting the same conclusion.
 375
 376 \textit{3.f CPU Power impact on performance}
 377
 378 % environment
 379 \begin{footnotesize}
 380 \begin{tabular}{r c }
 381  \hline
 382  Grid & 2x16\\ %\hline
 383  Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
 384  Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
 385  \end{tabular}
 386 \end{footnotesize}
 387
 388 Table 6 : CPU Power impact
 389
 390 \begin{figure} [ht!]
 391 \centering
 392 \includegraphics[width=60mm]{cpu_power_impact_on_execution_time.pdf}
 393 \caption{CPU Power impact on execution time}
 394 %\label{overflow}}
 395 \end{figure}
 396
 397 Using the SIMGRID simulator flexibility, we have tried to determine the
 398 impact on the algorithms performance in varying the CPU power of the
 399 clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
 400 confirm the performance gain, around 95\% for both of the two methods,
 401 after adding more powerful CPU. Note that the execution time axis in the
 402 figure is in logarithmic scale.
 403
 404  \textbf{V.4 Comparing GMRES in native synchronous mode and
 405 Multisplitting algorithms in asynchronous mode}
 406
 407 The previous paragraphs put in evidence the interests to simulate the
 408 behavior of the application before any deployment in a real environment.
 409 We have focused the study on analyzing the performance in varying the
 410 key factors impacting the results. In the same line, the study compares
 411 the performance of the two proposed methods in \textbf{synchronous mode
 412 }. In this section, with the same previous methodology, the goal is to
 413 demonstrate the efficiency of the multisplitting method in \textbf{
 414 asynchronous mode} compare with the classical GMRES staying in the
 415 synchronous mode.
 416
 417 Note that the interest of using the asynchronous mode for data exchange
 418 is mainly, in opposite of the synchronous mode, the non-wait aspects of
 419 the current computation after a communication operation like sending
 420 some data between nodes. Each processor can continue their local
 421 calculation without waiting for the end of the communication. Thus, the
 422 asynchronous may theoretically reduce the overall execution time and can
 423 improve the algorithm performance.
 424
 425 As stated supra, SIMGRID simulator tool has been used to prove the
 426 efficiency of the multisplitting in asynchronous mode and to find the
 427 best combination of the grid resources (CPU, Network, input matrix size,
 428 \ldots ) to get the highest "\,relative gain" in comparison with the
 429 classical GMRES time.
 430
 431
 432 The test conditions are summarized in the table below :
 433
 434 % environment
 435 \begin{footnotesize}
 436 \begin{tabular}{r c }
 437  \hline
 438  Grid & 2x50 totaling 100 processors\\ %\hline
 439  Processors & 1 GFlops to 1.5 GFlops\\
 440    Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline
 441    Inter-Network & bw=5 Mbits - lat=2E-02\\
 442  Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
 443  Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline
 444  \end{tabular}
 445 \end{footnotesize}
 446
 447 Again, comprehensive and extensive tests have been conducted varying the
 448 CPU power and the network parameters (bandwidth and latency) in the
 449 simulator tool with different problem size. The relative gains greater
 450 than 1 between the two algorithms have been captured after each step of
 451 the test. Table I below has recorded the best grid configurations
 452 allowing a multiplitting method time more than 2.5 times lower than
 453 classical GMRES execution and convergence time. The finding thru this
 454 experimentation is the tolerance of the multisplitting method under a
 455 low speed network that we encounter usually with distant clusters thru the
 456 internet.
 457
 458 % use the same column width for the following three tables
 459 \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
 460 \newenvironment{mytable}[1]{% #1: number of columns for data
 461   \renewcommand{\arraystretch}{1.3}%
 462   \begin{tabular}{|>{\bfseries}r%
 463                   |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
 464     \end{tabular}}
 465
 466 \begin{table}[!t]
 467   \centering
 468   \caption{Relative gain of the multisplitting algorithm compared with
 469 the classical GMRES}
 470   \label{tab.cluster.2x50}
 471
 472   \begin{mytable}{6}
 473     \hline
 474     bw
 475     & 5         & 5         & 5         & 5         & 5         & 50 \\
 476     \hline
 477     lat
 478     & 0.02      & 0.02      & 0.02      & 0.02      & 0.02      & 0.02 \\
 479     \hline
 480     power
 481     & 1         & 1         & 1         & 1.5       & 1.5       & 1.5 \\
 482     \hline
 483     size
 484     & 62        & 62        & 62        & 100       & 100       & 110 \\
 485     \hline
 486     Prec/Eprec
 487     & \np{E-5}  & \np{E-8}  & \np{E-9}  & \np{E-11} & \np{E-11} & \np{E-11} \\
 488     \hline
 489     speedup
 490     & 0.396     & 0.392     & 0.396     & 0.391     & 0.393     & 0.395 \\
 491     \hline
 492   \end{mytable}
 493
 494   \smallskip
 495
 496   \begin{mytable}{6}
 497     \hline
 498     bw
 499     & 50        & 50        & 50        & 50        & 10        & 10 \\
 500     \hline
 501     lat
 502     & 0.02      & 0.02      & 0.02      & 0.02      & 0.03      & 0.01 \\
 503     \hline
 504     power
 505     & 1.5       & 1.5       & 1.5       & 1.5       & 1         & 1.5 \\
 506     \hline
 507     size
 508     & 120       & 130       & 140       & 150       & 171       & 171 \\
 509     \hline
 510     Prec/Eprec
 511     & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5}  & \np{E-5} \\
 512     \hline
 513     speedup
 514     & 0.398     & 0.388     & 0.393     & 0.394     & 0.63      & 0.778 \\
 515     \hline
 516   \end{mytable}
 517 \end{table}
 518
 519 \section{Conclusion}
 520 CONCLUSION
 521
 522
 523 \section*{Acknowledgment}
 524
 525
 526 The authors would like to thank\dots{}
 527
 528
 529 % trigger a \newpage just before the given reference
 530 % number - used to balance the columns on the last page
 531 % adjust value as needed - may need to be readjusted if
 532 % the document is modified later
 533 \bibliographystyle{IEEEtran}
 534 \bibliography{hpccBib}
 535
 536 \end{document}
 537
 538 %%% Local Variables:
 539 %%% mode: latex
 540 %%% TeX-master: t
 541 %%% fill-column: 80
 542 %%% ispell-local-dictionary: "american"
 543 %%% End: