paper.tex

   1 %\documentclass[conference]{IEEEtran}
   2 \documentclass[times]{cpeauth}
   3
   4 \usepackage{moreverb}
   5
   6 %\usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref}
   7
   8 %\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em
   9 %T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
  10
  11 \def\volumeyear{2015}
  12
  13 \usepackage{graphicx}
  14 \usepackage{wrapfig}
  15 \usepackage{grffile}
  16
  17 \usepackage[T1]{fontenc}
  18 \usepackage[utf8]{inputenc}
  19 \usepackage{amsfonts,amssymb}
  20 \usepackage{amsmath}
  21 \usepackage{algorithm}
  22 \usepackage{algpseudocode}
  23 %\usepackage{amsthm}
  24 \usepackage{graphicx}
  25 \usepackage[american]{babel}
  26 % Extension pour les liens intra-documents (tagged PDF)
  27 % et l'affichage correct des URL (commande \url{http://example.com})
  28 %\usepackage{hyperref}
  29
  30 \usepackage{url}
  31 \DeclareUrlCommand\email{\urlstyle{same}}
  32
  33 \usepackage[autolanguage,np]{numprint}
  34 \AtBeginDocument{%
  35   \renewcommand*\npunitcommand[1]{\text{#1}}
  36   \npthousandthpartsep{}}
  37
  38 \usepackage{xspace}
  39 \usepackage[textsize=footnotesize]{todonotes}
  40
  41 \newcommand{\AG}[2][inline]{%
  42   \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace}
  43 \newcommand{\RC}[2][inline]{%
  44   \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
  45 \newcommand{\LZK}[2][inline]{%
  46   \todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace}
  47 \newcommand{\RCE}[2][inline]{%
  48   \todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace}
  49
  50 \algnewcommand\algorithmicinput{\textbf{Input:}}
  51 \algnewcommand\Input{\item[\algorithmicinput]}
  52
  53 \algnewcommand\algorithmicoutput{\textbf{Output:}}
  54 \algnewcommand\Output{\item[\algorithmicoutput]}
  55
  56 \newcommand{\MI}{\mathit{MaxIter}}
  57
  58 \usepackage{array}
  59 \usepackage{color, colortbl}
  60 \newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
  61 \newcolumntype{Z}[1]{>{\raggedleft}m{#1}}
  62
  63 \newcolumntype{g}{>{\columncolor{Gray}}c}
  64 \definecolor{Gray}{gray}{0.9}
  65
  66
  67
  68 \begin{document}
  69 \RCE{Titre a confirmer.}
  70 \title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms}
  71 %\itshape{\journalnamelc}\footnotemark[2]}
  72
  73 \author{    Charles Emile Ramamonjisoa and
  74     David Laiymani and
  75     Arnaud Giersch and
  76     Lilia Ziane Khodja and
  77     Raphaël Couturier
  78 }
  79
  80 \address{
  81         \centering
  82     Femto-ST Institute - DISC Department\\
  83     Université de Franche-Comté\\
  84     Belfort\\
  85     Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr}
  86 }
  87
  88 \begin{abstract}
  89 ABSTRACT
  90 \end{abstract}
  91
  92 \keywords{Algorithm; distributed; iterative; asynchronous; simulation; simgrid; performance}
  93
  94 \maketitle
  95
  96 \section{Introduction}
  97
  98 \section{The asynchronous iteration model}
  99
 100 \section{SimGrid}
 101
 102 \section{Simulation of the multisplitting method}
 103
 104 \section{Experimental, Results and Comments}
 105
 106
 107 \textbf{V.1. Setup study and Methodology}
 108
 109 To conduct our study, we have put in place the following methodology
 110 which can be reused with any grid-enabled applications.
 111
 112 \textbf{Step 1} : Choose with the end users the class of algorithms or
 113 the application to be tested. Numerical parallel iterative algorithms
 114 have been chosen for the study in the paper.
 115
 116 \textbf{Step 2} : Collect the software materials needed for the
 117 experimentation. In our case, we have three variants algorithms for the
 118 resolution of three 3D-Poisson problem: (1) using the classical GMRES
 119 \textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this
 120 paper, (2) using the multisplitting method alias Algo-2 and (3) an
 121 enhanced version of the multisplitting method as Algo-3. In addition,
 122 SIMGRID simulator has been chosen to simulate the behaviors of the
 123 distributed applications. SIMGRID is running on the Mesocentre
 124 datacenter in Franche-Comte University $[$10$]$ but also in a virtual
 125 machine on a laptop.
 126
 127 \textbf{Step 3} : Fix the criteria which will be used for the future
 128 results comparison and analysis. In the scope of this study, we retain
 129 in one hand the algorithm execution mode (synchronous and asynchronous)
 130 and in the other hand the execution time and the number of iterations of
 131 the application before obtaining the convergence.
 132
 133 \textbf{Step 4 }: Setup up the different grid testbeds environment
 134 which will be simulated in the simulator tool to run the program. The
 135 following architecture has been configured in Simgrid : 2x16 - that is a
 136 grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8,
 137 4x16, 8x8 and 2x50. The network has been designed to operate with a
 138 bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6
 139 microseconds (resp. 5E-5) for the intra-clusters links (resp.
 140 inter-clusters backbone links).
 141
 142 \textbf{Step 5}: Process an extensive and comprehensive testings
 143 within these configurations in varying the key parameters, especially
 144 the CPU power capacity, the network parameters and also the size of the
 145 input matrix. Note that some parameters should be invariant to allow the
 146 comparison like some program input arguments.
 147
 148 \textbf{Step 6} : Collect and analyze the output results.
 149
 150 \textbf{ V.2. Factors impacting distributed applications performance in
 151 a grid environment}
 152
 153 From our previous experience on running distributed application in a
 154 computational grid, many factors are identified to have an impact on the
 155 program behavior and performance on this specific environment. Mainly,
 156 first of all, the architecture of the grid itself can obviously
 157 influence the performance results of the program. The performance gain
 158 might be important theoretically when the number of clusters and/or the
 159 number of nodes (processors/cores) in each individual cluster increase.
 160
 161 Another important factor impacting the overall performance of the
 162 application is the network configuration. Two main network parameters
 163 can modify drastically the program output results : (i) the network
 164 bandwidth (bw=bits/s) also known as "the data-carrying capacity"
 165 $[$13$]$ of the network is defined as the maximum of data that can pass
 166 from one point to another in a unit of time. (ii) the network latency
 167 (lat : microsecond) defined as the delay from the start time to send the
 168 data from a source and the final time the destination have finished to
 169 receive it. Upon the network characteristics, another impacting factor
 170 is the application dependent volume of data exchanged between the nodes
 171 in the cluster and between distant clusters. Large volume of data can be
 172 transferred in transit between the clusters and nodes during the code
 173 execution.
 174
 175  In a grid environment, it is common to distinguish in one hand, the
 176 "\,intra-network" which refers to the links between nodes within a
 177 cluster and in the other hand, the "\,inter-network" which is the
 178 backbone link between clusters. By design, these two networks perform
 179 with different speed. The intra-network generally works like a high
 180 speed local network with a high bandwith and very low latency. In
 181 opposite, the inter-network connects clusters sometime via heterogeneous
 182 networks components thru internet with a lower speed. The network
 183 between distant clusters might be a bottleneck for the global
 184 performance of the application.
 185
 186 \textbf{V.3 Comparing GMRES and Multisplitting algorithms in
 187 synchronous mode}
 188
 189 In the scope of this paper, our first objective is to demonstrate the
 190 Algo-2 (Multisplitting method) shows a better performance in grid
 191 architecture compared with Algo-1 (Classical GMRES) both running in
 192 \textbf{\textit{synchronous mode}}. Better algorithm performance
 193 should mean a less number of iterations output and a less execution time
 194 before reaching the convergence. For a systematic study, the experiments
 195 should figure out that, for various grid parameters values, the
 196 simulator will confirm the targeted outcomes, particularly for poor and
 197 slow networks, focusing on the impact on the communication performance
 198 on the chosen class of algorithm $[$12$]$.
 199
 200 The following paragraphs present the test conditions, the output results
 201 and our comments.
 202
 203
 204 \textit{3.a Executing the algorithms on various computational grid
 205 architecture scaling up the input matrix size}
 206 \\
 207
 208 % environment
 209 \begin{footnotesize}
 210 \begin{tabular}{r c }
 211  \hline
 212  Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline
 213  Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline
 214  Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline
 215  - & N$_{x}$ =170 x 170 x 170    \\ \hline
 216  \end{tabular}
 217 \end{footnotesize}
 218
 219
 220  Table 1 : Clusters x Nodes with NX=150 or NX=170
 221
 222 \RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger}
 223
 224
 225 The results in figure 1 show the non-variation of the number of
 226 iterations of classical GMRES for a given input matrix size; it is not
 227 the case for the multisplitting method.
 228
 229 %%\begin{wrapfigure}{l}{60mm}
 230 \begin{figure} [ht!]
 231 \centering
 232 \includegraphics[width=60mm]{Cluster x Nodes NX=150 and NX=170.jpg}
 233 \caption{Cluster x Nodes NX=150 and NX=170 \label{overflow}}
 234 \end{figure}
 235 %%\end{wrapfigure}
 236
 237
 238 Unless the 8x8 cluster, the time
 239 execution difference between the two algorithms is important when
 240 comparing between different grid architectures, even with the same number of
 241 processors (like 2x16 and 4x8 = 32 processors for example). The
 242 experiment concludes the low sensitivity of the multisplitting method
 243 (compared with the classical GMRES) when scaling up to higher input
 244 matrix size.
 245
 246 \textit{3.b Running on various computational grid architecture}
 247
 248 % environment
 249 \begin{footnotesize}
 250 \begin{tabular}{r c }
 251  \hline
 252  Grid & 2x16, 4x8\\ %\hline
 253  Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline
 254  - & N2 : bw=1Gbs-lat=5E-05 \\
 255  Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\
 256  \end{tabular}
 257 \end{footnotesize}
 258
 259 %Table 2 : Clusters x Nodes - Networks N1 x N2
 260 %\RCE{idem pour tous les tableaux de donnees}
 261
 262
 263 \begin{figure} [ht!]
 264 \centering
 265 \includegraphics[width=60mm]{Cluster x Nodes N1 x N2.jpg}
 266 \caption{Cluster x Nodes N1 x N2\label{overflow}}
 267 \end{figure}
 268
 269 The experiments compare the behavior of the algorithms running first on
 270 speed inter- cluster network (N1) and a less performant network (N2).
 271 The figure 2 shows that end users will gain to reduce the execution time
 272 for both algorithms in using a grid architecture like 4x16 or 8x8: the
 273 performance was increased in a factor of 2. The results depict also that
 274 when the network speed drops down, the difference between the execution
 275 times can reach more than 25\%.
 276
 277 \textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance}
 278
 279 % environment
 280 \begin{footnotesize}
 281 \begin{tabular}{r c }
 282  \hline
 283  Grid & 2x16\\ %\hline
 284  Network & N1 : bw=1Gbs \\ %\hline
 285  Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\
 286  \end{tabular}
 287 \end{footnotesize}
 288
 289 Table 3 : Network latency impact
 290
 291
 292 \begin{figure} [ht!]
 293 \centering
 294 \includegraphics[width=60mm]{Network latency impact on execution time.jpg}
 295 \caption{Network latency impact on execution time\label{overflow}}
 296 \end{figure}
 297
 298
 299 According the results in table and figure 3, degradation of the network
 300 latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
 301 increase more than 75\% (resp. 82\%) of the execution for the classical
 302 GMRES (resp. multisplitting) algorithm. In addition, it appears that the
 303 multisplitting method tolerates more the network latency variation with
 304 a less rate increase. Consequently, in the worst case (lat=6.10$^{-5
 305 }$), the execution time for GMRES is almost the double of the time for
 306 the multisplitting, even though, the performance was on the same order
 307 of magnitude with a latency of 8.10$^{-6}$.
 308
 309 \textit{3.d Network bandwidth impacts on performance}
 310
 311 % environment
 312 \begin{footnotesize}
 313 \begin{tabular}{r c }
 314  \hline
 315  Grid & 2x16\\ %\hline
 316  Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline
 317  Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline
 318  \end{tabular}
 319 \end{footnotesize}
 320
 321 Table 4 : Network bandwidth impact
 322
 323 \begin{figure} [ht!]
 324 \centering
 325 \includegraphics[width=60mm]{Network bandwith impact on execution time.jpg}
 326 \caption{Network bandwith impact on execution time\label{overflow}}
 327 \end{figure}
 328
 329
 330
 331 The results of increasing the network bandwidth depict the improvement
 332 of the performance by reducing the execution time for both of the two
 333 algorithms. However, and again in this case, the multisplitting method
 334 presents a better performance in the considered bandwidth interval with
 335 a gain of 40\% which is only around 24\% for classical GMRES.
 336
 337 \textit{3.e Input matrix size impacts on performance}
 338
 339 % environment
 340 \begin{footnotesize}
 341 \begin{tabular}{r c }
 342  \hline
 343  Grid & 4x8\\ %\hline
 344  Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
 345  Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
 346  \end{tabular}
 347 \end{footnotesize}
 348
 349 Table 5 : Input matrix size impact
 350
 351 \begin{figure} [ht!]
 352 \centering
 353 \includegraphics[width=60mm]{Pb size impact on execution time.jpg}
 354 \caption{Pb size impact on execution time\label{overflow}}
 355 \end{figure}
 356
 357 In this experimentation, the input matrix size has been set from
 358 Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to
 359 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5,
 360 the execution time for the algorithms convergence increases with the
 361 input matrix size. But the interesting result here direct on (i) the
 362 drastic increase (300 times) of the number of iterations needed before
 363 the convergence for the classical GMRES algorithm when the matrix size
 364 go beyond Nx=150; (ii) the classical GMRES execution time also almost
 365 the double from Nx=140 compared with the convergence time of the
 366 multisplitting method. These findings may help a lot end users to setup
 367 the best and the optimal targeted environment for the application
 368 deployment when focusing on the problem size scale up. Note that the
 369 same test has been done with the grid 2x16 getting the same conclusion.
 370
 371 \textit{3.f CPU Power impact on performance}
 372
 373 % environment
 374 \begin{footnotesize}
 375 \begin{tabular}{r c }
 376  \hline
 377  Grid & 2x16\\ %\hline
 378  Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline
 379  Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
 380  \end{tabular}
 381 \end{footnotesize}
 382
 383 Table 6 : CPU Power impact
 384
 385 \begin{figure} [ht!]
 386 \centering
 387 \includegraphics[width=60mm]{CPU Power impact on execution time.jpg}
 388 \caption{CPU Power impact on execution time\label{overflow}}
 389 \end{figure}
 390
 391 Using the SIMGRID simulator flexibility, we have tried to determine the
 392 impact on the algorithms performance in varying the CPU power of the
 393 clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
 394 confirm the performance gain, around 95\% for both of the two methods,
 395 after adding more powerful CPU. Note that the execution time axis in the
 396 figure is in logarithmic scale.
 397
 398  \textbf{V.4 Comparing GMRES in native synchronous mode and
 399 Multisplitting algorithms in asynchronous mode}
 400
 401 The previous paragraphs put in evidence the interests to simulate the
 402 behavior of the application before any deployment in a real environment.
 403 We have focused the study on analyzing the performance in varying the
 404 key factors impacting the results. In the same line, the study compares
 405 the performance of the two proposed methods in \textbf{synchronous mode
 406 }. In this section, with the same previous methodology, the goal is to
 407 demonstrate the efficiency of the multisplitting method in \textbf{
 408 asynchronous mode} compare with the classical GMRES staying in the
 409 synchronous mode.
 410
 411 Note that the interest of using the asynchronous mode for data exchange
 412 is mainly, in opposite of the synchronous mode, the non-wait aspects of
 413 the current computation after a communication operation like sending
 414 some data between nodes. Each processor can continue their local
 415 calculation without waiting for the end of the communication. Thus, the
 416 asynchronous may theoretically reduce the overall execution time and can
 417 improve the algorithm performance.
 418
 419 As stated supra, SIMGRID simulator tool has been used to prove the
 420 efficiency of the multisplitting in asynchronous mode and to find the
 421 best combination of the grid resources (CPU, Network, input matrix size,
 422 \ldots ) to get the highest "\,relative gain" in comparison with the
 423 classical GMRES time.
 424
 425
 426 The test conditions are summarized in the table below :
 427
 428 % environment
 429 \begin{footnotesize}
 430 \begin{tabular}{r c }
 431  \hline
 432  Grid & 2x50 totaling 100 processors\\ %\hline
 433  Processors & 1 GFlops to 1.5 GFlops\\
 434    Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline
 435    Inter-Network & bw=5 Mbits - lat=2E-02\\
 436  Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline
 437  Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline
 438  \end{tabular}
 439 \end{footnotesize}
 440
 441 Again, comprehensive and extensive tests have been conducted varying the
 442 CPU power and the network parameters (bandwidth and latency) in the
 443 simulator tool with different problem size. The relative gains greater
 444 than 1 between the two algorithms have been captured after each step of
 445 the test. Table I below has recorded the best grid configurations
 446 allowing a multiplitting method time more than 2.5 times lower than
 447 classical GMRES execution and convergence time. The finding thru this
 448 experimentation is the tolerance of the multisplitting method under a
 449 low speed network that we encounter usually with distant clusters thru the
 450 internet.
 451
 452 % use the same column width for the following three tables
 453 \newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}}
 454 \newenvironment{mytable}[1]{% #1: number of columns for data
 455   \renewcommand{\arraystretch}{1.3}%
 456   \begin{tabular}{|>{\bfseries}r%
 457                   |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{%
 458     \end{tabular}}
 459
 460 \begin{table}[!t]
 461   \centering
 462   \caption{Relative gain of the multisplitting algorithm compared with
 463 the classical GMRES}
 464   \label{tab.cluster.2x50}
 465
 466   \begin{mytable}{6}
 467     \hline
 468     bw
 469     & 5         & 5         & 5         & 5         & 5         & 50 \\
 470     \hline
 471     lat
 472     & 0.02      & 0.02      & 0.02      & 0.02      & 0.02      & 0.02 \\
 473     \hline
 474     power
 475     & 1         & 1         & 1         & 1.5       & 1.5       & 1.5 \\
 476     \hline
 477     size
 478     & 62        & 62        & 62        & 100       & 100       & 110 \\
 479     \hline
 480     Prec/Eprec
 481     & \np{E-5}  & \np{E-8}  & \np{E-9}  & \np{E-11} & \np{E-11} & \np{E-11} \\
 482     \hline
 483     speedup
 484     & 0.396     & 0.392     & 0.396     & 0.391     & 0.393     & 0.395 \\
 485     \hline
 486   \end{mytable}
 487
 488   \smallskip
 489
 490   \begin{mytable}{6}
 491     \hline
 492     bw
 493     & 50        & 50        & 50        & 50        & 10        & 10 \\
 494     \hline
 495     lat
 496     & 0.02      & 0.02      & 0.02      & 0.02      & 0.03      & 0.01 \\
 497     \hline
 498     power
 499     & 1.5       & 1.5       & 1.5       & 1.5       & 1         & 1.5 \\
 500     \hline
 501     size
 502     & 120       & 130       & 140       & 150       & 171       & 171 \\
 503     \hline
 504     Prec/Eprec
 505     & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5}  & \np{E-5} \\
 506     \hline
 507     speedup
 508     & 0.398     & 0.388     & 0.393     & 0.394     & 0.63      & 0.778 \\
 509     \hline
 510   \end{mytable}
 511 \end{table}
 512
 513 \section{Conclusion}
 514 CONCLUSION
 515
 516
 517 \section*{Acknowledgment}
 518
 519
 520 The authors would like to thank\dots{}
 521
 522
 523 % trigger a \newpage just before the given reference
 524 % number - used to balance the columns on the last page
 525 % adjust value as needed - may need to be readjusted if
 526 % the document is modified later
 527 \bibliographystyle{IEEEtran}
 528 \bibliography{hpccBib}
 529
 530 \end{document}
 531
 532 %%% Local Variables:
 533 %%% mode: latex
 534 %%% TeX-master: t
 535 %%% fill-column: 80
 536 %%% ispell-local-dictionary: "american"
 537 %%% End: