X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/blobdiff_plain/1162f45db29e22394d4564bea71aef816419cba2..b0694e2563db1ac6146d29848f55971b9b021fd2:/paper.tex?ds=inline diff --git a/paper.tex b/paper.tex index 1468528..8583e51 100644 --- a/paper.tex +++ b/paper.tex @@ -1,4 +1,17 @@ -\documentclass[conference]{IEEEtran} +\documentclass[times]{cpeauth} + +\usepackage{moreverb} + +%\usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref} + +%\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em +%T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} + +\def\volumeyear{2015} + +\usepackage{graphicx} +\usepackage{wrapfig} +\usepackage{grffile} \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} @@ -23,6 +36,7 @@ \usepackage{xspace} \usepackage[textsize=footnotesize]{todonotes} + \newcommand{\AG}[2][inline]{% \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} \newcommand{\RC}[2][inline]{% @@ -48,35 +62,35 @@ \newcolumntype{g}{>{\columncolor{Gray}}c} \definecolor{Gray}{gray}{0.9} + + \begin{document} \RCE{Titre a confirmer.} \title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms} +%\itshape{\journalnamelc}\footnotemark[2]} -\author{% - \IEEEauthorblockN{% - Charles Emile Ramamonjisoa and +\author{ Charles Emile Ramamonjisoa and David Laiymani and Arnaud Giersch and Lilia Ziane Khodja and Raphaël Couturier - } - \IEEEauthorblockA{% +} + +\address{ + \centering Femto-ST Institute - DISC Department\\ Université de Franche-Comté\\ Belfort\\ Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr} - } } -\maketitle - \begin{abstract} ABSTRACT +\end{abstract} +\keywords{Algorithm; distributed; iterative; asynchronous; simulation; simgrid; performance} -Keywords : Algorithm distributed iterative asynchronous simulation simgrid performance - -\end{abstract} +\maketitle \section{Introduction} @@ -86,15 +100,421 @@ Keywords : Algorithm distributed iterative asynchronous simulation simgrid perfo \section{Simulation of the multisplitting method} -\section{Experiments and results} - -\paragraph*{1. Study setup and methodology} - -\paragraph*{2. Factors impacting distributed applications performance in a grid environment} - -\paragraph*{3. Comparing GMRES and Multisplitting algorithms in synchronous mode} - -\paragraph*{4. Comparing GMRES in synchronous mode and Multisplitting algorithms in asynchronous mode} +\section{Experimental, Results and Comments} + + +\textbf{V.1. Setup study and Methodology} + +To conduct our study, we have put in place the following methodology +which can be reused with any grid-enabled applications. + +\textbf{Step 1} : Choose with the end users the class of algorithms or +the application to be tested. Numerical parallel iterative algorithms +have been chosen for the study in the paper. + +\textbf{Step 2} : Collect the software materials needed for the +experimentation. In our case, we have three variants algorithms for the +resolution of three 3D-Poisson problem: (1) using the classical GMRES +\textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this +paper, (2) using the multisplitting method alias Algo-2 and (3) an +enhanced version of the multisplitting method as Algo-3. In addition, +SIMGRID simulator has been chosen to simulate the behaviors of the +distributed applications. SIMGRID is running on the Mesocentre +datacenter in Franche-Comte University $[$10$]$ but also in a virtual +machine on a laptop. + +\textbf{Step 3} : Fix the criteria which will be used for the future +results comparison and analysis. In the scope of this study, we retain +in one hand the algorithm execution mode (synchronous and asynchronous) +and in the other hand the execution time and the number of iterations of +the application before obtaining the convergence. + +\textbf{Step 4 }: Setup up the different grid testbeds environment +which will be simulated in the simulator tool to run the program. The +following architecture has been configured in Simgrid : 2x16 - that is a +grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8, +4x16, 8x8 and 2x50. The network has been designed to operate with a +bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6 +microseconds (resp. 5E-5) for the intra-clusters links (resp. +inter-clusters backbone links). + +\textbf{Step 5}: Process an extensive and comprehensive testings +within these configurations in varying the key parameters, especially +the CPU power capacity, the network parameters and also the size of the +input matrix. Note that some parameters should be invariant to allow the +comparison like some program input arguments. + +\textbf{Step 6} : Collect and analyze the output results. + +\textbf{ V.2. Factors impacting distributed applications performance in +a grid environment} + +From our previous experience on running distributed application in a +computational grid, many factors are identified to have an impact on the +program behavior and performance on this specific environment. Mainly, +first of all, the architecture of the grid itself can obviously +influence the performance results of the program. The performance gain +might be important theoretically when the number of clusters and/or the +number of nodes (processors/cores) in each individual cluster increase. + +Another important factor impacting the overall performance of the +application is the network configuration. Two main network parameters +can modify drastically the program output results : (i) the network +bandwidth (bw=bits/s) also known as "the data-carrying capacity" +$[$13$]$ of the network is defined as the maximum of data that can pass +from one point to another in a unit of time. (ii) the network latency +(lat : microsecond) defined as the delay from the start time to send the +data from a source and the final time the destination have finished to +receive it. Upon the network characteristics, another impacting factor +is the application dependent volume of data exchanged between the nodes +in the cluster and between distant clusters. Large volume of data can be +transferred in transit between the clusters and nodes during the code +execution. + + In a grid environment, it is common to distinguish in one hand, the +"\,intra-network" which refers to the links between nodes within a +cluster and in the other hand, the "\,inter-network" which is the +backbone link between clusters. By design, these two networks perform +with different speed. The intra-network generally works like a high +speed local network with a high bandwith and very low latency. In +opposite, the inter-network connects clusters sometime via heterogeneous +networks components thru internet with a lower speed. The network +between distant clusters might be a bottleneck for the global +performance of the application. + +\textbf{V.3 Comparing GMRES and Multisplitting algorithms in +synchronous mode} + +In the scope of this paper, our first objective is to demonstrate the +Algo-2 (Multisplitting method) shows a better performance in grid +architecture compared with Algo-1 (Classical GMRES) both running in +\textbf{\textit{synchronous mode}}. Better algorithm performance +should mean a less number of iterations output and a less execution time +before reaching the convergence. For a systematic study, the experiments +should figure out that, for various grid parameters values, the +simulator will confirm the targeted outcomes, particularly for poor and +slow networks, focusing on the impact on the communication performance +on the chosen class of algorithm $[$12$]$. + +The following paragraphs present the test conditions, the output results +and our comments. + + +\textit{3.a Executing the algorithms on various computational grid +architecture scaling up the input matrix size} +\\ + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline + Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline + - & N$_{x}$ =170 x 170 x 170 \\ \hline + \end{tabular} +\end{footnotesize} + + + Table 1 : Clusters x Nodes with NX=150 or NX=170 + +\RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger} + + +The results in figure 1 show the non-variation of the number of +iterations of classical GMRES for a given input matrix size; it is not +the case for the multisplitting method. + +%\begin{wrapfigure}{l}{60mm} +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{cluster_x_nodes_nx_150_and_nx_170.pdf} +\caption{Cluster x Nodes NX=150 and NX=170} +%\label{overflow}} +\end{figure} +%\end{wrapfigure} + +Unless the 8x8 cluster, the time +execution difference between the two algorithms is important when +comparing between different grid architectures, even with the same number of +processors (like 2x16 and 4x8 = 32 processors for example). The +experiment concludes the low sensitivity of the multisplitting method +(compared with the classical GMRES) when scaling up to higher input +matrix size. + +\textit{3.b Running on various computational grid architecture} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16, 4x8\\ %\hline + Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline + - & N2 : bw=1Gbs-lat=5E-05 \\ + Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\ + \end{tabular} +\end{footnotesize} + +%Table 2 : Clusters x Nodes - Networks N1 x N2 +%\RCE{idem pour tous les tableaux de donnees} + + +%\begin{wrapfigure}{l}{60mm} +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{cluster_x_nodes_n1_x_n2.pdf} +\caption{Cluster x Nodes N1 x N2} +%\label{overflow}} +\end{figure} +%\end{wrapfigure} + +The experiments compare the behavior of the algorithms running first on +speed inter- cluster network (N1) and a less performant network (N2). +The figure 2 shows that end users will gain to reduce the execution time +for both algorithms in using a grid architecture like 4x16 or 8x8: the +performance was increased in a factor of 2. The results depict also that +when the network speed drops down, the difference between the execution +times can reach more than 25\%. + +\textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16\\ %\hline + Network & N1 : bw=1Gbs \\ %\hline + Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\ + \end{tabular} +\end{footnotesize} + +Table 3 : Network latency impact + + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{network_latency_impact_on_execution_time.pdf} +\caption{Network latency impact on execution time} +%\label{overflow}} +\end{figure} + + +According the results in table and figure 3, degradation of the network +latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time +increase more than 75\% (resp. 82\%) of the execution for the classical +GMRES (resp. multisplitting) algorithm. In addition, it appears that the +multisplitting method tolerates more the network latency variation with +a less rate increase. Consequently, in the worst case (lat=6.10$^{-5 +}$), the execution time for GMRES is almost the double of the time for +the multisplitting, even though, the performance was on the same order +of magnitude with a latency of 8.10$^{-6}$. + +\textit{3.d Network bandwidth impacts on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16\\ %\hline + Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline + \end{tabular} +\end{footnotesize} + +Table 4 : Network bandwidth impact + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{network_bandwith_impact_on_execution_time.pdf} +\caption{Network bandwith impact on execution time} +%\label{overflow} +\end{figure} + + + +The results of increasing the network bandwidth depict the improvement +of the performance by reducing the execution time for both of the two +algorithms. However, and again in this case, the multisplitting method +presents a better performance in the considered bandwidth interval with +a gain of 40\% which is only around 24\% for classical GMRES. + +\textit{3.e Input matrix size impacts on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 4x8\\ %\hline + Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ = From 40 to 200\\ \hline + \end{tabular} +\end{footnotesize} + +Table 5 : Input matrix size impact + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{pb_size_impact_on_execution_time.pdf} +\caption{Pb size impact on execution time} +%\label{overflow}} +\end{figure} + +In this experimentation, the input matrix size has been set from +Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to +200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5, +the execution time for the algorithms convergence increases with the +input matrix size. But the interesting result here direct on (i) the +drastic increase (300 times) of the number of iterations needed before +the convergence for the classical GMRES algorithm when the matrix size +go beyond Nx=150; (ii) the classical GMRES execution time also almost +the double from Nx=140 compared with the convergence time of the +multisplitting method. These findings may help a lot end users to setup +the best and the optimal targeted environment for the application +deployment when focusing on the problem size scale up. Note that the +same test has been done with the grid 2x16 getting the same conclusion. + +\textit{3.f CPU Power impact on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16\\ %\hline + Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline + \end{tabular} +\end{footnotesize} + +Table 6 : CPU Power impact + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{cpu_power_impact_on_execution_time.pdf} +\caption{CPU Power impact on execution time} +%\label{overflow}} +\end{figure} + +Using the SIMGRID simulator flexibility, we have tried to determine the +impact on the algorithms performance in varying the CPU power of the +clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6 +confirm the performance gain, around 95\% for both of the two methods, +after adding more powerful CPU. Note that the execution time axis in the +figure is in logarithmic scale. + + \textbf{V.4 Comparing GMRES in native synchronous mode and +Multisplitting algorithms in asynchronous mode} + +The previous paragraphs put in evidence the interests to simulate the +behavior of the application before any deployment in a real environment. +We have focused the study on analyzing the performance in varying the +key factors impacting the results. In the same line, the study compares +the performance of the two proposed methods in \textbf{synchronous mode +}. In this section, with the same previous methodology, the goal is to +demonstrate the efficiency of the multisplitting method in \textbf{ +asynchronous mode} compare with the classical GMRES staying in the +synchronous mode. + +Note that the interest of using the asynchronous mode for data exchange +is mainly, in opposite of the synchronous mode, the non-wait aspects of +the current computation after a communication operation like sending +some data between nodes. Each processor can continue their local +calculation without waiting for the end of the communication. Thus, the +asynchronous may theoretically reduce the overall execution time and can +improve the algorithm performance. + +As stated supra, SIMGRID simulator tool has been used to prove the +efficiency of the multisplitting in asynchronous mode and to find the +best combination of the grid resources (CPU, Network, input matrix size, +\ldots ) to get the highest "\,relative gain" in comparison with the +classical GMRES time. + + +The test conditions are summarized in the table below : + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x50 totaling 100 processors\\ %\hline + Processors & 1 GFlops to 1.5 GFlops\\ + Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline + Inter-Network & bw=5 Mbits - lat=2E-02\\ + Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline + Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline + \end{tabular} +\end{footnotesize} + +Again, comprehensive and extensive tests have been conducted varying the +CPU power and the network parameters (bandwidth and latency) in the +simulator tool with different problem size. The relative gains greater +than 1 between the two algorithms have been captured after each step of +the test. Table I below has recorded the best grid configurations +allowing a multiplitting method time more than 2.5 times lower than +classical GMRES execution and convergence time. The finding thru this +experimentation is the tolerance of the multisplitting method under a +low speed network that we encounter usually with distant clusters thru the +internet. + +% use the same column width for the following three tables +\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}} +\newenvironment{mytable}[1]{% #1: number of columns for data + \renewcommand{\arraystretch}{1.3}% + \begin{tabular}{|>{\bfseries}r% + |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{% + \end{tabular}} + +\begin{table}[!t] + \centering + \caption{Relative gain of the multisplitting algorithm compared with +the classical GMRES} + \label{tab.cluster.2x50} + + \begin{mytable}{6} + \hline + bw + & 5 & 5 & 5 & 5 & 5 & 50 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\ + \hline + power + & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\ + \hline + size + & 62 & 62 & 62 & 100 & 100 & 110 \\ + \hline + Prec/Eprec + & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\ + \hline + speedup + & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\ + \hline + \end{mytable} + + \smallskip + + \begin{mytable}{6} + \hline + bw + & 50 & 50 & 50 & 50 & 10 & 10 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\ + \hline + power + & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\ + \hline + size + & 120 & 130 & 140 & 150 & 171 & 171 \\ + \hline + Prec/Eprec + & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\ + \hline + speedup + & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\ + \hline + \end{mytable} +\end{table} \section{Conclusion} CONCLUSION