X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/blobdiff_plain/508eb0199cb2b5f94ce4ce4878b8b8a3bea05431..42bfd8504dfc1ae1e1de017748fc27fb9a8e50e4:/paper.tex?ds=inline diff --git a/paper.tex b/paper.tex index e69de29..0ec97f3 100644 --- a/paper.tex +++ b/paper.tex @@ -0,0 +1,537 @@ +%\documentclass[conference]{IEEEtran} +\documentclass[times]{cpeauth} + +\usepackage{moreverb} + +%\usepackage[dvips,colorlinks,bookmarksopen,bookmarksnumbered,citecolor=red,urlcolor=red]{hyperref} + +%\newcommand\BibTeX{{\rmfamily B\kern-.05em \textsc{i\kern-.025em b}\kern-.08em +%T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}} + +\def\volumeyear{2015} + +\usepackage{graphicx} +\usepackage{wrapfig} +\usepackage{grffile} + +\usepackage[T1]{fontenc} +\usepackage[utf8]{inputenc} +\usepackage{amsfonts,amssymb} +\usepackage{amsmath} +\usepackage{algorithm} +\usepackage{algpseudocode} +%\usepackage{amsthm} +\usepackage{graphicx} +\usepackage[american]{babel} +% Extension pour les liens intra-documents (tagged PDF) +% et l'affichage correct des URL (commande \url{http://example.com}) +%\usepackage{hyperref} + +\usepackage{url} +\DeclareUrlCommand\email{\urlstyle{same}} + +\usepackage[autolanguage,np]{numprint} +\AtBeginDocument{% + \renewcommand*\npunitcommand[1]{\text{#1}} + \npthousandthpartsep{}} + +\usepackage{xspace} +\usepackage[textsize=footnotesize]{todonotes} + +\newcommand{\AG}[2][inline]{% + \todo[color=green!50,#1]{\sffamily\textbf{AG:} #2}\xspace} +\newcommand{\RC}[2][inline]{% + \todo[color=red!10,#1]{\sffamily\textbf{RC:} #2}\xspace} +\newcommand{\LZK}[2][inline]{% + \todo[color=blue!10,#1]{\sffamily\textbf{LZK:} #2}\xspace} +\newcommand{\RCE}[2][inline]{% + \todo[color=yellow!10,#1]{\sffamily\textbf{RCE:} #2}\xspace} + +\algnewcommand\algorithmicinput{\textbf{Input:}} +\algnewcommand\Input{\item[\algorithmicinput]} + +\algnewcommand\algorithmicoutput{\textbf{Output:}} +\algnewcommand\Output{\item[\algorithmicoutput]} + +\newcommand{\MI}{\mathit{MaxIter}} + +\usepackage{array} +\usepackage{color, colortbl} +\newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}} +\newcolumntype{Z}[1]{>{\raggedleft}m{#1}} + +\newcolumntype{g}{>{\columncolor{Gray}}c} +\definecolor{Gray}{gray}{0.9} + + + +\begin{document} +\RCE{Titre a confirmer.} +\title{Comparative performance analysis of simulated grid-enabled numerical iterative algorithms} +%\itshape{\journalnamelc}\footnotemark[2]} + +\author{ Charles Emile Ramamonjisoa and + David Laiymani and + Arnaud Giersch and + Lilia Ziane Khodja and + Raphaël Couturier +} + +\address{ + \centering + Femto-ST Institute - DISC Department\\ + Université de Franche-Comté\\ + Belfort\\ + Email: \email{{raphael.couturier,arnaud.giersch,david.laiymani,charles.ramamonjisoa}@univ-fcomte.fr} +} + +\begin{abstract} +ABSTRACT +\end{abstract} + +\keywords{Algorithm; distributed; iterative; asynchronous; simulation; simgrid; performance} + +\maketitle + +\section{Introduction} + +\section{The asynchronous iteration model} + +\section{SimGrid} + +\section{Simulation of the multisplitting method} + +\section{Experimental, Results and Comments} + + +\textbf{V.1. Setup study and Methodology} + +To conduct our study, we have put in place the following methodology +which can be reused with any grid-enabled applications. + +\textbf{Step 1} : Choose with the end users the class of algorithms or +the application to be tested. Numerical parallel iterative algorithms +have been chosen for the study in the paper. + +\textbf{Step 2} : Collect the software materials needed for the +experimentation. In our case, we have three variants algorithms for the +resolution of three 3D-Poisson problem: (1) using the classical GMRES +\textit{(Generalized Minimal RESidual Method)} alias Algo-1 in this +paper, (2) using the multisplitting method alias Algo-2 and (3) an +enhanced version of the multisplitting method as Algo-3. In addition, +SIMGRID simulator has been chosen to simulate the behaviors of the +distributed applications. SIMGRID is running on the Mesocentre +datacenter in Franche-Comte University $[$10$]$ but also in a virtual +machine on a laptop. + +\textbf{Step 3} : Fix the criteria which will be used for the future +results comparison and analysis. In the scope of this study, we retain +in one hand the algorithm execution mode (synchronous and asynchronous) +and in the other hand the execution time and the number of iterations of +the application before obtaining the convergence. + +\textbf{Step 4 }: Setup up the different grid testbeds environment +which will be simulated in the simulator tool to run the program. The +following architecture has been configured in Simgrid : 2x16 - that is a +grid containing 2 clusters with 16 hosts (processors/cores) each -, 4x8, +4x16, 8x8 and 2x50. The network has been designed to operate with a +bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8E-6 +microseconds (resp. 5E-5) for the intra-clusters links (resp. +inter-clusters backbone links). + +\textbf{Step 5}: Process an extensive and comprehensive testings +within these configurations in varying the key parameters, especially +the CPU power capacity, the network parameters and also the size of the +input matrix. Note that some parameters should be invariant to allow the +comparison like some program input arguments. + +\textbf{Step 6} : Collect and analyze the output results. + +\textbf{ V.2. Factors impacting distributed applications performance in +a grid environment} + +From our previous experience on running distributed application in a +computational grid, many factors are identified to have an impact on the +program behavior and performance on this specific environment. Mainly, +first of all, the architecture of the grid itself can obviously +influence the performance results of the program. The performance gain +might be important theoretically when the number of clusters and/or the +number of nodes (processors/cores) in each individual cluster increase. + +Another important factor impacting the overall performance of the +application is the network configuration. Two main network parameters +can modify drastically the program output results : (i) the network +bandwidth (bw=bits/s) also known as "the data-carrying capacity" +$[$13$]$ of the network is defined as the maximum of data that can pass +from one point to another in a unit of time. (ii) the network latency +(lat : microsecond) defined as the delay from the start time to send the +data from a source and the final time the destination have finished to +receive it. Upon the network characteristics, another impacting factor +is the application dependent volume of data exchanged between the nodes +in the cluster and between distant clusters. Large volume of data can be +transferred in transit between the clusters and nodes during the code +execution. + + In a grid environment, it is common to distinguish in one hand, the +"\,intra-network" which refers to the links between nodes within a +cluster and in the other hand, the "\,inter-network" which is the +backbone link between clusters. By design, these two networks perform +with different speed. The intra-network generally works like a high +speed local network with a high bandwith and very low latency. In +opposite, the inter-network connects clusters sometime via heterogeneous +networks components thru internet with a lower speed. The network +between distant clusters might be a bottleneck for the global +performance of the application. + +\textbf{V.3 Comparing GMRES and Multisplitting algorithms in +synchronous mode} + +In the scope of this paper, our first objective is to demonstrate the +Algo-2 (Multisplitting method) shows a better performance in grid +architecture compared with Algo-1 (Classical GMRES) both running in +\textbf{\textit{synchronous mode}}. Better algorithm performance +should mean a less number of iterations output and a less execution time +before reaching the convergence. For a systematic study, the experiments +should figure out that, for various grid parameters values, the +simulator will confirm the targeted outcomes, particularly for poor and +slow networks, focusing on the impact on the communication performance +on the chosen class of algorithm $[$12$]$. + +The following paragraphs present the test conditions, the output results +and our comments. + + +\textit{3.a Executing the algorithms on various computational grid +architecture scaling up the input matrix size} +\\ + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16, 4x8, 4x16 and 8x8\\ %\hline + Network & N2 : bw=1Gbs-lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ =150 x 150 x 150 and\\ %\hline + - & N$_{x}$ =170 x 170 x 170 \\ \hline + \end{tabular} +\end{footnotesize} + + + Table 1 : Clusters x Nodes with NX=150 or NX=170 + +\RCE{J'ai voulu mettre les tableaux des données mais je pense que c'est inutile et ça va surcharger} + + +The results in figure 1 show the non-variation of the number of +iterations of classical GMRES for a given input matrix size; it is not +the case for the multisplitting method. + +%%\begin{wrapfigure}{l}{60mm} +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{Cluster_x_Nodes_NX=150_and_NX=170.jpg} +\caption{Cluster x Nodes NX=150 and NX=170 \label{overflow}} +\end{figure} +%%\end{wrapfigure} + + +Unless the 8x8 cluster, the time +execution difference between the two algorithms is important when +comparing between different grid architectures, even with the same number of +processors (like 2x16 and 4x8 = 32 processors for example). The +experiment concludes the low sensitivity of the multisplitting method +(compared with the classical GMRES) when scaling up to higher input +matrix size. + +\textit{3.b Running on various computational grid architecture} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16, 4x8\\ %\hline + Network & N1 : bw=10Gbs-lat=8E-06 \\ %\hline + - & N2 : bw=1Gbs-lat=5E-05 \\ + Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline \\ + \end{tabular} +\end{footnotesize} + +%Table 2 : Clusters x Nodes - Networks N1 x N2 +%\RCE{idem pour tous les tableaux de donnees} + + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{Cluster_x_Nodes_N1_x_N2.jpg} +\caption{Cluster x Nodes N1 x N2\label{overflow}} +\end{figure} + +The experiments compare the behavior of the algorithms running first on +speed inter- cluster network (N1) and a less performant network (N2). +The figure 2 shows that end users will gain to reduce the execution time +for both algorithms in using a grid architecture like 4x16 or 8x8: the +performance was increased in a factor of 2. The results depict also that +when the network speed drops down, the difference between the execution +times can reach more than 25\%. + +\textit{\\\\\\\\\\\\\\\\\\3.c Network latency impacts on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16\\ %\hline + Network & N1 : bw=1Gbs \\ %\hline + Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline\\ + \end{tabular} +\end{footnotesize} + +Table 3 : Network latency impact + + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{Network_latency_impact_on_execution_time.jpg} +\caption{Network latency impact on execution time\label{overflow}} +\end{figure} + + +According the results in table and figure 3, degradation of the network +latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time +increase more than 75\% (resp. 82\%) of the execution for the classical +GMRES (resp. multisplitting) algorithm. In addition, it appears that the +multisplitting method tolerates more the network latency variation with +a less rate increase. Consequently, in the worst case (lat=6.10$^{-5 +}$), the execution time for GMRES is almost the double of the time for +the multisplitting, even though, the performance was on the same order +of magnitude with a latency of 8.10$^{-6}$. + +\textit{3.d Network bandwidth impacts on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16\\ %\hline + Network & N1 : bw=1Gbs - lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ =150 x 150 x 150\\ \hline + \end{tabular} +\end{footnotesize} + +Table 4 : Network bandwidth impact + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{Network_bandwith_impact_on_execution_time.jpg} +\caption{Network bandwith impact on execution time\label{overflow}} +\end{figure} + + + +The results of increasing the network bandwidth depict the improvement +of the performance by reducing the execution time for both of the two +algorithms. However, and again in this case, the multisplitting method +presents a better performance in the considered bandwidth interval with +a gain of 40\% which is only around 24\% for classical GMRES. + +\textit{3.e Input matrix size impacts on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 4x8\\ %\hline + Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ = From 40 to 200\\ \hline + \end{tabular} +\end{footnotesize} + +Table 5 : Input matrix size impact + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{Pb_size_impact_on_execution_time.jpg} +\caption{Pb size impact on execution time\label{overflow}} +\end{figure} + +In this experimentation, the input matrix size has been set from +Nx=Ny=Nz=40 to 200 side elements that is from 40$^{3}$ = 64.000 to +200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 5, +the execution time for the algorithms convergence increases with the +input matrix size. But the interesting result here direct on (i) the +drastic increase (300 times) of the number of iterations needed before +the convergence for the classical GMRES algorithm when the matrix size +go beyond Nx=150; (ii) the classical GMRES execution time also almost +the double from Nx=140 compared with the convergence time of the +multisplitting method. These findings may help a lot end users to setup +the best and the optimal targeted environment for the application +deployment when focusing on the problem size scale up. Note that the +same test has been done with the grid 2x16 getting the same conclusion. + +\textit{3.f CPU Power impact on performance} + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x16\\ %\hline + Network & N2 : bw=1Gbs - lat=5E-05 \\ %\hline + Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline + \end{tabular} +\end{footnotesize} + +Table 6 : CPU Power impact + +\begin{figure} [ht!] +\centering +\includegraphics[width=60mm]{CPU_Power_impact_on_execution_time.jpg} +\caption{CPU Power impact on execution time\label{overflow}} +\end{figure} + +Using the SIMGRID simulator flexibility, we have tried to determine the +impact on the algorithms performance in varying the CPU power of the +clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6 +confirm the performance gain, around 95\% for both of the two methods, +after adding more powerful CPU. Note that the execution time axis in the +figure is in logarithmic scale. + + \textbf{V.4 Comparing GMRES in native synchronous mode and +Multisplitting algorithms in asynchronous mode} + +The previous paragraphs put in evidence the interests to simulate the +behavior of the application before any deployment in a real environment. +We have focused the study on analyzing the performance in varying the +key factors impacting the results. In the same line, the study compares +the performance of the two proposed methods in \textbf{synchronous mode +}. In this section, with the same previous methodology, the goal is to +demonstrate the efficiency of the multisplitting method in \textbf{ +asynchronous mode} compare with the classical GMRES staying in the +synchronous mode. + +Note that the interest of using the asynchronous mode for data exchange +is mainly, in opposite of the synchronous mode, the non-wait aspects of +the current computation after a communication operation like sending +some data between nodes. Each processor can continue their local +calculation without waiting for the end of the communication. Thus, the +asynchronous may theoretically reduce the overall execution time and can +improve the algorithm performance. + +As stated supra, SIMGRID simulator tool has been used to prove the +efficiency of the multisplitting in asynchronous mode and to find the +best combination of the grid resources (CPU, Network, input matrix size, +\ldots ) to get the highest "\,relative gain" in comparison with the +classical GMRES time. + + +The test conditions are summarized in the table below : + +% environment +\begin{footnotesize} +\begin{tabular}{r c } + \hline + Grid & 2x50 totaling 100 processors\\ %\hline + Processors & 1 GFlops to 1.5 GFlops\\ + Intra-Network & bw=1.25 Gbits - lat=5E-05 \\ %\hline + Inter-Network & bw=5 Mbits - lat=2E-02\\ + Input matrix size & N$_{x}$ = From 62 to 150\\ %\hline + Residual error precision: 10$^{-5}$ to 10$^{-9}$\\ \hline + \end{tabular} +\end{footnotesize} + +Again, comprehensive and extensive tests have been conducted varying the +CPU power and the network parameters (bandwidth and latency) in the +simulator tool with different problem size. The relative gains greater +than 1 between the two algorithms have been captured after each step of +the test. Table I below has recorded the best grid configurations +allowing a multiplitting method time more than 2.5 times lower than +classical GMRES execution and convergence time. The finding thru this +experimentation is the tolerance of the multisplitting method under a +low speed network that we encounter usually with distant clusters thru the +internet. + +% use the same column width for the following three tables +\newlength{\mytablew}\settowidth{\mytablew}{\footnotesize\np{E-11}} +\newenvironment{mytable}[1]{% #1: number of columns for data + \renewcommand{\arraystretch}{1.3}% + \begin{tabular}{|>{\bfseries}r% + |*{#1}{>{\centering\arraybackslash}p{\mytablew}|}}}{% + \end{tabular}} + +\begin{table}[!t] + \centering + \caption{Relative gain of the multisplitting algorithm compared with +the classical GMRES} + \label{tab.cluster.2x50} + + \begin{mytable}{6} + \hline + bw + & 5 & 5 & 5 & 5 & 5 & 50 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 \\ + \hline + power + & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 \\ + \hline + size + & 62 & 62 & 62 & 100 & 100 & 110 \\ + \hline + Prec/Eprec + & \np{E-5} & \np{E-8} & \np{E-9} & \np{E-11} & \np{E-11} & \np{E-11} \\ + \hline + speedup + & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 \\ + \hline + \end{mytable} + + \smallskip + + \begin{mytable}{6} + \hline + bw + & 50 & 50 & 50 & 50 & 10 & 10 \\ + \hline + lat + & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01 \\ + \hline + power + & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 \\ + \hline + size + & 120 & 130 & 140 & 150 & 171 & 171 \\ + \hline + Prec/Eprec + & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5} & \np{E-5} \\ + \hline + speedup + & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778 \\ + \hline + \end{mytable} +\end{table} + +\section{Conclusion} +CONCLUSION + + +\section*{Acknowledgment} + + +The authors would like to thank\dots{} + + +% trigger a \newpage just before the given reference +% number - used to balance the columns on the last page +% adjust value as needed - may need to be readjusted if +% the document is modified later +\bibliographystyle{IEEEtran} +\bibliography{hpccBib} + +\end{document} + +%%% Local Variables: +%%% mode: latex +%%% TeX-master: t +%%% fill-column: 80 +%%% ispell-local-dictionary: "american" +%%% End: