X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/blobdiff_plain/c9f1e655cef3e735867e6000202cb1f982f05d58..2367662d760a73d98ca68d37a3912c211972afe3:/hpcc.tex diff --git a/hpcc.tex b/hpcc.tex index 5fbeca1..968b235 100644 --- a/hpcc.tex +++ b/hpcc.tex @@ -319,38 +319,21 @@ \usepackage[T1]{fontenc} -\usepackage{ucs} -%\usepackage[utf8x]{inputenc} -\usepackage{lmodern} -\usepackage{color} -%% Jolis entetes %% -\usepackage[Glenn]{fncychap} +\usepackage[utf8]{inputenc} +\usepackage{amsfonts,amssymb} +\usepackage{amsmath} %\usepackage{amsmath} %\usepackage{amsthm} %\usepackage{amsfonts} %\usepackage{graphicx} %\usepackage{xspace} -% Definition des marges -\usepackage{vmargin} -\setpapersize[portrait]{A4} -\usepackage[francais]{babel} +\usepackage[american]{babel} % Extension pour les graphiques EPS %\usepackage[dvips]{graphicx} \usepackage[pdftex,final]{graphicx} % Extension pour les liens intra-documents (tagged PDF) % et l'affichage correct des URL (commande \url{http://example.com}) -\usepackage{hyperref} - -\ifCLASSINFOpdf - \usepackage[pdftex]{graphicx} - \DeclareGraphicsExtensions{.pdf,.jpeg,.png} -\else -\fi - - - -% correct bad hyphenation here -\hyphenation{op-tical net-works semi-conduc-tor} +%\usepackage{hyperref} \begin{document} @@ -363,9 +346,9 @@ % author names and affiliations % use a multiple column layout for up to three different % affiliations -\author{\IEEEauthorblockN{Raphaël Couturier and Arnaud Giersch and David Laiymani and Charles-Emile Ramamonjisoa} +\author{\IEEEauthorblockN{Raphaël Couturier and Arnaud Giersch and David Laiymani and Charles-Emile Ramamonjisoa} \IEEEauthorblockA{Femto-ST Institute - DISC Department\\ -Université de Franche-Comté\\ +Université de Franche-Comté\\ Belfort\\ Email: raphael.couturier@univ-fcomte.fr} %\and @@ -417,23 +400,183 @@ The abstract goes here. \section{Introduction} -Présenter un bref état de l'art sur la simulation d'algos parallèles. Présenter rapidement les algos itératifs asynchrones et leurs avantages. Parler de leurs inconvénients en particulier la difficulté de déploiement à grande échelle donc il serait bien de simuler. Dire qu'à notre connaissance il n'existe pas de simulation de ce type d'algo. -Présenter les travaux et les résultats obtenus. Annoncer le plan. +Présenter un bref état de l'art sur la simulation d'algos parallèles. Présenter rapidement les algos itératifs asynchrones et leurs avantages. Parler de leurs inconvénients en particulier la difficulté de déploiement à grande échelle donc il serait bien de simuler. Dire qu'à notre connaissance il n'existe pas de simulation de ce type d'algo. +Présenter les travaux et les résultats obtenus. Annoncer le plan. \section{The asynchronous iteration model} -Décrire le modèle asynchrone. Je m'en charge (DL) +Décrire le modèle asynchrone. Je m'en charge (DL) \section{SimGrid} -Décrire SimGrid (Arnaud) +Décrire SimGrid (Arnaud) + + + + + + + +%%%%% +\section{Simulation of the multisplitting method} +%Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid. +Let $Ax=b$ be a large sparse system of $n$ linear equations in $\mathbb{R}$, where $A$ is a sparse square and nonsingular matrix, $x$ is the solution vector and $y$ is the right-hand side vector. We use a multisplitting method based on the block Jacobi partitioning to solve this linear system on a large scale platform composed of $L$ clusters of processors. In this case, we apply a row-by-row splitting without overlapping +\[ +\left(\begin{array}{ccc} +A_{11} & \cdots & A_{1L} \\ +\vdots & \ddots & \vdots\\ +A_{L1} & \cdots & A_{LL} +\end{array} \right) +\times +\left(\begin{array}{c} +X_1 \\ +\vdots\\ +X_L +\end{array} \right) += +\left(\begin{array}{c} +Y_1 \\ +\vdots\\ +Y_L +\end{array} \right)\] +in such a way that successive rows of matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster, where for all $l,i\in\{1,\ldots,L\}$ $A_{li}$ is a rectangular block of $A$ of size $n_l\times n_i$, $X_l$ and $Y_l$ are sub-vectors of $x$ and $y$, respectively, each of size $n_l$ and $\sum_{l} n_l=\sum_{i} n_i=n$. + +The multisplitting method proceeds by iteration to solve in parallel the linear system by $L$ clusters of processors, in such a way each sub-system +\begin{equation} +\left\{ +\begin{array}{l} +A_{ll}X_l = Y_l \mbox{,~such that}\\ +Y_l = B_l - \displaystyle\sum_{i=1,i\neq l}^{L}A_{li}X_i, +\end{array} +\right. +\label{eq:4.1} +\end{equation} +is solved independently by a cluster and communication are required to update the right-hand side sub-vectors $Y_l$, such that the sub-vectors $X_i$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers. +%%%%% + + + + + -\section{Simulation of the multi-splitting method} -Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid. \section{Experimental results} +When the ``real'' application runs in the simulation environment and produces +the expected results, varying the input parameters and the program arguments +allows us to compare outputs from the code execution. We have noticed from this +study that the results depend on the following parameters: (1) at the network +level, we found that the most critical values are the bandwidth (bw) and the +network latency (lat). (2) Hosts power (GFlops) can also influence on the +results. And finally, (3) when submitting job batches for execution, the +arguments values passed to the program like the maximum number of iterations or +the ``external'' precision are critical to ensure not only the convergence of the +algorithm but also to get the main objective of the experimentation of the +simulation in having an execution time in asynchronous less than in synchronous +mode, in others words, in having a ``speedup'' less than 1 (Speedup = Execution +time in synchronous mode / Execution time in asynchronous mode). + +A priori, obtaining a speedup less than 1 would be difficult in a local area +network configuration where the synchronous mode will take advantage on the rapid +exchange of information on such high-speed links. Thus, the methodology adopted +was to launch the application on clustered network. In this last configuration, +degrading the inter-cluster network performance will "penalize" the synchronous +mode allowing to get a speedup lower than 1. This action simulates the case of +clusters linked with long distance network like Internet. + +As a first step, the algorithm was run on a network consisting of two clusters +containing fifty hosts each, totaling one hundred hosts. Various combinations of +the above factors have providing the results shown in Table~\ref{tab.cluster.2x50} with a matrix size +ranging from Nx = Ny = Nz = 62 to 171 elements or from 62$^{3}$ = 238328 to +171$^{3}$ = 5,211,000 entries. + +Then we have changed the network configuration using three clusters containing +respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the +clusters. In the same way as above, a judicious choice of key parameters has +permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the speedups less than 1 with +a matrix size from 62 to 100 elements. + +In a final step, results of an execution attempt to scale up the three clustered +configuration but increasing by two hundreds hosts has been recorded in Table~\ref{tab.cluster.3x67}. + +Note that the program was run with the following parameters: + +\paragraph*{SMPI parameters} + +\begin{itemize} + \item HOSTFILE : Hosts file description. + \item PLATFORM: file description of the platform architecture : clusters (CPU power, +... ) , intra cluster network description, inter cluster network (bandwidth bw , +lat latency , ... ). +\end{itemize} + + +\paragraph*{Arguments of the program} + +\begin{itemize} + \item Description of the cluster architecture; + \item Maximum number of internal and external iterations; + \item Internal and external precisions; + \item Matrix size NX , NY and NZ; + \item Matrix diagonal value = 6.0; + \item Execution Mode: synchronous or asynchronous. +\end{itemize} + +\begin{table} + \centering + \caption{2 clusters X 50 nodes} + \label{tab.cluster.2x50} + \includegraphics[width=209pt]{img-1.eps} +\end{table} + +\begin{table} + \centering + \caption{3 clusters X 33 n\oe{}uds} + \label{tab.cluster.3x33} + \includegraphics[width=209pt]{img-1.eps} +\end{table} + +\begin{table} + \centering + \caption{3 clusters X 67 noeuds} + \label{tab.cluster.3x67} + \includegraphics[width=128pt]{img-2.eps} +\end{table} + +\paragraph*{Interpretations and comments} + +After analyzing the outputs, generally, for the configuration with two or three +clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50} and~\ref{tab.cluster.3x33}), some combinations of the +used parameters affecting the results have given a speedup less than 1, showing +the effectiveness of the asynchronous performance compared to the synchronous +mode. + +In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows that with a +deterioration of inter cluster network set with 5 Mbits/s of bandwidth, a latency +in order of a hundredth of a millisecond and a system power of one GFlops, an +efficiency of about 40\% in asynchronous mode is obtained for a matrix size of 62 +elements . It is noticed that the result remains stable even if we vary the +external precision from E -05 to E-09. By increasing the problem size up to 100 +elements, it was necessary to increase the CPU power of 50 \% to 1.5 GFlops for a +convergence of the algorithm with the same order of asynchronous mode efficiency. +Maintaining such a system power but this time, increasing network throughput +inter cluster up to 50 Mbits /s, the result of efficiency of about 40\% is +obtained with high external precision of E-11 for a matrix size from 110 to 150 +side elements . + +For the 3 clusters architecture including a total of 100 hosts, Table~\ref{tab.cluster.3x33} shows +that it was difficult to have a combination which gives an efficiency of +asynchronous below 80 \%. Indeed, for a matrix size of 62 elements, equality +between the performance of the two modes (synchronous and asynchronous) is +achieved with an inter cluster of 10 Mbits/s and a latency of E- 01 ms. To +challenge an efficiency by 78\% with a matrix size of 100 points, it was +necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s. + +A last attempt was made for a configuration of three clusters but more power +with 200 nodes in total. The convergence with a speedup of 90 \% was obtained +with a bandwidth of 1 Mbits/s as shown in Table~\ref{tab.cluster.3x67}. + \section{Conclusion} @@ -555,7 +698,7 @@ The authors would like to thank... % http://www.michaelshell.org/tex/ieeetran/bibtex/ \bibliographystyle{IEEEtran} % argument is your BibTeX string definitions and bibliography database(s) -\bibliography{bib/hpccBib} +\bibliography{hpccBib} % % manually copy in the resultant .bbl file % set second argument of \begin to the number of references