INTRODUCTION

[hpcc2014.git] / hpcc.tex
diff --git a/hpcc.tex b/hpcc.tex

index 5fbeca1a90c34ec0b69a85df8ada7a635d721d99..d44a7b4895cc52231169a64b43bf288da6712899 100644 (file)
--- a/hpcc.tex
+++ b/hpcc.tex
@@ -319,38 +319,29 @@
  
  
  \usepackage[T1]{fontenc}
-\usepackage{ucs}
-%\usepackage[utf8x]{inputenc}
-\usepackage{lmodern}
-\usepackage{color}
-%% Jolis entetes %%
-\usepackage[Glenn]{fncychap}
-%\usepackage{amsmath}
+\usepackage[utf8]{inputenc}
+\usepackage{amsfonts,amssymb}
+\usepackage{amsmath}
+\usepackage{algorithm}
+\usepackage{algpseudocode}
  %\usepackage{amsthm}
-%\usepackage{amsfonts}
-%\usepackage{graphicx}
+\usepackage{graphicx}
  %\usepackage{xspace}
-% Definition des marges
-\usepackage{vmargin}
-\setpapersize[portrait]{A4}
-\usepackage[francais]{babel}
+\usepackage[american]{babel}
  % Extension pour les graphiques EPS
  %\usepackage[dvips]{graphicx}
  \usepackage[pdftex,final]{graphicx}
  % Extension pour les liens intra-documents (tagged PDF)
  % et l'affichage correct des URL (commande \url{http://example.com})
-\usepackage{hyperref}
+%\usepackage{hyperref}
  
-\ifCLASSINFOpdf
-   \usepackage[pdftex]{graphicx}
-   \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
-\else
-\fi
+\algnewcommand\algorithmicinput{\textbf{Input:}}
+\algnewcommand\Input{\item[\algorithmicinput]}
  
+\algnewcommand\algorithmicoutput{\textbf{Output:}}
+\algnewcommand\Output{\item[\algorithmicoutput]}
  
  
-% correct bad hyphenation here
-\hyphenation{op-tical net-works semi-conduc-tor}
  
  
  \begin{document}
@@ -363,9 +354,9 @@
  % author names and affiliations
  % use a multiple column layout for up to three different
  % affiliations
-\author{\IEEEauthorblockN{Raphaël Couturier and Arnaud Giersch and David Laiymani and Charles-Emile Ramamonjisoa}
+\author{\IEEEauthorblockN{Raphaël Couturier and Arnaud Giersch and David Laiymani and Charles Emile Ramamonjisoa}
  \IEEEauthorblockA{Femto-ST Institute - DISC Department\\
-Université de Franche-Comté\\
+Université de Franche-Comté\\
  Belfort\\
  Email: raphael.couturier@univ-fcomte.fr}
  %\and
@@ -417,23 +408,271 @@ The abstract goes here.
  
  \section{Introduction}
  
-Présenter un bref état de l'art sur la simulation d'algos parallèles. Présenter rapidement les algos itératifs asynchrones et leurs avantages. Parler de leurs inconvénients en particulier la difficulté de déploiement à grande échelle donc il serait bien de simuler. Dire qu'à notre connaissance il n'existe pas de simulation de ce type d'algo.
-Présenter les travaux et les résultats obtenus. Annoncer le plan.
+Parallel computing and high performance computing (HPC) are becoming 
+more and more imperative for solving various problems raised by 
+researchers on various scientific disciplines but also by industrial in 
+the field. Indeed, the increasing complexity of these requested 
+applications combined with a continuous increase of their sizes lead to 
+write distributed and parallel algorithms requiring significant hardware 
+resources ( grid computing , clusters, broadband network ,etc... ) but 
+also a non- negligible CPU execution time. We consider in this paper a 
+class of highly efficient parallel algorithms called iterative executed 
+in a distributed environment. As their name suggests, these algorithm 
+solves a given problem that might be NP- complete complex by successive 
+iterations (X$_{n +1 }$= f (X$_{n}$) ) from an initial value X
+$_{0}$ to find an approximate value X* of the solution with a very low 
+residual error. Several well-known methods demonstrate the convergence 
+of these algorithms. Generally, to reduce the complexity and the 
+execution time, the problem is divided into several "pieces" that will 
+be solved in parallel on multiple processing units. The latter will 
+communicate each intermediate results before a new iteration starts 
+until the approximate solution is reached. These distributed parallel 
+computations can be performed either in "synchronous" communication mode 
+where a new iteration begin only when all nodes communications are 
+completed, either "asynchronous" mode where processors can continue 
+independently without or few synchronization points. Despite the 
+effectiveness of iterative approach, a major drawback of the method is 
+the requirement of huge resources in terms of computing capacity, 
+storage and high speed communication network. Indeed, limited physical 
+resources are blocking factors for large-scale deployment of parallel 
+algorithms. 
+
+In recent years, the use of a simulation environment to execute parallel 
+iterative algorithms found some interests in reducing the highly cost of 
+access to computing resources: (1) for the applications development life 
+cycle and in code debugging (2) and in production to get results in a 
+reasonable execution time with a simulated infrastructure not accessible 
+with physical resources. Indeed, the launch of distributed iterative 
+asynchronous algorithms to solve a given problem on a large-scale 
+simulated environment challenges to find optimal configurations giving 
+the best results with a lowest residual error and in the best of 
+execution time. According our knowledge, no testing of large-scale 
+simulation of the class of algorithm solving to achieve real results has 
+been undertaken to date. We had in the scope of this work implemented a 
+program for solving large non-symmetric linear system of equations by 
+numerical method GMRES (Generalized Minimal Residual ) in the simulation 
+environment Simgrid . The simulated platform had allowed us to launch 
+the application from a modest computing infrastructure by simulating 
+different distributed architectures composed by clusters nodes 
+interconnected by variable speed networks. In addition, it has been 
+permitted to show the effectiveness of asynchronous mode algorithm by 
+comparing its performance with the synchronous mode time. With selected 
+parameters on the network platforms (bandwidth, latency of inter cluster 
+network) and on the clusters architecture (number, capacity calculation 
+power) in the simulated environment , the experimental results have 
+demonstrated not only the algorithm convergence within a reasonable time 
+compared with the physical environment performance, but also a time 
+saving of up to 40 \% in asynchronous mode.
+
+This article is structured as follows: after this introduction, the next 
+section will give a brief description of iterative asynchronous model. 
+Then, the simulation framework SIMGRID will be presented with the 
+settings to create various distributed architectures. The algorithm of 
+the multi -splitting method used by GMRES written with MPI primitives 
+and its adaptation to Simgrid with SMPI (Simulation MPI ) will be in the 
+next section . At last, the experiments results carried out will be 
+presented before the conclusion which we will announce the opening of 
+our future work after the results.
   
  \section{The asynchronous iteration model}
  
-Décrire le modèle asynchrone. Je m'en charge (DL)
+Décrire le modèle asynchrone. Je m'en charge (DL)
  
  \section{SimGrid}
  
-Décrire SimGrid (Arnaud)
+Décrire SimGrid (Arnaud)
+
+
+
+
+
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Simulation of the multisplitting method}
+%Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid.
+Let $Ax=b$ be a large sparse system of $n$ linear equations in $\mathbb{R}$, where $A$ is a sparse square and nonsingular matrix, $x$ is the solution vector and $y$ is the right-hand side vector. We use a multisplitting method based on the block Jacobi partitioning to solve this linear system on a large scale platform composed of $L$ clusters of processors. In this case, we apply a row-by-row splitting without overlapping  
+\[
+\left(\begin{array}{ccc}
+A_{11} & \cdots & A_{1L} \\
+\vdots & \ddots & \vdots\\
+A_{L1} & \cdots & A_{LL}
+\end{array} \right)
+\times 
+\left(\begin{array}{c}
+X_1 \\
+\vdots\\
+X_L
+\end{array} \right)
+=
+\left(\begin{array}{c}
+Y_1 \\
+\vdots\\
+Y_L
+\end{array} \right)\] 
+in such a way that successive rows of matrix $A$ and both vectors $x$ and $b$ are assigned to one cluster, where for all $l,i\in\{1,\ldots,L\}$ $A_{li}$ is a rectangular block of $A$ of size $n_l\times n_i$, $X_l$ and $Y_l$ are sub-vectors of $x$ and $y$, respectively, each of size $n_l$ and $\sum_{l} n_l=\sum_{i} n_i=n$.
+
+The multisplitting method proceeds by iteration to solve in parallel the linear system by $L$ clusters of processors, in such a way each sub-system
+\begin{equation}
+\left\{
+\begin{array}{l}
+A_{ll}X_l = Y_l \mbox{,~such that}\\
+Y_l = B_l - \displaystyle\sum_{i=1,i\neq l}^{L}A_{li}X_i,
+\end{array}
+\right.
+\label{eq:4.1}
+\end{equation}
+is solved independently by a cluster and communication are required to update the right-hand side sub-vectors $Y_l$, such that the sub-vectors $X_i$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers. 
+
+\begin{algorithm}
+\caption{A multisplitting solver with inner iteration GMRES method}
+\begin{algorithmic}[1]
+\Input $A_l$ (local sparse matrix), $B_l$ (local right-hand side), $x^0$ (initial guess)
+\Output $X_l$ (local solution vector)\vspace{0.2cm}
+\State Load $A_l$, $B_l$, $x^0$
+\State Initialize the shared vector $\hat{x}=x^0$
+\For {$k=1,2,3,\ldots$ until the global convergence}
+\State $x^0=\hat{x}$
+\State Inner iteration solver: \Call{InnerSolver}{$x^0$, $k$}
+\State Exchange the local solution ${X}_l^k$ with the neighboring clusters and copy the shared vector elements in $\hat{x}$
+\EndFor
+
+\Statex
+
+\Function {InnerSolver}{$x^0$, $k$}
+\State Compute the local right-hand side: $Y_l = B_l - \sum^L_{i=1,i\neq l}A_{li}X_i^0$
+\State Solving the local splitting $A_{ll}X_l^k=Y_l$ using the parallel GMRES method, such that $X_l^0$ is the local initial guess
+\State \Return $X_l^k$
+\EndFunction
+\end{algorithmic}
+\label{algo:01}
+\end{algorithm}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+
+
+
+
  
-\section{Simulation of the multi-splitting method}
  
-Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid.
  
  \section{Experimental results}
  
+When the ``real'' application runs in the simulation environment and produces
+the expected results, varying the input parameters and the program arguments
+allows us to compare outputs from the code execution. We have noticed from this
+study that the results depend on the following parameters: (1) at the network
+level, we found that the most critical values are the bandwidth (bw) and the
+network latency (lat). (2) Hosts power (GFlops) can also influence on the
+results. And finally, (3) when submitting job batches for execution, the
+arguments values passed to the program like the maximum number of iterations or
+the ``external'' precision are critical to ensure not only the convergence of the
+algorithm but also to get the main objective of the experimentation of the
+simulation in having an execution time in asynchronous less than in synchronous
+mode, in others words, in having a ``speedup'' less than 1 (Speedup = Execution
+time in synchronous mode / Execution time in asynchronous mode).
+
+A priori, obtaining a speedup less than 1 would be difficult in a local area
+network configuration where the synchronous mode will take advantage on the rapid
+exchange of information on such high-speed links. Thus, the methodology adopted
+was to launch the application on clustered network. In this last configuration,
+degrading the inter-cluster network performance will "penalize" the synchronous
+mode allowing to get a speedup lower than 1. This action simulates the case of
+clusters linked with long distance network like Internet.
+
+As a first step, the algorithm was run on a network consisting of two clusters
+containing fifty hosts each, totaling one hundred hosts. Various combinations of
+the above factors have providing the results shown in Table~\ref{tab.cluster.2x50} with a matrix size
+ranging from Nx = Ny = Nz = 62 to 171 elements or from 62$^{3}$ = 238328 to
+171$^{3}$ = 5,211,000 entries.
+
+Then we have changed the network configuration using three clusters containing
+respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the
+clusters. In the same way as above, a judicious choice of key parameters has
+permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the speedups less than 1 with
+a matrix size from 62 to 100 elements.
+
+In a final step, results of an execution attempt to scale up the three clustered
+configuration but increasing by two hundreds hosts has been recorded in Table~\ref{tab.cluster.3x67}.
+
+Note that the program was run with the following parameters:
+
+\paragraph*{SMPI parameters}
+
+\begin{itemize}
+       \item HOSTFILE : Hosts file description.
+       \item PLATFORM: file description of the platform architecture : clusters (CPU power,
+... ) , intra cluster network description, inter cluster network (bandwidth bw ,
+lat latency , ... ).
+\end{itemize}
+
+
+\paragraph*{Arguments of the program}
+
+\begin{itemize}
+       \item Description of the cluster architecture;
+       \item Maximum number of internal and external iterations;
+       \item Internal and external precisions;
+       \item Matrix size NX , NY and NZ;
+       \item Matrix diagonal value = 6.0;
+       \item Execution Mode: synchronous or asynchronous.
+\end{itemize}
+
+\begin{table}
+  \centering
+  \caption{2 clusters X 50 nodes}
+  \label{tab.cluster.2x50}
+  \includegraphics[width=209pt]{img1.jpg}
+\end{table}
+
+\begin{table}
+  \centering
+  \caption{3 clusters X 33 nodes}
+  \label{tab.cluster.3x33}
+  \includegraphics[width=209pt]{img2.jpg}
+\end{table}
+
+\begin{table}
+  \centering
+  \caption{3 clusters X 67 nodes}
+  \label{tab.cluster.3x67}
+%  \includegraphics[width=160pt]{img3.jpg}
+  \includegraphics[scale=0.5]{img3.jpg}
+\end{table}
+
+\paragraph*{Interpretations and comments}
+
+After analyzing the outputs, generally, for the configuration with two or three
+clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50} and~\ref{tab.cluster.3x33}), some combinations of the
+used parameters affecting the results have given a speedup less than 1, showing
+the effectiveness of the asynchronous performance compared to the synchronous
+mode.
+
+In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows that with a
+deterioration of inter cluster network set with 5 Mbits/s of bandwidth, a latency
+in order of a hundredth of a millisecond and a system power of one GFlops, an
+efficiency of about 40\% in asynchronous mode is obtained for a matrix size of 62
+elements . It is noticed that the result remains stable even if we vary the
+external precision from E -05 to E-09. By increasing the problem size up to 100
+elements, it was necessary to increase the CPU power of 50 \% to 1.5 GFlops for a
+convergence of the algorithm with the same order of asynchronous mode efficiency.
+Maintaining such a system power but this time, increasing network throughput
+inter cluster up to 50 Mbits /s, the result of efficiency of about 40\% is
+obtained with high external  precision of E-11 for a matrix size from 110 to 150
+side elements .
+
+For the 3 clusters architecture including a total of 100 hosts, Table~\ref{tab.cluster.3x33} shows
+that it was difficult to have a combination which gives an efficiency of
+asynchronous below 80 \%. Indeed, for a matrix size of 62 elements, equality
+between the performance of the two modes (synchronous and asynchronous) is
+achieved with an inter cluster of 10 Mbits/s and a latency of E- 01 ms. To
+challenge an efficiency by 78\% with a matrix size of 100 points, it was
+necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s.
+
+A last attempt was made for a configuration of three clusters but more power
+with 200 nodes in total. The convergence with a speedup of 90 \% was obtained
+with a bandwidth of 1 Mbits/s as shown in Table~\ref{tab.cluster.3x67}.
+
  \section{Conclusion}
  
  
@@ -555,7 +794,7 @@ The authors would like to thank...
  % http://www.michaelshell.org/tex/ieeetran/bibtex/
  \bibliographystyle{IEEEtran}
  % argument is your BibTeX string definitions and bibliography database(s)
-\bibliography{bib/hpccBib}
+\bibliography{hpccBib}
  %
  % <OR> manually copy in the resultant .bbl file
  % set second argument of \begin to the number of references