ajout de AIAC.pdf

[hpcc2014.git] / hpcc.tex
diff --git a/hpcc.tex b/hpcc.tex

index 99d8c3860af7fe5248aad9ab40355f87ffa6071b..161a02c6b67cfee72e1ba8c5ed93851d02b5dec8 100644 (file)
--- a/hpcc.tex
+++ b/hpcc.tex
@@ -4,7 +4,7 @@
  \usepackage[utf8]{inputenc}
  \usepackage{amsfonts,amssymb}
  \usepackage{amsmath}
-\usepackage{algorithm}
+%\usepackage{algorithm}
  \usepackage{algpseudocode}
  %\usepackage{amsthm}
  \usepackage{graphicx}
@@ -101,80 +101,125 @@ simulated large scale growing environment and with larger problem size.
  
  \section{Introduction}
  
-Parallel computing and high performance computing (HPC) are becoming 
-more and more imperative for solving various problems raised by 
-researchers on various scientific disciplines but also by industrial in 
-the field. Indeed, the increasing complexity of these requested 
-applications combined with a continuous increase of their sizes lead to 
-write distributed and parallel algorithms requiring significant hardware 
-resources (grid computing, clusters, broadband network, etc.) but
-also a non-negligible CPU execution time. We consider in this paper a
-class of highly efficient parallel algorithms called iterative executed 
-in a distributed environment. As their name suggests, these algorithm 
-solves a given problem that might be NP-complete complex by successive
-iterations ($X_{n +1} = f(X_{n})$) from an initial value $X_{0}$ to find
-an approximate value $X^*$ of the solution with a very low
-residual error. Several well-known methods demonstrate the convergence 
-of these algorithms. Generally, to reduce the complexity and the 
-execution time, the problem is divided into several \emph{pieces} that will
-be solved in parallel on multiple processing units. The latter will 
-communicate each intermediate results before a new iteration starts 
-until the approximate solution is reached. These distributed parallel 
-computations can be performed either in \emph{synchronous} communication mode
-where a new iteration begin only when all nodes communications are 
-completed, either \emph{asynchronous} mode where processors can continue
-independently without or few synchronization points. Despite the 
-effectiveness of iterative approach, a major drawback of the method is 
-the requirement of huge resources in terms of computing capacity, 
-storage and high speed communication network. Indeed, limited physical 
-resources are blocking factors for large-scale deployment of parallel 
-algorithms. 
-
-In recent years, the use of a simulation environment to execute parallel 
-iterative algorithms found some interests in reducing the highly cost of 
-access to computing resources: (1) for the applications development life 
-cycle and in code debugging (2) and in production to get results in a 
-reasonable execution time with a simulated infrastructure not accessible 
-with physical resources. Indeed, the launch of distributed iterative 
-asynchronous algorithms to solve a given problem on a large-scale 
-simulated environment challenges to find optimal configurations giving 
-the best results with a lowest residual error and in the best of 
-execution time. According our knowledge, no testing of large-scale 
-simulation of the class of algorithm solving to achieve real results has 
-been undertaken to date. We had in the scope of this work implemented a 
-program for solving large non-symmetric linear system of equations by 
-numerical method GMRES (Generalized Minimal Residual) in the simulation
-environment SimGrid. The simulated platform had allowed us to launch
-the application from a modest computing infrastructure by simulating 
-different distributed architectures composed by clusters nodes 
-interconnected by variable speed networks. In addition, it has been 
-permitted to show the effectiveness of asynchronous mode algorithm by 
-comparing its performance with the synchronous mode time. With selected 
-parameters on the network platforms (bandwidth, latency of inter cluster 
-network) and on the clusters architecture (number, capacity calculation 
-power) in the simulated environment, the experimental results have
-demonstrated not only the algorithm convergence within a reasonable time 
-compared with the physical environment performance, but also a time 
+Parallel computing and high performance computing (HPC) are becoming  more and more imperative for solving various
+problems raised by  researchers on various scientific disciplines but also by industrial in  the field. Indeed, the
+increasing complexity of these requested  applications combined with a continuous increase of their sizes lead to  write
+distributed and parallel algorithms requiring significant hardware  resources (grid computing, clusters, broadband
+network, etc.) but also a non-negligible CPU execution time. We consider in this paper a class of highly efficient
+parallel algorithms called \texttt{numerical iterative algorithms} executed in a distributed environment. As their name
+suggests, these algorithm solves a given problem by successive iterations ($X_{n +1} = f(X_{n})$) from an initial value
+$X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
+demonstrate the convergence of these algorithms \cite{}. 
+
+Parallelization of such algorithms generally involved the division of the problem into several \emph{pieces} that will
+be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new
+iteration starts  until the approximate solution is reached. These parallel  computations can be performed
+either in \emph{synchronous} communication mode where a new iteration begin only when all nodes communications are
+completed, either \emph{asynchronous} mode where processors can continue independently without or few synchronization
+points. 
+
+% DL : reprendre correction ici 
+Despite the effectiveness of iterative approach, a major drawback of the method is  the requirement of huge
+resources in terms of computing capacity,  storage and high speed communication network. Indeed, limited physical
+resources are blocking factors for large-scale deployment of parallel algorithms.
+
+In recent years, the use of a simulation environment to execute parallel  iterative algorithms found some interests in
+reducing the highly cost of  access to computing resources: (1) for the applications development life  cycle and in code
+debugging (2) and in production to get results in a  reasonable execution time with a simulated infrastructure not
+accessible  with physical resources. Indeed, the launch of distributed iterative  asynchronous algorithms to solve a
+given problem on a large-scale  simulated environment challenges to find optimal configurations giving  the best results
+with a lowest residual error and in the best of  execution time. According our knowledge, no testing of large-scale
+simulation of the class of algorithm solving to achieve real results has  been undertaken to date. We had in the scope
+of this work implemented a  program for solving large non-symmetric linear system of equations by  numerical method
+GMRES (Generalized Minimal Residual) in the simulation environment SimGrid. The simulated platform had allowed us to
+launch the application from a modest computing infrastructure by simulating  different distributed architectures
+composed by clusters nodes  interconnected by variable speed networks. In addition, it has been  permitted to show the
+effectiveness of asynchronous mode algorithm by  comparing its performance with the synchronous mode time. With selected
+parameters on the network platforms (bandwidth, latency of inter cluster  network) and on the clusters architecture
+(number, capacity calculation  power) in the simulated environment, the experimental results have demonstrated not only
+the algorithm convergence within a reasonable time  compared with the physical environment performance, but also a time
  saving of up to \np[\%]{40} in asynchronous mode.
  
-This article is structured as follows: after this introduction, the next 
-section will give a brief description of iterative asynchronous model. 
-Then, the simulation framework SimGrid will be presented with the
-settings to create various distributed architectures. The algorithm of 
-the multi-splitting method used by GMRES written with MPI primitives
-and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the
-next section. At last, the experiments results carried out will be
-presented before the conclusion which we will announce the opening of 
-our future work after the results.
+This article is structured as follows: after this introduction, the next  section will give a brief description of
+iterative asynchronous model.  Then, the simulation framework SimGrid will be presented with the settings to create
+various distributed architectures. The algorithm of  the multi-splitting method used by GMRES written with MPI
+primitives and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the next section. At last, the experiments
+results carried out will be presented before the conclusion which we will announce the opening of  our future work after
+the results.
   
-\section{The asynchronous iteration model}
+\section{Motivations and scientific context}
+
+As exposed in the introduction, parallel iterative methods are now widely used in many scientific domains. They can be
+classified in three main classes depending on how iterations and communications are managed (for more details readers
+can refer to \cite{bcvc02:ip}). In the \textit{Synchronous Iterations - Synchronous Communications (SISC)} model data
+are exchanged at the end of each iteration. All the processors must begin the same iteration at the same time and
+important idle times on processors are generated. The \textit{Synchronous Iterations - Asynchronous Communications
+(SIAC)} model can be compared to the previous one except that data required on another processor are sent asynchronously
+i.e.  without stopping current computations. This technique allows to partially overlap communications by computations
+but unfortunately, the overlapping is only partial and important idle times remain.  It is clear that, in a grid
+computing context, where the number of computational nodes is large, heterogeneous and widely distributed, the idle
+times generated by synchronizations are very penalizing. One way to overcome this problem is to use the
+\textit{Asynchronous Iterations - Asynchronous   Communications (AIAC)} model. Here, local computations do not need to
+wait for required data. Processors can then perform their iterations with the data present at that time. Figure
+\ref{fig:aiac} illustrates this model where the grey blocks represent the computation phases, the white spaces the idle
+times and the arrows the communications. With this algorithmic model, the number of iterations required before the
+convergence is generally greater than for the two former classes. But, and as detailed in \cite{bcvc06:ij}, AIAC
+algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially
+in a grid computing context.
+
+\begin{figure}[htbp]
+  \centering
+    \includegraphics[width=8cm]{AIAC.pdf}
+  \caption{The Asynchronous Iterations - Asynchronous Communications model } 
+  \label{fig:aiac}
+\end{figure}
+
+
+It is very challenging to develop efficient applications for large scale, heterogeneous and distributed platforms such
+as computing grids. Researchers and engineers have to develop techniques for maximizing application performance of these
+multi-cluster platforms, by redesigning the applications and/or by using novel algorithms that can account for the
+composite and heterogeneous nature of the platform. Unfortunately, the deployment of such applications on these very
+large scale systems is very costly, labor intensive and time consuming. In this context, it appears that the use of
+simulation tools to explore various platform scenarios at will and to run enormous numbers of experiments quickly can be
+very promising. Several works...
+
+In the context of AIAC algorithms, the use of simulation tools is even more relevant. Indeed, this class of applications
+is very sensible to the execution environment context. For instance, variations in the network bandwith (intra and
+inter-clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of
+iterations and so to very different execution times.
  
-\DL{Décrire le modèle asynchrone. Je m'en charge}
  
-\section{SimGrid}
  
-\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid} (Arnaud)}
  
+\section{SimGrid}
+
+SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} is a simulation
+framework to sudy the behavior of large-scale distributed systems.  As its name
+says, it emanates from the grid computing community, but is nowadays used to
+study grids, clouds, HPC or peer-to-peer systems.
+%- open source, developped since 1999, one of the major solution in the field
+%
+SimGrid provides several programming interfaces: MSG to simulate Concurrent
+Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to
+run real applications written in MPI~\cite{MPI}.  Apart from the native C
+interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming
+languages.  The SMPI interface supports applications written in C or Fortran,
+with little or no modifications.
+%- implements most of MPI-2 \cite{ref} standard [CHECK]
+
+%%% explain simulation
+%- simulated processes folded in one real process
+%- simulates interactions on the network, fluid model
+%- able to skip long-lasting computations
+%- traces + visu?
+
+%%% platforms
+%- describe resources and their interconnection, with their properties
+%- XML files
+
+%%% validation + refs
+
+\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)}
  
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  \section{Simulation of the multisplitting method}
@@ -212,8 +257,9 @@ Y_l = B_l - \displaystyle\sum_{\substack{m=1\\ m\neq l}}^{L}A_{lm}X_m
  \end{equation}
  is solved independently by a cluster and communications are required to update the right-hand side sub-vector $Y_l$, such that the sub-vectors $X_m$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel algorithm of GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers. 
  
-\begin{algorithm}
-\caption{A multisplitting solver with GMRES method}
+\begin{figure}
+  %%% IEEE instructions forbid to use an algorithm environment here, use figure
+  %%% instead
  \begin{algorithmic}[1]
  \Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector)
  \Output $X_l$ (solution sub-vector)\vspace{0.2cm}
@@ -234,10 +280,23 @@ is solved independently by a cluster and communications are required to update t
  \State \Return $X_l^k$
  \EndFunction
  \end{algorithmic}
+\caption{A multisplitting solver with GMRES method}
  \label{algo:01}
-\end{algorithm}
+\end{figure}
  
-Algorithm~\ref{algo:01} shows the main key points of the multisplitting method to solve a large sparse linear system. This algorithm is based on an outer-inner iteration method where the parallel synchronous GMRES method is used to solve the inner iteration. It is executed in parallel by each cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors with the subscript $l$ represent the local data for cluster $l$, while $\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and $\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with neighboring clusters. At every outer iteration $k$, asynchronous communications are performed between processors of the local cluster and those of distant clusters (lines $6$ and $7$ in Algorithm~\ref{algo:01}). The shared vector elements of the solution $x$ are exchanged by message passing using MPI non-blocking communication routines. 
+Algorithm on Figure~\ref{algo:01} shows the main key points of the
+multisplitting method to solve a large sparse linear system. This algorithm is
+based on an outer-inner iteration method where the parallel synchronous GMRES
+method is used to solve the inner iteration. It is executed in parallel by each
+cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors
+with the subscript $l$ represent the local data for cluster $l$, while
+$\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and
+$\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with
+neighboring clusters. At every outer iteration $k$, asynchronous communications
+are performed between processors of the local cluster and those of distant
+clusters (lines $6$ and $7$ in Figure~\ref{algo:01}). The shared vector
+elements of the solution $x$ are exchanged by message passing using MPI
+non-blocking communication routines.
  
  \begin{figure}
  \centering