X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/blobdiff_plain/74faf54391ee09cf05d205964b861e91ee559d74..15c7337ca6150b7e463ce966f188445f2d55e95a:/hpcc.tex?ds=sidebyside diff --git a/hpcc.tex b/hpcc.tex index 1dc732c..161a02c 100644 --- a/hpcc.tex +++ b/hpcc.tex @@ -4,7 +4,7 @@ \usepackage[utf8]{inputenc} \usepackage{amsfonts,amssymb} \usepackage{amsmath} -\usepackage{algorithm} +%\usepackage{algorithm} \usepackage{algpseudocode} %\usepackage{amsthm} \usepackage{graphicx} @@ -101,84 +101,125 @@ simulated large scale growing environment and with larger problem size. \section{Introduction} -Parallel computing and high performance computing (HPC) are becoming -more and more imperative for solving various problems raised by -researchers on various scientific disciplines but also by industrial in -the field. Indeed, the increasing complexity of these requested -applications combined with a continuous increase of their sizes lead to -write distributed and parallel algorithms requiring significant hardware -resources (grid computing, clusters, broadband network, etc.) but -also a non-negligible CPU execution time. We consider in this paper a -class of highly efficient parallel algorithms called iterative executed -in a distributed environment. As their name suggests, these algorithm -solves a given problem that might be NP-complete complex by successive -iterations ($X_{n +1} = f(X_{n})$) from an initial value $X_{0}$ to find -an approximate value $X^*$ of the solution with a very low -residual error. Several well-known methods demonstrate the convergence -of these algorithms. Generally, to reduce the complexity and the -execution time, the problem is divided into several \emph{pieces} that will -be solved in parallel on multiple processing units. The latter will -communicate each intermediate results before a new iteration starts -until the approximate solution is reached. These distributed parallel -computations can be performed either in \emph{synchronous} communication mode -where a new iteration begin only when all nodes communications are -completed, either \emph{asynchronous} mode where processors can continue -independently without or few synchronization points. Despite the -effectiveness of iterative approach, a major drawback of the method is -the requirement of huge resources in terms of computing capacity, -storage and high speed communication network. Indeed, limited physical -resources are blocking factors for large-scale deployment of parallel -algorithms. - -In recent years, the use of a simulation environment to execute parallel -iterative algorithms found some interests in reducing the highly cost of -access to computing resources: (1) for the applications development life -cycle and in code debugging (2) and in production to get results in a -reasonable execution time with a simulated infrastructure not accessible -with physical resources. Indeed, the launch of distributed iterative -asynchronous algorithms to solve a given problem on a large-scale -simulated environment challenges to find optimal configurations giving -the best results with a lowest residual error and in the best of -execution time. According our knowledge, no testing of large-scale -simulation of the class of algorithm solving to achieve real results has -been undertaken to date. We had in the scope of this work implemented a -program for solving large non-symmetric linear system of equations by -numerical method GMRES (Generalized Minimal Residual) in the simulation -environment SimGrid. The simulated platform had allowed us to launch -the application from a modest computing infrastructure by simulating -different distributed architectures composed by clusters nodes -interconnected by variable speed networks. In addition, it has been -permitted to show the effectiveness of asynchronous mode algorithm by -comparing its performance with the synchronous mode time. With selected -parameters on the network platforms (bandwidth, latency of inter cluster -network) and on the clusters architecture (number, capacity calculation -power) in the simulated environment, the experimental results have -demonstrated not only the algorithm convergence within a reasonable time -compared with the physical environment performance, but also a time +Parallel computing and high performance computing (HPC) are becoming more and more imperative for solving various +problems raised by researchers on various scientific disciplines but also by industrial in the field. Indeed, the +increasing complexity of these requested applications combined with a continuous increase of their sizes lead to write +distributed and parallel algorithms requiring significant hardware resources (grid computing, clusters, broadband +network, etc.) but also a non-negligible CPU execution time. We consider in this paper a class of highly efficient +parallel algorithms called \texttt{numerical iterative algorithms} executed in a distributed environment. As their name +suggests, these algorithm solves a given problem by successive iterations ($X_{n +1} = f(X_{n})$) from an initial value +$X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods +demonstrate the convergence of these algorithms \cite{}. + +Parallelization of such algorithms generally involved the division of the problem into several \emph{pieces} that will +be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new +iteration starts until the approximate solution is reached. These parallel computations can be performed +either in \emph{synchronous} communication mode where a new iteration begin only when all nodes communications are +completed, either \emph{asynchronous} mode where processors can continue independently without or few synchronization +points. + +% DL : reprendre correction ici +Despite the effectiveness of iterative approach, a major drawback of the method is the requirement of huge +resources in terms of computing capacity, storage and high speed communication network. Indeed, limited physical +resources are blocking factors for large-scale deployment of parallel algorithms. + +In recent years, the use of a simulation environment to execute parallel iterative algorithms found some interests in +reducing the highly cost of access to computing resources: (1) for the applications development life cycle and in code +debugging (2) and in production to get results in a reasonable execution time with a simulated infrastructure not +accessible with physical resources. Indeed, the launch of distributed iterative asynchronous algorithms to solve a +given problem on a large-scale simulated environment challenges to find optimal configurations giving the best results +with a lowest residual error and in the best of execution time. According our knowledge, no testing of large-scale +simulation of the class of algorithm solving to achieve real results has been undertaken to date. We had in the scope +of this work implemented a program for solving large non-symmetric linear system of equations by numerical method +GMRES (Generalized Minimal Residual) in the simulation environment SimGrid. The simulated platform had allowed us to +launch the application from a modest computing infrastructure by simulating different distributed architectures +composed by clusters nodes interconnected by variable speed networks. In addition, it has been permitted to show the +effectiveness of asynchronous mode algorithm by comparing its performance with the synchronous mode time. With selected +parameters on the network platforms (bandwidth, latency of inter cluster network) and on the clusters architecture +(number, capacity calculation power) in the simulated environment, the experimental results have demonstrated not only +the algorithm convergence within a reasonable time compared with the physical environment performance, but also a time saving of up to \np[\%]{40} in asynchronous mode. -This article is structured as follows: after this introduction, the next -section will give a brief description of iterative asynchronous model. -Then, the simulation framework SimGrid will be presented with the -settings to create various distributed architectures. The algorithm of -the multi-splitting method used by GMRES written with MPI primitives -and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the -next section. At last, the experiments results carried out will be -presented before the conclusion which we will announce the opening of -our future work after the results. +This article is structured as follows: after this introduction, the next section will give a brief description of +iterative asynchronous model. Then, the simulation framework SimGrid will be presented with the settings to create +various distributed architectures. The algorithm of the multi-splitting method used by GMRES written with MPI +primitives and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the next section. At last, the experiments +results carried out will be presented before the conclusion which we will announce the opening of our future work after +the results. -\section{The asynchronous iteration model} +\section{Motivations and scientific context} + +As exposed in the introduction, parallel iterative methods are now widely used in many scientific domains. They can be +classified in three main classes depending on how iterations and communications are managed (for more details readers +can refer to \cite{bcvc02:ip}). In the \textit{Synchronous Iterations - Synchronous Communications (SISC)} model data +are exchanged at the end of each iteration. All the processors must begin the same iteration at the same time and +important idle times on processors are generated. The \textit{Synchronous Iterations - Asynchronous Communications +(SIAC)} model can be compared to the previous one except that data required on another processor are sent asynchronously +i.e. without stopping current computations. This technique allows to partially overlap communications by computations +but unfortunately, the overlapping is only partial and important idle times remain. It is clear that, in a grid +computing context, where the number of computational nodes is large, heterogeneous and widely distributed, the idle +times generated by synchronizations are very penalizing. One way to overcome this problem is to use the +\textit{Asynchronous Iterations - Asynchronous Communications (AIAC)} model. Here, local computations do not need to +wait for required data. Processors can then perform their iterations with the data present at that time. Figure +\ref{fig:aiac} illustrates this model where the grey blocks represent the computation phases, the white spaces the idle +times and the arrows the communications. With this algorithmic model, the number of iterations required before the +convergence is generally greater than for the two former classes. But, and as detailed in \cite{bcvc06:ij}, AIAC +algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially +in a grid computing context. + +\begin{figure}[htbp] + \centering + \includegraphics[width=8cm]{AIAC.pdf} + \caption{The Asynchronous Iterations - Asynchronous Communications model } + \label{fig:aiac} +\end{figure} + + +It is very challenging to develop efficient applications for large scale, heterogeneous and distributed platforms such +as computing grids. Researchers and engineers have to develop techniques for maximizing application performance of these +multi-cluster platforms, by redesigning the applications and/or by using novel algorithms that can account for the +composite and heterogeneous nature of the platform. Unfortunately, the deployment of such applications on these very +large scale systems is very costly, labor intensive and time consuming. In this context, it appears that the use of +simulation tools to explore various platform scenarios at will and to run enormous numbers of experiments quickly can be +very promising. Several works... + +In the context of AIAC algorithms, the use of simulation tools is even more relevant. Indeed, this class of applications +is very sensible to the execution environment context. For instance, variations in the network bandwith (intra and +inter-clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of +iterations and so to very different execution times. + + -\DL{Décrire le modèle asynchrone. Je m'en charge} \section{SimGrid} -\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)} +SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} is a simulation +framework to sudy the behavior of large-scale distributed systems. As its name +says, it emanates from the grid computing community, but is nowadays used to +study grids, clouds, HPC or peer-to-peer systems. +%- open source, developped since 1999, one of the major solution in the field +% +SimGrid provides several programming interfaces: MSG to simulate Concurrent +Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to +run real applications written in MPI~\cite{MPI}. Apart from the native C +interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming +languages. The SMPI interface supports applications written in C or Fortran, +with little or no modifications. +%- implements most of MPI-2 \cite{ref} standard [CHECK] + +%%% explain simulation +%- simulated processes folded in one real process +%- simulates interactions on the network, fluid model +%- able to skip long-lasting computations +%- traces + visu? -%%% brief history? -%%% programming interfaces: MSG, SimDAG, SMPI %%% platforms -%%% validation? +%- describe resources and their interconnection, with their properties +%- XML files + +%%% validation + refs + +\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \section{Simulation of the multisplitting method} @@ -216,8 +257,9 @@ Y_l = B_l - \displaystyle\sum_{\substack{m=1\\ m\neq l}}^{L}A_{lm}X_m \end{equation} is solved independently by a cluster and communications are required to update the right-hand side sub-vector $Y_l$, such that the sub-vectors $X_m$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel algorithm of GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers. -\begin{algorithm} -\caption{A multisplitting solver with GMRES method} +\begin{figure} + %%% IEEE instructions forbid to use an algorithm environment here, use figure + %%% instead \begin{algorithmic}[1] \Input $A_l$ (sparse sub-matrix), $B_l$ (right-hand side sub-vector) \Output $X_l$ (solution sub-vector)\vspace{0.2cm} @@ -238,10 +280,23 @@ is solved independently by a cluster and communications are required to update t \State \Return $X_l^k$ \EndFunction \end{algorithmic} +\caption{A multisplitting solver with GMRES method} \label{algo:01} -\end{algorithm} +\end{figure} -Algorithm~\ref{algo:01} shows the main key points of the multisplitting method to solve a large sparse linear system. This algorithm is based on an outer-inner iteration method where the parallel synchronous GMRES method is used to solve the inner iteration. It is executed in parallel by each cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors with the subscript $l$ represent the local data for cluster $l$, while $\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and $\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with neighboring clusters. At every outer iteration $k$, asynchronous communications are performed between processors of the local cluster and those of distant clusters (lines $6$ and $7$ in Algorithm~\ref{algo:01}). The shared vector elements of the solution $x$ are exchanged by message passing using MPI non-blocking communication routines. +Algorithm on Figure~\ref{algo:01} shows the main key points of the +multisplitting method to solve a large sparse linear system. This algorithm is +based on an outer-inner iteration method where the parallel synchronous GMRES +method is used to solve the inner iteration. It is executed in parallel by each +cluster of processors. For all $l,m\in\{1,\ldots,L\}$, the matrices and vectors +with the subscript $l$ represent the local data for cluster $l$, while +$\{A_{lm}\}_{m\neq l}$ are off-diagonal matrices of sparse matrix $A$ and +$\{X_m\}_{m\neq l}$ contain vector elements of solution $x$ shared with +neighboring clusters. At every outer iteration $k$, asynchronous communications +are performed between processors of the local cluster and those of distant +clusters (lines $6$ and $7$ in Figure~\ref{algo:01}). The shared vector +elements of the solution $x$ are exchanged by message passing using MPI +non-blocking communication routines. \begin{figure} \centering