coreections et améliorations

[hpcc2014.git] / hpcc.tex
diff --git a/hpcc.tex b/hpcc.tex

index 161a02c6b67cfee72e1ba8c5ed93851d02b5dec8..d4dc19e7e508a2a140f9fa7d39af01f284cad954 100644 (file)
--- a/hpcc.tex
+++ b/hpcc.tex
@@ -40,7 +40,6 @@
  
  \newcommand{\MI}{\mathit{MaxIter}}
  
-
  \begin{document}
  
  \title{Simulation of Asynchronous Iterative Numerical Algorithms Using SimGrid}
@@ -111,41 +110,51 @@ suggests, these algorithm solves a given problem by successive iterations ($X_{n
  $X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
  demonstrate the convergence of these algorithms \cite{}. 
  
-Parallelization of such algorithms generally involved the division of the problem into several \emph{pieces} that will
+Parallelization of such algorithms generally involved the division of the problem into several \emph{blocks} that will
  be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new
-iteration starts  until the approximate solution is reached. These parallel  computations can be performed
-either in \emph{synchronous} communication mode where a new iteration begin only when all nodes communications are
-completed, either \emph{asynchronous} mode where processors can continue independently without or few synchronization
-points. 
-
-% DL : reprendre correction ici 
-Despite the effectiveness of iterative approach, a major drawback of the method is  the requirement of huge
-resources in terms of computing capacity,  storage and high speed communication network. Indeed, limited physical
-resources are blocking factors for large-scale deployment of parallel algorithms.
-
-In recent years, the use of a simulation environment to execute parallel  iterative algorithms found some interests in
-reducing the highly cost of  access to computing resources: (1) for the applications development life  cycle and in code
-debugging (2) and in production to get results in a  reasonable execution time with a simulated infrastructure not
-accessible  with physical resources. Indeed, the launch of distributed iterative  asynchronous algorithms to solve a
-given problem on a large-scale  simulated environment challenges to find optimal configurations giving  the best results
-with a lowest residual error and in the best of  execution time. According our knowledge, no testing of large-scale
-simulation of the class of algorithm solving to achieve real results has  been undertaken to date. We had in the scope
-of this work implemented a  program for solving large non-symmetric linear system of equations by  numerical method
-GMRES (Generalized Minimal Residual) in the simulation environment SimGrid. The simulated platform had allowed us to
-launch the application from a modest computing infrastructure by simulating  different distributed architectures
-composed by clusters nodes  interconnected by variable speed networks. In addition, it has been  permitted to show the
-effectiveness of asynchronous mode algorithm by  comparing its performance with the synchronous mode time. With selected
-parameters on the network platforms (bandwidth, latency of inter cluster  network) and on the clusters architecture
-(number, capacity calculation  power) in the simulated environment, the experimental results have demonstrated not only
-the algorithm convergence within a reasonable time  compared with the physical environment performance, but also a time
-saving of up to \np[\%]{40} in asynchronous mode.
+iteration starts and until the approximate solution is reached. These parallel  computations can be performed either in
+\emph{synchronous} mode where a new iteration begin only when all nodes communications are completed,
+either \emph{asynchronous} mode where processors can continue independently without or few synchronization points. For
+instance in the \textit{Asynchronous Iterations - Asynchronous   Communications (AIAC)} model \cite{bcvc06:ij}, local
+computations do not need to wait for required data. Processors can then perform their iterations with the data present
+at that time. Even if the number of iterations required before the convergence is generally greater than for the
+synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to
+synchronizations especially in a grid computing context (see \cite{bcvc06:ij} for more details).
+
+Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment
+requirements.  Quantifying their resource allocation policies and application scheduling algorithms in
+grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time
+consuming \cite{BuRaCa}. The case of AIAC algorithms is even more problematic since they are very sensible to the
+execution environment context. For instance, variations in the network bandwith (intra and inter- clusters), in the
+number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to
+very different execution times. Then, it appears that the use of simulation tools to explore various platform
+scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation
+environment to execute parallel  iterative algorithms found some interests in reducing the highly cost of  access to
+computing resources: (1) for the applications development life  cycle and in code debugging (2) and in production to get
+results in a reasonable execution time with a simulated infrastructure not accessible  with physical resources. Indeed,
+the launch of distributed iterative  asynchronous algorithms to solve a given problem on a large-scale  simulated
+environment challenges to find optimal configurations giving the best results with a lowest residual error and in the
+best of execution time. 
+
+To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this
+paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
+SimGrid toolkit \cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their
+performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric
+linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor
+modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campain of a real AIAC
+application on different computing architectures. The simulated results we obtained are in line with real results
+exposed in ??. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating
+different distributed architectures composed by clusters nodes interconnected by variable speed networks. It has been
+permitted to show  With selected parameters on the network platforms (bandwidth, latency of inter  cluster network) and
+on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results
+have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment
+performance, but also a time saving of up to \np[\%]{40} in asynchronous mode.
  
  This article is structured as follows: after this introduction, the next  section will give a brief description of
-iterative asynchronous model.  Then, the simulation framework SimGrid will be presented with the settings to create
-various distributed architectures. The algorithm of  the multi-splitting method used by GMRES written with MPI
-primitives and its adaptation to SimGrid with SMPI (Simulated MPI) will be in the next section. At last, the experiments
-results carried out will be presented before the conclusion which we will announce the opening of  our future work after
-the results.
+iterative asynchronous model.  Then, the simulation framework SimGrid is presented with the settings to create various
+distributed architectures. The algorithm of  the multi-splitting method used by GMRES written with MPI primitives and
+its adaptation to SimGrid with SMPI (Simulated MPI) is detailed in the next section. At last, the experiments results
+carried out will be presented before some concluding remarks and future works.
   
  \section{Motivations and scientific context}
  
@@ -167,7 +176,7 @@ convergence is generally greater than for the two former classes. But, and as de
  algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially
  in a grid computing context.
  
-\begin{figure}[htbp]
+\begin{figure}[!t]
    \centering
      \includegraphics[width=8cm]{AIAC.pdf}
    \caption{The Asynchronous Iterations - Asynchronous Communications model } 
@@ -196,16 +205,18 @@ iterations and so to very different execution times.
  SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} is a simulation
  framework to sudy the behavior of large-scale distributed systems.  As its name
  says, it emanates from the grid computing community, but is nowadays used to
-study grids, clouds, HPC or peer-to-peer systems.
-%- open source, developped since 1999, one of the major solution in the field
-%
+study grids, clouds, HPC or peer-to-peer systems.  The early versions of SimGrid
+date from 1999, but it's still actively developped and distributed as an open
+source software.  Today, it's one of the major generic tools in the field of
+simulation for large-scale distributed systems.
+
  SimGrid provides several programming interfaces: MSG to simulate Concurrent
  Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to
  run real applications written in MPI~\cite{MPI}.  Apart from the native C
  interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming
  languages.  The SMPI interface supports applications written in C or Fortran,
-with little or no modifications.
-%- implements most of MPI-2 \cite{ref} standard [CHECK]
+with little or no modifications.  SMPI implements about \np[\%]{80} of the MPI
+2.0 standard~\cite{bedaride:hal-00919507}.
  
  %%% explain simulation
  %- simulated processes folded in one real process
@@ -219,8 +230,6 @@ with little or no modifications.
  
  %%% validation + refs
  
-\AG{Décrire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)}
-
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  \section{Simulation of the multisplitting method}
  %Décrire le problème (algo) traité ainsi que le processus d'adaptation à SimGrid.
@@ -257,7 +266,7 @@ Y_l = B_l - \displaystyle\sum_{\substack{m=1\\ m\neq l}}^{L}A_{lm}X_m
  \end{equation}
  is solved independently by a cluster and communications are required to update the right-hand side sub-vector $Y_l$, such that the sub-vectors $X_m$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel algorithm of GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers. 
  
-\begin{figure}
+\begin{figure}[!t]
    %%% IEEE instructions forbid to use an algorithm environment here, use figure
    %%% instead
  \begin{algorithmic}[1]
@@ -298,7 +307,7 @@ clusters (lines $6$ and $7$ in Figure~\ref{algo:01}). The shared vector
  elements of the solution $x$ are exchanged by message passing using MPI
  non-blocking communication routines.
  
-\begin{figure}
+\begin{figure}[!t]
  \centering
    \includegraphics[width=60mm,keepaspectratio]{clustering}
  \caption{Example of three clusters of processors interconnected by a virtual unidirectional ring network.}
@@ -321,43 +330,150 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole
  
  \section{Experimental results}
  
-When the \emph{real} application runs in the simulation environment and produces
-the expected results, varying the input parameters and the program arguments
-allows us to compare outputs from the code execution. We have noticed from this
-study that the results depend on the following parameters: (1) at the network
-level, we found that the most critical values are the bandwidth (bw) and the
-network latency (lat). (2) Hosts power (GFlops) can also influence on the
-results. And finally, (3) when submitting job batches for execution, the
-arguments values passed to the program like the maximum number of iterations or
-the \emph{external} precision are critical to ensure not only the convergence of the
-algorithm but also to get the main objective of the experimentation of the
-simulation in having an execution time in asynchronous less than in synchronous
-mode, in others words, in having a \emph{speedup} less than 1
-({speedup}${}={}${execution time in synchronous mode}${}/{}${execution time in
-asynchronous mode}).
-
-A priori, obtaining a speedup less than 1 would be difficult in a local area
+When the \emph{real} application runs in the simulation environment and produces the expected results, varying the input
+parameters and the program arguments allows us to compare outputs from the code execution. We have noticed from this
+study that the results depend on the following parameters:  
+\begin{itemize} 
+\item At the network level, we found that
+the most critical values are the bandwidth (bw) and the network latency (lat).  
+\item Hosts power (GFlops) can also
+influence on the results.  
+\item Finally, when submitting job batches for execution, the arguments values passed to the
+program like the maximum number of iterations or the \emph{external} precision are critical. They allow to ensure not
+only the convergence of the algorithm but also to get the main objective of the experimentation of the simulation in
+having an execution time in asynchronous less than in synchronous mode (i.e. speed-up less than $1$).
+\end{itemize}
+
+A priori, obtaining a speedup less than $1$ would be difficult in a local area
  network configuration where the synchronous mode will take advantage on the rapid
  exchange of information on such high-speed links. Thus, the methodology adopted
  was to launch the application on clustered network. In this last configuration,
  degrading the inter-cluster network performance will \emph{penalize} the synchronous
-mode allowing to get a speedup lower than 1. This action simulates the case of
+mode allowing to get a speedup lower than $1$. This action simulates the case of
  clusters linked with long distance network like Internet.
  
  As a first step, the algorithm was run on a network consisting of two clusters
-containing fifty hosts each, totaling one hundred hosts. Various combinations of
-the above factors have providing the results shown in Table~\ref{tab.cluster.2x50} with a matrix size
-ranging from $N_x = N_y = N_z = 62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to
-$171^{3} = \np{5211000}$ entries.
+containing $50$ hosts each, totaling $100$ hosts. Various combinations of
+the above factors have providing the results shown in
+Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z =
+62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to $171^{3} =
+\np{5211000}$ entries.
  
+\begin{table}[!t]
+  \centering
+  \caption{$2$ clusters, each with $50$ nodes}
+  \label{tab.cluster.2x50}
+  \renewcommand{\arraystretch}{1.3}
+
+  \begin{tabular}{|>{\bfseries}r|*{12}{c|}}
+    \hline
+    bw
+    & 5         & 5         & 5         & 5         & 5         & 50 \\
+    \hline
+    lat
+    & 0.02      & 0.02      & 0.02      & 0.02      & 0.02      & 0.02 \\
+    \hline
+    power
+    & 1         & 1         & 1         & 1.5       & 1.5       & 1.5 \\
+    \hline
+    size
+    & 62        & 62        & 62        & 100       & 100       & 110 \\
+    \hline
+    Prec/Eprec
+    & \np{E-5}  & \np{E-8}  & \np{E-9}  & \np{E-11} & \np{E-11} & \np{E-11} \\
+    \hline
+    speedup
+    & 0.396     & 0.392     & 0.396     & 0.391     & 0.393     & 0.395 \\
+    \hline
+  \end{tabular}
+
+  \smallskip
+
+  \begin{tabular}{|>{\bfseries}r|*{12}{c|}}
+    \hline
+    bw
+    & 50        & 50        & 50        & 50        & 10        & 10 \\
+    \hline
+    lat
+    & 0.02      & 0.02      & 0.02      & 0.02      & 0.03      & 0.01 \\
+    \hline
+    power
+    & 1.5       & 1.5       & 1.5       & 1.5       & 1         & 1.5 \\
+    \hline
+    size
+    & 120       & 130       & 140       & 150       & 171       & 171 \\
+    \hline
+    Prec/Eprec
+    & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5}  & \np{E-5} \\
+    \hline
+    speedup
+    & 0.398     & 0.388     & 0.393     & 0.394     & 0.63      & 0.778 \\
+    \hline
+  \end{tabular}
+\end{table}
+  
  Then we have changed the network configuration using three clusters containing
-respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the
+respectively $33$, $33$ and $34$ hosts, or again by on hundred hosts for all the
  clusters. In the same way as above, a judicious choice of key parameters has
-permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the speedups less than 1 with
-a matrix size from 62 to 100 elements.
+permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the
+speedups less than $1$ with a matrix size from $62$ to $100$ elements.
+
+\begin{table}[!t]
+  \centering
+  \caption{$3$ clusters, each with $33$ nodes}
+  \label{tab.cluster.3x33}
+  \renewcommand{\arraystretch}{1.3}
+
+  \begin{tabular}{|>{\bfseries}r|*{6}{c|}}
+    \hline
+    bw
+    & 10       & 5        & 4        & 3        & 2        & 6 \\
+    \hline
+    lat
+    & 0.01     & 0.02     & 0.02     & 0.02     & 0.02     & 0.02 \\
+    \hline
+    power
+    & 1        & 1        & 1        & 1        & 1        & 1 \\
+    \hline
+    size
+    & 62       & 100      & 100      & 100      & 100      & 171 \\
+    \hline
+    Prec/Eprec
+    & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} \\
+    \hline
+    speedup
+    & 0.997    & 0.99     & 0.93     & 0.84     & 0.78     & 0.99 \\
+    \hline
+  \end{tabular}
+\end{table}
+
  
  In a final step, results of an execution attempt to scale up the three clustered
-configuration but increasing by two hundreds hosts has been recorded in Table~\ref{tab.cluster.3x67}.
+configuration but increasing by two hundreds hosts has been recorded in
+Table~\ref{tab.cluster.3x67}.
+
+\begin{table}[!t]
+  \centering
+  \caption{3 clusters, each with 66 nodes}
+  \label{tab.cluster.3x67}
+  \renewcommand{\arraystretch}{1.3}
+
+  \begin{tabular}{|>{\bfseries}r|c|}
+    \hline
+    bw         & 1 \\
+    \hline
+    lat        & 0.02 \\
+    \hline
+    power      & 1 \\
+    \hline
+    size       & 62 \\
+    \hline
+    Prec/Eprec & \np{E-5} \\
+    \hline
+    speedup    & 0.9 \\
+    \hline
+ \end{tabular}
+\end{table}
  
  Note that the program was run with the following parameters:
  
@@ -382,66 +498,41 @@ lat latency, \dots{}).
         \item Execution Mode: synchronous or asynchronous.
  \end{itemize}
  
-\begin{table}
-  \centering
-  \caption{2 clusters X 50 nodes}
-  \label{tab.cluster.2x50}
-  \AG{Ces tableaux (\ref{tab.cluster.2x50}, \ref{tab.cluster.3x33} et
-    \ref{tab.cluster.3x67}) sont affreux. Utiliser un format vectoriel (eps ou
-    pdf) ou, mieux, les réécrire en \LaTeX{}. Réécrire les légendes proprement
-    également (\texttt{\textbackslash{}times} au lieu de \texttt{X} par ex.)}
-  \includegraphics[width=209pt]{img1.jpg}
-\end{table}
-
-\begin{table}
-  \centering
-  \caption{3 clusters X 33 nodes}
-  \label{tab.cluster.3x33}
-  \AG{Refaire le tableau.}
-  \includegraphics[width=209pt]{img2.jpg}
-\end{table}
-
-\begin{table}
-  \centering
-  \caption{3 clusters X 67 nodes}
-  \label{tab.cluster.3x67}
-  \AG{Refaire le tableau.}
-%  \includegraphics[width=160pt]{img3.jpg}
-  \includegraphics[scale=0.5]{img3.jpg}
-\end{table}
-
  \paragraph*{Interpretations and comments}
  
  After analyzing the outputs, generally, for the configuration with two or three
-clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50} and~\ref{tab.cluster.3x33}), some combinations of the
-used parameters affecting the results have given a speedup less than 1, showing
-the effectiveness of the asynchronous performance compared to the synchronous
-mode.
-
-In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows that with a
-deterioration of inter cluster network set with \np[Mbits/s]{5} of bandwidth, a latency
-in order of a hundredth of a millisecond and a system power of one GFlops, an
-efficiency of about \np[\%]{40} in asynchronous mode is obtained for a matrix size of 62
-elements. It is noticed that the result remains stable even if we vary the
-external precision from \np{E-5} to \np{E-9}. By increasing the problem size up to 100
-elements, it was necessary to increase the CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a
-convergence of the algorithm with the same order of asynchronous mode efficiency.
-Maintaining such a system power but this time, increasing network throughput
-inter cluster up to \np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is
-obtained with high external  precision of \np{E-11} for a matrix size from 110 to 150
-side elements.
-
-For the 3 clusters architecture including a total of 100 hosts, Table~\ref{tab.cluster.3x33} shows
-that it was difficult to have a combination which gives an efficiency of
-asynchronous below \np[\%]{80}. Indeed, for a matrix size of 62 elements, equality
-between the performance of the two modes (synchronous and asynchronous) is
-achieved with an inter cluster of \np[Mbits/s]{10} and a latency of \np[ms]{E-1}. To
-challenge an efficiency by \np[\%]{78} with a matrix size of 100 points, it was
-necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s.
+clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50}
+and~\ref{tab.cluster.3x33}), some combinations of the used parameters affecting
+the results have given a speedup less than 1, showing the effectiveness of the
+asynchronous performance compared to the synchronous mode.
+
+In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows
+that with a deterioration of inter cluster network set with \np[Mbits/s]{5} of
+bandwidth, a latency in order of a hundredth of a millisecond and a system power
+of one GFlops, an efficiency of about \np[\%]{40} in asynchronous mode is
+obtained for a matrix size of 62 elements. It is noticed that the result remains
+stable even if we vary the external precision from \np{E-5} to \np{E-9}. By
+increasing the problem size up to $100$ elements, it was necessary to increase the
+CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a convergence of the algorithm
+with the same order of asynchronous mode efficiency.  Maintaining such a system
+power but this time, increasing network throughput inter cluster up to
+\np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is obtained with
+high external precision of \np{E-11} for a matrix size from $110$ to $150$ side
+elements.
+
+For the $3$ clusters architecture including a total of 100 hosts,
+Table~\ref{tab.cluster.3x33} shows that it was difficult to have a combination
+which gives an efficiency of asynchronous below \np[\%]{80}. Indeed, for a
+matrix size of $62$ elements, equality between the performance of the two modes
+(synchronous and asynchronous) is achieved with an inter cluster of
+\np[Mbits/s]{10} and a latency of \np[ms]{E-1}. To challenge an efficiency by
+\np[\%]{78} with a matrix size of $100$ points, it was necessary to degrade the
+inter cluster network bandwidth from 5 to 2 Mbit/s.
  
  A last attempt was made for a configuration of three clusters but more powerful
-with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was obtained
-with a bandwidth of \np[Mbits/s]{1} as shown in Table~\ref{tab.cluster.3x67}.
+with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was
+obtained with a bandwidth of \np[Mbits/s]{1} as shown in
+Table~\ref{tab.cluster.3x67}.
  
  \section{Conclusion}
  The experimental results on executing a parallel iterative algorithm in