X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/blobdiff_plain/5f0dd22ada037b4a0657d93645f08e99403867e2..64cf966afb9c408d8ab199f12a27f45e1c5ed82a:/hpcc.tex

diff --git a/hpcc.tex b/hpcc.tex
index b9b1753..d4dc19e 100644
--- a/hpcc.tex
+++ b/hpcc.tex
@@ -40,11 +40,6 @@
 
 \newcommand{\MI}{\mathit{MaxIter}}
 
-\usepackage{array}
-\usepackage{color, colortbl}
-\newcolumntype{M}[1]{>{\centering\arraybackslash}m{#1}}
-\newcolumntype{Z}[1]{>{\raggedleft}m{#1}}
-
 \begin{document}
 
 \title{Simulation of Asynchronous Iterative Numerical Algorithms Using SimGrid}
@@ -115,10 +110,10 @@ suggests, these algorithm solves a given problem by successive iterations ($X_{n
 $X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
 demonstrate the convergence of these algorithms \cite{}. 
 
-Parallelization of such algorithms generally involved the division of the problem into several \emph{pieces} that will
+Parallelization of such algorithms generally involved the division of the problem into several \emph{blocks} that will
 be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new
-iteration starts  until the approximate solution is reached. These parallel  computations can be performed either in
-\emph{synchronous} communication mode where a new iteration begin only when all nodes communications are completed,
+iteration starts and until the approximate solution is reached. These parallel  computations can be performed either in
+\emph{synchronous} mode where a new iteration begin only when all nodes communications are completed,
 either \emph{asynchronous} mode where processors can continue independently without or few synchronization points. For
 instance in the \textit{Asynchronous Iterations - Asynchronous   Communications (AIAC)} model \cite{bcvc06:ij}, local
 computations do not need to wait for required data. Processors can then perform their iterations with the data present
@@ -127,13 +122,13 @@ synchronous case, AIAC algorithms can significantly reduce overall execution tim
 synchronizations especially in a grid computing context (see \cite{bcvc06:ij} for more details).
 
 Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment
-requirements.  Quantifying their performance of resource allocation policies and application scheduling algorithms in
-grid computing environments under varying load, CPU power and network speeds is very costly, labor intensive and time
+requirements.  Quantifying their resource allocation policies and application scheduling algorithms in
+grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time
 consuming \cite{BuRaCa}. The case of AIAC algorithms is even more problematic since they are very sensible to the
 execution environment context. For instance, variations in the network bandwith (intra and inter- clusters), in the
 number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to
-very different execution times. In this context, it appears that the use of simulation tools to explore various platform
-scenarios and to run enormous numbers of experiments quickly can be very promising. In this way, the use of a simulation
+very different execution times. Then, it appears that the use of simulation tools to explore various platform
+scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation
 environment to execute parallel  iterative algorithms found some interests in reducing the highly cost of  access to
 computing resources: (1) for the applications development life  cycle and in code debugging (2) and in production to get
 results in a reasonable execution time with a simulated infrastructure not accessible  with physical resources. Indeed,
@@ -142,15 +137,17 @@ environment challenges to find optimal configurations giving the best results wi
 best of execution time. 
 
 To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this
-paper is to give a first approach of the simulation of AIAC algorithms using the SimGrid toolkit \cite{SimGrid}. We had
-in the scope of this work implemented a  program for solving large non-symmetric linear system of equations by numerical
-method GMRES (Generalized Minimal Residual). SimGrid had allowed us to launch the application from a modest computing
-infrastructure by simulating  different distributed architectures composed by clusters nodes interconnected by variable
-speed networks. The simulated results we obtained are in line with real results exposed in ?? In addition, it has been
-permitted to show the effectiveness of asynchronous mode algorithm by comparing its performance with the synchronous
-mode time. With selected parameters on the network platforms (bandwidth, latency of inter  cluster  network) and on the
-clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results have
-demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment
+paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
+SimGrid toolkit \cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their
+performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric
+linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor
+modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campain of a real AIAC
+application on different computing architectures. The simulated results we obtained are in line with real results
+exposed in ??. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating
+different distributed architectures composed by clusters nodes interconnected by variable speed networks. It has been
+permitted to show  With selected parameters on the network platforms (bandwidth, latency of inter  cluster network) and
+on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results
+have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment
 performance, but also a time saving of up to \np[\%]{40} in asynchronous mode.
 
 This article is structured as follows: after this introduction, the next  section will give a brief description of
@@ -179,7 +176,7 @@ convergence is generally greater than for the two former classes. But, and as de
 algorithms can significantly reduce overall execution times by suppressing idle times due to synchronizations especially
 in a grid computing context.
 
-\begin{figure}[htbp]
+\begin{figure}[!t]
   \centering
     \includegraphics[width=8cm]{AIAC.pdf}
   \caption{The Asynchronous Iterations - Asynchronous Communications model } 
@@ -208,16 +205,18 @@ iterations and so to very different execution times.
 SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} is a simulation
 framework to sudy the behavior of large-scale distributed systems.  As its name
 says, it emanates from the grid computing community, but is nowadays used to
-study grids, clouds, HPC or peer-to-peer systems.
-%- open source, developped since 1999, one of the major solution in the field
-%
+study grids, clouds, HPC or peer-to-peer systems.  The early versions of SimGrid
+date from 1999, but it's still actively developped and distributed as an open
+source software.  Today, it's one of the major generic tools in the field of
+simulation for large-scale distributed systems.
+
 SimGrid provides several programming interfaces: MSG to simulate Concurrent
 Sequential Processes, SimDAG to simulate DAGs of (parallel) tasks, and SMPI to
 run real applications written in MPI~\cite{MPI}.  Apart from the native C
 interface, SimGrid provides bindings for the C++, Java, Lua and Ruby programming
 languages.  The SMPI interface supports applications written in C or Fortran,
-with little or no modifications.
-%- implements most of MPI-2 \cite{ref} standard [CHECK]
+with little or no modifications.  SMPI implements about \np[\%]{80} of the MPI
+2.0 standard~\cite{bedaride:hal-00919507}.
 
 %%% explain simulation
 %- simulated processes folded in one real process
@@ -231,8 +230,6 @@ with little or no modifications.
 
 %%% validation + refs
 
-\AG{DÃ©crire SimGrid~\cite{casanova+legrand+quinson.2008.simgrid,SimGrid} (Arnaud)}
-
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \section{Simulation of the multisplitting method}
 %DÃ©crire le problÃ¨me (algo) traitÃ© ainsi que le processus d'adaptation Ã  SimGrid.
@@ -269,7 +266,7 @@ Y_l = B_l - \displaystyle\sum_{\substack{m=1\\ m\neq l}}^{L}A_{lm}X_m
 \end{equation}
 is solved independently by a cluster and communications are required to update the right-hand side sub-vector $Y_l$, such that the sub-vectors $X_m$ represent the data dependencies between the clusters. As each sub-system (\ref{eq:4.1}) is solved in parallel by a cluster of processors, our multisplitting method uses an iterative method as an inner solver which is easier to parallelize and more scalable than a direct method. In this work, we use the parallel algorithm of GMRES method~\cite{ref1} which is one of the most used iterative method by many researchers. 
 
-\begin{figure}
+\begin{figure}[!t]
   %%% IEEE instructions forbid to use an algorithm environment here, use figure
   %%% instead
 \begin{algorithmic}[1]
@@ -310,7 +307,7 @@ clusters (lines $6$ and $7$ in Figure~\ref{algo:01}). The shared vector
 elements of the solution $x$ are exchanged by message passing using MPI
 non-blocking communication routines.
 
-\begin{figure}
+\begin{figure}[!t]
 \centering
   \includegraphics[width=60mm,keepaspectratio]{clustering}
 \caption{Example of three clusters of processors interconnected by a virtual unidirectional ring network.}
@@ -333,104 +330,150 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole
 
 \section{Experimental results}
 
-When the \emph{real} application runs in the simulation environment and produces
-the expected results, varying the input parameters and the program arguments
-allows us to compare outputs from the code execution. We have noticed from this
-study that the results depend on the following parameters: (1) at the network
-level, we found that the most critical values are the bandwidth (bw) and the
-network latency (lat). (2) Hosts power (GFlops) can also influence on the
-results. And finally, (3) when submitting job batches for execution, the
-arguments values passed to the program like the maximum number of iterations or
-the \emph{external} precision are critical to ensure not only the convergence of the
-algorithm but also to get the main objective of the experimentation of the
-simulation in having an execution time in asynchronous less than in synchronous
-mode, in others words, in having a \emph{speedup} less than 1
-({speedup}${}={}${execution time in synchronous mode}${}/{}${execution time in
-asynchronous mode}).
-
-A priori, obtaining a speedup less than 1 would be difficult in a local area
+When the \emph{real} application runs in the simulation environment and produces the expected results, varying the input
+parameters and the program arguments allows us to compare outputs from the code execution. We have noticed from this
+study that the results depend on the following parameters:  
+\begin{itemize} 
+\item At the network level, we found that
+the most critical values are the bandwidth (bw) and the network latency (lat).  
+\item Hosts power (GFlops) can also
+influence on the results.  
+\item Finally, when submitting job batches for execution, the arguments values passed to the
+program like the maximum number of iterations or the \emph{external} precision are critical. They allow to ensure not
+only the convergence of the algorithm but also to get the main objective of the experimentation of the simulation in
+having an execution time in asynchronous less than in synchronous mode (i.e. speed-up less than $1$).
+\end{itemize}
+
+A priori, obtaining a speedup less than $1$ would be difficult in a local area
 network configuration where the synchronous mode will take advantage on the rapid
 exchange of information on such high-speed links. Thus, the methodology adopted
 was to launch the application on clustered network. In this last configuration,
 degrading the inter-cluster network performance will \emph{penalize} the synchronous
-mode allowing to get a speedup lower than 1. This action simulates the case of
+mode allowing to get a speedup lower than $1$. This action simulates the case of
 clusters linked with long distance network like Internet.
 
 As a first step, the algorithm was run on a network consisting of two clusters
-containing fifty hosts each, totaling one hundred hosts. Various combinations of
-the above factors have providing the results shown in Table I with a matrix size
-ranging from $N_x = N_y = N_z = 62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to
-$171^{3} = \np{5211000}$ entries.
-
-\begin{table}[h!]
- \centering
+containing $50$ hosts each, totaling $100$ hosts. Various combinations of
+the above factors have providing the results shown in
+Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z =
+62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to $171^{3} =
+\np{5211000}$ entries.
 
- \tiny
- 
-\begin{tabular}{|Z{0.55cm}|Z{0.25cm}|Z{0.25cm}|M{0.25cm}|Z{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|M{0.25cm}|} 
- \hline 
- \bf bw & 5 &5 & 5 & 5 & 5 & 50 & 50 & 50 & 50 & 50 & 10 & 10\\ 
- \hline
- \bf lat & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02 & 0.03 & 0.01\\ 
- \hline 
- \bf power & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5\\ \hline    \bf size & 62 & 62 & 62 & 100 & 100 & 110 & 120& 130 & 140 & 150 & 171 & 171\\ \hline
- \bf Prec/Eprec & 10$^{-5}$ & 10$^{-8}$ & 10$^{-9}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-11}$ & 10$^{-5}$ & 10$^{-5}$\\ \hline 
- \bf speedup & 0.396 & 0.392 & 0.396 & 0.391 & 0.393 & 0.395 & 0.398 & 0.388 & 0.393 & 0.394 & 0.63 & 0.778\\ \hline 
- \end{tabular}
- \smallskip 
- \caption{2 Clusters x 50 nodes each} \label{tab1} 
-\end{table} 
+\begin{table}[!t]
+  \centering
+  \caption{$2$ clusters, each with $50$ nodes}
+  \label{tab.cluster.2x50}
+  \renewcommand{\arraystretch}{1.3}
+
+  \begin{tabular}{|>{\bfseries}r|*{12}{c|}}
+    \hline
+    bw
+    & 5         & 5         & 5         & 5         & 5         & 50 \\
+    \hline
+    lat
+    & 0.02      & 0.02      & 0.02      & 0.02      & 0.02      & 0.02 \\
+    \hline
+    power
+    & 1         & 1         & 1         & 1.5       & 1.5       & 1.5 \\
+    \hline
+    size
+    & 62        & 62        & 62        & 100       & 100       & 110 \\
+    \hline
+    Prec/Eprec
+    & \np{E-5}  & \np{E-8}  & \np{E-9}  & \np{E-11} & \np{E-11} & \np{E-11} \\
+    \hline
+    speedup
+    & 0.396     & 0.392     & 0.396     & 0.391     & 0.393     & 0.395 \\
+    \hline
+  \end{tabular}
+
+  \smallskip
+
+  \begin{tabular}{|>{\bfseries}r|*{12}{c|}}
+    \hline
+    bw
+    & 50        & 50        & 50        & 50        & 10        & 10 \\
+    \hline
+    lat
+    & 0.02      & 0.02      & 0.02      & 0.02      & 0.03      & 0.01 \\
+    \hline
+    power
+    & 1.5       & 1.5       & 1.5       & 1.5       & 1         & 1.5 \\
+    \hline
+    size
+    & 120       & 130       & 140       & 150       & 171       & 171 \\
+    \hline
+    Prec/Eprec
+    & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-11} & \np{E-5}  & \np{E-5} \\
+    \hline
+    speedup
+    & 0.398     & 0.388     & 0.393     & 0.394     & 0.63      & 0.778 \\
+    \hline
+  \end{tabular}
+\end{table}
   
 Then we have changed the network configuration using three clusters containing
-respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the
+respectively $33$, $33$ and $34$ hosts, or again by on hundred hosts for all the
 clusters. In the same way as above, a judicious choice of key parameters has
-permitted to get the results in Table II which shows the speedups less than 1 with
-a matrix size from 62 to 100 elements.
-
-\begin{table}[h!]
- \centering
+permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the
+speedups less than $1$ with a matrix size from $62$ to $100$ elements.
 
- \tiny
- 
-\begin{tabular}{|Z{0.55cm}|Z{0.25cm}|Z{0.25cm}|M{0.25cm}|Z{0.25cm}|M{0.25cm}|M{0.25cm}|} 
- \hline 
- \bf bw & 10 &5 & 4 & 3 & 2 & 6\\ \hline
- \bf lat & 0.01 & 0.02 & 0.02 & 0.02 & 0.02 & 0.02\\ 
- \hline 
- \bf power & 1 & 1 & 1 & 1 & 1 & 1\\ \hline    
- \bf size & 62 & 100 & 100 & 100 & 100 & 171\\ \hline
- \bf Prec/Eprec & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$ & 10$^{-5}$\\ \hline 
- \bf speedup & 0.997 & 0.99 & 0.93 & 0.84 & 0.78 & 0.99\\ 
- \hline 
- \end{tabular}
- \smallskip 
- \caption{3 Clusters x 33 nodes each} \label{tab2} 
-\end{table} 
+\begin{table}[!t]
+  \centering
+  \caption{$3$ clusters, each with $33$ nodes}
+  \label{tab.cluster.3x33}
+  \renewcommand{\arraystretch}{1.3}
+
+  \begin{tabular}{|>{\bfseries}r|*{6}{c|}}
+    \hline
+    bw
+    & 10       & 5        & 4        & 3        & 2        & 6 \\
+    \hline
+    lat
+    & 0.01     & 0.02     & 0.02     & 0.02     & 0.02     & 0.02 \\
+    \hline
+    power
+    & 1        & 1        & 1        & 1        & 1        & 1 \\
+    \hline
+    size
+    & 62       & 100      & 100      & 100      & 100      & 171 \\
+    \hline
+    Prec/Eprec
+    & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} & \np{E-5} \\
+    \hline
+    speedup
+    & 0.997    & 0.99     & 0.93     & 0.84     & 0.78     & 0.99 \\
+    \hline
+  \end{tabular}
+\end{table}
 
 
 In a final step, results of an execution attempt to scale up the three clustered
-configuration but increasing by two hundreds hosts has been recorded in Table III.
-
-\begin{table}[h!]
- \centering
- \tiny
-\begin{tabular}{|M{0.55cm}|M{0.25cm}|} 
- \hline 
- \bf bw & 1\\ \hline
- \bf lat & 0.02\\ 
- \hline 
- \bf power & 1\\ 
- \hline    
- \bf size & 62\\ 
- \hline
- \bf Prec/Eprec & 10$^{-5}$\\ 
- \hline 
- \bf speedup & 0.9\\ 
- \hline 
+configuration but increasing by two hundreds hosts has been recorded in
+Table~\ref{tab.cluster.3x67}.
+
+\begin{table}[!t]
+  \centering
+  \caption{3 clusters, each with 66 nodes}
+  \label{tab.cluster.3x67}
+  \renewcommand{\arraystretch}{1.3}
+
+  \begin{tabular}{|>{\bfseries}r|c|}
+    \hline
+    bw         & 1 \\
+    \hline
+    lat        & 0.02 \\
+    \hline
+    power      & 1 \\
+    \hline
+    size       & 62 \\
+    \hline
+    Prec/Eprec & \np{E-5} \\
+    \hline
+    speedup    & 0.9 \\
+    \hline
  \end{tabular}
- \smallskip 
- \caption{3 Clusters x 66 nodes each} \label{tab3} 
-\end{table} 
+\end{table}
 
 Note that the program was run with the following parameters:
 
@@ -455,39 +498,41 @@ lat latency, \dots{}).
 	\item Execution Mode: synchronous or asynchronous.
 \end{itemize}
 
-
 \paragraph*{Interpretations and comments}
 
 After analyzing the outputs, generally, for the configuration with two or three
-clusters including one hundred hosts (Table I and II), some combinations of the
-used parameters affecting the results have given a speedup less than 1, showing
-the effectiveness of the asynchronous performance compared to the synchronous
-mode.
-
-In the case of a two clusters configuration, Table I shows that with a
-deterioration of inter cluster network set with \np[Mbits/s]{5} of bandwidth, a latency
-in order of a hundredth of a millisecond and a system power of one GFlops, an
-efficiency of about \np[\%]{40} in asynchronous mode is obtained for a matrix size of 62
-elements. It is noticed that the result remains stable even if we vary the
-external precision from \np{E-5} to \np{E-9}. By increasing the problem size up to 100
-elements, it was necessary to increase the CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a
-convergence of the algorithm with the same order of asynchronous mode efficiency.
-Maintaining such a system power but this time, increasing network throughput
-inter cluster up to \np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is
-obtained with high external  precision of \np{E-11} for a matrix size from 110 to 150
-side elements.
-
-For the 3 clusters architecture including a total of 100 hosts, Table II shows
-that it was difficult to have a combination which gives an efficiency of
-asynchronous below \np[\%]{80}. Indeed, for a matrix size of 62 elements, equality
-between the performance of the two modes (synchronous and asynchronous) is
-achieved with an inter cluster of \np[Mbits/s]{10} and a latency of \np[ms]{E-1}. To
-challenge an efficiency by \np[\%]{78} with a matrix size of 100 points, it was
-necessary to degrade the inter cluster network bandwidth from 5 to 2 Mbit/s.
+clusters including one hundred hosts (Tables~\ref{tab.cluster.2x50}
+and~\ref{tab.cluster.3x33}), some combinations of the used parameters affecting
+the results have given a speedup less than 1, showing the effectiveness of the
+asynchronous performance compared to the synchronous mode.
+
+In the case of a two clusters configuration, Table~\ref{tab.cluster.2x50} shows
+that with a deterioration of inter cluster network set with \np[Mbits/s]{5} of
+bandwidth, a latency in order of a hundredth of a millisecond and a system power
+of one GFlops, an efficiency of about \np[\%]{40} in asynchronous mode is
+obtained for a matrix size of 62 elements. It is noticed that the result remains
+stable even if we vary the external precision from \np{E-5} to \np{E-9}. By
+increasing the problem size up to $100$ elements, it was necessary to increase the
+CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a convergence of the algorithm
+with the same order of asynchronous mode efficiency.  Maintaining such a system
+power but this time, increasing network throughput inter cluster up to
+\np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is obtained with
+high external precision of \np{E-11} for a matrix size from $110$ to $150$ side
+elements.
+
+For the $3$ clusters architecture including a total of 100 hosts,
+Table~\ref{tab.cluster.3x33} shows that it was difficult to have a combination
+which gives an efficiency of asynchronous below \np[\%]{80}. Indeed, for a
+matrix size of $62$ elements, equality between the performance of the two modes
+(synchronous and asynchronous) is achieved with an inter cluster of
+\np[Mbits/s]{10} and a latency of \np[ms]{E-1}. To challenge an efficiency by
+\np[\%]{78} with a matrix size of $100$ points, it was necessary to degrade the
+inter cluster network bandwidth from 5 to 2 Mbit/s.
 
 A last attempt was made for a configuration of three clusters but more powerful
-with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was obtained
-with a bandwidth of \np[Mbits/s]{1} as shown in Table III.
+with 200 nodes in total. The convergence with a speedup of \np[\%]{90} was
+obtained with a bandwidth of \np[Mbits/s]{1} as shown in
+Table~\ref{tab.cluster.3x67}.
 
 \section{Conclusion}
 The experimental results on executing a parallel iterative algorithm in