coreections et améliorations

[hpcc2014.git] / hpcc.tex
diff --git a/hpcc.tex b/hpcc.tex

index ebbdd4d41d4943acaa829dfe6d483dde45ab4502..d4dc19e7e508a2a140f9fa7d39af01f284cad954 100644 (file)
--- a/hpcc.tex
+++ b/hpcc.tex
@@ -110,10 +110,10 @@ suggests, these algorithm solves a given problem by successive iterations ($X_{n
  $X_{0}$ to find an approximate value $X^*$ of the solution with a very low residual error. Several well-known methods
  demonstrate the convergence of these algorithms \cite{}. 
  
-Parallelization of such algorithms generally involved the division of the problem into several \emph{pieces} that will
+Parallelization of such algorithms generally involved the division of the problem into several \emph{blocks} that will
  be solved in parallel on multiple processing units. The latter will communicate each intermediate results before a new
-iteration starts  until the approximate solution is reached. These parallel  computations can be performed either in
-\emph{synchronous} communication mode where a new iteration begin only when all nodes communications are completed,
+iteration starts and until the approximate solution is reached. These parallel  computations can be performed either in
+\emph{synchronous} mode where a new iteration begin only when all nodes communications are completed,
  either \emph{asynchronous} mode where processors can continue independently without or few synchronization points. For
  instance in the \textit{Asynchronous Iterations - Asynchronous   Communications (AIAC)} model \cite{bcvc06:ij}, local
  computations do not need to wait for required data. Processors can then perform their iterations with the data present
@@ -122,13 +122,13 @@ synchronous case, AIAC algorithms can significantly reduce overall execution tim
  synchronizations especially in a grid computing context (see \cite{bcvc06:ij} for more details).
  
  Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment
-requirements.  Quantifying their performance of resource allocation policies and application scheduling algorithms in
-grid computing environments under varying load, CPU power and network speeds is very costly, labor intensive and time
+requirements.  Quantifying their resource allocation policies and application scheduling algorithms in
+grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time
  consuming \cite{BuRaCa}. The case of AIAC algorithms is even more problematic since they are very sensible to the
  execution environment context. For instance, variations in the network bandwith (intra and inter- clusters), in the
  number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to
-very different execution times. In this context, it appears that the use of simulation tools to explore various platform
-scenarios and to run enormous numbers of experiments quickly can be very promising. In this way, the use of a simulation
+very different execution times. Then, it appears that the use of simulation tools to explore various platform
+scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation
  environment to execute parallel  iterative algorithms found some interests in reducing the highly cost of  access to
  computing resources: (1) for the applications development life  cycle and in code debugging (2) and in production to get
  results in a reasonable execution time with a simulated infrastructure not accessible  with physical resources. Indeed,
@@ -137,15 +137,17 @@ environment challenges to find optimal configurations giving the best results wi
  best of execution time. 
  
  To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this
-paper is to give a first approach of the simulation of AIAC algorithms using the SimGrid toolkit \cite{SimGrid}. We had
-in the scope of this work implemented a  program for solving large non-symmetric linear system of equations by numerical
-method GMRES (Generalized Minimal Residual). SimGrid had allowed us to launch the application from a modest computing
-infrastructure by simulating  different distributed architectures composed by clusters nodes interconnected by variable
-speed networks. The simulated results we obtained are in line with real results exposed in ?? In addition, it has been
-permitted to show the effectiveness of asynchronous mode algorithm by comparing its performance with the synchronous
-mode time. With selected parameters on the network platforms (bandwidth, latency of inter  cluster  network) and on the
-clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results have
-demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment
+paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
+SimGrid toolkit \cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their
+performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric
+linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor
+modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campain of a real AIAC
+application on different computing architectures. The simulated results we obtained are in line with real results
+exposed in ??. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating
+different distributed architectures composed by clusters nodes interconnected by variable speed networks. It has been
+permitted to show  With selected parameters on the network platforms (bandwidth, latency of inter  cluster network) and
+on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results
+have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment
  performance, but also a time saving of up to \np[\%]{40} in asynchronous mode.
  
  This article is structured as follows: after this introduction, the next  section will give a brief description of
@@ -328,31 +330,30 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole
  
  \section{Experimental results}
  
-When the \emph{real} application runs in the simulation environment and produces
-the expected results, varying the input parameters and the program arguments
-allows us to compare outputs from the code execution. We have noticed from this
-study that the results depend on the following parameters: (1) at the network
-level, we found that the most critical values are the bandwidth (bw) and the
-network latency (lat). (2) Hosts power (GFlops) can also influence on the
-results. And finally, (3) when submitting job batches for execution, the
-arguments values passed to the program like the maximum number of iterations or
-the \emph{external} precision are critical to ensure not only the convergence of the
-algorithm but also to get the main objective of the experimentation of the
-simulation in having an execution time in asynchronous less than in synchronous
-mode, in others words, in having a \emph{speedup} less than 1
-({speedup}${}={}${execution time in synchronous mode}${}/{}${execution time in
-asynchronous mode}).
-
-A priori, obtaining a speedup less than 1 would be difficult in a local area
+When the \emph{real} application runs in the simulation environment and produces the expected results, varying the input
+parameters and the program arguments allows us to compare outputs from the code execution. We have noticed from this
+study that the results depend on the following parameters:  
+\begin{itemize} 
+\item At the network level, we found that
+the most critical values are the bandwidth (bw) and the network latency (lat).  
+\item Hosts power (GFlops) can also
+influence on the results.  
+\item Finally, when submitting job batches for execution, the arguments values passed to the
+program like the maximum number of iterations or the \emph{external} precision are critical. They allow to ensure not
+only the convergence of the algorithm but also to get the main objective of the experimentation of the simulation in
+having an execution time in asynchronous less than in synchronous mode (i.e. speed-up less than $1$).
+\end{itemize}
+
+A priori, obtaining a speedup less than $1$ would be difficult in a local area
  network configuration where the synchronous mode will take advantage on the rapid
  exchange of information on such high-speed links. Thus, the methodology adopted
  was to launch the application on clustered network. In this last configuration,
  degrading the inter-cluster network performance will \emph{penalize} the synchronous
-mode allowing to get a speedup lower than 1. This action simulates the case of
+mode allowing to get a speedup lower than $1$. This action simulates the case of
  clusters linked with long distance network like Internet.
  
  As a first step, the algorithm was run on a network consisting of two clusters
-containing fifty hosts each, totaling one hundred hosts. Various combinations of
+containing $50$ hosts each, totaling $100$ hosts. Various combinations of
  the above factors have providing the results shown in
  Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z =
  62 \text{ to } 171$ elements or from $62^{3} = \np{238328}$ to $171^{3} =
@@ -360,7 +361,7 @@ Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z =
  
  \begin{table}[!t]
    \centering
-  \caption{2 clusters, each with 50 nodes}
+  \caption{$2$ clusters, each with $50$ nodes}
    \label{tab.cluster.2x50}
    \renewcommand{\arraystretch}{1.3}
  
@@ -412,14 +413,14 @@ Table~\ref{tab.cluster.2x50} with a matrix size ranging from $N_x = N_y = N_z =
  \end{table}
    
  Then we have changed the network configuration using three clusters containing
-respectively 33, 33 and 34 hosts, or again by on hundred hosts for all the
+respectively $33$, $33$ and $34$ hosts, or again by on hundred hosts for all the
  clusters. In the same way as above, a judicious choice of key parameters has
  permitted to get the results in Table~\ref{tab.cluster.3x33} which shows the
-speedups less than 1 with a matrix size from 62 to 100 elements.
+speedups less than $1$ with a matrix size from $62$ to $100$ elements.
  
  \begin{table}[!t]
    \centering
-  \caption{3 clusters, each with 33 nodes}
+  \caption{$3$ clusters, each with $33$ nodes}
    \label{tab.cluster.3x33}
    \renewcommand{\arraystretch}{1.3}
  
@@ -511,21 +512,21 @@ bandwidth, a latency in order of a hundredth of a millisecond and a system power
  of one GFlops, an efficiency of about \np[\%]{40} in asynchronous mode is
  obtained for a matrix size of 62 elements. It is noticed that the result remains
  stable even if we vary the external precision from \np{E-5} to \np{E-9}. By
-increasing the problem size up to 100 elements, it was necessary to increase the
+increasing the problem size up to $100$ elements, it was necessary to increase the
  CPU power of \np[\%]{50} to \np[GFlops]{1.5} for a convergence of the algorithm
  with the same order of asynchronous mode efficiency.  Maintaining such a system
  power but this time, increasing network throughput inter cluster up to
  \np[Mbits/s]{50}, the result of efficiency of about \np[\%]{40} is obtained with
-high external precision of \np{E-11} for a matrix size from 110 to 150 side
+high external precision of \np{E-11} for a matrix size from $110$ to $150$ side
  elements.
  
-For the 3 clusters architecture including a total of 100 hosts,
+For the $3$ clusters architecture including a total of 100 hosts,
  Table~\ref{tab.cluster.3x33} shows that it was difficult to have a combination
  which gives an efficiency of asynchronous below \np[\%]{80}. Indeed, for a
-matrix size of 62 elements, equality between the performance of the two modes
+matrix size of $62$ elements, equality between the performance of the two modes
  (synchronous and asynchronous) is achieved with an inter cluster of
  \np[Mbits/s]{10} and a latency of \np[ms]{E-1}. To challenge an efficiency by
-\np[\%]{78} with a matrix size of 100 points, it was necessary to degrade the
+\np[\%]{78} with a matrix size of $100$ points, it was necessary to degrade the
  inter cluster network bandwidth from 5 to 2 Mbit/s.
  
  A last attempt was made for a configuration of three clusters but more powerful