From: Arnaud Giersch <arnaud.giersch@univ-fcomte.fr>
Date: Tue, 22 Apr 2014 13:05:55 +0000 (+0200)
Subject: Todo++, typos, and reindent.
X-Git-Tag: hpcc2014_submission~86
X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/hpcc2014.git/commitdiff_plain/8e535991311d246854e7597c06cdee25b67c8419?ds=sidebyside;hp=-c

Todo++, typos, and reindent.
---

8e535991311d246854e7597c06cdee25b67c8419
diff --git a/hpcc.tex b/hpcc.tex
index c3a0b34..c9606c3 100644
--- a/hpcc.tex
+++ b/hpcc.tex
@@ -121,34 +121,48 @@ at that time. Even if the number of iterations required before the convergence i
 synchronous case, AIAC algorithms can significantly reduce overall execution times by suppressing idle times due to
 synchronizations especially in a grid computing context (see~\cite{Bahi07} for more details).
 
-Parallel numerical applications (synchronous or asynchronous) may have different configuration and deployment
-requirements.  Quantifying their resource allocation policies and application scheduling algorithms in
-grid computing environments under varying load, CPU power and network speeds is very costly, very labor intensive and very time
-consuming~\cite{Calheiros:2011:CTM:1951445.1951450}. The case of AIAC algorithms is even more problematic since they are very sensible to the
-execution environment context. For instance, variations in the network bandwidth (intra and inter-clusters), in the
-number and the power of nodes, in the number of clusters... can lead to very different number of iterations and so to
-very different execution times. Then, it appears that the use of simulation tools to explore various platform
-scenarios and to run large numbers of experiments quickly can be very promising. In this way, the use of a simulation
-environment to execute parallel  iterative algorithms found some interests in reducing the highly cost of  access to
-computing resources: (1) for the applications development life  cycle and in code debugging (2) and in production to get
-results in a reasonable execution time with a simulated infrastructure not accessible  with physical resources. Indeed,
-the launch of distributed iterative  asynchronous algorithms to solve a given problem on a large-scale  simulated
-environment challenges to find optimal configurations giving the best results with a lowest residual error and in the
-best of execution time. 
-
-To our knowledge, there is no existing work on the large-scale simulation of a real AIAC application. The aim of this
-paper is twofold. First we give a first approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
-SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of asynchronous mode algorithms by comparing their
-performance with the synchronous mode. More precisely, we had implemented a program for solving large non-symmetric
-linear system of equations by numerical method GMRES (Generalized Minimal Residual) []. We show, that with minor
-modifications of the initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a real AIAC
-application on different computing architectures. The simulated results we obtained are in line with real results
-exposed in ??\AG[]{??}. SimGrid had allowed us to launch the application from a modest computing infrastructure by simulating
-different distributed architectures composed by clusters nodes interconnected by variable speed networks.
-With selected parameters on the network platforms (bandwidth, latency of inter  cluster network) and
-on the clusters architecture (number, capacity calculation power) in the simulated environment, the experimental results
-have demonstrated not only the algorithm convergence within a reasonable time compared with the physical environment
-performance, but also a time saving of up to \np[\%]{40} in asynchronous mode.
+Parallel numerical applications (synchronous or asynchronous) may have different
+configuration and deployment requirements.  Quantifying their resource
+allocation policies and application scheduling algorithms in grid computing
+environments under varying load, CPU power and network speeds is very costly,
+very labor intensive and very time
+consuming~\cite{Calheiros:2011:CTM:1951445.1951450}.  The case of AIAC
+algorithms is even more problematic since they are very sensible to the
+execution environment context. For instance, variations in the network bandwidth
+(intra and inter-clusters), in the number and the power of nodes, in the number
+of clusters\dots{} can lead to very different number of iterations and so to
+very different execution times. Then, it appears that the use of simulation
+tools to explore various platform scenarios and to run large numbers of
+experiments quickly can be very promising. In this way, the use of a simulation
+environment to execute parallel iterative algorithms found some interests in
+reducing the highly cost of access to computing resources: (1) for the
+applications development life cycle and in code debugging (2) and in production
+to get results in a reasonable execution time with a simulated infrastructure
+not accessible with physical resources. Indeed, the launch of distributed
+iterative asynchronous algorithms to solve a given problem on a large-scale
+simulated environment challenges to find optimal configurations giving the best
+results with a lowest residual error and in the best of execution time.
+
+To our knowledge, there is no existing work on the large-scale simulation of a
+real AIAC application. The aim of this paper is twofold. First we give a first
+approach of the simulation of AIAC algorithms using a simulation tool (i.e. the
+SimGrid toolkit~\cite{SimGrid}). Second, we confirm the effectiveness of
+asynchronous mode algorithms by comparing their performance with the synchronous
+mode. More precisely, we had implemented a program for solving large
+non-symmetric linear system of equations by numerical method GMRES (Generalized
+Minimal Residual) []\AG[]{[]?}. We show, that with minor modifications of the
+initial MPI code, the SimGrid toolkit allows us to perform a test campaign of a
+real AIAC application on different computing architectures. The simulated
+results we obtained are in line with real results exposed in ??\AG[]{??}.
+SimGrid had allowed us to launch the application from a modest computing
+infrastructure by simulating different distributed architectures composed by
+clusters nodes interconnected by variable speed networks.  With selected
+parameters on the network platforms (bandwidth, latency of inter cluster
+network) and on the clusters architecture (number, capacity calculation power)
+in the simulated environment, the experimental results have demonstrated not
+only the algorithm convergence within a reasonable time compared with the
+physical environment performance, but also a time saving of up to \np[\%]{40} in
+asynchronous mode.
 
 This article is structured as follows: after this introduction, the next  section will give a brief description of
 iterative asynchronous model.  Then, the simulation framework SimGrid is presented with the settings to create various
@@ -184,18 +198,25 @@ in a grid computing context.
 \end{figure}
 
 
-It is very challenging to develop efficient applications for large scale, heterogeneous and distributed platforms such
-as computing grids. Researchers and engineers have to develop techniques for maximizing application performance of these
-multi-cluster platforms, by redesigning the applications and/or by using novel algorithms that can account for the
-composite and heterogeneous nature of the platform. Unfortunately, the deployment of such applications on these very
-large scale systems is very costly, labor intensive and time consuming. In this context, it appears that the use of
-simulation tools to explore various platform scenarios at will and to run enormous numbers of experiments quickly can be
-very promising. Several works...
+It is very challenging to develop efficient applications for large scale,
+heterogeneous and distributed platforms such as computing grids. Researchers and
+engineers have to develop techniques for maximizing application performance of
+these multi-cluster platforms, by redesigning the applications and/or by using
+novel algorithms that can account for the composite and heterogeneous nature of
+the platform. Unfortunately, the deployment of such applications on these very
+large scale systems is very costly, labor intensive and time consuming. In this
+context, it appears that the use of simulation tools to explore various platform
+scenarios at will and to run enormous numbers of experiments quickly can be very
+promising. Several works\dots{}
 
-In the context of AIAC algorithms, the use of simulation tools is even more relevant. Indeed, this class of applications
-is very sensible to the execution environment context. For instance, variations in the network bandwidth (intra and
-inter-clusters), in the number and the power of nodes, in the number of clusters... can lead to very different number of
-iterations and so to very different execution times.
+\AG{Several works\dots{} what?\\
+  Le paragraphe suivant se trouve dÃ©jÃ  dans l'intro ?}
+In the context of AIAC algorithms, the use of simulation tools is even more
+relevant. Indeed, this class of applications is very sensible to the execution
+environment context. For instance, variations in the network bandwidth (intra
+and inter-clusters), in the number and the power of nodes, in the number of
+clusters\dots{} can lead to very different number of iterations and so to very
+different execution times.
 
 
 
@@ -340,25 +361,34 @@ where $\MI$ is the maximum number of outer iterations and $\epsilon$ is the tole
 
 \LZK{Description du processus d'adaptation de l'algo multisplitting Ã  SimGrid}
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-We did not encounter major blocking problems when adapting the multisplitting algorithm previously described to a simulation environment like SIMGRID unless some code 
-debugging. Indeed, apart from the review of the program sequence for asynchronous exchanges between the six neighbors of each point in a submatrix within a cluster or 
-between clusters, the algorithm was executed successfully with SMPI and provided identical outputs as those obtained with direct execution under MPI. In synchronous 
-mode, the execution of the program raised no particular issue but in asynchronous mode, the review of the sequence of MPI\_Isend, MPI\_Irecv and MPI\_Waitall instructions
-and with the addition of the primitive MPI\_Test was needed to avoid a memory fault due to an infinite loop resulting from the non-convergence of the algorithm. Note here that the use of SMPI
-functions optimizer for memory footprint and CPU usage is not recommended knowing that one wants to get real results by simulation.
-As mentioned, upon this adaptation, the algorithm is executed as in the real life in the simulated environment after the following minor changes. First, all declared 
-global variables have been moved to local variables for each subroutine. In fact, global variables generate side effects arising from the concurrent access of 
-shared memory used by threads simulating each computing units in the SimGrid architecture. Second, the alignment of certain types of variables such as ``long int'' had
-also to be reviewed. Finally, some compilation errors on MPI\_Waitall and MPI\_Finalize primitives have been fixed with the latest version of SimGrid.
-In total, the initial MPI program running on the simulation environment SMPI gave after a very simple adaptation the same results as those obtained in a real 
-environment. We have tested in synchronous mode with a simulated platform starting from a modest 2 or 3 clusters grid to a larger configuration like simulating 
-Grid5000 with more than 1500 hosts with 5000 cores~\cite{bolze2006grid}. Once the code debugging and adaptation were complete, the next section shows our methodology and experimental 
-results.
-
-
-
-
-
+We did not encounter major blocking problems when adapting the multisplitting
+algorithm previously described to a simulation environment like SimGrid unless
+some code debugging. Indeed, apart from the review of the program sequence for
+asynchronous exchanges between the six neighbors of each point in a submatrix
+within a cluster or between clusters, the algorithm was executed successfully
+with SMPI and provided identical outputs as those obtained with direct execution
+under MPI. In synchronous mode, the execution of the program raised no
+particular issue but in asynchronous mode, the review of the sequence of
+MPI\_Isend, MPI\_Irecv and MPI\_Waitall instructions and with the addition of
+the primitive MPI\_Test was needed to avoid a memory fault due to an infinite
+loop resulting from the non-convergence of the algorithm. Note here that the use
+of SMPI functions optimizer for memory footprint and CPU usage is not
+recommended knowing that one wants to get real results by simulation.  As
+mentioned, upon this adaptation, the algorithm is executed as in the real life
+in the simulated environment after the following minor changes. First, all
+declared global variables have been moved to local variables for each
+subroutine. In fact, global variables generate side effects arising from the
+concurrent access of shared memory used by threads simulating each computing
+units in the SimGrid architecture. Second, the alignment of certain types of
+variables such as ``long int'' had also to be reviewed. Finally, some
+compilation errors on MPI\_Waitall and MPI\_Finalize primitives have been fixed
+with the latest version of SimGrid.  In total, the initial MPI program running
+on the simulation environment SMPI gave after a very simple adaptation the same
+results as those obtained in a real environment. We have tested in synchronous
+mode with a simulated platform starting from a modest 2 or 3 clusters grid to a
+larger configuration like simulating Grid5000 with more than 1500 hosts with
+5000 cores~\cite{bolze2006grid}. Once the code debugging and adaptation were
+complete, the next section shows our methodology and experimental results.
 
 
 \section{Experimental results}
@@ -511,7 +541,7 @@ Table~\ref{tab.cluster.3x67}.
     \hline
     speedup    & 0.9 \\
     \hline
- \end{mytable}
+  \end{mytable}
 \end{table}
 
 Note that the program was run with the following parameters: