From: David Laiymani <david.laiymani@univ-fcomte.fr>
Date: Wed, 6 May 2015 13:59:36 +0000 (+0200)
Subject: Merge branch 'master' of ssh://bilbo.iut-bm.univ-fcomte.fr/rce2015
X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/commitdiff_plain/f99a638c921635746c3df9d45c46fe345f57a074?ds=inline;hp=-c

Merge branch 'master' of ssh://bilbo.iut-bm.univ-fcomte.fr/rce2015
---

f99a638c921635746c3df9d45c46fe345f57a074
diff --combined paper.tex
index b11da8c,8fd9fb4..e60e242
--- a/paper.tex
+++ b/paper.tex
@@@ -96,21 -96,20 +96,21 @@@ performed. With some applications, it i
  accurate performance models. That is why another solution is to use a simulation
  tool which allows us to change many parameters of the architecture (network
  bandwidth, latency, number of processors) and to simulate the execution of such
 -applications. We have decided to use SimGrid as it enables to benchmark MPI
 -applications.
 +applications. The main contribution of this paper is to show that the use of a
 +simulation tool (here we have decided to use the SimGrid toolkit) can really
 +help developpers to better tune their applications for a given multi-core
 +architecture.
  
 -In this paper, we focus our attention on two parallel iterative algorithms based
 +In particular we focus our attention on two parallel iterative algorithms based
  on the  Multisplitting algorithm  and we  compare them  to the  GMRES algorithm.
 -These algorithms  are used to  solve libear  systems. Two different  variants of
 +These algorithms  are used to  solve linear  systems. Two different  variants of
  the Multisplitting are studied: one  using synchronoous  iterations and  another
 -one  with asynchronous iterations. For each algorithm we have  tested different
 -parameters to see their influence.  We strongly  recommend people  interested
 -by investing  into a  new expensive  hardware  architecture  to   benchmark
 -their  applications  using  a simulation tool before.
 -
 -
 -
 +one  with asynchronous iterations. For each algorithm we have simulated
 +different architecture parameters to evaluate their influence on the overall
 +execution time.  The obtain simulated results confirm the real results
 +previously obtained on different real multi-core architectures and also confirm
 +the efficiency of the asynchronous multisplitting algorithm compared to the
 +synchronous GMRES method.
  
  \end{abstract}
  
@@@ -519,26 -518,28 +519,28 @@@ multisplitting method
  \end{figure}
  
  
- The execution time difference between the two algorithms is important when
- comparing between different grid architectures, even with the same number of
- processors (like 2x16 and 4x8 = 32 processors for example). The
- experiment concludes the low sensitivity of the multisplitting method
- (compared with the classical GMRES) when scaling up the number of the processors in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors.
- 
- \textit{\\3.b Running on two different speed cluster inter-networks\\}
+ The execution  times between  the two algorithms  is significant  with different
+ grid architectures, even  with the same number of processors  (for example, 2x16
+ and  4x8). We  can  observ  the low  sensitivity  of  the Krylov multisplitting  method
+ (compared with the classical GMRES) when scaling up the number of the processors
+ in the  grid: in  average, the GMRES  (resp. Multisplitting)  algorithm performs
+ 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors.
+ 
+ \subsubsection{Running on two different speed cluster inter-networks}
+ \ \\
  
- % environment
- \begin{footnotesize}
+ \begin{figure} [ht!]
+ \begin{center}
  \begin{tabular}{r c }
   \hline
   Grid & 2x16, 4x8\\ %\hline
   Network & N1 : bw=10Gbs-lat=8.10$^{-6}$ \\ %\hline
   - & N2 : bw=1Gbs-lat=5.10$^{-5}$ \\
-  Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\
+  Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline 
   \end{tabular}
- Table 2 : Clusters x Nodes - Networks N1 x N2 \\
- 
-  \end{footnotesize}
+ \caption{Clusters x Nodes - Networks N1 x N2}
+ \end{center}
+ \end{figure}
  
  
  
@@@ -547,31 -548,30 +549,30 @@@
  \centering
  \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf}
  \caption{Cluster x Nodes N1 x N2}
- %\label{overflow}}
+ \label{fig:02}
  \end{figure}
  %\end{wrapfigure}
  
- The experiments compare the behavior of the algorithms running first on
- a speed inter- cluster network (N1) and also on a less performant network (N2).
- Figure 4 shows that end users will gain to reduce the execution time
- for both algorithms in using a grid architecture like 4x16 or 8x8: the
- performance was increased in a factor of 2. The results depict also that
- when the network speed drops down (12.5\%), the difference between the execution
- times can reach more than 25\%.
+ These experiments  compare the  behavior of  the algorithms  running first  on a
+ speed inter-cluster  network (N1) and  also on  a less performant  network (N2).
+ Figure~\ref{fig:02} shows that end users will  gain to reduce the execution time
+ for  both  algorithms  in using  a  grid  architecture  like  4x16 or  8x8:  the
+ performance was increased  in a factor of  2. The results depict  also that when
+ the  network speed  drops down  (12.5\%), the  difference between  the execution
+ times can reach more than 25\%. \RC{c'est pas clair : la diffÃ©rence entre quoi et quoi?}
  
- \textit{\\3.c Network latency impacts on performance\\}
- 
- % environment
- \begin{footnotesize}
+ \subsubsection{Network latency impacts on performance}
+ \ \\
+ \begin{figure} [ht!]
+ \centering
  \begin{tabular}{r c }
   \hline
   Grid & 2x16\\ %\hline
   Network & N1 : bw=1Gbs \\ %\hline
-  Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline\\
+  Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline
   \end{tabular}
- Table 3 : Network latency impact \\
- 
- \end{footnotesize}
+ \caption{Network latency impact}
+ \end{figure}
  
  
  
@@@ -579,123 -579,124 +580,124 @@@
  \centering
  \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf}
  \caption{Network latency impact on execution time}
- %\label{overflow}}
+ \label{fig:03}
  \end{figure}
  
  
- According the results in figure 5, degradation of the network
- latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
- increase more than 75\% (resp. 82\%) of the execution for the classical
- GMRES (resp. multisplitting) algorithm. In addition, it appears that the
- multisplitting method tolerates more the network latency variation with
- a less rate increase of the execution time. Consequently, in the worst case (lat=6.10$^{-5
- }$), the execution time for GMRES is almost the double of the time for
- the multisplitting, even though, the performance was on the same order
- of magnitude with a latency of 8.10$^{-6}$.
- 
- \textit{\\3.d Network bandwidth impacts on performance\\}
+ According  the results  in  Figure~\ref{fig:03}, a  degradation  of the  network
+ latency from 8.10$^{-6}$  to 6.10$^{-5}$ implies an absolute  time increase more
+ than 75\%  (resp. 82\%) of the  execution for the classical  GMRES (resp. Krylov
+ multisplitting)   algorithm.   In   addition,   it  appears   that  the   Krylov
+ multisplitting method tolerates  more the network latency variation  with a less
+ rate  increase  of  the  execution   time.   Consequently,  in  the  worst  case
+ (lat=6.10$^{-5 }$), the  execution time for GMRES is almost  the double than the
+ time of the Krylov multisplitting, even  though, the performance was on the same
+ order of magnitude with a latency of 8.10$^{-6}$.
  
- % environment
- \begin{footnotesize}
+ \subsubsection{Network bandwidth impacts on performance}
+ \ \\
+ \begin{figure} [ht!]
+ \centering
  \begin{tabular}{r c }
   \hline
   Grid & 2x16\\ %\hline
   Network & N1 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
   Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\
   \end{tabular}
- Table 4 : Network bandwidth impact \\
- 
- \end{footnotesize}
+ \caption{Network bandwidth impact}
+ \end{figure}
  
  
  \begin{figure} [ht!]
  \centering
  \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf}
  \caption{Network bandwith impact on execution time}
- %\label{overflow}
+ \label{fig:04}
  \end{figure}
  
  
  
- The results of increasing the network bandwidth show the improvement
- of the performance for both of the two algorithms by reducing the execution time (Figure 6). However, and again in this case, the multisplitting method presents a better performance in the considered bandwidth interval with a gain of 40\% which is only around 24\% for classical GMRES.
- 
- \textit{\\3.e Input matrix size impacts on performance\\}
+ The results  of increasing  the network  bandwidth show  the improvement  of the
+ performance  for   both  algorithms   by  reducing   the  execution   time  (see
+ Figure~\ref{fig:04}). However,  in this  case, the Krylov  multisplitting method
+ presents a better  performance in the considered bandwidth interval  with a gain
+ of 40\% which is only around 24\% for classical GMRES.
  
- % environment
- \begin{footnotesize}
+ \subsubsection{Input matrix size impacts on performance}
+ \ \\
+ \begin{figure} [ht!]
+ \centering
  \begin{tabular}{r c }
   \hline
   Grid & 4x8\\ %\hline
-  Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
-  Input matrix size & N$_{x}$ = From 40 to 200\\ \hline \\
+  Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ 
+  Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
   \end{tabular}
- Table 5 : Input matrix size impact\\
- 
- \end{footnotesize}
+ \caption{Input matrix size impact}
+ \end{figure}
  
  
  \begin{figure} [ht!]
  \centering
  \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
- \caption{Pb size impact on execution time}
- %\label{overflow}}
+ \caption{Problem size impact on execution time}
+ \label{fig:05}
  \end{figure}
  
- In this experimentation, the input matrix size has been set from
- N$_{x}$ = N$_{y}$ = N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to
- 200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 7,
- the execution time for the two algorithms convergence increases with the
- iinput matrix size. But the interesting results here direct on (i) the
- drastic increase (300 times) of the number of iterations needed before
- the convergence for the classical GMRES algorithm when the matrix size
- go beyond N$_{x}$=150; (ii) the classical GMRES execution time also almost
- the double from N$_{x}$=140 compared with the convergence time of the
- multisplitting method. These findings may help a lot end users to setup
- the best and the optimal targeted environment for the application
- deployment when focusing on the problem size scale up. Note that the
- same test has been done with the grid 2x16 getting the same conclusion.
- 
- \textit{\\3.f CPU Power impact on performance\\}
+ In these experiments, the input matrix size  has been set from N$_{x}$ = N$_{y}$
+ = N$_{z}$ = 40 to 200 side elements  that is from 40$^{3}$ = 64.000 to 200$^{3}$
+ = 8,000,000  points. Obviously, as  shown in Figure~\ref{fig:05},  the execution
+ time for  both algorithms increases when  the input matrix size  also increases.
+ But the interesting results are:
+ \begin{enumerate}
+   \item the drastic increase (300 times) \RC{Je ne vois pas cela sur la figure}
+ of the  number of  iterations needed  to reach the  convergence for  the classical
+ GMRES algorithm when  the matrix size go beyond N$_{x}$=150;
+ \item the  classical GMRES execution time  is almost the double  for N$_{x}$=140
+   compared with the Krylov multisplitting method.
+ \end{enumerate}
  
- % environment
- \begin{footnotesize}
+ These  findings may  help a  lot end  users to  setup the  best and  the optimal
+ targeted environment for the application deployment when focusing on the problem
+ size scale up.  It  should be noticed that the same test has  been done with the
+ grid 2x16 leading to the same conclusion.
+ 
+ \subsubsection{CPU Power impact on performance}
+ 
+ \begin{figure} [ht!]
+ \centering
  \begin{tabular}{r c }
   \hline
   Grid & 2x16\\ %\hline
   Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
   Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
   \end{tabular}
- Table 6 : CPU Power impact \\
- 
- \end{footnotesize}
- 
+ \caption{CPU Power impact}
+ \end{figure}
  
  \begin{figure} [ht!]
  \centering
  \includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf}
  \caption{CPU Power impact on execution time}
- %\label{overflow}}
- s\end{figure}
- 
- Using the Simgrid simulator flexibility, we have tried to determine the
- impact on the algorithms performance in varying the CPU power of the
- clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
- confirm the performance gain, around 95\% for both of the two methods,
- after adding more powerful CPU.
- 
- \subsection{Comparing GMRES in native synchronous mode and
- Multisplitting algorithms in asynchronous mode}
- 
- The previous paragraphs put in evidence the interests to simulate the
- behavior of the application before any deployment in a real environment.
- We have focused the study on analyzing the performance in varying the
- key factors impacting the results. The study compares
- the performance of the two proposed algorithms both in \textit{synchronous mode
- }. In this section, following the same previous methodology, the goal is to
- demonstrate the efficiency of the multisplitting method in \textit{
- asynchronous mode} compared with the classical GMRES staying in
- \textit{synchronous mode}.
+ \label{fig:06}
+ \end{figure}
+ 
+ Using the Simgrid  simulator flexibility, we have tried to  determine the impact
+ on the  algorithms performance in  varying the CPU  power of the  clusters nodes
+ from 1  to 19 GFlops.  The outputs  depicted in Figure~\ref{fig:06}  confirm the
+ performance gain,  around 95\% for  both of the  two methods, after  adding more
+ powerful CPU.
+ 
+ \subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode}
+ 
+ The previous paragraphs  put in evidence the interests to  simulate the behavior
+ of the application before any deployment in a real environment.  We have focused
+ the study on analyzing the performance  in varying the key factors impacting the
+ results. The study compares the performance  of the two proposed algorithms both
+ in  \textit{synchronous mode  }. In  this section,  following the  same previous
+ methodology, the  goal is  to demonstrate the  efficiency of  the multisplitting
+ method in \textit{ asynchronous mode}  compared with the classical GMRES staying
+ in \textit{synchronous mode}.
  
  Note that the interest of using the asynchronous mode for data exchange
  is mainly, in opposite of the synchronous mode, the non-wait aspects of