Merge branch 'master' of ssh://bilbo.iut-bm.univ-fcomte.fr/rce2015

[rce2015.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index 2c9b309bc9661048b70b46078c0ea028777afef9..e60e2423d4147e0486b9a1aba879d472d691c322 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -96,20 +96,21 @@ performed. With some applications, it is difficult, if not impossible, to build
  accurate performance models. That is why another solution is to use a simulation
  tool which allows us to change many parameters of the architecture (network
  bandwidth, latency, number of processors) and to simulate the execution of such
-applications. We have decided to use SimGrid as it enables to benchmark MPI
-applications.
+applications. The main contribution of this paper is to show that the use of a
+simulation tool (here we have decided to use the SimGrid toolkit) can really
+help developpers to better tune their applications for a given multi-core
+architecture.
  
-In this paper, we focus our attention on two parallel iterative algorithms based
+In particular we focus our attention on two parallel iterative algorithms based
  on the  Multisplitting algorithm  and we  compare them  to the  GMRES algorithm.
-These algorithms  are used to  solve libear  systems. Two different  variants of
+These algorithms  are used to  solve linear  systems. Two different  variants of
  the Multisplitting are studied: one  using synchronoous  iterations and  another
-one  with asynchronous iterations. For each algorithm we have  tested different
-parameters to see their influence.  We strongly  recommend people  interested
-by investing  into a  new expensive  hardware  architecture  to   benchmark
-their  applications  using  a simulation tool before.
-
-
-
+one  with asynchronous iterations. For each algorithm we have simulated
+different architecture parameters to evaluate their influence on the overall
+execution time.  The obtain simulated results confirm the real results
+previously obtained on different real multi-core architectures and also confirm
+the efficiency of the asynchronous multisplitting algorithm compared to the
+synchronous GMRES method.
  
  \end{abstract}
  
@@ -518,26 +519,28 @@ multisplitting method.
  \end{figure}
  
  
-The execution time difference between the two algorithms is important when
-comparing between different grid architectures, even with the same number of
-processors (like 2x16 and 4x8 = 32 processors for example). The
-experiment concludes the low sensitivity of the multisplitting method
-(compared with the classical GMRES) when scaling up the number of the processors in the grid: in average, the GMRES (resp. Multisplitting) algorithm performs 40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors.
+The execution  times between  the two algorithms  is significant  with different
+grid architectures, even  with the same number of processors  (for example, 2x16
+and  4x8). We  can  observ  the low  sensitivity  of  the Krylov multisplitting  method
+(compared with the classical GMRES) when scaling up the number of the processors
+in the  grid: in  average, the GMRES  (resp. Multisplitting)  algorithm performs
+40\% better (resp. 48\%) less when running from 2x16=32 to 8x8=64 processors.
  
-\textit{\\3.b Running on two different speed cluster inter-networks\\}
+\subsubsection{Running on two different speed cluster inter-networks}
+\ \\
  
-% environment
-\begin{footnotesize}
+\begin{figure} [ht!]
+\begin{center}
  \begin{tabular}{r c }
   \hline
   Grid & 2x16, 4x8\\ %\hline
   Network & N1 : bw=10Gbs-lat=8.10$^{-6}$ \\ %\hline
   - & N2 : bw=1Gbs-lat=5.10$^{-5}$ \\
- Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\
+ Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline 
   \end{tabular}
-Table 2 : Clusters x Nodes - Networks N1 x N2 \\
-
- \end{footnotesize}
+\caption{Clusters x Nodes - Networks N1 x N2}
+\end{center}
+\end{figure}
  
  
  
@@ -546,31 +549,30 @@ Table 2 : Clusters x Nodes - Networks N1 x N2 \\
  \centering
  \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf}
  \caption{Cluster x Nodes N1 x N2}
-%\label{overflow}}
+\label{fig:02}
  \end{figure}
  %\end{wrapfigure}
  
-The experiments compare the behavior of the algorithms running first on
-a speed inter- cluster network (N1) and also on a less performant network (N2).
-Figure 4 shows that end users will gain to reduce the execution time
-for both algorithms in using a grid architecture like 4x16 or 8x8: the
-performance was increased in a factor of 2. The results depict also that
-when the network speed drops down (12.5\%), the difference between the execution
-times can reach more than 25\%.
-
-\textit{\\3.c Network latency impacts on performance\\}
+These experiments  compare the  behavior of  the algorithms  running first  on a
+speed inter-cluster  network (N1) and  also on  a less performant  network (N2).
+Figure~\ref{fig:02} shows that end users will  gain to reduce the execution time
+for  both  algorithms  in using  a  grid  architecture  like  4x16 or  8x8:  the
+performance was increased  in a factor of  2. The results depict  also that when
+the  network speed  drops down  (12.5\%), the  difference between  the execution
+times can reach more than 25\%. \RC{c'est pas clair : la différence entre quoi et quoi?}
  
-% environment
-\begin{footnotesize}
+\subsubsection{Network latency impacts on performance}
+\ \\
+\begin{figure} [ht!]
+\centering
  \begin{tabular}{r c }
   \hline
   Grid & 2x16\\ %\hline
   Network & N1 : bw=1Gbs \\ %\hline
- Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline\\
+ Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline
   \end{tabular}
-Table 3 : Network latency impact \\
-
-\end{footnotesize}
+\caption{Network latency impact}
+\end{figure}
  
  
  
@@ -578,123 +580,124 @@ Table 3 : Network latency impact \\
  \centering
  \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf}
  \caption{Network latency impact on execution time}
-%\label{overflow}}
+\label{fig:03}
  \end{figure}
  
  
-According the results in figure 5, degradation of the network
-latency from 8.10$^{-6}$ to 6.10$^{-5}$ implies an absolute time
-increase more than 75\% (resp. 82\%) of the execution for the classical
-GMRES (resp. multisplitting) algorithm. In addition, it appears that the
-multisplitting method tolerates more the network latency variation with
-a less rate increase of the execution time. Consequently, in the worst case (lat=6.10$^{-5
-}$), the execution time for GMRES is almost the double of the time for
-the multisplitting, even though, the performance was on the same order
-of magnitude with a latency of 8.10$^{-6}$.
+According  the results  in  Figure~\ref{fig:03}, a  degradation  of the  network
+latency from 8.10$^{-6}$  to 6.10$^{-5}$ implies an absolute  time increase more
+than 75\%  (resp. 82\%) of the  execution for the classical  GMRES (resp. Krylov
+multisplitting)   algorithm.   In   addition,   it  appears   that  the   Krylov
+multisplitting method tolerates  more the network latency variation  with a less
+rate  increase  of  the  execution   time.   Consequently,  in  the  worst  case
+(lat=6.10$^{-5 }$), the  execution time for GMRES is almost  the double than the
+time of the Krylov multisplitting, even  though, the performance was on the same
+order of magnitude with a latency of 8.10$^{-6}$.
  
-\textit{\\3.d Network bandwidth impacts on performance\\}
-
-% environment
-\begin{footnotesize}
+\subsubsection{Network bandwidth impacts on performance}
+\ \\
+\begin{figure} [ht!]
+\centering
  \begin{tabular}{r c }
   \hline
   Grid & 2x16\\ %\hline
   Network & N1 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
   Input matrix size & N$_{x}$ x N$_{y}$ x N$_{z}$ =150 x 150 x 150\\ \hline \\
   \end{tabular}
-Table 4 : Network bandwidth impact \\
-
-\end{footnotesize}
+\caption{Network bandwidth impact}
+\end{figure}
  
  
  \begin{figure} [ht!]
  \centering
  \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf}
  \caption{Network bandwith impact on execution time}
-%\label{overflow}
+\label{fig:04}
  \end{figure}
  
  
  
-The results of increasing the network bandwidth show the improvement
-of the performance for both of the two algorithms by reducing the execution time (Figure 6). However, and again in this case, the multisplitting method presents a better performance in the considered bandwidth interval with a gain of 40\% which is only around 24\% for classical GMRES.
-
-\textit{\\3.e Input matrix size impacts on performance\\}
+The results  of increasing  the network  bandwidth show  the improvement  of the
+performance  for   both  algorithms   by  reducing   the  execution   time  (see
+Figure~\ref{fig:04}). However,  in this  case, the Krylov  multisplitting method
+presents a better  performance in the considered bandwidth interval  with a gain
+of 40\% which is only around 24\% for classical GMRES.
  
-% environment
-\begin{footnotesize}
+\subsubsection{Input matrix size impacts on performance}
+\ \\
+\begin{figure} [ht!]
+\centering
  \begin{tabular}{r c }
   \hline
   Grid & 4x8\\ %\hline
- Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
- Input matrix size & N$_{x}$ = From 40 to 200\\ \hline \\
+ Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ 
+ Input matrix size & N$_{x}$ = From 40 to 200\\ \hline
   \end{tabular}
-Table 5 : Input matrix size impact\\
-
-\end{footnotesize}
+\caption{Input matrix size impact}
+\end{figure}
  
  
  \begin{figure} [ht!]
  \centering
  \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
-\caption{Pb size impact on execution time}
-%\label{overflow}}
+\caption{Problem size impact on execution time}
+\label{fig:05}
  \end{figure}
  
-In this experimentation, the input matrix size has been set from
-N$_{x}$ = N$_{y}$ = N$_{z}$ = 40 to 200 side elements that is from 40$^{3}$ = 64.000 to
-200$^{3}$ = 8.000.000 points. Obviously, as shown in the figure 7,
-the execution time for the two algorithms convergence increases with the
-iinput matrix size. But the interesting results here direct on (i) the
-drastic increase (300 times) of the number of iterations needed before
-the convergence for the classical GMRES algorithm when the matrix size
-go beyond N$_{x}$=150; (ii) the classical GMRES execution time also almost
-the double from N$_{x}$=140 compared with the convergence time of the
-multisplitting method. These findings may help a lot end users to setup
-the best and the optimal targeted environment for the application
-deployment when focusing on the problem size scale up. Note that the
-same test has been done with the grid 2x16 getting the same conclusion.
-
-\textit{\\3.f CPU Power impact on performance\\}
+In these experiments, the input matrix size  has been set from N$_{x}$ = N$_{y}$
+= N$_{z}$ = 40 to 200 side elements  that is from 40$^{3}$ = 64.000 to 200$^{3}$
+= 8,000,000  points. Obviously, as  shown in Figure~\ref{fig:05},  the execution
+time for  both algorithms increases when  the input matrix size  also increases.
+But the interesting results are:
+\begin{enumerate}
+  \item the drastic increase (300 times) \RC{Je ne vois pas cela sur la figure}
+of the  number of  iterations needed  to reach the  convergence for  the classical
+GMRES algorithm when  the matrix size go beyond N$_{x}$=150;
+\item the  classical GMRES execution time  is almost the double  for N$_{x}$=140
+  compared with the Krylov multisplitting method.
+\end{enumerate}
  
-% environment
-\begin{footnotesize}
+These  findings may  help a  lot end  users to  setup the  best and  the optimal
+targeted environment for the application deployment when focusing on the problem
+size scale up.  It  should be noticed that the same test has  been done with the
+grid 2x16 leading to the same conclusion.
+
+\subsubsection{CPU Power impact on performance}
+
+\begin{figure} [ht!]
+\centering
  \begin{tabular}{r c }
   \hline
   Grid & 2x16\\ %\hline
   Network & N2 : bw=1Gbs - lat=5.10$^{-5}$ \\ %\hline
   Input matrix size & N$_{x}$ = 150 x 150 x 150\\ \hline
   \end{tabular}
-Table 6 : CPU Power impact \\
-
-\end{footnotesize}
-
+\caption{CPU Power impact}
+\end{figure}
  
  \begin{figure} [ht!]
  \centering
  \includegraphics[width=100mm]{cpu_power_impact_on_execution_time.pdf}
  \caption{CPU Power impact on execution time}
-%\label{overflow}}
-s\end{figure}
-
-Using the Simgrid simulator flexibility, we have tried to determine the
-impact on the algorithms performance in varying the CPU power of the
-clusters nodes from 1 to 19 GFlops. The outputs depicted in the figure 6
-confirm the performance gain, around 95\% for both of the two methods,
-after adding more powerful CPU.
-
-\subsection{Comparing GMRES in native synchronous mode and
-Multisplitting algorithms in asynchronous mode}
-
-The previous paragraphs put in evidence the interests to simulate the
-behavior of the application before any deployment in a real environment.
-We have focused the study on analyzing the performance in varying the
-key factors impacting the results. The study compares
-the performance of the two proposed algorithms both in \textit{synchronous mode
-}. In this section, following the same previous methodology, the goal is to
-demonstrate the efficiency of the multisplitting method in \textit{
-asynchronous mode} compared with the classical GMRES staying in
-\textit{synchronous mode}.
+\label{fig:06}
+\end{figure}
+
+Using the Simgrid  simulator flexibility, we have tried to  determine the impact
+on the  algorithms performance in  varying the CPU  power of the  clusters nodes
+from 1  to 19 GFlops.  The outputs  depicted in Figure~\ref{fig:06}  confirm the
+performance gain,  around 95\% for  both of the  two methods, after  adding more
+powerful CPU.
+
+\subsection{Comparing GMRES in native synchronous mode and the multisplitting algorithm in asynchronous mode}
+
+The previous paragraphs  put in evidence the interests to  simulate the behavior
+of the application before any deployment in a real environment.  We have focused
+the study on analyzing the performance  in varying the key factors impacting the
+results. The study compares the performance  of the two proposed algorithms both
+in  \textit{synchronous mode  }. In  this section,  following the  same previous
+methodology, the  goal is  to demonstrate the  efficiency of  the multisplitting
+method in \textit{ asynchronous mode}  compared with the classical GMRES staying
+in \textit{synchronous mode}.
  
  Note that the interest of using the asynchronous mode for data exchange
  is mainly, in opposite of the synchronous mode, the non-wait aspects of