X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/blobdiff_plain/0ff5badea3e5156e9795cf5724f3dffed49ca7b5..34ef1c761f30fa1a22267a93d9aeaabb90869ada:/paper.tex

diff --git a/paper.tex b/paper.tex
index 523716f..beb3140 100644
--- a/paper.tex
+++ b/paper.tex
@@ -321,7 +321,7 @@ A_{\ell\ell} x_\ell = c_\ell,\mbox{~for~}\ell=1,\ldots,L,
 \end{equation}
 where right-hand sides $c_\ell=b_\ell-\sum_{m\neq\ell}A_{\ell m}x_m$ are computed using the shared vectors $x_m$. In this paper, we use the well-known iterative method GMRES~\cite{saad86} as an inner iteration to approximate the solutions of the different splittings arising from the block Jacobi multisplitting of matrix $A$. The algorithm in Figure~\ref{alg:01} shows the main key points of our block Jacobi two-stage method executed by a cluster of processors. In line~\ref{solve}, the linear sub-system~(\ref{eq:03}) is solved in parallel using GMRES method where $\MIG$ and $\TOLG$ are the maximum number of inner iterations and the tolerance threshold for GMRES respectively. The convergence of the two-stage multisplitting methods, based on synchronous or asynchronous iterations, has been studied by many authors for example~\cite{Bru95,bahi07}.
 
-\begin{figure}[t]
+\begin{figure}[htpb]
 %\begin{algorithm}[t]
 %\caption{Block Jacobi two-stage multisplitting method}
 \begin{algorithmic}[1]
@@ -359,7 +359,7 @@ At each $s$ outer iterations, the algorithm computes a new approximation $\tilde
 \end{equation}
 The algorithm in Figure~\ref{alg:02} includes the procedure of the residual minimization and the outer iteration is restarted with a new approximation $\tilde{x}$ at every $s$ iterations. The least-squares problem~(\ref{eq:06}) is solved in parallel by all clusters using CGLS method~\cite{Hestenes52} such that $\MIC$ is the maximum number of iterations and $\TOLC$ is the tolerance threshold for this method (line~\ref{cgls} in Figure~\ref{alg:02}).
 
-\begin{figure}[t]
+\begin{figure}[htbp]
 %\begin{algorithm}[t]
 %\caption{Krylov two-stage method using block Jacobi multisplitting}
 \begin{algorithmic}[1]
@@ -495,6 +495,7 @@ latency of 8.10$^{-6}$ seconds (resp.  5.10$^{-5}$) for the intra-clusters links
 (resp.  inter-clusters backbone links).  \\
 
 \LZK{Il me semble que le bw et lat des deux rÃ©seaux varient dans les expÃ©s d'une simu Ã  l'autre. On vire la derniÃ¨re phrase?}
+\RC{il me semble qu'on peut laisser ca}
 
 \textbf{Step 5}: Conduct an extensive and comprehensive testings
 within these configurations by varying the key parameters, especially
@@ -540,18 +541,23 @@ and  between distant  clusters.  This parameter is application dependent.
 In the scope  of this paper, our  first objective is to analyze  when the Krylov
 two-stage method has  better  performance  than   the  classical  GMRES method. With a synchronous  iterative method, better performance means a
 smaller number of iterations and execution time before reaching the convergence.
-For a systematic study,  the experiments  should figure  out  that, for  various
-grid  parameters values, the simulator will confirm Multisplitting method  better performance compared to classical GMRES, particularly on poor and slow networks.
-\LZK{Pas du tout claire la derniÃ¨re phrase (For a systematic...)!!}
-\RCE { Reformule autrement}
+In what follows, we will present the test conditions, the output results and our comments.
+
+%%RAPH : on vire ca, c'est pas clair et pas important
+%For a systematic study,  the experiments  should figure  out  that, for  various
+%grid  parameters values, the simulator will confirm Multisplitting method  better performance compared to classical GMRES, particularly on poor and slow networks.
+%\LZK{Pas du tout claire la derniÃ¨re phrase (For a systematic...)!!}
+%\RCE { Reformule autrement}
+
 
-In what follows, we will present the test conditions, the output results and our comments.\\
 
 %\subsubsection{Execution of the algorithms on various computational grid architectures and scaling up the input matrix size}
 \subsubsection{Simulations for various grid architectures and scaling-up matrix sizes}
 \ \\
 % environment
 
+\RC{Je ne comprends plus rien CE : pourquoi dans 5.4.1 il y a 2 network et aussi dans 5.4.2. Quelle est la diffÃ©rence? Dans la figure 3 de la section 5.4.1 pourquoi il n'y a pas N1 et N2?}
+
 \begin{table} [ht!]
 \begin{center}
 \begin{tabular}{ll }
@@ -563,33 +569,42 @@ In what follows, we will present the test conditions, the output results and our
   &  N$_{x}$ $\times$ N$_{y}$ $\times$ N$_{z}$  =170 $\times$ 170 $\times$ 170    \\ \hline
  \end{tabular}
 \caption{Test conditions: various grid configurations with the matrix sizes 150$^3$ or 170$^3$}
-\LZK{Ce sont les caractÃ©ristiques du rÃ©seau intra ou inter clusters? Ce n'est pas prÃ©cisÃ©...}
-\RCE{oui c est precise}
+%\LZK{Ce sont les caractÃ©ristiques du rÃ©seau intra ou inter clusters? Ce n'est pas prÃ©cisÃ©...}
+%\RCE{oui c est precise}
 \label{tab:01}
 \end{center}
 \end{table}
 
 
-In this section, we analyze the simulations conducted on various grid configurations presented in Table~\ref{tab:01}. Figure~\ref{fig:01} shows, for all grid configurations and a given matrix size, a non-variation in the number of iterations for the classical GMRES algorithm, which is not the case of the Krylov two-stage algorithm.
+In  this  section,  we  analyze   the  simulations  conducted  on  various  grid
+configurations presented  in Table~\ref{tab:01}. It  should be noticed  that two
+networks are considered: N1 is  the network between clusters (inter-cluster) and
+N2 is the network inside  a cluster (intra-cluster).  Figure~\ref{fig:01} shows,
+for all  grid configurations  and a  given matrix size,  a non-variation  in the
+number of iterations for the classical GMRES algorithm, which is not the case of
+the Krylov two-stage algorithm.
 %% First,  the results in  Figure~\ref{fig:01}
 %% show for all grid configurations the non-variation of the number of iterations of
 %% classical  GMRES for  a given  input matrix  size; it is not  the case  for the
 %% multisplitting method.
-\RC{CE attention tu n'as pas mis de label dans tes figures, donc c'est le bordel, j'en mets mais vÃ©rifie...}
-\RC{Les lÃ©gendes ne sont pas explicites...}
-\RCE{Corrige}
+%\RC{CE attention tu n'as pas mis de label dans tes figures, donc c'est le bordel, j'en mets mais vÃ©rifie...}
+%\RC{Les lÃ©gendes ne sont pas explicites...}
+%\RCE{Corrige}
 
-\begin{figure} [ht!]
+\begin{figure} [htbp]
   \begin{center}
     \includegraphics[width=100mm]{cluster_x_nodes_nx_150_and_nx_170.pdf}
   \end{center}
-  \caption{Various grid configurations with the matrix sizes 150$^3$ and 170$^3$
-\AG{Utiliser le point comme sÃ©parateur dÃ©cimal et non la virgule.  Idem dans les autres figures.}}
-\LZK{Pour quelle taille du problÃ¨me sont calculÃ©s les nombres d'itÃ©rations? Que reprÃ©sente le 2 Clusters x 16 Nodes with Nx=150 and Nx=170 en haut de la figure?}
-\RCE {Corrige}
+  \caption{Various grid configurations with the matrix sizes 150$^3$ and 170$^3$}
+%\AG{Utiliser le point comme sÃ©parateur dÃ©cimal et non la virgule.  Idem dans les autres figures.}
+%\LZK{Pour quelle taille du problÃ¨me sont calculÃ©s les nombres d'itÃ©rations? Que reprÃ©sente le 2 Clusters x 16 Nodes with Nx=150 and Nx=170 en haut de la figure?}
+  %\RCE {Corrige}
+    \RC{IdÃ©alement dans la lÃ©gende il faudrait insiquer Pb size=$150^3$ ou $170^3$  car pour l'instant Nx=150 ca n'indique rien concernant Ny et Nz}
   \label{fig:01}
 \end{figure}
 
+
+
 The execution  times between  the two algorithms  is significant  with different
 grid architectures, even  with the same number of processors  (for example, 2 $\times$ 16
 and  4 $\times  8$). We  can  observe  a better  sensitivity  of  the Krylov multisplitting  method
@@ -617,7 +632,8 @@ $40\%$ better (resp. $48\%$) when running from 32 (grid 2 $\times$ 16) to 64 pro
 \end{table}
 
 In this section, the experiments  compare the  behavior of  the algorithms  running on a
-speeder inter-cluster  network (N2) and  also on  a less performant  network (N1) respectively defined in the test conditions Table~\ref{tab:02}. \RC{Il faut dÃ©finir cela avant...}
+speeder inter-cluster  network (N2) and  also on  a less performant  network (N1) respectively defined in the test conditions Table~\ref{tab:02}.
+%\RC{Il faut dÃ©finir cela avant...}
 Figure~\ref{fig:02} shows that end users will reduce the execution time
 for  both  algorithms when using  a  grid  architecture  like  4 $\times$ 16 or  8 $\times$ 8: the reduction factor is around $2$. The results depict  also that when
 the  network speed  drops down (variation of 12.5\%), the  difference between  the two Multisplitting algorithms execution times can reach more than 25\%.
@@ -625,7 +641,7 @@ the  network speed  drops down (variation of 12.5\%), the  difference between  t
 
 
 %\begin{wrapfigure}{l}{100mm}
-\begin{figure} [ht!]
+\begin{figure} [htbp]
 \centering
 \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf}
 \caption{Various grid configurations with networks N1 vs N2
@@ -651,7 +667,7 @@ the  network speed  drops down (variation of 12.5\%), the  difference between  t
 \label{tab:03}
 \end{table}
 
-\begin{figure} [ht!]
+\begin{figure} [htbp]
 \centering
 \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf}
 \caption{Network latency impacts on execution time
@@ -684,12 +700,12 @@ more  than $75\%$  (resp.  $82\%$)  of the  execution  for  the classical  GMRES
 \end{table}
 
 
-\begin{figure} [ht!]
+\begin{figure} [htbp]
 \centering
 \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf}
-\caption{Network bandwith impacts on execution time
-\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.}
-\RCE{Corrige}
+\caption{Network bandwith impacts on execution time}
+%\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.}
+%\RCE{Corrige}
 \label{fig:04}
 \end{figure}
 
@@ -714,7 +730,7 @@ of $40\%$ which is only around $24\%$ for the classical GMRES.
 \end{table}
 
 
-\begin{figure} [ht!]
+\begin{figure} [htbp]
 \centering
 \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
 \caption{Problem size impacts on execution time}
@@ -743,7 +759,7 @@ grid 2 $\times$ 16 leading to the same conclusion.
 
 \subsubsection{CPU Power impacts on performance}
 
-\begin{table} [ht!]
+\begin{table} [htbp]
 \centering
 \begin{tabular}{r c }
  \hline
@@ -795,18 +811,19 @@ synchronization  with   the  other   processors.  Thus,  the   asynchronous  may
 theoretically reduce  the overall execution  time and can improve  the algorithm
 performance.
 
-\RC{la phrase suivante est bizarre, je ne comprends pas pourquoi elle vient ici}
-\RCE{C est la description du dernier test sync/async avec l'introduction de la notion de relative gain}
-In this section, Simgrid simulator tool has been successfully used to show
-the efficiency of  the multisplitting in asynchronous mode and  to find the best
-combination of the grid resources (CPU,  Network, input matrix size, \ldots ) to
-get    the   highest    \textit{"relative    gain"}   (exec\_time$_{GMRES}$    /
-exec\_time$_{multisplitting}$) in comparison with the classical GMRES time.
+In this section,  the Simgrid simulator is  used to compare the  behavior of the
+multisplitting in  asynchronous mode  with GMRES  in synchronous  mode.  Several
+benchmarks have  been performed with  various combination of the  grid resources
+(CPU, Network, input  matrix size, \ldots ). The test  conditions are summarized
+in  Table~\ref{tab:07}. In  order to  compare  the execution  times, this  table
+reports the  relative gain between both  algorithms. It is defined  by the ratio
+between  the   execution  time  of   GMRES  and   the  execution  time   of  the
+multisplitting.  The  ration  is  greater  than  one  because  the  asynchronous
+multisplitting version is faster than GMRES.
 
 
-The test conditions are summarized in the table~\ref{tab:07}: \\
 
-\begin{table} [ht!]
+\begin{table} [htbp]
 \centering
 \begin{tabular}{r c }
  \hline
@@ -856,7 +873,7 @@ geographically distant clusters through the internet.
     power (GFlops)
     & 1    & 1    & 1    & 1.5       & 1.5  & 1.5         & 1.5         & 1         & 1.5       & 1.5 \\
     \hline
-    size (N)
+    size ($N^3$)
     & 62  & 62   & 62        & 100       & 100 & 110       & 120       & 130       & 140       & 150 \\
     \hline
     Precision