modif placement des figures

[rce2015.git] / paper.tex
diff --git a/paper.tex b/paper.tex

index 523716f980bd65ab09a7c6c3efee6d244714d3c3..beb3140bc7f69bf624ae6202e24f4452c6bd4c80 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -321,7 +321,7 @@ A_{\ell\ell} x_\ell = c_\ell,\mbox{~for~}\ell=1,\ldots,L,
  \end{equation}
  where right-hand sides $c_\ell=b_\ell-\sum_{m\neq\ell}A_{\ell m}x_m$ are computed using the shared vectors $x_m$. In this paper, we use the well-known iterative method GMRES~\cite{saad86} as an inner iteration to approximate the solutions of the different splittings arising from the block Jacobi multisplitting of matrix $A$. The algorithm in Figure~\ref{alg:01} shows the main key points of our block Jacobi two-stage method executed by a cluster of processors. In line~\ref{solve}, the linear sub-system~(\ref{eq:03}) is solved in parallel using GMRES method where $\MIG$ and $\TOLG$ are the maximum number of inner iterations and the tolerance threshold for GMRES respectively. The convergence of the two-stage multisplitting methods, based on synchronous or asynchronous iterations, has been studied by many authors for example~\cite{Bru95,bahi07}.
  
-\begin{figure}[t]
+\begin{figure}[htpb]
  %\begin{algorithm}[t]
  %\caption{Block Jacobi two-stage multisplitting method}
  \begin{algorithmic}[1]
@@ -359,7 +359,7 @@ At each $s$ outer iterations, the algorithm computes a new approximation $\tilde
  \end{equation}
  The algorithm in Figure~\ref{alg:02} includes the procedure of the residual minimization and the outer iteration is restarted with a new approximation $\tilde{x}$ at every $s$ iterations. The least-squares problem~(\ref{eq:06}) is solved in parallel by all clusters using CGLS method~\cite{Hestenes52} such that $\MIC$ is the maximum number of iterations and $\TOLC$ is the tolerance threshold for this method (line~\ref{cgls} in Figure~\ref{alg:02}).
  
-\begin{figure}[t]
+\begin{figure}[htbp]
  %\begin{algorithm}[t]
  %\caption{Krylov two-stage method using block Jacobi multisplitting}
  \begin{algorithmic}[1]
@@ -495,6 +495,7 @@ latency of 8.10$^{-6}$ seconds (resp.  5.10$^{-5}$) for the intra-clusters links
  (resp.  inter-clusters backbone links).  \\
  
  \LZK{Il me semble que le bw et lat des deux réseaux varient dans les expés d'une simu à l'autre. On vire la dernière phrase?}
+\RC{il me semble qu'on peut laisser ca}
  
  \textbf{Step 5}: Conduct an extensive and comprehensive testings
  within these configurations by varying the key parameters, especially
@@ -540,18 +541,23 @@ and  between distant  clusters.  This parameter is application dependent.
  In the scope  of this paper, our  first objective is to analyze  when the Krylov
  two-stage method has  better  performance  than   the  classical  GMRES method. With a synchronous  iterative method, better performance means a
  smaller number of iterations and execution time before reaching the convergence.
-For a systematic study,  the experiments  should figure  out  that, for  various
-grid  parameters values, the simulator will confirm Multisplitting method  better performance compared to classical GMRES, particularly on poor and slow networks.
-\LZK{Pas du tout claire la dernière phrase (For a systematic...)!!}
-\RCE { Reformule autrement}
+In what follows, we will present the test conditions, the output results and our comments.
+
+%%RAPH : on vire ca, c'est pas clair et pas important
+%For a systematic study,  the experiments  should figure  out  that, for  various
+%grid  parameters values, the simulator will confirm Multisplitting method  better performance compared to classical GMRES, particularly on poor and slow networks.
+%\LZK{Pas du tout claire la dernière phrase (For a systematic...)!!}
+%\RCE { Reformule autrement}
+
  
-In what follows, we will present the test conditions, the output results and our comments.\\
  
  %\subsubsection{Execution of the algorithms on various computational grid architectures and scaling up the input matrix size}
  \subsubsection{Simulations for various grid architectures and scaling-up matrix sizes}
  \ \\
  % environment
  
+\RC{Je ne comprends plus rien CE : pourquoi dans 5.4.1 il y a 2 network et aussi dans 5.4.2. Quelle est la différence? Dans la figure 3 de la section 5.4.1 pourquoi il n'y a pas N1 et N2?}
+
  \begin{table} [ht!]
  \begin{center}
  \begin{tabular}{ll }
@@ -563,33 +569,42 @@ In what follows, we will present the test conditions, the output results and our
    &  N$_{x}$ $\times$ N$_{y}$ $\times$ N$_{z}$  =170 $\times$ 170 $\times$ 170    \\ \hline
   \end{tabular}
  \caption{Test conditions: various grid configurations with the matrix sizes 150$^3$ or 170$^3$}
-\LZK{Ce sont les caractéristiques du réseau intra ou inter clusters? Ce n'est pas précisé...}
-\RCE{oui c est precise}
+%\LZK{Ce sont les caractéristiques du réseau intra ou inter clusters? Ce n'est pas précisé...}
+%\RCE{oui c est precise}
  \label{tab:01}
  \end{center}
  \end{table}
  
  
-In this section, we analyze the simulations conducted on various grid configurations presented in Table~\ref{tab:01}. Figure~\ref{fig:01} shows, for all grid configurations and a given matrix size, a non-variation in the number of iterations for the classical GMRES algorithm, which is not the case of the Krylov two-stage algorithm.
+In  this  section,  we  analyze   the  simulations  conducted  on  various  grid
+configurations presented  in Table~\ref{tab:01}. It  should be noticed  that two
+networks are considered: N1 is  the network between clusters (inter-cluster) and
+N2 is the network inside  a cluster (intra-cluster).  Figure~\ref{fig:01} shows,
+for all  grid configurations  and a  given matrix size,  a non-variation  in the
+number of iterations for the classical GMRES algorithm, which is not the case of
+the Krylov two-stage algorithm.
  %% First,  the results in  Figure~\ref{fig:01}
  %% show for all grid configurations the non-variation of the number of iterations of
  %% classical  GMRES for  a given  input matrix  size; it is not  the case  for the
  %% multisplitting method.
-\RC{CE attention tu n'as pas mis de label dans tes figures, donc c'est le bordel, j'en mets mais vérifie...}
-\RC{Les légendes ne sont pas explicites...}
-\RCE{Corrige}
+%\RC{CE attention tu n'as pas mis de label dans tes figures, donc c'est le bordel, j'en mets mais vérifie...}
+%\RC{Les légendes ne sont pas explicites...}
+%\RCE{Corrige}
  
-\begin{figure} [ht!]
+\begin{figure} [htbp]
    \begin{center}
      \includegraphics[width=100mm]{cluster_x_nodes_nx_150_and_nx_170.pdf}
    \end{center}
-  \caption{Various grid configurations with the matrix sizes 150$^3$ and 170$^3$
-\AG{Utiliser le point comme séparateur décimal et non la virgule.  Idem dans les autres figures.}}
-\LZK{Pour quelle taille du problème sont calculés les nombres d'itérations? Que représente le 2 Clusters x 16 Nodes with Nx=150 and Nx=170 en haut de la figure?}
-\RCE {Corrige}
+  \caption{Various grid configurations with the matrix sizes 150$^3$ and 170$^3$}
+%\AG{Utiliser le point comme séparateur décimal et non la virgule.  Idem dans les autres figures.}
+%\LZK{Pour quelle taille du problème sont calculés les nombres d'itérations? Que représente le 2 Clusters x 16 Nodes with Nx=150 and Nx=170 en haut de la figure?}
+  %\RCE {Corrige}
+    \RC{Idéalement dans la légende il faudrait insiquer Pb size=$150^3$ ou $170^3$  car pour l'instant Nx=150 ca n'indique rien concernant Ny et Nz}
    \label{fig:01}
  \end{figure}
  
+
+
  The execution  times between  the two algorithms  is significant  with different
  grid architectures, even  with the same number of processors  (for example, 2 $\times$ 16
  and  4 $\times  8$). We  can  observe  a better  sensitivity  of  the Krylov multisplitting  method
@@ -617,7 +632,8 @@ $40\%$ better (resp. $48\%$) when running from 32 (grid 2 $\times$ 16) to 64 pro
  \end{table}
  
  In this section, the experiments  compare the  behavior of  the algorithms  running on a
-speeder inter-cluster  network (N2) and  also on  a less performant  network (N1) respectively defined in the test conditions Table~\ref{tab:02}. \RC{Il faut définir cela avant...}
+speeder inter-cluster  network (N2) and  also on  a less performant  network (N1) respectively defined in the test conditions Table~\ref{tab:02}.
+%\RC{Il faut définir cela avant...}
  Figure~\ref{fig:02} shows that end users will reduce the execution time
  for  both  algorithms when using  a  grid  architecture  like  4 $\times$ 16 or  8 $\times$ 8: the reduction factor is around $2$. The results depict  also that when
  the  network speed  drops down (variation of 12.5\%), the  difference between  the two Multisplitting algorithms execution times can reach more than 25\%.
@@ -625,7 +641,7 @@ the  network speed  drops down (variation of 12.5\%), the  difference between  t
  
  
  %\begin{wrapfigure}{l}{100mm}
-\begin{figure} [ht!]
+\begin{figure} [htbp]
  \centering
  \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf}
  \caption{Various grid configurations with networks N1 vs N2
@@ -651,7 +667,7 @@ the  network speed  drops down (variation of 12.5\%), the  difference between  t
  \label{tab:03}
  \end{table}
  
-\begin{figure} [ht!]
+\begin{figure} [htbp]
  \centering
  \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf}
  \caption{Network latency impacts on execution time
@@ -684,12 +700,12 @@ more  than $75\%$  (resp.  $82\%$)  of the  execution  for  the classical  GMRES
  \end{table}
  
  
-\begin{figure} [ht!]
+\begin{figure} [htbp]
  \centering
  \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf}
-\caption{Network bandwith impacts on execution time
-\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.}
-\RCE{Corrige}
+\caption{Network bandwith impacts on execution time}
+%\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.}
+%\RCE{Corrige}
  \label{fig:04}
  \end{figure}
  
@@ -714,7 +730,7 @@ of $40\%$ which is only around $24\%$ for the classical GMRES.
  \end{table}
  
  
-\begin{figure} [ht!]
+\begin{figure} [htbp]
  \centering
  \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf}
  \caption{Problem size impacts on execution time}
@@ -743,7 +759,7 @@ grid 2 $\times$ 16 leading to the same conclusion.
  
  \subsubsection{CPU Power impacts on performance}
  
-\begin{table} [ht!]
+\begin{table} [htbp]
  \centering
  \begin{tabular}{r c }
   \hline
@@ -795,18 +811,19 @@ synchronization  with   the  other   processors.  Thus,  the   asynchronous  may
  theoretically reduce  the overall execution  time and can improve  the algorithm
  performance.
  
-\RC{la phrase suivante est bizarre, je ne comprends pas pourquoi elle vient ici}
-\RCE{C est la description du dernier test sync/async avec l'introduction de la notion de relative gain}
-In this section, Simgrid simulator tool has been successfully used to show
-the efficiency of  the multisplitting in asynchronous mode and  to find the best
-combination of the grid resources (CPU,  Network, input matrix size, \ldots ) to
-get    the   highest    \textit{"relative    gain"}   (exec\_time$_{GMRES}$    /
-exec\_time$_{multisplitting}$) in comparison with the classical GMRES time.
+In this section,  the Simgrid simulator is  used to compare the  behavior of the
+multisplitting in  asynchronous mode  with GMRES  in synchronous  mode.  Several
+benchmarks have  been performed with  various combination of the  grid resources
+(CPU, Network, input  matrix size, \ldots ). The test  conditions are summarized
+in  Table~\ref{tab:07}. In  order to  compare  the execution  times, this  table
+reports the  relative gain between both  algorithms. It is defined  by the ratio
+between  the   execution  time  of   GMRES  and   the  execution  time   of  the
+multisplitting.  The  ration  is  greater  than  one  because  the  asynchronous
+multisplitting version is faster than GMRES.
  
  
-The test conditions are summarized in the table~\ref{tab:07}: \\
  
-\begin{table} [ht!]
+\begin{table} [htbp]
  \centering
  \begin{tabular}{r c }
   \hline
@@ -856,7 +873,7 @@ geographically distant clusters through the internet.
      power (GFlops)
      & 1    & 1    & 1    & 1.5       & 1.5  & 1.5         & 1.5         & 1         & 1.5       & 1.5 \\
      \hline
-    size (N)
+    size ($N^3$)
      & 62  & 62   & 62        & 100       & 100 & 110       & 120       & 130       & 140       & 150 \\
      \hline
      Precision