X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/blobdiff_plain/0ff5badea3e5156e9795cf5724f3dffed49ca7b5..34ef1c761f30fa1a22267a93d9aeaabb90869ada:/paper.tex diff --git a/paper.tex b/paper.tex index 523716f..beb3140 100644 --- a/paper.tex +++ b/paper.tex @@ -321,7 +321,7 @@ A_{\ell\ell} x_\ell = c_\ell,\mbox{~for~}\ell=1,\ldots,L, \end{equation} where right-hand sides $c_\ell=b_\ell-\sum_{m\neq\ell}A_{\ell m}x_m$ are computed using the shared vectors $x_m$. In this paper, we use the well-known iterative method GMRES~\cite{saad86} as an inner iteration to approximate the solutions of the different splittings arising from the block Jacobi multisplitting of matrix $A$. The algorithm in Figure~\ref{alg:01} shows the main key points of our block Jacobi two-stage method executed by a cluster of processors. In line~\ref{solve}, the linear sub-system~(\ref{eq:03}) is solved in parallel using GMRES method where $\MIG$ and $\TOLG$ are the maximum number of inner iterations and the tolerance threshold for GMRES respectively. The convergence of the two-stage multisplitting methods, based on synchronous or asynchronous iterations, has been studied by many authors for example~\cite{Bru95,bahi07}. -\begin{figure}[t] +\begin{figure}[htpb] %\begin{algorithm}[t] %\caption{Block Jacobi two-stage multisplitting method} \begin{algorithmic}[1] @@ -359,7 +359,7 @@ At each $s$ outer iterations, the algorithm computes a new approximation $\tilde \end{equation} The algorithm in Figure~\ref{alg:02} includes the procedure of the residual minimization and the outer iteration is restarted with a new approximation $\tilde{x}$ at every $s$ iterations. The least-squares problem~(\ref{eq:06}) is solved in parallel by all clusters using CGLS method~\cite{Hestenes52} such that $\MIC$ is the maximum number of iterations and $\TOLC$ is the tolerance threshold for this method (line~\ref{cgls} in Figure~\ref{alg:02}). -\begin{figure}[t] +\begin{figure}[htbp] %\begin{algorithm}[t] %\caption{Krylov two-stage method using block Jacobi multisplitting} \begin{algorithmic}[1] @@ -495,6 +495,7 @@ latency of 8.10$^{-6}$ seconds (resp. 5.10$^{-5}$) for the intra-clusters links (resp. inter-clusters backbone links). \\ \LZK{Il me semble que le bw et lat des deux réseaux varient dans les expés d'une simu à l'autre. On vire la dernière phrase?} +\RC{il me semble qu'on peut laisser ca} \textbf{Step 5}: Conduct an extensive and comprehensive testings within these configurations by varying the key parameters, especially @@ -540,18 +541,23 @@ and between distant clusters. This parameter is application dependent. In the scope of this paper, our first objective is to analyze when the Krylov two-stage method has better performance than the classical GMRES method. With a synchronous iterative method, better performance means a smaller number of iterations and execution time before reaching the convergence. -For a systematic study, the experiments should figure out that, for various -grid parameters values, the simulator will confirm Multisplitting method better performance compared to classical GMRES, particularly on poor and slow networks. -\LZK{Pas du tout claire la dernière phrase (For a systematic...)!!} -\RCE { Reformule autrement} +In what follows, we will present the test conditions, the output results and our comments. + +%%RAPH : on vire ca, c'est pas clair et pas important +%For a systematic study, the experiments should figure out that, for various +%grid parameters values, the simulator will confirm Multisplitting method better performance compared to classical GMRES, particularly on poor and slow networks. +%\LZK{Pas du tout claire la dernière phrase (For a systematic...)!!} +%\RCE { Reformule autrement} + -In what follows, we will present the test conditions, the output results and our comments.\\ %\subsubsection{Execution of the algorithms on various computational grid architectures and scaling up the input matrix size} \subsubsection{Simulations for various grid architectures and scaling-up matrix sizes} \ \\ % environment +\RC{Je ne comprends plus rien CE : pourquoi dans 5.4.1 il y a 2 network et aussi dans 5.4.2. Quelle est la différence? Dans la figure 3 de la section 5.4.1 pourquoi il n'y a pas N1 et N2?} + \begin{table} [ht!] \begin{center} \begin{tabular}{ll } @@ -563,33 +569,42 @@ In what follows, we will present the test conditions, the output results and our & N$_{x}$ $\times$ N$_{y}$ $\times$ N$_{z}$ =170 $\times$ 170 $\times$ 170 \\ \hline \end{tabular} \caption{Test conditions: various grid configurations with the matrix sizes 150$^3$ or 170$^3$} -\LZK{Ce sont les caractéristiques du réseau intra ou inter clusters? Ce n'est pas précisé...} -\RCE{oui c est precise} +%\LZK{Ce sont les caractéristiques du réseau intra ou inter clusters? Ce n'est pas précisé...} +%\RCE{oui c est precise} \label{tab:01} \end{center} \end{table} -In this section, we analyze the simulations conducted on various grid configurations presented in Table~\ref{tab:01}. Figure~\ref{fig:01} shows, for all grid configurations and a given matrix size, a non-variation in the number of iterations for the classical GMRES algorithm, which is not the case of the Krylov two-stage algorithm. +In this section, we analyze the simulations conducted on various grid +configurations presented in Table~\ref{tab:01}. It should be noticed that two +networks are considered: N1 is the network between clusters (inter-cluster) and +N2 is the network inside a cluster (intra-cluster). Figure~\ref{fig:01} shows, +for all grid configurations and a given matrix size, a non-variation in the +number of iterations for the classical GMRES algorithm, which is not the case of +the Krylov two-stage algorithm. %% First, the results in Figure~\ref{fig:01} %% show for all grid configurations the non-variation of the number of iterations of %% classical GMRES for a given input matrix size; it is not the case for the %% multisplitting method. -\RC{CE attention tu n'as pas mis de label dans tes figures, donc c'est le bordel, j'en mets mais vérifie...} -\RC{Les légendes ne sont pas explicites...} -\RCE{Corrige} +%\RC{CE attention tu n'as pas mis de label dans tes figures, donc c'est le bordel, j'en mets mais vérifie...} +%\RC{Les légendes ne sont pas explicites...} +%\RCE{Corrige} -\begin{figure} [ht!] +\begin{figure} [htbp] \begin{center} \includegraphics[width=100mm]{cluster_x_nodes_nx_150_and_nx_170.pdf} \end{center} - \caption{Various grid configurations with the matrix sizes 150$^3$ and 170$^3$ -\AG{Utiliser le point comme séparateur décimal et non la virgule. Idem dans les autres figures.}} -\LZK{Pour quelle taille du problème sont calculés les nombres d'itérations? Que représente le 2 Clusters x 16 Nodes with Nx=150 and Nx=170 en haut de la figure?} -\RCE {Corrige} + \caption{Various grid configurations with the matrix sizes 150$^3$ and 170$^3$} +%\AG{Utiliser le point comme séparateur décimal et non la virgule. Idem dans les autres figures.} +%\LZK{Pour quelle taille du problème sont calculés les nombres d'itérations? Que représente le 2 Clusters x 16 Nodes with Nx=150 and Nx=170 en haut de la figure?} + %\RCE {Corrige} + \RC{Idéalement dans la légende il faudrait insiquer Pb size=$150^3$ ou $170^3$ car pour l'instant Nx=150 ca n'indique rien concernant Ny et Nz} \label{fig:01} \end{figure} + + The execution times between the two algorithms is significant with different grid architectures, even with the same number of processors (for example, 2 $\times$ 16 and 4 $\times 8$). We can observe a better sensitivity of the Krylov multisplitting method @@ -617,7 +632,8 @@ $40\%$ better (resp. $48\%$) when running from 32 (grid 2 $\times$ 16) to 64 pro \end{table} In this section, the experiments compare the behavior of the algorithms running on a -speeder inter-cluster network (N2) and also on a less performant network (N1) respectively defined in the test conditions Table~\ref{tab:02}. \RC{Il faut définir cela avant...} +speeder inter-cluster network (N2) and also on a less performant network (N1) respectively defined in the test conditions Table~\ref{tab:02}. +%\RC{Il faut définir cela avant...} Figure~\ref{fig:02} shows that end users will reduce the execution time for both algorithms when using a grid architecture like 4 $\times$ 16 or 8 $\times$ 8: the reduction factor is around $2$. The results depict also that when the network speed drops down (variation of 12.5\%), the difference between the two Multisplitting algorithms execution times can reach more than 25\%. @@ -625,7 +641,7 @@ the network speed drops down (variation of 12.5\%), the difference between t %\begin{wrapfigure}{l}{100mm} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf} \caption{Various grid configurations with networks N1 vs N2 @@ -651,7 +667,7 @@ the network speed drops down (variation of 12.5\%), the difference between t \label{tab:03} \end{table} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf} \caption{Network latency impacts on execution time @@ -684,12 +700,12 @@ more than $75\%$ (resp. $82\%$) of the execution for the classical GMRES \end{table} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf} -\caption{Network bandwith impacts on execution time -\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.} -\RCE{Corrige} +\caption{Network bandwith impacts on execution time} +%\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.} +%\RCE{Corrige} \label{fig:04} \end{figure} @@ -714,7 +730,7 @@ of $40\%$ which is only around $24\%$ for the classical GMRES. \end{table} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf} \caption{Problem size impacts on execution time} @@ -743,7 +759,7 @@ grid 2 $\times$ 16 leading to the same conclusion. \subsubsection{CPU Power impacts on performance} -\begin{table} [ht!] +\begin{table} [htbp] \centering \begin{tabular}{r c } \hline @@ -795,18 +811,19 @@ synchronization with the other processors. Thus, the asynchronous may theoretically reduce the overall execution time and can improve the algorithm performance. -\RC{la phrase suivante est bizarre, je ne comprends pas pourquoi elle vient ici} -\RCE{C est la description du dernier test sync/async avec l'introduction de la notion de relative gain} -In this section, Simgrid simulator tool has been successfully used to show -the efficiency of the multisplitting in asynchronous mode and to find the best -combination of the grid resources (CPU, Network, input matrix size, \ldots ) to -get the highest \textit{"relative gain"} (exec\_time$_{GMRES}$ / -exec\_time$_{multisplitting}$) in comparison with the classical GMRES time. +In this section, the Simgrid simulator is used to compare the behavior of the +multisplitting in asynchronous mode with GMRES in synchronous mode. Several +benchmarks have been performed with various combination of the grid resources +(CPU, Network, input matrix size, \ldots ). The test conditions are summarized +in Table~\ref{tab:07}. In order to compare the execution times, this table +reports the relative gain between both algorithms. It is defined by the ratio +between the execution time of GMRES and the execution time of the +multisplitting. The ration is greater than one because the asynchronous +multisplitting version is faster than GMRES. -The test conditions are summarized in the table~\ref{tab:07}: \\ -\begin{table} [ht!] +\begin{table} [htbp] \centering \begin{tabular}{r c } \hline @@ -856,7 +873,7 @@ geographically distant clusters through the internet. power (GFlops) & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 & 1.5 \\ \hline - size (N) + size ($N^3$) & 62 & 62 & 62 & 100 & 100 & 110 & 120 & 130 & 140 & 150 \\ \hline Precision