X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/rce2015.git/blobdiff_plain/976b793822811b4335b31747dc0e5542a41fc62b..2e01ef1240fcecca4313cea0cb013a8c0b09f9f5:/paper.tex diff --git a/paper.tex b/paper.tex index 0e69066..8126583 100644 --- a/paper.tex +++ b/paper.tex @@ -321,7 +321,7 @@ A_{\ell\ell} x_\ell = c_\ell,\mbox{~for~}\ell=1,\ldots,L, \end{equation} where right-hand sides $c_\ell=b_\ell-\sum_{m\neq\ell}A_{\ell m}x_m$ are computed using the shared vectors $x_m$. In this paper, we use the well-known iterative method GMRES~\cite{saad86} as an inner iteration to approximate the solutions of the different splittings arising from the block Jacobi multisplitting of matrix $A$. The algorithm in Figure~\ref{alg:01} shows the main key points of our block Jacobi two-stage method executed by a cluster of processors. In line~\ref{solve}, the linear sub-system~(\ref{eq:03}) is solved in parallel using GMRES method where $\MIG$ and $\TOLG$ are the maximum number of inner iterations and the tolerance threshold for GMRES respectively. The convergence of the two-stage multisplitting methods, based on synchronous or asynchronous iterations, has been studied by many authors for example~\cite{Bru95,bahi07}. -\begin{figure}[t] +\begin{figure}[htpb] %\begin{algorithm}[t] %\caption{Block Jacobi two-stage multisplitting method} \begin{algorithmic}[1] @@ -359,7 +359,7 @@ At each $s$ outer iterations, the algorithm computes a new approximation $\tilde \end{equation} The algorithm in Figure~\ref{alg:02} includes the procedure of the residual minimization and the outer iteration is restarted with a new approximation $\tilde{x}$ at every $s$ iterations. The least-squares problem~(\ref{eq:06}) is solved in parallel by all clusters using CGLS method~\cite{Hestenes52} such that $\MIC$ is the maximum number of iterations and $\TOLC$ is the tolerance threshold for this method (line~\ref{cgls} in Figure~\ref{alg:02}). -\begin{figure}[t] +\begin{figure}[htbp] %\begin{algorithm}[t] %\caption{Krylov two-stage method using block Jacobi multisplitting} \begin{algorithmic}[1] @@ -494,8 +494,8 @@ designed to operate with a bandwidth equals to 10Gbits (resp. 1Gbits/s) and a latency of 8.10$^{-6}$ seconds (resp. 5.10$^{-5}$) for the intra-clusters links (resp. inter-clusters backbone links). \\ -\LZK{Il me semble que le bw et lat des deux réseaux varient dans les expés d'une simu à l'autre. On vire la dernière phrase?} -\RC{il me semble qu'on peut laisser ca} +%\LZK{Il me semble que le bw et lat des deux réseaux varient dans les expés d'une simu à l'autre. On vire la dernière phrase?} +%\RC{il me semble qu'on peut laisser ca} \textbf{Step 5}: Conduct an extensive and comprehensive testings within these configurations by varying the key parameters, especially @@ -591,7 +591,7 @@ the Krylov two-stage algorithm. %\RC{Les légendes ne sont pas explicites...} %\RCE{Corrige} -\begin{figure} [ht!] +\begin{figure} [htbp] \begin{center} \includegraphics[width=100mm]{cluster_x_nodes_nx_150_and_nx_170.pdf} \end{center} @@ -641,12 +641,12 @@ the network speed drops down (variation of 12.5\%), the difference between t %\begin{wrapfigure}{l}{100mm} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{cluster_x_nodes_n1_x_n2.pdf} -\caption{Various grid configurations with networks N1 vs N2 -\AG{\np{8E-6}, \np{5E-6} au lieu de 8E-6, 5E-6}} -\RCE{Corrige} +\caption{Various grid configurations with networks N1 vs N2} +%\AG{\np{8E-6}, \np{5E-6} au lieu de 8E-6, 5E-6}} +%\RCE{Corrige} \label{fig:02} \end{figure} %\end{wrapfigure} @@ -667,21 +667,22 @@ the network speed drops down (variation of 12.5\%), the difference between t \label{tab:03} \end{table} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{network_latency_impact_on_execution_time.pdf} -\caption{Network latency impacts on execution time -\AG{\np{E-6}}} +\caption{Network latency impacts on execution time} +%\AG{\np{E-6}}} \label{fig:03} \end{figure} -According to the results of Figure~\ref{fig:03}, a degradation of the network -latency from $8.10^{-6}$ to $6.10^{-5}$ implies an absolute time increase of -more than $75\%$ (resp. $82\%$) of the execution for the classical GMRES -(resp. Krylov multisplitting) algorithm which means that the GMRES seems tolerate more the network latency variation with a less rate increase of the execution time. However, the execution time factor between the two algorithms varies from 2.2 to 1.5 times with a network latency decreasing from $8.10^{-6}$ to $6.10^{-5}$. +In Table~\ref{tab:03}, parameters for the influence of the network latency are +reported. According to the results of Figure~\ref{fig:03}, a degradation of the +network latency from $8.10^{-6}$ to $6.10^{-5}$ implies an absolute time +increase of more than $75\%$ (resp. $82\%$) of the execution for the classical +GMRES (resp. Krylov multisplitting) algorithm. The execution time factor +between the two algorithms varies from 2.2 to 1.5 times with a network latency +decreasing from $8.10^{-6}$ to $6.10^{-5}$. -\RC{Les 2 précédentes phrases me semblent en contradiction....} -\RCE{Reformule} \subsubsection{Network bandwidth impacts on performance} \ \\ @@ -694,18 +695,19 @@ more than $75\%$ (resp. $82\%$) of the execution for the classical GMRES & $lat$= 5.10$^{-5}$ second \\ Input matrix size & $N_{x} \times N_{y} \times N_{z} =150 \times 150 \times 150$\\ \hline \\ \end{tabular} -\caption{Test conditions: Network bandwidth impacts\RC{Qu'est ce qui varie ici? Il n'y a pas de variation dans le tableau}} -\RCE{C est le bw} +\caption{Test conditions: Network bandwidth impacts} +% \RC{Qu'est ce qui varie ici? Il n'y a pas de variation dans le tableau} +%\RCE{C est le bw} \label{tab:04} \end{table} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{network_bandwith_impact_on_execution_time.pdf} -\caption{Network bandwith impacts on execution time -\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.} -\RCE{Corrige} +\caption{Network bandwith impacts on execution time} +%\AG{``Execution time'' avec un 't' minuscule}. Idem autres figures.} +%\RCE{Corrige} \label{fig:04} \end{figure} @@ -723,34 +725,29 @@ of $40\%$ which is only around $24\%$ for the classical GMRES. \hline Grid Architecture & 4 $\times$ 8\\ %\hline Inter Network & $bw$=1Gbs - $lat$=5.10$^{-5}$ \\ - Input matrix size & $N_{x} \times N_{y} \times N_{z}$ = From 40$^{3}$ to 200$^{3}$\\ \hline + Input matrix size & $N_{x} \times N_{y} \times N_{z}$ = From 50$^{3}$ to 190$^{3}$\\ \hline \end{tabular} \caption{Test conditions: Input matrix size impacts} \label{tab:05} \end{table} -\begin{figure} [ht!] +\begin{figure} [htbp] \centering \includegraphics[width=100mm]{pb_size_impact_on_execution_time.pdf} \caption{Problem size impacts on execution time} \label{fig:05} \end{figure} -In these experiments, the input matrix size has been set from $N_{x} = N_{y} -= N_{z} = 40$ to $200$ side elements that is from $40^{3} = 64.000$ to $200^{3} -= 8,000,000$ points. Obviously, as shown in Figure~\ref{fig:05}, the execution -time for both algorithms increases when the input matrix size also increases. -But the interesting results are: -\begin{enumerate} - \item the important increase ($10$ times) of the number of iterations needed to - reach the convergence for the classical GMRES algorithm particularly, when the matrix size - go beyond $N_{x}=150$; \RC{C'est toujours pas clair... ok le nommbre d'itérations est 10 fois plus long mais la suite de la phrase ne veut rien dire} - \RCE{Le nombre d'iterations augmente de 10 fois, cela surtout a partir de N=150} - -\item the classical GMRES execution time is almost the double for $N_{x}=140$ - compared with the Krylov multisplitting method. -\end{enumerate} +In these experiments, the input matrix size has been set from $50^3$ to +$190^3$. Obviously, as shown in Figure~\ref{fig:05}, the execution time for both +algorithms increases when the input matrix size also increases. For all problem +sizes, GMRES is always slower than the Krylov multisplitting. Moreover, for this +benchmark, it seems that the greater the problem size is, the bigger the ratio +between both algorithm execution times is. We can also observ that for some +problem sizes, the Krylov multisplitting convergence varies quite a +lot. Consequently the execution times in that cases also varies. + These findings may help a lot end users to setup the best and the optimal targeted environment for the application deployment when focusing on the problem @@ -759,7 +756,7 @@ grid 2 $\times$ 16 leading to the same conclusion. \subsubsection{CPU Power impacts on performance} -\begin{table} [ht!] +\begin{table} [htbp] \centering \begin{tabular}{r c } \hline @@ -811,18 +808,19 @@ synchronization with the other processors. Thus, the asynchronous may theoretically reduce the overall execution time and can improve the algorithm performance. -\RC{la phrase suivante est bizarre, je ne comprends pas pourquoi elle vient ici} -\RCE{C est la description du dernier test sync/async avec l'introduction de la notion de relative gain} -In this section, Simgrid simulator tool has been successfully used to show -the efficiency of the multisplitting in asynchronous mode and to find the best -combination of the grid resources (CPU, Network, input matrix size, \ldots ) to -get the highest \textit{"relative gain"} (exec\_time$_{GMRES}$ / -exec\_time$_{multisplitting}$) in comparison with the classical GMRES time. +In this section, the Simgrid simulator is used to compare the behavior of the +multisplitting in asynchronous mode with GMRES in synchronous mode. Several +benchmarks have been performed with various combination of the grid resources +(CPU, Network, input matrix size, \ldots ). The test conditions are summarized +in Table~\ref{tab:07}. In order to compare the execution times, this table +reports the relative gain between both algorithms. It is defined by the ratio +between the execution time of GMRES and the execution time of the +multisplitting. The ration is greater than one because the asynchronous +multisplitting version is faster than GMRES. -The test conditions are summarized in the table~\ref{tab:07}: \\ -\begin{table} [ht!] +\begin{table} [htbp] \centering \begin{tabular}{r c } \hline @@ -872,7 +870,7 @@ geographically distant clusters through the internet. power (GFlops) & 1 & 1 & 1 & 1.5 & 1.5 & 1.5 & 1.5 & 1 & 1.5 & 1.5 \\ \hline - size (N) + size ($N^3$) & 62 & 62 & 62 & 100 & 100 & 110 & 120 & 130 & 140 & 150 \\ \hline Precision