correc

[GMRES_For_Journal.git] / GMRES_Journal.tex
diff --git a/GMRES_Journal.tex b/GMRES_Journal.tex

index be3eccf1303e69e6b2552dee1921d04ff5fb6e07..ee4f62ef12ea2581ac4cab2aabf69152a2f831a7 100644 (file)
--- a/GMRES_Journal.tex
+++ b/GMRES_Journal.tex
@@ -1,5 +1,6 @@
  \documentclass[11pt]{article}
  %\documentclass{acmconf}
+\usepackage{multicol}
  
  \usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry}
  \usepackage{times}
@@ -126,7 +127,7 @@ volume. In addition, the performances of the parallel FEM algorithm are improved
  the communication with computation. 
  
  %%% MODIF %%%
-\textcolor{red}{ \bf Our main contribution in this work is to show the difficulties to implement the GMRES method for solving sparse linear systems on a cluster of GPUs. First, we show the main key points of the parallel GMRES algorithm on a GPU cluster. Then, we discuss the improvements of the algorithm which are mainly performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. In fact, on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs due to the CPU/GPU communications between two GPUs that are not on the same machines. We propose to perform a hypergraph partitioning on the problem to be solved, then we reorder the matrix columns according to the partitioning scheme, and we use a compressed format for storing the vectors in such a way to minimize the communication overheads between two GPUs.}
+\textcolor{red}{ \bf Our main contribution in this work is to show the difficulties to implement the GMRES method for solving sparse linear systems on a cluster of GPUs. First, we show the main key points of the parallel GMRES algorithm on a GPU cluster. Then, we discuss the improvements of the algorithm which are mainly performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. In fact, on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs due to the CPU/GPU communications between two GPUs that are not on the same machines. We propose to perform a hypergraph partitioning on the problem to be solved, then we reorder the matrix columns according to the partitioning scheme, and we use a compressed format for storing the vectors in order a way to minimize the communication overheads between two GPUs.}
  %%% END %%%
  
  %%--------------------%%
@@ -402,7 +403,7 @@ main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute t
  matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. Finally, we set 
  the size of a thread-block in GPUs to $512$ threads. 
  %%% MODIF %%%
-\textcolor{red}{\bf It would be noted that the same optimizations done on the GPU version of the parallel GMRES algorithm are performed on the CPU version.}
+\textcolor{red}{\bf It should be noted that the same optimizations are performed on the CPU version and on the GPU version of the parallel GMRES algorithm.}
  %%% END %%%
  
  \begin{table}[!h]
@@ -853,39 +854,101 @@ torso3                  & 183 863 292      & 25 682 514       & 613 250
  
  
  
+
+
+
+
  %%% MODIF %%%
-\textcolor{red}{\bf In order to show the influence of the communications on a GPU cluster
-In tables, we compute the ratios of the computation time over the communication time to show the influence of the communications on a GPU cluster compared to a CPU cluster}
+\textcolor{red}{\bf Hereafter, we show the influence of the communications on a GPU cluster compared to a CPU cluster. In Tables~\ref{tab:10},~\ref{tab:11} and~\ref{tab:12}, we compute the ratios between the computation time over the communication time of three versions of the parallel GMRES algorithm for solving sparse linear systems associated to matrices of Table~\ref{tab:06}. These tables show that the hypergraph partitioning and the compressed format of the vectors increase the ratios either on the GPU cluster or on the CPU cluster. That means that the two optimization techniques allow the minimization of the total communication volume between the computing nodes. However, we can notice that the ratios obtained on the GPU cluster are lower than those obtained on the CPU cluster. Indeed, GPUs compute faster than CPUs but with GPUs they are more communications due to CPU/GPU communications, so the communications are more time-consuming while the computation time remains unchanged.}
  
-\begin{table}%[!h]
+\begin{table}
  \begin{center}
-\begin{tabular}{|c|c|c|c|} 
+\begin{tabular}{|c||c|c|c||c|c|c|} 
  \hline
-Matrix            & Computation time & Communication time & Ratio \\ \hline \hline
-2cubes\_sphere    & 37.067 s         & 1434.512 s         & 0.026 \\
-ecology2          & 4.116 s          & 501.327 s          & 0.008 \\
-finan512          & 7.170 s          & 386.742 s          & 0.019 \\
-G3\_circuit       & 4.797 s          & 537.343 s          & 0.009 \\
-shallow\_water2   & 3.620 s          & 411.208 s          & 0.009 \\ 
-thermal2          & 6.902 s          & 511.618 s          & 0.013 \\ \hline \hline
-cage13            & 12.837 s         & 625.175 s          & 0.021 \\
-crashbasis        & 48.532 s         & 3195.183 s         & 0.015 \\
-FEM\_3D\_thermal2 & 37.211 s         & 1584.650 s         & 0.023 \\
-language          & 22.912 s         & 2242.897 s         & 0.010 \\
-poli\_large       & 13.618 s         & 1722.304 s         & 0.008 \\
-torso3            & 74.194 s         & 4454.936 s         & 0.017 \\ \hline       
+\multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
+                  & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ \\ \hline \hline
+2cubes\_sphere    & 37.067 s       & 1434.512 s   & {\bf 0.026}   & 312.061 s      & 3453.931 s   & {\bf 0.090}\\
+ecology2          & 4.116 s        & 501.327 s    & {\bf 0.008}   & 60.776 s       & 1216.607 s   & {\bf 0.050}\\
+finan512          & 7.170 s        & 386.742 s    & {\bf 0.019}   & 72.464 s       & 932.538 s    & {\bf 0.078}\\
+G3\_circuit       & 4.797 s        & 537.343 s    & {\bf 0.009}   & 66.011 s       & 1407.378 s   & {\bf 0.047}\\
+shallow\_water2   & 3.620 s        & 411.208 s    & {\bf 0.009}   & 51.294 s       & 973.446 s    & {\bf 0.053}\\ 
+thermal2          & 6.902 s        & 511.618 s    & {\bf 0.013}   & 77.255 s       & 1281.979 s   & {\bf 0.060}\\ \hline \hline
+cage13            & 12.837 s       & 625.175 s    & {\bf 0.021}   & 139.178 s      & 1518.349 s   & {\bf 0.092}\\
+crashbasis        & 48.532 s       & 3195.183 s   & {\bf 0.015}   & 623.686 s      & 7741.777 s   & {\bf 0.081}\\
+FEM\_3D\_thermal2 & 37.211 s       & 1584.650 s   & {\bf 0.023}   & 370.297 s      & 3810.255 s   & {\bf 0.097}\\
+language          & 22.912 s       & 2242.897 s   & {\bf 0.010}   & 286.682 s      & 5348.733 s   & {\bf 0.054}\\
+poli\_large       & 13.618 s       & 1722.304 s   & {\bf 0.008}   & 190.302 s      & 4059.642 s   & {\bf 0.047}\\
+torso3            & 74.194 s       & 4454.936 s   & {\bf 0.017}   & 190.302 s      & 10800.787 s  & {\bf 0.083}\\ \hline       
  \end{tabular}
-\caption{}
-\label{tab:09}
+\caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using row-by-row partitioning on 12 GPUs and 24 CPUs.}
+\label{tab:10}
+\end{center}
+\end{table}
+
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c||c|c|c||c|c|c|} 
+\hline
+\multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
+                  & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ \\ \hline \hline
+2cubes\_sphere    & 27.386 s       & 154.861 s   & {\bf 0.177}   & 342.255 s      & 42.100 s   & {\bf 8.130}\\
+ecology2          & 3.822 s        & 53.131 s    & {\bf 0.072}   & 69.956 s       & 15.019 s   & {\bf 4.658}\\
+finan512          & 6.366 s        & 41.155 s    & {\bf 0.155}   & 79.592 s       & 8.604 s    & {\bf 9.251}\\
+G3\_circuit       & 4.543 s        & 63.132 s    & {\bf 0.072}   & 76.540 s       & 27.371 s   & {\bf 2.796}\\
+shallow\_water2   & 3.282 s        & 43.080 s    & {\bf 0.076}   & 58.348 s       & 8.088 s    & {\bf 7.214}\\ 
+thermal2          & 5.986 s        & 57.100 s    & {\bf 0.105}   & 87.682 s       & 28.544 s   & {\bf 3.072}\\ \hline \hline
+cage13            & 10.227 s       & 70.388 s    & {\bf 0.145}   & 152.718 s      & 30.785 s   & {\bf 4.961}\\
+crashbasis        & 41.527 s       & 369.071 s   & {\bf 0.113}   & 701.040 s      & 158.916 s  & {\bf 4.411}\\
+FEM\_3D\_thermal2 & 28.691 s       & 167.140 s   & {\bf 0.172}   & 403.510 s      & 50.935 s   & {\bf 7.922}\\
+language          & 22.408 s       & 242.589 s   & {\bf 0.092}   & 333.119 s      & 64.409 s   & {\bf 5.172}\\
+poli\_large       & 13.710 s       & 179.208 s   & {\bf 0.077}   & 215.934 s      & 30.903 s   & {\bf 6.987}\\
+torso3            & 58.455 s       & 480.315 s   & {\bf 0.122}   & 993.609 s      & 152.173 s  & {\bf 6.529}\\ \hline       
+\end{tabular}
+\caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using row-by-row partitioning and compressed format for vectors on 12 GPUs and 24 CPUs.}
+\label{tab:11}
  \end{center}
  \end{table}
  
  
-\textcolor{red}{\bf Finally, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case of many real-world applications. When the matrix has an irregular structure, the amount of communication between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze different difficulties. With as a large bandwidth as possible involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communication in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed once only.
+\begin{table}
+\begin{center}
+\begin{tabular}{|c||c|c|c||c|c|c|} 
+\hline
+\multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
+                  & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ \\ \hline \hline
+2cubes\_sphere    & 28.440 s       & 7.768 s      & {\bf 3.661}   & 327.109 s      & 63.788 s   & {\bf 5.128}\\
+ecology2          & 3.652 s        & 0.757 s      & {\bf 4.823}   & 63.632 s       & 13.520 s   & {\bf 4.707}\\
+finan512          & 7.579 s        & 4.569 s      & {\bf 1.659}   & 74.120 s       & 22.505 s   & {\bf 3.294}\\
+G3\_circuit       & 4.876 s        & 8.745 s      & {\bf 0.558}   & 72.280 s       & 28.395 s   & {\bf 2.546}\\
+shallow\_water2   & 3.146 s        & 0.606 s      & {\bf 5.191}   & 52.903 s       & 11.177 s   & {\bf 4.733}\\ 
+thermal2          & 6.473 s        & 4.325 s      & {\bf 1.497}   & 81.171 s       & 20.907 s   & {\bf 3.882}\\ \hline \hline
+cage13            & 11.676 s       & 7.723 s      & {\bf 1.512}   & 145.755 s      & 46.547 s   & {\bf 3.131}\\
+crashbasis        & 42.799 s       & 29.399 s     & {\bf 1.456}   & 650.386 s      & 203.918 s  & {\bf 3.189}\\
+FEM\_3D\_thermal2 & 29.875 s       & 8.915 s      & {\bf 3.351}   & 382.887 s      & 93.252 s   & {\bf 4.106}\\
+language          & 20.991 s       & 11.197 s     & {\bf 1.875}   & 310.679 s      & 82.480 s   & {\bf 3.767}\\
+poli\_large       & 13.817 s       & 102.760 s    & {\bf 0.134}   & 197.508 s      & 151.672 s  & {\bf 1.302}\\
+torso3            & 57.469 s       & 16.828 s     & {\bf 3.415}   & 926.588 s      & 242.721 s  & {\bf 3.817}\\ \hline       
+\end{tabular}
+\caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using hypergraph partitioning and compressed format for vectors on 12 GPUs and 24 CPUs.}
+\label{tab:12}
+\end{center}
+\end{table}
+
+\begin{figure}
+\centering
+  \includegraphics[width=120mm,keepaspectratio]{weak}
+\caption{Weak scaling of the parallel GMRES algorithm on a GPU cluster.}
+\label{fig:09}
+\end{figure}
+
+\textcolor{red}{\bf Figure~\ref{fig:09} presents the weak scaling of four versions of the parallel GMRES algorithm on a GPU cluster. We fixed the size of a sub-matrix to 5 million of rows per GPU computing node. We used matrices having five bands generated from the symmetric matrix thermal2. This figure shows that the parallel GMRES algorithm in its naive version or using either the compression format for vectors or the hypergraph partitioning is not scalable on a GPU cluster due to the large amount of communications between GPUs. In contrast, we can see that the algorithm using both optimization techniques is fairly scalable. That means that in this version the cost of communications is relatively constant regardless the number of computing nodes in the cluster.}
+
+\textcolor{red}{\bf Finally, from our point of view, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case for many real-world applications. When the matrix has an irregular structure, the amount of communication between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communication. In this work, we have generated different kinds of matrices in order to analyze different difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communication in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed once only.
  }
  
  \textcolor{red}{\bf 
-Another very important issue is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for this. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of a parallel algorithm.}
+Another very important issue is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of a parallel algorithm.}
  %%% END %%%
  
  %%--------------------%%
@@ -920,7 +983,7 @@ and to improve the performances of the parallel GMRES algorithm as the multispli
  The recent GPU hardware and software architectures provide the GPU-Direct system which allows
  two GPUs, placed in the same machine or in two remote machines, to exchange data without using
  CPUs. This improves the data transfers between GPUs. Finally, it would be interesting to implement
-other iterative methods on GPU clusters for solving large sparse linear or nonlinear systems.
+other iterative methods on GPU clusters for solving large sparse linear or non linear systems.
  
  \paragraph{Acknowledgments}
  This paper is based upon work supported by the R\'egion de Franche-Comt\'e.