Make Revision.tex fit on 4 pages (not change in text).

[GMRES_For_Journal.git] / Revision.tex
diff --git a/Revision.tex b/Revision.tex

index 9c47fa16150d0b00e3d5cbdd42545d60c142cc3f..4d6c7e86554dfcd546edb3988e0316781deddf8d 100644 (file)
--- a/Revision.tex
+++ b/Revision.tex
@@ -13,44 +13,86 @@
  
  \section*{Reviewer \#2:} 
  \begin{enumerate}
  
  \section*{Reviewer \#2:} 
  \begin{enumerate}
-\item ``The authors claim that this paper focuses on use the GMRES iterative method for solving large sparse linear systems on a cluster of GPUs. However, as the authors say, they focus, particularly, on improving the performances of the parallel sparse matrix-vector multiplication'' \\ \\ Our main contribution is to show the difficulties to implement the GMRES method for solving sparse linear systems on a cluster of GPUs. First, we showed the main key points of the parallel GMRES algorithm on a GPU cluster. Then, the most improvements discussed in this paper are performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. This step is the most time-consuming.
+\item ``The authors claim that this paper focuses on use the GMRES iterative method for solving large sparse linear systems on a cluster of GPUs. However, as the authors say, they focus, particularly, on improving the performances of the parallel sparse matrix-vector multiplication''
  
  
-\item ``This is a very studied topic in the literature devoted to sparse matrices processing. Therefore, it is difficult to generate new approaches which could be relevant to the state-of-the-art of this topic. Still, the work is interesting from the implementation point of view. It may be noted the two solutions used to minimize communication in the GPU cluster. However, there are no new contributions from the archival point of view.'' \\ \\ We  focused our work on the communications because they are expensive on a cluster of hardware accelerators. We applied a hypergraph partitioning on the problem to solve (of course other partitioning methods may be used according to the structure of the sparse matrix and the computing environment), then we reordered the matrix columns according to the partitioning scheme, and we used a compressed format for storing the vectors in such a way to minimize the communication overheads.
+\medskip
+Our main contribution is to show the difficulties of implementing the GMRES method to solve sparse linear systems on a cluster of GPUs. First, we show the main key points of the parallel GMRES algorithm on a GPU cluster. Then, most of the improvements discussed in this paper are performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. This step is the most time-consuming.
  
  
-\item ``Some shortcomings should be corrected for future versions: Only the versions for the GPU cluster of the algorithms are optimized. The implementation for the CPU cluster is not optimized.'' \\ \\ Thank you for your comment. It was clarified in the paper that the same optimizations are performed on both GPU and CPU versions.
+\item ``This is a very studied topic in the literature devoted to sparse matrices processing. Therefore, it is difficult to generate new approaches which could be relevant to the state-of-the-art of this topic. Still, the work is interesting from the implementation point of view. It may be noted the two solutions used to minimize communication in the GPU cluster. However, there are no new contributions from the archival point of view.''
  
  
-\item ``What would happen if the algorithm would have been also optimized for the cluster of CPUs (eg using AVX instructions , or using hybrid MPI + OpenMP programming, etc)?'' \\ \\ In this paper, we aimed to investigate the parallelization of the GMRES method on a GPU cluster. We have compared different versions of the parallel GMRES algorithm on a cluster of GPUs (with/without the optimizations). Obviously, we could optimize the CPU version but this leaves the objective of this paper.
+\medskip
+We  have focused our work on the communications because they are expensive on a cluster of hardware accelerators. We have applied a hypergraph partitioning on the problem to solve (of course other partitioning methods may be used according to the structure of the sparse matrix and the computing environment), then we have reordered the matrix columns according to the partitioning scheme, and we have used a compressed format to store the vectors in order to minimize the communication overheads.
  
  
-\item ``There is no comparison with proposals of other authors.'' \\ \\ In the literature, there are few GMRES implementations on a multi-GPUs but not on a GPU cluster which involves the distributed memory constraint.
+\item ``Some shortcomings should be corrected for future versions: Only the versions for the GPU cluster of the algorithms are optimized. The implementation for the CPU cluster is not optimized.''
  
  
-\item ``The only comparisons is the  speedup with regard to the CPU version of the  algorithm carried out by the authors. The GMRES algorithm it is not analyzed, since the paper focuses mainly on the sparse matrix-vector product.'' \\ \\ As we previously mentioned, we have not only compared the CPU and GPU versions but also the different GPU versions between them ( with/without optimizations). The GMRES algorithm is already analyzed by many papers (we gave some references). In this paper we focused on its implementation on a GPU cluster and how to improve the communication between the computing nodes.
+\medskip
+We have clarified in the paper that the same optimizations are performed on both GPU and CPU versions.
  
  
-\item ``Preconditioning and its influence in the communication should be perhaps most interesting and should be deeply considered, as it limits substantially the performance of GMRES.'' \\ \\ 
-In fact if we use preconditioning techniques, they will influence both the CPU and the GPU solvers. If we use a left preconditioning, the initial matrix vector product is not changed. In this case, the preconditioning process does not change the cost of the communication on a cluster of processors. It only reduces the number of iterations required for the convergence. What could be intersting to study is what preconditining algorithm is more suited to GPU clusters, but this is out of the topic of this paper.
+\item ``What would happen if the algorithm would have been also optimized for the cluster of CPUs (eg using AVX instructions , or using hybrid MPI + OpenMP programming, etc)?''
  
  
-\item ``The theoretical part of the paper devoted to GMRES method should be eliminated, since it is a well-known topic and  the contributions of the paper are mainly related to the sparse matrix-vector product.'' \\ \\ Thank you for your comment. We have reduced the theoretical part devoted to the GMRES method.
+\medskip
+In this paper, we aim to investigate the parallelization of the GMRES method on a GPU cluster. We have compared different versions of the parallel GMRES algorithm on a cluster of GPUs (with/without the optimizations). Obviously, we could optimize the CPU version but this would be beyond the objectives of this paper.
+
+\item ``There is no comparison with proposals of other authors.''
+
+\medskip
+In the literature, there are a few GMRES implementations on a multi-GPUs but, to the best of our knowledge, not on a GPU cluster which involves the distributed memory constraint.
+
+\item ``The only comparisons is the  speedup with regard to the CPU version of the  algorithm carried out by the authors. The GMRES algorithm it is not analyzed, since the paper focuses mainly on the sparse matrix-vector product.''
+
+\medskip
+As we previously mentioned, we have not only compared the CPU and GPU versions but also the different GPU versions between them (with/\linebreak[0]without optimizations). The GMRES algorithm has already been analyzed in many papers (we gave some references). In this paper we have focused on its implementation on a GPU cluster and on how to improve the communication between the computing nodes.
+
+\item ``Preconditioning and its influence in the communication should be perhaps most interesting and should be deeply considered, as it limits substantially the performance of GMRES.''
+
+\medskip
+In fact if we use preconditioning techniques, they will influence both the CPU and the GPU solvers. If we use a left preconditioning, the initial matrix vector product is not changed. In this case, the preconditioning process does not change the cost of the communication on a cluster of processors. It only reduces the number of iterations required for the convergence. What could be interesting to study is which preconditioning algorithm is more suited to GPU clusters, but this is out of the topic of this paper.
+
+\item ``The theoretical part of the paper devoted to GMRES method should be eliminated, since it is a well-known topic and  the contributions of the paper are mainly related to the sparse matrix-vector product.''
+
+\medskip
+We have reduced the theoretical part devoted to the GMRES method.
  \end{enumerate}
  
  \section*{Reviewer \#3:} 
  \begin{enumerate}
  \end{enumerate}
  
  \section*{Reviewer \#3:} 
  \begin{enumerate}
-\item ``Right now the paper reads more like a technical report, with a lot of details on the linear solver and then on some of the optimizations. The key findings and the contributions have not been emphasized.'' \\ \\ Up to our knowledge, this is the first parallel implementation of the GMRES algorithm on a GPU cluster. Obviously, in this kind of clusters, the GPUs accelerate the computations but the communication between computing nodes is  more time-consuming than on a cluster of CPUs (because there are communications between CPUs and GPUs). Hence, using a partitioning technique that minimizes the total communication volume is interesting. However, we have showed that the use of a partitioning method is not sufficient. In fact, the partitioning without the reordering of the sparse matrix according to the partitioning scheme and the use of the compressed storage format for the vectors does not have much interest in most cases. 
+\item ``Right now the paper reads more like a technical report, with a lot of details on the linear solver and then on some of the optimizations. The key findings and the contributions have not been emphasized.''
+
+\medskip
+To the best of our knowledge, this is the first parallel implementation of the GMRES algorithm on a GPU cluster. Obviously, in this kind of clusters, the GPUs accelerate the computations but the communication between computing nodes is  more time-consuming than on a cluster of CPUs (because there are communications between CPUs and GPUs). Hence, using a partitioning technique that minimizes the total communication volume is interesting. However, we have shown that the use of a partitioning method is not sufficient. In fact, the partitioning without the reordering of the sparse matrix according to the partitioning scheme and the use of the compressed storage format for the vectors does not have much interest in most cases. 
+
+\item ``That is to say, it is unclear that how much a researcher working on GPU computing but not specialized in parallel linear solver can appreciate the paper.''
+
+\medskip
+The GMRES method is one of the most widely used iterative methods to solve large and sparse linear systems. As mentioned in the paper, the techniques and optimizations we have used for GMRES method may be applied and adapted to other iterative methods on GPU clusters.
  
  
-\item ``That is to say, it is unclear that how much a researcher working on GPU computing but not specialized in parallel linear solver can appreciate the paper.'' \\ \\ The GMRES method is one of the widely used iterative method for solving large and sparse linear systems. As we mentioned in the paper, the techniques and optimizations that we have used for GMRES method may be applied and adapted to other iterative methods on GPU clusters.
+\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.''
  
  
-\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.'' \\ \\ Thank you for your comment. In fact, we have added a paragraph on that in page {\bf A PRECISER, en gros ca serait une partie lesson learns, à voir si faut la faire apparaitre en tant que telle}. 
+\medskip
+In fact, we have added the following paragraph on that subject page 14.
  
  
-%In fact, parallel linear system solving can be easy to optimized when the linear system is regular. This is the case for many applications. But for many other ones, this is not the case. When the matrix has not a regular structure, the amount of communication between processors is not the same. Another important parameter is the size bandwidth which has a huge influence on the amount of communications. In this work, we have generated matrices different kinds of matrices in order to analyze different difficulties. With the largest bandwidth as possible and with communications between all processors  which  is the most difficult situations, we propose to use two heuristics. Unfortunatly, there is no fast method that optimize the communication in any situation. For non linear systems of equations, there are differents algorithms but one of them consists in linearizing the systems. In this case, a linear system need to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed once only.
+%In fact, parallel linear system solving can be easy to optimized when the linear system is regular. This is the case for many applications. But for many other ones, this is not the case. When the matrix has not a regular structure, the amount of communication between processors is not the same. Another important parameter is the size bandwidth which has a huge influence on the amount of communications. In this work, we have generated matrices different kinds of matrices in order to analyze several difficulties. With the largest bandwidth as possible and with communications between all processors  which  is the most difficult situations, we propose to use two heuristics. Unfortunatly, there is no fast method that optimize the communication in any situation. For non linear systems of equations, there are differents algorithms but one of them consists in linearizing the systems. In this case, a linear system need to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
  
  %Another very important issue, that maybe too many people ignore, is that on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs. There are two reasons for this. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU communications slow down communications between two GPUs not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time are reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of an parallel algorithm. 
  
  
  %Another very important issue, that maybe too many people ignore, is that on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs. There are two reasons for this. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU communications slow down communications between two GPUs not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time are reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of an parallel algorithm. 
  
-In fact, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case of many real-world applications. When the matrix does not have a regular structure, the amount of communication between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze different difficulties. With as a large bandwidth as possible involving communications between all processors, which is the most difficult situation, we propose to use two heuristics. Unfortunately, there is no fast method that optimizes the communication in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed once only.
+ Finally, as far as we are concerned, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case for many real-world applications. When the matrix has an irregular structure, the amount of communications between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze several difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communications in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
  
  
-Another very important issue, that maybe too many people ignore, is that on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs. There are two reasons for this. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU communications slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of a parallel algorithm. 
+Another very important issue, which might be ignored by too many people, is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of parallel algorithms.
  
  
  
  
-\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?'' \\ \\ The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentrated our effort on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
+\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?''
+
+\medskip
+The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the University of Florida. Of course, we needed to choose the experiments that had to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, it is impossible to make an algorithm to solve linear system which is optimal in any situation. In this work we concentrated our efforts on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not try to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
  
  %The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to performed. However, with we have chosen different matrices with differents patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentred our effort on the parallelization on a GPU cluster. Nevertheless there are hundred of variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been build to take advantage of these specificities. For other kinds of architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
  
  %The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to performed. However, with we have chosen different matrices with differents patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentred our effort on the parallelization on a GPU cluster. Nevertheless there are hundred of variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been build to take advantage of these specificities. For other kinds of architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
+\end{enumerate}
  
  
+\section{Summary of the modifications made in the paper}
+\begin{enumerate}
+\item In Page 2, Section 2, we have added a paragraph describing the contributions of our work.
+\item In Section 4, we have reduced the theory part of the GMRES method as suggested by Reviewer 2.
+\item In Section 5.2, we have clarified that the same optimizations are used on the CPU version of the parallel GMRES algorithm too.  
+\item We have added some results obtained from other experiments to show the influence of the communications on a GPU cluster. We computed the ratios between the computation time over the communication time and we have showed the weak scaling of the different versions of the parallel GMRES algorithm. In this revised version, there are many modifications from the end of page 13 to the middle of page 16. In particular Tables 9, 10, 11, 12 and Figure 9 show new results that should help the reader to understand the impact of communications with linear systems on GPU clusters.
  \end{enumerate}
  \end{document} 
  \end{enumerate}
  \end{document}