From 13bec903a949b4676e7210fadeb7ef16f8466101 Mon Sep 17 00:00:00 2001
From: Arnaud Giersch <arnaud.giersch@iut-bm.univ-fcomte.fr>
Date: Wed, 12 Feb 2014 13:40:42 +0100
Subject: [PATCH] Make Revision.tex fit on 4 pages (not change in text).

---
 Revision.tex | 59 +++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 47 insertions(+), 12 deletions(-)

diff --git a/Revision.tex b/Revision.tex
index 580c815..4d6c7e8 100644
--- a/Revision.tex
+++ b/Revision.tex
@@ -13,31 +13,63 @@
 
 \section*{Reviewer \#2:} 
 \begin{enumerate}
-\item ``The authors claim that this paper focuses on use the GMRES iterative method for solving large sparse linear systems on a cluster of GPUs. However, as the authors say, they focus, particularly, on improving the performances of the parallel sparse matrix-vector multiplication'' \\ \\ Our main contribution is to show the difficulties of implementing the GMRES method to solve sparse linear systems on a cluster of GPUs. First, we show the main key points of the parallel GMRES algorithm on a GPU cluster. Then, most of the improvements discussed in this paper are performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. This step is the most time-consuming.
+\item ``The authors claim that this paper focuses on use the GMRES iterative method for solving large sparse linear systems on a cluster of GPUs. However, as the authors say, they focus, particularly, on improving the performances of the parallel sparse matrix-vector multiplication''
 
-\item ``This is a very studied topic in the literature devoted to sparse matrices processing. Therefore, it is difficult to generate new approaches which could be relevant to the state-of-the-art of this topic. Still, the work is interesting from the implementation point of view. It may be noted the two solutions used to minimize communication in the GPU cluster. However, there are no new contributions from the archival point of view.'' \\ \\ We  have focused our work on the communications because they are expensive on a cluster of hardware accelerators. We have applied a hypergraph partitioning on the problem to solve (of course other partitioning methods may be used according to the structure of the sparse matrix and the computing environment), then we have reordered the matrix columns according to the partitioning scheme, and we have used a compressed format to store the vectors in order to minimize the communication overheads.
+\medskip
+Our main contribution is to show the difficulties of implementing the GMRES method to solve sparse linear systems on a cluster of GPUs. First, we show the main key points of the parallel GMRES algorithm on a GPU cluster. Then, most of the improvements discussed in this paper are performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. This step is the most time-consuming.
 
-\item ``Some shortcomings should be corrected for future versions: Only the versions for the GPU cluster of the algorithms are optimized. The implementation for the CPU cluster is not optimized.'' \\ \\ We have clarified in the paper that the same optimizations are performed on both GPU and CPU versions.
+\item ``This is a very studied topic in the literature devoted to sparse matrices processing. Therefore, it is difficult to generate new approaches which could be relevant to the state-of-the-art of this topic. Still, the work is interesting from the implementation point of view. It may be noted the two solutions used to minimize communication in the GPU cluster. However, there are no new contributions from the archival point of view.''
 
-\item ``What would happen if the algorithm would have been also optimized for the cluster of CPUs (eg using AVX instructions , or using hybrid MPI + OpenMP programming, etc)?'' \\ \\ In this paper, we aim to investigate the parallelization of the GMRES method on a GPU cluster. We have compared different versions of the parallel GMRES algorithm on a cluster of GPUs (with/without the optimizations). Obviously, we could optimize the CPU version but this would be beyond the objectives of this paper.
+\medskip
+We  have focused our work on the communications because they are expensive on a cluster of hardware accelerators. We have applied a hypergraph partitioning on the problem to solve (of course other partitioning methods may be used according to the structure of the sparse matrix and the computing environment), then we have reordered the matrix columns according to the partitioning scheme, and we have used a compressed format to store the vectors in order to minimize the communication overheads.
 
-\item ``There is no comparison with proposals of other authors.'' \\ \\ In the literature, there are a few GMRES implementations on a multi-GPUs but, to the best of our knowledge, not on a GPU cluster which involves the distributed memory constraint.
+\item ``Some shortcomings should be corrected for future versions: Only the versions for the GPU cluster of the algorithms are optimized. The implementation for the CPU cluster is not optimized.''
 
-\item ``The only comparisons is the  speedup with regard to the CPU version of the  algorithm carried out by the authors. The GMRES algorithm it is not analyzed, since the paper focuses mainly on the sparse matrix-vector product.'' \\ \\ As we previously mentioned, we have not only compared the CPU and GPU versions but also the different GPU versions between them (with/\linebreak[0]without optimizations). The GMRES algorithm has already been analyzed in many papers (we gave some references). In this paper we have focused on its implementation on a GPU cluster and on how to improve the communication between the computing nodes.
+\medskip
+We have clarified in the paper that the same optimizations are performed on both GPU and CPU versions.
 
-\item ``Preconditioning and its influence in the communication should be perhaps most interesting and should be deeply considered, as it limits substantially the performance of GMRES.'' \\ \\ 
+\item ``What would happen if the algorithm would have been also optimized for the cluster of CPUs (eg using AVX instructions , or using hybrid MPI + OpenMP programming, etc)?''
+
+\medskip
+In this paper, we aim to investigate the parallelization of the GMRES method on a GPU cluster. We have compared different versions of the parallel GMRES algorithm on a cluster of GPUs (with/without the optimizations). Obviously, we could optimize the CPU version but this would be beyond the objectives of this paper.
+
+\item ``There is no comparison with proposals of other authors.''
+
+\medskip
+In the literature, there are a few GMRES implementations on a multi-GPUs but, to the best of our knowledge, not on a GPU cluster which involves the distributed memory constraint.
+
+\item ``The only comparisons is the  speedup with regard to the CPU version of the  algorithm carried out by the authors. The GMRES algorithm it is not analyzed, since the paper focuses mainly on the sparse matrix-vector product.''
+
+\medskip
+As we previously mentioned, we have not only compared the CPU and GPU versions but also the different GPU versions between them (with/\linebreak[0]without optimizations). The GMRES algorithm has already been analyzed in many papers (we gave some references). In this paper we have focused on its implementation on a GPU cluster and on how to improve the communication between the computing nodes.
+
+\item ``Preconditioning and its influence in the communication should be perhaps most interesting and should be deeply considered, as it limits substantially the performance of GMRES.''
+
+\medskip
 In fact if we use preconditioning techniques, they will influence both the CPU and the GPU solvers. If we use a left preconditioning, the initial matrix vector product is not changed. In this case, the preconditioning process does not change the cost of the communication on a cluster of processors. It only reduces the number of iterations required for the convergence. What could be interesting to study is which preconditioning algorithm is more suited to GPU clusters, but this is out of the topic of this paper.
 
-\item ``The theoretical part of the paper devoted to GMRES method should be eliminated, since it is a well-known topic and  the contributions of the paper are mainly related to the sparse matrix-vector product.'' \\ \\ We have reduced the theoretical part devoted to the GMRES method.
+\item ``The theoretical part of the paper devoted to GMRES method should be eliminated, since it is a well-known topic and  the contributions of the paper are mainly related to the sparse matrix-vector product.''
+
+\medskip
+We have reduced the theoretical part devoted to the GMRES method.
 \end{enumerate}
 
 \section*{Reviewer \#3:} 
 \begin{enumerate}
-\item ``Right now the paper reads more like a technical report, with a lot of details on the linear solver and then on some of the optimizations. The key findings and the contributions have not been emphasized.'' \\ \\ To the best of our knowledge, this is the first parallel implementation of the GMRES algorithm on a GPU cluster. Obviously, in this kind of clusters, the GPUs accelerate the computations but the communication between computing nodes is  more time-consuming than on a cluster of CPUs (because there are communications between CPUs and GPUs). Hence, using a partitioning technique that minimizes the total communication volume is interesting. However, we have shown that the use of a partitioning method is not sufficient. In fact, the partitioning without the reordering of the sparse matrix according to the partitioning scheme and the use of the compressed storage format for the vectors does not have much interest in most cases. 
+\item ``Right now the paper reads more like a technical report, with a lot of details on the linear solver and then on some of the optimizations. The key findings and the contributions have not been emphasized.''
+
+\medskip
+To the best of our knowledge, this is the first parallel implementation of the GMRES algorithm on a GPU cluster. Obviously, in this kind of clusters, the GPUs accelerate the computations but the communication between computing nodes is  more time-consuming than on a cluster of CPUs (because there are communications between CPUs and GPUs). Hence, using a partitioning technique that minimizes the total communication volume is interesting. However, we have shown that the use of a partitioning method is not sufficient. In fact, the partitioning without the reordering of the sparse matrix according to the partitioning scheme and the use of the compressed storage format for the vectors does not have much interest in most cases. 
 
-\item ``That is to say, it is unclear that how much a researcher working on GPU computing but not specialized in parallel linear solver can appreciate the paper.'' \\ \\ The GMRES method is one of the most widely used iterative methods to solve large and sparse linear systems. As mentioned in the paper, the techniques and optimizations we have used for GMRES method may be applied and adapted to other iterative methods on GPU clusters.
+\item ``That is to say, it is unclear that how much a researcher working on GPU computing but not specialized in parallel linear solver can appreciate the paper.''
 
-\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.'' \\ \\ In fact, we have added the following paragraph on that subject page 14.
+\medskip
+The GMRES method is one of the most widely used iterative methods to solve large and sparse linear systems. As mentioned in the paper, the techniques and optimizations we have used for GMRES method may be applied and adapted to other iterative methods on GPU clusters.
+
+\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.''
+
+\medskip
+In fact, we have added the following paragraph on that subject page 14.
 
 %In fact, parallel linear system solving can be easy to optimized when the linear system is regular. This is the case for many applications. But for many other ones, this is not the case. When the matrix has not a regular structure, the amount of communication between processors is not the same. Another important parameter is the size bandwidth which has a huge influence on the amount of communications. In this work, we have generated matrices different kinds of matrices in order to analyze several difficulties. With the largest bandwidth as possible and with communications between all processors  which  is the most difficult situations, we propose to use two heuristics. Unfortunatly, there is no fast method that optimize the communication in any situation. For non linear systems of equations, there are differents algorithms but one of them consists in linearizing the systems. In this case, a linear system need to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
 
@@ -48,7 +80,10 @@ In fact if we use preconditioning techniques, they will influence both the CPU a
 Another very important issue, which might be ignored by too many people, is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of parallel algorithms.
 
 
-\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?'' \\ \\ The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the University of Florida. Of course, we needed to choose the experiments that had to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, it is impossible to make an algorithm to solve linear system which is optimal in any situation. In this work we concentrated our efforts on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not try to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
+\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?''
+
+\medskip
+The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the University of Florida. Of course, we needed to choose the experiments that had to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, it is impossible to make an algorithm to solve linear system which is optimal in any situation. In this work we concentrated our efforts on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not try to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
 
 %The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to performed. However, with we have chosen different matrices with differents patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentred our effort on the parallelization on a GPU cluster. Nevertheless there are hundred of variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been build to take advantage of these specificities. For other kinds of architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
 \end{enumerate}
-- 
2.39.5