\item ``There is no comparison with proposals of other authors.'' \\ \\ In the literature, there are a few GMRES implementations on a multi-GPUs but, to the best of our knowledge, not on a GPU cluster which involves the distributed memory constraint.
-\item ``The only comparisons is the speedup with regard to the CPU version of the algorithm carried out by the authors. The GMRES algorithm it is not analyzed, since the paper focuses mainly on the sparse matrix-vector product.'' \\ \\ As we previously mentioned, we have not only compared the CPU and GPU versions but also the different GPU versions between them ( with/without optimizations). The GMRES algorithm has already been analyzed in many papers (we gave some references). In this paper we have focused on its implementation on a GPU cluster and on how to improve the communication between the computing nodes.
+\item ``The only comparisons is the speedup with regard to the CPU version of the algorithm carried out by the authors. The GMRES algorithm it is not analyzed, since the paper focuses mainly on the sparse matrix-vector product.'' \\ \\ As we previously mentioned, we have not only compared the CPU and GPU versions but also the different GPU versions between them (with/\linebreak[0]without optimizations). The GMRES algorithm has already been analyzed in many papers (we gave some references). In this paper we have focused on its implementation on a GPU cluster and on how to improve the communication between the computing nodes.
\item ``Preconditioning and its influence in the communication should be perhaps most interesting and should be deeply considered, as it limits substantially the performance of GMRES.'' \\ \\
In fact if we use preconditioning techniques, they will influence both the CPU and the GPU solvers. If we use a left preconditioning, the initial matrix vector product is not changed. In this case, the preconditioning process does not change the cost of the communication on a cluster of processors. It only reduces the number of iterations required for the convergence. What could be interesting to study is which preconditioning algorithm is more suited to GPU clusters, but this is out of the topic of this paper.
%Another very important issue, that maybe too many people ignore, is that on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs. There are two reasons for this. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU communications slow down communications between two GPUs not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time are reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of an parallel algorithm.
-In fact, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case of many real-world applications. When the matrix does not have a regular structure, the amount of communication between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze several difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communication in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
+ Finally, as far as we are concerned, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case for many real-world applications. When the matrix has an irregular structure, the amount of communications between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze several difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communications in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
-Another very important issue, which might be ignored by too many people, is that on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs. There are two reasons for this. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU communications slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of a parallel algorithm.
+Another very important issue, which might be ignored by too many people, is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of parallel algorithms.
-\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?'' \\ \\ The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you needed to chose the experiments that had to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, this is impossible to make an algorithm to solve linear system which is optimal in any situation. In this work we concentrated our efforts on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system.
+\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?'' \\ \\ The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the University of Florida. Of course, we needed to choose the experiments that had to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, it is impossible to make an algorithm to solve linear system which is optimal in any situation. In this work we concentrated our efforts on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not try to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system.
%The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to performed. However, with we have chosen different matrices with differents patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentred our effort on the parallelization on a GPU cluster. Nevertheless there are hundred of variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been build to take advantage of these specificities. For other kinds of architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system.
\end{enumerate}