--- /dev/null
+\documentclass[a4paper]{article}
+
+
+
+\title{Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters}
+\author{}
+\date{}
+
+\makeindex
+
+\begin{document}
+\maketitle
+
+\section*{Reviewer \#2:}
+\begin{enumerate}
+\item ``The authors claim that this paper focuses on use the GMRES iterative method for solving large sparse linear systems on a cluster of GPUs. However, as the authors say, they focus, particularly, on improving the performances of the parallel sparse matrix-vector multiplication'' \\ \\ Our main contribution is to show the difficulties to implement the GMRES method for solving sparse linear systems on a cluster of GPUs. First, we showed the main key points of the parallel GMRES algorithm on a GPU cluster. Then, the most improvements discussed in this paper are performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. This step is the most time-consuming.
+
+\item ``This is a very studied topic in the literature devoted to sparse matrices processing. Therefore, it is difficult to generate new approaches which could be relevant to the state-of-the-art of this topic. Still, the work is interesting from the implementation point of view. It may be noted the two solutions used to minimize communication in the GPU cluster. However, there are no new contributions from the archival point of view.'' \\ \\ We focused our work on the communications because they are expensive on a cluster of hardware accelerators. We applied a hypergraph partitioning on the problem to solve (of course other partitioning methods may be used according to the structure of the sparse matrix and the computing environment), then we reordered the matrix columns according to the partitioning scheme, and we used a compressed format for storing the vectors in such a way to minimize the communication overheads.
+
+\item ``Some shortcomings should be corrected for future versions: Only the versions for the GPU cluster of the algorithms are optimized. The implementation for the CPU cluster is not optimized.'' \\ \\ Thank you for your comment. It was clarified in the paper that the same optimizations are performed on both GPU and CPU versions.
+
+\item ``What would happen if the algorithm would have been also optimized for the cluster of CPUs (eg using AVX instructions , or using hybrid MPI + OpenMP programming, etc)?'' \\ \\ In this paper, we aimed to investigate the parallelization of the GMRES method on a GPU cluster. We have compared different versions of the parallel GMRES algorithm on a cluster of GPUs (with/without the optimizations). Obviously, we could optimize the CPU version but this leaves the objective of this paper.
+
+\item ``There is no comparison with proposals of other authors.'' \\ \\ In the literature, there are few GMRES implementations on a multi-GPUs but not on a GPU cluster which involves the distributed memory constraint.
+
+\item ``The only comparisons is the speedup with regard to the CPU version of the algorithm carried out by the authors. The GMRES algorithm it is not analyzed, since the paper focuses mainly on the sparse matrix-vector product.'' \\ \\ As we previously mentioned, we have not only compared the CPU and GPU versions but also the different GPU versions between them ( with/without optimizations). The GMRES algorithm is already analyzed by many papers (we gave some references). In this paper we focused on its implementation on a GPU cluster and how to improve the communication between the computing nodes.
+
+\item ``Preconditioning and its influence in the communication should be perhaps most interesting and should be deeply considered, as it limits substantially the performance of GMRES.'' \\ \\
+In fact if we use preconditioning techniques, they will influence both the CPU and the GPU solvers. If we use a left preconditioning, the initial matrix vector product is not changed. In this case, the preconditioning process does not change the cost of the communication on a cluster of processors. It only reduces the number of iterations required for the convergence. What could be intersting to study is what preconditining algorithm is more suited to GPU clusters, but this is out of the topic of this paper.
+
+\item ``The theoretical part of the paper devoted to GMRES method should be eliminated, since it is a well-known topic and the contributions of the paper are mainly related to the sparse matrix-vector product.'' \\ \\ Thank you for your comment. We have reduced the theoretical part devoted to the GMRES method.
+\end{enumerate}
+
+\section*{Reviewer \#3:}
+\begin{enumerate}
+\item ``Right now the paper reads more like a technical report, with a lot of details on the linear solver and then on some of the optimizations. The key findings and the contributions have not been emphasized.'' \\ \\ Up to our knowledge, this is the first parallel implementation of the GMRES algorithm on a GPU cluster. Obviously, in this kind of clusters, the GPUs accelerate the computations but the communication between computing nodes is more time-consuming than on a cluster of CPUs (because there are communications between CPUs and GPUs). Hence, using a partitioning technique that minimizes the total communication volume is interesting. However, we have showed that the use of a partitioning method is not sufficient. In fact, the partitioning without the reordering of the sparse matrix according to the partitioning scheme and the use of the compressed storage format for the vectors does not have much interest in most cases.
+
+\item ``That is to say, it is unclear that how much a researcher working on GPU computing but not specialized in parallel linear solver can appreciate the paper.'' \\ \\ The GMRES method is one of the widely used iterative method for solving large and sparse linear systems. As we mentioned in the paper, the techniques and optimizations that we have used for GMRES method may be applied and adapted to other iterative methods on GPU clusters.
+
+\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.'' \\ \\ Thank you for your comment. {\bf je ne comprends là}In fact, the most problem affecting the performances and the scalability of the linear solvers is the communication on parallel computers. In future work, we plan to study other linear system solvers.
+
+\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?'' \\ \\ The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university.
+\end{enumerate}
+\end{document}