02-02-2014

[GMRES_For_Journal.git] / Revision.tex
diff --git a/Revision.tex b/Revision.tex

index 9c47fa16150d0b00e3d5cbdd42545d60c142cc3f..73923cb438d2bdf29889a7b33adb8cc1c8f1508b 100644 (file)
--- a/Revision.tex
+++ b/Revision.tex
@@ -37,7 +37,7 @@ In fact if we use preconditioning techniques, they will influence both the CPU a
  
  \item ``That is to say, it is unclear that how much a researcher working on GPU computing but not specialized in parallel linear solver can appreciate the paper.'' \\ \\ The GMRES method is one of the widely used iterative method for solving large and sparse linear systems. As we mentioned in the paper, the techniques and optimizations that we have used for GMRES method may be applied and adapted to other iterative methods on GPU clusters.
  
-\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.'' \\ \\ Thank you for your comment. In fact, we have added a paragraph on that in page {\bf A PRECISER, en gros ca serait une partie lesson learns, à voir si faut la faire apparaitre en tant que telle}. 
+\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.'' \\ \\ Thank you for your comment. In fact, we have added a paragraph on that in page 14. %{\bf A PRECISER, en gros ca serait une partie lesson learns, à voir si faut la faire apparaitre en tant que telle}. 
  
  %In fact, parallel linear system solving can be easy to optimized when the linear system is regular. This is the case for many applications. But for many other ones, this is not the case. When the matrix has not a regular structure, the amount of communication between processors is not the same. Another important parameter is the size bandwidth which has a huge influence on the amount of communications. In this work, we have generated matrices different kinds of matrices in order to analyze different difficulties. With the largest bandwidth as possible and with communications between all processors  which  is the most difficult situations, we propose to use two heuristics. Unfortunatly, there is no fast method that optimize the communication in any situation. For non linear systems of equations, there are differents algorithms but one of them consists in linearizing the systems. In this case, a linear system need to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed once only.
  
@@ -51,6 +51,13 @@ Another very important issue, that maybe too many people ignore, is that on a cl
  \item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?'' \\ \\ The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentrated our effort on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
  
  %The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to performed. However, with we have chosen different matrices with differents patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentred our effort on the parallelization on a GPU cluster. Nevertheless there are hundred of variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been build to take advantage of these specificities. For other kinds of architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system. 
+\end{enumerate}
  
+\section{Some modifications in the papier}
+\begin{enumerate}
+\item In Page 2, Section 2, we added a paragraph describing the contributions of our work.
+\item In Section 4, we reduced the theory part of the GMRES method as suggested by the Reviewer 2.
+\item In Section 5.2, we clarified that the same optimizations are used on the CPU version of the parallel GMRES algorithm too.  
+\item We added some results obtained from other experiments to show the influence of the communications on a GPU cluster. We computed the ratios between the compute time over the communication time and we showed the weak scaling of the different versions of the parallel GMRES algorithm.
  \end{enumerate}
  \end{document}