of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
to perform their local sparse matrix-vector multiplications.
-\begin{figure}
+\begin{figure}[!h]
\centering
\includegraphics[width=120mm,keepaspectratio]{Figures/compress}
\caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
\label{fig:05}
\end{figure}
+\begin{figure}[!h]
+\centering
+ \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
+\caption{Reordering of the columns of a local sparse matrix.}
+\label{fig:06}
+\end{figure}
+
We propose to use a compressed storage format of the sparse global vector. In Figure~\ref{fig:05}, we show an
example of the data exchanges between node 1 and its neighbors to construct the compressed global vector.
First, the neighboring nodes 0, 2 and 3 determine the vector elements needed by node 1 and, then, they send
by the GPU as a kernel. In addition, we use the MPI point-to-point communication routines: a blocking send routine
\verb+MPI_Send()+ and a nonblocking receive routine \verb+MPI_Irecv()+.
-\begin{figure}
-\centering
- \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
-\caption{Reordering of the columns of a local sparse matrix.}
-\label{fig:06}
-\end{figure}
-
Table~\ref{tab:05} shows the performances of the parallel GMRES algorithm using the compressed storage format
of the sparse global vector. The results are obtained from solving large linear systems associated to the sparse
band matrices presented in Table~\ref{tab:03}. We can see from Table~\ref{tab:05} that the execution times
\end{center}
\end{table}
-
\subsection{Hypergraph partitioning}
\label{sec:06.02}
In this section, we use another structure of the sparse matrices. We are interested in sparse matrices
of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
see in the fourth column that the bandwidths of these matrices are as large as their sizes.
-\begin{figure}
+\begin{figure}[h!]
\centering
\includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
\caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
\label{fig:07}
\end{figure}
-
\begin{table}[!h]
\begin{center}
\begin{tabular}{|c|c|r|r|}
\end{center}
\end{table}
-Table~\ref{tab:09} shows in the second, third and fourth columns the total communication volume on a cluster of 12 GPUs by using row-by-row partitioning or hypergraph partitioning and compressed format. The total communication volume defines the total number of the vector elements exchanged between the 12 GPUs. From these columns we can see that the two heuristics, compressed format for the vectors and the hypergraph partitioning, minimize the number the vector elements to be exchanged over the GPU cluster. The compressed format applied
-
-
+\textcolor{red}{\bf Table~\ref{tab:09} shows in the second, third and fourth columns the total communication volume on a cluster of 12 GPUs by using row-by-row partitioning or hypergraph partitioning and compressed format. The total communication volume defines the total number of the vector elements exchanged between the 12 GPUs. From these columns we can see that the two heuristics, compressed format for the vectors and the hypergraph partitioning, minimize the number of vector elements to be exchanged over the GPU cluster. The compressed format allow the GPUs to exchange the needed vector elements witout any communication overheads. The hypergraph partitioning allows to split the large sparse matrices so as to minimize data dependencies between the GPU computing nodes. However, we can notice in the fourth column that the hypergraph partitioning takes longer than the computation times. As we have mentioned before, the hypergraph partitioning method is less efficient in terms of memory consumption and partitioning time than its graph counterpart. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning only once and, then, we save the traces in files to be reused several times. Therefore, this allows us to avoid the partitioning of the sparse matrices at each resolution of the linear systems.}
-This table shows that the hypergraph partitioning allows to split the large sparse matrices so as to minimize data
-dependencies between the GPU computing nodes. However, we can notice in the fourth column that the hypergraph
-partitioning takes longer than the computation times. As we have mentioned before, the hypergraph partitioning
-method is less efficient in terms of memory consumption and partitioning time than its graph counterpart.
-So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
-only once and, then, we save the traces in files to be reused several times. Therefore, this allows us to
-avoid the partitioning of the sparse matrices at each resolution of the linear systems.
-
-\begin{table}[!h]
+\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\item ``That is to say, it is unclear that how much a researcher working on GPU computing but not specialized in parallel linear solver can appreciate the paper.'' \\ \\ The GMRES method is one of the widely used iterative method for solving large and sparse linear systems. As we mentioned in the paper, the techniques and optimizations that we have used for GMRES method may be applied and adapted to other iterative methods on GPU clusters.
-\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.'' \\ \\ Thank you for your comment. In fact, we have added a paragraph on that in page {\bf A PRECISER, en gros ca serait une partie lesson learns, à voir si faut la faire apparaitre en tant que telle}.
+\item ``It will be nice if the authors can emphasize the part of their experiences/optimizations that are generally applicable to other parallel algorithms.'' \\ \\ Thank you for your comment. In fact, we have added a paragraph on that in page 14. %{\bf A PRECISER, en gros ca serait une partie lesson learns, à voir si faut la faire apparaitre en tant que telle}.
%In fact, parallel linear system solving can be easy to optimized when the linear system is regular. This is the case for many applications. But for many other ones, this is not the case. When the matrix has not a regular structure, the amount of communication between processors is not the same. Another important parameter is the size bandwidth which has a huge influence on the amount of communications. In this work, we have generated matrices different kinds of matrices in order to analyze different difficulties. With the largest bandwidth as possible and with communications between all processors which is the most difficult situations, we propose to use two heuristics. Unfortunatly, there is no fast method that optimize the communication in any situation. For non linear systems of equations, there are differents algorithms but one of them consists in linearizing the systems. In this case, a linear system need to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed once only.
\item ``Follow up on point 1). The experiment section can be enhanced as well. The numbers presented are very specific to the input matrix workload, which is generated by the authors. So it is unclear how much other researchers can benefit from it. It will be nice to focus on more detailed measuring and metrics, i.e., how to evaluate if your algorithm/optimization has maximally exploited the system capacity based on the CPU/GPU power and bandwidth available? Or is your algorithm as presented is the optimal at all?'' \\ \\ The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to be performed. In this work, we have chosen different matrices with different patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentrated our effort on the parallelization on a GPU cluster. Nevertheless there are many variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been built to take advantage of these specificities. For other kinds of computing architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system.
%The sparse matrices that we have found in the literature are very small for our experiments and they don't allow to exploit the computing power of a GPU cluster. This is why we used a generator of large sparse matrices based on the real-world matrices of the Davis collection of the Florida university. Of course, you need to make a choice of the experiments to performed. However, with we have chosen different matrices with differents patterns that induce either few or many communications. As explained previously, it is not possible to make an algorithm for solving linear system which is optimal in any situation. In this work we concentred our effort on the parallelization on a GPU cluster. Nevertheless there are hundred of variants of the GMRES method. It would be surprising that quite old methods that seemed not very interesting may not have a new interest with GPU clusters. Moreover, according to the nature of the matrix some specific solvers have been build to take advantage of these specificities. For other kinds of architectures, researchers did not tried to optimize all the parameters since this is too difficult and since the number of parameters is really important. So, a method that would try to optimize them would take more time than simply solving the linear system.
+\end{enumerate}
+\section{Some modifications in the papier}
+\begin{enumerate}
+\item In Page 2, Section 2, we added a paragraph describing the contributions of our work.
+\item In Section 4, we reduced the theory part of the GMRES method as suggested by the Reviewer 2.
+\item In Section 5.2, we clarified that the same optimizations are used on the CPU version of the parallel GMRES algorithm too.
+\item We added some results obtained from other experiments to show the influence of the communications on a GPU cluster. We computed the ratios between the compute time over the communication time and we showed the weak scaling of the different versions of the parallel GMRES algorithm.
\end{enumerate}
\end{document}