last version

[GMRES_For_Journal.git] / GMRES_Journal.tex
diff --git a/GMRES_Journal.tex b/GMRES_Journal.tex

index 8382eeae1968c38e728a5afd742bc2ceb4ab6ee9..f5f05e40c36b578540135a19da8edf536f3898ed 100644 (file)
--- a/GMRES_Journal.tex
+++ b/GMRES_Journal.tex
@@ -1,5 +1,6 @@
  \documentclass[11pt]{article}
  %\documentclass{acmconf}
  \documentclass[11pt]{article}
  %\documentclass{acmconf}
+\usepackage{multicol}
  
  \usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry}
  \usepackage{times}
  
  \usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry}
  \usepackage{times}
@@ -18,6 +19,7 @@
  \usepackage{url}
  \usepackage{mdwlist}
  \usepackage{multirow}
  \usepackage{url}
  \usepackage{mdwlist}
  \usepackage{multirow}
+%\usepackage{color}
  
  \date{}
  
  
  \date{}
  
@@ -79,8 +81,8 @@ with GPUs introduce new programming difficulties to adapt  parallel algorithms t
  In this paper, we use the GMRES iterative method for solving large sparse linear systems on a cluster
  of GPUs. The parallel algorithm of this method is implemented using the CUDA programming language for
  the GPUs and the MPI parallel environment to distribute the computations between the different GPU nodes
  In this paper, we use the GMRES iterative method for solving large sparse linear systems on a cluster
  of GPUs. The parallel algorithm of this method is implemented using the CUDA programming language for
  the GPUs and the MPI parallel environment to distribute the computations between the different GPU nodes
-of the cluster. Particularly, we focus on improving the performances of the parallel sparse matrix-vector
-multiplication. Indeed, this operation is not only very time-consuming but it also requires communications
+of the cluster. Particularly, we focus on improving the performances of the parallel sparse matrix-vector multiplication. 
+Indeed, this operation is not only very time-consuming but it also requires communications
  between the GPU nodes. These communications are needed to build the global vector involved in
  the parallel sparse matrix-vector multiplication. It should be noted that a communication between two
  GPU nodes involves data transfers between the GPU and CPU memories in the same node and the MPI communications
  between the GPU nodes. These communications are needed to build the global vector involved in
  the parallel sparse matrix-vector multiplication. It should be noted that a communication between two
  GPU nodes involves data transfers between the GPU and CPU memories in the same node and the MPI communications
@@ -124,6 +126,8 @@ This method firstly uses the Reverse Cuthill-McKee reordering to reduce the tota
  volume. In addition, the performances of the parallel FEM algorithm are improved by overlapping
  the communication with computation. 
  
  volume. In addition, the performances of the parallel FEM algorithm are improved by overlapping
  the communication with computation. 
  
+ Our main contribution in this work is to show the difficulties of implementing the GMRES method to solve sparse linear systems on a cluster of GPUs. First, we show the main key points of the parallel GMRES algorithm on a GPU cluster. Then, we discuss the improvements of the algorithm which are mainly performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. In fact, on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs due to the CPU/GPU communications between two GPUs that are not on the same machines. We propose to perform a hypergraph partitioning on the problem to be solved, then we reorder the matrix columns according to the partitioning scheme, and we use a compressed format to store the vectors in order to minimize the communication overheads between two GPUs.
+
  
  %%--------------------%%
  %%      SECTION 3     %%
  
  %%--------------------%%
  %%      SECTION 3     %%
@@ -174,20 +178,15 @@ the transfer of data between the GPU and its host.
  %%--------------------%%
  \section{{GMRES} method}
  \label{sec:04}
  %%--------------------%%
  \section{{GMRES} method}
  \label{sec:04}
-The generalized minimal residual method (GMRES) is an iterative method designed by Saad
-and Schultz in 1986~\cite{Saa86}. It is a generalization of the minimal residual method
-(MNRES)~\cite{Pai75} to deal with asymmetric and non Hermitian problems and indefinite
-symmetric problems. 
+
+The generalized minimal residual method (GMRES) is an iterative method designed by Saad and Schultz in 1986~\cite{Saa86}. It is a generalization of the minimal residual method (MNRES)~\cite{Pai75} to deal with asymmetric and non Hermitian problems and indefinite symmetric problems. 
  
  Let us consider the following sparse linear system of $n$ equations:
  \begin{equation}
    Ax = b,
  \label{eq:01}
  \end{equation}
  
  Let us consider the following sparse linear system of $n$ equations:
  \begin{equation}
    Ax = b,
  \label{eq:01}
  \end{equation}
-where $A\in\mathbb{R}^{n\times n}$ is a sparse square and nonsingular matrix, $x\in\mathbb{R}^n$
-is the solution vector and $b\in\mathbb{R}^n$ is the right-hand side vector. The main idea of
-the GMRES method is to find a sequence of solutions $\{x_k\}_{k\in\mathbb{N}}$ which minimizes at
-best the residual $r_k=b-Ax_k$. The solution $x_k$ is computed in a Krylov sub-space $\mathcal{K}_k(A,v_1)$:
+where $A\in\mathbb{R}^{n\times n}$ is a sparse square and nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector and $b\in\mathbb{R}^n$ is the right-hand side vector. The main idea of the GMRES method is to find a sequence of solutions $\{x_k\}_{k\in\mathbb{N}}$ which minimizes at best the residual $r_k=b-Ax_k$. The solution $x_k$ is computed in a Krylov sub-space $\mathcal{K}_k(A,v_1)$:
  \begin{equation}
  \begin{array}{ll}
    \mathcal{K}_{k}(A,v_{1}) \equiv \text{span}\{v_{1}, Av_{1}, A^{2}v_{1},..., A^{k-1}v_{1}\}, & v_{1}=\frac{r_{0}}{\|r_{0}\|_{2}},
  \begin{equation}
  \begin{array}{ll}
    \mathcal{K}_{k}(A,v_{1}) \equiv \text{span}\{v_{1}, Av_{1}, A^{2}v_{1},..., A^{k-1}v_{1}\}, & v_{1}=\frac{r_{0}}{\|r_{0}\|_{2}},
@@ -197,71 +196,12 @@ such that the Petrov-Galerkin condition is satisfied:
  \begin{equation}
    r_{k} \perp A\mathcal{K}_{k}(A, v_{1}).
  \end{equation}
  \begin{equation}
    r_{k} \perp A\mathcal{K}_{k}(A, v_{1}).
  \end{equation}
-The GMRES method uses the Arnoldi method~\cite{Arn51} to build an orthonormal basis $V_k$
-and a Hessenberg upper matrix $\bar{H}_k$ of order $(k+1)\times k$: 
-\begin{equation}
-\begin{array}{lll}
-  V_{k}=\{v_{1},v_{2},...,v_{k}\}, & \text{such that} & \forall k>1, v_{k}=A^{k-1}v_{1},
-\end{array}
-\end{equation} 
-and 
-\begin{equation}
-  AV_{k} = V_{k+1}\bar{H}_{k}.
-\label{eq:02}
-\end{equation}
-Then at each iteration $k$, a solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_{k}$
-spanned by $V_k$ as follows: 
-\begin{equation}
-\begin{array}{ll}
-  x_{k} = x_{0} + V_{k}y, & y\in\mathbb{R}^{k}.
-\end{array}
-\label{eq:03}
-\end{equation}
-From both equations~(\ref{eq:02}) and~(\ref{eq:03}), we can deduce:
-\begin{equation}
-\begin{array}{lll}
-  r_{k} & = & b - A (x_{0} + V_{k}y) \\
-        & = & r_{0} - AV_{k}y \\
-        & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
-        & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
-\end{array}
-\end{equation}
-where $\beta=\|r_{0}\|_{2}$ and $e_{1}=(1,0,...,0)$ is the first vector of the canonical basis of
-$\mathbb{R}^{k}$. The vector $y$ is computed in $\mathbb{R}^{k}$ so as to minimize at best the Euclidean
-norm of the residual $r_{k}$. This means that the following least-squares problem of size $k$ must
-be solved: 
-\begin{equation}
-  \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
-\end{equation}
-The solution of this problem is computed thanks to the QR factorization of the Hessenberg matrix 
-$\bar{H}_{k}$ by using the Givens rotations~\cite{Saa03,Saa86} such that:
-\begin{equation}
-\begin{array}{lll}
-  \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
-\end{array}
-\end{equation}
-where $Q_{k}Q_{k}^{T}=I_{k}$ and $R_{k}$ is a upper triangular matrix.
-
-The GMRES method finds a solution at most after $n$ iterations. So, the Krylov sub-space dimension
-can increase up-to the large size $n$ of the linear system. Then in order to avoid a large storage
-of the orthonormal basis $V_{k}$, the Arnoldi process is restricted at $m\ll n$ iterations and restarted
-with the last iterate $x_{m}$ as an initial guess to the next iteration. Moreover, GMRES sometimes
-does not converge or takes too many iterations to satisfy the convergence criteria. Therefore, in
-most cases, GMRES must contain a preconditioning step to improve its convergence. The preconditioning
-technique replaces the system~(\ref{eq:01}) with the modified systems:
-\begin{equation}M^{-1}Ax=M^{-1}b.\end{equation}
-
-Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts.
-The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning 
-matrix. At each iteration $k$, the Arnoldi process is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01})
-to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$.
-Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes
-the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$).
-In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or
-the maximum number of iterations is reached. 
+
+Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts. The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning matrix. The Arnoldi process~\cite{Arn51} is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01}) to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$ such that $m\ll n$. Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$). In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or the maximum number of iterations is reached.
+%%% END %%%
  
  \begin{algorithm}[!h]
  
  \begin{algorithm}[!h]
-  \SetAlgoLined
+  %\SetAlgoLined
    \Entree{$A$ (matrix), $b$ (vector), $M$ (preconditioning matrix),
  $x_{0}$ (initial guess), $\varepsilon$ (tolerance threshold), $max$ (maximum number of iterations),
  $m$ (number of iterations of the Arnoldi process)}
    \Entree{$A$ (matrix), $b$ (vector), $M$ (preconditioning matrix),
  $x_{0}$ (initial guess), $\varepsilon$ (tolerance threshold), $max$ (maximum number of iterations),
  $m$ (number of iterations of the Arnoldi process)}
@@ -460,7 +400,8 @@ of the guess solution $x_0$ is initialized to $0$ and those of the right-hand si
  initialized to $1$. For simplicity sake, we chose the matrix preconditioning $M$ as the
  main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse
  matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. Finally, we set 
  initialized to $1$. For simplicity sake, we chose the matrix preconditioning $M$ as the
  main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse
  matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. Finally, we set 
-the size of a thread-block in GPUs to $512$ threads.
+the size of a thread-block in GPUs to $512$ threads. 
+ It should be noted that the same optimizations are performed on the CPU version and on the GPU version of the parallel GMRES algorithm.
  
  \begin{table}[!h]
  \begin{center}
  
  \begin{table}[!h]
  \begin{center}
@@ -627,13 +568,20 @@ and 9) and two elements from  node 3 (elements 13 and 14). Therefore to reduce t
  of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
  to perform their local sparse matrix-vector multiplications.
  
  of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
  to perform their local sparse matrix-vector multiplications.
  
-\begin{figure}
+\begin{figure}[!h]
  \centering
    \includegraphics[width=120mm,keepaspectratio]{Figures/compress}
  \caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
  \label{fig:05}
  \end{figure}
  
  \centering
    \includegraphics[width=120mm,keepaspectratio]{Figures/compress}
  \caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
  \label{fig:05}
  \end{figure}
  
+\begin{figure}[!h]
+\centering
+  \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
+\caption{Reordering of the columns of a local sparse matrix.}
+\label{fig:06}
+\end{figure}
+
  We propose to use a compressed storage format of the sparse global vector. In Figure~\ref{fig:05}, we show an
  example of the data exchanges between node 1 and its neighbors to construct the compressed global vector.
  First, the neighboring nodes 0, 2 and 3 determine the vector elements needed by node 1 and, then, they send
  We propose to use a compressed storage format of the sparse global vector. In Figure~\ref{fig:05}, we show an
  example of the data exchanges between node 1 and its neighbors to construct the compressed global vector.
  First, the neighboring nodes 0, 2 and 3 determine the vector elements needed by node 1 and, then, they send
@@ -645,13 +593,6 @@ For performance purposes, the computation of the shared data to send to the neig
  by the GPU as a kernel. In addition, we use the MPI point-to-point communication routines: a blocking send routine
  \verb+MPI_Send()+ and a nonblocking receive routine \verb+MPI_Irecv()+.
  
  by the GPU as a kernel. In addition, we use the MPI point-to-point communication routines: a blocking send routine
  \verb+MPI_Send()+ and a nonblocking receive routine \verb+MPI_Irecv()+.
  
-\begin{figure}
-\centering
-  \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
-\caption{Reordering of the columns of a local sparse matrix.}
-\label{fig:06}
-\end{figure}
-
  Table~\ref{tab:05} shows the performances of the parallel GMRES algorithm using the compressed storage format
  of the sparse global vector. The results are obtained from solving large linear systems associated to the sparse
  band matrices presented in Table~\ref{tab:03}. We can see from Table~\ref{tab:05} that the execution times 
  Table~\ref{tab:05} shows the performances of the parallel GMRES algorithm using the compressed storage format
  of the sparse global vector. The results are obtained from solving large linear systems associated to the sparse
  band matrices presented in Table~\ref{tab:03}. We can see from Table~\ref{tab:05} that the execution times 
@@ -686,7 +627,6 @@ a cluster of 12 GPUs.}
  \end{center}
  \end{table}
  
  \end{center}
  \end{table}
  
-
  \subsection{Hypergraph partitioning}
  \label{sec:06.02}
  In this section, we use another structure of the sparse matrices. We are interested in sparse matrices
  \subsection{Hypergraph partitioning}
  \label{sec:06.02}
  In this section, we use another structure of the sparse matrices. We are interested in sparse matrices
@@ -698,14 +638,13 @@ off-diagonals on the left and right of the main diagonal. Table~\ref{tab:06} sho
  of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
  see in the fourth column that the bandwidths of these matrices are as large as their sizes.
  
  of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
  see in the fourth column that the bandwidths of these matrices are as large as their sizes.
  
-\begin{figure}
+\begin{figure}[h!]
  \centering
    \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
  \caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
  \label{fig:07}
  \end{figure}
  
  \centering
    \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
  \caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
  \label{fig:07}
  \end{figure}
  
-
  \begin{table}[!h]
  \begin{center}
  \begin{tabular}{|c|c|r|r|} 
  \begin{table}[!h]
  \begin{center}
  \begin{tabular}{|c|c|r|r|} 
@@ -863,44 +802,138 @@ having five bands on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.}
  \end{center}
  \end{table}
  
  \end{center}
  \end{table}
  
-Table~\ref{tab:09} shows in the second and third columns the total communication volume on a cluster
-of 12 GPUs by using row-by-row partitioning and hypergraph partitioning, respectively. The total communication
-volume defines the total number of the vector elements exchanged between the 12 GPUs. This table shows
-that the hypergraph partitioning allows to split the large sparse matrices so as to minimize  data
-dependencies between the GPU computing nodes. However, we can notice in the fourth column that the hypergraph
-partitioning takes longer than the computation times. As we have mentioned before, the hypergraph partitioning
-method is less efficient in terms of memory consumption and partitioning time than its graph counterpart.
-So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
-only once and, then, we save the traces in files to be reused several times. Therefore, this allows us to
-avoid the partitioning of the sparse matrices at each resolution of the linear systems. 
+Table~\ref{tab:09} shows in the second, third and fourth columns the total communication volume on a cluster of 12 GPUs by using row-by-row partitioning or hypergraph partitioning and compressed format. The total communication volume defines the total number of the vector elements exchanged between the 12 GPUs. From these columns we can see that the two heuristics, compressed format for the vectors and the hypergraph partitioning, minimize the number of vector elements to be exchanged over the GPU cluster. The compressed format allows the GPUs to exchange the needed vector elements witout any communication overheads. The hypergraph partitioning allows to split the large sparse matrices so as to minimize  data dependencies between the GPU computing nodes. However, we can notice in the fourth column that the hypergraph partitioning takes longer than the computation times. As we have mentioned before, the hypergraph partitioning method is less efficient in terms of memory consumption and partitioning time than its graph counterpart. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning only once and, then, we save the traces in files to be reused several times. Therefore, this allows us to avoid the partitioning of the sparse matrices at each resolution of the linear systems.
  
  
-\begin{table}[!h]
+\begin{table}
  \begin{center}
  \begin{center}
-\begin{tabular}{|c|c|c|c|} 
+\begin{tabular}{|c|c|c|c|c|} 
  \hline
  \hline
-\multirow{4}{*}{Matrix} & Total communication    & Total communication & Time of  \\
-                        & volume using           & volume using        & hypergraph     \\
-                        & row-by-row             & hypergraph          & partitioning      \\
-                        & partitioning           & partitioning        & in minutes                \\ \hline \hline
-2cubes\_sphere          & 25 360 543             & 240 679             & 68.98         \\
-ecology2                & 26 044 002             & 73 021              & 4.92          \\
-finan512                & 26 087 431             & 900 729             & 33.72         \\
-G3\_circuit             & 31 912 003             & 5 366 774           & 11.63         \\
-shallow\_water2         & 25 105 108             & 60 899              & 5.06          \\ 
-thermal2                & 30 012 846             & 1 077 921           & 17.88         \\ \hline \hline
-cage13                  & 28 254 282             & 3 845 440           & 196.45        \\
-crashbasis              & 29 020 060             & 2 401 876           & 33.39         \\
-FEM\_3D\_thermal2       & 25 263 767             & 250 105             & 49.89         \\
-language                & 27 291 486             & 1 537 835           & 9.07          \\
-poli\_large             & 25 053 554             & 7 388 883           & 5.92          \\
-torso3                  & 25 682 514             & 613 250             & 61.51        \\ \hline       
+\multirow{3}{*}{Matrix} & Total comm. vol. & Total comm. vol. & Total comm. vol.              & Time of hypergraph \\
+                        & using row-by row & using compressed & using hypergraph partitioning & partitioning \\
+                        & partitioning     & format           & and compressed format         & in minutes \\ \hline \hline
+2cubes\_sphere          & 182 061 791      & 25 360 543       & 240 679                       & 68.98         \\
+ecology2                & 181 267 000      & 26 044 002       & 73 021                        & 4.92          \\
+finan512                & 182 090 692      & 26 087 431       & 900 729                       & 33.72         \\
+G3\_circuit             & 192 244 835      & 31 912 003       & 5 366 774                     & 11.63         \\
+shallow\_water2         & 181 729 606      & 25 105 108       & 60 899                        & 5.06          \\ 
+thermal2                & 191 350 306      & 30 012 846       & 1 077 921                     & 17.88         \\ \hline \hline
+cage13                  & 183 970 606      & 28 254 282       & 3 845 440                     & 196.45        \\
+crashbasis              & 182 931 818      & 29 020 060       & 2 401 876                     & 33.39         \\
+FEM\_3D\_thermal2       & 182 503 894      & 25 263 767       & 250 105                       & 49.89         \\
+language                & 183 055 017      & 27 291 486       & 1 537 835                     & 9.07          \\
+poli\_large             & 181 381 470      & 25 053 554       & 7 388 883                     & 5.92          \\
+torso3                  & 183 863 292      & 25 682 514       & 613 250                       & 61.51        \\ \hline       
  \end{tabular}
  \end{tabular}
-\caption{Total communication volume on a cluster of 12 GPUs using row-by-row or hypergraph partitioning methods}
+\caption{Total communication volume on a cluster of 12 GPUs using row-by-row or hypergraph partitioning methods and compressed vectors. The total communication volume is defined as the total number of vector elements exchanged between all GPUs of the cluster.}
  \label{tab:09}
  \end{center}
  \end{table}
  
  
  \label{tab:09}
  \end{center}
  \end{table}
  
  
+
+
+
+
+
+
+
+
+
+
+
+
+
+Hereafter, we show the influence of the communications on a GPU cluster compared to a CPU cluster. In Tables~\ref{tab:10},~\ref{tab:11} and~\ref{tab:12}, we compute the ratios between the computation time over the communication time of three versions of the parallel GMRES algorithm to solve sparse linear systems associated to matrices of Table~\ref{tab:06}. These tables show that the hypergraph partitioning and the compressed format of the vectors increase the ratios either on the GPU cluster or on the CPU cluster. That means that the two optimization techniques allow the minimization of the total communication volume between the computing nodes. However, we can notice that the ratios obtained on the GPU cluster are lower than those obtained on the CPU cluster. Indeed, GPUs compute faster than CPUs but with GPUs there are more communications due to CPU/GPU communications, so communications are more time-consuming while the computation time remains unchanged.
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c||c|c|c||c|c|c|} 
+\hline
+\multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
+                  & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ \\ \hline \hline
+2cubes\_sphere    & 37.067 s       & 1434.512 s   & {\bf 0.026}   & 312.061 s      & 3453.931 s   & {\bf 0.090}\\
+ecology2          & 4.116 s        & 501.327 s    & {\bf 0.008}   & 60.776 s       & 1216.607 s   & {\bf 0.050}\\
+finan512          & 7.170 s        & 386.742 s    & {\bf 0.019}   & 72.464 s       & 932.538 s    & {\bf 0.078}\\
+G3\_circuit       & 4.797 s        & 537.343 s    & {\bf 0.009}   & 66.011 s       & 1407.378 s   & {\bf 0.047}\\
+shallow\_water2   & 3.620 s        & 411.208 s    & {\bf 0.009}   & 51.294 s       & 973.446 s    & {\bf 0.053}\\ 
+thermal2          & 6.902 s        & 511.618 s    & {\bf 0.013}   & 77.255 s       & 1281.979 s   & {\bf 0.060}\\ \hline \hline
+cage13            & 12.837 s       & 625.175 s    & {\bf 0.021}   & 139.178 s      & 1518.349 s   & {\bf 0.092}\\
+crashbasis        & 48.532 s       & 3195.183 s   & {\bf 0.015}   & 623.686 s      & 7741.777 s   & {\bf 0.081}\\
+FEM\_3D\_thermal2 & 37.211 s       & 1584.650 s   & {\bf 0.023}   & 370.297 s      & 3810.255 s   & {\bf 0.097}\\
+language          & 22.912 s       & 2242.897 s   & {\bf 0.010}   & 286.682 s      & 5348.733 s   & {\bf 0.054}\\
+poli\_large       & 13.618 s       & 1722.304 s   & {\bf 0.008}   & 190.302 s      & 4059.642 s   & {\bf 0.047}\\
+torso3            & 74.194 s       & 4454.936 s   & {\bf 0.017}   & 190.302 s      & 10800.787 s  & {\bf 0.083}\\ \hline       
+\end{tabular}
+\caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using row-by-row partitioning on 12 GPUs and 24 CPUs.}
+\label{tab:10}
+\end{center}
+\end{table}
+
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c||c|c|c||c|c|c|} 
+\hline
+\multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
+                  & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ \\ \hline \hline
+2cubes\_sphere    & 27.386 s       & 154.861 s   & {\bf 0.177}   & 342.255 s      & 42.100 s   & {\bf 8.130}\\
+ecology2          & 3.822 s        & 53.131 s    & {\bf 0.072}   & 69.956 s       & 15.019 s   & {\bf 4.658}\\
+finan512          & 6.366 s        & 41.155 s    & {\bf 0.155}   & 79.592 s       & 8.604 s    & {\bf 9.251}\\
+G3\_circuit       & 4.543 s        & 63.132 s    & {\bf 0.072}   & 76.540 s       & 27.371 s   & {\bf 2.796}\\
+shallow\_water2   & 3.282 s        & 43.080 s    & {\bf 0.076}   & 58.348 s       & 8.088 s    & {\bf 7.214}\\ 
+thermal2          & 5.986 s        & 57.100 s    & {\bf 0.105}   & 87.682 s       & 28.544 s   & {\bf 3.072}\\ \hline \hline
+cage13            & 10.227 s       & 70.388 s    & {\bf 0.145}   & 152.718 s      & 30.785 s   & {\bf 4.961}\\
+crashbasis        & 41.527 s       & 369.071 s   & {\bf 0.113}   & 701.040 s      & 158.916 s  & {\bf 4.411}\\
+FEM\_3D\_thermal2 & 28.691 s       & 167.140 s   & {\bf 0.172}   & 403.510 s      & 50.935 s   & {\bf 7.922}\\
+language          & 22.408 s       & 242.589 s   & {\bf 0.092}   & 333.119 s      & 64.409 s   & {\bf 5.172}\\
+poli\_large       & 13.710 s       & 179.208 s   & {\bf 0.077}   & 215.934 s      & 30.903 s   & {\bf 6.987}\\
+torso3            & 58.455 s       & 480.315 s   & {\bf 0.122}   & 993.609 s      & 152.173 s  & {\bf 6.529}\\ \hline       
+\end{tabular}
+\caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using row-by-row partitioning and compressed format for vectors on 12 GPUs and 24 CPUs.}
+\label{tab:11}
+\end{center}
+\end{table}
+
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c||c|c|c||c|c|c|} 
+\hline
+\multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
+                  & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ & $Time_{comput}$ & $Time_{comm}$ & $Ratio$ \\ \hline \hline
+2cubes\_sphere    & 28.440 s       & 7.768 s      & {\bf 3.661}   & 327.109 s      & 63.788 s   & {\bf 5.128}\\
+ecology2          & 3.652 s        & 0.757 s      & {\bf 4.823}   & 63.632 s       & 13.520 s   & {\bf 4.707}\\
+finan512          & 7.579 s        & 4.569 s      & {\bf 1.659}   & 74.120 s       & 22.505 s   & {\bf 3.294}\\
+G3\_circuit       & 4.876 s        & 8.745 s      & {\bf 0.558}   & 72.280 s       & 28.395 s   & {\bf 2.546}\\
+shallow\_water2   & 3.146 s        & 0.606 s      & {\bf 5.191}   & 52.903 s       & 11.177 s   & {\bf 4.733}\\ 
+thermal2          & 6.473 s        & 4.325 s      & {\bf 1.497}   & 81.171 s       & 20.907 s   & {\bf 3.882}\\ \hline \hline
+cage13            & 11.676 s       & 7.723 s      & {\bf 1.512}   & 145.755 s      & 46.547 s   & {\bf 3.131}\\
+crashbasis        & 42.799 s       & 29.399 s     & {\bf 1.456}   & 650.386 s      & 203.918 s  & {\bf 3.189}\\
+FEM\_3D\_thermal2 & 29.875 s       & 8.915 s      & {\bf 3.351}   & 382.887 s      & 93.252 s   & {\bf 4.106}\\
+language          & 20.991 s       & 11.197 s     & {\bf 1.875}   & 310.679 s      & 82.480 s   & {\bf 3.767}\\
+poli\_large       & 13.817 s       & 102.760 s    & {\bf 0.134}   & 197.508 s      & 151.672 s  & {\bf 1.302}\\
+torso3            & 57.469 s       & 16.828 s     & {\bf 3.415}   & 926.588 s      & 242.721 s  & {\bf 3.817}\\ \hline       
+\end{tabular}
+\caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using hypergraph partitioning and compressed format for vectors on 12 GPUs and 24 CPUs.}
+\label{tab:12}
+\end{center}
+\end{table}
+
+\begin{figure}
+\centering
+  \includegraphics[width=120mm,keepaspectratio]{weak}
+\caption{Weak scaling of the parallel GMRES algorithm on a GPU cluster.}
+\label{fig:09}
+\end{figure}
+
+ Figure~\ref{fig:09} presents the weak scaling of four versions of the parallel GMRES algorithm on a GPU cluster. We fixed the size of a sub-matrix to 5 million of rows per GPU computing node. We used matrices having five bands generated from the symmetric matrix thermal2. This figure shows that the parallel GMRES algorithm, in its naive version or using either the compression format for vectors or the hypergraph partitioning, is not scalable on a GPU cluster due to the large amount of communications between GPUs. In contrast, we can see that the algorithm using both optimization techniques is fairly scalable. That means that in this version the cost of communications is relatively constant regardless the number of computing nodes in the cluster.\\
+
+ Finally, as far as we are concerned, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case for many real-world applications. When the matrix has an irregular structure, the amount of communication between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communication. In this work, we have generated different kinds of matrices in order to analyze several difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communication in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
+
+
+ 
+Another very important issue, which might be ignored by too many people, is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of a parallel algorithm.
+
  %%--------------------%%
  %%      SECTION 7     %%
  %%--------------------%%
  %%--------------------%%
  %%      SECTION 7     %%
  %%--------------------%%
@@ -933,7 +966,7 @@ and to improve the performances of the parallel GMRES algorithm as the multispli
  The recent GPU hardware and software architectures provide the GPU-Direct system which allows
  two GPUs, placed in the same machine or in two remote machines, to exchange data without using
  CPUs. This improves the data transfers between GPUs. Finally, it would be interesting to implement
  The recent GPU hardware and software architectures provide the GPU-Direct system which allows
  two GPUs, placed in the same machine or in two remote machines, to exchange data without using
  CPUs. This improves the data transfers between GPUs. Finally, it would be interesting to implement
-other iterative methods on GPU clusters for solving large sparse linear or nonlinear systems.
+other iterative methods on GPU clusters for solving large sparse linear or non linear systems.
  
  \paragraph{Acknowledgments}
  This paper is based upon work supported by the R\'egion de Franche-Comt\'e.
  
  \paragraph{Acknowledgments}
  This paper is based upon work supported by the R\'egion de Franche-Comt\'e.