Commit e-Offprint provided by Springer.

[GMRES_For_Journal.git] / GMRES_Journal.tex
diff --git a/GMRES_Journal.tex b/GMRES_Journal.tex

index 696ccf14bac3a1e474b107ee6551c67343848259..3add2567e3fcbc522d7296230a2f6da9121bca8b 100644 (file)
--- a/GMRES_Journal.tex
+++ b/GMRES_Journal.tex
@@ -151,7 +151,7 @@ read-only \emph{constant} and \emph{texture} caches per multiprocessor and a rea
  \emph{global memory} shared by all its multiprocessors. The new architectures (Fermi, Kepler,
  etc) have also L1 and L2 caches to improve the accesses to the global memory.
  
  \emph{global memory} shared by all its multiprocessors. The new architectures (Fermi, Kepler,
  etc) have also L1 and L2 caches to improve the accesses to the global memory.
  
-NVIDIA has released the CUDA platform (Compute Unified Device Architecture)~\cite{Nvi10}
+NVIDIA has released the CUDA platform (Compute Unified Device Architecture)~\cite{ref19}
  which provides a high level GPGPU-based programming language (General-Purpose computing
  on GPUs), allowing to program GPUs for general purpose computations. In CUDA programming
  environment, all data-parallel and compute intensive portions of an application running
  which provides a high level GPGPU-based programming language (General-Purpose computing
  on GPUs), allowing to program GPUs for general purpose computations. In CUDA programming
  environment, all data-parallel and compute intensive portions of an application running
@@ -209,7 +209,7 @@ such that the Petrov-Galerkin condition is satisfied:
  Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts. The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning matrix. The Arnoldi process~\cite{Arn51} is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01}) to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$ such that $m\ll n$. Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$). In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or the maximum number of iterations is reached.
  %%% END %%%
  
  Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts. The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning matrix. The Arnoldi process~\cite{Arn51} is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01}) to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$ such that $m\ll n$. Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$). In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or the maximum number of iterations is reached.
  %%% END %%%
  
-\begin{algorithm}[!h]
+\begin{algorithm}[!ht]
    \newcommand{\Convergence}{\mathit{convergence}}
    \newcommand{\False}{\mathit{false}}
    \newcommand{\True}{\mathit{true}}
    \newcommand{\Convergence}{\mathit{convergence}}
    \newcommand{\False}{\mathit{false}}
    \newcommand{\True}{\mathit{true}}
@@ -357,7 +357,7 @@ is composed of $240$ cores running at $1.3$ GHz and has $4$~GB of global memory
  of $102$~GB/s. The GPU is connected to the CPU via a PCI-Express 16x Gen2.0 with a throughput of $8$~GB/s.
  Figure~\ref{fig:02} shows the general scheme of the GPU cluster.
  
  of $102$~GB/s. The GPU is connected to the CPU via a PCI-Express 16x Gen2.0 with a throughput of $8$~GB/s.
  Figure~\ref{fig:02} shows the general scheme of the GPU cluster.
  
-\begin{figure}[!h]
+\begin{figure}[!ht]
  \centering
    \includegraphics[width=80mm,keepaspectratio]{Figures/clusterGPU}
  \caption{A cluster composed of six machines, each equipped with two Tesla C1060 GPUs}
  \centering
    \includegraphics[width=80mm,keepaspectratio]{Figures/clusterGPU}
  \caption{A cluster composed of six machines, each equipped with two Tesla C1060 GPUs}
@@ -415,7 +415,7 @@ matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. F
  the size of a thread-block in GPUs to $512$ threads. 
   It should be noted that the same optimizations are performed on the CPU version and on the GPU version of the parallel GMRES algorithm.
  
  the size of a thread-block in GPUs to $512$ threads. 
   It should be noted that the same optimizations are performed on the CPU version and on the GPU version of the parallel GMRES algorithm.
  
-\begin{table}[!h]
+\begin{table}[!ht]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -504,7 +504,7 @@ that for solving large sparse matrices the GPU cluster is more efficient (about
  cluster. The computing power of the GPUs allows to accelerate the computation of the functions operating
  on large vectors of the parallel GMRES algorithm.
  
  cluster. The computing power of the GPUs allows to accelerate the computation of the functions operating
  on large vectors of the parallel GMRES algorithm.
  
-\begin{table}[!h]
+\begin{table}[!ht]
  \begin{center}
  \begin{tabular}{|c|c|r|r|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|r|r|} 
  \hline
@@ -528,7 +528,7 @@ Matrix type                 & Name              & \# nonzeros & Bandwidth \\ \hl
  \end{table}
  
  
  \end{table}
  
  
-\begin{table}[!h]
+\begin{table}[!ht]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -579,14 +579,14 @@ and 9) and two elements from  node 3 (elements 13 and 14). Therefore to reduce t
  of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
  to perform their local sparse matrix-vector multiplications.
  
  of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
  to perform their local sparse matrix-vector multiplications.
  
-\begin{figure}[!h]
+\begin{figure}[!ht]
  \centering
    \includegraphics[width=120mm,keepaspectratio]{Figures/compress}
  \caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
  \label{fig:05}
  \end{figure}
  
  \centering
    \includegraphics[width=120mm,keepaspectratio]{Figures/compress}
  \caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
  \label{fig:05}
  \end{figure}
  
-\begin{figure}[!h]
+\begin{figure}[!ht]
  \centering
    \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
  \caption{Reordering of the columns of a local sparse matrix.}
  \centering
    \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
  \caption{Reordering of the columns of a local sparse matrix.}
@@ -613,7 +613,7 @@ and those spent on the cluster of 12 GPUs have increased. Indeed, the reordering
  the use of a compressed storage format for the sparse vectors minimize the communication overheads between the
  GPU computing nodes.
  
  the use of a compressed storage format for the sparse vectors minimize the communication overheads between the
  GPU computing nodes.
  
-\begin{table}[!h]
+\begin{table}[!ht]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -649,14 +649,14 @@ off-diagonals on the left and right of the main diagonal. Table~\ref{tab:06} sho
  of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
  see in the fourth column that the bandwidths of these matrices are as large as their sizes.
  
  of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
  see in the fourth column that the bandwidths of these matrices are as large as their sizes.
  
-\begin{figure}[h!]
+\begin{figure}[!ht]
  \centering
    \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
  \caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
  \label{fig:07}
  \end{figure}
  
  \centering
    \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
  \caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
  \label{fig:07}
  \end{figure}
  
-\begin{table}[!h]
+\begin{table}[!ht]
  \begin{center}
  \begin{tabular}{|c|c|r|r|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|r|r|} 
  \hline
@@ -688,7 +688,7 @@ In fact, the naive partitioning row-by-row or column-by-column of this type of s
  a GPU node to many neighboring nodes and produces a significant number of data dependencies between
  the different GPU nodes.   
  
  a GPU node to many neighboring nodes and produces a significant number of data dependencies between
  the different GPU nodes.   
  
-\begin{table}[!h]
+\begin{table}[!ht]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -788,7 +788,7 @@ We can notice from Table~\ref{tab:08} that the execution times on the cluster of
  improved compared to those presented in Table~\ref{tab:07}. The hypergraph partitioning applied on the large
  sparse matrices having large bandwidths have improved the execution times on the GPU cluster by about 65\%.
  
  improved compared to those presented in Table~\ref{tab:07}. The hypergraph partitioning applied on the large
  sparse matrices having large bandwidths have improved the execution times on the GPU cluster by about 65\%.
  
-\begin{table}[!h]
+\begin{table}[!ht]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -819,8 +819,8 @@ Table~\ref{tab:09} shows in the second, third and fourth columns the total commu
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|} 
  \hline
-\multirow{3}{*}{Matrix} & Total comm. vol. & Total comm. vol. & Total comm. vol.              & Time of hypergraph \\
-                        & using row-by row & using compressed & using hypergraph partitioning & partitioning \\
+\multirow{3}{*}{Matrix} & Total comm. vol. & Total comm. vol. & Total comm. vol. using        & Time of hypergraph \\
+                        & using row-by row & using compressed & hypergraph partitioning       & partitioning \\
                          & partitioning     & format           & and compressed format         & in minutes \\ \hline \hline
  2cubes\_sphere          & 182 061 791      & 25 360 543       & 240 679                       & 68.98         \\
  ecology2                & 181 267 000      & 26 044 002       & 73 021                        & 4.92          \\
                          & partitioning     & format           & and compressed format         & in minutes \\ \hline \hline
  2cubes\_sphere          & 182 061 791      & 25 360 543       & 240 679                       & 68.98         \\
  ecology2                & 181 267 000      & 26 044 002       & 73 021                        & 4.92          \\
@@ -856,7 +856,7 @@ torso3                  & 183 863 292      & 25 682 514       & 613 250
  
  Hereafter, we show the influence of the communications on a GPU cluster compared to a CPU cluster. In Tables~\ref{tab:10},~\ref{tab:11} and~\ref{tab:12}, we compute the ratios between the computation time over the communication time of three versions of the parallel GMRES algorithm to solve sparse linear systems associated to matrices of Table~\ref{tab:06}. These tables show that the hypergraph partitioning and the compressed format of the vectors increase the ratios either on the GPU cluster or on the CPU cluster. That means that the two optimization techniques allow the minimization of the total communication volume between the computing nodes. However, we can notice that the ratios obtained on the GPU cluster are lower than those obtained on the CPU cluster. Indeed, GPUs compute faster than CPUs but with GPUs there are more communications due to CPU/GPU communications, so communications are more time-consuming while the computation time remains unchanged. Furthermore, we can notice that the GPU computation times on Tables~\ref{tab:11} and~\ref{tab:12} are about 10\% lower than those on Table~\ref{tab:10}. Indeed, the compression of the vectors and the reordering of matrix columns allow to perform coalesced accesses to the GPU memory and thus accelerate the sparse matrix-vector multiplication.  
  
  
  Hereafter, we show the influence of the communications on a GPU cluster compared to a CPU cluster. In Tables~\ref{tab:10},~\ref{tab:11} and~\ref{tab:12}, we compute the ratios between the computation time over the communication time of three versions of the parallel GMRES algorithm to solve sparse linear systems associated to matrices of Table~\ref{tab:06}. These tables show that the hypergraph partitioning and the compressed format of the vectors increase the ratios either on the GPU cluster or on the CPU cluster. That means that the two optimization techniques allow the minimization of the total communication volume between the computing nodes. However, we can notice that the ratios obtained on the GPU cluster are lower than those obtained on the CPU cluster. Indeed, GPUs compute faster than CPUs but with GPUs there are more communications due to CPU/GPU communications, so communications are more time-consuming while the computation time remains unchanged. Furthermore, we can notice that the GPU computation times on Tables~\ref{tab:11} and~\ref{tab:12} are about 10\% lower than those on Table~\ref{tab:10}. Indeed, the compression of the vectors and the reordering of matrix columns allow to perform coalesced accesses to the GPU memory and thus accelerate the sparse matrix-vector multiplication.  
  
-\begin{table}
+\begin{table}[p]
  \begin{center}
  \begin{tabular}{|c||c|c|c||c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c||c|c|c||c|c|c|} 
  \hline
@@ -881,7 +881,7 @@ torso3            & 74.194 s       & 4454.936 s   & {\bf 0.017}   & 897.440 s
  \end{table}
  
  
  \end{table}
  
  
-\begin{table}
+\begin{table}[p]
  \begin{center}
  \begin{tabular}{|c||c|c|c||c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c||c|c|c||c|c|c|} 
  \hline
@@ -906,7 +906,7 @@ torso3            & 58.455 s       & 480.315 s   & {\bf 0.122}   & 993.609 s
  \end{table}
  
  
  \end{table}
  
  
-\begin{table}
+\begin{table}[p]
  \begin{center}
  \begin{tabular}{|c||c|c|c||c|c|c|} 
  \hline
  \begin{center}
  \begin{tabular}{|c||c|c|c||c|c|c|} 
  \hline
@@ -937,7 +937,9 @@ torso3            & 57.469 s       & 16.828 s     & {\bf 3.415}   & 926.588 s
  \label{fig:09}
  \end{figure}
  
  \label{fig:09}
  \end{figure}
  
- Figure~\ref{fig:09} presents the weak scaling of four versions of the parallel GMRES algorithm on a GPU cluster. We fixed the size of a sub-matrix to 5 million of rows per GPU computing node. We used matrices having five bands generated from the symmetric matrix thermal2. This figure shows that the parallel GMRES algorithm, in its naive version or using either the compression format for vectors or the hypergraph partitioning, is not scalable on a GPU cluster due to the large amount of communications between GPUs. In contrast, we can see that the algorithm using both optimization techniques is fairly scalable. That means that in this version the cost of communications is relatively constant regardless the number of computing nodes in the cluster.\\
+ Figure~\ref{fig:09} presents the weak scaling of four versions of the parallel GMRES algorithm on a GPU cluster. We fixed the size of a sub-matrix to 5 million of rows per GPU computing node. We used matrices having five bands generated from the symmetric matrix thermal2. This figure shows that the parallel GMRES algorithm, in its naive version or using either the compression format for vectors or the hypergraph partitioning, is not scalable on a GPU cluster due to the large amount of communications between GPUs. In contrast, we can see that the algorithm using both optimization techniques is fairly scalable. That means that in this version the cost of communications is relatively constant regardless the number of computing nodes in the cluster.
+
+\bigskip
  
   Finally, as far as we are concerned, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case for many real-world applications. When the matrix has an irregular structure, the amount of communications between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze several difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communications in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
  
  
   Finally, as far as we are concerned, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case for many real-world applications. When the matrix has an irregular structure, the amount of communications between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze several difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communications in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.