X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/4c5e6c1725249ae02b156277ef750a43f5d6144b..c8b308b236018000b34d6d58d5bdc4cb8313111b:/BookGPU/Chapters/chapter12/ch12.tex?ds=sidebyside diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index eaa4f9c..7495859 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -38,11 +38,8 @@ traditional CPUs. In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03}, we give the main key points of the parallel implementation of both methods on a cluster of -GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a -CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices -of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning -technique to reduce the total communication volume between the computing nodes and, thus, -to improve the execution times of the parallel algorithms of both iterative methods. +GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a +CPU cluster and on a GPU cluster, for solving large sparse linear systems. %%--------------------------%% @@ -200,7 +197,7 @@ $maxiter$ are reached. %%****************%% \subsection{GMRES method} \label{ch12:sec:02.02} -The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization +The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can be applied for solving symmetric or nonsymmetric linear systems. @@ -231,7 +228,7 @@ V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1, \end{equation} and \begin{equation} -V_k A = V_{k+1} \bar{H}_k. +A V_k = V_{k+1} \bar{H}_k. \label{ch12:eq:15} \end{equation} @@ -512,12 +509,6 @@ Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster} that we used in the experimental tests. -\begin{figure} -\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}} -\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.} -\label{ch12:fig:04} -\end{figure} - Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9} is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations @@ -525,6 +516,12 @@ in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the co CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node is managed by a MPI process and it is composed of one CPU core and one GPU card. +\begin{figure}[!h] +\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}} +\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.} +\label{ch12:fig:04} +\end{figure} + All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the @@ -536,17 +533,9 @@ not too ill-conditioned matrices. In the GPU computing, the size of thread block threads. Finally, the performance results, presented hereafter, are obtained from the mean value over $10$ executions of the same parallel linear solver and for the same input data. -To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's -collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six -symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05}, -we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics -which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In -the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating -the first and the last nonzero value on a matrix row. - \begin{figure} \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}} -\caption{Sketches of sparse matrices chosen from the Davis's collection.} +\caption{Sketches of sparse matrices chosen from the Davis collection.} \label{ch12:fig:05} \end{figure} @@ -580,11 +569,18 @@ the first and the last nonzero value on a matrix row. & torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline \end{tabular} -\vspace{0.5cm} -\caption{Main characteristics of sparse matrices chosen from the Davis's collection.} +\caption{Main characteristics of sparse matrices chosen from the Davis collection.} \label{ch12:tab:01} \end{table} +To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis +collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six +symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05}, +we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics +which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In +the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating +the first and the last nonzero value on a matrix row. + \begin{table} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} @@ -608,7 +604,7 @@ thermal2 & $1.172s$ & $0.622s$ & $1.88$ & $ \end{center} \end{table} -\begin{table}[!h] +\begin{table} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline @@ -675,19 +671,19 @@ of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the G were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$. However, we can notice from the relative gains $\tau$ that is not interesting to use multiple -GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to +GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow to maximize utilization of GPU cores. In addition, the communications required to synchronize the computations over the cluster increase the idle times of GPUs and slow down further the parallel computations. Consequently, in order to test the performances of the parallel solvers, we developed in C programming -language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10} +language a generator of large sparse matrices. This generator takes a matrix from the Davis collection~\cite{ch12:ref10} as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure, because they are those arise in the most of numerical problems. So to generate the global sparse matrix, each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen -from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix +from the Davis collection. Then, it puts all these copies on the main diagonal of the global matrix (see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same initial matrix. @@ -729,7 +725,7 @@ initial matrix. & torso3 & $433,795,264$ & $328,757$ \\ \hline \end{tabular} \vspace{0.5cm} -\caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.} +\caption{Main characteristics of sparse banded matrices generated from those of the Davis collection.} \label{ch12:tab:04} \end{table} @@ -746,7 +742,7 @@ CG method is characterized by a better convergence\index{Convergence} rate and a time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES method requires more data exchanges between computing nodes compared to the parallel CG method. -\begin{table}[!h] +\begin{table} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline @@ -770,7 +766,7 @@ on a cluster of 12 GPUs.} \end{center} \end{table} -\begin{table}[!h] +\begin{table} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline @@ -806,390 +802,11 @@ on a cluster of 12 GPUs.} \end{center} \end{table} - %%--------------------------%% %% SECTION 5 %% %%--------------------------%% -\section{Hypergraph partitioning} -\label{ch12:sec:05} -In this section, we present the performances of both parallel CG and GMRES solvers for solving linear -systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse -matrices having the nonzero values distributed along their bandwidths. - -\begin{figure} -\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}} -\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.} -\label{ch12:fig:07} -\end{figure} - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|} -\hline -{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline - -\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $829,082,728$ & $24,999,999$ \\ - - & ecology2 & $254,892,056$ & $25,000,000$ \\ - - & finan512 & $556,982,339$ & $24,999,973$ \\ - - & G3\_circuit & $257,982,646$ & $25,000,000$ \\ - - & shallow\_water2 & $200,798,268$ & $25,000,000$ \\ - - & thermal2 & $359,340,179$ & $24,999,998$ \\ \hline \hline - -\multirow{6}{*}{Nonsymmetric} & cage13 & $879,063,379$ & $24,999,998$ \\ - - & crashbasis & $820,373,286$ & $24,999,803$ \\ - - & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$ \\ - - & language & $155,261,826$ & $24,999,492$ \\ - - & poli\_large & $106,680,819$ & $25,000,000$ \\ - - & torso3 & $872,029,998$ & $25,000,000$\\ \hline -\end{tabular} -\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.} -\label{ch12:tab:07} -\end{center} -\end{table} - -We have developed in C programming language a generator of large sparse matrices -having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}). -The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}. -However, the copies performed on the initial matrix (chosen from the Davis's collection) -are placed on the main diagonal and on four off-diagonals, two on the right and two -on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a -generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07} -shows the main characteristics of sparse five-bands matrices generated from those -presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million -unknown values. - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline - -2cubes\_sphere & $6.041s$ & $3.338s$ & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\ - -ecology2 & $1.404s$ & $1.301s$ & $1.08$ & $13$ & $5.22e$-$11$ & $2.17e$-$18$ \\ - -finan512 & $1.822s$ & $1.299s$ & $1.40$ & $12$ & $3.52e$-$11$ & $3.47e$-$18$ \\ - -G3\_circuit & $2.331s$ & $2.129s$ & $1.09$ & $15$ & $1.36e$-$11$ & $5.20e$-$18$ \\ - -shallow\_water2 & $0.541s$ & $0.504s$ & $1.07$ & $6$ & $2.12e$-$16$ & $5.05e$-$28$ \\ - -thermal2 & $2.549s$ & $1.705s$ & $1.49$ & $14$ & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline -\end{tabular} -\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices -on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs} -\label{ch12:tab:08} -\end{center} -\end{table} - -\begin{table} -\begin{center} -\begin{tabular}{|c|c|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline - -2cubes\_sphere & $15.963s$ & $7.250s$ & $2.20$ & $58$ & $6.23e$-$16$ & $3.25e$-$19$ \\ - -ecology2 & $3.549s$ & $2.176s$ & $1.63$ & $21$ & $4.78e$-$15$ & $1.06e$-$15$ \\ - -finan512 & $3.862s$ & $1.934s$ & $1.99$ & $17$ & $3.21e$-$14$ & $8.43e$-$17$ \\ - -G3\_circuit & $4.636s$ & $2.811s$ & $1.65$ & $22$ & $1.08e$-$14$ & $1.77e$-$16$ \\ - -shallow\_water2 & $2.738s$ & $1.539s$ & $1.78$ & $17$ & $5.54e$-$23$ & $3.82e$-$26$ \\ - -thermal2 & $5.017s$ & $2.587s$ & $1.94$ & $21$ & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline - -cage13 & $9.315s$ & $3.227s$ & $2.89$ & $26$ & $3.38e$-$13$ & $2.08e$-$16$ \\ - -crashbasis & $35.980s$ & $14.770s$ & $2.43$ & $127$ & $1.17e$-$12$ & $1.56e$-$17$ \\ - -FEM\_3D\_thermal2 & $24.611s$ & $7.749s$ & $3.17$ & $64$ & $3.87e$-$11$ & $2.84e$-$14$ \\ - -language & $16.859s$ & $9.697s$ & $1.74$ & $89$ & $2.17e$-$12$ & $1.70e$-$12$ \\ - -poli\_large & $10.200s$ & $6.534s$ & $1.56$ & $69$ & $5.14e$-$13$ & $1.63e$-$13$ \\ - -torso3 & $49.074s$ & $19.397s$ & $2.53$ & $175$ & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline -\end{tabular} -\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices -on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs} -\label{ch12:tab:09} -\end{center} -\end{table} - -Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel -CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a -cluster of $12$ GPUs. The linear systems solved in these tables are associated to the -sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from -both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not -efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}. -We can see that the execution times obtained on the GPU cluster are almost equivalent -to those obtained on the CPU cluster (see the relative gains presented in column~$4$ -of each table). This is due to the large number of communications necessary to synchronize -the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column, -of sparse matrices having large bandwidths can link a computing node to many neighbors -and then generate a large number of data dependencies between these computing nodes in -the cluster. - -Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph}, -which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed, -it can well model the communications between the computing nodes, particularly in the -case of nonsymmetric and irregular matrices, and it gives good reduction of the total -communication volume. In contrast, it is an expensive operation in terms of execution -time and memory space. - -The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph -$\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows: -\begin{itemize} -\item each matrix row $\{i\}_{0\leq i