modif lilia ch12

[book_gpu.git] / BookGPU / Chapters / chapter12 / ch12.tex
diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex

index eaa4f9cf5463126e3e4102134007aa09f5797d3f..749585952548f74f180bca570c0db36d0e6f0a63 100755 (executable)
--- a/BookGPU/Chapters/chapter12/ch12.tex
+++ b/BookGPU/Chapters/chapter12/ch12.tex
@@ -38,11 +38,8 @@ traditional CPUs.
  In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
  methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
  we give the main key points of the parallel implementation of both methods on a cluster of
-GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
-CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices
-of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning
-technique to reduce the total communication volume between the computing nodes and, thus,
-to improve the execution times of the parallel algorithms of both iterative methods.   
+GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
+CPU cluster and on a GPU cluster, for solving large sparse linear systems.    
  
  
  %%--------------------------%%
@@ -200,7 +197,7 @@ $maxiter$ are reached.
  %%****************%%
  \subsection{GMRES method} 
  \label{ch12:sec:02.02}
-The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
+The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
  of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can
  be applied for solving symmetric or nonsymmetric linear systems. 
  
@@ -231,7 +228,7 @@ V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1,
  \end{equation}
  and
  \begin{equation}
-V_k A = V_{k+1} \bar{H}_k.
+A V_k = V_{k+1} \bar{H}_k.
  \label{ch12:eq:15}
  \end{equation}
  
@@ -512,12 +509,6 @@ Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global
  a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster}
  that we used in the experimental tests.
  
-\begin{figure}
-\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
-\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
-\label{ch12:fig:04}
-\end{figure}
-
  Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding
  the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
  is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
@@ -525,6 +516,12 @@ in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the co
  CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
  is managed by a MPI process and it is composed of one CPU core and one GPU card.
  
+\begin{figure}[!h]
+\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
+\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
+\label{ch12:fig:04}
+\end{figure}
+
  All tests are made on double-precision floating point operations. The parameters of both linear
  solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
  maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
@@ -536,17 +533,9 @@ not too ill-conditioned matrices. In the GPU computing, the size of thread block
  threads. Finally, the performance results, presented hereafter, are obtained from the mean value
  over $10$ executions of the same parallel linear solver and for the same input data.
  
-To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's
-collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
-symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
-we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
-which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
-the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
-the first and the last nonzero value on a matrix row.
-
  \begin{figure}
  \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
-\caption{Sketches of sparse matrices chosen from the Davis's collection.}
+\caption{Sketches of sparse matrices chosen from the Davis collection.}
  \label{ch12:fig:05}
  \end{figure}
  
@@ -580,11 +569,18 @@ the first and the last nonzero value on a matrix row.
  
                                & torso3            & $259,156$     & $4,429,042$  & $216,854$  \\ \hline
  \end{tabular}
-\vspace{0.5cm}
-\caption{Main characteristics of sparse matrices chosen from the Davis's collection.}
+\caption{Main characteristics of sparse matrices chosen from the Davis collection.}
  \label{ch12:tab:01}
  \end{table}
  
+To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis
+collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
+symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
+we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
+which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
+the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
+the first and the last nonzero value on a matrix row.
+
  \begin{table}
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
@@ -608,7 +604,7 @@ thermal2          & $1.172s$           & $0.622s$            & $1.88$        & $
  \end{center}
  \end{table}
  
-\begin{table}[!h]
+\begin{table}
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -675,19 +671,19 @@ of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the G
  were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
  to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
  However, we can notice from the relative gains $\tau$ that is not interesting to use multiple
-GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to
+GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow to
  maximize utilization of GPU cores. In addition, the communications required to synchronize the
  computations over the cluster increase the idle times of GPUs and slow down further the parallel
  computations.
  
  Consequently, in order to test the performances of the parallel solvers, we developed in C programming
-language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10}
+language a generator of large sparse matrices. This generator takes a matrix from the Davis collection~\cite{ch12:ref10}
  as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed
  in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse
  sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure,
  because they are those arise in the most of numerical problems. So to generate the global sparse matrix,
  each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
-from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix
+from the Davis collection. Then, it puts all these copies on the main diagonal of the global matrix
  (see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
  diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
  initial matrix.
@@ -729,7 +725,7 @@ initial matrix.
                                & torso3            & $433,795,264$ & $328,757$        \\ \hline
  \end{tabular}
  \vspace{0.5cm}
-\caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.}
+\caption{Main characteristics of sparse banded matrices generated from those of the Davis collection.}
  \label{ch12:tab:04}
  \end{table}
  
@@ -746,7 +742,7 @@ CG method is characterized by a better convergence\index{Convergence} rate and a
  time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
  method requires more data exchanges between computing nodes compared to the parallel CG method.
   
-\begin{table}[!h]
+\begin{table}
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -770,7 +766,7 @@ on a cluster of 12 GPUs.}
  \end{center}
  \end{table}
  
-\begin{table}[!h]
+\begin{table}
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
@@ -806,390 +802,11 @@ on a cluster of 12 GPUs.}
  \end{center}
  \end{table}
  
-
  %%--------------------------%%
  %%       SECTION 5          %%
  %%--------------------------%%
-\section{Hypergraph partitioning}
-\label{ch12:sec:05}
-In this section, we present the performances of both parallel CG and GMRES solvers for solving linear
-systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse
-matrices having the nonzero values distributed along their bandwidths. 
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}}
-\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
-\label{ch12:fig:07}
-\end{figure}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|} 
-\hline
-{\bf Matrix type}             & {\bf Matrix name} & {\bf \# nnz}  & {\bf Bandwidth} \\ \hline \hline
-
-\multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $829,082,728$ & $24,999,999$     \\
-
-                              & ecology2          & $254,892,056$ & $25,000,000$     \\ 
-
-                              & finan512          & $556,982,339$ & $24,999,973$     \\ 
-
-                              & G3\_circuit       & $257,982,646$ & $25,000,000$     \\
-            
-                              & shallow\_water2   & $200,798,268$ & $25,000,000$     \\
-
-                              & thermal2          & $359,340,179$ & $24,999,998$     \\ \hline \hline
-            
-\multirow{6}{*}{Nonsymmetric} & cage13            & $879,063,379$ & $24,999,998$     \\
-
-                              & crashbasis        & $820,373,286$ & $24,999,803$     \\
-
-                              & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$     \\
-
-                              & language          & $155,261,826$ & $24,999,492$     \\
-
-                              & poli\_large       & $106,680,819$ & $25,000,000$    \\
-
-                              & torso3            & $872,029,998$ & $25,000,000$\\ \hline
-\end{tabular}
-\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
-\label{ch12:tab:07}
-\end{center}
-\end{table}
-
-We have developed in C programming language a generator of large sparse matrices
-having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}).
-The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}.
-However, the copies performed on the initial matrix (chosen from the Davis's collection)
-are placed on the main diagonal and on four off-diagonals, two on the right and two
-on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a
-generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07}
-shows the main characteristics of sparse five-bands matrices generated from those
-presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million
-unknown values.   
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|} 
-\hline
-{\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
-
-2cubes\_sphere    & $6.041s$     & $3.338s$      & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\
-
-ecology2          & $1.404s$     & $1.301s$      & $1.08$ & $13$     & $5.22e$-$11$ & $2.17e$-$18$ \\
-
-finan512          & $1.822s$     & $1.299s$      & $1.40$ & $12$     & $3.52e$-$11$ & $3.47e$-$18$ \\
-
-G3\_circuit       & $2.331s$     & $2.129s$      & $1.09$ & $15$     & $1.36e$-$11$ & $5.20e$-$18$ \\
-
-shallow\_water2   & $0.541s$     & $0.504s$      & $1.07$ & $6$      & $2.12e$-$16$ & $5.05e$-$28$ \\
-
-thermal2          & $2.549s$     & $1.705s$      & $1.49$ & $14$     & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline  
-\end{tabular}
-\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{ch12:tab:08}
-\end{center}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|} 
-\hline
-{\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
-
-2cubes\_sphere    & $15.963s$    & $7.250s$      & $2.20$  & $58$     & $6.23e$-$16$ & $3.25e$-$19$ \\
-
-ecology2          & $3.549s$     & $2.176s$      & $1.63$  & $21$     & $4.78e$-$15$ & $1.06e$-$15$ \\
-
-finan512          & $3.862s$     & $1.934s$      & $1.99$  & $17$     & $3.21e$-$14$ & $8.43e$-$17$ \\
-
-G3\_circuit       & $4.636s$     & $2.811s$      & $1.65$  & $22$     & $1.08e$-$14$ & $1.77e$-$16$ \\
-
-shallow\_water2   & $2.738s$     & $1.539s$      & $1.78$  & $17$     & $5.54e$-$23$ & $3.82e$-$26$ \\
-
-thermal2          & $5.017s$     & $2.587s$      & $1.94$  & $21$     & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline
-
-cage13            & $9.315s$     & $3.227s$      & $2.89$  & $26$     & $3.38e$-$13$ & $2.08e$-$16$ \\
-
-crashbasis        & $35.980s$    & $14.770s$     & $2.43$  & $127$    & $1.17e$-$12$ & $1.56e$-$17$ \\
-
-FEM\_3D\_thermal2 & $24.611s$    & $7.749s$      & $3.17$  & $64$     & $3.87e$-$11$ & $2.84e$-$14$ \\
-
-language          & $16.859s$    & $9.697s$      & $1.74$  & $89$     & $2.17e$-$12$ & $1.70e$-$12$ \\
-
-poli\_large       & $10.200s$    & $6.534s$      & $1.56$  & $69$     & $5.14e$-$13$ & $1.63e$-$13$ \\
-
-torso3            & $49.074s$    & $19.397s$     & $2.53$  & $175$    & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline
-\end{tabular}
-\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{ch12:tab:09}
-\end{center}
-\end{table}
-
-Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel
-CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a
-cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
-sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from
-both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not
-efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}.
-We can see that the execution times obtained on the GPU cluster are almost equivalent
-to those obtained on the CPU cluster (see the relative gains presented in column~$4$
-of each table). This is due to the large number of communications necessary to synchronize
-the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column,
-of sparse matrices having large bandwidths can link a computing node to many neighbors
-and then generate a large number of data dependencies between these computing nodes in
-the cluster. 
-
-Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph},
-which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed,
-it can well model the communications between the computing nodes, particularly in the
-case of nonsymmetric and irregular matrices, and it gives good reduction of the total
-communication volume. In contrast, it is an expensive operation in terms of execution
-time and memory space. 
-
-The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph
-$\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows:
-\begin{itemize}
-\item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
-\item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
-\begin{equation}
-\forall a_{ij} \neq 0 \mbox{~is a nonzero value of matrix~} A \mbox{~:~} v_i \in pins[e_j],
-\end{equation} 
-\item $w_i$ is the weight of vertex $v_i$ and,
-\item $c_j$ is the cost of hyperedge $e_j$.
-\end{itemize}
-A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is
-defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$ a set of pairwise
-disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each
-subset is attributed to a computing node. Figure~\ref{ch12:fig:08} shows an example
-of the hypergraph model of a  $(9\times 9)$ sparse matrix in three parts. The circles
-and squares correspond, respectively, to the vertices and hyperedges of the hypergraph.
-The solid squares define the cut hyperedges connecting at least two different parts. 
-The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different
-parts spanned by $e_j$.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
-\caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
-\label{ch12:fig:08}
-\end{figure}
-
-The cut hyperedges model the total communication volume between the different computing
-nodes in the cluster, necessary to perform the parallel SpMV multiplication\index{SpMV~multiplication}.
-Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$,
-$0\leq i,j<n$, of the SpMV multiplication $Ax=b$ that need the $j^{th}$ unknown value of
-solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix
-rows sharing and requiring the same unknown value $x_j$. For example in Figure~\ref{ch12:fig:08},
-hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the dependency of matrix
-rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations:
-$b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$.
-However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
-So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform
-two communications.
-
-The hypergraph partitioning\index{Hypergraph} allows to reduce the total communication volume
-required to perform the parallel SpMV multiplication, while maintaining the load balancing between
-the computing nodes. In fact, it allows to minimize at best the following amount:
-\begin{equation}
-\mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
-\end{equation}
-where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning
-$\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively, the cost and the connectivity of cut hyperedge
-$e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows: 
-\begin{equation}
-  W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
-\end{equation} 
-where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the
-average weight of all $K$ parts and $\epsilon$ is the maximum allowed imbalanced ratio.
-
-The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed,
-for example: hMETIS~\cite{ch12:ref12}, PaToH~\cite{ch12:ref13} and Zoltan~\cite{ch12:ref14}. Since our
-objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must
-be performed by at least two MPI processes. It allows to accelerate the data partitioning of large sparse
-matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned in $p$ (number of MPI processes)
-sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the
-parallel hypergraph partitioning method using some functions of the MPI library between the $p$ processes.
-
-Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} shows the performances of the parallel CG and GMRES solvers,
-respectively, using the hypergraph partitioning for solving large linear systems associated to the sparse
-five-bands matrices presented in Table~\ref{ch12:tab:07}. For these experimental tests, we have applied the
-parallel hypergraph partitioning~\cite{ch12:ref15} developed in Zoltan tool~\cite{ch12:ref14}. We have initialized
-the parameters of the partitioning operation as follows:
-\begin{itemize}
-\item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
-\item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
-\item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
-\end{itemize}  
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|} 
-\hline
-{\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere  & $5.935s$             & $1.213s$              & $4.89$          & $63.66\%$ \\
-
-ecology2        & $1.093s$             & $0.136s$              & $8.00$          & $89.55\%$ \\
-
-finan512        & $1.762s$             & $0.475s$              & $3.71$          & $63.43\%$ \\
-
-G3\_circuit     & $2.095s$             & $0.558s$              & $3.76$          & $73.79\%$ \\
-
-shallow\_water2 & $0.498s$             & $0.068s$              & $7.31$          & $86.51\%$ \\
-
-thermal2        & $1.889s$             & $0.348s$              & $5.43$          & $79.59\%$ \\ \hline  
-\end{tabular}
-\caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{ch12:tab:10}
-\end{center}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|} 
-\hline
-{\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere    & $16.430s$            & $2.840s$              & $5.78$          & $60.83\%$ \\
-
-ecology2          & $3.152s$             & $0.367s$              & $8.59$          & $83.13\%$ \\
-
-finan512          & $3.672s$             & $0.723s$              & $5.08$          & $62.62\%$ \\
-
-G3\_circuit       & $4.468s$             & $0.971s$              & $4.60$          & $65.46\%$ \\
-
-shallow\_water2   & $2.647s$             & $0.312s$              & $8.48$          & $79.73\%$ \\
-
-thermal2          & $4.190s$             & $0.666s$              & $6.29$          & $74.25\%$ \\ \hline \hline
-
-cage13            & $8.077s$             & $1.584s$              & $5.10$          & $50.91\%$ \\
-
-crashbasis        & $35.173s$            & $5.546s$              & $6.34$          & $62.43\%$ \\
-
-FEM\_3D\_thermal2 & $24.825s$            & $3.113s$              & $7.97$          & $59.83\%$ \\
-
-language          & $16.706s$            & $2.522s$              & $6.62$          & $73.99\%$ \\
-
-poli\_large       & $12.715s$            & $3.989s$              & $3.19$          & $38.95\%$ \\
-
-torso3            & $48.459s$            & $6.234s$              & $7.77$          & $67.86\%$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{ch12:tab:11}
-\end{center}
-\end{table}
-
-We can notice from both Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} that the
-hypergraph partitioning has improved the performances of both parallel CG and GMRES
-algorithms. The execution times on the GPU cluster of both parallel solvers are
-significantly improved compared to those obtained by using the partitioning row-by-row.
-For these examples of sparse matrices, the execution times of CG and GMRES solvers
-are reduced about $76\%$ and $65\%$ respectively (see column~$5$ of each table)
-compared to those obtained in Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09}.
-
-In fact, the hypergraph partitioning\index{Hypergraph} applied to sparse matrices
-having large bandwidths allows to reduce the total communication volume necessary
-to synchronize the computations between the computing nodes in the GPU cluster.
-Table~\ref{ch12:tab:12} presents, for each sparse matrix, the total communication
-volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row
-(column~$2$), the total communication volume obtained by using the hypergraph partitioning
-(column~$3$) and the execution times in minutes of the hypergraph partitioning
-operation performed by $12$ MPI processes (column~$4$). The total communication
-volume defines the total number of the vector elements exchanged by the computing
-nodes. Then, Table~\ref{ch12:tab:12} shows that the hypergraph partitioning method
-can split the sparse matrix so as to minimize the data dependencies between the
-computing nodes and thus to reduce the total communication volume.
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|} 
-\hline
-\multirow{4}{*}{\bf Matrix}  & {\bf Total comms.}      & {\bf Total comms.}      & {\bf Execution} \\
-                             & {\bf volume without}    & {\bf volume with}       & {\bf trime}  \\
-                             & {\bf hypergraph}        & {\bf hypergraph }       & {\bf of the parti.}  \\  
-                             & {\bf parti.}            & {\bf parti.}            & {\bf in minutes}\\ \hline \hline
-
-2cubes\_sphere               & $25,360,543$            & $240,679$               & $68.98$         \\
-
-ecology2                     & $26,044,002$            & $73,021$                & $4.92$          \\
-
-finan512                     & $26,087,431$            & $900,729$               & $33.72$         \\
-
-G3\_circuit                  & $31,912,003$            & $5,366,774$             & $11.63$         \\ 
-
-shallow\_water2              & $25,105,108$            & $60,899$                & $5.06$          \\ 
-
-thermal2                     & $30,012,846$            & $1,077,921$             & $17.88$         \\ \hline \hline
-
-cage13                       & $28,254,282$            & $3,845,440$             & $196.45$        \\
-
-crashbasis                   & $29,020,060$            & $2,401,876$             & $33.39$         \\
-
-FEM\_3D\_thermal2            & $25,263,767$            & $250,105$               & $49.89$         \\
-
-language                     & $27,291,486$            & $1,537,835$             & $9.07$          \\
-
-poli\_large                  & $25,053,554$            & $7,388,883$             & $5.92$          \\
-
-torso3                       & $25,682,514$            & $613,250$               & $61.51$         \\ \hline       
-\end{tabular}
-\caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
-\label{ch12:tab:12}
-\end{center}
-\end{table}
-
-Nevertheless, as we can see from the fourth column of Table~\ref{ch12:tab:12},
-the hypergraph partitioning takes longer compared to the execution times of the
-resolutions. As previously mentioned, the hypergraph partitioning method is less
-efficient in terms of memory consumption and partitioning time than its graph
-counterpart, but the hypergraph well models the nonsymmetric and irregular problems.
-So for the applications which often use the same sparse matrices, we can perform
-the hypergraph partitioning on these matrices only once for each and then, we save
-the traces of these partitionings in files to be reused several times. Therefore,
-this allows to avoid the partitioning of the sparse matrices at each resolution
-of the linear systems.
-
-\begin{figure}[!h]
-\centering
-  \mbox{\subfigure[Sparse band matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band}\label{ch12:fig:09.01}}}
-\vfill 
-  \mbox{\subfigure[Sparse five-bands matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band}\label{ch12:fig:09.02}}}
-\caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
-\label{ch12:fig:09}
-\end{figure}
-
-However, the most important performance parameter is the scalability of the parallel
-CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} solvers on a GPU
-cluster. Particularly, we have taken into account the weak-scaling of both parallel
-algorithms on a cluster of one to 12 GPU computing nodes. We have performed a set of
-experiments on both matrix structures: band matrices and five-bands matrices. The sparse
-matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen
-from the Davis's collection. Figures~\ref{ch12:fig:09.01} and~\ref{ch12:fig:09.02}
-show the execution times of both parallel methods for solving large linear systems
-associated to band matrices and those associated to five-bands matrices, respectively.
-The size of a sparse sub-matrix per computing node, for each matrix structure, is fixed
-as follows:
-\begin{itemize}
-\item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
-\item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values. 
-\end{itemize}
-We can see from these figures that both parallel solvers are quite scalable on a GPU
-cluster. Indeed, the execution times remains almost constant while the size of the
-sparse linear systems to be solved increases proportionally with the number of the
-GPU computing nodes. This means that the communication cost is relatively constant
-regardless of the number the computing nodes in the GPU cluster.
-
-
-
-%%--------------------------%%
-%%       SECTION 6          %%
-%%--------------------------%%
  \section{Conclusion}
-\label{ch12:sec:06}
+\label{ch12:sec:05}
  In this chapter, we have aimed at harnessing the computing power of a
  cluster of GPUs for solving large sparse linear systems. For this, we
  have used two Krylov subspace iterative methods: the CG and GMRES methods.
@@ -1211,22 +828,21 @@ for solving linear systems associated to very large sparse matrices. The experim
  results, obtained in the present chapter, showed that a cluster of $12$ GPUs is
  about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
  linear systems of $25$ million unknown values. This is due to the GPU ability to
-compute the data-parallel operations faster than the CPUs. However, we have shown
-that solving linear systems associated to matrices having large bandwidths uses
-many communications to synchronize the computations of GPUs, which slow down even
-more the resolution. Moreover, there are two kinds of communications: between a
-CPU and its GPU and between CPUs of the computing nodes, such that the first ones
-are the slowest communications on a GPU cluster. So, we have proposed to use the
-hypergraph partitioning instead of the row-by-row partitioning. This allows to
-minimize the data dependencies between the GPU computing nodes and thus to reduce
-the total communication volume. The experimental results showed that using the
-hypergraph partitioning technique improve the execution times on average of $76\%$
-to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs. 
-
-In the recent GPU hardware and software architectures, the GPU-Direct system with
-CUDA version 5.0 is used so that two GPUs located on the same node or on distant
-nodes can communicate between them directly without CPUs. This allows to improve
-the data transfers between GPUs.          
+compute the data-parallel operations faster than the CPUs.
+
+As future works, we plan to test the parallel algorithms of CG and GMRES methods, adapted
+to GPUs, for solving large linear systems associated to sparse matrices of different structures.
+For example, the matrices having large bandwidths, which can lead to many data dependencies
+between the computing nodes and, thus, degrade the performances of both algorithms. So in
+this case, it would be interesting to study the different data partitioning techniques, in
+order to minimize the dependencies between the computing nodes and thus to reduce the total
+communication volume. This may improve the performances of both algorithms implemented on
+a GPU cluster. Moreover, in the recent GPU hardware and software architectures, the GPU-Direct
+system with CUDA version 5.0 is used so that two GPUs located on the same node or on distant
+nodes can communicate between them directly without CPUs. This allows to improve the data
+transfers between GPUs.          
+   
+  
  
  \putbib[Chapters/chapter12/biblio12]