-%%--------------------------%%
-%% SECTION 5 %%
-%%--------------------------%%
-\section{Hypergraph partitioning}
-\label{sec:05}
-In this section, we present the performances of both parallel CG and GMRES solvers for solving linear systems associated to sparse matrices having
-large bandwidths. Indeed, we are interested on sparse matrices having the nonzero values distributed along their bandwidths.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}}
-\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
-\label{fig:07}
-\end{figure}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|}
-\hline
-{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
-
-\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $829,082,728$ & $24,999,999$ \\
-
- & ecology2 & $254,892,056$ & $25,000,000$ \\
-
- & finan512 & $556,982,339$ & $24,999,973$ \\
-
- & G3\_circuit & $257,982,646$ & $25,000,000$ \\
-
- & shallow\_water2 & $200,798,268$ & $25,000,000$ \\
-
- & thermal2 & $359,340,179$ & $24,999,998$ \\ \hline \hline
-
-\multirow{6}{*}{Nonsymmetric} & cage13 & $879,063,379$ & $24,999,998$ \\
-
- & crashbasis & $820,373,286$ & $24,999,803$ \\
-
- & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$ \\
-
- & language & $155,261,826$ & $24,999,492$ \\
-
- & poli\_large & $106,680,819$ & $25,000,000$ \\
-
- & torso3 & $872,029,998$ & $25,000,000$\\ \hline
-\end{tabular}
-\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
-\label{tab:07}
-\end{center}
-\end{table}
-
-We have developed in C programming language a generator of large sparse matrices having five bands distributed along their bandwidths (see Figure~\ref{fig:07}).
-The principle of this generator is equivalent to that in Section~\ref{sec:04}. However, the copies performed on the initial matrix (chosen from the
-Davis's collection) are placed on the main diagonal and on four off-diagonals, two on the right and two on the left of the main diagonal. Figure~\ref{fig:07}
-shows an example of a generation of a sparse five-bands matrix by four computing nodes. Table~\ref{tab:07} shows the main characteristics of sparse
-five-bands matrices generated from those presented in Table~\ref{tab:01} and associated to linear systems of $25$ million unknown values.
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $6.041s$ & $3.338s$ & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\
-
-ecology2 & $1.404s$ & $1.301s$ & $1.08$ & $13$ & $5.22e$-$11$ & $2.17e$-$18$ \\
-
-finan512 & $1.822s$ & $1.299s$ & $1.40$ & $12$ & $3.52e$-$11$ & $3.47e$-$18$ \\
-
-G3\_circuit & $2.331s$ & $2.129s$ & $1.09$ & $15$ & $1.36e$-$11$ & $5.20e$-$18$ \\
-
-shallow\_water2 & $0.541s$ & $0.504s$ & $1.07$ & $6$ & $2.12e$-$16$ & $5.05e$-$28$ \\
-
-thermal2 & $2.549s$ & $1.705s$ & $1.49$ & $14$ & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline
-\end{tabular}
-\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{tab:08}
-\end{center}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $15.963s$ & $7.250s$ & $2.20$ & $58$ & $6.23e$-$16$ & $3.25e$-$19$ \\
-
-ecology2 & $3.549s$ & $2.176s$ & $1.63$ & $21$ & $4.78e$-$15$ & $1.06e$-$15$ \\
-
-finan512 & $3.862s$ & $1.934s$ & $1.99$ & $17$ & $3.21e$-$14$ & $8.43e$-$17$ \\
-
-G3\_circuit & $4.636s$ & $2.811s$ & $1.65$ & $22$ & $1.08e$-$14$ & $1.77e$-$16$ \\
-
-shallow\_water2 & $2.738s$ & $1.539s$ & $1.78$ & $17$ & $5.54e$-$23$ & $3.82e$-$26$ \\
-
-thermal2 & $5.017s$ & $2.587s$ & $1.94$ & $21$ & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline
-
-cage13 & $9.315s$ & $3.227s$ & $2.89$ & $26$ & $3.38e$-$13$ & $2.08e$-$16$ \\
-
-crashbasis & $35.980s$ & $14.770s$ & $2.43$ & $127$ & $1.17e$-$12$ & $1.56e$-$17$ \\
-
-FEM\_3D\_thermal2 & $24.611s$ & $7.749s$ & $3.17$ & $64$ & $3.87e$-$11$ & $2.84e$-$14$ \\
-
-language & $16.859s$ & $9.697s$ & $1.74$ & $89$ & $2.17e$-$12$ & $1.70e$-$12$ \\
-
-poli\_large & $10.200s$ & $6.534s$ & $1.56$ & $69$ & $5.14e$-$13$ & $1.63e$-$13$ \\
-
-torso3 & $49.074s$ & $19.397s$ & $2.53$ & $175$ & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline
-\end{tabular}
-\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{tab:09}
-\end{center}
-\end{table}
-
-Tables~\ref{tab:08} and~\ref{tab:09} shows the performaces of the parallel CG and GMRES solvers, respectively, obtained on
-a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
-sparse five-bands matrices presented on Table~\ref{tab:07}. We can notice from both Tables~\ref{tab:08} and~\ref{tab:09} that
-using a GPU cluster is not efficient for solving these kind of sparse linear systems. We can see that the execution times obtained
-on the GPU cluster are almost equivalent to those obtained on the CPU cluster (see the relative gains presented in column~$4$
-of each table). This is due to the large number of communications necessary to synchronize the computations over the cluster.
-Indeed, the naive partitioning, row-by-row or column-by-column, of sparse matrices having large bandwidths can link a computing
-node to many neighbors and then generate a large number of data dependencies between these computing nodes in the cluster.
-
-Therefore, we have chosen to use a hypergraph partitioning method, which is well-suited to numerous kinds of sparse matrices~\cite{ref11}.
-Indeed, it can well model the communications between the computing nodes, particularly in the case of nonsymmetric and irregular
-matrices, and it gives good reduction of the total communication volume. In contrast, it is an expensive operation in terms of
-execution time and memory space.
-
-The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ as
-follows:
-\begin{itemize}
-\item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
-\item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
-\begin{equation}
-\forall a_{ij} \neq 0 \mbox{~is a nonzero value of matrix~} A \mbox{~:~} v_i \in pins[e_j],
-\end{equation}
-\item $w_i$ is the weight of vertex $v_i$ and,
-\item $c_j$ is the cost of hyperedge $e_j$.
-\end{itemize}
-A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$
-a set of pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each subset is attributed to a computing node.
-Figure~\ref{fig:08} shows an example of the hypergraph model of a $(9\times 9)$ sparse matrix in three parts. The circles and squares correspond,
-respectively, to the vertices and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two different parts.
-The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different parts spanned by $e_j$.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
-\caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
-\label{fig:08}
-\end{figure}
-
-The cut hyperedges model the total communication volume between the different computing nodes in the cluster, necessary to perform the parallel SpMV
-multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$, $0\leq i,j<n$, of the SpMV multiplication
-$Ax=b$ that need the $j^{th}$ unknown value of solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix rows sharing
-and requiring the same unknown value $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the
-dependency of matrix rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$,
-$b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$. However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
-So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform two communications.
-
-The hypergraph partitioning allows to reduce the total communication volume required to perform the parallel SpMV multiplication, while maintaining the
-load balancing between the computing nodes. In fact, it allows to minimize at best the following amount:
-\begin{equation}
-\mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
-\end{equation}
-where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning $\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively,
-the cost and the connectivity of cut hyperedge $e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows:
-\begin{equation}
- W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
-\end{equation}
-where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the average weight of all $K$ parts and $\epsilon$ is the
-maximum allowed imbalanced ratio.
-
-The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for example: hMETIS~\cite{ref12}, PaToH~\cite{ref13}
-and Zoltan~\cite{ref14}. Since our objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must be performed by
-at least two MPI processes. It allows to accelerate the data partitioning of large sparse matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned
-in $p$ (number of MPI processes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the parallel hypergraph
-partitioning method using some functions of the MPI library between the $p$ processes.
-
-Tables~\ref{tab:10} and~\ref{tab:11} shows the performances of the parallel CG and GMRES solvers, respectively, using the hypergraph partitioning for solving
-large linear systems associated to the sparse five-bands matrices presented in Table~\ref{tab:07}. For these experimental tests, we have applied the parallel
-hypergraph partitioning~\cite{ref15} developed in Zoltan tool~\cite{ref14}. We have initialized the parameters of the partitioning operation as follows:
-\begin{itemize}
-\item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
-\item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
-\item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
-\end{itemize}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere & $5.935s$ & $1.213s$ & $4.89$ & $63.66\%$ \\
-
-ecology2 & $1.093s$ & $0.136s$ & $8.00$ & $89.55\%$ \\
-
-finan512 & $1.762s$ & $0.475s$ & $3.71$ & $63.43\%$ \\
-
-G3\_circuit & $2.095s$ & $0.558s$ & $3.76$ & $73.79\%$ \\
-
-shallow\_water2 & $0.498s$ & $0.068s$ & $7.31$ & $86.51\%$ \\
-
-thermal2 & $1.889s$ & $0.348s$ & $5.43$ & $79.59\%$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{tab:10}
-\end{center}
-\end{table}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere & $16.430s$ & $2.840s$ & $5.78$ & $60.83\%$ \\
-
-ecology2 & $3.152s$ & $0.367s$ & $8.59$ & $83.13\%$ \\
-
-finan512 & $3.672s$ & $0.723s$ & $5.08$ & $62.62\%$ \\
-
-G3\_circuit & $4.468s$ & $0.971s$ & $4.60$ & $65.46\%$ \\
-
-shallow\_water2 & $2.647s$ & $0.312s$ & $8.48$ & $79.73\%$ \\
-
-thermal2 & $4.190s$ & $0.666s$ & $6.29$ & $74.25\%$ \\ \hline \hline
-
-cage13 & $8.077s$ & $1.584s$ & $5.10$ & $50.91\%$ \\
-
-crashbasis & $35.173s$ & $5.546s$ & $6.34$ & $62.43\%$ \\
-
-FEM\_3D\_thermal2 & $24.825s$ & $3.113s$ & $7.97$ & $59.83\%$ \\
-
-language & $16.706s$ & $2.522s$ & $6.62$ & $73.99\%$ \\
-
-poli\_large & $12.715s$ & $3.989s$ & $3.19$ & $38.95\%$ \\
-
-torso3 & $48.459s$ & $6.234s$ & $7.77$ & $67.86\%$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{tab:11}
-\end{center}
-\end{table}
-
-We can notice from both Tables~\ref{tab:10} and~\ref{tab:11} that the hypergraph partitioning has improved the performances of both
-parallel CG and GMRES algorithms. The execution times
-on the GPU cluster of both parallel solvers are significantly improved compared to those obtained by using the partitioning row-by-row.
-For these examples of sparse matrices, the execution times of CG and GMRES solvers are reduced about $76\%$ and $65\%$ respectively
-(see column~$5$ of each table) compared to those obtained in Tables~\ref{tab:08} and~\ref{tab:09}.
-
-In fact, the hypergraph partitioning applied to sparse matrices having large bandwidths allows to reduce the total communication volume
-necessary to synchronize the computations between the computing nodes in the GPU cluster. Table~\ref{tab:12} presents, for each sparse
-matrix, the total communication volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row (column~$2$), the
-total communication volume obtained by using the hypergraph partitioning (column~$3$) and the execution times in minutes of the hypergraph
-partitioning operation performed by $12$ MPI processes (column~$4$). The total communication volume defines the total number of the vector
-elements exchanged by the computing nodes. Then, Table~\ref{tab:12} shows that the hypergraph partitioning method can split the sparse
-matrix so as to minimize the data dependencies between the computing nodes and thus to reduce the total communication volume.
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|}
-\hline
-\multirow{4}{*}{\bf Matrix} & {\bf Total comms.} & {\bf Total comms.} & {\bf Execution} \\
- & {\bf volume without} & {\bf volume with} & {\bf trime} \\
- & {\bf hypergraph} & {\bf hypergraph } & {\bf of the parti.} \\
- & {\bf parti.} & {\bf parti.} & {\bf in minutes}\\ \hline \hline
-
-2cubes\_sphere & $25,360,543$ & $240,679$ & $68.98$ \\
-
-ecology2 & $26,044,002$ & $73,021$ & $4.92$ \\
-
-finan512 & $26,087,431$ & $900,729$ & $33.72$ \\
-
-G3\_circuit & $31,912,003$ & $5,366,774$ & $11.63$ \\
-
-shallow\_water2 & $25,105,108$ & $60,899$ & $5.06$ \\
-
-thermal2 & $30,012,846$ & $1,077,921$ & $17.88$ \\ \hline \hline
-
-cage13 & $28,254,282$ & $3,845,440$ & $196.45$ \\
-
-crashbasis & $29,020,060$ & $2,401,876$ & $33.39$ \\
-
-FEM\_3D\_thermal2 & $25,263,767$ & $250,105$ & $49.89$ \\
-
-language & $27,291,486$ & $1,537,835$ & $9.07$ \\
-
-poli\_large & $25,053,554$ & $7,388,883$ & $5.92$ \\
-
-torso3 & $25,682,514$ & $613,250$ & $61.51$ \\ \hline
-\end{tabular}
-\caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
-\label{tab:12}
-\end{center}
-\end{table}
-
-Nevertheless, as we can see from the fourth column of Table~\ref{tab:12}, the hypergraph partitioning takes longer compared
-to the execution times of the resolutions. As previously mentioned, the hypergraph partitioning method is less efficient in
-terms of memory consumption and partitioning time than its graph counterpart, but the hypergraph well models the nonsymmetric
-and irregular problems. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
-on these matrices only once for each and then, we save the traces of these partitionings in files to be reused several times.
-Therefore, this allows to avoid the partitioning of the sparse matrices at each resolution of the linear systems.
-
-\begin{figure}[!t]
-\centering
-\begin{tabular}{c}
-\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band} \\
-\small{(a) Sparse band matrices} \\
-\\
-\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band} \\
-\small{(b) Sparse five-bands matrices}
-\end{tabular}
-\caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
-\label{fig:09}
-\end{figure}
-
-However, the most important performance parameter is the scalability of the parallel CG and GMRES solvers on a GPU cluster.
-Particularly, we have taken into account the weak-scaling of both parallel algorithms on a cluster of one to 12 GPU computing
-nodes. We have performed a set of experiments on both matrix structures: band matrices and five-bands matrices. The
-sparse matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen from the Davis's collection.
-Figures~\ref{fig:09}-$(a)$ and~\ref{fig:09}-$(b)$ show the execution times of both parallel methods for solving large linear
-systems associated to band matrices and those associated to five-bands matrices, respectively. The size of a sparse sub-matrix
-per computing node, for each matrix structure, is fixed as follows:
-\begin{itemize}
-\item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
-\item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values.
-\end{itemize}
-We can see from these figures that both parallel solvers are quite scalable on a GPU cluster. Indeed, the execution times remains
-almost constant while the size of the sparse linear systems to be solved increases proportionally with the number of
-the GPU computing nodes. This means that the communication cost is relatively constant regardless of the number the computing nodes
-in the GPU cluster.
-
-