elements necessary to compute this multiplication. First, each computing node determines, in its
local subvector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
between them these shared vector elements. The data exchanges are implemented by using the MPI
-point-to-point communication routines: blocking\index{MPI subroutines!blocking} sends with \verb+MPI_Send()+
-and nonblocking\index{MPI subroutines!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
+point-to-point communication routines: blocking\index{MPI!blocking} sends with \verb+MPI_Send()+
+and nonblocking\index{MPI!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2},
and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
nodes is that presented in Figure~\ref{ch12:fig:01}.
and vice versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
-to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI subroutines!global}
+to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI!global}
\verb+MPI_Allreduce()+.
the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
the first and the last nonzero value on a matrix row.
+
\begin{table}
\centering
+\begin{small}
\begin{tabular}{|c|c|c|c|c|}
\hline
{\bf Matrix Type} & {\bf Matrix Name} & {\bf \# Rows} & {\bf \# Nonzeros} & {\bf Bandwidth} \\ \hline \hline
& torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline
\end{tabular}
+\end{small}
\caption{Main characteristics of sparse matrices chosen from the University of Florida collection.}
\label{ch12:tab:01}
\end{table}
+
\begin{table}[!h]
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\begin{table}[!h]
\begin{center}
+\begin{small}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
torso3 & $4.242s$ & $2.030s$ & $2.09$ & $175$ & $2.69e$-$10$ & $1.78e$-$14$ \\ \hline
\end{tabular}
+\end{small}
\caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.}
\label{ch12:tab:03}
\end{center}
CG method is characterized by a better convergence\index{convergence} rate and a shorter execution
time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
method requires more data exchanges between computing nodes compared to the parallel CG method.
-
+\clearpage
\begin{table}[!h]
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\begin{table}[!h]
\begin{center}
+\begin{small}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
torso3 & $31.463s$ & $3.681s$ & $8.55$ & $175$ & $2.69e$-$10$ & $2.66e$-$14$ \\ \hline
\end{tabular}
+\end{small}
\caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
on a cluster of 12 GPUs.}
\label{ch12:tab:06}