\label{ch12:tab:04}
\end{table}
-We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
-million unknown values. The sparse matrices associated to these linear systems are generated
-from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
-Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
-GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
-GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
-a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
-notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster,
-are better than those of the GMRES method for solving large symmetric linear systems. In fact, the
-CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
-time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
-method requires more data exchanges between computing nodes compared to the parallel CG method.
-
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\end{center}
\end{table}
+
+We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
+million unknown values. The sparse matrices associated to these linear systems are generated
+from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
+Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
+GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
+GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
+a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
+notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster,
+are better than those of the GMRES method for solving large symmetric linear systems. In fact, the
+CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
+time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
+method requires more data exchanges between computing nodes compared to the parallel CG method.
+
+
%%--------------------------%%
%% SECTION 5 %%
%%--------------------------%%
-\chapterauthor{Bertil Schmidt, Hoang-Vu Dang}{Johannes Gutenberg University of Mainz}
+\chapterauthor{Bertil Schmidt}{Johannes Gutenberg University of Mainz}
+\chapterauthor{Hoang-Vu Dang}{Johannes Gutenberg University of Mainz}
\chapter{Solving large sparse linear systems for integer factorization on GPUs}
\label{chapter19}
\subsubsection*{Coordinate list (COO)\index{Compressed storage format!COO}}
For each non-zero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously.
- \begin{lstlisting}
+ \begin{lstlisting}[caption={}]
coo.row_index = {0, 1, 1, 2, 2, 3, 3, 3, 4, 5, 5}
coo.col_index = {2, 1, 4, 1, 3, 0, 2, 4, 2, 0, 5}
coo.value = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
\subsubsection*{Compressed Sparse Row (CSR)\index{Compressed storage format!CSR}} Non-zeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first non-zero element of each row in the column array.
- \begin{lstlisting}
+ \begin{lstlisting}[caption={}]
csr.row_start = {0, 1, 3, 5, 8, 9, 12}
csr.col_index = {2, 1, 4, 1, 3, 0, 2, 4, 2, 0, 5}
csr.value = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
\subsubsection*{Ellpack (ELL)\index{Compressed storage format!ELL}} Let $K$ be the maximum number of non-zero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain less than $K$ non-zero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation store the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses.
- \begin{lstlisting}
+ \begin{lstlisting}[caption={}]
ell.col_index = {
2, 1, 1, 0, 2, 0,
*, 4, 3, 2, *, 5,
\end{lstlisting}
\subsubsection*{Hybrid (HYB)\index{Compressed storage format!HYB}} The HYB format heuristically computes an value $K$ and store $K$ non-zeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing non-zeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improve the overall performance.
- \begin{lstlisting}
+ \begin{lstlisting}[caption={}]
hyb.nnz_per_row = 2
hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *, 5}
hyb.ell.value = {3, 1, 2, 6, 9, 7, *, 5, 4, 8, *, 11}
\end{lstlisting}
\subsubsection*{Sliced Ellpack (SLE)\index{Compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of non-zeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of non-zeros per row in order to move rows with similar number of non-zeros together.
- \begin{lstlisting}
+ \begin{lstlisting}[caption={}]
sle.slice_size = 2
sle.col_index = {
2, 1, *, 4,
\label{ch19:conclusion}
In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a big GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for bigger problem sizes, the amount of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs.
-We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at [-to be updated-].
+We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv.
\putbib[Chapters/chapter19/biblio]