X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/1ac5b5a535d9154c4f080e94f2f9a49ab6e299b7..c76d2115cc2c37904fb4173b7b1e0288a6646c9c:/BookGPU/Chapters/chapter19/ch19.tex diff --git a/BookGPU/Chapters/chapter19/ch19.tex b/BookGPU/Chapters/chapter19/ch19.tex index 480e528..40912bd 100755 --- a/BookGPU/Chapters/chapter19/ch19.tex +++ b/BookGPU/Chapters/chapter19/ch19.tex @@ -7,7 +7,7 @@ \section{Introduction} \label{ch19:intro} -The Number Field Sieve (NFS)\index{iterative methods!Number Field Sieve} is the current state-of-the-art integer factorization method. It requires the solution of a large sparse linear system over Galois Field GF(2) (called the linear algebra step). The Block Wiedemann\index{Number Field Sieve!Block Wiedemann} (BW)\cite{ch19:bw} algorithm can be used to solve such a large sparse linear system efficiently using iterative sparse matrix vector multiplication (SpMV). +The Number Field Sieve (NFS)\index{iterative method!number field sieve} is the current state-of-the-art integer factorization method. It requires the solution of a large sparse linear system over Galois Field GF(2) (called the linear algebra step). The Block Wiedemann\index{number field sieve!block Wiedemann} (BW)\cite{ch19:bw} algorithm can be used to solve such a large sparse linear system efficiently using iterative sparse matrix vector multiplication (SpMV). Recent integer factorization efforts have been using CPU clusters to solve the large sparse linear system \cite{ch19:kilobit,ch19:rsa768}. The RSA-768 factorization \cite{ch19:rsa768}, for example, reported a runtime of 3 months for the linear algebra step on a cluster with 48 AMD dual hex-core CPUs. Previous work on parallelizing the linear algebra step focused on using CPU clusters and grids \cite{ch19:aoki,ch19:hwang,ch19:grid,ch19:hetero768}. In this chapter, we present a CUDA approach that can be used to accelerate the costly iterative SpMV operation for matrices derived from NFS. @@ -18,7 +18,7 @@ SpMV on the GPU has been explored previously in several papers \cite{ch19:nvidia \section{Block Wiedemann algorithm} \label{ch19:block-wiedemann} -The BW algorithm heuristically finds $n$ vectors in the kernel space \index{Number Field Sieve!kernel space}of a $d \times d$ binary matrix $B$; $n$ is one of two parameters $m, n$, called blocking factors\index{Number Field Sieve!blocking factors}. BW consists of the following steps: +The BW algorithm heuristically finds $n$ vectors in the kernel space \index{number field sieve!kernel space}of a $d \times d$ binary matrix $B$; $n$ is one of two parameters $m, n$, called blocking factors\index{number field sieve!blocking factors}. BW consists of the following steps: \begin{itemize} \item \textbf{Step 1 (BW1):} Compute the matrix sequence @@ -28,7 +28,7 @@ A_i = x \cdot B^i \cdot y, \forall i=1,...,{\frac{d}{m}}+{ \frac{d}{n} }+O(1), \end{equation} where $x,y$ are randomly chosen binary matrices of size $m \times d$ and $d \times n$, respectively. -\item \textbf{Step 2 (BW2):} The Berlekamp-Massey\index{Number Field Sieve!Berlekamp-Massey} algorithm \cite{ch19:Thome:subqad} is used to compute a generating polynomial of the matrix sequence $A$ from BW1 in the form +\item \textbf{Step 2 (BW2):} The Berlekamp-Massey\index{number field sieve!Berlekamp-Massey} algorithm \cite{ch19:Thome:subqad} is used to compute a generating polynomial of the matrix sequence $A$ from BW1 in the form \begin{equation} F(X)= \sum_{i=1}^{{ \frac{d}{n} }+O(1)} {C_i \cdot X^i}, \end{equation} @@ -100,7 +100,7 @@ We can treat $x$, $y$, and $S_i$ as vectors of block width $m$ or $n$. Assuming \label{fig:ex_matrix} \end{figure} - \subsubsection*{Coordinate list (COO)\index{Compressed storage format!COO}} + \subsubsection*{Coordinate list (COO)\index{compressed storage format!COO}} For each nonzero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously. \begin{lstlisting}[caption={}] @@ -109,7 +109,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus coo.value = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11} \end{lstlisting} - \subsubsection*{Compressed sparse row (CSR)\index{Compressed storage format!CSR}} Nonzeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first nonzero element of each row in the column array. + \subsubsection*{Compressed sparse row (CSR)\index{compressed storage format!CSR}} Nonzeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first nonzero element of each row in the column array. \begin{lstlisting}[caption={}] csr.row_start = {0, 1, 3, 5, 8, 9, 12} @@ -117,7 +117,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus csr.value = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11} \end{lstlisting} - \subsubsection*{Ellpack (ELL)\index{Compressed storage format!ELL}} Let $K$ be the maximum number of nonzero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain fewer than $K$ nonzero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation stores the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses. + \subsubsection*{Ellpack (ELL)\index{compressed storage format!ELL}} Let $K$ be the maximum number of nonzero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain fewer than $K$ nonzero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation stores the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses. \begin{lstlisting}[caption={}] ell.col_index = { @@ -130,7 +130,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus *, *, *, 10, *, *} \end{lstlisting} - \subsubsection*{Hybrid (HYB)\index{Compressed storage format!HYB}} The HYB format heuristically computes a value $K$ and stores $K$ nonzeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing nonzeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improves the overall performance. + \subsubsection*{Hybrid (HYB)\index{compressed storage format!HYB}} The HYB format heuristically computes a value $K$ and stores $K$ nonzeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing nonzeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improves the overall performance. \begin{lstlisting}[caption={}] hyb.nnz_per_row = 2 hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *, 5} @@ -140,7 +140,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus hyb.coo.value = {10} \end{lstlisting} - \subsubsection*{Sliced Ellpack (SLE)\index{Compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of nonzeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of nonzeros per row in order to move rows with similar number of nonzeros together. + \subsubsection*{Sliced Ellpack (SLE)\index{compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of nonzeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of nonzeros per row in order to move rows with similar number of nonzeros together. \begin{lstlisting}[caption={}] sle.slice_size = 2 sle.col_index = { @@ -177,7 +177,7 @@ The existing formats do not achieve good performance due to the special structur \section{A hybrid format for SpMV on GPUs} \label{Implementation} -As a preprocessing step, we reorder the rows of the matrix by their \emph{row weight}, in nonincreasing order. The row weight of row $j$ of $B$ is defined as the total number of nonzero elements in row $j$. We then partition the sorted matrix rows into at most four consecutive parts. Each part uses a different format. The different formats are optimized for the sparseness properties of each partition as shown in Figure \ref{fig:partitioning}. For the densest part, we use a dense format. When the matrix gets less dense, we switch to another format which we call \index{Compressed storage format!Sliced COO} Sliced COO (SCOO). SCOO has three variants, small, medium, and large. Our formats are now described in more detail. +As a preprocessing step, we reorder the rows of the matrix by their \emph{row weight}, in nonincreasing order. The row weight of row $j$ of $B$ is defined as the total number of nonzero elements in row $j$. We then partition the sorted matrix rows into at most four consecutive parts. Each part uses a different format. The different formats are optimized for the sparseness properties of each partition as shown in Figure \ref{fig:partitioning}. For the densest part, we use a dense format. When the matrix gets less dense, we switch to another format which we call \index{compressed storage format!sliced COO} Sliced COO (SCOO). SCOO has three variants, small, medium, and large. Our formats are now described in more detail. \begin{figure}[t] \centering @@ -421,13 +421,13 @@ We compare the SCOO format to the CSR, COO, and HYB format of Cusp 0.3.0. Other -The SCOO format achieves a stable performance for different matrices in single-precision mode. In most cases a performance of over 10 Gflop/s can be sustained. For some highly unstructured matrices such as \emph{GL7d19}, \emph{wikipedia-20070206}, \emph{rgg\_n\_2\_24\_s0} and \emph{kron\_g500-logn21} SCOO achieves high speedups ranging from 3 to 6 compared to the best performaning Cusp format. +The SCOO format achieves a stable performance for different matrices in single-precision mode. In most cases a performance of over 10 Gflop/s can be sustained. For some highly unstructured matrices such as \emph{GL7d19}, \emph{wikipedia-20070206}, \emph{rgg\_n\_2\_24\_s0}, and \emph{kron\_g500-logn21}, SCOO achieves high speedups ranging from 3 to 6 compared to the best performaning Cusp format. -For most matrices, HYB produces the best performance among the tested Cusp formats. HYB is able to outperform SCOO only for two matrices: \emph{nlpkkt120} and \emph{nlpkkt160}. Both matrices have a similar structure i.e. they consist of consecutive rows that have a very similar number of nonzero coefficients which is suitable to be stored in the ELL section of the HYB format. Moreover the nonzeros are close to each other facilitating coaleasing and cache-friendly access patterns by nature. SCOO is able to outperform COO and CSR for all tested matrices. +For most matrices, HYB produces the best performance among the tested Cusp formats. HYB is able to outperform SCOO for only two matrices: \emph{nlpkkt120} and \emph{nlpkkt160}. Both matrices have a similar structure, i.e., they consist of consecutive rows that have a very similar number of nonzero coefficients which are suitable to be stored in the ELL section of the HYB format. Moreover the nonzeros are close to each other facilitating coaleasing and cache-friendly access patterns by nature. SCOO is able to outperform COO and CSR for all tested matrices. In matrix $Relat9$ we observe some patterns but the matrix is still generally unstructured, thus SCOO is able to achieve about 2 times speed up compared to HYB which is the best among tested Cusp formats in this case. The average speedup of SCOO for the tested matrices is 3.0 compared to CSR, 5.02 compared to COO, 2.15 compared to HYB. -We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \emph{GL7d19} in Figure \ref{fig:mat-str}, \ref{fig:mat-mid}, \ref{fig:mat-unstr} using MatView \cite{ch19:matview}. The white color represents zero entries, gray color represents nonzero entries. +We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{relat9}, \emph{GL7d19} in Figure \ref{fig:mat-str}, \ref{fig:mat-mid}, \ref{fig:mat-unstr} using MatView \cite{ch19:matview}. The white color represents zero entries, gray color represents nonzero entries. \begin{figure}[htbp] \centering @@ -435,16 +435,16 @@ We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \e \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-str.pdf} \label{fig:mat-str} } - \subfigure[Relat9 - first 10000 rows] { + \subfigure[relat9--first 10000 rows] { \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-mid.pdf} \label{fig:mat-mid} } - \subfigure[GL7d19 - first 500 rows and columns] { + \subfigure[GL7d19--first 500 rows and columns] { \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-uns.pdf} \label{fig:mat-unstr} } - \caption{Visualization of \emph{nlpkkt120}, \emph{Relat9}, and \emph{GL7d19} matrix.} + \caption{Visualization of \emph{nlpkkt120}, \emph{relat9}, and \emph{GL7d19} matrix.} % \label{fig:mat-visual} \end{figure} @@ -458,13 +458,13 @@ We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \e \label{fig:scoo-vs-cpu} \end{figure} -We use the Intel MKL library 10.3 in order to compare SCOO performance to an optimized CPU implementation. MKL SpMV receives the input matrices in CSR format. The results are shown in Figure \ref{fig:scoo-vs-cpu}. Using a GTX-580, we achieve speedups ranging between 5.5 and 18 over MKL on a 4-core CPU with hyper-threading using 8 threads. Also note that the SCOO performance on a GTX-580 is around 1.5 times faster than on the C2075 due to the increased memory bandwidth and clock speed. The storage requirement for the \emph{rgg\_n\_2\_24\_s0} and \emph{uk-2002} matrices and associated input/output vectors slightly exceeds the 3 GB global memory of the GTX-580 and thus are not included. +We used the Intel MKL library 10.3 in order to compare SCOO performance to an optimized CPU implementation. MKL SpMV receives the input matrices in CSR format. The results are shown in Figure \ref{fig:scoo-vs-cpu}. Using a GTX-580, we achieved speedups ranging between 5.5 and 18 over MKL on a 4-core CPU with hyper-threading using 8 threads. Also note that the SCOO performance on a GTX-580 is around 1.5 times faster than on the C2075 due to the increased memory bandwidth and clock speed. The storage requirement for the \emph{rgg\_n\_2\_24\_s0} and \emph{uk-2002} matrices and associated input/output vectors slightly exceeds the 3 GB global memory of the GTX-580 and thus are not included. \section{Conclusion} \label{ch19:conclusion} -In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a big GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for bigger problem sizes, the amount of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. +In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a large GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for larger problem sizes, the amounts of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. -We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv. +We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv \putbib[Chapters/chapter19/biblio]