X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/5efdabb8714996a643bae4305878c2f277eaf679..0d39f3bfb1736ae41805f75a779e0bb01f4f5139:/BookGPU/Chapters/chapter19/ch19.tex diff --git a/BookGPU/Chapters/chapter19/ch19.tex b/BookGPU/Chapters/chapter19/ch19.tex index 90dee23..40912bd 100755 --- a/BookGPU/Chapters/chapter19/ch19.tex +++ b/BookGPU/Chapters/chapter19/ch19.tex @@ -7,7 +7,7 @@ \section{Introduction} \label{ch19:intro} -The Number Field Sieve (NFS)\index{iterative methods!Number Field Sieve} is the current state-of-the-art integer factorization method. It requires the solution of a large sparse linear system over Galois Field GF(2) (called the linear algebra step). The Block Wiedemann\index{Number Field Sieve!Block Wiedemann} (BW)\cite{ch19:bw} algorithm can be used to solve such a large sparse linear system efficiently using iterative sparse matrix vector multiplication (SpMV). +The Number Field Sieve (NFS)\index{iterative method!number field sieve} is the current state-of-the-art integer factorization method. It requires the solution of a large sparse linear system over Galois Field GF(2) (called the linear algebra step). The Block Wiedemann\index{number field sieve!block Wiedemann} (BW)\cite{ch19:bw} algorithm can be used to solve such a large sparse linear system efficiently using iterative sparse matrix vector multiplication (SpMV). Recent integer factorization efforts have been using CPU clusters to solve the large sparse linear system \cite{ch19:kilobit,ch19:rsa768}. The RSA-768 factorization \cite{ch19:rsa768}, for example, reported a runtime of 3 months for the linear algebra step on a cluster with 48 AMD dual hex-core CPUs. Previous work on parallelizing the linear algebra step focused on using CPU clusters and grids \cite{ch19:aoki,ch19:hwang,ch19:grid,ch19:hetero768}. In this chapter, we present a CUDA approach that can be used to accelerate the costly iterative SpMV operation for matrices derived from NFS. @@ -18,7 +18,7 @@ SpMV on the GPU has been explored previously in several papers \cite{ch19:nvidia \section{Block Wiedemann algorithm} \label{ch19:block-wiedemann} -The BW algorithm heuristically finds $n$ vectors in the kernel space \index{Number Field Sieve!kernel space}of a $d \times d$ binary matrix $B$; $n$ is one of two parameters $m, n$, called blocking factors\index{Number Field Sieve!blocking factors}. BW consists of the following steps: +The BW algorithm heuristically finds $n$ vectors in the kernel space \index{number field sieve!kernel space}of a $d \times d$ binary matrix $B$; $n$ is one of two parameters $m, n$, called blocking factors\index{number field sieve!blocking factors}. BW consists of the following steps: \begin{itemize} \item \textbf{Step 1 (BW1):} Compute the matrix sequence @@ -28,7 +28,7 @@ A_i = x \cdot B^i \cdot y, \forall i=1,...,{\frac{d}{m}}+{ \frac{d}{n} }+O(1), \end{equation} where $x,y$ are randomly chosen binary matrices of size $m \times d$ and $d \times n$, respectively. -\item \textbf{Step 2 (BW2):} The Berlekamp-Massey\index{Number Field Sieve!Berlekamp-Massey} algorithm \cite{ch19:Thome:subqad} is used to compute a generating polynomial of the matrix sequence $A$ from BW1 in the form +\item \textbf{Step 2 (BW2):} The Berlekamp-Massey\index{number field sieve!Berlekamp-Massey} algorithm \cite{ch19:Thome:subqad} is used to compute a generating polynomial of the matrix sequence $A$ from BW1 in the form \begin{equation} F(X)= \sum_{i=1}^{{ \frac{d}{n} }+O(1)} {C_i \cdot X^i}, \end{equation} @@ -100,7 +100,7 @@ We can treat $x$, $y$, and $S_i$ as vectors of block width $m$ or $n$. Assuming \label{fig:ex_matrix} \end{figure} - \subsubsection*{Coordinate list (COO)\index{Compressed storage format!COO}} + \subsubsection*{Coordinate list (COO)\index{compressed storage format!COO}} For each nonzero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously. \begin{lstlisting}[caption={}] @@ -109,7 +109,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus coo.value = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11} \end{lstlisting} - \subsubsection*{Compressed sparse row (CSR)\index{Compressed storage format!CSR}} Nonzeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first nonzero element of each row in the column array. + \subsubsection*{Compressed sparse row (CSR)\index{compressed storage format!CSR}} Nonzeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first nonzero element of each row in the column array. \begin{lstlisting}[caption={}] csr.row_start = {0, 1, 3, 5, 8, 9, 12} @@ -117,7 +117,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus csr.value = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11} \end{lstlisting} - \subsubsection*{Ellpack (ELL)\index{Compressed storage format!ELL}} Let $K$ be the maximum number of nonzero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain fewer than $K$ nonzero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation stores the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses. + \subsubsection*{Ellpack (ELL)\index{compressed storage format!ELL}} Let $K$ be the maximum number of nonzero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain fewer than $K$ nonzero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation stores the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses. \begin{lstlisting}[caption={}] ell.col_index = { @@ -130,7 +130,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus *, *, *, 10, *, *} \end{lstlisting} - \subsubsection*{Hybrid (HYB)\index{Compressed storage format!HYB}} The HYB format heuristically computes a value $K$ and stores $K$ nonzeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing nonzeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improves the overall performance. + \subsubsection*{Hybrid (HYB)\index{compressed storage format!HYB}} The HYB format heuristically computes a value $K$ and stores $K$ nonzeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing nonzeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improves the overall performance. \begin{lstlisting}[caption={}] hyb.nnz_per_row = 2 hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *, 5} @@ -140,7 +140,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus hyb.coo.value = {10} \end{lstlisting} - \subsubsection*{Sliced Ellpack (SLE)\index{Compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of nonzeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of nonzeros per row in order to move rows with similar number of nonzeros together. + \subsubsection*{Sliced Ellpack (SLE)\index{compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of nonzeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of nonzeros per row in order to move rows with similar number of nonzeros together. \begin{lstlisting}[caption={}] sle.slice_size = 2 sle.col_index = { @@ -177,7 +177,7 @@ The existing formats do not achieve good performance due to the special structur \section{A hybrid format for SpMV on GPUs} \label{Implementation} -As a preprocessing step, we reorder the rows of the matrix by their \emph{row weight}, in nonincreasing order. The row weight of row $j$ of $B$ is defined as the total number of nonzero elements in row $j$. We then partition the sorted matrix rows into at most four consecutive parts. Each part uses a different format. The different formats are optimized for the sparseness properties of each partition as shown in Figure \ref{fig:partitioning}. For the densest part, we use a dense format. When the matrix gets less dense, we switch to another format which we call \index{Compressed storage format!Sliced COO} Sliced COO (SCOO). SCOO has three variants, small, medium, and large. Our formats are now described in more detail. +As a preprocessing step, we reorder the rows of the matrix by their \emph{row weight}, in nonincreasing order. The row weight of row $j$ of $B$ is defined as the total number of nonzero elements in row $j$. We then partition the sorted matrix rows into at most four consecutive parts. Each part uses a different format. The different formats are optimized for the sparseness properties of each partition as shown in Figure \ref{fig:partitioning}. For the densest part, we use a dense format. When the matrix gets less dense, we switch to another format which we call \index{compressed storage format!sliced COO} Sliced COO (SCOO). SCOO has three variants, small, medium, and large. Our formats are now described in more detail. \begin{figure}[t] \centering