BookGPU/Chapters/chapter19/ch19.tex

   1 \chapterauthor{Bertil Schmidt and Hoang-Vu Dang}{Johannes Gutenberg University of Mainz, Germany}
   2 %\chapterauthor{Hoang-Vu Dang}{Johannes Gutenberg University of Mainz}
   3
   4 \chapter{Solving large sparse linear systems for integer factorization on GPUs}
   5 \label{chapter19}
   6
   7 \section{Introduction}
   8 \label{ch19:intro}
   9
  10 The Number Field Sieve (NFS)\index{iterative method!number field sieve} is the current state-of-the-art integer factorization method. It requires the solution of a large sparse linear system over Galois Field GF(2) (called the linear algebra step). The Block Wiedemann\index{number field sieve!block Wiedemann} (BW)\cite{ch19:bw} algorithm can be used to solve such a large sparse linear system efficiently using iterative sparse matrix vector multiplication (SpMV).
  11
  12 Recent integer factorization efforts have been using CPU clusters to solve the large sparse linear system \cite{ch19:kilobit,ch19:rsa768}. The RSA-768 factorization \cite{ch19:rsa768}, for example, reported a runtime of 3 months for the linear algebra step on a cluster with 48 AMD dual hex-core CPUs. Previous work on parallelizing the linear algebra step focused on using CPU clusters and grids \cite{ch19:aoki,ch19:hwang,ch19:grid,ch19:hetero768}. In this chapter, we present a CUDA approach that can be used to accelerate the costly iterative SpMV operation for matrices derived from NFS.
  13
  14 The memory access pattern in the SpMV operation generally consists of regular access patterns over the matrix and irregular access patterns over the vector. The irregular access pattern over the vector is a challenge that is more pronounced  on the GPU than on the CPU, because of the smaller cache and the restrictive memory access pattern requirement to achieve maximum performance. However, a high-end GPU has an order-of-magnitude higher bandwidth than a high-end CPU; e.g., a GeForce GTX 580 has 192.4 GB/s memory bandwidth, while an Intel Core-i7 has a maximum of 25.6 GB/s memory bandwidth.
  15
  16 SpMV on the GPU has been explored previously in several papers \cite{ch19:nvidia,ch19:exactspmv,ch19:bellpack,ch19:sle} for matrices derived from scientific computing applications. However, sparse matrices derived from NFS have generally different properties, i.e. they are larger, have a few dense rows, and have many extremely sparse rows. The large size of the matrix causes the BL and BW algorithms to require a large number of SpMV iterations. This means that the time spent for matrix preprocessing and matrix data transfer to the GPU memory is negligible compared to the total runtime. Thus, approaches to SpMV on GPUs for NFS matrices may be different from previously published GPU SpMV approaches. We further present an extension of our SpMV method for binary-valued NFS matrices to single-precision floating-point matrices.
  17
  18 \section{Block Wiedemann algorithm}
  19 \label{ch19:block-wiedemann}
  20
  21 The BW algorithm heuristically finds $n$ vectors in the kernel space \index{number field sieve!kernel space}of a $d \times d$ binary matrix $B$; $n$ is one of two parameters $m, n$, called blocking factors\index{number field sieve!blocking factors}. BW consists of the following steps:
  22
  23 \begin{itemize}
  24 \item \textbf{Step 1 (BW1):} Compute the matrix sequence
  25
  26 \begin{equation}
  27 A_i = x \cdot B^i \cdot y, \forall i=1,...,{\frac{d}{m}}+{ \frac{d}{n} }+O(1),
  28 \end{equation}
  29
  30  where $x,y$ are randomly chosen binary matrices of size $m \times d$ and $d \times n$, respectively.
  31 \item \textbf{Step 2 (BW2):} The Berlekamp-Massey\index{number field sieve!Berlekamp-Massey} algorithm \cite{ch19:Thome:subqad} is used to compute a generating polynomial of the matrix sequence $A$ from BW1 in the form
  32 \begin{equation}
  33 F(X)= \sum_{i=1}^{{ \frac{d}{n} }+O(1)} {C_i \cdot X^i},
  34 \end{equation}
  35 where $C_i$ is an $n \times n$ binary matrix
  36
  37 \item \textbf{Step 3 (BW3):} Compute the sequence of matrices $S$ of size $d \times n$ such that
  38 \begin{equation}
  39 S_i = B^i \cdot y \cdot C_i, \forall i=1,...,{\frac{d}{n}}+O(1).
  40 \end{equation}
  41
  42 With high probability, $B \cdot \sum{S_i} = 0$, for which $\sum{S_i}$ is output as a solution.
  43 \end{itemize}
  44
  45 We can treat $x$, $y$, and $S_i$ as vectors of block width $m$ or $n$. Assuming $B$ is a sparse matrix with $\gamma$ nonzeros per row on average and ignoring $m$ and $n$ as constant, the complexity of BW1 and BW3 is $O(\gamma d^2)$. Using the subquadratic algorithm by Thomé \cite{ch19:Thome:subqad}, BW2 has a complexity of $O(d \log^2{d})$. Thus our approach to accelerate BW is based on accelerating BW1 and BW3 on a GPU while BW2 is still done on a CPU.
  46
  47 \section{SpMV over GF(2) for NFS matrices using existing formats on GPUs}
  48 \label{ch19:existingspmv}
  49         In this section, we review a few relevant previously published sparse matrix formats on GPUs and study their performance when applied to sparse matrices over GF(2) derived from integer factorization with NFS.
  50
  51         In Algorithm~\ref{ch19:algo1}, we consider the sparse binary matrix $B$ of size $d \times d$ and a dense vector $X$ of size $d \times n$ bit, where $n$ is called the \emph{blocking factor}. Modern processors generally support 64-bit operations. Thus, typical blocking factors are of the form of $64 \cdot k, \; k \in \mathbb{N}$. Note that doubling the blocking factor roughly halves the number of SpMV iterations required but doubles the input vector size.
  52
  53         For all $0 \leq i \leq d-1$ let $c\_index[i]$ column index array of $B[i]$ which contains the indices of the nonzero entries of row $i$. Then, the following pseudocode shows a single SpMV iteration of $B$ with input vector $X$ and result vector $Y$.
  54
  55 %% \begin{algorithm}
  56 %% \caption{SpMV-COLUMN-INDEX}
  57 %%      \begin{algorithmic}
  58 %%      \REQUIRE $c\_index[i]$: column array of rows $i$ of $B$;\\
  59 %%                       $X$: input vector
  60 %%      \ENSURE  $Y=B \cdot X$
  61 %%      \FOR{$i=0$ to $d-1$}
  62 %%      \STATE $Y[i]=0$
  63 %%              \FORALL{$ind$ $\in$ $c\_index[i]$}
  64 %%                      \STATE $Y[i] = Y[i] \oplus X[ind]$
  65 %%                      \COMMENT{$\oplus:$ bitwise XOR operation}
  66 %%              \ENDFOR
  67 %%      \ENDFOR
  68 %%      \end{algorithmic}
  69 %% \end{algorithm}
  70
  71
  72 \begin{algorithm}
  73 \caption{SpMV-column-index}
  74 \label{ch19:algo1}
  75 %       \begin{algorithmic}
  76         \KwIn{$c\_index[i]$: column array of rows $i$ of $B$}
  77         \KwIn{ $X$: input vector}
  78         \KwOut{  $Y=B \cdot X$}
  79         \For{$i=0$ to $d-1$}{
  80         $Y[i]=0$\;
  81    \ForAll {$ind$ $\in$ $c\_index[i]$} {
  82                 $Y[i] = Y[i] \oplus X[ind]$\;
  83                 \CommentSty{$\oplus:$ bitwise xor operation}
  84 }
  85 }
  86 %       \end{algorithmic}
  87 \end{algorithm}
  88
  89
  90         The costly operations in the SpMV pseudocode are the memory accesses for loading $ind$, $X[ind]$, $Y[i]$ and storing $Y[i]$. To speed up those operations on any architecture a common approach is to design a cache-friendly order of accessing the memory. The order is especially important for the vectors $X$ and $Y$, since their memory locations might be accessed multiple times. Accesses to $Y$ can be minimized by storing the intermediate results in fast memory and only write the accumulated result to $Y$.
  91
  92         CUDA implementations of SpMV generally store both matrix and vectors in \emph{global memory}. Memory accesses to $B$ and $Y$ can usually be coalesced. Memory accesses to $X$ are random and noncoalesced, thus having higher latency, but \emph{texture memory} can be used to take advantage of the \emph{texture cache}. Intermediate results could utilize \emph{shared memory}, which has low latency but is very limited in size and only shared between threads of the same block. Shared memory is further divided into \emph{banks}. When threads in a warp access the shared memory, \emph{bank conflicts} should be avoided, otherwise the access will be serialized.
  93
  94         We now briefly review the CUDA implementation of a number of SpMV matrix formats published in previous papers \cite{ch19:nvidia,ch19:sle}. We also describe the data structure in which the matrix in Figure \ref{fig:ex_matrix} is stored using different formats as examples. Please note that when storing a binary matrix, the $value$ array can be removed.
  95
  96 \begin{figure}[t]
  97         \centering
  98         \includegraphics[width=4cm]{Chapters/chapter19/fig/ex_matrix.pdf}
  99         \caption{An example square matrix of size $6 \times 6$ (zero values are not shown).}
 100         \label{fig:ex_matrix}
 101 \end{figure}
 102
 103         \subsubsection*{Coordinate list (COO)\index{compressed storage format!COO}}
 104 For each nonzero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously.
 105
 106         \begin{lstlisting}[caption={}]
 107         coo.row_index = {0, 1, 1, 2, 2, 3, 3,  3, 4, 5,  5}
 108         coo.col_index = {2, 1, 4, 1, 3, 0, 2,  4, 2, 0,  5}
 109         coo.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
 110         \end{lstlisting}
 111
 112         \subsubsection*{Compressed sparse row (CSR)\index{compressed storage format!CSR}} Nonzeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first nonzero element of each row in the column array.
 113
 114         \begin{lstlisting}[caption={}]
 115         csr.row_start = {0, 1, 3, 5, 8, 9, 12}
 116         csr.col_index = {2, 1, 4, 1, 3, 0, 2,  4, 2, 0,  5}
 117         csr.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
 118         \end{lstlisting}
 119
 120         \subsubsection*{Ellpack (ELL)\index{compressed storage format!ELL}} Let $K$ be the maximum number of nonzero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain fewer than $K$ nonzero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation stores the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses.
 121
 122         \begin{lstlisting}[caption={}]
 123         ell.col_index = {
 124                                         2, 1, 1, 0, 2, 0,
 125                                         *, 4, 3, 2, *, 5,
 126                                         *, *, *, 4, *, *}
 127         ell.value     = {
 128                                         3, 1, 2,  6, 9,  7,
 129                                         *, 5, 4,  8, *, 11,
 130                                         *, *, *, 10, *,  *}
 131         \end{lstlisting}
 132
 133         \subsubsection*{Hybrid (HYB)\index{compressed storage format!HYB}} The HYB format heuristically computes a value $K$ and stores $K$ nonzeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing nonzeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improves the overall performance.
 134         \begin{lstlisting}[caption={}]
 135         hyb.nnz_per_row   = 2
 136         hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *,  5}
 137         hyb.ell.value     = {3, 1, 2, 6, 9, 7, *, 5, 4, 8, *, 11}
 138         hyb.coo.row_index = {3}
 139         hyb.coo.col_index = {4}
 140         hyb.coo.value     = {10}
 141         \end{lstlisting}
 142
 143         \subsubsection*{Sliced Ellpack (SLE)\index{compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of nonzeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of nonzeros per row in order to move rows with similar number of nonzeros together.
 144         \begin{lstlisting}[caption={}]
 145         sle.slice_size  = 2
 146         sle.col_index   = {
 147                                         2, 1, *, 4,
 148                                         1, 0, 3, 2, *, 4,
 149                                         2, 0, *, 5}
 150         sle.value       = {
 151                                         3, 1, *, 5,
 152                                         2, 6, 4, 8, *, 10
 153                                         9, 7, *, 11}
 154         sle.slice_start = {0, 4, 10, 14}
 155         \end{lstlisting}
 156
 157         \begin{table}[htbp]
 158         \centering
 159         \begin{tabular}
 160         {| l || c | c | c | c | c | c |}
 161         \hline
 162         &   RSA-170 & RSA-190 & KILOBIT & RSA-768 \\
 163         \hline
 164           Max dimension & 10.4M & 26.1M & 66.7M & 192M \\
 165           Nonzeros & 994.7M & 2,451M & 9,538M & 27,797M\\
 166           Max row weight & 5.5M & 14M & 28.2M & 82.6M\\
 167           Min row weight & 3 & 3 & 2 & 2\\
 168           Avg row weight & 95.08 & 93.6 & 143 & 144\\
 169         \hline
 170         \end{tabular}
 171         \caption{Properties of some NFS matrices}
 172         \label{table:rsa_matrix}
 173         \end{table}
 174
 175 The existing formats do not achieve good performance due to the special structure of NFS matrices. Row weights of an NFS matrix have a very wide range (see Table \ref{table:rsa_matrix}). Hence using one warp or one thread per row (as in CSR or ELLPACK) results in an unbalanced workload. Moreover, the NFS matrices are highly unstructured, which does not facilitate cache-friendly vector access patterns using existing formats. These reasons have motivated our design of a new format.
 176
 177 \section{A hybrid format for SpMV on GPUs}
 178 \label{Implementation}
 179
 180 As a preprocessing step, we reorder the rows of the matrix by their \emph{row weight}, in nonincreasing order. The row weight of row $j$ of $B$ is defined as the total number of nonzero elements in row $j$. We then partition the sorted matrix rows into at most four consecutive parts. Each part uses a different format. The different formats are optimized for the sparseness properties of each partition as shown in Figure \ref{fig:partitioning}. For the densest part, we use a dense format. When the matrix gets less dense, we switch to another format which we call \index{compressed storage format!sliced COO} Sliced COO (SCOO). SCOO has three variants, small, medium, and large. Our formats are now described in more detail.
 181
 182 \begin{figure}[t]
 183         \centering
 184         \includegraphics[scale=1.3]{Chapters/chapter19/fig/partitioning.pdf}
 185         \caption{Partitioning of a row-sorted NFS matrix into four formats.}
 186         \label{fig:partitioning}
 187 \end{figure}
 188
 189     \subsection{Dense format}
 190     The dense format is used for the dense part of the matrix. This format uses 1 bit per matrix entry. Within a column, 32 matrix entries are stored as a 32-bit integer. Thus, 32 rows are stored as $N$ consecutive integers.
 191
 192     Each CUDA thread works on a column. Each thread fetches one element from the input vector in coalesced fashion. Then, each thread checks the 32 matrix entries one by one. When the matrix entry is a nonzero, the thread performs an xor operation between the element from the input vector and the partial result for the row. This means each thread accesses the input vector only once to do work on up to 32 nonzeros. The partial result from each thread needs to be stored and combined to get the final result for the 32 rows. These operations are performed in CUDA shared memory.
 193
 194     The 32 threads in a warp share 32 shared memory entries to store the partial results from the 32 rows. Since all threads in a warp execute a common instruction at the same time, access to these 32 entries can be made exclusive. The result from each warp in a thread block is combined using p-reduction on shared memory. The result from each thread block is combined using an atomic xor operation on global memory.
 195
 196     When the blocking factor is larger than 64, access to the shared memory needs to be reorganized to avoid bank conflicts. Each thread can read/write up to 64-bit data at a time to the shared memory. If a thread is accessing 128-bit data for example, two read/write operations need to be performed. Thus, there will be bank conflicts if we store 128-bit data on contiguous addresses. We can avoid bank conflicts by having the threads in a warp first access consecutive 64-bit elements representing the first halves of the 128-bit elements. Then, the threads again access consecutive 64-bit elements representing the second halves of the 128-bit elements. The same modification can be applied to other formats as well.
 197
 198         \subsection{Sliced COO}
 199         The SCOO format is adapted from the CADO-NFS software for CPUs \cite{ch19:cadonfs}. The aim is to reduce irregular accesses to the input vector and increase the texture cache hit rate. SCOO stores the column index and the row index of each nonzero. A number of consecutive rows form a slice. Nonzeros within a slice are sorted by their column index. We give an example of SCOO in Figure \ref{fig:scoo-1}, threads working on a slice can access contiguous elements from the input vector. In the figure, S1, S2 are two slices. Entries in the input vector denote the corresponding nonzero elements with which the input vector element is multiplied.
 200
 201         For each nonzero, two bytes are used to store the column index. However, two bytes is not enough for large RSA matrices. Thus, we further divide a slice into groups. Group $i$ contains nonzeros with column index between $i \times 2^{16}$ and $(i+1) \times 2^{16} - 1$. An additional array stores the starting position of each group in the slice.
 202
 203         \begin{figure}[t]
 204         \centering
 205         \includegraphics[height=5cm]{Chapters/chapter19/fig/scoo.pdf}
 206         \caption{Example of the memory access pattern for a $6 \times 6$ matrix stored in Sliced COO format (Slice Size = 3 rows).}
 207         \label{fig:scoo-1}
 208         \end{figure}
 209
 210
 211 \lstinputlisting[label=ch19:lst:spmv,caption=SpMV with SCOO format]{Chapters/chapter19/code.cu}
 212
 213         One thread block works on a slice, one group at a time. Neighboring threads work on neighboring nonzeros in the group in parallel. Each thread works on more than one row of the matrix. Thus, each thread needs some storage to store the partial results and combine them with the results from the other threads to generate the final output. Each entry costs equally as the element of the vector, i.e., the blocking factor in bytes. Shared memory is used as intermediate storage as it has low latency however it is limited to 48 KB per SM in Fermi. The global memory is accessed only once at the end to write the final output. The CUDA kernel for SpMV with Sliced COO format is shown in Listing \ref{ch19:lst:spmv}.
 214
 215         Since neighboring nonzeros may or may not come from the same row (recall that we sorted them by column index), many threads may access the same entry in the shared memory if it is used to store result of the same rows. Thus, shared memory entries are either assigned exclusively to a single thread or shared by using atomic xor operations.     Based on the way we allocate the shared memory, we further divide the Sliced COO format into three different subformats: small, medium, and large.
 216
 217         \begin{table}
 218         \centering
 219         \begin{tabular}
 220         { | c || c | c | c |}
 221         \hline
 222         Sliced COO subformats & Small & Medium & Large \\
 223         \hline
 224           Memory sharing & No sharing & Among warp & Among block\\
 225           Access method & Direct & Atomic xor & Atomic xor\\
 226           Bank conflict & No & No & Yes\\
 227           \# Rows per Slice & 12 & 192 & 6144 \\
 228         \hline
 229         \end{tabular}
 230         \caption{Sliced COO subformat comparison (\# rows per slices is based on $n=64$).}
 231         \label{table:scoo}
 232         \end{table}
 233
 234         \subsubsection*{Small sliced (SS) COO}
 235         In this subformat, each thread has one exclusive entry in shared memory to store the partial result for each row. The assignment of the shared memory is organized such that each thread in a warp accesses only one bank and there is no bank accessed by more than one thread. Thus, there is no bank conflict. A p-reduction operation on shared memory is required to combine partial results from each thread.
 236
 237         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{(number of threads per block * blocking factor)}. We use 512 threads per block for 64-bit blocking factor which gives 12 rows, and 256 threads per block for 128 and 256-bit blocking factor, which gives 12 and 6 rows, respectively. Hence, one byte per row index is sufficient for this subformat.
 238
 239         \subsubsection*{Medium sliced (MS) COO}
 240         In this subformat, each thread in a warp gets an entry in the shared memory to store the partial result for each row. However, this entry is shared with the threads in other warps. Access to the shared memory uses an atomic xor operation. Each thread in a warp accesses only one bank, avoiding bank conflicts. A p-reduction operation on shared memory is required to combine the 32 partial results.
 241
 242         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{(32 * blocking factor)} where 32 is the number of threads in a warp. This translates to 192, 96, and 48 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. Hence, one byte per row index is sufficient for this format.
 243
 244         \subsubsection*{Large sliced (LS) COO}
 245         In this subformat, the result for each row gets one entry in shared memory, which is shared among all threads in the thread block. Access to shared memory uses an atomic xor operation. Thus, there will be bank conflicts. However, this drawback can be compensated for by a higher texture cache hit rate. There is no p-reduction on shared memory required.
 246
 247         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{blocking factor}. This translates to 6144, 3072, and 1536 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. We need two bytes for the row index.
 248
 249         \noindent Table \ref{table:scoo} summarizes the differences between the three subformats.
 250         \begin{itemize}
 251         \item[a)] Each thread in a block is allocated one memory entry for the result of a row and works on consecutive 12 rows. A p-reduction to Thread\#1 is needed to get the final result of one row.
 252         \item[b)] Each warp in a block is allocated one memory entry per row and works on 192 consecutive rows, threads in each warp use atomic xor to update the value in a given entry. A p-reduction for each row to memory in Warp\#1 is needed to obtain the final result of that row.
 253         \item[c)] A thread block works on 6144 consecutive rows and shares the memory entry of each row. Any update to the memory has to use atomic xor.
 254         \end{itemize}
 255
 256         \subsection{Determining the cut-off point of each format}
 257         \label{ch19:mat-partition-1}
 258         To determine which format to use, we compare the performance of two consecutive formats in terms of giga nonzeros (\emph{gnnz}) per second for a given matrix, starting with the dense format and the SS-COO format. The two formats start from the same row (the first row) and work on the minimum number of rows possible. For the dense format, the minimum number of rows is 32. For the SS-COO format (and its variants), the minimum number of rows is \emph{the number of rows in a slice} times \emph{the number of multiprocessors in the GPU}, since one thread block works on one slice and one thread block is assigned to one multiprocessor.
 259
 260         The next comparison depends on the result of the current comparison. If the dense format performs better, we decide to use it for rows 1 to 32, and we continue comparing the dense format and the SS-COO format starting from row 33. However, if the SS-COO format performs better, we compare its performance with the next format, MS-COO, starting from the same row, and so on. The idea is to stop considering the denser format once the sparser format outperforms it. Once we get to the comparison between MS-COO and LS-COO, and LS-COO performs better, we don't need to do any further comparisons. LS-COO should be used for the rest of the matrix.
 261
 262         For Sliced COO format, it is essential to note that when one slice is assigned to each multiprocessor, the load for one multiprocessor may be much higher than that of the other multiprocessors. This is because the matrix rows have been reordered by their weight in a nonincreasing order, so the first slice contains more nonzero entries than the rest. Thus, we need to further reorder the rows so that each multiprocessor gets the same level of load.
 263
 264         The cut-off point determination is performed once per matrix and as a preprocessing step. Its runtime is equivalent to less than five iterations of SpMV.
 265
 266
 267 \section{SCOO for single-precision floating-point matrices}
 268 The design of SCOO depends on performing atomic operations to combine the partial results. Atomic operations for double-precision are not available on current CUDA-enabled devices. Thus, we present the SCOO extension for only single-precision floating-point matrices.
 269
 270 To extend SCOO for general matrices in real numbers, we first adapt to the variety of sparsity in real number matrices. Unlike NFS matrices, the sparsity is not predictable. Hence, instead of using three subformats (SS-COO, MS-COO, LS-COO) with a fixed slice size (12, 192, 6144), we make the slice size a variable. Let $h$ denote the slice size, $S$ denote the size in bytes of shared memory per thread block ($S=49152$ for Fermi), and $b$ denote the size of each matrix value in bytes ($b=4$ for single-precision floating point). $h$ must be a positive integer value of the form of $h = \frac{S}{2^{i} b}$, $\forall{0 \leq i \leq T}$ where $T=\log_2{\textit{(maximum number of thread per block)}}$.
 271
 272 For example in Fermi GPUs, single-precision floating point, $h \in \{12288, 6144, 3072, 1536, 768, 384, 192, 96, 48, 24, 12\}$ for $S=49152, b=4, T=10$. The algorithm described in Section \ref{ch19:mat-partition-1} can be used to determine the cut-off point for each subformat. Figure \ref{fig:final-scoo} illustrates the final matrix in SCOO format assuming the number of multiprocessors in the GPU is $5$ and the row index  ranges from $0..d-1$. $H_i$ and $r_i$ are the slice size and starting row index of the sub-matrix $i$.
 273
 274 \begin{figure}[t]
 275         \centering
 276         \includegraphics[width=8cm]{Chapters/chapter19/fig/final-scoo-book.pdf}
 277         \caption{Partitioning of a row-sorted floating-point matrix into SCOO format.}
 278         \label{fig:final-scoo}
 279 \end{figure}
 280
 281 More details of the SCOO format for floating-point matrices can be found in \cite{ch19:spmv-iccs}.
 282
 283 \section{Performance evaluation}
 284 \label{ch19:result}
 285
 286 \subsection{Experimental setup}
 287 \begin{table}
 288 \begin{center}
 289 \begin{tabular}{|l|l|l|l|}
 290 \hline
 291 Hardware & C2075 & GTX-580 & Core-i7 2700K\\
 292 \hline
 293 \# Cores &448& 512 & 4\\
 294 \hline
 295 Clock speed (Ghz)&1.15& 1.57 &3.5\\
 296 \hline
 297 Memory type & GDDR5 & GDDR5 & DDR3-1600 \\
 298 \hline
 299 Memory size (GB)&6 & 3 &16\\
 300 \hline
 301 Max Memory bandwidth (GB/s)& 144 & 192 & 21 \\
 302 \hline
 303 \end{tabular}
 304 \end{center}
 305 \caption{Overview of hardware used in the experiments}
 306 \label{table:hardware}
 307 \end{table}
 308
 309 Table \ref{table:hardware} specifies the GPUs and the CPU workstation used for performance evaluation. The performance is measured in terms of $gnnz$ and $Gflop/s$ for binary and floating-point matrices. Measured GPU performance includes neither PCIe data transfers nor matrix preprocessing. These are realistic assumptions since SpMV applications usually consist of a large number of iterations where the sparse matrix is iteratively multiplied by the input/output vectors.
 310
 311 Our source code is compiled and executed using CUDA toolkit and driver version $4.2$ under Linux Ubuntu 12.04.
 312
 313 \subsection{Experimental results of SpMV on NFS matrix}
 314 We have evaluated our implementation on an NVIDIA Tesla C2075 (ECC disabled) with 6 GB RAM. We have compared the GPU performance with the open-source CADO-NFS \cite{ch19:cadonfs} program running on Intel Core i7-920 CPU with 12 GB DDR3-1066 memory. The RSA-170 matrix (see Table \ref{table:rsa_matrix}) is used for performance evaluation.
 315
 316 The speedups compared to the multithreaded CADO-NFS bucket implementation on an Intel Core i7-920 are given in parentheses in Table~\ref{table:hardware}. The GPU memory required to store the sparse matrix and the corresponding bytes per nonzero (nnz) are also reported. CADO-NFS contains several CPU optimized SpMV implementations using multithreading and SSE instructions. In this experiment, we compare the performance of our implementation to the CPU cache-optimized bucket format of CADO-NFS using 8 threads.
 317
 318 \begin{table}[t]
 319         \centering
 320         \begin{tabular}
 321         { | c || c | c | c | c |}
 322         \hline
 323         Blocking & C2075     & 2 x C2075 & Core i7-920 & GPU memory \\
 324         factor   & (speedup) & (speedup) &  CADO-NFS  & (bytes/nnz) \\
 325         \hline
 326           64 & 4.28 (4.9) & 8.38 (9.5) & 0.88 & 2748 MB (2.90)\\
 327           128 & 2.78 (5.6) & 5.37 (10.7) & 0.5 & 2967 MB (3.13)\\
 328           256 & 1.88 (7.0) & 3.68 (13.6) & 0.27 & 3000 MB (3.16)\\
 329
 330         \hline
 331         \end{tabular}
 332         \caption{Performance of SpMV on RSA-170 matrix}
 333         \label{table:rsa170}
 334 \end{table}
 335
 336         Table \ref{table:rsa170} shows the result. The dual-GPU implementation achieves speedups between 1.93 and 1.96 compared to the single-GPU performance.
 337
 338         Table \ref{table:individual} shows the individual performance of each subformat (in \emph{gnnz/s}), the percentage of nonzeros are included in parentheses. The results show that the performance decreases when the matrix gets sparser. The MS-COO and LS-COO performance degrade when the blocking factor is increased from 64 to 128 and 256-bit. This is caused by the increased number of bank conflicts and serialization of atomic xor operations on larger blocking factors. Thus, the SS-COO format gets a higher percentage of nonzeros with 128 and 256-bit blocking factor.
 339
 340         More detail results of our full block Wiedemann CUDA implementation as well as a multi-GPU implementation can be found in \cite{ch19:spmv-ccpe}
 341
 342 \begin{table}[h]
 343         \centering
 344         \begin{tabular}
 345         { | c || c | c | c | c | c |}
 346         \hline
 347          Blocking factor & Dense & SS-COO & MS-COO & LS-COO \\
 348         \hline
 349           64     & 13.66 (24\%) & 9.66 (11\%) & 8.13 (13\%) & 2.77 (52\%)  \\
 350           128    & 9.66 (15\%) & 7.53 (24\%) & 3.23 (6\%) & 1.86 (55\%) \\
 351           256    & 7.00 (15\%) & 4.21 (21\%) & 2.34 (6\%) & 1.33 (58\%) \\
 352         \hline
 353         \end{tabular}
 354         \caption{Performance for each of the four subformat partitions of the RSA-170 matrix on a C2075.}
 355         \label{table:individual}
 356 \end{table}
 357
 358 \subsection{Experimental results of SpMV on floating-point matrices}
 359 \label{ch19:result-2}
 360 \npthousandsep{,}
 361
 362
 363
 364
 365
 366 \subsubsection{Performance comparison to existing GPU formats}
 367 We compare the performance of our SCOO format to available SpMV implementations on both GPU and CPU. The set of selected test matrices are collected from the University of Florida Sparse Matrix Collection \cite{ch19:matrix-collection}. We have chosen the biggest matrices from different areas that with their corresponding input and output vectors can still fit into the 6GB global memory of a C2075 GPU. Table \ref{table:matrices} gives an overview of those matrices.
 368
 369
 370
 371 \begin{table}
 372 \begin{center}
 373 \begin{tabular}{|l|n{8}{0}|n{9}{0}|n{3}{2}|l|}
 374 \hline
 375 {Name} & row & column & nz/row & Description \\
 376 \hline
 377 GL7d19 &1911130&1955309&19.53& combinatorial problem \\
 378 \hline
 379 relat9 &12360060&549336&3.15& combinatorial problem \\
 380 \hline
 381 wikipedia-20070206 &3566907&3566907&12.62& directed graph \\
 382 \hline
 383 wb-edu &9845725&9845725&5.81& directed graph\\
 384 \hline
 385 road\_usa &23947347&23947347&2.41& undirected graph \\
 386 \hline
 387 hugebubbles-00010 &19458087&19458087&3.00& undirected graph \\
 388 \hline
 389 circuit5M &5558326&5558326&10.71&circuit simulation\\
 390 \hline
 391 nlpkkt120 &3542400&3542400&26.85&optimization problem\\
 392 \hline
 393 cage15 &5154859&5154859&19.24&directed weighted\\
 394 \hline
 395 kron\_g500-logn21 &2097152&2097152&86.82&undirected multigraph\\
 396 \hline
 397 indochina-2004&7414866&7414866&26.18& directed graph \\
 398 \hline
 399 nlpkkt160 &8345600&8345600&27.01&optimization problem\\
 400 \hline
 401 rgg\_n\_2\_24\_s0 &16777216&16777216&15.80&undirected random\\
 402 \hline
 403 uk-2002 &18520486&18520486&16.10&directed graph\\
 404 \hline
 405 \end{tabular}
 406 \end{center}
 407 \caption{Overview of sparse matrices used for performance evaluation.}
 408 \label{table:matrices}
 409 \end{table}
 410
 411 We compare the SCOO format to the CSR, COO, and HYB format of Cusp 0.3.0. Other Cusp formats are not able to run on the large tested matrices that we selected. The results are shown in Figure \ref{fig:scoo-vs-gpu}. The performances are in terms of Gflop/s which is based on the assumption of two flops per nonzero entry of the matrix \cite{ch19:nvidia-spmv, ch19:bellpack}.
 412
 413
 414
 415 \begin{figure}[htbp]
 416 \centering
 417 \includegraphics[height=5cm]{Chapters/chapter19/fig/scoo-a.pdf}
 418 \caption{Performance comparison of SCOO and other GPU formats for each test matrix on a Fermi Tesla C2075 (ECC disabled).}
 419 \label{fig:scoo-vs-gpu}
 420 \end{figure}
 421
 422
 423
 424 The SCOO format achieves a stable performance for different matrices in single-precision mode. In most cases a performance of over 10 Gflop/s can be sustained. For some highly unstructured matrices such as \emph{GL7d19}, \emph{wikipedia-20070206}, \emph{rgg\_n\_2\_24\_s0}, and \emph{kron\_g500-logn21}, SCOO achieves high speedups ranging from 3 to 6 compared to the best performaning Cusp format.
 425
 426 For most matrices, HYB produces the best performance among the tested Cusp formats. HYB is able to outperform SCOO for only two matrices: \emph{nlpkkt120} and \emph{nlpkkt160}. Both matrices have a similar structure, i.e., they consist of consecutive rows that have a very similar number of nonzero coefficients which are suitable to be stored in the ELL section of the HYB format. Moreover the nonzeros are close to each other facilitating coaleasing and cache-friendly access patterns by nature. SCOO is able to outperform COO and CSR for all tested matrices.
 427
 428 In matrix $Relat9$ we observe some patterns but the matrix is still generally unstructured, thus SCOO is able to achieve about 2 times speed up compared to HYB which is the best among tested Cusp formats in this case. The average speedup of SCOO for the tested matrices is 3.0 compared to CSR, 5.02 compared to COO, 2.15 compared to HYB.
 429
 430 We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{relat9}, \emph{GL7d19} in Figure \ref{fig:mat-str}, \ref{fig:mat-mid}, \ref{fig:mat-unstr} using MatView \cite{ch19:matview}. The white color represents zero entries, gray color represents nonzero entries.
 431
 432 \begin{figure}[htbp]
 433 \centering
 434         \subfigure[nlpkkt120] {
 435                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-str.pdf}
 436                 \label{fig:mat-str}
 437         }
 438         \subfigure[relat9--first 10000 rows] {
 439                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-mid.pdf}
 440                 \label{fig:mat-mid}
 441         }
 442         \subfigure[GL7d19--first 500 rows and columns] {
 443                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-uns.pdf}
 444                 \label{fig:mat-unstr}
 445         }
 446
 447   \caption{Visualization of \emph{nlpkkt120}, \emph{relat9}, and \emph{GL7d19} matrix.}
 448 %  \label{fig:mat-visual}
 449 \end{figure}
 450
 451
 452 \subsubsection{Performance comparison to a CPU implementation}
 453 \begin{figure}
 454 \centering
 455 \includegraphics[width=340pt]{Chapters/chapter19/fig/gpu-cpu.pdf}
 456
 457 \caption{Performance of the SCOO on a GTX-580 and a CPU implementation using MKL performed on a Core-i7 2700K using 8 threads.}
 458 \label{fig:scoo-vs-cpu}
 459 \end{figure}
 460
 461 We used the Intel MKL library 10.3 in order to compare SCOO performance to an optimized CPU implementation. MKL SpMV receives the input matrices in CSR format. The results are shown in Figure \ref{fig:scoo-vs-cpu}. Using a GTX-580, we achieved speedups ranging between 5.5 and 18 over MKL on a 4-core CPU with hyper-threading using 8 threads. Also note that the SCOO performance on a GTX-580 is around 1.5 times faster than on the C2075 due to the increased memory bandwidth and clock speed. The storage requirement for the \emph{rgg\_n\_2\_24\_s0} and \emph{uk-2002} matrices and associated input/output vectors slightly exceeds the 3 GB global memory of the GTX-580 and thus are not included.
 462
 463 \section{Conclusion}
 464 \label{ch19:conclusion}
 465 In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a large GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for larger problem sizes, the amounts of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs.
 466
 467 We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv
 468
 469 \putbib[Chapters/chapter19/biblio]
 470