new

[book_gpu.git] / BookGPU / Chapters / chapter19 / ch19.tex
diff --git a/BookGPU/Chapters/chapter19/ch19.tex b/BookGPU/Chapters/chapter19/ch19.tex

index 480e5287722dc2512f362a7e45e3dde375fc1f49..307b85158369587435bc683f010da49784ce500d 100755 (executable)
--- a/BookGPU/Chapters/chapter19/ch19.tex
+++ b/BookGPU/Chapters/chapter19/ch19.tex
@@ -7,7 +7,7 @@
  \section{Introduction}
  \label{ch19:intro}
  
  \section{Introduction}
  \label{ch19:intro}
  
-The Number Field Sieve (NFS)\index{iterative methods!Number Field Sieve} is the current state-of-the-art integer factorization method. It requires the solution of a large sparse linear system over Galois Field GF(2) (called the linear algebra step). The Block Wiedemann\index{Number Field Sieve!Block Wiedemann} (BW)\cite{ch19:bw} algorithm can be used to solve such a large sparse linear system efficiently using iterative sparse matrix vector multiplication (SpMV).
+The Number Field Sieve (NFS)\index{iterative method!number field sieve} is the current state-of-the-art integer factorization method. It requires the solution of a large sparse linear system over Galois Field GF(2) (called the linear algebra step). The Block Wiedemann\index{number field sieve!block Wiedemann} (BW)\cite{ch19:bw} algorithm can be used to solve such a large sparse linear system efficiently using iterative sparse matrix vector multiplication (SpMV).
  
  Recent integer factorization efforts have been using CPU clusters to solve the large sparse linear system \cite{ch19:kilobit,ch19:rsa768}. The RSA-768 factorization \cite{ch19:rsa768}, for example, reported a runtime of 3 months for the linear algebra step on a cluster with 48 AMD dual hex-core CPUs. Previous work on parallelizing the linear algebra step focused on using CPU clusters and grids \cite{ch19:aoki,ch19:hwang,ch19:grid,ch19:hetero768}. In this chapter, we present a CUDA approach that can be used to accelerate the costly iterative SpMV operation for matrices derived from NFS.
  
  
  Recent integer factorization efforts have been using CPU clusters to solve the large sparse linear system \cite{ch19:kilobit,ch19:rsa768}. The RSA-768 factorization \cite{ch19:rsa768}, for example, reported a runtime of 3 months for the linear algebra step on a cluster with 48 AMD dual hex-core CPUs. Previous work on parallelizing the linear algebra step focused on using CPU clusters and grids \cite{ch19:aoki,ch19:hwang,ch19:grid,ch19:hetero768}. In this chapter, we present a CUDA approach that can be used to accelerate the costly iterative SpMV operation for matrices derived from NFS.
  
@@ -18,7 +18,7 @@ SpMV on the GPU has been explored previously in several papers \cite{ch19:nvidia
  \section{Block Wiedemann algorithm}
  \label{ch19:block-wiedemann}
  
  \section{Block Wiedemann algorithm}
  \label{ch19:block-wiedemann}
  
-The BW algorithm heuristically finds $n$ vectors in the kernel space \index{Number Field Sieve!kernel space}of a $d \times d$ binary matrix $B$; $n$ is one of two parameters $m, n$, called blocking factors\index{Number Field Sieve!blocking factors}. BW consists of the following steps:
+The BW algorithm heuristically finds $n$ vectors in the kernel space \index{number field sieve!kernel space}of a $d \times d$ binary matrix $B$; $n$ is one of two parameters $m, n$, called blocking factors\index{number field sieve!blocking factors}. BW consists of the following steps:
  
  \begin{itemize}
  \item \textbf{Step 1 (BW1):} Compute the matrix sequence 
  
  \begin{itemize}
  \item \textbf{Step 1 (BW1):} Compute the matrix sequence 
@@ -28,7 +28,7 @@ A_i = x \cdot B^i \cdot y, \forall i=1,...,{\frac{d}{m}}+{ \frac{d}{n} }+O(1),
  \end{equation}
  
   where $x,y$ are randomly chosen binary matrices of size $m \times d$ and $d \times n$, respectively.
  \end{equation}
  
   where $x,y$ are randomly chosen binary matrices of size $m \times d$ and $d \times n$, respectively.
-\item \textbf{Step 2 (BW2):} The Berlekamp-Massey\index{Number Field Sieve!Berlekamp-Massey} algorithm \cite{ch19:Thome:subqad} is used to compute a generating polynomial of the matrix sequence $A$ from BW1 in the form
+\item \textbf{Step 2 (BW2):} The Berlekamp-Massey\index{number field sieve!Berlekamp-Massey} algorithm \cite{ch19:Thome:subqad} is used to compute a generating polynomial of the matrix sequence $A$ from BW1 in the form
  \begin{equation}
  F(X)= \sum_{i=1}^{{ \frac{d}{n} }+O(1)} {C_i \cdot X^i},
  \end{equation}
  \begin{equation}
  F(X)= \sum_{i=1}^{{ \frac{d}{n} }+O(1)} {C_i \cdot X^i},
  \end{equation}
@@ -80,7 +80,7 @@ We can treat $x$, $y$, and $S_i$ as vectors of block width $m$ or $n$. Assuming
         $Y[i]=0$\;   
     \ForAll {$ind$ $\in$ $c\_index[i]$} {
                 $Y[i] = Y[i] \oplus X[ind]$\;
         $Y[i]=0$\;   
     \ForAll {$ind$ $\in$ $c\_index[i]$} {
                 $Y[i] = Y[i] \oplus X[ind]$\;
-               \CommentSty{$\oplus:$ bitwise xor operation}
+               \CommentSty{$\oplus:$ bitwise XOR operation}
  }
  }
  %      \end{algorithmic}
  }
  }
  %      \end{algorithmic}
@@ -100,7 +100,7 @@ We can treat $x$, $y$, and $S_i$ as vectors of block width $m$ or $n$. Assuming
         \label{fig:ex_matrix}
  \end{figure}
  
         \label{fig:ex_matrix}
  \end{figure}
  
-       \subsubsection*{Coordinate list (COO)\index{Compressed storage format!COO}}
+       \subsubsection*{Coordinate list (COO)\index{compressed storage format!COO}}
  For each nonzero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously. 
  
         \begin{lstlisting}[caption={}]
  For each nonzero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously. 
  
         \begin{lstlisting}[caption={}]
@@ -109,7 +109,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus
         coo.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
         \end{lstlisting}
  
         coo.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
         \end{lstlisting}
  
-       \subsubsection*{Compressed sparse row (CSR)\index{Compressed storage format!CSR}} Nonzeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first nonzero element of each row in the column array.
+       \subsubsection*{Compressed sparse row (CSR)\index{compressed storage format!CSR}} Nonzeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first nonzero element of each row in the column array.
         
         \begin{lstlisting}[caption={}]
         csr.row_start = {0, 1, 3, 5, 8, 9, 12}
         
         \begin{lstlisting}[caption={}]
         csr.row_start = {0, 1, 3, 5, 8, 9, 12}
@@ -117,7 +117,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus
         csr.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
         \end{lstlisting}
  
         csr.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
         \end{lstlisting}
  
-       \subsubsection*{Ellpack (ELL)\index{Compressed storage format!ELL}} Let $K$ be the maximum number of nonzero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain fewer than $K$ nonzero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation stores the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses.
+       \subsubsection*{Ellpack (ELL)\index{compressed storage format!ELL}} Let $K$ be the maximum number of nonzero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain fewer than $K$ nonzero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation stores the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses.
         
         \begin{lstlisting}[caption={}]
         ell.col_index = {
         
         \begin{lstlisting}[caption={}]
         ell.col_index = {
@@ -130,7 +130,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus
                                         *, *, *, 10, *,  *}
         \end{lstlisting}
  
                                         *, *, *, 10, *,  *}
         \end{lstlisting}
  
-       \subsubsection*{Hybrid (HYB)\index{Compressed storage format!HYB}} The HYB format heuristically computes a value $K$ and stores $K$ nonzeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing nonzeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improves the overall performance.
+       \subsubsection*{Hybrid (HYB)\index{compressed storage format!HYB}} The HYB format heuristically computes a value $K$ and stores $K$ nonzeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing nonzeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improves the overall performance.
         \begin{lstlisting}[caption={}]
         hyb.nnz_per_row   = 2
         hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *,  5}
         \begin{lstlisting}[caption={}]
         hyb.nnz_per_row   = 2
         hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *,  5}
@@ -140,7 +140,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus
         hyb.coo.value     = {10}
         \end{lstlisting}
  
         hyb.coo.value     = {10}
         \end{lstlisting}
  
-       \subsubsection*{Sliced Ellpack (SLE)\index{Compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of nonzeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of nonzeros per row in order to move rows with similar number of nonzeros together. 
+       \subsubsection*{Sliced Ellpack (SLE)\index{compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of nonzeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of nonzeros per row in order to move rows with similar number of nonzeros together. 
         \begin{lstlisting}[caption={}]
         sle.slice_size  = 2
         sle.col_index   = {
         \begin{lstlisting}[caption={}]
         sle.slice_size  = 2
         sle.col_index   = {
@@ -168,7 +168,7 @@ For each nonzero, both its column and row indices are explicitly stored. The Cus
           Avg row weight & 95.08 & 93.6 & 143 & 144\\
         \hline
         \end{tabular}
           Avg row weight & 95.08 & 93.6 & 143 & 144\\
         \hline
         \end{tabular}
-       \caption{Properties of some NFS matrices}
+       \caption{Properties of some NFS matrices.}
         \label{table:rsa_matrix}        
         \end{table}
         
         \label{table:rsa_matrix}        
         \end{table}
         
@@ -177,7 +177,7 @@ The existing formats do not achieve good performance due to the special structur
  \section{A hybrid format for SpMV on GPUs}
  \label{Implementation}
  
  \section{A hybrid format for SpMV on GPUs}
  \label{Implementation}
  
-As a preprocessing step, we reorder the rows of the matrix by their \emph{row weight}, in nonincreasing order. The row weight of row $j$ of $B$ is defined as the total number of nonzero elements in row $j$. We then partition the sorted matrix rows into at most four consecutive parts. Each part uses a different format. The different formats are optimized for the sparseness properties of each partition as shown in Figure \ref{fig:partitioning}. For the densest part, we use a dense format. When the matrix gets less dense, we switch to another format which we call \index{Compressed storage format!Sliced COO} Sliced COO (SCOO). SCOO has three variants, small, medium, and large. Our formats are now described in more detail.
+As a preprocessing step, we reorder the rows of the matrix by their \emph{row weight}, in nonincreasing order. The row weight of row $j$ of $B$ is defined as the total number of nonzero elements in row $j$. We then partition the sorted matrix rows into at most four consecutive parts. Each part uses a different format. The different formats are optimized for the sparseness properties of each partition as shown in Figure \ref{fig:partitioning}. For the densest part, we use a dense format. When the matrix gets less dense, we switch to another format which we call \index{compressed storage format!sliced COO} Sliced COO (SCOO). SCOO has three variants, small, medium, and large. Our formats are now described in more detail.
  
  \begin{figure}[t]
         \centering
  
  \begin{figure}[t]
         \centering
@@ -189,9 +189,9 @@ As a preprocessing step, we reorder the rows of the matrix by their \emph{row we
      \subsection{Dense format}
      The dense format is used for the dense part of the matrix. This format uses 1 bit per matrix entry. Within a column, 32 matrix entries are stored as a 32-bit integer. Thus, 32 rows are stored as $N$ consecutive integers.
         
      \subsection{Dense format}
      The dense format is used for the dense part of the matrix. This format uses 1 bit per matrix entry. Within a column, 32 matrix entries are stored as a 32-bit integer. Thus, 32 rows are stored as $N$ consecutive integers.
         
-    Each CUDA thread works on a column. Each thread fetches one element from the input vector in coalesced fashion. Then, each thread checks the 32 matrix entries one by one. When the matrix entry is a nonzero, the thread performs an xor operation between the element from the input vector and the partial result for the row. This means each thread accesses the input vector only once to do work on up to 32 nonzeros. The partial result from each thread needs to be stored and combined to get the final result for the 32 rows. These operations are performed in CUDA shared memory.
+    Each CUDA thread works on a column. Each thread fetches one element from the input vector in coalesced fashion. Then, each thread checks the 32 matrix entries one by one. When the matrix entry is a nonzero, the thread performs an XOR operation between the element from the input vector and the partial result for the row. This means each thread accesses the input vector only once to do work on up to 32 nonzeros. The partial result from each thread needs to be stored and combined to get the final result for the 32 rows. These operations are performed in CUDA shared memory.
  
  
-    The 32 threads in a warp share 32 shared memory entries to store the partial results from the 32 rows. Since all threads in a warp execute a common instruction at the same time, access to these 32 entries can be made exclusive. The result from each warp in a thread block is combined using p-reduction on shared memory. The result from each thread block is combined using an atomic xor operation on global memory.
+    The 32 threads in a warp share 32 shared memory entries to store the partial results from the 32 rows. Since all threads in a warp execute a common instruction at the same time, access to these 32 entries can be made exclusive. The result from each warp in a thread block is combined using p-reduction on shared memory. The result from each thread block is combined using an atomic XOR operation on global memory.
  
      When the blocking factor is larger than 64, access to the shared memory needs to be reorganized to avoid bank conflicts. Each thread can read/write up to 64-bit data at a time to the shared memory. If a thread is accessing 128-bit data for example, two read/write operations need to be performed. Thus, there will be bank conflicts if we store 128-bit data on contiguous addresses. We can avoid bank conflicts by having the threads in a warp first access consecutive 64-bit elements representing the first halves of the 128-bit elements. Then, the threads again access consecutive 64-bit elements representing the second halves of the 128-bit elements. The same modification can be applied to other formats as well.
  
  
      When the blocking factor is larger than 64, access to the shared memory needs to be reorganized to avoid bank conflicts. Each thread can read/write up to 64-bit data at a time to the shared memory. If a thread is accessing 128-bit data for example, two read/write operations need to be performed. Thus, there will be bank conflicts if we store 128-bit data on contiguous addresses. We can avoid bank conflicts by having the threads in a warp first access consecutive 64-bit elements representing the first halves of the 128-bit elements. Then, the threads again access consecutive 64-bit elements representing the second halves of the 128-bit elements. The same modification can be applied to other formats as well.
  
@@ -212,7 +212,7 @@ As a preprocessing step, we reorder the rows of the matrix by their \emph{row we
  
         One thread block works on a slice, one group at a time. Neighboring threads work on neighboring nonzeros in the group in parallel. Each thread works on more than one row of the matrix. Thus, each thread needs some storage to store the partial results and combine them with the results from the other threads to generate the final output. Each entry costs equally as the element of the vector, i.e., the blocking factor in bytes. Shared memory is used as intermediate storage as it has low latency however it is limited to 48 KB per SM in Fermi. The global memory is accessed only once at the end to write the final output. The CUDA kernel for SpMV with Sliced COO format is shown in Listing \ref{ch19:lst:spmv}. 
         
  
         One thread block works on a slice, one group at a time. Neighboring threads work on neighboring nonzeros in the group in parallel. Each thread works on more than one row of the matrix. Thus, each thread needs some storage to store the partial results and combine them with the results from the other threads to generate the final output. Each entry costs equally as the element of the vector, i.e., the blocking factor in bytes. Shared memory is used as intermediate storage as it has low latency however it is limited to 48 KB per SM in Fermi. The global memory is accessed only once at the end to write the final output. The CUDA kernel for SpMV with Sliced COO format is shown in Listing \ref{ch19:lst:spmv}. 
         
-       Since neighboring nonzeros may or may not come from the same row (recall that we sorted them by column index), many threads may access the same entry in the shared memory if it is used to store result of the same rows. Thus, shared memory entries are either assigned exclusively to a single thread or shared by using atomic xor operations.     Based on the way we allocate the shared memory, we further divide the Sliced COO format into three different subformats: small, medium, and large. 
+       Since neighboring nonzeros may or may not come from the same row (recall that we sorted them by column index), many threads may access the same entry in the shared memory if it is used to store result of the same rows. Thus, shared memory entries are either assigned exclusively to a single thread or shared by using atomic XOR operations.     Based on the way we allocate the shared memory, we further divide the Sliced COO format into three different subformats: small, medium, and large. 
         
         \begin{table}
         \centering
         
         \begin{table}
         \centering
@@ -222,7 +222,7 @@ As a preprocessing step, we reorder the rows of the matrix by their \emph{row we
         Sliced COO subformats & Small & Medium & Large \\
         \hline            
           Memory sharing & No sharing & Among warp & Among block\\
         Sliced COO subformats & Small & Medium & Large \\
         \hline            
           Memory sharing & No sharing & Among warp & Among block\\
-         Access method & Direct & Atomic xor & Atomic xor\\
+         Access method & Direct & Atomic XOR & Atomic XOR\\
           Bank conflict & No & No & Yes\\
           \# Rows per Slice & 12 & 192 & 6144 \\
         \hline
           Bank conflict & No & No & Yes\\
           \# Rows per Slice & 12 & 192 & 6144 \\
         \hline
@@ -237,20 +237,20 @@ As a preprocessing step, we reorder the rows of the matrix by their \emph{row we
         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{(number of threads per block * blocking factor)}. We use 512 threads per block for 64-bit blocking factor which gives 12 rows, and 256 threads per block for 128 and 256-bit blocking factor, which gives 12 and 6 rows, respectively. Hence, one byte per row index is sufficient for this subformat.
         
         \subsubsection*{Medium sliced (MS) COO}
         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{(number of threads per block * blocking factor)}. We use 512 threads per block for 64-bit blocking factor which gives 12 rows, and 256 threads per block for 128 and 256-bit blocking factor, which gives 12 and 6 rows, respectively. Hence, one byte per row index is sufficient for this subformat.
         
         \subsubsection*{Medium sliced (MS) COO}
-       In this subformat, each thread in a warp gets an entry in the shared memory to store the partial result for each row. However, this entry is shared with the threads in other warps. Access to the shared memory uses an atomic xor operation. Each thread in a warp accesses only one bank, avoiding bank conflicts. A p-reduction operation on shared memory is required to combine the 32 partial results.
+       In this subformat, each thread in a warp gets an entry in the shared memory to store the partial result for each row. However, this entry is shared with the threads in other warps. Access to the shared memory uses an atomic XOR operation. Each thread in a warp accesses only one bank, avoiding bank conflicts. A p-reduction operation on shared memory is required to combine the 32 partial results.
         
         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{(32 * blocking factor)} where 32 is the number of threads in a warp. This translates to 192, 96, and 48 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. Hence, one byte per row index is sufficient for this format.      
         
         \subsubsection*{Large sliced (LS) COO}
         
         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{(32 * blocking factor)} where 32 is the number of threads in a warp. This translates to 192, 96, and 48 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. Hence, one byte per row index is sufficient for this format.      
         
         \subsubsection*{Large sliced (LS) COO}
-       In this subformat, the result for each row gets one entry in shared memory, which is shared among all threads in the thread block. Access to shared memory uses an atomic xor operation. Thus, there will be bank conflicts. However, this drawback can be compensated for by a higher texture cache hit rate. There is no p-reduction on shared memory required.
+       In this subformat, the result for each row gets one entry in shared memory, which is shared among all threads in the thread block. Access to shared memory uses an atomic XOR operation. Thus, there will be bank conflicts. However, this drawback can be compensated for by a higher texture cache hit rate. There is no p-reduction on shared memory required.
  
         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{blocking factor}. This translates to 6144, 3072, and 1536 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. We need two bytes for the row index.
  
         \noindent Table \ref{table:scoo} summarizes the differences between the three subformats.
         \begin{itemize}
         \item[a)] Each thread in a block is allocated one memory entry for the result of a row and works on consecutive 12 rows. A p-reduction to Thread\#1 is needed to get the final result of one row.
  
         The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{blocking factor}. This translates to 6144, 3072, and 1536 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. We need two bytes for the row index.
  
         \noindent Table \ref{table:scoo} summarizes the differences between the three subformats.
         \begin{itemize}
         \item[a)] Each thread in a block is allocated one memory entry for the result of a row and works on consecutive 12 rows. A p-reduction to Thread\#1 is needed to get the final result of one row.
-       \item[b)] Each warp in a block is allocated one memory entry per row and works on 192 consecutive rows, threads in each warp use atomic xor to update the value in a given entry. A p-reduction for each row to memory in Warp\#1 is needed to obtain the final result of that row.
-       \item[c)] A thread block works on 6144 consecutive rows and shares the memory entry of each row. Any update to the memory has to use atomic xor.
+       \item[b)] Each warp in a block is allocated one memory entry per row and works on 192 consecutive rows, threads in each warp use atomic XOR to update the value in a given entry. A p-reduction for each row to memory in Warp\#1 is needed to obtain the final result of that row.
+       \item[c)] A thread block works on 6144 consecutive rows and shares the memory entry of each row. Any update to the memory has to use atomic XOR.
         \end{itemize}
  
         \subsection{Determining the cut-off point of each format}
         \end{itemize}
  
         \subsection{Determining the cut-off point of each format}
@@ -335,7 +335,7 @@ The speedups compared to the multithreaded CADO-NFS bucket implementation on an
                         
         Table \ref{table:rsa170} shows the result. The dual-GPU implementation achieves speedups between 1.93 and 1.96 compared to the single-GPU performance.
         
                         
         Table \ref{table:rsa170} shows the result. The dual-GPU implementation achieves speedups between 1.93 and 1.96 compared to the single-GPU performance.
         
-       Table \ref{table:individual} shows the individual performance of each subformat (in \emph{gnnz/s}), the percentage of nonzeros are included in parentheses. The results show that the performance decreases when the matrix gets sparser. The MS-COO and LS-COO performance degrade when the blocking factor is increased from 64 to 128 and 256-bit. This is caused by the increased number of bank conflicts and serialization of atomic xor operations on larger blocking factors. Thus, the SS-COO format gets a higher percentage of nonzeros with 128 and 256-bit blocking factor. 
+       Table \ref{table:individual} shows the individual performance of each subformat (in \emph{gnnz/s}), the percentage of nonzeros are included in parentheses. The results show that the performance decreases when the matrix gets sparser. The MS-COO and LS-COO performance degrade when the blocking factor is increased from 64 to 128 and 256-bit. This is caused by the increased number of bank conflicts and serialization of atomic XOR operations on larger blocking factors. Thus, the SS-COO format gets a higher percentage of nonzeros with 128 and 256-bit blocking factor. 
         
         More detail results of our full block Wiedemann CUDA implementation as well as a multi-GPU implementation can be found in \cite{ch19:spmv-ccpe}
  
         
         More detail results of our full block Wiedemann CUDA implementation as well as a multi-GPU implementation can be found in \cite{ch19:spmv-ccpe}
  
@@ -370,6 +370,7 @@ We compare the performance of our SCOO format to available SpMV implementations
  
  \begin{table}
  \begin{center}
  
  \begin{table}
  \begin{center}
+\begin{small}
  \begin{tabular}{|l|n{8}{0}|n{9}{0}|n{3}{2}|l|}
  \hline
  {Name} & row & column & nz/row & Description \\
  \begin{tabular}{|l|n{8}{0}|n{9}{0}|n{3}{2}|l|}
  \hline
  {Name} & row & column & nz/row & Description \\
@@ -403,6 +404,7 @@ rgg\_n\_2\_24\_s0 &16777216&16777216&15.80&undirected random\\
  uk-2002 &18520486&18520486&16.10&directed graph\\
  \hline
  \end{tabular}
  uk-2002 &18520486&18520486&16.10&directed graph\\
  \hline
  \end{tabular}
+\end{small}
  \end{center}
  \caption{Overview of sparse matrices used for performance evaluation.}
  \label{table:matrices}
  \end{center}
  \caption{Overview of sparse matrices used for performance evaluation.}
  \label{table:matrices}
@@ -421,13 +423,13 @@ We compare the SCOO format to the CSR, COO, and HYB format of Cusp 0.3.0. Other
  
  
  
  
  
  
-The SCOO format achieves a stable performance for different matrices in single-precision mode. In most cases a performance of over 10 Gflop/s can be sustained. For some highly unstructured matrices such as \emph{GL7d19}, \emph{wikipedia-20070206}, \emph{rgg\_n\_2\_24\_s0} and \emph{kron\_g500-logn21} SCOO achieves high speedups ranging from 3 to 6 compared to the best performaning Cusp format. 
+The SCOO format achieves a stable performance for different matrices in single-precision mode. In most cases a performance of over 10 Gflop/s can be sustained. For some highly unstructured matrices such as \emph{GL7d19}, \emph{wikipedia-20070206}, \emph{rgg\_n\_2\_24\_s0}, and \emph{kron\_g500-logn21}, SCOO achieves high speedups ranging from 3 to 6 compared to the best performaning Cusp format. 
  
  
-For most matrices, HYB produces the best performance among the tested Cusp formats. HYB is able to outperform SCOO only for two matrices: \emph{nlpkkt120} and \emph{nlpkkt160}. Both matrices have a similar structure i.e. they consist of consecutive rows that have a very similar number of nonzero coefficients which is suitable to be stored in the ELL section of the HYB format. Moreover the nonzeros are close to each other facilitating coaleasing and cache-friendly access patterns by nature. SCOO is able to outperform COO and CSR for all tested matrices.
+For most matrices, HYB produces the best performance among the tested Cusp formats. HYB is able to outperform SCOO for only two matrices: \emph{nlpkkt120} and \emph{nlpkkt160}. Both matrices have a similar structure, i.e., they consist of consecutive rows that have a very similar number of nonzero coefficients which are suitable to be stored in the ELL section of the HYB format. Moreover the nonzeros are close to each other facilitating coaleasing and cache-friendly access patterns by nature. SCOO is able to outperform COO and CSR for all tested matrices.
  
  In matrix $Relat9$ we observe some patterns but the matrix is still generally unstructured, thus SCOO is able to achieve about 2 times speed up compared to HYB which is the best among tested Cusp formats in this case. The average speedup of SCOO for the tested matrices is 3.0 compared to CSR, 5.02 compared to COO, 2.15 compared to HYB.
  
  
  In matrix $Relat9$ we observe some patterns but the matrix is still generally unstructured, thus SCOO is able to achieve about 2 times speed up compared to HYB which is the best among tested Cusp formats in this case. The average speedup of SCOO for the tested matrices is 3.0 compared to CSR, 5.02 compared to COO, 2.15 compared to HYB.
  
-We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \emph{GL7d19} in Figure \ref{fig:mat-str}, \ref{fig:mat-mid}, \ref{fig:mat-unstr} using MatView \cite{ch19:matview}. The white color represents zero entries, gray color represents nonzero entries.
+We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{relat9}, \emph{GL7d19} in Figure \ref{fig:mat-str}, \ref{fig:mat-mid}, \ref{fig:mat-unstr} using MatView \cite{ch19:matview}. The white color represents zero entries, gray color represents nonzero entries.
  
  \begin{figure}[htbp]
  \centering
  
  \begin{figure}[htbp]
  \centering
@@ -435,16 +437,16 @@ We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \e
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-str.pdf}
                 \label{fig:mat-str}
         }
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-str.pdf}
                 \label{fig:mat-str}
         }
-       \subfigure[Relat9 - first 10000 rows] {
+       \subfigure[relat9--first 10000 rows] {
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-mid.pdf}
                 \label{fig:mat-mid}
         }
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-mid.pdf}
                 \label{fig:mat-mid}
         }
-       \subfigure[GL7d19 - first 500 rows and columns] {
+       \subfigure[GL7d19--first 500 rows and columns] {
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-uns.pdf}
                 \label{fig:mat-unstr}
         }
    
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-uns.pdf}
                 \label{fig:mat-unstr}
         }
    
-  \caption{Visualization of \emph{nlpkkt120}, \emph{Relat9}, and \emph{GL7d19} matrix.}
+  \caption{Visualization of \emph{nlpkkt120}, \emph{relat9}, and \emph{GL7d19} matrix.}
  %  \label{fig:mat-visual}
  \end{figure}
  
  %  \label{fig:mat-visual}
  \end{figure}
  
@@ -458,13 +460,13 @@ We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \e
  \label{fig:scoo-vs-cpu}
  \end{figure}
  
  \label{fig:scoo-vs-cpu}
  \end{figure}
  
-We use the Intel MKL library 10.3 in order to compare SCOO performance to an optimized CPU implementation. MKL SpMV receives the input matrices in CSR format. The results are shown in Figure \ref{fig:scoo-vs-cpu}. Using a GTX-580, we achieve speedups ranging between 5.5 and 18 over MKL on a 4-core CPU with hyper-threading using 8 threads. Also note that the SCOO performance on a GTX-580 is around 1.5 times faster than on the C2075 due to the increased memory bandwidth and clock speed. The storage requirement for the \emph{rgg\_n\_2\_24\_s0} and \emph{uk-2002} matrices and associated input/output vectors slightly exceeds the 3 GB global memory of the GTX-580 and thus are not included.
+We used the Intel MKL library 10.3 in order to compare SCOO performance to an optimized CPU implementation. MKL SpMV receives the input matrices in CSR format. The results are shown in Figure \ref{fig:scoo-vs-cpu}. Using a GTX-580, we achieved speedups ranging between 5.5 and 18 over MKL on a 4-core CPU with hyper-threading using 8 threads. Also note that the SCOO performance on a GTX-580 is around 1.5 times faster than on the C2075 due to the increased memory bandwidth and clock speed. The storage requirement for the \emph{rgg\_n\_2\_24\_s0} and \emph{uk-2002} matrices and associated input/output vectors slightly exceeds the 3 GB global memory of the GTX-580 and thus are not included.
  
  \section{Conclusion}
  \label{ch19:conclusion}
  
  \section{Conclusion}
  \label{ch19:conclusion}
-In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a big GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for bigger problem sizes, the amount of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. 
+In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a large GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for larger problem sizes, the amounts of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. 
  
  
-We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv.
+We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv
  
  \putbib[Chapters/chapter19/biblio]
  
  
  \putbib[Chapters/chapter19/biblio]