ch1

[book_gpu.git] / BookGPU / Chapters / chapter19 / ch19.tex
diff --git a/BookGPU/Chapters/chapter19/ch19.tex b/BookGPU/Chapters/chapter19/ch19.tex

index ea65a11d523c439cceee2cd915cfd1c5821e7417..fdf533d46bb0d01853770689c12ec0ccea95e697 100755 (executable)
--- a/BookGPU/Chapters/chapter19/ch19.tex
+++ b/BookGPU/Chapters/chapter19/ch19.tex
@@ -1,4 +1,5 @@
-\chapterauthor{Bertil Schmidt, Hoang-Vu Dang}{Johannes Gutenberg University of Mainz}
+\chapterauthor{Bertil Schmidt and Hoang-Vu Dang}{Johannes Gutenberg University of Mainz, Germany}
+%\chapterauthor{Hoang-Vu Dang}{Johannes Gutenberg University of Mainz}
  
  \chapter{Solving large sparse linear systems for integer factorization on GPUs}
  \label{chapter19}
  
  \chapter{Solving large sparse linear systems for integer factorization on GPUs}
  \label{chapter19}
@@ -101,7 +102,7 @@ We can treat $x$, $y$ and $S_i$ as vectors of block width $m$ or $n$. Assuming $
         \subsubsection*{Coordinate list (COO)\index{Compressed storage format!COO}}
  For each non-zero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously. 
  
         \subsubsection*{Coordinate list (COO)\index{Compressed storage format!COO}}
  For each non-zero, both its column and row indices are explicitly stored. The Cusp implementation \cite{ch19:cusp} stores elements in sorted order of row indices ensuring that entries with the same row index are stored contiguously. 
  
-       \begin{lstlisting}
+       \begin{lstlisting}[caption={}]
         coo.row_index = {0, 1, 1, 2, 2, 3, 3,  3, 4, 5,  5}
         coo.col_index = {2, 1, 4, 1, 3, 0, 2,  4, 2, 0,  5}
         coo.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
         coo.row_index = {0, 1, 1, 2, 2, 3, 3,  3, 4, 5,  5}
         coo.col_index = {2, 1, 4, 1, 3, 0, 2,  4, 2, 0,  5}
         coo.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
@@ -109,7 +110,7 @@ For each non-zero, both its column and row indices are explicitly stored. The Cu
  
         \subsubsection*{Compressed Sparse Row (CSR)\index{Compressed storage format!CSR}} Non-zeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first non-zero element of each row in the column array.
         
  
         \subsubsection*{Compressed Sparse Row (CSR)\index{Compressed storage format!CSR}} Non-zeros are sorted by the row index, and only their column indices are explicitly stored in a column array. Additionally, the vector $row\_start$ stores indices of the first non-zero element of each row in the column array.
         
-       \begin{lstlisting}
+       \begin{lstlisting}[caption={}]
         csr.row_start = {0, 1, 3, 5, 8, 9, 12}
         csr.col_index = {2, 1, 4, 1, 3, 0, 2,  4, 2, 0,  5}
         csr.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
         csr.row_start = {0, 1, 3, 5, 8, 9, 12}
         csr.col_index = {2, 1, 4, 1, 3, 0, 2,  4, 2, 0,  5}
         csr.value     = {3, 1, 5, 2, 4, 6, 8, 10, 9, 7, 11}
@@ -117,7 +118,7 @@ For each non-zero, both its column and row indices are explicitly stored. The Cu
  
         \subsubsection*{Ellpack (ELL)\index{Compressed storage format!ELL}} Let $K$ be the maximum number of non-zero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain less than $K$ non-zero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation store the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses.
         
  
         \subsubsection*{Ellpack (ELL)\index{Compressed storage format!ELL}} Let $K$ be the maximum number of non-zero elements in any row of the matrix. Then, for each row, ELL stores exactly $K$ elements (extra padding is required for rows that contain less than $K$ non-zero elements). Only column indices are required to store in an array, the row index can be implied since exactly $K$ elements are stored per row. The Cusp implementation store the column indices in a transposed manner so that consecutive threads can access consecutive memory addresses.
         
-       \begin{lstlisting}
+       \begin{lstlisting}[caption={}]
         ell.col_index = {
                                         2, 1, 1, 0, 2, 0, 
                                         *, 4, 3, 2, *, 5, 
         ell.col_index = {
                                         2, 1, 1, 0, 2, 0, 
                                         *, 4, 3, 2, *, 5, 
@@ -129,7 +130,7 @@ For each non-zero, both its column and row indices are explicitly stored. The Cu
         \end{lstlisting}
  
         \subsubsection*{Hybrid (HYB)\index{Compressed storage format!HYB}} The HYB format heuristically computes an value $K$ and store $K$ non-zeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing non-zeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improve the overall performance.
         \end{lstlisting}
  
         \subsubsection*{Hybrid (HYB)\index{Compressed storage format!HYB}} The HYB format heuristically computes an value $K$ and store $K$ non-zeros per rows in the ELL format. When a row has more than $K$ non-zeros, the trailing non-zeros are stored in COO. This design decreases the storage overhead due to ELL padding elements and thus improve the overall performance.
-       \begin{lstlisting}
+       \begin{lstlisting}[caption={}]
         hyb.nnz_per_row   = 2
         hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *,  5}
         hyb.ell.value     = {3, 1, 2, 6, 9, 7, *, 5, 4, 8, *, 11}
         hyb.nnz_per_row   = 2
         hyb.ell.col_index = {2, 1, 1, 0, 2, 0, *, 4, 3, 2, *,  5}
         hyb.ell.value     = {3, 1, 2, 6, 9, 7, *, 5, 4, 8, *, 11}
@@ -139,7 +140,7 @@ For each non-zero, both its column and row indices are explicitly stored. The Cu
         \end{lstlisting}
  
         \subsubsection*{Sliced Ellpack (SLE)\index{Compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of non-zeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of non-zeros per row in order to move rows with similar number of non-zeros together. 
         \end{lstlisting}
  
         \subsubsection*{Sliced Ellpack (SLE)\index{Compressed storage format!SLE}} This format partitions the matrix into horizontal slices of $S$ adjacent rows \cite{ch19:sle}. Each slice is stored in ELLPACK format. The maximum number of non-zeros may be different for each slice. An additional array $slice\_start$ is used to index the first element in each slice. The matrix rows are usually sorted by the number of non-zeros per row in order to move rows with similar number of non-zeros together. 
-       \begin{lstlisting}
+       \begin{lstlisting}[caption={}]
         sle.slice_size  = 2
         sle.col_index   = {
                                         2, 1, *, 4,
         sle.slice_size  = 2
         sle.col_index   = {
                                         2, 1, *, 4,
@@ -453,7 +454,7 @@ We use the Intel MKL library 10.3 in order to compare SCOO performance to an opt
  \label{ch19:conclusion}
  In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a big GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for bigger problem sizes, the amount of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. 
  
  \label{ch19:conclusion}
  In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a big GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for bigger problem sizes, the amount of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. 
  
-We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at [-to be updated-].
+We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv.
  
  \putbib[Chapters/chapter19/biblio]
  
  
  \putbib[Chapters/chapter19/biblio]