new

[book_gpu.git] / BookGPU / Chapters / chapter19 / ch19.tex
diff --git a/BookGPU/Chapters/chapter19/ch19.tex b/BookGPU/Chapters/chapter19/ch19.tex

index 480e5287722dc2512f362a7e45e3dde375fc1f49..90dee2315c421a36ae71c68d4ac65ffa3fcba753 100755 (executable)
--- a/BookGPU/Chapters/chapter19/ch19.tex
+++ b/BookGPU/Chapters/chapter19/ch19.tex
@@ -421,13 +421,13 @@ We compare the SCOO format to the CSR, COO, and HYB format of Cusp 0.3.0. Other
  
  
  
  
  
  
-The SCOO format achieves a stable performance for different matrices in single-precision mode. In most cases a performance of over 10 Gflop/s can be sustained. For some highly unstructured matrices such as \emph{GL7d19}, \emph{wikipedia-20070206}, \emph{rgg\_n\_2\_24\_s0} and \emph{kron\_g500-logn21} SCOO achieves high speedups ranging from 3 to 6 compared to the best performaning Cusp format. 
+The SCOO format achieves a stable performance for different matrices in single-precision mode. In most cases a performance of over 10 Gflop/s can be sustained. For some highly unstructured matrices such as \emph{GL7d19}, \emph{wikipedia-20070206}, \emph{rgg\_n\_2\_24\_s0}, and \emph{kron\_g500-logn21}, SCOO achieves high speedups ranging from 3 to 6 compared to the best performaning Cusp format. 
  
  
-For most matrices, HYB produces the best performance among the tested Cusp formats. HYB is able to outperform SCOO only for two matrices: \emph{nlpkkt120} and \emph{nlpkkt160}. Both matrices have a similar structure i.e. they consist of consecutive rows that have a very similar number of nonzero coefficients which is suitable to be stored in the ELL section of the HYB format. Moreover the nonzeros are close to each other facilitating coaleasing and cache-friendly access patterns by nature. SCOO is able to outperform COO and CSR for all tested matrices.
+For most matrices, HYB produces the best performance among the tested Cusp formats. HYB is able to outperform SCOO for only two matrices: \emph{nlpkkt120} and \emph{nlpkkt160}. Both matrices have a similar structure, i.e., they consist of consecutive rows that have a very similar number of nonzero coefficients which are suitable to be stored in the ELL section of the HYB format. Moreover the nonzeros are close to each other facilitating coaleasing and cache-friendly access patterns by nature. SCOO is able to outperform COO and CSR for all tested matrices.
  
  In matrix $Relat9$ we observe some patterns but the matrix is still generally unstructured, thus SCOO is able to achieve about 2 times speed up compared to HYB which is the best among tested Cusp formats in this case. The average speedup of SCOO for the tested matrices is 3.0 compared to CSR, 5.02 compared to COO, 2.15 compared to HYB.
  
  
  In matrix $Relat9$ we observe some patterns but the matrix is still generally unstructured, thus SCOO is able to achieve about 2 times speed up compared to HYB which is the best among tested Cusp formats in this case. The average speedup of SCOO for the tested matrices is 3.0 compared to CSR, 5.02 compared to COO, 2.15 compared to HYB.
  
-We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \emph{GL7d19} in Figure \ref{fig:mat-str}, \ref{fig:mat-mid}, \ref{fig:mat-unstr} using MatView \cite{ch19:matview}. The white color represents zero entries, gray color represents nonzero entries.
+We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{relat9}, \emph{GL7d19} in Figure \ref{fig:mat-str}, \ref{fig:mat-mid}, \ref{fig:mat-unstr} using MatView \cite{ch19:matview}. The white color represents zero entries, gray color represents nonzero entries.
  
  \begin{figure}[htbp]
  \centering
  
  \begin{figure}[htbp]
  \centering
@@ -435,16 +435,16 @@ We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \e
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-str.pdf}
                 \label{fig:mat-str}
         }
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-str.pdf}
                 \label{fig:mat-str}
         }
-       \subfigure[Relat9 - first 10000 rows] {
+       \subfigure[relat9--first 10000 rows] {
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-mid.pdf}
                 \label{fig:mat-mid}
         }
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-mid.pdf}
                 \label{fig:mat-mid}
         }
-       \subfigure[GL7d19 - first 500 rows and columns] {
+       \subfigure[GL7d19--first 500 rows and columns] {
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-uns.pdf}
                 \label{fig:mat-unstr}
         }
    
                 \includegraphics[width=100pt]{Chapters/chapter19/fig/matrix-uns.pdf}
                 \label{fig:mat-unstr}
         }
    
-  \caption{Visualization of \emph{nlpkkt120}, \emph{Relat9}, and \emph{GL7d19} matrix.}
+  \caption{Visualization of \emph{nlpkkt120}, \emph{relat9}, and \emph{GL7d19} matrix.}
  %  \label{fig:mat-visual}
  \end{figure}
  
  %  \label{fig:mat-visual}
  \end{figure}
  
@@ -458,13 +458,13 @@ We show the visualization of sparse matrices \emph{nlpkkt120}, \emph{Relat9}, \e
  \label{fig:scoo-vs-cpu}
  \end{figure}
  
  \label{fig:scoo-vs-cpu}
  \end{figure}
  
-We use the Intel MKL library 10.3 in order to compare SCOO performance to an optimized CPU implementation. MKL SpMV receives the input matrices in CSR format. The results are shown in Figure \ref{fig:scoo-vs-cpu}. Using a GTX-580, we achieve speedups ranging between 5.5 and 18 over MKL on a 4-core CPU with hyper-threading using 8 threads. Also note that the SCOO performance on a GTX-580 is around 1.5 times faster than on the C2075 due to the increased memory bandwidth and clock speed. The storage requirement for the \emph{rgg\_n\_2\_24\_s0} and \emph{uk-2002} matrices and associated input/output vectors slightly exceeds the 3 GB global memory of the GTX-580 and thus are not included.
+We used the Intel MKL library 10.3 in order to compare SCOO performance to an optimized CPU implementation. MKL SpMV receives the input matrices in CSR format. The results are shown in Figure \ref{fig:scoo-vs-cpu}. Using a GTX-580, we achieved speedups ranging between 5.5 and 18 over MKL on a 4-core CPU with hyper-threading using 8 threads. Also note that the SCOO performance on a GTX-580 is around 1.5 times faster than on the C2075 due to the increased memory bandwidth and clock speed. The storage requirement for the \emph{rgg\_n\_2\_24\_s0} and \emph{uk-2002} matrices and associated input/output vectors slightly exceeds the 3 GB global memory of the GTX-580 and thus are not included.
  
  \section{Conclusion}
  \label{ch19:conclusion}
  
  \section{Conclusion}
  \label{ch19:conclusion}
-In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a big GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for bigger problem sizes, the amount of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. 
+In this chapter, we have presented our implementation of iterative SpMV for NFS matrices on GPUs with the CUDA programming language. Our GPU implementation takes advantage of the variety of sparseness properties in NFS matrices to produce suitable formats for different parts. The GPU implementation shows promising improvement over an optimized CPU implementation. As the size of integers in factorization projects is expected to increase further, the linear algebrea step of NFS will become an even bigger bottleneck. The size and sparseness of matrices generated by the NFS sieving step are growing significantly with the size of the integer to be factored. Thus, a large GPU cluster is required to accelerate the linear algebra step. However, in order to achieve scalability for larger problem sizes, the amounts of GPU RAM and data transfer bandwidth need to be increased in addition to the number of GPUs. 
  
  
-We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv.
+We further adapted the proposed Sliced COO format to single-precision floating-point numbers and evaluated it with large and sparse matrices derived from other computational science applications. We have published our code at https://github.com/danghvu/cudaSpmv
  
  \putbib[Chapters/chapter19/biblio]
  
  
  \putbib[Chapters/chapter19/biblio]