new

[book_gpu.git] / BookGPU / Chapters / chapter11 / ch11.tex
diff --git a/BookGPU/Chapters/chapter11/ch11.tex b/BookGPU/Chapters/chapter11/ch11.tex

index 5f81d1f6019ef840683392d014f1cee9f6f22430..85d5304209bbbecbe046f49d62d96f12f4509a80 100644 (file)
--- a/BookGPU/Chapters/chapter11/ch11.tex
+++ b/BookGPU/Chapters/chapter11/ch11.tex
@@ -1,6 +1,6 @@
  
  
-\chapterauthor{Gleb Beliakov}{School of Information Technology, Deakin University, Burwood 3125, Australia}
-\chapterauthor{Shaowu Liu}{School of Information Technology, Deakin University, Burwood 3125, Australia}
+\chapterauthor{Gleb Beliakov and Shaowu Liu}{School of Information Technology, Deakin University, Burwood 3125, Australia}
+%\chapterauthor{Shaowu Liu}{School of Information Technology, Deakin University, Burwood 3125, Australia}
  
  
  \chapter{Parallel Monotone Spline Interpolation and Approximation on GPUs}
  
  
  \chapter{Parallel Monotone Spline Interpolation and Approximation on GPUs}
@@ -104,9 +104,13 @@ $$
  
  It is almost straightforward to parallelise this scheme for GPUs, by processing each subinterval $[x_i,x_{i+1}]$ independently in a separate thread. However, it is not known in advance whether an extra knot $t_i$ needs to be inserted or not, and therefore calculation of the position of the knot in the output sequence of knots ${t_i}$ is problematic for parallel implementation (for a sequential algorithm no such issue arises). To avoid serialisation, we decided to insert an additional knot in every interval $[x_i,x_{i+1}]$, but set $t_i=x_i$ when the extra knot is not actually needed. This way we know in advance the position of the output knots and the length of this sequence is $2(n-1)$, and therefore all calculations can now be performed independently. The price we pay is that some of the spline knots can coincide. However, this does not affect spline evaluation, as one of the coinciding knots is simply disregarded, and the spline coefficients are replicated (so for a double knot $t_i=t_{i+1}$, we have $\alpha_i=\alpha_{i+1}$, $\beta_i=\beta_{i+1}$, $\gamma_i=\gamma_{i+1}$). Our implementation is presented in Figures \ref{ch11:algcoef}-\ref{ch11:algcoef1}.
  
  
  It is almost straightforward to parallelise this scheme for GPUs, by processing each subinterval $[x_i,x_{i+1}]$ independently in a separate thread. However, it is not known in advance whether an extra knot $t_i$ needs to be inserted or not, and therefore calculation of the position of the knot in the output sequence of knots ${t_i}$ is problematic for parallel implementation (for a sequential algorithm no such issue arises). To avoid serialisation, we decided to insert an additional knot in every interval $[x_i,x_{i+1}]$, but set $t_i=x_i$ when the extra knot is not actually needed. This way we know in advance the position of the output knots and the length of this sequence is $2(n-1)$, and therefore all calculations can now be performed independently. The price we pay is that some of the spline knots can coincide. However, this does not affect spline evaluation, as one of the coinciding knots is simply disregarded, and the spline coefficients are replicated (so for a double knot $t_i=t_{i+1}$, we have $\alpha_i=\alpha_{i+1}$, $\beta_i=\beta_{i+1}$, $\gamma_i=\gamma_{i+1}$). Our implementation is presented in Figures \ref{ch11:algcoef}-\ref{ch11:algcoef1}.
  
+\lstinputlisting[label=ch11:algcoef1,caption=Calculation of monotone spline knots and coefficients.]{Chapters/chapter11/code2.cu}
+
+
  At the spline evaluation stage we need to compute $s(z_k)$ for a sequence of query values ${z_k}, k=1,\ldots,K$. For each $z_k$ we locate the interval $[t_i,t_{i+1}]$ containing $z_k$, using bisection algorithm presented in Figure \ref{ch11:algeval}, and then apply the appropriate coefficients of the quadratic function. This is also  done in parallel.
  The bisection algorithm could be implemented using texture memory (to cache the array \texttt{z}), but this is not shown in Figure \ref{ch11:algeval}.
  
  At the spline evaluation stage we need to compute $s(z_k)$ for a sequence of query values ${z_k}, k=1,\ldots,K$. For each $z_k$ we locate the interval $[t_i,t_{i+1}]$ containing $z_k$, using bisection algorithm presented in Figure \ref{ch11:algeval}, and then apply the appropriate coefficients of the quadratic function. This is also  done in parallel.
  The bisection algorithm could be implemented using texture memory (to cache the array \texttt{z}), but this is not shown in Figure \ref{ch11:algeval}.
  
+\pagebreak
  \lstinputlisting[label=ch11:algcoef,caption=Implementation of the kernel for calculating spline knots and coefficients. Function fmax is used to avoid division by zero for data with coinciding abscissae.]{Chapters/chapter11/code1.cu}
  
  
  \lstinputlisting[label=ch11:algcoef,caption=Implementation of the kernel for calculating spline knots and coefficients. Function fmax is used to avoid division by zero for data with coinciding abscissae.]{Chapters/chapter11/code1.cu}
  
  
@@ -167,7 +171,6 @@ The bisection algorithm could be implemented using texture memory (to cache the
  %% \end{figure}
  
  
  %% \end{figure}
  
  
-\lstinputlisting[label=ch11:algcoef1,caption=Calculation of monotone spline knots and coefficients.]{Chapters/chapter11/code2.cu}
  
  %% \begin{figure}[!hp]
  %% \renewcommand{\baselinestretch}{1}
  
  %% \begin{figure}[!hp]
  %% \renewcommand{\baselinestretch}{1}
@@ -409,9 +412,8 @@ with $\hat y(k,l)$ being the unrestricted maximum likelihood estimator of $y_k\l
  %% %\renewcommand{\baselinestretch}{1}
  \begin{table}[!h]
  \begin{center}
  %% %\renewcommand{\baselinestretch}{1}
  \begin{table}[!h]
  \begin{center}
-\caption{The average CPU time (sec) of the serial PAVA, MLS and parallel MLS algorithms.  } \label{ch11:table1}
  \begin{tabular}{|r|r|r|r|}
  \begin{tabular}{|r|r|r|r|}
-
+\hline
  Data  & PAVA & MLS & GPU MLS \\ \hline
  
  monotone increasing $f$ & & & \\
  Data  & PAVA & MLS & GPU MLS \\ \hline
  
  monotone increasing $f$ & & & \\
@@ -430,9 +432,11 @@ $n=10^6$ &0.2&0.1& 38\\
  $n=10 \times 10^6$ &1.9& 1.9& 3500 \\
  $n=20 \times 10^6$ &3.5& 4.0&-- \\
  $n=50 \times 10^6$ &11& 11& -- \\
  $n=10 \times 10^6$ &1.9& 1.9& 3500 \\
  $n=20 \times 10^6$ &3.5& 4.0&-- \\
  $n=50 \times 10^6$ &11& 11& -- \\
-
+\hline
  \end{tabular}
  \end{center}
  \end{tabular}
  \end{center}
+\caption{The average CPU time (sec) of the serial PAVA, MLS and parallel MLS algorithms.  } 
+\label{ch11:table1}
  \end{table}
  %% %\renewcommand{\baselinestretch}{2}
  
  \end{table}
  %% %\renewcommand{\baselinestretch}{2}