new

[book_gpu.git] / BookGPU / Chapters / chapter11 / ch11.tex
diff --git a/BookGPU/Chapters/chapter11/ch11.tex b/BookGPU/Chapters/chapter11/ch11.tex

index 8a8d54c5d379ca23758588deb423ae6220998654..0aa6e8cc8aec787a1721a3fdd4e4b66db81ba197 100644 (file)
--- a/BookGPU/Chapters/chapter11/ch11.tex
+++ b/BookGPU/Chapters/chapter11/ch11.tex
@@ -1,56 +1,64 @@
  
  
-\chapterauthor{Gleb Beliakov}{School of Information Technology, Deakin University, Burwood 3125, Australia}
-\chapterauthor{Shaowu Liu}{School of Information Technology, Deakin University, Burwood 3125, Australia}
+\chapterauthor{Gleb Beliakov and Shaowu Liu}{School of Information Technology, Deakin University, Burwood 3125, Australia}
+%\chapterauthor{Shaowu Liu}{School of Information Technology, Deakin University, Burwood 3125, Australia}
  
  
  
  
-\chapter{Parallel Monotone Spline Interpolation and Approximation on GPUs}
+\chapter{Parallel monotone spline interpolation and approximation on GPUs}
  
  \section{Introduction} \label{ch11:Introduction}
  
  
  \section{Introduction} \label{ch11:Introduction}
  
-Monotonicity preserving interpolation and approximation have received substantial attention in the last thirty years because of their numerous applications in computer aided design, statistics and machine learning \cite{Dierckx1995_book,Kvasov2000_book,deboor2001_book}. Constrained splines are particularly popular because of their flexibility in modelling different geometrical shapes, sound theoretical properties and availability of numerically stable algorithms \cite{Dierckx1995_book,Schumaker1981_book,deboor2001_book}. 
+Monotonicity preserving interpolation and approximation have received substantial attention in the last thirty years because of their numerous applications in computer aided-design, statistics, and machine learning \cite{Dierckx1995_book,Kvasov2000_book,deboor2001_book}. Constrained splines \index{spline}\index{constrained splines}\index{monotonicity} are particularly popular because of their flexibility in modeling different geometrical shapes, sound theoretical properties, and availability of numerically stable algorithms \cite{Dierckx1995_book,Schumaker1981_book,deboor2001_book}.
  % It is surprising though that few parallel spline algorithms are available.
  % It is surprising though that few parallel spline algorithms are available.
-In this work we examine parallelisation and adaptation for GPUs of a few algorithms of monotone spline interpolation and data smoothing, which arose in the context of estimating probability distributions.
+In this work we examine parallelization and adaptation for GPUs of a few algorithms of monotone spline interpolation and data smoothing, which arose in the context of estimating probability distributions.
  
  
-Estimating cumulative probability distribution functions (cdf) from data is quite common in data analysis. In our particular case we faced this problem in the context of partitioning univariate data with the purpose of efficient sorting. It was required to partition large data sets into chunks of  approximately equal size, so that these chunks could be sorted independently and subsequently concatenated. In order to do that, empirical cdf of the data was used to find the quantiles, which served to partition the data. Cdf was estimated from the data based on a number of pairs $(x_i,y_i), i=1,\ldots,n$, where $y_i$ was the proportion of data no larger than $x_i$. As data could come from a variety of distributions, a distribution-free nonparametric fitting procedure was required to interpolate the above pairs. Needless to say that the whole process was aimed at GPU, and hence the use of CPU for invoking serial algorithms had to be minimised.
+Estimating Cumulative Probability distribution Functions (CDF) from data is quite common in data analysis. In our particular case we faced this problem in the context of partitioning univariate data with the purpose of efficient sorting. It was necessary to partition large data sets into chunks of  approximately equal size, so that these chunks could be sorted independently and subsequently concatenated. In order to do that, empirical CDF of the data was used to find the quantiles, which served to partition the data. CDF was estimated from the data based on a number of pairs $(x_i,y_i), i=1,\ldots,n$, where $y_i$ was the proportion of data no larger than $x_i$. As data could come from a variety of distributions, a distribution-free nonparametric fitting procedure was required to interpolate the above pairs. Needless to say the whole process was aimed at GPU, and hence the use of CPU for invoking serial algorithms had to be minimized.
  
  
-The above mentioned application is one of many examples (e.g. mass spectrography \cite{Kearsley_2006}, global warming data \cite{Yohai} and so on) where univariate data needs to be fitted by monotonicity preserving interpolants.  Of course, cdf is a monotone increasing function, whose inverse, called quantile function, can be used to calculate the quantiles. Spline interpolation would be the most suitable nonparametric method to fit the cdf, except that polynomial splines do not preserve monotonicity of the data, as illustrated on Figure \ref{ch11:fig1}.
  
  
-The failure of splines to preserve monotonicity has prompted fundamental research in this area since 1960s. One of the first methods to remedy this problem were splines in tension by Schweikert \cite{Sch}, where a tension parameter controlled the shape of exponential splines \cite{Spath1969}. Later on several monotonicity preserving polynomial spline algorithms were proposed \cite{Schumaker1983,PasRoul1977,AndElf1987,Andersson1991_JAT,McAllister1981_ACM,PasRoul1977}. These algorithms typically rely on introducing additional spline knots between the abscissae of the data. Algorithmic developments are active to this day, see for example \cite{Kvasov2000_book,Abbas2011}.
  
  
-When  in addition to the pairs $(x_i, y_i)$ the slopes of the function are available, i.e., the data comes in triples $(x_i, y_i, p_i)$, the interpolation problem is called Hermite, and the Hermite splines are used. However, even when the sequence $y_i$ is increasing and the slopes $p_i$ are non-negative, cubic Hermite splines may still fail to be monotone, as illustrated in Figure \ref{ch11:fig2}. Thus monotone Hermite splines are needed \cite{Gregory1982}.
  
  
-Another issue with monotone approximation is noisy data. In this case, inaccuracies in the data make the input sequence $y_i$ itself non-monotone, and hence monotone spline interpolation algorithms will fail.  Monotone spline smoothing algorithms are available, e.g. \cite{Andersson1991_JAT,Elfving1989_NM}. Such algorithms are based on solving a quadratic (or another convex) programming problem numerically, and have not been yet adapted to parallel processing.
  
  
-In this work we examined several monotone spline fitting algorithms, and selected the ones that we believe are most suitable for parallelisation on GPUs. We paid attention to numerical efficiency in terms of numerical calculations and memory access pattern, and favoured one-pass algorithms. We also looked at smoothing noisy data, and developed a parallel version of the Minimum Lower Sets algorithm for isotonic regression problem \cite{Best1990, Robertson_book}.
-
-The rest of the chapter is organised as follows. Section \ref{ch11:splines} discusses monotone spline interpolation methods and presents two parallel algorithms. Section \ref{ch11:smoothing} deals with smoothing problem. It presents isotonic regression problem and discusses the Pool Adjacent Violators (PAV) and Minimum Lower Sets (MLS) algorithms. Combined with monotone spline interpolation, the parallel MLS method makes it possible to build a monotone spline approximation to noisy data entirely on GPU. Section \ref{ch11:conc} concludes.
-
-\begin{figure}[h]
+\begin{figure}[!b]
  \centering
  \centering
-\includegraphics[angle=0,width=8cm]{Chapters/chapter11/gregory1_plot1.pdf}
-\caption{Cubic spline (solid) and monotone quadratic spline (dashed) interpolating monotone data from \cite{Gregory1982}. Cubic spline fails to preserve monotonicity of the data.}
+\includegraphics[angle=0,width=9cm]{Chapters/chapter11/gregory1_plot1.pdf}
+\caption[Cubic spline (solid) and monotone quadratic spline (dashed) interpolating monotone data]{Cubic spline (solid) and monotone quadratic spline (dashed) interpolating monotone data from \cite{Gregory1982}. Cubic spline fails to preserve monotonicity of the data.}
  \label{ch11:fig1}
  \end{figure}
  \label{ch11:fig1}
  \end{figure}
-
-\begin{figure}[h]
+\begin{figure}[!b]
  \centering
  \centering
-\includegraphics[angle=00,width=8cm]{Chapters/chapter11/gregory1_plot2_b.pdf}
-\caption{Hermite cubic spline (solid) and Hermite rational spline interpolating monotone data from \cite{Gregory1982} with non-negative prescribed slopes. Despite non-negative slopes, Hermite cubic spline is not monotone.}
+\includegraphics[angle=00,width=9cm]{Chapters/chapter11/gregory1_plot2_b.pdf}
+\caption[Hermite cubic spline (solid) and Hermite rational spline interpolating monotone data]{Hermite cubic spline (solid) and Hermite rational spline interpolating monotone data from \cite{Gregory1982} with nonnegative prescribed slopes. Despite nonnegative slopes, the Hermite cubic spline is not monotone.}
  \label{ch11:fig2}
  \end{figure}
  
  \label{ch11:fig2}
  \end{figure}
  
+
+The above mentioned application is one of many examples (e.g., mass spectrography \cite{Kearsley_2006} and global warming data \cite{Yohai}) where univariate data needs to be fitted by monotonicity preserving interpolants.  Of course, CDF is a monotone increasing function, whose inverse, called quantile function, can be used to calculate the quantiles. Spline interpolation would be the most suitable nonparametric method to fit the CDF, except that polynomial splines do not preserve monotonicity of the data, as illustrated on Figure \ref{ch11:fig1}.
+
+
+The failure of splines to preserve monotonicity has prompted fundamental research in this area since the 1960s. One of the first methods to remedy this problem was splines in tension by Schweikert \cite{Sch}, where a tension parameter controlled the shape of exponential splines \cite{Spath1969}. Later on several monotonicity preserving polynomial spline algorithms were proposed \cite{Schumaker1983,PasRoul1977,AndElf1987,Andersson1991_JAT,McAllister1981_ACM}. These algorithms typically rely on introducing additional spline knots between the abscissae of the data. Algorithmic developments are active to this day; see, for example, \cite{Kvasov2000_book,Abbas2011}.
+
+When  in addition to the pairs $(x_i, y_i)$ the slopes of the function are available, i.e., the data comes in triples $(x_i, y_i, p_i)$, the interpolation problem is called Hermite, and the Hermite splines are used. However, even when the sequence $y_i$ is increasing and the slopes $p_i$ are nonnegative, cubic Hermite splines may still fail to be monotone, as illustrated in Figure \ref{ch11:fig2}. Thus, monotone Hermite splines are needed \cite{Gregory1982}. \index{Hermite splines}
+
+Another issue with monotone approximation is noisy data. In this case inaccuracies in the data make the input sequence $y_i$ itself nonmonotone; and hence monotone spline interpolation algorithms will fail.  Monotone spline smoothing algorithms are available, e.g., \cite{Andersson1991_JAT,Elfving1989_NM}. Such algorithms are based on solving a quadratic (or another convex) programming problem numerically, and have not yet been adapted to parallel processing.
+
+In this work we examine several monotone spline fitting algorithms, and select the ones that we believe are most suitable for parallelization on GPUs. We pay attention to numerical efficiency in terms of numerical calculations and memory access pattern, and favor one-pass algorithms. We also look at smoothing noisy data and developed a parallel version of the Minimum Lower Sets (MLS) algorithm for the isotonic regression problem \cite{Best1990, Robertson_book}.
+\index{isotone regression}
+
+The rest of the chapter is organized as follows. Section \ref{ch11:splines} discusses monotone spline interpolation methods and presents two parallel algorithms. Section \ref{ch11:smoothing} deals with the smoothing problem. It presents the isotonic regression problem and discusses the Pool Adjacent Violators (PAV) and MLS algorithms. Combined with monotone spline interpolation, the parallel MLS method makes it possible to build a monotone spline approximation to noisy data entirely on GPU. Section \ref{ch11:conc} concludes.
+
+
  \section{Monotone splines} \label{ch11:splines}
  
  \section{Monotone splines} \label{ch11:splines}
  
-Splines are piecewise continuous functions very popular in numerical approximation and computer aided design \cite{deboor2001_book,Dierckx1995_book}. An example of a spline is broken line interpolation. Typically polynomial splines are used, and the first (and often second) derivatives of the polynomial pieces are required to match at the knots. The knots of the splines are usually the abscissae of the input data, although this condition is not always required (e.g., splines with free knots \cite{Jupp_1978,Dierckx1995_book,Beliakov2003_amc}).
+\index{constrained splines} \index{monotonicity}
+Splines are piecewise continuous functions very popular in numerical approximation and computer-aided design \cite{deboor2001_book,Dierckx1995_book}. An example of a spline is the broken line interpolation. Typically, polynomial splines are used, and the first (and often second) derivatives of the polynomial pieces are required to match at the knots. The knots of the splines are usually the abscissae of the input data, although this condition is not always required (e.g., splines with free knots \cite{Jupp_1978,Dierckx1995_book,Beliakov2003_amc}).
  
  
-Polynomial splines are often represented in the B-spline basis, in which case their coefficients are computed from the input data by solving a banded system of linear equations \cite{Lyche1973, Dierckx1995_book, deboor2001_book}. Tridiagonal systems arise in cubic spline interpolation, while pentadiagonal systems arise in cubic spline smoothing \cite{Lyche1973}. Spline possess important extremal properties \cite{Holladay1957,Lyche1973}, in particular splines of degree $2m-1$ are the most ``smooth" functions that interpolate (or approximate, in the least squares sense) the data. The smoothness term is Tihkonov regularisation functional, the $L_2$ norm of the $m$-th derivative of the interpolant \cite{Lyche1973}.
+Polynomial splines are often represented in the B-spline basis, in which case their coefficients are computed from the input data by solving a banded system of linear equations \cite{Lyche1973, Dierckx1995_book, deboor2001_book}. Tridiagonal systems arise in cubic spline interpolation, while pentadiagonal systems arise in cubic spline smoothing \cite{Lyche1973}. Splines possess important extremal properties \cite{Holladay1957,Lyche1973}, in particular splines of degree $2m-1$ are the most ``smooth" functions that interpolate (or approximate, in the least squares sense) the data. The smoothness term is Tihkonov regularization functional, the $L_2$ norm of the $m$th derivative of the interpolant \cite{Lyche1973}.
  
  
-When the data are known to come from a monotone function, the interpolant needs to be monotone as well. Even if the sequence of data ordinates $y_i, i=1,\ldots,n$ is non-decreasing, cubic (and higher degree) interpolating splines are not necessarily monotone, an example is shown in Figure \ref{ch11:fig1}. To deal with the problem of extraneous inflection points, Schweikert \cite{Sch}  proposed splines in tension, which are piecewise exponential functions. Splines in tension were further explored in \cite{Spath1969, SapKak1988, SapKakLouk1988} and many subsequent works.
+When the data are known to come from a monotone function, the interpolant needs to be monotone as well. Even if the sequence of data ordinates $y_i, i=1,\ldots,n$ is nondecreasing, cubic (and higher degree) interpolating splines are not necessarily monotone; an example is shown in Figure \ref{ch11:fig1}. To deal with the problem of extraneous inflection points, Schweikert \cite{Sch}  proposed splines in tension, which are piecewise exponential functions. Splines in tension have been further explored in \cite{Spath1969, SapKak1988, SapKakLouk1988} and many subsequent works.
  
  \subsection{Monotone quadratic splines}
  
  
  \subsection{Monotone quadratic splines}
  
-For polynomial splines, monotone or otherwise constrained splines were developed in \cite{Schumaker1983,AndElf1987,Andersson1991_JAT,McAllister1981_ACM,PasRoul1977}. Two monotone quadratic spline algorithms were published in the early 1980s \cite{McAllister1981_ACM, Schumaker1983}. Both algorithms are based on introducing additional interpolation knots under certain conditions, to facilitate preservation of monotonicity of the data. McAllister and Roulier's algorithm \cite{McAllister1981_ACM} introduces at most two extra knots between two neighbouring data, while Schumaker's algorithm  \cite{Schumaker1983} introduces only one extra knot. In addition, Schumaker's algorithm is one pass, which is particularly suited for parallelisation, as no system of equations needs to be solved. While parallel tridiagonal linear systems solvers have been developed for GPUs \cite{tridiag_GPU}, the obvious advantage of a one-pass algorithm is the speed.
-Because of that, we chose Schumaker's algorithm for GPU parallelisation.
+For polynomial splines, monotone or otherwise constrained splines were developed in \cite{Schumaker1983,AndElf1987,Andersson1991_JAT,McAllister1981_ACM,PasRoul1977}. Two monotone quadratic spline algorithms were published in the early 1980s \cite{McAllister1981_ACM, Schumaker1983}. Both algorithms are based on introducing additional interpolation knots under certain conditions, to facilitate preservation of monotonicity of the data. McAllister and Roulier's algorithm \cite{McAllister1981_ACM} introduces at most two extra knots between two neighbouring data, while Schumaker's algorithm  \cite{Schumaker1983} introduces only one extra knot. In addition, Schumaker's algorithm is one pass, which is particularly suited for parallelization, as no system of equations needs to be solved. While parallel tridiagonal linear systems solvers have been developed for GPUs \cite{tridiag_GPU}, the obvious advantage of a one-pass algorithm is the speed.
+Because of that, we chose Schumaker's algorithm for GPU parallelization.
  
  Let us formally describe Schumaker's algorithm, with Butland's slopes \cite{Butland1980}.
  The spline is a piecewise quadratic polynomial in the form
  
  Let us formally describe Schumaker's algorithm, with Butland's slopes \cite{Butland1980}.
  The spline is a piecewise quadratic polynomial in the form
@@ -71,7 +79,9 @@ d_1=\left\{\begin{array}{ll}
              2\delta_{1}-d_2, & \mbox{if } \delta_{1}(2\delta_1-d_2)>0, \\
              0 & \mbox{otherwise},
            \end{array}
              2\delta_{1}-d_2, & \mbox{if } \delta_{1}(2\delta_1-d_2)>0, \\
              0 & \mbox{otherwise},
            \end{array}
- \right. \;
+ \right.
+ $$
+ $$
   d_n=\left\{\begin{array}{ll}
              2\delta_{n-1}-d_{n-1}, & \mbox{if } \delta_{n-1}(2\delta_{n-1}-d_{n-1})>0,\\
              0 & \mbox{otherwise}.
   d_n=\left\{\begin{array}{ll}
              2\delta_{n-1}-d_{n-1}, & \mbox{if } \delta_{n-1}(2\delta_{n-1}-d_{n-1})>0,\\
              0 & \mbox{otherwise}.
@@ -79,7 +89,7 @@ d_1=\left\{\begin{array}{ll}
   \right.
  $$
  
   \right.
  $$
  
-When $d_i+d_{i+1}=2\delta_i$, then a single quadratic polynomial interpolates the data on $[x_i,x_{i+1}]$ and  $t_i=x_i$ $\alpha_i=y_i, \beta_i=d_i, \gamma_i=\frac{d_{i+1}-d_i}{2(x_{i+1}-x_i)}$. otherwise an additional knot $t_i$ is required, and
+When $d_i+d_{i+1}=2\delta_i$, then a single quadratic polynomial interpolates the data on $[x_i,x_{i+1}]$ and  $t_i=x_i$, $\alpha_i=y_i, \beta_i=d_i$, and $ \gamma_i=\frac{d_{i+1}-d_i}{2(x_{i+1}-x_i)}$. Otherwise an additional knot $t_i$ is required, and
  \begin{eqnarray*}
  \alpha_{i}&=&y_{i}, \beta_{i}=d_{i}, \gamma_{i}=\frac{(\bar{d}_{i}-d_{i})}{2(t_{i}-x_{i})}, x\in\left [ x_{i},t_{i} \right ],\\
  \bar{\alpha}_{i}&=&y_{i}+d_{i}(t_{i}-x_{i})+\frac{(\bar{d}_{i}-d_{i})}{2(t_{i}-x_{i})}, \bar{\beta}_{i}=\bar{d}_{i}, \bar{\gamma}_{i}=\frac{(d_{i+1}-\bar{d}_{i})}{2(x_{i+1}-t_{i})}, x\in\left [ t_{i},x_{i+1} \right ],
  \begin{eqnarray*}
  \alpha_{i}&=&y_{i}, \beta_{i}=d_{i}, \gamma_{i}=\frac{(\bar{d}_{i}-d_{i})}{2(t_{i}-x_{i})}, x\in\left [ x_{i},t_{i} \right ],\\
  \bar{\alpha}_{i}&=&y_{i}+d_{i}(t_{i}-x_{i})+\frac{(\bar{d}_{i}-d_{i})}{2(t_{i}-x_{i})}, \bar{\beta}_{i}=\bar{d}_{i}, \bar{\gamma}_{i}=\frac{(d_{i+1}-\bar{d}_{i})}{2(x_{i+1}-t_{i})}, x\in\left [ t_{i},x_{i+1} \right ],
@@ -98,190 +108,199 @@ t_{i}=
  \end{cases}
  $$
  
  \end{cases}
  $$
  
-It is almost straightforward to parallelise this scheme for GPUs, by processing each subinterval $[x_i,x_{i+1}]$ independently in a separate thread. However, it is not known in advance whether an extra knot $t_i$ needs to be inserted or not, and therefore calculation of the position of the knot in the output sequence of knots ${t_i}$ is problematic for parallel implementation (for a sequential algorithm no such issue arises). To avoid serialisation, we decided to insert an additional knot in every interval $[x_i,x_{i+1}]$, but set $t_i=x_i$ when the extra knot is not actually needed. This way we know in advance the position of the output knots and the length of this sequence is $2(n-1)$, and therefore all calculations can now be performed independently. The price we pay is that some of the spline knots can coincide. However, this does not affect spline evaluation, as one of the coinciding knots is simply disregarded, and the spline coefficients are replicated (so for a double knot $t_i=t_{i+1}$, we have $\alpha_i=\alpha_{i+1}$, $\beta_i=\beta_{i+1}$, $\gamma_i=\gamma_{i+1}$). Our implementation is presented in Figures \ref{ch11:algcoef}-\ref{ch11:algcoef1}.
-
-At the spline evaluation stage we need to compute $s(z_k)$ for a sequence of query values ${z_k}, k=1,\ldots,K$. For each $z_k$ we locate the interval $[t_i,t_{i+1}]$ containing $z_k$, using bisection algorithm presented in Figure \ref{ch11:algeval}, and then apply the appropriate coefficients of the quadratic function. This is also  done in parallel.
-The bisection algorithm could be implemented using texture memory (to cache the array \texttt{z}), but this is not shown in Figure \ref{ch11:algeval}.
-
-\begin{figure}[!hp]
-\renewcommand{\baselinestretch}{1}
- \begin{alltt}
-\begin{center}
-\begin{minipage}{13cm}\small
-
-template<typename Tx, typename Ty>
-\_\_global\_\_ void CalculateCoefficientsKnots( Tx *u, Ty *v, double *b, double *c,
-  double *t, double *alpha, double *beta, double *gamma, int N )
-\{
-  int tid = threadIdx.x + blockIdx.x * blockDim.x;
-  int s = tid*2;
-  while(tid<=(N-2))
-  \{
-   // decide whether an additional knot is necessary
-   if(fabs(c[tid]+c[tid+1]- 2*b[tid])<=0.1e-5) // tolerance
-   \{  //no additional knot
-      h[s]=h[s+1]=u[tid];
-      alpha[s]=alpha[s+1]=v[tid];
-      beta[s]=beta[s+1]=c[tid];
-      gamma[s]=gamma[s+1]=(c[tid+1]-c[tid])/(2*(fmax(1e-10,u[tid+1]-u[tid])));
-   \} else  \{  //adding a knot
-      h[s]=u[tid];
-      //determine the position of the knot
-      if((c[tid+1] - b[tid])*(c[tid] - b[tid])<0)
-        h[s+1]=u[tid+1] + (c[tid] - b[tid])*(fmax(1e-10,u[tid+1]-u[tid]))/
-               fmax(1e-10,(c[tid+1] - c[tid]));
-      else
-        h[s+1]=0.5*(u[tid+1] + u[tid]);
-   //calculate coefficients
-      double dtemp = (2*b[tid] - c[tid+1])+((c[tid+1] - c[tid])*(h[s+1] - u[tid]))/
-             fmax(1e-10,(u[tid+1] - u[tid]));
-      alpha[s]=v[tid];   beta[s]=c[tid];
-      gamma[s]=(dtemp - c[tid])/(2*fmax(1e-10,(h[s+1] - u[tid])));
-      alpha[s+1]=v[tid] + c[tid]*(h[s+1] - u[tid]) +
-                (dtemp - c[tid])*(h[s+1] - u[tid])/2;
-      gamma[s+1]=(c[tid+1] - dtemp)/(2*fmax(1e-10,(u[tid+1] - h[s+1])));
-      beta[s+1]=dtemp;
-    \}
-    tid += blockDim.x * gridDim.x;   s = tid*2;
-  \}
-  \_\_syncthreads();
-   // Select a single thread  to perform the last operation
-  if((threadIdx.x  ) == 0)  \{
-   s = (N-1) * 2;   h[s]=u[N-1];
-  \}
-  \_\_syncthreads();
-\}
-\end{minipage}
-\end{center}
-\end{alltt}
-\caption{Implementation of the kernel for calcuating spline knots and coefficients. Function fmax is used to avoid division by zero for data with coinciding abscissae.}
-\label{ch11:algcoef}
-\renewcommand{\baselinestretch}{2}
-\end{figure}
-
-\begin{figure}[!hp]
-\renewcommand{\baselinestretch}{1}
- \begin{alltt}
-\begin{center}
-\begin{minipage}{13cm}\small
-
-template<typename Tx, typename Ty>
-\_\_global\_\_ void CalculateBeta(Tx *u, Ty *v, double *b, int N)
-\{
-   int tid = threadIdx.x + blockIdx.x * blockDim.x;
-   while(tid<=(N-2)) \{
-     b[tid]=(v[tid+1]-v[tid])/fmax(1e-20,double(u[tid+1]-u[tid]));
-     tid += blockDim.x * gridDim.x;
-  \}
- \_\_syncthreads();
-\}
-\_\_global\_\_ void CalculateDGeneral( double *b, double *c, int N)
-\{
-   int tid = threadIdx.x + blockIdx.x * blockDim.x;
-   while(tid<=(N-2)) \{
-     if((b[tid-1]*b[tid])<=0) c[tid]=0;
-       else c[tid]=(2*b[tid-1]*b[tid])/(b[tid-1]+b[tid]);
-     \}
-     tid += blockDim.x * gridDim.x;
-  \}
-  \_\_syncthreads();
-\}
-\_\_global\_\_ void CalculateD( double *b, double *c, int N )
-\{
-   if((b[0]*(2*b[0]-c[1]))<=0) c[0]=0;
-     else  c[0]=2*b[0] - c[1];
-   if((b[N-2]*(2*b[N-2]-c[N-2]))<=0) c[N-1]=0;
-     else c[N-1]=2*b[N-2] - c[N-2];
-   \_\_syncthreads();
-\}
-template<typename Tx, typename Ty>             
-int BuildMonotonSpline(Tx *d_X, Ty *d_Y, int N,
-   double *t, double *alpha, double *beta, double *gamma)
-\{
-  int T =  (N-1)*2+1; // length of the output array
-  double *b, *c; // temp variables
-  cudaMalloc( (void**)&b, 1*N*sizeof(double) );
-  cudaMalloc( (void**)&c, 2*N*sizeof(double) );
-  int threads=256;
-  int blocks = (N-1)/threads + 1;
-  CalculateBeta<<<blocks,threads>>>(d_X,d_Y,b,N);
-  CalculateDGeneral<<<blocks,threads>>>(b,c,N);
-  CalculateD<<<1,1>>>(b,c,NN);  // calculate d_1 and d_N
-  CalculateCoefficientsKnots<<<blocks,threads>>>(d_X,
-                          d_Y,b,c,h,alpha,beta,gamma,N);
-  cudaFree(b); cudaFree(c);
-  return T;
-\}
-\end{minipage}
-\end{center}
-\end{alltt}
-\caption{Calculation of monotone spline knots and coefficients.}
-\label{ch11:algcoef1}
-\renewcommand{\baselinestretch}{2}
-\end{figure}
-
-\begin{figure}[!hp]
-\renewcommand{\baselinestretch}{1}
- \begin{alltt}
-\begin{center}
-\begin{minipage}{13cm}\small
-
-template<typename T>
-\_\_device\_\_ void Bisection\_device(T z, T* t, int mi, int ma,  int* l)
-\{
-  int i; ma--;
-  while(1) \{
-    i=(mi+ma)/2;
-    if(z >= t[i]) mi=i+1;
-             else ma=i;
-    if(mi>=ma) break;
-  \}           
-  *l = mi-1;
-\}
-
-/* Kernel to evaluates monotone spline for a sequence of query points
-   residing in the array z of size m
-*/
-template<typename Tx, typename Ty>     
-\_\_global\_\_ void d\_MonSplineValue(Tx* z, int K, double* t,
-  double * alpha, double * beta, double * gamma, int T, Ty *value)
-\{
-  int tid = threadIdx.x + blockIdx.x * blockDim.x;
-  int mi=0, ma=T, i=0;
-  Ty r;
-  while(tid<K)
-  \{
-     Bisection\_device(z[tid], t, mi, ma,  &i);
-     r= z[tid]-t[i];
-     r= alpha[i] + r*(beta[i] + gamma[i]*r);
-     value[tid]=r;
-     tid += blockDim.x * gridDim.x;
-   \}
-   \_\_syncthreads();
-\}
-
-template<typename Tx, typename Ty>     
-void MonotoneSplineValue(Tx *z, int K, double* t,
-   double * alpha, double * beta, double * gamma, int T, Ty* result)
-\{     
-  int blocks,threads=256;
-  blocks=(K-1)/threads+1;
-  d\_MonSplineValue<<<blocks,threads>>>(z,K,t,alpha,beta,gamma,T,result);
-\}
-\end{minipage}
-\end{center}
-\end{alltt}
-\caption{Implementation of the spline evaluation algorithm for GPU.}
-\label{ch11:algeval}
-\renewcommand{\baselinestretch}{2}
-\end{figure}
-
-
+It is almost straightforward to parallelize this scheme for GPUs, by processing each subinterval $[x_i,x_{i+1}]$ independently in a separate thread. However, it is not known in advance whether an extra knot $t_i$ needs to be inserted, and therefore calculation of the position of the knot in the output sequence of knots ${t_i}$ is problematic for parallel implementation (for a sequential algorithm no such issue arises). To avoid serialization, we decided to insert an additional knot in every interval $[x_i,x_{i+1}]$, but set $t_i=x_i$ when the extra knot is not actually needed. This way we know in advance the position of the output knots and the length of this sequence is $2(n-1)$, and therefore all calculations can now be performed independently. The price we pay is that some of the spline knots can coincide. However, this does not affect spline evaluation, as one of the coinciding knots is simply disregarded, and the spline coefficients are replicated (so for a double knot $t_i=t_{i+1}$, we have $\alpha_i=\alpha_{i+1}$, $\beta_i=\beta_{i+1}$, $\gamma_i=\gamma_{i+1}$). Our implementation is presented in Listings \ref{ch11:algcoef1}-\ref{ch11:algcoef}.
+
+\lstinputlisting[label=ch11:algcoef1,caption=calculation of monotone spline knots and coefficients.]{Chapters/chapter11/code2.cu}
+
+
+At the spline evaluation stage we need to compute $s(z_k)$ for a sequence of query values ${z_k}, k=1,\ldots,K$. For each $z_k$ we locate the interval $[t_i,t_{i+1}]$ containing $z_k$, using the bisection algorithm presented in Listing \ref{ch11:algeval}, and then apply the appropriate coefficients of the quadratic function. This is also  done in parallel.
+The bisection algorithm could be implemented using texture memory (to cache the array \texttt{z}), but this is not shown in Listing \ref{ch11:algeval}.
+
+\pagebreak
+\lstinputlisting[label=ch11:algcoef,caption=implementation of the kernel for calculating spline knots and coefficients; function fmax is used to avoid division by zero for data with coinciding abscissae.]{Chapters/chapter11/code1.cu}
+
+
+%% \begin{figure}[!hp]
+%% \renewcommand{\baselinestretch}{1}
+%%  \begin{alltt}
+%% \begin{center}
+%% \begin{minipage}{13cm}\small
+
+%% template<typename Tx, typename Ty>
+%% \_\_global\_\_ void CalculateCoefficientsKnots( Tx *u, Ty *v, double *b, double *c,
+%%   double *t, double *alpha, double *beta, double *gamma, int N )
+%% \{
+%%   int tid = threadIdx.x + blockIdx.x * blockDim.x;
+%%   int s = tid*2;
+%%   while(tid<=(N-2))
+%%   \{
+%%    // decide whether an additional knot is necessary
+%%    if(fabs(c[tid]+c[tid+1]- 2*b[tid])<=0.1e-5) // tolerance
+%%    \{  //no additional knot
+%%       h[s]=h[s+1]=u[tid];
+%%       alpha[s]=alpha[s+1]=v[tid];
+%%       beta[s]=beta[s+1]=c[tid];
+%%       gamma[s]=gamma[s+1]=(c[tid+1]-c[tid])/(2*(fmax(1e-10,u[tid+1]-u[tid])));
+%%    \} else  \{  //adding a knot
+%%       h[s]=u[tid];
+%%       //determine the position of the knot
+%%       if((c[tid+1] - b[tid])*(c[tid] - b[tid])<0)
+%%         h[s+1]=u[tid+1] + (c[tid] - b[tid])*(fmax(1e-10,u[tid+1]-u[tid]))/
+%%                fmax(1e-10,(c[tid+1] - c[tid]));
+%%       else
+%%         h[s+1]=0.5*(u[tid+1] + u[tid]);
+%%    //calculate coefficients
+%%       double dtemp = (2*b[tid] - c[tid+1])+((c[tid+1] - c[tid])*(h[s+1] - u[tid]))/
+%%              fmax(1e-10,(u[tid+1] - u[tid]));
+%%       alpha[s]=v[tid];   beta[s]=c[tid];
+%%       gamma[s]=(dtemp - c[tid])/(2*fmax(1e-10,(h[s+1] - u[tid])));
+%%       alpha[s+1]=v[tid] + c[tid]*(h[s+1] - u[tid]) +
+%%                 (dtemp - c[tid])*(h[s+1] - u[tid])/2;
+%%       gamma[s+1]=(c[tid+1] - dtemp)/(2*fmax(1e-10,(u[tid+1] - h[s+1])));
+%%       beta[s+1]=dtemp;
+%%     \}
+%%     tid += blockDim.x * gridDim.x;   s = tid*2;
+%%   \}
+%%   \_\_syncthreads();
+%%    // Select a single thread  to perform the last operation
+%%   if((threadIdx.x  ) == 0)  \{
+%%    s = (N-1) * 2;   h[s]=u[N-1];
+%%   \}
+%%   \_\_syncthreads();
+%% \}
+%% \end{minipage}
+%% \end{center}
+%% \end{alltt}
+%% \caption{Implementation of the kernel for calcuating spline knots and coefficients. Function fmax is used to avoid division by zero for data with coinciding abscissae.}
+%% \label{ch11:algcoef}
+%% \renewcommand{\baselinestretch}{2}
+%% \end{figure}
+
+
+
+%% \begin{figure}[!hp]
+%% \renewcommand{\baselinestretch}{1}
+%%  \begin{alltt}
+%% \begin{center}
+%% \begin{minipage}{13cm}\small
+
+%% template<typename Tx, typename Ty>
+%% \_\_global\_\_ void CalculateBeta(Tx *u, Ty *v, double *b, int N)
+%% \{
+%%    int tid = threadIdx.x + blockIdx.x * blockDim.x;
+%%    while(tid<=(N-2)) \{
+%%      b[tid]=(v[tid+1]-v[tid])/fmax(1e-20,double(u[tid+1]-u[tid]));
+%%      tid += blockDim.x * gridDim.x;
+%%   \}
+%%  \_\_syncthreads();
+%% \}
+%% \_\_global\_\_ void CalculateDGeneral( double *b, double *c, int N)
+%% \{
+%%    int tid = threadIdx.x + blockIdx.x * blockDim.x;
+%%    while(tid<=(N-2)) \{
+%%      if((b[tid-1]*b[tid])<=0) c[tid]=0;
+%%        else c[tid]=(2*b[tid-1]*b[tid])/(b[tid-1]+b[tid]);
+%%      \}
+%%      tid += blockDim.x * gridDim.x;
+%%   \}
+%%   \_\_syncthreads();
+%% \}
+%% \_\_global\_\_ void CalculateD( double *b, double *c, int N )
+%% \{
+%%    if((b[0]*(2*b[0]-c[1]))<=0)      c[0]=0;
+%%      else  c[0]=2*b[0] - c[1];
+%%    if((b[N-2]*(2*b[N-2]-c[N-2]))<=0) c[N-1]=0;
+%%      else c[N-1]=2*b[N-2] - c[N-2];
+%%    \_\_syncthreads();
+%% \}
+%% template<typename Tx, typename Ty>          
+%% int BuildMonotonSpline(Tx *d_X, Ty *d_Y, int N,
+%%    double *t, double *alpha, double *beta, double *gamma)
+%% \{
+%%   int T =  (N-1)*2+1; // length of the output array
+%%   double *b, *c; // temp variables
+%%   cudaMalloc( (void**)&b, 1*N*sizeof(double) );
+%%   cudaMalloc( (void**)&c, 2*N*sizeof(double) );
+%%   int threads=256;
+%%   int blocks = (N-1)/threads + 1;
+%%   CalculateBeta<<<blocks,threads>>>(d_X,d_Y,b,N);
+%%   CalculateDGeneral<<<blocks,threads>>>(b,c,N);
+%%   CalculateD<<<1,1>>>(b,c,NN);  // calculate d_1 and d_N
+%%   CalculateCoefficientsKnots<<<blocks,threads>>>(d_X,
+%%                           d_Y,b,c,h,alpha,beta,gamma,N);
+%%   cudaFree(b); cudaFree(c);
+%%   return T;
+%% \}
+%% \end{minipage}
+%% \end{center}
+%% \end{alltt}
+%% \caption{Calculation of monotone spline knots and coefficients.}
+%% \label{ch11:algcoef1}
+%% \renewcommand{\baselinestretch}{2}
+%% \end{figure}
+
+%% \begin{figure}[!hp]
+%% \renewcommand{\baselinestretch}{1}
+%%  \begin{alltt}
+%% \begin{center}
+%% \begin{minipage}{13cm}\small
+
+%% template<typename T>
+%% \_\_device\_\_ void Bisection\_device(T z, T* t, int mi, int ma,  int* l)
+%% \{
+%%   int i; ma--;
+%%   while(1) \{
+%%     i=(mi+ma)/2;
+%%     if(z >= t[i]) mi=i+1;
+%%              else ma=i;
+%%     if(mi>=ma) break;
+%%   \}                
+%%   *l = mi-1;
+%% \}
+
+%% /* Kernel to evaluates monotone spline for a sequence of query points
+%%    residing in the array z of size m
+%% */
+%% template<typename Tx, typename Ty>  
+%% \_\_global\_\_ void d\_MonSplineValue(Tx* z, int K, double* t,
+%%   double * alpha, double * beta, double * gamma, int T, Ty *value)
+%% \{
+%%   int tid = threadIdx.x + blockIdx.x * blockDim.x;
+%%   int mi=0, ma=T, i=0;
+%%   Ty r;
+%%   while(tid<K)
+%%   \{
+%%      Bisection\_device(z[tid], t, mi, ma,  &i);
+%%      r= z[tid]-t[i];
+%%      r= alpha[i] + r*(beta[i] + gamma[i]*r);
+%%      value[tid]=r;
+%%      tid += blockDim.x * gridDim.x;
+%%    \}
+%%    \_\_syncthreads();
+%% \}
+
+%% template<typename Tx, typename Ty>  
+%% void MonotoneSplineValue(Tx *z, int K, double* t,
+%%    double * alpha, double * beta, double * gamma, int T, Ty* result)
+%% \{  
+%%   int blocks,threads=256;
+%%   blocks=(K-1)/threads+1;
+%%   d\_MonSplineValue<<<blocks,threads>>>(z,K,t,alpha,beta,gamma,T,result);
+%% \}
+%% \end{minipage}
+%% \end{center}
+%% \end{alltt}
+%% \caption{Implementation of the spline evaluation algorithm for GPU.}
+%% \label{ch11:algeval}
+%% \renewcommand{\baselinestretch}{2}
+%% \end{figure}
+
+\lstinputlisting[label=ch11:algeval,caption=implementation of the spline evaluation algorithm for GPU.]{Chapters/chapter11/code3.cu}
  
  \subsection{Monotone Hermite splines}
  
  \subsection{Monotone Hermite splines}
-
-In this section, in addition to the points $(x_i,y_i)$ we have the  slopes $p_i$, and hence we consider monotone Hermite interpolation. In our motivating application of cdf estimation, the values $p_i$ are easily obtained together with $y_i$, and their use may help to build a more accurate interpolant.
-Of course, for monotone non-decreasing functions we must have $p_i\geq 0$. However this does not guarantee that the spline interpolant is monotone, as can be seen in Figure \ref{ch11:fig2}. Fritsch and Carlson \cite{Fritsch1980} show that non-negative $p_i$ is not a sufficient condition to guarantee monotonicity, and design a process for modification of derivatives, so that the necessary and sufficient conditions for monotonicity of a piecewise cubic are met. Hence the values $p_i$ are not matched exactly. In contrast, Gregory and Delbourgo \cite{Gregory1982} design piecewise rational quadratic spline, for which the non-negativity of $p_i$ is both necessary and sufficient condition.
+\index{Hermite splines} \index{monotonicity}
+In this section, in addition to the points $(x_i,y_i)$ we have the  slopes $p_i$, and hence, we consider monotone Hermite interpolation. In our motivating application of CDF estimation, the values $p_i$ are easily obtained together with $y_i$, and their use may help to build a more accurate interpolant.
+Of course, for monotone nondecreasing functions we must have $p_i\geq 0$. However, this does not guarantee that the spline interpolant is monotone, as can be seen in Figure \ref{ch11:fig2}. Fritsch and Carlson \cite{Fritsch1980} showed that nonnegative $p_i$ is not a sufficient condition to guarantee monotonicity, and designed a process for modification of derivatives, so that the necessary and sufficient conditions for monotonicity of a piecewise cubic are met. Hence, the values $p_i$ are not matched exactly. In contrast, Gregory and Delbourgo \cite{Gregory1982} designed piecewise rational quadratic spline, for which the nonnegativity of $p_i$ is both a necessary and sufficient condition.
  
  The rational quadratic spline in \cite{Gregory1982} is constructed as
  $$
  
  The rational quadratic spline in \cite{Gregory1982} is constructed as
  $$
@@ -308,37 +327,42 @@ with
  $$
  Q_i(\theta)= \Delta_i+(p_{i+1}+p_i-2\Delta_i)\theta(1-\theta),
  $$
  $$
  Q_i(\theta)= \Delta_i+(p_{i+1}+p_i-2\Delta_i)\theta(1-\theta),
  $$
-provided $\Delta_i \neq 0$ ( $s'(x)=0$ otherwise), and this expression is non-negative.
+provided $\Delta_i \neq 0$ ($s'(x)=0$ otherwise), and this expression is nonnegative.
  
  
-It is clear that  Gregory and Delbourgo's Hermite interpolant is trivially parallel, and the parameters $h_i=x_{i+1}-x_i$ and $\Delta_i$ are easily computed in a simple kernel. Evaluation of the spline and its derivative is accomplished by locating the interval containing the query point $x$ using bisection, as in Figure \ref{ch11:algeval}, and applying the above mentioned formulas.
+It is clear that  Gregory and Delbourgo's Hermite interpolant \cite{Gregory1982} is trivially parallel, and the parameters $h_i=x_{i+1}-x_i$ and $\Delta_i$ are easily computed in a simple kernel. Evaluation of the spline and its derivative is accomplished by locating the interval containing the query point $x$ using bisection, as in Listing \ref{ch11:algeval}, and applying the above-mentioned formulas.
  
  
  \section{Smoothing noisy data via parallel isotone regression} \label{ch11:smoothing}
  
  
  
  \section{Smoothing noisy data via parallel isotone regression} \label{ch11:smoothing}
  
-Inaccuracies in the data are common in practice, and need to be accounted for during spline approximation process. Smoothing polynomial splines were presented in \cite{Lyche1973}, where the data are fitted in the least squares sense while also minimising the $L_2$ norm of the $m-$th derivative of the spline. Monotone smoothing splines were dealt with in several works, in particular we mention \cite{Andersson1991_JAT,Elfving1989_NM}. The presented algorithms rely on solving  quadratic programming problems. Monotone approximating splines with fixed knots distinct form the data have been presented in \cite{Beliakov2000_ata}, where an instance of a quadrating programming problem is solved as well.
  
  
-Another approach consists in monotonising the data, so that the sequence $y_i$ becomes monotone. This approach is known as isotone regression \cite{Best1990, Robertson_book}. It is different from monotone spline smoothing, as the regularisation term controlling the $L_2$ norm of the $m-$the derivative is not taken into account. Usually the data is monotonised by minimising the squared differences to the inputs. It becomes  a quadratic programming problem, usually solved by active sets methods \cite{Best1990}.
+Inaccuracies in the data are common in practice and need to be accounted for during the spline approximation process. Smoothing polynomial splines were presented in \cite{Lyche1973}, where the data were fitted in the least squares sense while also minimizing the $L_2$ norm of the $m$th derivative of the spline. Monotone smoothing splines have been dealt with in several works, in particular we mention \cite{Andersson1991_JAT,Elfving1989_NM}. The presented algorithms rely on solving  quadratic programming problems. Monotone approximating splines with fixed knots distinct form the data have been presented in \cite{Beliakov2000_ata}, where an instance of a quadrating programming problem is solved as well.
+
+\index{isotone regression} \index{monotonicity}
+Another approach consists of monotonizing the data, so that the sequence $y_i$ becomes monotone. This approach is known as isotone regression \cite{Best1990, Robertson_book}. It is different from monotone spline smoothing, as the regularization term controlling the $L_2$ norm of the $m$th derivative is not taken into account. Usually the data is monotonized by minimizing the squared differences to the inputs. It becomes  a quadratic programming problem, usually solved by active sets methods \cite{Best1990}.
  A popular PAV algorithm (PAVA) is one method that provides efficient numerical solution.
  A popular PAV algorithm (PAVA) is one method that provides efficient numerical solution.
+\index{PAV algorithm}
  
  
- PAVA  consists of the following steps. The sequence ${y_i}$ is processed form the start. If violation of monotonicity $y_i>y_{i+1}$ is found, both values $y_i$ and $y_{i+1}$ are replaced with their average $y'_i$, and both values form a block. Since the new value $y'_i$ is smaller than $y_i$, monotonicity may become violated with respect to the datum $y_{i-1}$. If this is the case, the $i-1$st, $i$th and $i+1$st data are merged into a block and their values are replaced with their average. We continue back-average as needed to get monotonicity.
+ PAVA  consists of the following steps. The sequence ${y_i}$ is processed from the start. If violation of monotonicity $y_i>y_{i+1}$ is found, both values $y_i$ and $y_{i+1}$ are replaced with their average $y'_i$, and both values form a block. Since the new value $y'_i$ is smaller than $y_i$, monotonicity may become violated with respect to  $y_{i-1}$. If this is the case, the $i-1$st, $i$th, and $i+1$st data are merged into a block and their values are replaced with their average. We continue to back-average as needed to get monotonicity.
  
  
-Various serial implementations of the PAVA exist. It is noted \cite{Kearsley_2006} that in PAVA, which is based on the ideas from convex analysis, a decomposition theorem holds, namely performing PAVA separately on two contiguous subsets of data, and then performing PAVA on the result produces isotonic regression on the whole data set. Thus isotonic regression is parallelisable, and divide-and-conquer approach, decomposing the original problem into two smaller subproblems, can be implemented on multiple processors. However, to our knowledge, no parallel PAVA for many-core systems such as GPUs exist.
+Various serial implementations of the PAVA exist. It is noted \cite{Kearsley_2006} that in PAVA, which is based on the ideas from convex analysis, a decomposition theorem holds, namely, performing PAVA separately on two contiguous subsets of data and then performing PAVA on the result produces isotonic regression on the whole data set. Thus, isotonic regression is parallelizable, and the divide-and-conquer approach, decomposing the original problem into two smaller subproblems, can be implemented on multiple processors. However, to our knowledge, no parallel PAVA for many-core systems such as GPUs exist.
  
  
-Another approach to isotonic regression is called the Minimum Lower Sets algorithm (MLS) \cite{Best1990, Robertson_book}. It provides the same solution as the PAVA, but works differently.  For each datum (or block), MLS selects the largest contiguous block of subsequent data with the smallest average. If this average is smaller than that of the preceding block, the blocks are merged, and the data in the block are replaced with their average. MLS is also an active set method \cite{Best1990}, but its complexity is $O(n^2)$ as opposed to $O(n)$ of the PAVA, and of another active set algorithm proposed in \cite{Best1990} under the name Algorithm A.
+\index{MLS (minimum lower sets) algorithm} 
+Another approach to isotonic regression is called the MLS algorithm 
+\cite{Best1990, Robertson_book}. It provides the same solution as the PAVA, but works differently.  For each datum (or block), MLS selects the largest contiguous block of subsequent data with the smallest average. If this average is smaller than that of the preceding block, the blocks are merged, and the data in the block are replaced with their average. MLS is also an active set method \cite{Best1990}, but its complexity is $O(n^2)$ as opposed to $O(n)$ of the PAVA, and of another active set algorithm proposed in \cite{Best1990} by the name of Algorithm A.
  
  
-In terms of GPU parallelisation, neither PAVA nor Algorithm A appear to be suitable, as the techniques that achieve $O(n)$ complexity are inheritably serial.
-In this work we focused on parallelising MLS. First, we precompute the values
+In terms of GPU parallelization, neither PAVA nor Algorithm A appears to be suitable, as the techniques that achieve $O(n)$ complexity are inheritably serial.
+In this work we focus on parallelizing MLS. First, we precompute the values
  $$
  z_i=\sum_{j=i}^n y_i
  $$
  and $z_{n+1}=0$
  $$
  z_i=\sum_{j=i}^n y_i
  $$
  and $z_{n+1}=0$
-using parallel partial sum algorithm (\texttt{scan} algorithm from Thrust \cite{Thrust} library).
+using the parallel partial sum algorithm (\texttt{scan} algorithm from Thrust \cite{Thrust} library).
  From these values we can compute the averages of the blocks of data with the indices $\{i,i+1,\ldots,j\}$
  \begin{equation} \label{ch11:eq1}
  From these values we can compute the averages of the blocks of data with the indices $\{i,i+1,\ldots,j\}$
  \begin{equation} \label{ch11:eq1}
-P_{ij}=\frac{1}{j-i+1}\sum_{k=i}^j y_k = \frac{1}{j-i+1} (z_i-z_{j+1})
+P_{ij}=\frac{1}{j-i+1}\sum_{k=i}^j y_k = \frac{1}{j-i+1} (z_i-z_{j+1}).
  \end{equation}
  
  \end{equation}
  
-As per MLS algorithm, for each fixed $i$ from 1 to $n$ we compute the smallest $P_{ij}$ starting from $j=i+1$ and fix the index $j^*$. If $y_i>P_{ij^*}$, we replace the values $y_i,\ldots,y_{j^*}$ with their average $P_{ij^*}$ otherwise we keep the value $y_i$. In case of replacement, we advance $i$ to position $j^*+1$. We check the condition $y_i>P_{i,j^*}$ to form a block, which is equivalent to $y_i>P_{i+1,j^*}$ as $P_{ij}=\frac{1}{j-i+1}((j-i) P_{i+1,j}+y_i)$, from which we deduce both inequalities hold simultaneously.
+As per MLS algorithm, for each fixed $i$ from 1 to $n$, we compute the smallest $P_{ij}$ starting from $j=i+1$ and fix the index $j^*$. If $y_i>P_{ij^*}$, we replace the values $y_i,\ldots,y_{j^*}$ with their average $P_{ij^*}$; otherwise we keep the value $y_i$. In case of replacement, we advance $i$ to position $j^*+1$. We check the condition $y_i>P_{i,j^*}$ to form a block, which is equivalent to $y_i>P_{i+1,j^*}$ as $P_{ij}=\frac{1}{j-i+1}((j-i) P_{i+1,j}+y_i)$, from which we deduce that both inequalities hold simultaneously.
  
  %Also we note that $y_{j^*}\leq P_{ij^*}$, which means we
  
  
  %Also we note that $y_{j^*}\leq P_{ij^*}$, which means we
  
@@ -349,10 +373,10 @@ As per MLS algorithm, for each fixed $i$ from 1 to $n$ we compute the smallest $
  %Finally, in PAVA the value $y_k, i \leq k\leq j$ can be replaced several times at different back-averaging steps, the latest being $P_{ij}$ corresponding to the smallest $i$ for which $y_i>P_{i+1,j}$. In our version we start with the smallest $i$, and replace $y_k$ only once, and then advance to position $j$, so they are not overwritten. Therefore, our algorithm performs the same replacements as PAVA but in a different order, and the outputs are the same. Below we show that the order does not really matter if we perform replacements with the max operation.
  
  
  %Finally, in PAVA the value $y_k, i \leq k\leq j$ can be replaced several times at different back-averaging steps, the latest being $P_{ij}$ corresponding to the smallest $i$ for which $y_i>P_{i+1,j}$. In our version we start with the smallest $i$, and replace $y_k$ only once, and then advance to position $j$, so they are not overwritten. Therefore, our algorithm performs the same replacements as PAVA but in a different order, and the outputs are the same. Below we show that the order does not really matter if we perform replacements with the max operation.
  
  
-Now the presented algorithm can be parallelised for GPUs: each datum $y_i$ is treated in its own thread. Calculation of the smallest $P_{ij}$ is performed serially within the $i$-th thread, or in parallel, by starting children threads.
-Replacing the values $y_i,\ldots,y_{j^*}$ with  $P_{ij^*}$ leads to potential clashes, as several threads can perform this operation on the same elements $y_k$. This can be circumvented by using  max operation, i.e., $y_k\leftarrow \max(y_k,P_{ij})$. Note that thread $i$ replaces the value $y_k$, $k\geq i$ if $P_{ij}<y_i$. Now, if two threads $i_1$ and $i_2$ need to replace $y_k$, and $i_1<i_2$, we must have $P_{i_1 j_1}\geq P_{i_2 j_2}$, as formalised in the following
+Now the presented algorithm can be parallelized for GPUs: each datum $y_i$ is treated in its own thread. Calculation of the smallest $P_{ij}$ is performed serially within the $i$th thread, or in parallel by starting children threads.
+Replacing the values $y_i,\ldots,y_{j^*}$ with  $P_{ij^*}$ leads to potential clashes, as several threads can perform this operation on the same elements $y_k$. This can be circumvented by using  max operation, i.e., $y_k\leftarrow \max(y_k,P_{ij})$. Note that thread $i$ replaces the value $y_k$, $k\geq i$ if $P_{ij}<y_i$. Now, if two threads $i_1$ and $i_2$ need to replace $y_k$, and $i_1<i_2$, we must have $P_{i_1 j_1}\geq P_{i_2 j_2}$, as formalized in the following.
  
  
-\begin{proposition} If partial averages $P_{ij}$ are defined by (\ref{ch11:eq1}) and $i_1<i_2$, $j_1,j_2 \geq i_1,i_2$, where $j_1, j_2$ denote the minimisers of $P_{i_1 j}$ over $j\geq i_1$ (resp.$P_{i_2 j}$  over $j\geq i_2$ ), then $P_{i_1 j_1}\geq P_{i_2 j_2}$.
+\begin{proposition} If partial averages $P_{ij}$ are defined by (\ref{ch11:eq1}) and $i_1<i_2$, $j_1,j_2 \geq i_1,i_2$, where $j_1, j_2$ denote the minimizers of $P_{i_1 j}$ over $j\geq i_1$ (respectively $P_{i_2 j}$  over $j\geq i_2$ ), then $P_{i_1 j_1}\geq P_{i_2 j_2}$.
  \end{proposition}
  
  \begin{proof}
  \end{proposition}
  
  \begin{proof}
@@ -368,33 +392,25 @@ $P_{i_2 j_1} \geq P_{i_2 j_2}$, which implies $P_{i_1 j_1}\geq P_{i_2 j_2}$.
  %In the serial PAVA, step $i_1$ is executed after $i_2$, so $P_{i_1 j_1}$ overrides $P_{i_2 j_2}$, but this order is not preserved in parallel computations.
  
  The order in which the steps are performed is not guaranteed in parallel computations.
  %In the serial PAVA, step $i_1$ is executed after $i_2$, so $P_{i_1 j_1}$ overrides $P_{i_2 j_2}$, but this order is not preserved in parallel computations.
  
  The order in which the steps are performed is not guaranteed in parallel computations.
-By the proposition above, $P_{i_2 j_2}\leq P_{i_1 j_1}<y_{i_1}$ whenever the value $y_{i_1}$ needs replacement by the average of its block, which leads to overriding all $y_k, i_1 \leq k \leq j_1$ with  $P_{i_1 j_1}$, which is no smaller than $P_{i_2 j_2}$. Thus in the serial algorithm $y_k$ may only replaced with a larger value as the algorithm progresses. Therefore
-the max operation in the parallel algorithm ensures that $y_k$ is replaced with the same value as in the serial algorithm, regardless the order of the steps.
+By the proposition above, $P_{i_2 j_2}\leq P_{i_1 j_1}<y_{i_1}$ whenever the value $y_{i_1}$ needs replacement by the average of its block, which leads to overriding all $y_k, i_1 \leq k \leq j_1$ with  $P_{i_1 j_1}$, which is no smaller than $P_{i_2 j_2}$. Thus, in the serial algorithm $y_k$ may only be replaced with a larger value as the algorithm progresses. Therefore,
+the max operation in the parallel algorithm ensures that $y_k$ is replaced with the same value as in the serial algorithm, regardless of the order of the steps.
  
  
- We present the source code of the parallel MLS in Figure \ref{ch11:algMLS}. Here we reduced the number of writes to the global memory by using an indexing array \texttt{keys\_d} to encode blocks, and subsequently performing \texttt{scan} operation with the maximum operator and indexed by \texttt{keys\_d}, so that maximum is taken within each block.
+ We present the source code of the parallel MLS in Listing \ref{ch11:algMLS}. Here we reduce the number of writes to the global memory by using an indexing array \texttt{keys\_d} to encode blocks and subsequently performing a \texttt{scan} operation with the maximum operator and indexed by \texttt{keys\_d}, so that maximum is taken within each block.
  
  
-As we mentioned, the complexity of the MLS algorithm is $O(n^2)$, due to the fact that for each datum, the smallest average $P_{ij}$ of the blocks of subsequent data is needed. Thus each thread needs to perform $O(n)$ comparisons (the averages themselves are precomputed in $O(n)$ operations using partial sum algorithm). It is interesting to compare the runtime of the PAVA algorithm on CPU and parallel MLS on GPU to establish up to which $n$ parallel MLS is preferable. We performed such experiments on Tesla 2050 device connected to a four-core Intel i7 CPU with 4 GB RAM clocked at 2.8 GHz, running Linux (Fedora 16).
+\lstinputlisting[label=ch11:algMLS,caption=fragments of implementation of a parallel version of the MLS algorithm using Thrust library.]{Chapters/chapter11/code4.cu}
  
  
-First we compared the serial versions of PAV and MLS algorithms. For this we used two packages in R environment, \texttt{stats} and \texttt{fdrtool}.  The package \texttt{stats} offers function \texttt{isoreg}, which implements the MLS algorithm in C language, whereas package \texttt{fdrtool} offers PAVA, also implemented in C. Overheads of R environment can be neglected, as the input data are simply passed to C code, so we can compare the running time of both algorithms head to head. We generated input data of varying length $n$ from $10^4$ to $ 5 \times 10^7$ randomly, using $y_i=f(x_i)+\varepsilon_i$, where $f$ is a monotone test function and $\varepsilon$ is random noise. We also tried completely ordered isotone data, and antitone data, to check the performance for adversary inputs. Subsequently, we measured the runtime of our parallel version of MLS algorithm on Tesla 2050 GPU. The results are  presented in Table \ref{ch11:table1}.
  
  
-As expected,  the runtimes of both methods differed significantly, as shown in Table \ref{ch11:table1}, and clearly linear PAVA was superior to serial MLS algorithm. Even though for some special cases, e.g., test function $f=const$ both serial methods gave the same running time, which can be explained by the fact that large blocks of data allowed MLS to skip the majority of tests. This did not happen in the parallel version of MLS, where for each datum the smallest value of $P_{ij^*}$ was computed (in parallel), so the average CPU times were the same for all data.
  
  
-From the results in Table \ref{ch11:table1} we conclude that serial PAVA is superior to MLS for $n>10^4$. While it is possible to transfer data from GPU to CPU and run PAVA there, it is warranted only for sufficiently large data $n\geq 5 \times 10^5$ , for otherwise the data transfer overheads will dominate CPU time. For smaller $n$, isotone regression is best performed on GPU.
  
  
-We also see that the use of GPU accelerated MLS by a factor of at least 100. The cost of serial MLS is prohibitive for  $n>10^6$. 
+As we mentioned, the complexity of the MLS algorithm is $O(n^2)$, due to the fact that for each datum, the smallest average $P_{ij}$ of the blocks of subsequent data is needed. Thus, each thread needs to perform $O(n)$ comparisons (the averages themselves are precomputed in $O(n)$ operations using the partial sum algorithm). It is interesting to compare the runtime of the PAVA algorithm on CPU and parallel MLS on GPU to establish for which $n$ parallel MLS is preferable. We performed such experiments on Tesla 2050 device connected to a four-core Intel i7 CPU with 4 GB RAM clocked at 2.8 GHz, running Linux (Fedora 16).
  
  
-We should mention that not all isotone regression problems allow a PAV-like algorithm linear in time. When the data may contain large outliers, monotonizing the data is better done not in the least squares sense, but using other cost functionals, such as by minimizing the sum of absolute deviations \cite{Wang} or using M-estimators \cite{Yohai}, which are less sensitive to outliers. It is interesting than in all such cases the solution to isotone regression problem can be found by solving maximin problem
-$$
-u_i=\max_{k\leq i} \min_{l \geq i} \hat y(k,l), 
-$$
-with $\hat y(k,l)$ being the unrestricted maximum likelihood estimator of $y_k\ldots,y_l$. For quadratic cost function $\hat y(k,l)$ is the mean, as in PAV and MLS algorithms, for the absolute deviations it becomes the median, and for other cost functions an M-estimator of location. The MLS algorithm can be applied to such isotone regression problems with very little modification, while linear in time algorithm may not be available. Our parallel MLS algorithm will be valuable in such cases.
+First we compared the serial versions of PAV and MLS algorithms. For this we used two packages in R environment, \texttt{stats} and \texttt{fdrtool}.  The package \texttt{stats} offers function \texttt{isoreg}, which implements the MLS algorithm in C language, whereas package \texttt{fdrtool} offers PAVA, also implemented in C. Overheads of R environment can be neglected, as the input data are simply passed to C code, so we can compare the running time of both algorithms head to head. We generated input data of varying lengths $n$ from $10^4$ to $ 5 \times 10^7$ randomly, using $y_i=f(x_i)+\varepsilon_i$, where $f$ is a monotone test function and $\varepsilon$ is random noise. We also tried completely ordered isotone data, and antitone data, to check the performance for adversary inputs. Subsequently, we measured the runtime of our parallel version of MLS algorithm on Tesla 2050 GPU. The results are  presented in Table \ref{ch11:table1}.
  
  
-%\renewcommand{\baselinestretch}{1}
-\begin{table}[!h]
+%% %\renewcommand{\baselinestretch}{1}
+\begin{table}[htbp]
  \begin{center}
  \begin{center}
-\caption{The average CPU time (sec) of the serial PAVA, MLS and parallel MLS algorithms.  } \label{ch11:table1}
  \begin{tabular}{|r|r|r|r|}
  \begin{tabular}{|r|r|r|r|}
-
+\hline
  Data  & PAVA & MLS & GPU MLS \\ \hline
  
  monotone increasing $f$ & & & \\
  Data  & PAVA & MLS & GPU MLS \\ \hline
  
  monotone increasing $f$ & & & \\
@@ -413,74 +429,96 @@ $n=10^6$ &0.2&0.1& 38\\
  $n=10 \times 10^6$ &1.9& 1.9& 3500 \\
  $n=20 \times 10^6$ &3.5& 4.0&-- \\
  $n=50 \times 10^6$ &11& 11& -- \\
  $n=10 \times 10^6$ &1.9& 1.9& 3500 \\
  $n=20 \times 10^6$ &3.5& 4.0&-- \\
  $n=50 \times 10^6$ &11& 11& -- \\
-
+\hline
  \end{tabular}
  \end{center}
  \end{tabular}
  \end{center}
+\caption{The average CPU time (sec) of the serial PAVA, MLS, and parallel MLS algorithms.  }
+\label{ch11:table1}
  \end{table}
  \end{table}
-%\renewcommand{\baselinestretch}{2}
  
  
  
  
-\begin{figure}[!hp]
- \begin{alltt}
-\begin{center}
-\begin{minipage}{13cm}\small
-template<typename Tx>   
-__device__ Tx Aver(Tx z,int i,int j, Tx *z) \{return (z-z[j+1])/(j-i+1);\}
-
-template<typename Tx>
-__global__ void monotonizekernel(Tx *y, Tx *z, Tx *u, int *key, int n)  
-\{ int i = threadIdx.x + blockIdx.x * blockDim.x;
-   if(i<n) \{
-      int smallestJ = i;
-      Tx curP, smallestP, curz=z[i];
-      smallestP=Aver(curz,i,i,z);
-      for(int j = i+1; j < n; j++) \{
-          curP=Aver(curz,i,j,z);
-          if(smallestP>curP) \{
-               smallestJ = j;
-               smallestP = curP;
-          \}   
-      \}
-      curP=y[i];
-      if(curP > smallestP) t=smallestP;
-                      else smallestJ=i;
-      key[i]=smallestJ;
-      u[i]=t;
-   \}
-\}
-
-template< typename Tx >
-void MonotonizeData(Tx *y, int n, Tx *u) \{
-    thrust::less_equal<int> binary_pred;
-    thrust::maximum<Tx>     binary_op2;
-    thrust::device_vector<Tx> z_d(n+1);
-    thrust::device_vector<int> keys_d(n);      
-    thrust::device_ptr<Tx> y_d(y), u_d(u);
-    thrust::fill(u_d, u_d+n, -1e100);
-    thrust::fill(keys_d.begin(), keys_d.end(), 0);
-
-    thrust::reverse_iterator< typename thrust::device_vector<Tx>::iterator >
-            y_reverse_b(y_d+n), y_reverse_end(y_d), z_reverse_b(z_d.end());
+As expected,  the runtimes of both methods differed significantly, as shown in Table \ref{ch11:table1}, and clearly linear PAVA was superior to serial MLS algorithm. Even though for some special cases, e.g., test function $f=const$, both serial methods gave the same running time; this can be explained by the fact that large blocks of data allowed MLS to skip the majority of tests. This did not happen in the parallel version of MLS, where for each datum the smallest value of $P_{ij^*}$ was computed (in parallel), so the average CPU times were the same for all data.
+
+From the results in Table \ref{ch11:table1} we conclude that serial PAVA is superior to MLS for $n>10^4$. While it is possible to transfer data from GPU to CPU and run PAVA there, it is warranted only for sufficiently large data $n\geq 5 \times 10^5$, for otherwise the data transfer overheads will dominate CPU time. For smaller $n$, isotone regression is best performed on GPU.
+
+
+We also see that the use of GPU accelerated MLS by a factor of at least 100, except for antitone data. The cost of serial MLS is prohibitive for  $n>10^6$.
+
+We should mention that not all isotone regression problems allow a PAV-like algorithm linear in time. When the data may contain large outliers, monotonizing the data is better done not in the least squares sense, but using other cost functionals, such as by minimizing the sum of absolute deviations \cite{Wang} or using M-estimators \cite{Yohai}, which are less sensitive to outliers. It is interesting than in all such cases the solution to an isotone regression problem can be found by solving maximin problem
+$$
+u_i=\max_{k\leq i} \min_{l \geq i} \hat y(k,l),
+$$
+with $\hat y(k,l)$ being the unrestricted maximum likelihood estimator of $y_k\ldots,y_l$. For the quadratic cost function $\hat y(k,l)$ corresponds to  the mean of these data (as in PAV and MLS algorithms), for the absolute deviations $\hat y(k,l)$ corresponds to the median, and for other cost functions it corresponds to an M-estimator of location. The MLS algorithm can be applied to such isotone regression problems with very little modification. However, we are unaware of other algorithms for solving the modified problem that linear in time. Our parallel MLS algorithm will be valuable in such cases.
+
+
+%% %\renewcommand{\baselinestretch}{2}
+
+
+%% \begin{figure}[!hp]
+%%  \begin{alltt}
+%% \begin{center}
+%% \begin{minipage}{13cm}\small
+%% template<typename Tx>       
+%% __device__ Tx Aver(Tx z,int i,int j, Tx *z) \{return (z-z[j+1])/(j-i+1);\}
+
+%% template<typename Tx>
+%% __global__ void monotonizekernel(Tx *y, Tx *z, Tx *u, int *key, int n)
+%% \{ int i = threadIdx.x + blockIdx.x * blockDim.x;
+%%    if(i<n) \{
+%%       int smallestJ = i;
+%%       Tx curP, smallestP, curz=z[i];
+%%       smallestP=Aver(curz,i,i,z);
+%%       for(int j = i+1; j < n; j++) \{
+%%           curP=Aver(curz,i,j,z);
+%%           if(smallestP>curP) \{
+%%                smallestJ = j;
+%%                smallestP = curP;
+%%           \}        
+%%       \}
+%%       curP=y[i];
+%%       if(curP > smallestP) t=smallestP;
+%%                       else smallestJ=i;
+%%       key[i]=smallestJ;
+%%       u[i]=t;
+%%    \}
+%% \}
+
+%% template< typename Tx >
+%% void MonotonizeData(Tx *y, int n, Tx *u) \{
+%%     thrust::less_equal<int> binary_pred;
+%%     thrust::maximum<Tx>     binary_op2;
+%%     thrust::device_vector<Tx> z_d(n+1);
+%%     thrust::device_vector<int> keys_d(n);   
+%%     thrust::device_ptr<Tx> y_d(y), u_d(u);
+%%     thrust::fill(u_d, u_d+n, -1e100);
+%%     thrust::fill(keys_d.begin(), keys_d.end(), 0);
+
+%%     thrust::reverse_iterator< typename thrust::device_vector<Tx>::iterator >
+%%             y_reverse_b(y_d+n), y_reverse_end(y_d), z_reverse_b(z_d.end());
         
         
-    thrust::inclusive_scan(y_reverse_b, y_reverse_end, z_reverse_b+1);
+%%     thrust::inclusive_scan(y_reverse_b, y_reverse_end, z_reverse_b+1);
+
+%%     monotonizekernel<<<grid, block>>>(y, thrust::raw_pointer_cast(&z_d[0]),
+%%                                u, thrust::raw_pointer_cast(&keys_d[0]), n );
+
+%%     thrust::sort(keys_d.begin(), keys_d.end());
+%%     thrust::inclusive_scan_by_key(keys_d.begin(), keys_d.end(),
+%%                                   u_d, u_d, binary_pred, binary_op2);
+%% \}
+%% \end{minipage}
+%% \end{center}
+%% \end{alltt}
+%% \caption{Fragments of implementation of a parallel version of the MLS algorithm using Thrust library.}
+%% \label{ch11:algMLS}
+%% \end{figure}
  
  
-    monotonizekernel<<<grid, block>>>(y, thrust::raw_pointer_cast(&z_d[0]), 
-                               u, thrust::raw_pointer_cast(&keys_d[0]), n );
  
  
-    thrust::sort(keys_d.begin(), keys_d.end());
-    thrust::inclusive_scan_by_key(keys_d.begin(), keys_d.end(), 
-                                  u_d, u_d, binary_pred, binary_op2);
-\}
-\end{minipage}
-\end{center}
-\end{alltt}
-\caption{Fragments of implementation of a parallel version of the MLS algorithm using Thrust library.}
-\label{ch11:algMLS}
-\end{figure}
  
  \section{Conclusion} \label{ch11:conc}
  
  
  \section{Conclusion} \label{ch11:conc}
  
-We presented three GPU-based parallel algorithms for approximating monotone data: monotone quadratic spline, monotone Hermite rational spline and minimum lower sets algorithm for monotonizing noisy data. These tools are valuable in a number of applications that involve large data sets modeled by monotone nonlinear functions.
-The source code of the package monospline is available from \texttt{www.deakin.edu.au/$\sim$ gleb/monospline.html }
+We presented three GPU-based parallel algorithms for approximating monotone data: monotone quadratic spline, monotone Hermite rational spline, and minimum lower sets algorithm for monotonizing noisy data. These tools are valuable in a number of applications that involve large data sets modeled by monotone nonlinear functions.
+The source code of the package monospline is available from \texttt{www.deakin.edu.au/$\sim$gleb/monospline.html }
+
+
  
  \putbib[Chapters/chapter11/biblio11]
  
  \putbib[Chapters/chapter11/biblio11]