BookGPU/Chapters/chapter16/gpu.tex

   1 \section{New parallel envelope-following method}
   2 \label{sec:gmres}
   3
   4 In this section, we explain how to efficiently
   5 use matrix-free GMRES to solve
   6 the Newton update problems with implicit sensitivity calculation,
   7 i.e., the steps enclosed by the double dashed block
   8 in Fig.~\ref{fig:ef_flow}.
   9 Then implementation issues of GPU acceleration
  10 will be discussed in detail.
  11 Finally,  the Gear-2 integration is briefly introduced.
  12
  13 \subsection{GMRES solver for Newton update equation}
  14
  15 \underline{G}eneralized \underline{M}inimum \underline{Res}idual,
  16 or GMRES method is an iterative method for solving
  17 systems of linear equations ($A x=b$) with dense matrix $A$.
  18 The standard GMRES\index{GMRES} is given in Algorithm~\ref{alg:GMRES}.
  19 It constructs a Krylov subspace\index{Krylov subspace} with order $m$,
  20 \[ \mathcal{K}_m = \mathrm{span}( b, A^{} b, A^2 b,\ldots, A^{m-1} b ),\]
  21 where the approximate solution $x_m$ resides.
  22 In practice, an orthonormal basis $V_m$ that spans the
  23 subspace $\mathcal{K}_{m}$ can be generated by
  24 the Arnoldi iteration\index{Arnoldi iterations}.
  25 The goal of GMRES is to search for an optimal coefficient $y$
  26 such that the linear combination $x_m = V_m y$ will minimize
  27 its residual $\| b-Ax_m \|_2$.
  28 The Arnoldi iteration also creates a by-product,
  29 an upper Hessenberg matrix\index{Hessenberg matrix} $\tilde{H}_m \in R^{(m+1)\times m}$.
  30 Thus, the projection of $A$ on the orthonormal basis $V_m$ is described by
  31 the Arnoldi decomposition
  32 $
  33  AV_m = V_{m+1} \tilde{H}_{m},
  34 $
  35 which is useful to check the residual at each iteration
  36 without forming $x_m$,
  37 and to solve for coefficient $y$ when residual is smaller than
  38 a preset tolerance~\cite{Golub:Book'96}.
  39
  40 % Givens rotations are generated to triangularize
  41 % the Hessenberg matrix $\tilde{H}_m$ in order to solve the overdetermined system.
  42 % These same Givens rotations are also applied to the right-hand-side $h_{1,0}e_1$,
  43 % whose last element provides residual information on whether the tolerance
  44 % has been met in the current iteration.
  45
  46 %$\left\| h_{1,0}e_1-\tilde{H}_m y_m \right\|_2$.
  47
  48
  49 %% \begin{algorithm}
  50 %% \caption{Standard GMRES\index{GMRES} algorithm.} \label{alg:GMRES}
  51 %% \begin{algorithmic}[1]
  52 %%   \REQUIRE $ A \in \mathbb{R}^{N \times N}$, $b \in \mathbb{R}^N$,
  53 %%       and initial guess $x_0 \in \mathbb{R}^N$
  54 %%   \ENSURE $x \in \mathbb{R}^N$: $\| b - A x\|_2 < tol$
  55
  56 %%   \STATE $r = b - A x_0$
  57 %%   \STATE $h_{1,0}=\left \| r \right \|_2$
  58 %%   \STATE $m=0$
  59
  60 %%   \WHILE{$m < max\_iter$}
  61 %%     \STATE $m = m+1$
  62 %%     \STATE $v_{m} = r / h_{m,m-1}$
  63 %%     \STATE \label{line:mvp} $r = A v_m$
  64 %%     \FOR{$i = 1\ldots m$}
  65 %%       \STATE $h_{i,m} = \langle v_i, r \rangle$
  66 %%       \STATE $r = r - h_{i,m} v_i$
  67 %%     \ENDFOR
  68 %%     \STATE $h_{m+1,m} = \left\| r \right\|_2$\label{line:newnorm}
  69 %%     %\STATE Generate Givens rotations to triangularize $\tilde{H}_m$
  70 %%     %\STATE Apply Givens rotations on $h_{1,0}e_1$ to get residual $\epsilon$
  71 %%     \STATE Compute the residual $\epsilon$
  72 %%     \IF{$\epsilon < tol$}
  73 %%     \STATE Solve the problem: minimize $\|b-Ax_m\|_2$
  74 %%         \STATE Return $x_m = x_0 + V_m y_m$
  75 %%     \ENDIF
  76 %%   \ENDWHILE
  77 %% \end{algorithmic}
  78 %% \end{algorithm}
  79
  80 \begin{algorithm}
  81 \caption{Standard GMRES\index{GMRES} algorithm.} \label{alg:GMRES}
  82   \KwIn{ $ A \in \mathbb{R}^{N \times N}$, $b \in \mathbb{R}^N$,
  83       and initial guess $x_0 \in \mathbb{R}^N$}
  84   \KwOut{ $x \in \mathbb{R}^N$: $\| b - A x\|_2 < tol$}
  85
  86   $r = b - A x_0$\;
  87   $h_{1,0}=\left \| r \right \|_2$\;
  88   $m=0$\;
  89
  90   \While{$m < max\_iter$} {
  91     $m = m+1$;
  92     $v_{m} = r / h_{m,m-1}$\;
  93     \label{line:mvp} $r = A v_m$\;
  94     \For{$i = 1\ldots m$} {
  95       $h_{i,m} = \langle v_i, r \rangle$\;
  96       $r = r - h_{i,m} v_i$\;
  97     }
  98     $h_{m+1,m} = \left\| r \right\|_2$\label{line:newnorm} \;
  99     %\STATE Generate Givens rotations to triangularize $\tilde{H}_m$
 100     %\STATE Apply Givens rotations on $h_{1,0}e_1$ to get residual $\epsilon$
 101     Compute the residual $\epsilon$\;
 102     \If{$\epsilon < tol$} {
 103       Solve the problem: minimize $\|b-Ax_m\|_2$\;
 104       Return $x_m = x_0 + V_m y_m$\;
 105     }
 106   }
 107 \end{algorithm}
 108
 109
 110 At a first glance, the cost of using standard GMRES directly to
 111 solve the Newton update in Eq.~\eqref{eq:Newton}
 112 seems to come mainly from two parts: the
 113 formulation of the sensitivity matrix $J = \ud x_M/\ud x_0$
 114 in Eq.~\eqref{eq:sensM} in Section~\ref{sec:gear},
 115 and the iteration inside the standard GMRES,
 116 especially the matrix-vector multiplication and the orthonormal
 117 basis construction (Line~\ref{line:mvp} through
 118 Line~\ref{line:newnorm} in Algorithm~\ref{alg:GMRES}).
 119 Based on the observation that only the matrix-vector product is
 120 required in GMRES, the work in~\cite{Telichevesky:DAC'95} introduces
 121 an efficient matrix-free algorithm in the shooting-Newton method,
 122 where the equation solving part also involves with a sensitivity
 123 matrix.
 124 The matrix-free method does not take an explicit matrix as
 125 input, but directly passes the saved capacitance matrices $C_i$ and
 126 the LU factorizations of $J_i$, $i=0,\ldots,M$, into the Arnoldi
 127 iteration for Krylov subspace generation.
 128 Therefore, it avoids the
 129 cost of forming the dense sensitivity matrix and focuses on
 130 subspace construction.
 131 Briefly speaking, Line~\ref{line:mvp} will be replaced
 132 by a procedure without explict $A$, and we will talk
 133 about the flow of matrix-free generation of new basis vectors
 134 in later sections.
 135
 136
 137 \subsection{Parallelization on GPU platforms}
 138 \label{sec:gpu}
 139 % In this subsection, we describe how to parallelize
 140 % the proposed matrix-free GMRES solver with Gear-2 sensitivity
 141 % on GPU platforms.
 142 There exist many GPU-friendly computing operations
 143 in GMRES, such as the vector addition (\verb|axpy|),
 144 2-norm of vector (\verb|nrm2|), and
 145 sparse matrix-vector multiplication (\verb|csrmv|),
 146 which have been parallelized in the CUDA Basic Linear
 147 Algebra Subroutine (CUBLAS) Library and the CUDA Sparse Linear Algebra
 148 Library (CUSPARSE)~\cite{CUDASDK}.
 149
 150 GPU programming is typically limited by the data transfer bandwidth as
 151 GPU favors computationally intensive algorithms~\cite{Kirk:Book'10}.
 152 So how to efficiently transfer the data and wisely partition
 153 the data between CPU memory and GPU memory
 154 is crucial for GPU programming.
 155 In the following, we discuss these two issues in our implementation.
 156
 157 As noted in Section~\ref{sec:ef}, the envelope-following method
 158 requires the matrices gathered from all the time steps in a
 159 period in order to solve a Newton update.
 160 At each time step, SPICE\index{SPICE} has
 161 to linearize device models, stamp matrix elements
 162 into MNA (short for modified nodal analysis\index{modified nodal analysis, or MNA}) matrices,
 163 and solve circuit equations in its inner Newton iteration\index{Newton iteration}.
 164 When convergence is attained,
 165 circuit states are saved and then next time step begins.
 166 This is also the time when we store the needed matrices
 167 for the envelope-following computation.
 168 Since these data are involved in the calculation of Gear-2 sensitivity
 169 matrix in the generation of Krylov subspace vectors in
 170 Algorithm~\ref{alg:mf_Gear}, it is desirable that all of these
 171 matrices are transferred to GPU for its data parallel capability.
 172
 173 To efficiently transfer those large data, we explore asynchronous
 174 memory copy between host and device in the recent GPUs (Fermi architecture),
 175 so that the copying overlaps with the host's computing of
 176 the next time step's circuit state. The implementation of asynchronous matrices
 177 copy includes two parts: allocating page-locked memory, also known as
 178 pinned memory, where we save matrices for one time step, and using
 179 asynchronous memory transfer to copy these data to GPU memory.  While
 180 it is known that page-locked host memory is a scarce resource and
 181 should not be overused, the demand of memory size of the data
 182 generated at one time step can be generously accommodated by today's
 183 mainstream server configurations.
 184
 185 \begin{figure}[htbp]
 186   \centering
 187   \resizebox{.9\textwidth}{!}{\input{./Chapters/chapter16/figures/gmres_flow.pdf_t}}
 188   \caption{GPU parallel solver for envelope-following update.}
 189   \label{fig:gmres}
 190 \end{figure}
 191
 192
 193 %When SPICE simulation reaches the end of the period, all data needed by
 194 %Newton update equation are ready in GPU memory, and the matrix-free
 195 %GMRES can be applied directly on them to solve for the update vector
 196 %in \eqref{eq:Newton}.
 197
 198 % Knowing the dimensions or the sizes of the
 199 %matrices in a computation helps to boost GPU computing efficiency.
 200 The second issue is to decide the location of data between CPU and GPU memories.
 201 Therefore let us first make a rough sketch of the quantities in the
 202 GMRES Algorithm~\ref{alg:GMRES}.  Although GMRES tends to converge
 203 quickly for most circuit examples, i.e., the iteration number $m \ll
 204 N$, the space for storing all the subspace basis $V_m$ of $N$-by-$m$,
 205 i.e., $m$ column vectors with $N$-length, is still big.  In addition,
 206 every newly generated matrix-vector product needs to be orthogonalized
 207 with respect to all its previous basis vectors in
 208 the Arnoldi iterations\index{Arnoldi iterations}.
 209 Hence, keeping all the vectors of $V_m$ in GPU global
 210 memory allows GPU to handle those operations, such as inner-product of
 211 basis vectors (\verb|dot|) and vector subtraction (\verb|axpy|), in
 212 parallel.
 213
 214 On the other hand, it is better to keep the Hessenberg matrix
 215 $\tilde{H}$, where intermediate results of
 216 the orthogonalization are stored, at the host side.
 217 This comes with the following reasons.
 218 First, its size is $(m+1)$-by-$m$ at most, rather small
 219 if compared to circuit matrices and Krylov basis.
 220 Besides, it is also necessary to triangularize $\tilde{H}$
 221 and check the residual regularly in
 222 each iteration so the GMRES can return the approximate solution as
 223 soon as the residual is below a preset tolerance.
 224 Hence, in consideration of the serial nature of the trianularization,
 225 the small size of Hessenberg matrix,
 226 and the frequent inspection of values by host, it is
 227 preferable to allocate $\tilde{H}$ in CPU (host) memory.
 228 As shown in Fig.~\ref{fig:gmres}, the memory copy from device to host
 229 is called each time when Arnoldi iteration generates a new vector
 230 and the orthogonalization produces the vector $h$.
 231
 232 % \begin{figure}[!tb]
 233 % \centering
 234 % \resizebox{.45\textwidth}{!}{\input{./matlab/drawing.ps_tex}}
 235 % \caption{The flow of envelope-following method.}
 236 % \label{fig:gmres}
 237 % %\end{center}
 238 % \end{figure}
 239
 240 % talk about gear 2 method here.
 241 \input{Chapters/chapter16/bdf.tex}
 242