BookGPU/Chapters/chapter16/gpu.tex

   1 \section{New parallel envelope-following method}
   2 \label{sec:gmres}
   3
   4 In this section, we explain how to efficiently
   5 use matrix-free GMRES to solve
   6 the Newton update problems with implicit sensitivity calculation,
   7 i.e., the steps enclosed by the double dashed block
   8 in Fig.~\ref{fig:ef_flow}.
   9 Then implementation issues of GPU acceleration
  10 will be discussed in detail.
  11 Finally,  the Gear-2 integration is briefly introduced.
  12
  13 \subsection{GMRES solver for Newton update equation}
  14
  15 \underline{G}eneralized \underline{M}inimum \underline{Res}idual,
  16 or GMRES method is an iterative method for solving
  17 systems of linear equations ($A x=b$) with dense matrix $A$.
  18 The standard GMRES\index{GMRES} is given in Algorithm~\ref{alg:GMRES}.
  19 It constructs a Krylov subspace\index{Krylov subspace} with order $m$,
  20 \[ \mathcal{K}_m = \mathrm{span}( b, A^{} b, A^2 b,\ldots, A^{m-1} b ),\]
  21 where the approximate solution $x_m$ resides.
  22 In practice, an orthonormal basis $V_m$ that spans the
  23 subspace $\mathcal{K}_{m}$ can be generated by
  24 the Arnoldi iteration\index{Arnoldi iterations}.
  25 The goal of GMRES is to search for an optimal coefficient $y$
  26 such that the linear combination $x_m = V_m y$ will minimize
  27 its residual $\| b-Ax_m \|_2$.
  28 The Arnoldi iteration also creates a by-product,
  29 an upper Hessenberg matrix\index{Hessenberg matrix} $\tilde{H}_m \in R^{(m+1)\times m}$.
  30 Thus, the projection of $A$ on the orthonormal basis $V_m$ is described by
  31 the Arnoldi decomposition
  32 $
  33  AV_m = V_{m+1} \tilde{H}_{m},
  34 $
  35 which is useful to check the residual at each iteration
  36 without forming $x_m$,
  37 and to solve for coefficient $y$ when residual is smaller than
  38 a preset tolerance~\cite{Golub:Book'96}.
  39
  40 % Givens rotations are generated to triangularize
  41 % the Hessenberg matrix $\tilde{H}_m$ in order to solve the overdetermined system.
  42 % These same Givens rotations are also applied to the right-hand-side $h_{1,0}e_1$,
  43 % whose last element provides residual information on whether the tolerance
  44 % has been met in the current iteration.
  45
  46 %$\left\| h_{1,0}e_1-\tilde{H}_m y_m \right\|_2$.
  47
  48
  49 %% \begin{algorithm}
  50 %% \caption{Standard GMRES\index{GMRES} algorithm.} \label{alg:GMRES}
  51 %% \begin{algorithmic}[1]
  52 %%   \REQUIRE $ A \in \mathbb{R}^{N \times N}$, $b \in \mathbb{R}^N$,
  53 %%       and initial guess $x_0 \in \mathbb{R}^N$
  54 %%   \ENSURE $x \in \mathbb{R}^N$: $\| b - A x\|_2 < tol$
  55
  56 %%   \STATE $r = b - A x_0$
  57 %%   \STATE $h_{1,0}=\left \| r \right \|_2$
  58 %%   \STATE $m=0$
  59
  60 %%   \WHILE{$m < max\_iter$}
  61 %%     \STATE $m = m+1$
  62 %%     \STATE $v_{m} = r / h_{m,m-1}$
  63 %%     \STATE \label{line:mvp} $r = A v_m$
  64 %%     \FOR{$i = 1\ldots m$}
  65 %%       \STATE $h_{i,m} = \langle v_i, r \rangle$
  66 %%       \STATE $r = r - h_{i,m} v_i$
  67 %%     \ENDFOR
  68 %%     \STATE $h_{m+1,m} = \left\| r \right\|_2$\label{line:newnorm}
  69 %%     %\STATE Generate Givens rotations to triangularize $\tilde{H}_m$
  70 %%     %\STATE Apply Givens rotations on $h_{1,0}e_1$ to get residual $\epsilon$
  71 %%     \STATE Compute the residual $\epsilon$
  72 %%     \IF{$\epsilon < tol$}
  73 %%     \STATE Solve the problem: minimize $\|b-Ax_m\|_2$
  74 %%         \STATE Return $x_m = x_0 + V_m y_m$
  75 %%     \ENDIF
  76 %%   \ENDWHILE
  77 %% \end{algorithmic}
  78 %% \end{algorithm}
  79
  80 At a first glance, the cost of using standard GMRES directly to
  81 solve the Newton update in Eq.~\eqref{eq:Newton}
  82 seems to come mainly from two parts: the
  83 formulation of the sensitivity matrix $J = \ud x_M/\ud x_0$
  84 in Eq.~\eqref{eq:sensM} in Section~\ref{sec:gear},
  85 and the iteration inside the standard GMRES,
  86 especially the matrix-vector multiplication and the orthonormal
  87 basis construction (Line~\ref{line:mvp} through
  88 Line~\ref{line:newnorm} in Algorithm~\ref{alg:GMRES}).
  89 Based on the observation that only the matrix-vector product is
  90 required in GMRES, the work in~\cite{Telichevesky:DAC'95} introduces
  91 an efficient matrix-free algorithm in the shooting-Newton method,
  92 where the equation solving part also involves with a sensitivity
  93 matrix.
  94 The matrix-free method does not take an explicit matrix as
  95 input, but directly passes the saved capacitance matrices $C_i$ and
  96 the LU factorizations of $J_i$, $i=0,\ldots,M$, into the Arnoldi
  97 iteration for Krylov subspace generation.
  98 Therefore, it avoids the
  99 cost of forming the dense sensitivity matrix and focuses on
 100 subspace construction.
 101 Briefly speaking, Line~\ref{line:mvp} will be replaced
 102 by a procedure without explict $A$, and we will talk
 103 about the flow of matrix-free generation of new basis vectors
 104 in later sections.
 105
 106
 107 \subsection{Parallelization on GPU platforms}
 108 \label{sec:gpu}
 109 % In this subsection, we describe how to parallelize
 110 % the proposed matrix-free GMRES solver with Gear-2 sensitivity
 111 % on GPU platforms.
 112 There exist many GPU-friendly computing operations
 113 in GMRES, such as the vector addition (\verb|axpy|),
 114 2-norm of vector (\verb|nrm2|), and
 115 sparse matrix-vector multiplication (\verb|csrmv|),
 116 which have been parallelized in the CUDA Basic Linear
 117 Algebra Subroutine (CUBLAS) Library and the CUDA Sparse Linear Algebra
 118 Library (CUSPARSE)~\cite{CUDASDK}.
 119
 120 GPU programming is typically limited by the data transfer bandwidth as
 121 GPU favors computationally intensive algorithms~\cite{Kirk:Book'10}.
 122 So how to efficiently transfer the data and wisely partition
 123 the data between CPU memory and GPU memory
 124 is crucial for GPU programming.
 125 In the following, we discuss these two issues in our implementation.
 126
 127 As noted in Section~\ref{sec:ef}, the envelope-following method
 128 requires the matrices gathered from all the time steps in a
 129 period in order to solve a Newton update.
 130 At each time step, SPICE\index{SPICE} has
 131 to linearize device models, stamp matrix elements
 132 into MNA (short for modified nodal analysis\index{modified nodal analysis, or MNA}) matrices,
 133 and solve circuit equations in its inner Newton iteration\index{Newton iteration}.
 134 When convergence is attained,
 135 circuit states are saved and then next time step begins.
 136 This is also the time when we store the needed matrices
 137 for the envelope-following computation.
 138 Since these data are involved in the calculation of Gear-2 sensitivity
 139 matrix in the generation of Krylov subspace vectors in
 140 Algorithm~\ref{alg:mf_Gear}, it is desirable that all of these
 141 matrices are transferred to GPU for its data parallel capability.
 142
 143 To efficiently transfer those large data, we explore asynchronous
 144 memory copy between host and device in the recent GPUs (Fermi architecture),
 145 so that the copying overlaps with the host's computing of
 146 the next time step's circuit state. The implementation of asynchronous matrices
 147 copy includes two parts: allocating page-locked memory, also known as
 148 pinned memory, where we save matrices for one time step, and using
 149 asynchronous memory transfer to copy these data to GPU memory.  While
 150 it is known that page-locked host memory is a scarce resource and
 151 should not be overused, the demand of memory size of the data
 152 generated at one time step can be generously accommodated by today's
 153 mainstream server configurations.
 154
 155 \begin{figure}[htbp]
 156   \centering
 157   \resizebox{.9\textwidth}{!}{\input{./Chapters/chapter16/figures/gmres_flow.pdf_t}}
 158   \caption{GPU parallel solver for envelope-following update.}
 159   \label{fig:gmres}
 160 \end{figure}
 161
 162
 163 %When SPICE simulation reaches the end of the period, all data needed by
 164 %Newton update equation are ready in GPU memory, and the matrix-free
 165 %GMRES can be applied directly on them to solve for the update vector
 166 %in \eqref{eq:Newton}.
 167
 168 % Knowing the dimensions or the sizes of the
 169 %matrices in a computation helps to boost GPU computing efficiency.
 170 The second issue is to decide the location of data between CPU and GPU memories.
 171 Therefore let us first make a rough sketch of the quantities in the
 172 GMRES Algorithm~\ref{alg:GMRES}.  Although GMRES tends to converge
 173 quickly for most circuit examples, i.e., the iteration number $m \ll
 174 N$, the space for storing all the subspace basis $V_m$ of $N$-by-$m$,
 175 i.e., $m$ column vectors with $N$-length, is still big.  In addition,
 176 every newly generated matrix-vector product needs to be orthogonalized
 177 with respect to all its previous basis vectors in
 178 the Arnoldi iterations\index{Arnoldi iterations}.
 179 Hence, keeping all the vectors of $V_m$ in GPU global
 180 memory allows GPU to handle those operations, such as inner-product of
 181 basis vectors (\verb|dot|) and vector subtraction (\verb|axpy|), in
 182 parallel.
 183
 184 On the other hand, it is better to keep the Hessenberg matrix
 185 $\tilde{H}$, where intermediate results of
 186 the orthogonalization are stored, at the host side.
 187 This comes with the following reasons.
 188 First, its size is $(m+1)$-by-$m$ at most, rather small
 189 if compared to circuit matrices and Krylov basis.
 190 Besides, it is also necessary to triangularize $\tilde{H}$
 191 and check the residual regularly in
 192 each iteration so the GMRES can return the approximate solution as
 193 soon as the residual is below a preset tolerance.
 194 Hence, in consideration of the serial nature of the trianularization,
 195 the small size of Hessenberg matrix,
 196 and the frequent inspection of values by host, it is
 197 preferable to allocate $\tilde{H}$ in CPU (host) memory.
 198 As shown in Fig.~\ref{fig:gmres}, the memory copy from device to host
 199 is called each time when Arnoldi iteration generates a new vector
 200 and the orthogonalization produces the vector $h$.
 201
 202 % \begin{figure}[!tb]
 203 % \centering
 204 % \resizebox{.45\textwidth}{!}{\input{./matlab/drawing.ps_tex}}
 205 % \caption{The flow of envelope-following method.}
 206 % \label{fig:gmres}
 207 % %\end{center}
 208 % \end{figure}
 209
 210 % talk about gear 2 method here.
 211 \input{Chapters/chapter16/bdf.tex}
 212