1 \section{New parallel envelope-following method}
4 In this section, we explain how to efficiently
5 use matrix-free GMRES to solve
6 the Newton update problems with implicit sensitivity calculation,
7 i.e., the steps enclosed by the double dashed block
8 in Fig.~\ref{fig:ef_flow}.
9 Then implementation issues of GPU acceleration
10 will be discussed in detail.
11 Finally, the Gear-2 integration is briefly introduced.
13 \subsection{GMRES solver for Newton update equation}
15 \underline{G}eneralized \underline{M}inimum \underline{Res}idual,
16 or GMRES method is an iterative method for solving
17 systems of linear equations ($A x=b$) with dense matrix $A$.
18 The standard GMRES\index{GMRES} is given in Algorithm~\ref{alg:GMRES}.
19 It constructs a Krylov subspace\index{Krylov subspace} with order $m$,
20 \[ \mathcal{K}_m = \mathrm{span}( b, A^{} b, A^2 b,\ldots, A^{m-1} b ),\]
21 where the approximate solution $x_m$ resides.
22 In practice, an orthonormal basis $V_m$ that spans the
23 subspace $\mathcal{K}_{m}$ can be generated by
24 the Arnoldi iteration\index{Arnoldi iterations}.
25 The goal of GMRES is to search for an optimal coefficient $y$
26 such that the linear combination $x_m = V_m y$ will minimize
27 its residual $\| b-Ax_m \|_2$.
28 The Arnoldi iteration also creates a by-product,
29 an upper Hessenberg matrix\index{Hessenberg matrix} $\tilde{H}_m \in R^{(m+1)\times m}$.
30 Thus, the projection of $A$ on the orthonormal basis $V_m$ is described by
31 the Arnoldi decomposition
33 AV_m = V_{m+1} \tilde{H}_{m},
35 which is useful to check the residual at each iteration
36 without forming $x_m$,
37 and to solve for coefficient $y$ when residual is smaller than
38 a preset tolerance~\cite{Golub:Book'96}.
40 % Givens rotations are generated to triangularize
41 % the Hessenberg matrix $\tilde{H}_m$ in order to solve the overdetermined system.
42 % These same Givens rotations are also applied to the right-hand-side $h_{1,0}e_1$,
43 % whose last element provides residual information on whether the tolerance
44 % has been met in the current iteration.
46 %$\left\| h_{1,0}e_1-\tilde{H}_m y_m \right\|_2$.
50 %% \caption{Standard GMRES\index{GMRES} algorithm.} \label{alg:GMRES}
51 %% \begin{algorithmic}[1]
52 %% \REQUIRE $ A \in \mathbb{R}^{N \times N}$, $b \in \mathbb{R}^N$,
53 %% and initial guess $x_0 \in \mathbb{R}^N$
54 %% \ENSURE $x \in \mathbb{R}^N$: $\| b - A x\|_2 < tol$
56 %% \STATE $r = b - A x_0$
57 %% \STATE $h_{1,0}=\left \| r \right \|_2$
60 %% \WHILE{$m < max\_iter$}
62 %% \STATE $v_{m} = r / h_{m,m-1}$
63 %% \STATE \label{line:mvp} $r = A v_m$
64 %% \FOR{$i = 1\ldots m$}
65 %% \STATE $h_{i,m} = \langle v_i, r \rangle$
66 %% \STATE $r = r - h_{i,m} v_i$
68 %% \STATE $h_{m+1,m} = \left\| r \right\|_2$\label{line:newnorm}
69 %% %\STATE Generate Givens rotations to triangularize $\tilde{H}_m$
70 %% %\STATE Apply Givens rotations on $h_{1,0}e_1$ to get residual $\epsilon$
71 %% \STATE Compute the residual $\epsilon$
72 %% \IF{$\epsilon < tol$}
73 %% \STATE Solve the problem: minimize $\|b-Ax_m\|_2$
74 %% \STATE Return $x_m = x_0 + V_m y_m$
80 At a first glance, the cost of using standard GMRES directly to
81 solve the Newton update in Eq.~\eqref{eq:Newton}
82 seems to come mainly from two parts: the
83 formulation of the sensitivity matrix $J = \ud x_M/\ud x_0$
84 in Eq.~\eqref{eq:sensM} in Section~\ref{sec:gear},
85 and the iteration inside the standard GMRES,
86 especially the matrix-vector multiplication and the orthonormal
87 basis construction (Line~\ref{line:mvp} through
88 Line~\ref{line:newnorm} in Algorithm~\ref{alg:GMRES}).
89 Based on the observation that only the matrix-vector product is
90 required in GMRES, the work in~\cite{Telichevesky:DAC'95} introduces
91 an efficient matrix-free algorithm in the shooting-Newton method,
92 where the equation solving part also involves with a sensitivity
94 The matrix-free method does not take an explicit matrix as
95 input, but directly passes the saved capacitance matrices $C_i$ and
96 the LU factorizations of $J_i$, $i=0,\ldots,M$, into the Arnoldi
97 iteration for Krylov subspace generation.
98 Therefore, it avoids the
99 cost of forming the dense sensitivity matrix and focuses on
100 subspace construction.
101 Briefly speaking, Line~\ref{line:mvp} will be replaced
102 by a procedure without explict $A$, and we will talk
103 about the flow of matrix-free generation of new basis vectors
107 \subsection{Parallelization on GPU platforms}
109 % In this subsection, we describe how to parallelize
110 % the proposed matrix-free GMRES solver with Gear-2 sensitivity
112 There exist many GPU-friendly computing operations
113 in GMRES, such as the vector addition (\verb|axpy|),
114 2-norm of vector (\verb|nrm2|), and
115 sparse matrix-vector multiplication (\verb|csrmv|),
116 which have been parallelized in the CUDA Basic Linear
117 Algebra Subroutine (CUBLAS) Library and the CUDA Sparse Linear Algebra
118 Library (CUSPARSE)~\cite{CUDASDK}.
120 GPU programming is typically limited by the data transfer bandwidth as
121 GPU favors computationally intensive algorithms~\cite{Kirk:Book'10}.
122 So how to efficiently transfer the data and wisely partition
123 the data between CPU memory and GPU memory
124 is crucial for GPU programming.
125 In the following, we discuss these two issues in our implementation.
127 As noted in Section~\ref{sec:ef}, the envelope-following method
128 requires the matrices gathered from all the time steps in a
129 period in order to solve a Newton update.
130 At each time step, SPICE\index{SPICE} has
131 to linearize device models, stamp matrix elements
132 into MNA (short for modified nodal analysis\index{modified nodal analysis, or MNA}) matrices,
133 and solve circuit equations in its inner Newton iteration\index{Newton iteration}.
134 When convergence is attained,
135 circuit states are saved and then next time step begins.
136 This is also the time when we store the needed matrices
137 for the envelope-following computation.
138 Since these data are involved in the calculation of Gear-2 sensitivity
139 matrix in the generation of Krylov subspace vectors in
140 Algorithm~\ref{alg:mf_Gear}, it is desirable that all of these
141 matrices are transferred to GPU for its data parallel capability.
143 To efficiently transfer those large data, we explore asynchronous
144 memory copy between host and device in the recent GPUs (Fermi architecture),
145 so that the copying overlaps with the host's computing of
146 the next time step's circuit state. The implementation of asynchronous matrices
147 copy includes two parts: allocating page-locked memory, also known as
148 pinned memory, where we save matrices for one time step, and using
149 asynchronous memory transfer to copy these data to GPU memory. While
150 it is known that page-locked host memory is a scarce resource and
151 should not be overused, the demand of memory size of the data
152 generated at one time step can be generously accommodated by today's
153 mainstream server configurations.
157 \resizebox{.9\textwidth}{!}{\input{./Chapters/chapter16/figures/gmres_flow.pdf_t}}
158 \caption{GPU parallel solver for envelope-following update.}
163 %When SPICE simulation reaches the end of the period, all data needed by
164 %Newton update equation are ready in GPU memory, and the matrix-free
165 %GMRES can be applied directly on them to solve for the update vector
166 %in \eqref{eq:Newton}.
168 % Knowing the dimensions or the sizes of the
169 %matrices in a computation helps to boost GPU computing efficiency.
170 The second issue is to decide the location of data between CPU and GPU memories.
171 Therefore let us first make a rough sketch of the quantities in the
172 GMRES Algorithm~\ref{alg:GMRES}. Although GMRES tends to converge
173 quickly for most circuit examples, i.e., the iteration number $m \ll
174 N$, the space for storing all the subspace basis $V_m$ of $N$-by-$m$,
175 i.e., $m$ column vectors with $N$-length, is still big. In addition,
176 every newly generated matrix-vector product needs to be orthogonalized
177 with respect to all its previous basis vectors in
178 the Arnoldi iterations\index{Arnoldi iterations}.
179 Hence, keeping all the vectors of $V_m$ in GPU global
180 memory allows GPU to handle those operations, such as inner-product of
181 basis vectors (\verb|dot|) and vector subtraction (\verb|axpy|), in
184 On the other hand, it is better to keep the Hessenberg matrix
185 $\tilde{H}$, where intermediate results of
186 the orthogonalization are stored, at the host side.
187 This comes with the following reasons.
188 First, its size is $(m+1)$-by-$m$ at most, rather small
189 if compared to circuit matrices and Krylov basis.
190 Besides, it is also necessary to triangularize $\tilde{H}$
191 and check the residual regularly in
192 each iteration so the GMRES can return the approximate solution as
193 soon as the residual is below a preset tolerance.
194 Hence, in consideration of the serial nature of the trianularization,
195 the small size of Hessenberg matrix,
196 and the frequent inspection of values by host, it is
197 preferable to allocate $\tilde{H}$ in CPU (host) memory.
198 As shown in Fig.~\ref{fig:gmres}, the memory copy from device to host
199 is called each time when Arnoldi iteration generates a new vector
200 and the orthogonalization produces the vector $h$.
202 % \begin{figure}[!tb]
204 % \resizebox{.45\textwidth}{!}{\input{./matlab/drawing.ps_tex}}
205 % \caption{The flow of envelope-following method.}
210 % talk about gear 2 method here.
211 \input{Chapters/chapter16/bdf.tex}