last version

[book_gpu.git] / BookGPU / Chapters / chapter16 / gpu.tex
diff --git a/BookGPU/Chapters/chapter16/gpu.tex b/BookGPU/Chapters/chapter16/gpu.tex

index bcfc686529272c248c47bcd49b7f609553c839eb..4d4d6ef365353105617ca6b9bde55f864738f997 100644 (file)
--- a/BookGPU/Chapters/chapter16/gpu.tex
+++ b/BookGPU/Chapters/chapter16/gpu.tex
@@ -5,7 +5,7 @@ In this section, we explain how to efficiently
  use matrix-free GMRES to solve
  the Newton update problems with implicit sensitivity calculation,
  i.e., the steps enclosed by the double dashed block
-in Fig.~\ref{fig:ef_flow}.
+in Figure~\ref{fig:ef_flow}.
  Then implementation issues of GPU acceleration
  will be discussed in detail. 
  Finally,  the Gear-2 integration is briefly introduced.
@@ -15,13 +15,13 @@ Finally,  the Gear-2 integration is briefly introduced.
  \underline{G}eneralized \underline{M}inimum \underline{Res}idual,
  or GMRES method is an iterative method for solving
  systems of linear equations ($A x=b$) with dense matrix $A$.
-The standard GMRES\index{GMRES} is given in Algorithm~\ref{alg:GMRES}.
-It constructs a Krylov subspace\index{Krylov subspace} with order $m$,
+The standard GMRES\index{iterative method!GMRES} is given in Algorithm~\ref{alg:GMRES}.
+It constructs a Krylov subspace\index{iterative method!Krylov subspace} with order $m$,
  \[ \mathcal{K}_m = \mathrm{span}( b, A^{} b, A^2 b,\ldots, A^{m-1} b ),\]
  where the approximate solution $x_m$ resides.
  In practice, an orthonormal basis $V_m$ that spans the
  subspace $\mathcal{K}_{m}$ can be generated by
-the Arnoldi iteration\index{Arnoldi iterations}.
+the Arnoldi iterations\index{iterative method!Arnoldi iterations}.
  The goal of GMRES is to search for an optimal coefficient $y$
  such that the linear combination $x_m = V_m y$ will minimize
  its residual $\| b-Ax_m \|_2$.
@@ -77,6 +77,36 @@ a preset tolerance~\cite{Golub:Book'96}.
  %% \end{algorithmic}
  %% \end{algorithm}
  
+\begin{algorithm}
+\caption{standard GMRES\index{iterative method!GMRES} algorithm} \label{alg:GMRES}
+  \KwIn{ $ A \in \mathbb{R}^{N \times N}$, $b \in \mathbb{R}^N$,
+      and initial guess $x_0 \in \mathbb{R}^N$}
+  \KwOut{ $x \in \mathbb{R}^N$: $\| b - A x\|_2 < tol$}
+
+  $r = b - A x_0$\;
+  $h_{1,0}=\left \| r \right \|_2$\;
+  $m=0$\;
+
+  \While{$m < max\_iter$} {
+    $m = m+1$;
+    $v_{m} = r / h_{m,m-1}$\;
+    \label{line:mvp} $r = A v_m$\;
+    \For{$i = 1\ldots m$} {
+      $h_{i,m} = \langle v_i, r \rangle$\;
+      $r = r - h_{i,m} v_i$\;
+    }
+    $h_{m+1,m} = \left\| r \right\|_2$\label{line:newnorm} \;
+    %\STATE Generate Givens rotations to triangularize $\tilde{H}_m$
+    %\STATE Apply Givens rotations on $h_{1,0}e_1$ to get residual $\epsilon$
+    Compute the residual $\epsilon$\;
+    \If{$\epsilon < tol$} {
+      Solve the problem: minimize $\|b-Ax_m\|_2$\;
+      Return $x_m = x_0 + V_m y_m$\;
+    }
+  }
+\end{algorithm}
+
+
  At a first glance, the cost of using standard GMRES directly to
  solve the Newton update in Eq.~\eqref{eq:Newton}
  seems to come mainly from two parts: the
@@ -130,7 +160,7 @@ period in order to solve a Newton update.
  At each time step, SPICE\index{SPICE} has
  to linearize device models, stamp matrix elements
  into MNA (short for modified nodal analysis\index{modified nodal analysis, or MNA}) matrices,
-and solve circuit equations in its inner Newton iteration\index{Newton iteration}.
+and solve circuit equations in its inner Newton iteration\index{iterative method!Newton iteration}.
  When convergence is attained,
  circuit states are saved and then next time step begins.
  This is also the time when we store the needed matrices
@@ -195,7 +225,7 @@ Hence, in consideration of the serial nature of the trianularization,
  the small size of Hessenberg matrix,
  and the frequent inspection of values by host, it is
  preferable to allocate $\tilde{H}$ in CPU (host) memory.
-As shown in Fig.~\ref{fig:gmres}, the memory copy from device to host
+As shown in Figure~\ref{fig:gmres}, the memory copy from device to host
  is called each time when Arnoldi iteration generates a new vector
  and the orthogonalization produces the vector $h$.