From 17e9a5fd8b2b7e56319c80f3a67767cbbedd00d5 Mon Sep 17 00:00:00 2001
From: Raphael Couturier <raphael.couturier@univ-fcomte.fr>
Date: Fri, 28 Oct 2011 17:17:01 +0200
Subject: [PATCH] new

---
 dmems12.tex | 150 ++++++++++++++++++++++++++++------------------------
 1 file changed, 80 insertions(+), 70 deletions(-)

diff --git a/dmems12.tex b/dmems12.tex
index 0c00ee0..ef49def 100644
--- a/dmems12.tex
+++ b/dmems12.tex
@@ -92,24 +92,30 @@ In~\cite{CantiCapacitive03}, authors have presented a cantilever mechanism
 based on capacitive sensing. These techniques require cantilever
 instrumentation resulting in\ complex fabrication processes.
 
-In this paper  our attention is focused on a method  based on interferometry for
-cantilever  displacement  measurement in  quasi-static  regime. Cantilevers  are
-illuminated  by an  optical  source.  Interferometry  produces fringes  enabling
-cantilever displacement computation. A high  speed camera is used to analyze the
-fringes. In view of real time  applications, images need to be processed quickly
-and then a  fast estimation method is required to  determine the displacement of
-each  cantilever. In~\cite{AFMCSEM11},  an algorithm  based on  spline  has been
-introduced  for  cantilever  position  estimation.  The  overall  process  gives
-accurate results  but computations  are performed on  a standard  computer using
-LabView      \textsuperscript{\textregistered}     \textsuperscript{\copyright}.
-Consequently, the main drawback of this implementation is that the computer is a
-bottleneck. In this  paper we pose the problem  of real-time cantilever position
-estimation and  bring a  hardware/software solution. It  includes a  fast method
+In this paper our attention is focused on a method based on
+interferometry for cantilever displacement measurement in quasi-static
+regime. Cantilevers are illuminated by an optical source.
+Interferometry produces fringes enabling cantilever displacement
+computation. A high speed camera is used to analyze the fringes. In
+view of real time applications, images need to be processed quickly
+and then a fast estimation method is required to determine the
+displacement of each cantilever. In~\cite{AFMCSEM11}, an algorithm
+based on spline has been introduced for cantilever position
+estimation.  The overall process gives accurate results but
+computations are performed on a standard computer using LabView
+\textsuperscript{\textregistered} .  Consequently, the main drawback
+of this implementation is that the computer is a bottleneck. In this
+paper we pose the problem of real-time cantilever position estimation
+and bring a hardware/software solution. It includes a fast method
 based on least squares and its FPGA implementation.
 
-The remainder of the paper is organized as follows. Section~\ref{sec:measure}
-describes the measurement process. Our solution based on the least square
-method and its implementation on a FPGA is presented in Section~\ref{sec:solus}. Numerical experimentations are described in Section~\ref{sec:results}. Finally a conclusion and some perspectives are drawn.
+The remainder of the paper is organized as
+follows. Section~\ref{sec:measure} describes the measurement
+process. Our solution based on the least square method and its
+implementation on a FPGA is presented in
+Section~\ref{sec:solus}. Numerical experimentations are described in
+Section~\ref{sec:results}. Finally a conclusion and some perspectives
+are drawn.
 
 \section{Architecture and goals}
 
@@ -123,7 +129,7 @@ interferometry.
 
 \label{sec:archi}
 
-In opposition to other optical based system\textbf{s u}sing a laser beam
+In opposition to other optical based systems using a laser beam
 deflection scheme and sensitive to the angular displacement of the
 cantilever, interferometry is sensitive to the optical path difference
 induced by the vertical displacement of the cantilever.
@@ -197,7 +203,7 @@ with respect to the light source.
 The method consists in two main sequences.  In the first one
 corresponding to precomputation, the frequency $f$ of each profile is
 determined using a spline interpolation (see section \ref%
-{sec:algo-spline}) and the coefficient used for phase unwrapping is
+{sec:algo-spline}) and the coefficients used for phase unwrapping are
 computed. The second one, that we call the \textit{acquisition loop,}
 is done after images have been taken at regular time steps. For each
 image, the phase $\theta $ of all profiles is computed to obtain,
@@ -205,7 +211,8 @@ after unwrapping, the cantilever deflection. The phase determination
 in \cite{AFMCSEM11} is achieved by a spline based algorithm which is
 the most consuming part of the computation. In this article, we
 propose an alternate version based on the least square method which is
-faster and better suited for FPGA implementation.
+faster and better suited for FPGA implementation. Moreover, it can be
+used in real-time, i.e. after each image picked by the camera.
 
 \subsection{Computation design goals}
 
@@ -227,9 +234,9 @@ between the camera and the computation unit, the time to deserialize
 pixels and to store them in memory, the phase computation is the
 bottleneck of the whole process. For example, the camera in the setup
 of \cite{AFMCSEM11} provides $%
-1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if we
-the pixel extraction time is neglected, each phase calculation of a
-100-cantilever array should take no more than 12.5$\mu$s. 
+1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if the
+pixel extraction time is neglected, each phase calculation of a
+100-cantilever array should take no more than 12.5$\mu$s.
 
 In fact, this timing is a very hard constraint. To illustrate this point, we
 consider a very small program that initializes twenty million of doubles in
@@ -241,16 +248,17 @@ computations can drastically increase these performances: peak efficiency is
 about 2.5Gflops for the considered CPU. But this is not the case for phase
 computation that is using only a few tenth of values.
 
-In order to evaluate the original algorithm, we translated it in C language.
-As stated before, for 20 pixels, it does about 1,550 operations, thus an
-estimated execution time of $1,550/155=$10$\mu $s. For a more realistic
-evaluation, we constructed a file of 1Mo containing 200 profiles of 20
-pixels, equally scattered. This file is equivalent to an image stored in a
-device file representing the camera. We obtained an average of 10.5$\mu$s
-by profile (including I/O accesses). It is under our requirements but close
-to the limit. In case of an occasional load of the system, it could be
-largely overtaken. Solutions would be to use a real-time operating system or
-to search for a more efficient algorithm.
+In order to evaluate the original algorithm, we translated it in C
+language.  As stated in section \ref{sec:algo-comp}, for 20 pixels, it
+does about 1,550 operations, thus an estimated execution time of
+$1,550/155=$10$\mu $s. For a more realistic evaluation, we constructed
+a file of 1Mo containing 200 profiles of 20 pixels, equally
+scattered. This file is equivalent to an image stored in a device file
+representing the camera. We obtained an average of 10.5$\mu$s by
+profile (including I/O accesses). It is under our requirements but
+close to the limit. In case of an occasional load of the system, it
+could be largely overtaken. Solutions would be to use a real-time
+operating system or to search for a more efficient algorithm.
 
 However, the main drawback is the latency of such a solution because each
 profile must be treated one after another and the deflection of 100
@@ -336,7 +344,7 @@ data streams in parallel.
 The architecture we use is designed by the Armadeus Systems
 company. It consists in a development board called APF27 \textsuperscript{\textregistered}, hosting a
 i.MX27 ARM processor (from Freescale) and a Spartan3A (from
-XIlinx). This board includes all classical connectors as USB and
+Xilinx). This board includes all classical connectors as USB and
 Ethernet for instance. A Flash memory contains a Linux kernel that can
 be launched after booting the board via u-Boot. The processor is
 directly connected to the Spartan3A via its special interface called
@@ -367,18 +375,19 @@ we focus on its computation.
 
 We denote by $M$ the number of pixels in a segment used for phase
 computation. For the sake of simplicity of the notations, we consider
-the light intensity $I$ to be a mapping of the physical segment in the
-interval $[0,M[$. The pixels are assumed to be regularly spaced and
-centered at the positions $x^{p}\in\{0,1,\ldots,M-1\}.$ We use the simplest
-definition of a pixel, namely the value of $I$ at its center. The
-pixel intensities are considered as pre-normalized so that their
-minimum and maximum have been resized to $-1$ and $1$. 
+the light intensity I a function on the interval [0,M] which is the
+range of a one-to-one mapping defined on the physical segment. The
+pixels are assumed to be regularly spaced and centered at the
+positions $x^{p}\in\{0,1,\ldots,M-1\}.$ We use the simplest definition
+of a pixel, namely the value of $I$ at its center. The pixel
+intensities are considered as pre-normalized so that their minimum and
+maximum have been resized to $-1$ and $1$.
 
 The first step consists in computing the cubic spline interpolation of
 the intensities. This allows for interpolating $I$ at a larger number
 $L=k\times M$ of points (typically $k=4$ is sufficient) $%
 x^{s}$ in the interval $[0,M[$. During the precomputation sequence,
-the second step is to determin the afine part $a.x+b$ of $I$. It is
+the second step is to determin the affine part $a.x+b$ of $I$. It is
 found with an ordinary least square method, taking account the $L$
 points. Values of $I$ in $x^s$ are used to compute its intersections
 with $a.x+b$. The period of $I$ (and thus its frequency) is deduced
@@ -409,10 +418,10 @@ yields higher errors.
 Assuming that we compute the phase during the acquisition loop, equation \ref%
 {equ:profile} has only 4 parameters: $a,b,A$, and $\theta $, $f$ and $x$
 being already known. Since $I$ is non-linear, a least square method based on
-a Gauss-newton algorithm can be used to determine these four parameters.
+a Gauss-Newton algorithm can be used to determine these four parameters.
 This kind of iterative process ends with a convergence criterion, so it is
 not suited to our design goals. Fortunately, it is quite simple to reduce
-the number of parameters to only $\theta $. Firstly, the afine part $ax+b$
+the number of parameters to only $\theta $. Firstly, the affine part $ax+b$
 is estimated from the $M$ values $I(x^{p})$ to determine the rectified
 intensities,%
 \begin{equation*}
@@ -474,7 +483,7 @@ log_{2}(nb_{s}).$
 \end{itemize}
 
 The overall method is synthetized in an algorithm (called LSQ in the
-following) divided into the precomputing part and the acquisition loop:
+following) divided into the precomputing part and the acquisition loop.
 
 \begin{algorithm}[htbp]
 \caption{LSQ algorithm - before acquisition loop.}
@@ -512,7 +521,7 @@ following) divided into the precomputing part and the acquisition loop:
      $xy_{covar} \leftarrow xy_{covar} + (i-\bar{x}) \times (I[i]-\bar{y})$\\
    }
    $slope \leftarrow \frac{xy_{covar}}{x_{var}}$\\
-   $start \leftarrow y_{moy} - slope\times \bar{x}$\\
+   $start \leftarrow \bar{y} - slope\times \bar{x}$\\
    \For{$i=0$ to $M-1$}{
      $I[i] \leftarrow I[i] - start - slope\times i$\\
    }
@@ -561,7 +570,7 @@ following) divided into the precomputing part and the acquisition loop:
 \end{algorithm}
 
 \subsubsection{Algorithm comparison}
-
+\label{sec:algo-comp}
 We compared the two algorithms on the base of three criteria:
 
 \begin{itemize}
@@ -575,11 +584,11 @@ We compared the two algorithms on the base of three criteria:
 For the first item, we produced a matlab version of each algorithm,
 running in double precision. The profile was generated for about
 34,000 different quadruplets of periods ($\in \lbrack 3.1,6.1]$, step
-= 0.1), phases ($\in \lbrack -3.1,3.1]$, steps = 0.062) and slope
+= 0.1), phases ($\in \lbrack -3.1,3.1]$, steps = 0.062) and slopes
 ($\in \lbrack -2,2]$, step = 0.4). Obviously, the discretization of
 $[-\pi ,\pi ]$ introduces an error in the phase estimation. It is at
 most equal to $\frac{\pi}{nb_s}$. From some experiments on a $17\times
-4$ array, authors of \cite{AFMCSEM11} noticed a average ratio of 50
+4$ array, authors of \cite{AFMCSEM11} noticed an average ratio of 50
 between phase variation in radians and lever end position in
 nanometers. Assuming such a ratio and $nb_s = 1024$, the maximum lever
 deflection error would be 0.15nm which is smaller than 0.3nm, the best
@@ -632,8 +641,8 @@ distorted, with pikes and straight/rounded portions. In fact, it is
 very close to some of the worst experimental profiles. Figure
 \ref{fig:noise60} shows a sample of worst profile for $N=30$. It is
 completely distorted, largely beyond any experimental ones. Obviously,
-these comparisons are a bit subjectives and experimental profiles
-could also be completly distorted on other experiments. Nevertheless,
+these comparisons are a bit subjective and experimental profiles
+could also be more distorted on other experiments. Nevertheless,
 they give an idea about the possible error.
 
 \begin{figure}[ht]
@@ -654,7 +663,7 @@ they give an idea about the possible error.
 
 The second criterion is relatively easy to estimate for LSQ and harder for
 SPL because of the use of the arctangent function. In both cases, the number
-of operation is proportional to $M$ the numbers of pixels. For LSQ, it also
+of operation is proportional to $M$ the number of pixels. For LSQ, it also
 depends on $nb_{s}$ and for SPL on $L=k\times M$ the number of interpolated
 points. We assume that $M=20$, $nb_{s}=1024$ and $k=4$, that all possible
 parts are already in lookup tables and that a limited set of operations (+,
@@ -662,7 +671,7 @@ parts are already in lookup tables and that a limited set of operations (+,
 code, we obtain about 430 operations for LSQ and 1,550 (plus a few tenth for 
 $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
 considering the total number of operations is not fully relevant for FPGA
-implementation which time and space consumption depends not only on the type
+implementation for which time and space consumption depends not only on the type
 of operations but also of their ordering. The final evaluation is thus very
 much driven by the third criterion.
 
@@ -711,7 +720,7 @@ division in the computation of $a$ the slope coefficient, we also
 scaled the pixel intensities by another power of two. The main problem
 was to determin these factors. Most of the time, they are chosen to
 minimize the error induced by the quantization. But in our case, we
-also have some hardware constraints, for example the size and depth of
+also have some hardware constraints, for example the width and depth of
 RAMs or the input size of DSPs. Thus, having a maximum of values that
 fit in these sizes is a very important criterion to choose the scaling
 factors.
@@ -721,14 +730,14 @@ a function of the scale factors and the profile size involved in the
 algorithm. It gave us the the maximum number of bits necessary to code
 them. We have chosen the scale factors so that any variable (except
 the covariance) fits in 18 bits, which is the maximum input size of
-DSPs. In this way, all multiplications, except one, could be done with
-a single DSP, in a single clock cycle. Moreover, assuming that $nb_s =
-1024$, all LUTs could fit in the 18Kbits RAMs. Finally, we compared
-the double and integer versions of LSQ and found a nearly perfect
-agreement between their results.
+DSPs. In this way, all multiplications (except one with covariance)
+could be done with a single DSP, in a single clock cycle. Moreover,
+assuming that $nb_s = 1024$, all LUTs could fit in the 18Kbits
+RAMs. Finally, we compared the double and integer versions of LSQ and
+found a nearly perfect agreement between their results.
 
 As mentionned above, some operations like divisions must be
-avoided. But when the denominator is fixed, a division can be replaced
+avoided. But when the divisor is fixed, a division can be replaced
 by its multiplication/shift counterpart. This is always the case in
 LSQ. For example, assuming that $M$ is fixed, $x_{var}$ is known and
 fixed. Thus, $\frac{xy_{covar}}{x_{var}}$ can be replaced by
@@ -748,7 +757,7 @@ coding and the other with Matlab using the Simulink HDL coder feature~\cite%
 {HDLCoder}. Although the approaches are completely different we obtained
 quite comparable VHDL codes. Each approach has advantages and drawbacks.
 Roughly speaking, hand coding provides beautiful and much better structured
-code while Simulind HDL coder produces allows for fast code production. In
+code while Simulink HDL coder allows for fast code production. In
 terms of throughput and latency, simulations show that the two approaches
 yield close results with a slight advantage for hand coding.
 
@@ -770,17 +779,18 @@ second.
 In order to test our code on the SP Vision board, the design was
 extended with a component that keeps profiles in RAM, flushes them in
 the phase computation component and stores its output in another
-RAM. We also added a wishbone, a component that can "drive" signals to
-communicate between i.MX and other components. It is mainly used to
-start to flush profiles and to retrieve the computed phases in
-RAM. Unfortunately, the first designs could not be placed and routed
-with ISE on the Spartan6 with a 100MHz clock. The main problems were
-encountered with series of arthmetic operations and more especially
-with RAM outputs used in DSPs. So, we needed to decompose some parts
-of the pipeline, which added few clock cycles. Finally, we obtained a
-bitstream that has been successfully tested on the board.
-
-Its latency is of 112 cycles and computes a new phase every 40
+RAM. We also added components that implement the wishbone protocol,
+in order to "drive" signals to communicate between i.MX and other
+components. It is mainly used to start to flush profiles and to
+retrieve the computed phases in RAM. Unfortunately, the first designs
+could not be placed and routed with ISE on the Spartan6 with a 100MHz
+clock. The main problems were encountered with series of arthmetic
+operations and more especially with RAM outputs used in DSPs. So, we
+needed to decompose some parts of the pipeline, which added few clock
+cycles. Finally, we obtained a bitstream that has been successfully
+tested on the board.
+
+Its latency is of 112 cycles and it computes a new phase every 40
 cycles. For 100 cantilevers, it takes $(112+200\times 40).10=81.12\mu
 $s to compute their deflection. It corresponds to about 12300 images
 per second, which is largely beyond the camera capacities and the
-- 
2.39.5