dmems12.tex

   1 %\usepackage{latex8}
   2 %\usepackage{times}
   3 %\usepackage[cyr]{aeguill}
   4 %\usepackage{pstricks,pst-node,pst-text,pst-3d}
   5 %\usepackage{babel}
   6 %\input{psfig.sty}
   7 %%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
   8
   9
  10 \documentclass[10pt, peerreview, compsocconf]{IEEEtran}
  11 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  12 \usepackage[utf8]{inputenc}
  13 \usepackage{amsmath}
  14 \usepackage{url}
  15 \usepackage{graphicx}
  16 \usepackage{thumbpdf}
  17 \usepackage{color}
  18 \usepackage{moreverb}
  19 \usepackage{commath}
  20 \usepackage{subfigure}
  21 \usepackage{fullpage}
  22 \usepackage{fancybox}
  23 \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  24
  25 \setcounter{MaxMatrixCols}{10}
  26 %TCIDATA{OutputFilter=LATEX.DLL}
  27 %TCIDATA{Version=5.50.0.2953}
  28 %TCIDATA{<META NAME="SaveForMode" CONTENT="1">}
  29 %TCIDATA{BibliographyScheme=BibTeX}
  30 %TCIDATA{LastRevised=Wednesday, October 26, 2011 09:49:54}
  31 %TCIDATA{<META NAME="GraphicsSave" CONTENT="32">}
  32
  33 \newcommand{\noun}[1]{\textsc{#1}}
  34 \newcommand{\tab}{\ \ \ }
  35
  36
  37 \begin{document}
  38
  39 \title{A new approach based on a least square method for real-time estimation of cantilever array deflections with a FPGA}
  40 \author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane
  41 Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and
  42 Michel Lenczner\IEEEauthorrefmark{2}}
  43 \IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France \and
  44 \{raphael.couturier,stephane.domas\}@univ-fcomte.fr}
  45 \IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France \and
  46 \{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com} }
  47
  48 \begin{abstract}
  49 Atomic force microscopes (AFM) provide high resolution images of surfaces.
  50 In this paper, we focus our attention on an interferometry method for
  51 deflection estimation of cantilever arrays in quasi-static regime. In its
  52 original form, spline interpolation was used to determine interference
  53 fringe phase, and thus the deflections. Computations were performed on a PC.
  54 Here, we propose a new complete solution with a least square based algorithm
  55 and an optimized FPGA implementation. Simulations and real tests showed very
  56 good results and open perspective for real-time estimation and control of
  57 cantilever arrays in the dynamic regime.
  58 \end{abstract}
  59
  60 %% \author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)}
  61 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  62 %% line 2: name of organization, acronyms acceptable\\
  63 %% line 3: City, Country\\
  64 %% line 4: Email: name@xyz.com}
  65 %% \and
  66 %% \IEEEauthorblockN{Authors Name/s per 2nd Affiliation (Author)}
  67 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  68 %% line 2: name of organization, acronyms acceptable\\
  69 %% line 3: City, Country\\
  70 %% line 4: Email: name@xyz.com}
  71 %% }
  72
  73 %\maketitle
  74
  75 \thispagestyle{empty}
  76
  77 \begin{IEEEkeywords}
  78 FPGA, cantilever arrays, interferometry.
  79 \end{IEEEkeywords}
  80
  81 \IEEEpeerreviewmaketitle
  82
  83 \section{Introduction}
  84
  85 Cantilevers are used in atomic force microscopes (AFM) which provide high
  86 resolution surface images. Several techniques have been reported in
  87 literature for cantilever displacement measurement. In~\cite{CantiPiezzo01},
  88 authors have shown how a piezoresistor can be integrated into a cantilever
  89 for deflection measurement. Nevertheless this approach suffers from the
  90 complexity of the microfabrication process needed to implement the sensor.
  91 In~\cite{CantiCapacitive03}, authors have presented a cantilever mechanism
  92 based on capacitive sensing. These techniques require cantilever
  93 instrumentation resulting in\ complex fabrication processes.
  94
  95 In this paper our attention is focused on a method based on
  96 interferometry for cantilever displacement measurement in quasi-static
  97 regime. Cantilevers are illuminated by an optical source.
  98 Interferometry produces fringes enabling cantilever displacement
  99 computation. A high speed camera is used to analyze the fringes. In
 100 view of real time applications, images need to be processed quickly
 101 and then a fast estimation method is required to determine the
 102 displacement of each cantilever. In~\cite{AFMCSEM11}, an algorithm
 103 based on spline has been introduced for cantilever position
 104 estimation.  The overall process gives accurate results but
 105 computations are performed on a standard computer using LabView
 106 \textsuperscript{\textregistered} .  Consequently, the main drawback
 107 of this implementation is that the computer is a bottleneck. In this
 108 paper we pose the problem of real-time cantilever position estimation
 109 and bring a hardware/software solution. It includes a fast method
 110 based on least squares and its FPGA implementation.
 111
 112 The remainder of the paper is organized as
 113 follows. Section~\ref{sec:measure} describes the measurement
 114 process. Our solution based on the least square method and its
 115 implementation on a FPGA is presented in
 116 Section~\ref{sec:solus}. Numerical experimentations are described in
 117 Section~\ref{sec:results}. Finally a conclusion and some perspectives
 118 are drawn.
 119
 120 \section{Architecture and goals}
 121
 122 \label{sec:measure}
 123
 124 In order to build simple, cost effective and user-friendly cantilever
 125 arrays, authors of ~\cite{AFMCSEM11} have developed a system based on
 126 interferometry.
 127
 128 \subsection{Experimental setup}
 129
 130 \label{sec:archi}
 131
 132 In opposition to other optical based systems using a laser beam
 133 deflection scheme and sensitive to the angular displacement of the
 134 cantilever, interferometry is sensitive to the optical path difference
 135 induced by the vertical displacement of the cantilever.
 136
 137 The system is based on a Linnick interferometer~\cite{Sinclair:05}.
 138 It is illustrated in Figure~\ref{fig:AFM} \footnote{by courtesy of
 139   CSEM}. A laser diode is first split (by the splitter) into a
 140 reference beam and a sample beam both reaching the cantilever array.
 141 The complete system including a cantilever array\ and the optical
 142 system can be moved thanks to a translation and rotational hexapod
 143 stage with five degrees of freedom. Thus, the cantilever array is
 144 centered in the optical system which can be adjusted accurately.  The
 145 beam illuminates the array by a microscope objective and the light
 146 reflects on the cantilevers. Likewise the reference beam reflects on a
 147 movable mirror. A CMOS camera chip records the reference and sample
 148 beams which are recombined in the beam splitter and the
 149 interferogram. At the beginning of each experiment, the movable mirror
 150 is fitted manually in order to align the interferometric fringes
 151 approximately parallel to the cantilevers. Then, cantilever motion in
 152 the transverse direction produces movements in the fringes. They are
 153 detected with the CMOS camera which images are analyzed by a Labview
 154 program to recover the cantilever deflections.
 155
 156 \begin{figure}[tbp]
 157 \begin{center}
 158 \includegraphics[width=\columnwidth]{AFM}
 159 \end{center}
 160 \caption{AFM Setup}
 161 \label{fig:AFM}
 162 \end{figure}
 163
 164 %% image tirée des expériences.
 165
 166 \subsection{Inteferometric based cantilever deflection estimation}
 167
 168 \label{sec:deflest}
 169
 170 \begin{figure}[tbp]
 171 \begin{center}
 172 \includegraphics[width=\columnwidth]{lever-xp}
 173 \end{center}
 174 \caption{Portion of a camera image showing moving interferometric fringes in
 175 cantilevers}
 176 \label{fig:img-xp}
 177 \end{figure}
 178
 179 As shown in Figure \ref{fig:img-xp} \footnote{by courtesy of CSEM}, each
 180 cantilever is covered by several interferometric fringes. The fringes
 181 distort when cantilevers are deflected.  In \cite{AFMCSEM11}, a novel
 182 method for interferometric based cantilever deflection measurement was
 183 reported. For each cantilever, the method uses three segments of pixels,
 184 parallel to its section, to determine phase shifts.  The first is
 185 located just above the AFM tip (tip profile), it provides the phase
 186 shift modulo $2\pi $. The second one is close to the base junction
 187 (base profile) and is used to determine the exact multiple of $2\pi $
 188 through an operation called unwrapping where it is assumed that the
 189 deflection means along the two measurement segments are linearly
 190 dependent.  The third is on the base and provides a reference for
 191 noise suppression.  Finally, deflections are simply derived from phase
 192 shifts.
 193
 194 The pixel gray-level intensity $I$ of each profile is modelized by%
 195 \begin{equation}
 196 I(x)=A\text{ }\cos (2\pi fx+\theta )+ax+b  \label{equ:profile}
 197 \end{equation}%
 198 where $x$ denotes the position of a pixel in a segment, $A$, $f$ and $\theta
 199 $ are the amplitude, the frequency and the phase of the light signal when
 200 the affine function $ax+b$ corresponds to the cantilever array surface tilt
 201 with respect to the light source.
 202
 203 The method consists in two main sequences.  In the first one
 204 corresponding to precomputation, the frequency $f$ of each profile is
 205 determined using a spline interpolation (see section \ref%
 206 {sec:algo-spline}) and the coefficients used for phase unwrapping are
 207 computed. The second one, that we call the \textit{acquisition loop,}
 208 is done after images have been taken at regular time steps. For each
 209 image, the phase $\theta $ of all profiles is computed to obtain,
 210 after unwrapping, the cantilever deflection. The phase determination
 211 in \cite{AFMCSEM11} is achieved by a spline based algorithm which is
 212 the most consuming part of the computation. In this article, we
 213 propose an alternate version based on the least square method which is
 214 faster and better suited for FPGA implementation. Moreover, it can be
 215 used in real-time, i.e. after each image picked by the camera.
 216
 217 \subsection{Computation design goals}
 218
 219 \label{sec:goals}
 220
 221 To evaluate the solution performances, we choose a goal which consists
 222 in designing a computing unit able to estimate the deflections of
 223 a $10\times 10$%
 224 -cantilever array, faster than the camera image stream. In addition,
 225 the result accuracy must be close to 0.3nm, the maximum precision
 226 reached in \cite{AFMCSEM11}. Finally, the latency between the entrance
 227 of the first pixel of an image and the end of deflection computation
 228 must be as small as possible. All these requirement are
 229 stated in the perspective of implementing real-time active control for
 230 each cantilever, see~\cite{LencznerChap10,Hui11}.
 231
 232 If we put aside other hardware issues like the speed of the link
 233 between the camera and the computation unit, the time to deserialize
 234 pixels and to store them in memory, the phase computation is the
 235 bottleneck of the whole process. For example, the camera in the setup
 236 of \cite{AFMCSEM11} provides $%
 237 1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if the
 238 pixel extraction time is neglected, each phase calculation of a
 239 100-cantilever array should take no more than 12.5$\mu$s.
 240
 241 In fact, this timing is a very hard constraint. To illustrate this point, we
 242 consider a very small program that initializes twenty million of doubles in
 243 memory and then does 1,000,000 cumulated sums on 20 contiguous values
 244 (experimental profiles have about this size). On an intel Core 2 Duo E6650
 245 at 2.33GHz, this program reaches an average of 155Mflops.
 246 Obviously, some cache effects and optimizations on huge amount of
 247 computations can drastically increase these performances: peak efficiency is
 248 about 2.5Gflops for the considered CPU. But this is not the case for phase
 249 computation that is using only a few tenth of values.
 250
 251 In order to evaluate the original algorithm, we translated it in C
 252 language.  As stated in section \ref{sec:algo-comp}, for 20 pixels, it
 253 does about 1,550 operations, thus an estimated execution time of
 254 $1,550/155=$10$\mu $s. For a more realistic evaluation, we constructed
 255 a file of 1Mo containing 200 profiles of 20 pixels, equally
 256 scattered. This file is equivalent to an image stored in a device file
 257 representing the camera. We obtained an average of 10.5$\mu$s by
 258 profile (including I/O accesses). It is under our requirements but
 259 close to the limit. In case of an occasional load of the system, it
 260 could be largely overtaken. Solutions would be to use a real-time
 261 operating system or to search for a more efficient algorithm.
 262
 263 However, the main drawback is the latency of such a solution because each
 264 profile must be treated one after another and the deflection of 100
 265 cantilevers takes about $200\times 10.5=2.1$ms. This would be inadequate
 266 for real-time requirements as for individual cantilever active control. An
 267 obvious solution is to parallelize the computations, for example on a GPU.
 268 Nevertheless, the cost of transferring profile in GPU memory and of taking
 269 back results would be prohibitive compared to computation time.
 270
 271 We remark that when possible, it is more efficient to pipeline the
 272 computation. For example, supposing that 200 profiles of 20 pixels
 273 could be pushed sequentially in a pipelined unit cadenced at a 100MHz
 274 (i.e. a pixel enters in the unit each 10ns), all profiles would be
 275 treated in $200\times 20\times 10.10^{-9}=$ 40$\mu$s plus the latency
 276 of the pipeline. Such a solution would be meeting our requirements and
 277 would be 50 times faster than our C code, and even more compared to
 278 the LabView version use in \cite{AFMCSEM11}. FPGAs are appropriate for
 279 such implementation, so they turn out to be the computation units of
 280 choice to reach our performance requirements. Nevertheless, passing
 281 from a C code to a pipelined version in VHDL is not obvious at all. It
 282 can even be impossible because of FPGA hardware constraints. All these
 283 points are discussed in the following sections.
 284
 285 \section{An hardware/software solution}
 286
 287 \label{sec:solus}
 288
 289 In  this  section we  present  parts  of the  computing  solution  to the  above
 290 requirements. The  hardware part consists in  a high-speed camera,  linked on an
 291 embedded board hosting  two FPGAs. In this way, the camera  output stream can be
 292 pushed directly  into the FPGA. The software  part is mostly the  VHDL code that
 293 deserializes the camera stream, extracts profiles and computes the deflection.
 294
 295 We first give some general information about FPGAs, then we
 296 describe the FPGA board we use for implementation and finally the two
 297 algorithms for phase computation are detailed. Presentation of VHDL
 298 implementations is postponned until Section \ref{Experimental tests}.
 299
 300
 301
 302 \subsection{Elements of FPGA architecture and programming}
 303
 304 A field-programmable gate array (FPGA) is an integrated circuit designed to
 305 be configured by the customer. FGPAs are composed of programmable logic
 306 components, called configurable logic blocks (CLB). These blocks mainly
 307 contain look-up tables (LUT), flip/flops (F/F) and latches, organized in one
 308 or more slices connected together. Each CLB can be configured to perform
 309 simple (AND, XOR, ...) or complex combinational functions. They are
 310 interconnected by reconfigurable links. Modern FPGAs contain memory elements
 311 and multipliers which enable to simplify the design and to increase the
 312 performance. Nevertheless, all other complex operations like division and
 313 other functions like trigonometric functions are not available and must be
 314 built by configuring a set of CLBs. Since this is not an obvious task at
 315 all, tools like ISE~\cite{ISE} have been built to do this operation. Such a
 316 software can synthetize a design written in a hardware description language
 317 (HDL), maps it onto CLBs, place/route them for a specific FPGA, and finally
 318 produces a bitstream that is used to configure the FPGA. Thus, from the
 319 developer's point of view, the main difficulty is to translate an algorithm
 320 into HDL code, taking into account FPGA resources and constraints like clock
 321 signals and I/O values that drive the FPGA.
 322
 323 Indeed, HDL programming is very different from classic languages like
 324 C. A program can be seen as a state-machine, manipulating signals that
 325 evolve from state to state. Moreover, HDL instructions can be executed
 326 concurrently. Signals may be combined with basic logic operations to
 327 produce new states that are assigned to another signal. States are mainly expressed as
 328 arrays of bits.  Fortunately, libraries propose some higher levels
 329 representations like signed integers, and arithmetic operations.
 330
 331 Furthermore, even if FPGAs are cadenced more slowly than classic processors,
 332 they can perform pipelines as well as parallel operations. A pipeline
 333 consists in cutting a process in a sequence of small tasks, taking the same
 334 execution time. It accepts a new data at each clock top, thus, after a known
 335 latency, it also provides a result at each clock top. We observe that the
 336 components of a task are not reusable by another one. Nevertheless, this is
 337 the most efficient technique on FPGAs. Because of their architecture, it is
 338 also very easy to process several data concurrently. Finally, the best
 339 performance can be reached when several pipelines are operating on multiple
 340 data streams in parallel.
 341
 342 \subsection{The FPGA board}
 343
 344 The architecture we use is designed by the Armadeus Systems
 345 company. It consists in a development board called APF27 \textsuperscript{\textregistered}, hosting a
 346 i.MX27 ARM processor (from Freescale) and a Spartan3A (from
 347 Xilinx). This board includes all classical connectors as USB and
 348 Ethernet for instance. A Flash memory contains a Linux kernel that can
 349 be launched after booting the board via u-Boot. The processor is
 350 directly connected to the Spartan3A via its special interface called
 351 WEIM. The Spartan3A is itself connected to an extension board called
 352 SP Vision \textsuperscript{\textregistered}, that hosts a Spartan6 FPGA. Thus, it is
 353 possible to develop programs that communicate between i.MX and
 354 Spartan6, using Spartan3 as a tunnel. A clock signal at 100MHz (by
 355 default) is delivered to dedicated FPGA pins. The Spartan6 of our
 356 board is an LX100 version. It has 15,822 slices, each slice containing
 357 4 LUTs and 8 flip/flops. It is equivalent to 101,261 logic
 358 cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
 359 multiply-adders (named DSP48), which is largely enough for our
 360 project. Some I/O pins of Spartan6 are connected to two $2\times 17$
 361 headers that can be used for any purpose as to be connected to the
 362 interface of a camera.
 363
 364 \subsection{Two algorithms for phase computation}
 365
 366 In \cite{AFMCSEM11}, $f$ the frequency and $\theta $\ the phase of the
 367 light wave are computed thanks to spline interpolation. As said in
 368 section \ref{sec:deflest}, $f$ is computed only once time but the
 369 phase needs to be computed for each image. This is why, in this paper,
 370 we focus on its computation.
 371
 372 \subsubsection{Spline algorithm (SPL)}
 373
 374 \label{sec:algo-spline}
 375
 376 We denote by $M$ the number of pixels in a segment used for phase
 377 computation. For the sake of simplicity of the notations, we consider
 378 the light intensity I a function on the interval [0,M] which is the
 379 range of a one-to-one mapping defined on the physical segment. The
 380 pixels are assumed to be regularly spaced and centered at the
 381 positions $x^{p}\in\{0,1,\ldots,M-1\}.$ We use the simplest definition
 382 of a pixel, namely the value of $I$ at its center. The pixel
 383 intensities are considered as pre-normalized so that their minimum and
 384 maximum have been resized to $-1$ and $1$.
 385
 386 The first step consists in computing the cubic spline interpolation of
 387 the intensities. This allows for interpolating $I$ at a larger number
 388 $L=k\times M$ of points (typically $k=4$ is sufficient) $%
 389 x^{s}$ in the interval $[0,M[$. During the precomputation sequence,
 390 the second step is to determin the affine part $a.x+b$ of $I$. It is
 391 found with an ordinary least square method, taking account the $L$
 392 points. Values of $I$ in $x^s$ are used to compute its intersections
 393 with $a.x+b$. The period of $I$ (and thus its frequency) is deduced
 394 from the number of intersections and the distance between the first
 395 and last.
 396
 397 During the acquisition loop, the second step is the phase computation, with
 398 \begin{equation}
 399 \theta =atan\left[ \frac{\sum_{i=0}^{N-1}\text{sin}(2\pi fx_{i}^{s})\times
 400 I(x_{i}^{s})}{\sum_{i=0}^{N-1}\text{cos}(2\pi fx_{i}^{s})\times I(x_{i}^{s})}%
 401 \right] .
 402 \end{equation}
 403
 404 \textit{Remarks: }
 405
 406 \begin{itemize}
 407 \item The frequency could also be obtained using the derivates of spline
 408 equations, which only implies to solve quadratic equations but certainly
 409 yields higher errors.
 410
 411 \item Profile frequency are computed during the precomputation step,
 412   thus the values sin$(2\pi fx_{i}^{s})$ and cos$(2\pi fx_{i}^{s})$
 413   can be determined once for all.
 414 \end{itemize}
 415
 416 \subsubsection{Least square algorithm (LSQ)}
 417
 418 Assuming that we compute the phase during the acquisition loop, equation \ref%
 419 {equ:profile} has only 4 parameters: $a,b,A$, and $\theta $, $f$ and $x$
 420 being already known. Since $I$ is non-linear, a least square method based on
 421 a Gauss-Newton algorithm can be used to determine these four parameters.
 422 This kind of iterative process ends with a convergence criterion, so it is
 423 not suited to our design goals. Fortunately, it is quite simple to reduce
 424 the number of parameters to only $\theta $. Firstly, the affine part $ax+b$
 425 is estimated from the $M$ values $I(x^{p})$ to determine the rectified
 426 intensities,%
 427 \begin{equation*}
 428 I^{corr}(x^{p})\approx I(x^{p})-a.x^{p}-b.
 429 \end{equation*}%
 430 To find $a$ and $b$ we apply an ordinary least square method (as in SPL but on $M$ points)%
 431 \begin{equation*}
 432 a=\frac{covar(x^{p},I(x^{p}))}{\text{var}(x^{p})}\text{ and }b=\overline{%
 433 I(x^{p})}-a.\overline{{x^{p}}}
 434 \end{equation*}%
 435 where overlined symbols represent average. Then the amplitude $A$ is
 436 approximated by%
 437 \begin{equation*}
 438 A\approx \frac{\text{max}(I^{corr})-\text{min}(I^{corr})}{2}.
 439 \end{equation*}%
 440 Finally, the problem of approximating $\theta $ is reduced to minimizing%
 441 \begin{equation*}
 442 \min_{\theta \in \lbrack -\pi ,\pi ]}\sum_{i=0}^{M-1}\left[ \text{cos}(2\pi
 443 f.i+\theta )-\frac{I^{corr}(i)}{A}\right] ^{2}.
 444 \end{equation*}%
 445 An optimal value $\theta ^{\ast }$ of the minimization problem is a zero of
 446 the first derivative of the above argument,%\begin{eqnarray*}{l}
 447 \begin{equation*}
 448 2\left[ \text{cos}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{sin}(2\pi
 449 f.i)\right.
 450 \end{equation*}%
 451 \begin{equation*}
 452 \left. +\text{sin}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{cos}(2\pi
 453 f.i)\right] -
 454 \end{equation*}%
 455 \begin{equation*}
 456 A\left[ \text{cos}2\theta ^{\ast }\sum_{i=0}^{M-1}\sin (4\pi f.i)+\text{sin}%
 457 2\theta ^{\ast }\sum_{i=0}^{M-1}\cos (4\pi f.i)\right] =0
 458 \end{equation*}%
 459 %
 460 %\end{eqnarray*}
 461
 462 Several points can be noticed:
 463
 464 \begin{itemize}
 465 \item The terms $\sum_{i=0}^{M-1}$sin$(4\pi f.i)$ and$\sum_{i=0}^{M-1}$cos$%
 466 (4\pi f.i)$ are independent of $\theta $, they can be precomputed.
 467
 468 \item Lookup tables (namely lut$_{sfi}$ and lut$_{cfi}$ in the following algorithms) can be
 469   set with the $2.M$ values $\sin (2\pi f.i)$ and $\cos (2\pi f.i)$.
 470
 471 \item A simple method to find a zero $\theta ^{\ast }$ of the optimality
 472 condition is to discretize the range $[-\pi ,\pi ]$ with a large number $%
 473 nb_{s}$ of nodes and to find which one is a minimizer in the absolute value
 474 sense. Hence, three other lookup tables (lut$_{s}$, lut$_{c}$ and lut$_{A}$) can be set with the $%
 475 3\times nb_{s}$ values $\sin \theta ,$ $\cos \theta ,$
 476 \begin{equation*}
 477 \left[ cos2\theta \sum_{i=0}^{M-1}sin(4\pi f.i)+sin2\theta
 478 \sum_{i=0}^{M-1}cos(4\pi f.i)\right] .
 479 \end{equation*}
 480
 481 \item The search algorithm can be very fast using a dichotomous process in $%
 482 log_{2}(nb_{s}).$
 483 \end{itemize}
 484
 485 The overall method is synthetized in an algorithm (called LSQ in the
 486 following) divided into the precomputing part and the acquisition loop.
 487
 488 \begin{algorithm}[htbp]
 489 \caption{LSQ algorithm - before acquisition loop.}
 490 \label{alg:lsq-before}
 491
 492    $M \leftarrow $ number of pixels of the profile\\
 493    I[] $\leftarrow $ intensity of pixels\\
 494    $f \leftarrow $ frequency of the profile\\
 495    $s4i \leftarrow \sum_{i=0}^{M-1} sin(4\pi f.i)$\\
 496    $c4i \leftarrow \sum_{i=0}^{M-1} cos(4\pi f.i)$\\
 497    $nb_s \leftarrow $ number of discretization steps of $[-\pi,\pi]$\\
 498
 499    \For{$i=0$ to $nb_s $}{
 500      $\theta  \leftarrow -\pi + 2\pi\times \frac{i}{nb_s}$\\
 501      lut$_s$[$i$] $\leftarrow sin \theta$\\
 502      lut$_c$[$i$] $\leftarrow cos \theta$\\
 503      lut$_A$[$i$] $\leftarrow cos 2 \theta \times s4i + sin 2 \theta \times c4i$\\
 504      lut$_{sfi}$[$i$] $\leftarrow sin (2\pi f.i)$\\
 505      lut$_{cfi}$[$i$] $\leftarrow cos (2\pi f.i)$\\
 506    }
 507 \end{algorithm}
 508
 509 \begin{algorithm}[htbp]
 510 \caption{LSQ algorithm - during acquisition loop.}
 511 \label{alg:lsq-during}
 512
 513    $\bar{x} \leftarrow \frac{M-1}{2}$\\
 514    $\bar{y} \leftarrow 0$, $x_{var} \leftarrow 0$, $xy_{covar} \leftarrow 0$\\
 515    \For{$i=0$ to $M-1$}{
 516      $\bar{y} \leftarrow \bar{y} + $ I[$i$]\\
 517      $x_{var} \leftarrow x_{var} + (i-\bar{x})^2$\\
 518    }
 519    $\bar{y} \leftarrow \frac{\bar{y}}{M}$\\
 520    \For{$i=0$ to $M-1$}{
 521      $xy_{covar} \leftarrow xy_{covar} + (i-\bar{x}) \times (I[i]-\bar{y})$\\
 522    }
 523    $slope \leftarrow \frac{xy_{covar}}{x_{var}}$\\
 524    $start \leftarrow \bar{y} - slope\times \bar{x}$\\
 525    \For{$i=0$ to $M-1$}{
 526      $I[i] \leftarrow I[i] - start - slope\times i$\\
 527    }
 528
 529    $I_{max} \leftarrow max_i(I[i])$, $I_{min} \leftarrow min_i(I[i])$\\
 530    $amp \leftarrow \frac{I_{max}-I_{min}}{2}$\\
 531
 532    $Is \leftarrow 0$, $Ic \leftarrow 0$\\
 533    \For{$i=0$ to $M-1$}{
 534      $Is \leftarrow Is + I[i]\times $ lut$_{sfi}$[$i$]\\
 535      $Ic \leftarrow Ic + I[i]\times $ lut$_{cfi}$[$i$]\\
 536    }
 537
 538    $\delta \leftarrow \frac{nb_s}{2}$, $b_l \leftarrow 0$, $b_r \leftarrow \delta$\\
 539    $v_l \leftarrow -2.I_s - amp.$lut$_A$[$b_l$]\\
 540
 541    \While{$\delta >= 1$}{
 542
 543      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 544
 545      \If{$!(v_l < 0$ and $v_r >= 0)$}{
 546        $v_l \leftarrow v_r$ \\
 547        $b_l \leftarrow b_r$ \\
 548      }
 549      $\delta \leftarrow \frac{\delta}{2}$\\
 550      $b_r \leftarrow b_l + \delta$\\
 551    }
 552    \uIf{$!(v_l < 0$ and $v_r >= 0)$}{
 553      $v_l \leftarrow v_r$ \\
 554      $b_l \leftarrow b_r$ \\
 555      $b_r \leftarrow b_l + 1$\\
 556      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 557    }
 558    \Else {
 559      $b_r \leftarrow b_l + 1$\\
 560    }
 561
 562    \uIf{$ abs(v_l) < v_r$}{
 563      $b_{\theta} \leftarrow b_l$ \\
 564    }
 565    \Else {
 566      $b_{\theta} \leftarrow b_r$ \\
 567    }
 568    $\theta \leftarrow \pi\times \left[\frac{2.b_{ref}}{nb_s}-1\right]$\\
 569
 570 \end{algorithm}
 571
 572 \subsubsection{Algorithm comparison}
 573 \label{sec:algo-comp}
 574 We compared the two algorithms on the base of three criteria:
 575
 576 \begin{itemize}
 577 \item precision of results on a cosines profile, distorted by noise,
 578
 579 \item number of operations,
 580
 581 \item complexity of FPGA implementation
 582 \end{itemize}
 583
 584 For the first item, we produced a matlab version of each algorithm,
 585 running in double precision. The profile was generated for about
 586 34,000 different quadruplets of periods ($\in \lbrack 3.1,6.1]$, step
 587 = 0.1), phases ($\in \lbrack -3.1,3.1]$, steps = 0.062) and slopes
 588 ($\in \lbrack -2,2]$, step = 0.4). Obviously, the discretization of
 589 $[-\pi ,\pi ]$ introduces an error in the phase estimation. It is at
 590 most equal to $\frac{\pi}{nb_s}$. From some experiments on a $17\times
 591 4$ array, authors of \cite{AFMCSEM11} noticed an average ratio of 50
 592 between phase variation in radians and lever end position in
 593 nanometers. Assuming such a ratio and $nb_s = 1024$, the maximum lever
 594 deflection error would be 0.15nm which is smaller than 0.3nm, the best
 595 precision achieved with the setup used in \cite{AFMCSEM11}.
 596
 597 Moreover, pixels have been paired and the paired intensities have been
 598 perturbed by addition of a random number uniformly picked in
 599 $[-N,N]$. Notice that we have observed that perturbing each pixel
 600 independently yields too weak profile distortion. We report
 601 percentages of errors between the reference and the computed phases
 602 out of $2\pi ,$%
 603 \begin{equation*}
 604 err=100\times \frac{|\theta _{ref}-\theta _{comp}|}{2\pi }.
 605 \end{equation*}%
 606 Table \ref{tab:algo_prec} gives the maximum and the average errors for both
 607 algorithms and for increasing values of $N$ the noise parameter.
 608
 609 \begin{table}[ht]
 610 \begin{center}
 611 \begin{tabular}{|c|c|c|c|c|}
 612 \hline
 613 & \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5}
 614 noise (N)& max. err. & aver. err. & max. err. & aver. err. \\ \hline
 615 0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline
 616 2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline
 617 5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline
 618 7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline
 619 10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline
 620 15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline
 621 30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
 622 \end{tabular}%
 623 \end{center}
 624 \caption{Error (in \%) for cosines profiles, with noise.}
 625 \label{tab:algo_prec}
 626 \end{table}
 627
 628 The results show that the two algorithms yield close results, with a slight
 629 advantage for LSQ. Furthermore, both behave very well against noise.
 630 Assuming an average ratio of 50 (see above), an error of 1 percent on
 631 the phase corresponds to an error of 0.5nm on the lever deflection, which is
 632 very close to the best precision.
 633
 634 It is very hard to predict which level of noise will be present in
 635 real experiments and how it will distort the profiles. Authors of
 636 \cite{AFMCSEM11} gave us the authorization to exploit some of their
 637 results on a $17\times 4$ array. It allowed us to compare experimental
 638 profiles to simulated ones. We can see on figure \ref{fig:noise20} the
 639 profile with $N=10$ that leads to the biggest error. It is a bit
 640 distorted, with pikes and straight/rounded portions. In fact, it is
 641 very close to some of the worst experimental profiles. Figure
 642 \ref{fig:noise60} shows a sample of worst profile for $N=30$. It is
 643 completely distorted, largely beyond any experimental ones. Obviously,
 644 these comparisons are a bit subjective and experimental profiles
 645 could also be more distorted on other experiments. Nevertheless,
 646 they give an idea about the possible error.
 647
 648 \begin{figure}[ht]
 649 \begin{center}
 650 \includegraphics[width=\columnwidth]{intens-noise20}
 651 \end{center}
 652 \caption{Sample of worst profile for N=10}
 653 \label{fig:noise20}
 654 \end{figure}
 655
 656 \begin{figure}[ht]
 657 \begin{center}
 658 \includegraphics[width=\columnwidth]{intens-noise60}
 659 \end{center}
 660 \caption{Sample of worst profile for N=30}
 661 \label{fig:noise60}
 662 \end{figure}
 663
 664 The second criterion is relatively easy to estimate for LSQ and harder for
 665 SPL because of the use of the arctangent function. In both cases, the number
 666 of operation is proportional to $M$ the number of pixels. For LSQ, it also
 667 depends on $nb_{s}$ and for SPL on $L=k\times M$ the number of interpolated
 668 points. We assume that $M=20$, $nb_{s}=1024$ and $k=4$, that all possible
 669 parts are already in lookup tables and that a limited set of operations (+,
 670 -, *, /, $<$, $>$) is taken into account. Translating both algorithms in C
 671 code, we obtain about 430 operations for LSQ and 1,550 (plus a few tenth for
 672 $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
 673 considering the total number of operations is not fully relevant for FPGA
 674 implementation for which time and space consumption depends not only on the type
 675 of operations but also of their ordering. The final evaluation is thus very
 676 much driven by the third criterion.
 677
 678 The Spartan 6 used in our architecture has a hard constraint since it
 679 has no built-in floating point units. Obviously, it is possible to use
 680 some existing "black-boxes" for double precision operations. But they
 681 require a lot of clock cycles to complete. It is much simpler to
 682 exclusively use integers, with a quantization of all double precision
 683 values. It should be chosen in a manner that does not alterate result
 684 precision. Furthermore, it should not lead to a design with a huge
 685 latency because of operations that could not complete during a single
 686 or few clock cycles. Divisions fall into that category and, moreover,
 687 they need a varying number of clock cycles to complete. Even
 688 multiplications can be a problem since a DSP48 takes inputs of 18 bits
 689 maximum. So, for larger multiplications, several DSP must be combined
 690 which increases the overall latency.
 691
 692 In the present algorithms, the hardest constraint does not come from the
 693 FPGA characteristics but from the algorithms. Their VHDL implementation can
 694 be efficient only if they can be fully (or near) pipelined. We observe that
 695 only a small part of SPL can be pipelined, indeed, the computation of spline
 696 coefficients implies to solve a linear tridiagonal system which matrix and
 697 right-hand side are computed from incoming pixels intensity but after, the
 698 back-solve starts with the latest values, which breaks the pipeline.
 699 Moreover, SPL relies on interpolating far more points than profile size.
 700 Thus, the end of SPL works on a larger amount of data than at the beginning,
 701 which also breaks the pipeline.
 702
 703 LSQ has not this problem since all parts, except the dichotomic search, work
 704 on the same amount of data, i.e. the profile size. Furthermore, LSQ requires
 705 less operations than SPL, implying a smaller output latency. In total, LSQ
 706 turns out to be the best candidate for phase computation on any architecture
 707 including FPGA.
 708
 709 \section{VHDL implementation and experimental tests}
 710
 711 \label{Experimental tests}
 712
 713 \subsection{VHDL implementation}
 714
 715 From the LSQ algorithm, we have written a C program that uses only
 716 integer values. We used a very simple quantization which consists in
 717 multiplying each double precision value by a factor power of two and
 718 by keeping the integer part. For an accurate evaluation of the
 719 division in the computation of $a$ the slope coefficient, we also
 720 scaled the pixel intensities by another power of two. The main problem
 721 was to determin these factors. Most of the time, they are chosen to
 722 minimize the error induced by the quantization. But in our case, we
 723 also have some hardware constraints, for example the width and depth of
 724 RAMs or the input size of DSPs. Thus, having a maximum of values that
 725 fit in these sizes is a very important criterion to choose the scaling
 726 factors.
 727
 728 Consequently, we have determined the maximum value of each variable as
 729 a function of the scale factors and the profile size involved in the
 730 algorithm. It gave us the the maximum number of bits necessary to code
 731 them. We have chosen the scale factors so that any variable (except
 732 the covariance) fits in 18 bits, which is the maximum input size of
 733 DSPs. In this way, all multiplications (except one with covariance)
 734 could be done with a single DSP, in a single clock cycle. Moreover,
 735 assuming that $nb_s = 1024$, all LUTs could fit in the 18Kbits
 736 RAMs. Finally, we compared the double and integer versions of LSQ and
 737 found a nearly perfect agreement between their results.
 738
 739 As mentionned above, some operations like divisions must be
 740 avoided. But when the divisor is fixed, a division can be replaced
 741 by its multiplication/shift counterpart. This is always the case in
 742 LSQ. For example, assuming that $M$ is fixed, $x_{var}$ is known and
 743 fixed. Thus, $\frac{xy_{covar}}{x_{var}}$ can be replaced by
 744
 745 \[ (xy_{covar}\times \left \lfloor\frac{2^n}{x_{var}} \right \rfloor) \gg n\]
 746
 747 where $n$ depends on the desired precision (in our case $n=24$).
 748
 749 Obviously, multiplications and divisions by a power of two can be
 750 replaced by left or right bit shifts. Finally, the code only contains
 751 shifts, additions, subtractions and multiplications of signed integers, which
 752 are perfectly adapted to FGPAs.
 753
 754
 755 We built two versions of VHDL codes, namely one directly by hand
 756 coding and the other with Matlab using the Simulink HDL coder feature~\cite%
 757 {HDLCoder}. Although the approaches are completely different we obtained
 758 quite comparable VHDL codes. Each approach has advantages and drawbacks.
 759 Roughly speaking, hand coding provides beautiful and much better structured
 760 code while Simulink HDL coder allows for fast code production. In
 761 terms of throughput and latency, simulations show that the two approaches
 762 yield close results with a slight advantage for hand coding.
 763
 764 \subsection{Simulation}
 765
 766 Before experimental tests on the FPGA board, we simulated our two VHDL
 767 codes with GHDL and GTKWave (two free tools with linux). We built a
 768 testbench based on experimental profiles and compared the results to
 769 values given by the SPL algorithm. Both versions lead to correct
 770 results. Our first codes were highly optimized, indeed the pipeline
 771 could compute a new phase each 33 cycles and its latency was equal to
 772 95 cycles. Since the Spartan6 is clocked at 100MHz, estimating the
 773 deflection of 100 cantilevers would take about $%
 774 (95+200\times 33).10=66.95\mu $s, i.e. nearly 15,000 estimations by
 775 second.
 776
 777 \subsection{Bitstream creation}
 778
 779 In order to test our code on the SP Vision board, the design was
 780 extended with a component that keeps profiles in RAM, flushes them in
 781 the phase computation component and stores its output in another
 782 RAM. We also added components that implement the wishbone protocol,
 783 in order to "drive" signals to communicate between i.MX and other
 784 components. It is mainly used to start to flush profiles and to
 785 retrieve the computed phases in RAM. Unfortunately, the first designs
 786 could not be placed and routed with ISE on the Spartan6 with a 100MHz
 787 clock. The main problems were encountered with series of arthmetic
 788 operations and more especially with RAM outputs used in DSPs. So, we
 789 needed to decompose some parts of the pipeline, which added few clock
 790 cycles. Finally, we obtained a bitstream that has been successfully
 791 tested on the board.
 792
 793 Its latency is of 112 cycles and it computes a new phase every 40
 794 cycles. For 100 cantilevers, it takes $(112+200\times 40).10=81.12\mu
 795 $s to compute their deflection. It corresponds to about 12300 images
 796 per second, which is largely beyond the camera capacities and the
 797 possibility to extract a new profile from an image every 40
 798 cycles. Nevertheless, it also largely fits our design goals.
 799
 800 \label{sec:results}
 801
 802 \section{Conclusion and perspectives}
 803
 804 In this paper we have presented a full hardware/software solution for
 805 real-time cantilever deflection computation from interferometry images.
 806 Phases are computed thanks to a new algorithm based on the least square
 807 method. It has been quantized and pipelined to be mapped into a FPGA, the
 808 architecture of our solution. Performances have been analyzed through
 809 simulations and real experiments on a Spartan6 FPGA. The results meet our
 810 initial requirements. In future work, the algorithm quantization will be
 811 better analyzed and an high speed camera will be introduced in the
 812 processing chain so that to process real images. Finally, we will address
 813 real-time filtering and control problems for AFM arrays in dynamic regime.
 814
 815 \section{Acknowledgments}
 816 We would like to thank A. Meister and M. Favre, from CSEM, for sharing all the
 817 material we used to write this article and for the time they spent to
 818 explain us their approach.
 819
 820 \bibliographystyle{plain}
 821 \bibliography{biblio}
 822
 823 \end{document}