dmems12.tex

   1 %\usepackage{latex8}
   2 %\usepackage{times}
   3 %\usepackage[cyr]{aeguill}
   4 %\usepackage{pstricks,pst-node,pst-text,pst-3d}
   5 %\usepackage{babel}
   6 %\input{psfig.sty}
   7 %%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
   8
   9
  10 \documentclass[10pt, peerreview, compsocconf]{IEEEtran}
  11 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  12 \usepackage[utf8]{inputenc}
  13 \usepackage{amsmath}
  14 \usepackage{url}
  15 \usepackage{graphicx}
  16 \usepackage{thumbpdf}
  17 \usepackage{color}
  18 \usepackage{moreverb}
  19 \usepackage{commath}
  20 \usepackage{subfigure}
  21 \usepackage{fullpage}
  22 \usepackage{fancybox}
  23 \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  24
  25 \setcounter{MaxMatrixCols}{10}
  26 %TCIDATA{OutputFilter=LATEX.DLL}
  27 %TCIDATA{Version=5.50.0.2953}
  28 %TCIDATA{<META NAME="SaveForMode" CONTENT="1">}
  29 %TCIDATA{BibliographyScheme=BibTeX}
  30 %TCIDATA{LastRevised=Wednesday, October 26, 2011 09:49:54}
  31 %TCIDATA{<META NAME="GraphicsSave" CONTENT="32">}
  32
  33 \newcommand{\noun}[1]{\textsc{#1}}
  34 \newcommand{\tab}{\ \ \ }
  35
  36
  37 \begin{document}
  38
  39 \title{A new approach based on a least square method for real-time estimation of cantilever array deflections with a FPGA}
  40 \author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane
  41 Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2}, Mélanie Favre\IEEEauthorrefmark{3},
  42 Michel Lenczner\IEEEauthorrefmark{2} and André Meister\IEEEauthorrefmark{3}} \\
  43 \IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France \and
  44 \{raphael.couturier,stephane.domas\}@univ-fcomte.fr} \\
  45 \IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France \and
  46 michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com}\\
  47 \IEEEauthorblockA{\IEEEauthorrefmark{3}CSEM, Centre Suisse d’Electronique et de Microtechnique, Neuchatel, Switzerland \and
  48 \{melanie.favre,andre.meister\}@csem.ch}
  49  }
  50
  51 \begin{abstract}
  52 Atomic force microscopes (AFM) provide high resolution images of surfaces.
  53 In this paper, we focus our attention on an interferometry method for
  54 deflection estimation of cantilever arrays in quasi-static regime. In its
  55 original form, spline interpolation was used to determine interference
  56 fringe phase, and thus the deflections. Computations were performed on a PC.
  57 Here, we propose a new complete solution with a least square based algorithm
  58 and an optimized FPGA implementation. Simulations and real tests showed very
  59 good results and open perspective for real-time estimation and control of
  60 cantilever arrays in the dynamic regime.
  61 \end{abstract}
  62
  63 %% \author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)}
  64 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  65 %% line 2: name of organization, acronyms acceptable\\
  66 %% line 3: City, Country\\
  67 %% line 4: Email: name@xyz.com}
  68 %% \and
  69 %% \IEEEauthorblockN{Authors Name/s per 2nd Affiliation (Author)}
  70 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  71 %% line 2: name of organization, acronyms acceptable\\
  72 %% line 3: City, Country\\
  73 %% line 4: Email: name@xyz.com}
  74 %% }
  75
  76 %\maketitle
  77
  78 \thispagestyle{empty}
  79
  80 \begin{IEEEkeywords}
  81 FPGA, cantilever arrays, interferometry.
  82 \end{IEEEkeywords}
  83
  84 \IEEEpeerreviewmaketitle
  85 %\maketitle
  86
  87 \section{Introduction}
  88
  89 Cantilevers are used in atomic force microscopes (AFM) which provide high
  90 resolution surface images. Several techniques have been reported in
  91 literature for cantilever displacement measurement. In~\cite{CantiPiezzo01},
  92 authors have shown how a piezoresistor can be integrated into a cantilever
  93 for deflection measurement. Nevertheless this approach suffers from the
  94 complexity of the microfabrication process needed to implement the sensor.
  95 In~\cite{CantiCapacitive03}, authors have presented a cantilever mechanism
  96 based on capacitive sensing. These techniques require cantilever
  97 instrumentation resulting in\ complex fabrication processes.
  98
  99 In this paper our attention is focused on a method based on
 100 interferometry for cantilever displacement measurement in quasi-static
 101 regime. Cantilevers are illuminated by an optical source.
 102 Interferometry produces fringes enabling cantilever displacement
 103 computation. A high speed camera is used to analyze the fringes. In
 104 view of real time applications, images need to be processed quickly
 105 and then a fast estimation method is required to determine the
 106 displacement of each cantilever. In~\cite{AFMCSEM11}, an algorithm
 107 based on spline has been introduced for cantilever position
 108 estimation.  The overall process gives accurate results but
 109 computations are performed on a standard computer using LabView
 110 \textsuperscript{\textregistered}. Consequently, the main drawback
 111 of this implementation is that the computer is a bottleneck. In this
 112 paper we pose the problem of real-time cantilever position estimation
 113 and bring a hardware/software solution. It includes a fast method
 114 based on least squares and its FPGA implementation.
 115
 116 The remainder of the paper is organized as
 117 follows. Section~\ref{sec:measure} describes the measurement
 118 process. Our solution based on the least square method and its
 119 implementation on a FPGA is presented in
 120 Section~\ref{sec:solus}. Numerical experimentations are described in
 121 Section~\ref{sec:xp-test}. Finally a conclusion and some perspectives
 122 are drawn.
 123
 124 \section{Architecture and goals}
 125
 126 \label{sec:measure}
 127
 128 In order to build simple, cost effective and user-friendly cantilever
 129 arrays, we use a system based on interferometry. The two following
 130 sections summarize the original characteristics of its architecture
 131 and computation method.
 132
 133 \subsection{Experimental setup}
 134
 135 \label{sec:archi}
 136
 137 In opposition to other optical based systems using a laser beam
 138 deflection scheme and sensitive to the angular displacement of the
 139 cantilever, interferometry is sensitive to the optical path difference
 140 induced by the vertical displacement of the cantilever.
 141
 142 The system is based on a Linnick interferometer~\cite{Sinclair:05}.
 143 It is illustrated in Figure~\ref{fig:AFM}. A laser diode is first
 144 split (by the splitter) into a reference beam and a sample beam both
 145 reaching the cantilever array.  The complete system including a
 146 cantilever array and the optical system can be moved thanks to a
 147 translation and rotational hexapod stage with five degrees of
 148 freedom. Thus, the cantilever array is centered in the optical system
 149 which can be adjusted accurately.  The beam illuminates the array by a
 150 microscope objective and the light reflects on the
 151 cantilevers. Likewise the reference beam reflects on a movable
 152 mirror. A CMOS camera chip records the reference and sample beams
 153 which are recombined in the beam splitter and the interferogram. At
 154 the beginning of each experiment, the movable mirror is fitted
 155 manually in order to align the interferometric fringes approximately
 156 parallel to the cantilevers. Then, cantilever motion in the transverse
 157 direction produces movements in the fringes. They are detected with
 158 the CMOS camera which images are analyzed by a Labview program to
 159 recover the cantilever deflections.
 160
 161 \begin{figure}[tbp]
 162 \begin{center}
 163 \includegraphics[width=\columnwidth]{AFM}
 164 \end{center}
 165 \caption{AFM Setup}
 166 \label{fig:AFM}
 167 \end{figure}
 168
 169 %% image tirée des expériences.
 170
 171 \subsection{Inteferometric based cantilever deflection estimation}
 172
 173 \label{sec:deflest}
 174
 175 \begin{figure}[tbp]
 176 \begin{center}
 177 \includegraphics[width=\columnwidth]{lever-xp}
 178 \end{center}
 179 \caption{Portion of a camera image showing moving interferometric fringes in
 180 cantilevers}
 181 \label{fig:img-xp}
 182 \end{figure}
 183
 184 As shown in Figure \ref{fig:img-xp}, each cantilever is covered by
 185 several interferometric fringes. The fringes distort when cantilevers
 186 are deflected. For each cantilever, the method uses three segments of
 187 pixels, parallel to its section, to determine phase shifts.  The first
 188 is located just above the AFM tip (tip profile), it provides the phase
 189 shift modulo $2\pi $. The second one is close to the base junction
 190 (base profile) and is used to determine the exact multiple of $2\pi $
 191 through an operation called unwrapping where it is assumed that the
 192 deflection means along the two measurement segments are linearly
 193 dependent.  The third is on the base and provides a reference for
 194 noise suppression.  Finally, deflections are simply derived from phase
 195 shifts.
 196
 197 The pixel gray-level intensity $I$ of each profile is modelized by%
 198 \begin{equation}
 199 I(x)=A\text{ }\cos (2\pi fx+\theta )+ax+b  \label{equ:profile}
 200 \end{equation}%
 201 where $x$ denotes the position of a pixel in a segment, $A$, $f$ and $\theta
 202 $ are the amplitude, the frequency and the phase of the light signal when
 203 the affine function $ax+b$ corresponds to the cantilever array surface tilt
 204 with respect to the light source.
 205
 206 The method consists in two main sequences.  In the first one
 207 corresponding to precomputation, the frequency $f$ of each profile is
 208 determined using a spline interpolation (see section \ref%
 209 {sec:algo-spline}) and the coefficients used for phase unwrapping are
 210 computed. The second one, that we call the \textit{acquisition loop,}
 211 is done after images have been taken at regular time steps. For each
 212 image, the phase $\theta $ of all profiles is computed to obtain,
 213 after unwrapping, the cantilever deflection. The phase determination
 214 is achieved by a spline based algorithm,  which is the most consuming
 215 part of the computation. In this article, we propose an alternate
 216 version based on the least square method which is faster and better
 217 suited for FPGA implementation. Moreover, it can be used in real-time,
 218 i.e. after each image is picked by the camera.
 219
 220 \subsection{Computation design goals}
 221
 222 \label{sec:goals}
 223
 224 To evaluate the solution performances, we choose a goal which consists
 225 in designing a computing unit able to estimate the deflections of a
 226 $10\times 10$-cantilever array, faster than the camera image
 227 stream. In addition, the result accuracy must be close to 0.3nm, the
 228 maximum precision reached in~\cite{AFMCSEM11}. Finally, the latency
 229 between the entrance of the first pixel of an image and the end of
 230 deflection computation must be as small as possible. All these
 231 requirement are stated in the perspective of implementing real-time
 232 active control for each cantilever, see~\cite{LencznerChap10,Hui11}.
 233
 234 If we put aside other hardware issues like the speed of the link
 235 between the camera and the computation unit, the time to deserialize
 236 pixels and to store them in memory, the phase computation is the
 237 bottleneck of the whole process. For example, the camera in the setup
 238 of \cite{AFMCSEM11} provides $1024\times 1204$ pixels with an
 239 exposition time of 2.5ms. Thus, if the pixel extraction time is
 240 neglected, each phase calculation of a 100-cantilever array should
 241 take no more than 12.5$\mu$s.
 242
 243 In fact, this timing is a very hard constraint. To illustrate this point, we
 244 consider a very small program that initializes twenty million of doubles in
 245 memory and then does 1,000,000 cumulated sums on 20 contiguous values
 246 (experimental profiles have about this size). On an intel Core 2 Duo E6650
 247 at 2.33GHz, this program reaches an average of 155Mflops.
 248 Obviously, some cache effects and optimizations on huge amount of
 249 computations can drastically increase these performances: peak efficiency is
 250 about 2.5Gflops for the considered CPU. But this is not the case for phase
 251 computation that is using only a few tenth of values.
 252
 253 In order to evaluate the original algorithm, we translated it in C
 254 language.  As stated in section \ref{sec:algo-comp}, for 20 pixels, it
 255 does about 1,550 operations, thus an estimated execution time of
 256 $1,550/155=$10$\mu$s. For a more realistic evaluation, we constructed
 257 a file of 1Mo containing 200 profiles of 20 pixels, equally
 258 scattered. This file is equivalent to an image stored in a device file
 259 representing the camera. We obtained an average of 10.5$\mu$s by
 260 profile (including I/O accesses). It is under our requirements but
 261 close to the limit. In case of an occasional load of the system, it
 262 could be largely overtaken. Solutions would be to use a real-time
 263 operating system or to search for a more efficient algorithm.
 264
 265 However, the main drawback is the latency of such a solution because each
 266 profile must be treated one after another and the deflection of 100
 267 cantilevers takes about $200\times 10.5=2.1$ms. This would be inadequate
 268 for real-time requirements as for individual cantilever active control. An
 269 obvious solution is to parallelize the computations, for example on a GPU.
 270 Nevertheless, the cost of transferring profile in GPU memory and of taking
 271 back results would be prohibitive compared to computation time.
 272
 273 It should be noticed that when possible, it is more efficient to
 274 pipeline the computation. For example, supposing that 200 profiles of
 275 20 pixels could be pushed sequentially in a pipelined unit cadenced at
 276 a 100MHz (i.e. a pixel enters in the unit each 10ns), all profiles
 277 would be treated in $200\times 20\times 10.10^{-9}=$ 40$\mu$s plus the
 278 latency of the pipeline. Such a solution would be meeting our
 279 requirements and would be 50 times faster than our C code, and even
 280 more compared to the LabView version. FPGAs are appropriate for such
 281 implementation, so they turn out to be the computation units of choice
 282 to reach our performance requirements. Nevertheless, passing from a C
 283 code to a pipelined version in VHDL is not obvious at all. It can even
 284 be impossible because of FPGA hardware constraints. All these points
 285 are discussed in the following sections.
 286
 287 \section{An hardware/software solution}
 288
 289 \label{sec:solus}
 290
 291 In  this  section we  present  parts  of the  computing  solution  to the  above
 292 requirements. The  hardware part consists in  a high-speed camera linked on an
 293 embedded board hosting  two FPGAs. In this way, the camera  output stream can be
 294 pushed directly  into the FPGA. The software  part is mostly the  VHDL code that
 295 deserializes the camera stream, extracts profiles and computes the deflection.
 296
 297 We first give some general information about FPGAs, then we
 298 describe the FPGA board we use for implementation and finally the two
 299 algorithms for phase computation are detailed. Presentation of VHDL
 300 implementations is postponned until Section \ref{sec:xp-test}.
 301
 302
 303
 304 \subsection{Elements of FPGA architecture and programming}
 305
 306 A field-programmable gate array (FPGA) is an integrated circuit designed to
 307 be configured by the customer. FGPAs are composed of programmable logic
 308 components, called configurable logic blocks (CLB). These blocks mainly
 309 contain look-up tables (LUT), flip/flops (F/F) and latches, organized in one
 310 or more slices connected together. Each CLB can be configured to perform
 311 simple (AND, XOR, ...) or complex combinational functions. They are
 312 interconnected by reconfigurable links. Modern FPGAs contain memory elements
 313 and multipliers which enable to simplify the design and to increase the
 314 performance. Nevertheless, all other complex operations like division and
 315 other functions like trigonometric functions are not available and must be
 316 built by configuring a set of CLBs. Since this is not an obvious task at
 317 all, tools like ISE~\cite{ISE} have been built to do this operation. Such a
 318 software can synthetize a design written in a hardware description language
 319 (HDL), maps it onto CLBs, place/route them for a specific FPGA, and finally
 320 produces a bitstream that is used to configure the FPGA. Thus, from the
 321 developer's point of view, the main difficulty is to translate an algorithm
 322 into HDL code, taking into account FPGA resources and constraints like clock
 323 signals and I/O values that drive the FPGA.
 324
 325 Indeed, HDL programming is very different from classic languages like
 326 C. A program can be seen as a state-machine, manipulating signals that
 327 evolve from state to state. Moreover, HDL instructions can be executed
 328 concurrently. Signals may be combined with basic logic operations to
 329 produce new states that are assigned to another signal. States are mainly expressed as
 330 arrays of bits.  Fortunately, libraries propose higher levels
 331 representations like signed integers, and arithmetic operations.
 332
 333 Furthermore, even if FPGAs are cadenced more slowly than classic processors,
 334 they can perform pipelines as well as parallel operations. A pipeline
 335 consists in cutting a process in a sequence of small tasks, taking the same
 336 execution time. It accepts a new data at each clock top, thus, after a known
 337 latency, it also provides a result at each clock top. The drawback is that the
 338 components of a task are not reusable by another one. Nevertheless, this is
 339 the most efficient technique on FPGAs. Because of their architecture, it is
 340 also very easy to process several data concurrently. Finally, the best
 341 performance can be reached when several pipelines are operating on multiple
 342 data streams in parallel.
 343
 344 \subsection{The FPGA board}
 345
 346 The architecture we use is designed by the Armadeus Systems
 347 company. It consists in a development board called APF27 \textsuperscript{\textregistered}, hosting a
 348 i.MX27 ARM processor (from Freescale) and a Spartan3A (from
 349 Xilinx). This board includes all classical connectors as USB and
 350 Ethernet for instance. A Flash memory contains a Linux kernel that can
 351 be launched after booting the board via u-Boot. The processor is
 352 directly connected to the Spartan3A via its special interface called
 353 WEIM. The Spartan3A is itself connected to an extension board called
 354 SP Vision \textsuperscript{\textregistered}, that hosts a Spartan6 FPGA. Thus, it is
 355 possible to develop programs that communicate between i.MX and
 356 Spartan6, using Spartan3 as a tunnel. A clock signal at 100MHz (by
 357 default) is delivered to dedicated FPGA pins. The Spartan6 of our
 358 board is an LX100 version. It has 15,822 slices, each slice containing
 359 4 LUTs and 8 flip/flops. It is equivalent to 101,261 logic
 360 cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
 361 multiply-adders (named DSP48), which is largely enough for our
 362 project. Some I/O pins of Spartan6 are connected to two $2\times 17$
 363 headers that can be used for any purpose as to be connected to the
 364 interface of a camera.
 365
 366 \subsection{Two algorithms for phase computation}
 367
 368 As said in section \ref{sec:deflest}, $f$ is computed only once but
 369 the phase needs to be computed for each image. This is why, in this
 370 paper, we focus on its computation. The next section describes the
 371 original method, based on spline interpolation, and section
 372 \ref{sec:algo-square} presents the new one based on least
 373 squares. Finally, in section \ref{sec:algo-comp}, we compare the two
 374 algorithms from their FPGA implementation point of view.
 375
 376 \subsubsection{Spline algorithm (SPL)}
 377
 378 \label{sec:algo-spline}
 379
 380 We denote by $M$ the number of pixels in a segment used for phase
 381 computation. For the sake of simplicity of the notations, we consider
 382 the light intensity $I$ a function on the interval [0,M] which itself
 383 is the range of a one-to-one mapping defined on the physical
 384 segment. The pixels are assumed to be regularly spaced and centered at
 385 the positions $x^{p}\in\{0,1,\ldots,M-1\}.$ We use the simplest
 386 definition of a pixel, namely the value of $I$ at its center. The
 387 pixel intensities are considered as pre-normalized so that their
 388 minimum and maximum have been resized to $-1$ and $1$.
 389
 390 The first step consists in computing the cubic spline interpolation of
 391 the intensities. This allows for interpolating $I$ at a larger number
 392 $L=k\times M$ of points (typically $k=4$ is sufficient) $%
 393 x^{s}$ in the interval $[0,M[$. During the precomputation sequence,
 394 the second step is to determine the affine part $a.x+b$ of $I$. It is
 395 found with an ordinary least square method, taking account the $L$
 396 points. Values of $I$ in $x^s$ are used to compute its intersections
 397 with $a.x+b$. The period of $I$ (and thus its frequency) is deduced
 398 from the number of intersections and the distance between the first
 399 and last.
 400
 401 During the acquisition loop, the second step is the phase computation, with
 402 \begin{equation}
 403 \theta =atan\left[ \frac{\sum_{i=0}^{N-1}\text{sin}(2\pi fx_{i}^{s})\times
 404 I(x_{i}^{s})}{\sum_{i=0}^{N-1}\text{cos}(2\pi fx_{i}^{s})\times I(x_{i}^{s})}%
 405 \right] .
 406 \end{equation}
 407
 408 \textit{Remarks: }
 409
 410 \begin{itemize}
 411 \item The frequency could also be obtained using the derivative of spline
 412 equations, which only implies to solve quadratic equations but certainly
 413 yields higher errors.
 414
 415 \item Profile frequency are computed during the precomputation step,
 416   thus the values sin$(2\pi fx_{i}^{s})$ and cos$(2\pi fx_{i}^{s})$
 417   can be determined once for all.
 418 \end{itemize}
 419
 420 \subsubsection{Least square algorithm (LSQ)}
 421 \label{sec:algo-square}
 422 Assuming that we compute the phase during the acquisition loop,
 423 equation \ref{equ:profile} has only 4 parameters: $a,b,A$, and $\theta
 424 $, $f$ and $x$ being already known. A least
 425 square method based on a Gauss-Newton algorithm can be used to
 426 determine these four parameters.  This kind of iterative process ends
 427 with a convergence criterion, so it is not suited to our design
 428 goals. Fortunately, it is quite simple to reduce the number of
 429 parameters to $\theta$ only. Firstly, the affine part $ax+b$ is
 430 estimated from the $M$ values $I(x^{p})$ to determine the rectified
 431 intensities,%
 432 \begin{equation*}
 433 I^{corr}(x^{p})\approx I(x^{p})-a.x^{p}-b.
 434 \end{equation*}%
 435 To find $a$ and $b$ we apply an ordinary least square method (as in SPL but on $M$ points)%
 436 \begin{equation*}
 437 a=\frac{covar(x^{p},I(x^{p}))}{\text{var}(x^{p})}\text{ and }b=\overline{%
 438 I(x^{p})}-a.\overline{{x^{p}}}
 439 \end{equation*}%
 440 where overlined symbols represent average. Then the amplitude $A$ is
 441 approximated by%
 442 \begin{equation*}
 443 A\approx \frac{\text{max}(I^{corr})-\text{min}(I^{corr})}{2}.
 444 \end{equation*}%
 445 Finally, the problem of approximating $\theta $ is reduced to minimizing%
 446 \begin{equation*}
 447 \min_{\theta \in \lbrack -\pi ,\pi ]}\sum_{i=0}^{M-1}\left[ \text{cos}(2\pi
 448 f.i+\theta )-\frac{I^{corr}(i)}{A}\right] ^{2}.
 449 \end{equation*}%
 450 An optimal value $\theta ^{\ast }$ of the minimization problem is a zero of
 451 the first derivative of the above argument,%\begin{eqnarray*}{l}
 452 \begin{equation*}
 453 2\left[ \text{cos}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{sin}(2\pi
 454 f.i)\right.
 455 \end{equation*}%
 456 \begin{equation*}
 457 \left. +\text{sin}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{cos}(2\pi
 458 f.i)\right] -
 459 \end{equation*}%
 460 \begin{equation*}
 461 A\left[ \text{cos}2\theta ^{\ast }\sum_{i=0}^{M-1}\sin (4\pi f.i)+\text{sin}%
 462 2\theta ^{\ast }\sum_{i=0}^{M-1}\cos (4\pi f.i)\right] =0
 463 \end{equation*}%
 464 %
 465 %\end{eqnarray*}
 466
 467 Several points can be noticed:
 468
 469 \begin{itemize}
 470 \item The terms $\sum_{i=0}^{M-1}$sin$(4\pi f.i)$ and$\sum_{i=0}^{M-1}$cos$%
 471 (4\pi f.i)$ are independent of $\theta $, they can be precomputed.
 472
 473 \item Lookup tables (namely lut$_{sfi}$ and lut$_{cfi}$ in the following algorithms) can be
 474   set with the $2.M$ values $\sin (2\pi f.i)$ and $\cos (2\pi f.i)$.
 475
 476 \item A simple method to find a zero $\theta ^{\ast }$ of the
 477   optimality condition is to discretize the range $[-\pi ,\pi ]$ with
 478   a large number $%
 479   nb_{s}$ of nodes and to find which one is a minimizer in the
 480   absolute value sense. Hence, three other lookup tables (lut$_{s}$,
 481   lut$_{c}$ and lut$_{A}$) can be set with the $3\times nb_{s}$ values
 482   $\sin \theta$, $\cos \theta$, and
 483 \begin{equation*}
 484 \left[ cos2\theta \sum_{i=0}^{M-1}sin(4\pi f.i)+sin2\theta
 485 \sum_{i=0}^{M-1}cos(4\pi f.i)\right] .
 486 \end{equation*}
 487
 488 \item The search algorithm can be very fast using a dichotomous process in $%
 489 log_{2}(nb_{s}).$
 490 \end{itemize}
 491
 492 The overall method is synthetized in an algorithm (called LSQ in the
 493 following) divided into the precomputing part and the acquisition loop.
 494
 495 \begin{algorithm}[htbp]
 496 \caption{LSQ algorithm - before acquisition loop.}
 497 \label{alg:lsq-before}
 498
 499    $M \leftarrow $ number of pixels of the profile\\
 500    I[] $\leftarrow $ intensity of pixels\\
 501    $f \leftarrow $ frequency of the profile\\
 502    $s4i \leftarrow \sum_{i=0}^{M-1} sin(4\pi f.i)$\\
 503    $c4i \leftarrow \sum_{i=0}^{M-1} cos(4\pi f.i)$\\
 504    $nb_s \leftarrow $ number of discretization steps of $[-\pi,\pi]$\\
 505
 506    \For{$i=0$ to $nb_s $}{
 507      $\theta  \leftarrow -\pi + 2\pi\times \frac{i}{nb_s}$\\
 508      lut$_s$[$i$] $\leftarrow sin \theta$\\
 509      lut$_c$[$i$] $\leftarrow cos \theta$\\
 510      lut$_A$[$i$] $\leftarrow cos 2 \theta \times s4i + sin 2 \theta \times c4i$\\
 511      lut$_{sfi}$[$i$] $\leftarrow sin (2\pi f.i)$\\
 512      lut$_{cfi}$[$i$] $\leftarrow cos (2\pi f.i)$\\
 513    }
 514 \end{algorithm}
 515
 516 \begin{algorithm}[htbp]
 517 \caption{LSQ algorithm - during acquisition loop.}
 518 \label{alg:lsq-during}
 519
 520    $\bar{x} \leftarrow \frac{M-1}{2}$\\
 521    $\bar{y} \leftarrow 0$, $x_{var} \leftarrow 0$, $xy_{covar} \leftarrow 0$\\
 522    \For{$i=0$ to $M-1$}{
 523      $\bar{y} \leftarrow \bar{y} + $ I[$i$]\\
 524      $x_{var} \leftarrow x_{var} + (i-\bar{x})^2$\\
 525    }
 526    $\bar{y} \leftarrow \frac{\bar{y}}{M}$\\
 527    \For{$i=0$ to $M-1$}{
 528      $xy_{covar} \leftarrow xy_{covar} + (i-\bar{x}) \times (I[i]-\bar{y})$\\
 529    }
 530    $slope \leftarrow \frac{xy_{covar}}{x_{var}}$\\
 531    $start \leftarrow \bar{y} - slope\times \bar{x}$\\
 532    \For{$i=0$ to $M-1$}{
 533      $I[i] \leftarrow I[i] - start - slope\times i$\\
 534    }
 535
 536    $I_{max} \leftarrow max_i(I[i])$, $I_{min} \leftarrow min_i(I[i])$\\
 537    $amp \leftarrow \frac{I_{max}-I_{min}}{2}$\\
 538
 539    $Is \leftarrow 0$, $Ic \leftarrow 0$\\
 540    \For{$i=0$ to $M-1$}{
 541      $Is \leftarrow Is + I[i]\times $ lut$_{sfi}$[$i$]\\
 542      $Ic \leftarrow Ic + I[i]\times $ lut$_{cfi}$[$i$]\\
 543    }
 544
 545    $\delta \leftarrow \frac{nb_s}{2}$, $b_l \leftarrow 0$, $b_r \leftarrow \delta$\\
 546    $v_l \leftarrow -2.I_s - amp.$lut$_A$[$b_l$]\\
 547
 548    \While{$\delta >= 1$}{
 549
 550      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 551
 552      \If{$!(v_l < 0$ and $v_r >= 0)$}{
 553        $v_l \leftarrow v_r$ \\
 554        $b_l \leftarrow b_r$ \\
 555      }
 556      $\delta \leftarrow \frac{\delta}{2}$\\
 557      $b_r \leftarrow b_l + \delta$\\
 558    }
 559    \uIf{$!(v_l < 0$ and $v_r >= 0)$}{
 560      $v_l \leftarrow v_r$ \\
 561      $b_l \leftarrow b_r$ \\
 562      $b_r \leftarrow b_l + 1$\\
 563      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 564    }
 565    \Else {
 566      $b_r \leftarrow b_l + 1$\\
 567    }
 568
 569    \uIf{$ abs(v_l) < v_r$}{
 570      $b_{\theta} \leftarrow b_l$ \\
 571    }
 572    \Else {
 573      $b_{\theta} \leftarrow b_r$ \\
 574    }
 575    $\theta \leftarrow \pi\times \left[\frac{2.b_{ref}}{nb_s}-1\right]$\\
 576
 577 \end{algorithm}
 578
 579 \subsubsection{Algorithm comparison}
 580 \label{sec:algo-comp}
 581 We compared the two algorithms regarding three criteria:
 582
 583 \begin{itemize}
 584 \item precision of results on a cosines profile distorted by noise,
 585
 586 \item number of operations,
 587
 588 \item complexity of FPGA implementation.
 589 \end{itemize}
 590
 591 For the first item, we produced a Matlab version of each algorithm,
 592 running in double precision. The profile was generated for about
 593 34,000 different quadruplets of periods ($\in \lbrack 3.1,6.1]$, step
 594 = 0.1), phases ($\in \lbrack -3.1,3.1]$, steps = 0.062) and slopes
 595 ($\in \lbrack -2,2]$, step = 0.4). Obviously, the discretization of
 596 $[-\pi ,\pi ]$ introduces an error in the phase estimation. It is at
 597 most equal to $\frac{\pi}{nb_s}$. From some experiments on a $17\times
 598 4$ array, we noticed an average ratio of 50
 599 between phase variation in radians and lever end position in
 600 nanometers. Assuming such a ratio and $nb_s = 1024$, the maximum lever
 601 deflection error would be 0.15nm which is smaller than 0.3nm, the best
 602 precision achieved with the setup used.
 603
 604 Moreover, pixels have been paired and the paired intensities have been
 605 perturbed by addition of a random number uniformly picked in
 606 $[-N,N]$. Notice that we have observed that perturbing each pixel
 607 independently yields too weak profile distortion. We report
 608 percentages of errors between the reference and the computed phases
 609 out of $2\pi ,$%
 610 \begin{equation*}
 611 err=100\times \frac{|\theta _{ref}-\theta _{comp}|}{2\pi }.
 612 \end{equation*}%
 613 Table \ref{tab:algo_prec} gives the maximum and the average errors for both
 614 algorithms and for increasing values of $N$ the noise parameter.
 615
 616 \begin{table}[ht]
 617 \begin{center}
 618 \begin{tabular}{|c|c|c|c|c|}
 619 \hline
 620 & \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5}
 621 noise (N)& max. err. & aver. err. & max. err. & aver. err. \\ \hline
 622 0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline
 623 2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline
 624 5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline
 625 7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline
 626 10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline
 627 15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline
 628 30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
 629 \end{tabular}%
 630 \end{center}
 631 \caption{Error (in \%) for cosines profiles, with noise.}
 632 \label{tab:algo_prec}
 633 \end{table}
 634
 635 The results show that the two algorithms yield close results, with a slight
 636 advantage for LSQ. Furthermore, both behave very well against noise.
 637 Assuming an average ratio of 50 (see above), an error of 1 percent on
 638 the phase corresponds to an error of 0.5nm on the lever deflection, which is
 639 very close to the best precision.
 640
 641 It is very hard to predict which level of noise will be present in
 642 real experiments and how it will distort the profiles. Results on
 643 a $17\times 4$ array allowed us to compare experimental profiles to
 644 simulated ones. We can see on figure \ref{fig:noise20} the profile
 645 with $N=10$ that leads to the biggest error. It is a bit distorted,
 646 with pikes and straight/rounded portions. In fact, it is very close to
 647 some of the worst experimental profiles. Figure \ref{fig:noise60}
 648 shows a sample of worst profile for $N=30$. It is completely
 649 distorted, largely beyond any experimental ones. Obviously, these
 650 comparisons are a bit subjective and experimental profiles could also
 651 be more distorted on other experiments. Nevertheless, they give an
 652 idea about the possible error.
 653
 654 \begin{figure}[ht]
 655 \begin{center}
 656 \includegraphics[width=\columnwidth]{intens-noise20}
 657 \end{center}
 658 \caption{Sample of worst profile for N=10}
 659 \label{fig:noise20}
 660 \end{figure}
 661
 662 \begin{figure}[ht]
 663 \begin{center}
 664 \includegraphics[width=\columnwidth]{intens-noise60}
 665 \end{center}
 666 \caption{Sample of worst profile for N=30}
 667 \label{fig:noise60}
 668 \end{figure}
 669
 670 The second criterion is relatively easy to estimate for LSQ and harder for
 671 SPL because of the use of the arctangent function. In both cases, the number
 672 of operation is proportional to $M$ the number of pixels. For LSQ, it also
 673 depends on $nb_{s}$ and for SPL on $L=k\times M$ the number of interpolated
 674 points. We assume that $M=20$, $nb_{s}=1024$ and $k=4$, that all possible
 675 parts are already in lookup tables and that a limited set of operations (+,
 676 -, *, /, $<$, $>$) is taken into account. Translating both algorithms in C
 677 code, we obtain about 430 operations for LSQ and 1,550 (plus a few tenth for
 678 $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
 679 considering the total number of operations is not fully relevant for FPGA
 680 implementation for which time and space consumption depends not only on the type
 681 of operations but also of their ordering. The final evaluation is thus very
 682 much driven by the third criterion.
 683
 684 The Spartan 6 used in our architecture has a hard constraint since it
 685 has no built-in floating point units. Obviously, it is possible to use
 686 some existing "black-boxes" for double precision operations. But they
 687 require a lot of clock cycles to complete. It is much simpler to
 688 exclusively use integers, with a quantization of all double precision
 689 values. It should be chosen in a manner that does not alterate result
 690 precision. Furthermore, it should not lead to a design with a huge
 691 latency because of operations that could not complete during a single
 692 or few clock cycles. Divisions fall into that category and, moreover,
 693 they need a varying number of clock cycles to complete. Even
 694 multiplications can be a problem since a DSP48 takes inputs of 18 bits
 695 maximum. So, for larger multiplications, several DSP must be combined
 696 which increases the overall latency.
 697
 698 Nevertheless, in the present algorithms, the hardest constraint does
 699 not come from the FPGA characteristics but from the algorithms
 700 themselves. Their VHDL implementation can be efficient only if they
 701 can be fully (or near) pipelined. We observe that only a small part of
 702 SPL can be pipelined, indeed, the computation of spline coefficients
 703 implies to solve a linear tridiagonal system which matrix and
 704 right-hand side are computed from incoming pixels intensity but after,
 705 the back-solve starts with the latest values, which breaks the
 706 pipeline.  Moreover, SPL relies on interpolating far more points than
 707 profile size.  Thus, the end of SPL works on a larger amount of data
 708 than at the beginning, which also breaks the pipeline.
 709
 710 LSQ has not this problem since all parts, except the dichotomic search, work
 711 on the same amount of data, i.e. the profile size. Furthermore, LSQ requires
 712 less operations than SPL, implying a smaller output latency. In total, LSQ
 713 turns out to be the best candidate for phase computation on any architecture
 714 including FPGA.
 715
 716 \section{VHDL implementation and experimental tests}
 717
 718 \label{sec:xp-test}
 719
 720 \subsection{VHDL implementation}
 721
 722 From the LSQ algorithm, we have written a C program that uses only
 723 integer values. We used a very simple quantization which consists in
 724 multiplying each double precision value by a factor power of two and
 725 by keeping the integer part. For an accurate evaluation of the
 726 division in the computation of $a$ the slope coefficient, we also
 727 scaled the pixel intensities by another power of two. The main problem
 728 was to determine these factors. Most of the time, they are chosen to
 729 minimize the error induced by the quantization. But in our case, we
 730 also have some hardware constraints, for example the width and depth of
 731 RAMs or the input size of DSPs. Thus, having a maximum of values that
 732 fit in these sizes is a very important criterion to choose the scaling
 733 factors.
 734
 735 Consequently, we have determined the maximum value of each variable as
 736 a function of the scale factors and the profile size involved in the
 737 algorithm. It gave us the maximum number of bits necessary to code
 738 them. We have chosen the scale factors so that any variable (except
 739 the covariance) fits in 18 bits, which is the maximum input size of
 740 DSPs. In this way, all multiplications (except one with covariance)
 741 could be done with a single DSP, in a single clock cycle. Moreover,
 742 assuming that $nb_s = 1024$, all LUTs could fit in the 18Kbits
 743 RAMs. Finally, we compared the double and integer versions of LSQ and
 744 found a nearly perfect agreement between their results.
 745
 746 As mentionned above, some operations like divisions must be
 747 avoided. But when the divisor is fixed, a division can be replaced
 748 by its multiplication/shift counterpart. This is always the case in
 749 LSQ. For example, assuming that $M$ is fixed, $x_{var}$ is known and
 750 fixed. Thus, $\frac{xy_{covar}}{x_{var}}$ can be replaced by
 751
 752 \[ (xy_{covar}\times \left \lfloor\frac{2^n}{x_{var}} \right \rfloor) \gg n\]
 753
 754 where $n$ depends on the desired precision (in our case $n=24$).
 755
 756 Obviously, multiplications and divisions by a power of two can be
 757 replaced by left or right bit shifts. Finally, the code only contains
 758 shifts, additions, subtractions and multiplications of signed integers, which
 759 are perfectly adapted to FGPAs.
 760
 761
 762 We built two versions of VHDL codes, namely one directly by hand
 763 coding and the other with Matlab using the Simulink HDL coder feature~\cite%
 764 {HDLCoder}. Although the approaches are completely different we obtained
 765 quite comparable VHDL codes. Each approach has advantages and drawbacks.
 766 Roughly speaking, hand coding provides beautiful and much better structured
 767 code while Simulink HDL coder allows  fast code production. In
 768 terms of throughput and latency, simulations show that the two approaches
 769 yield close results with a slight advantage for hand coding.
 770
 771 \subsection{Simulation}
 772
 773 Before experimental tests on the FPGA board, we simulated our two VHDL
 774 codes with GHDL and GTKWave (two free tools with linux). We built a
 775 testbench based on experimental profiles and compared the results to
 776 values given by the SPL algorithm. Both versions lead to correct
 777 results. Our first codes were highly optimized, indeed the pipeline
 778 could compute a new phase each 33 cycles and its latency was equal to
 779 95 cycles. Since the Spartan6 is clocked at 100MHz, estimating the
 780 deflection of 100 cantilevers would take about $%
 781 (95+200\times 33).10=66.95\mu $s, i.e. nearly 15,000 estimations by
 782 second.
 783
 784 \subsection{Bitstream creation}
 785
 786 In order to test our code on the SP Vision board, the design was
 787 extended with a component that keeps profiles in RAM, flushes them in
 788 the phase computation component and stores its output in another
 789 RAM. We also added components that implement the wishbone protocol,
 790 in order to "drive" signals to communicate between i.MX and other
 791 components. It is mainly used to start to flush profiles and to
 792 retrieve the computed phases in RAM. Unfortunately, the first designs
 793 could not be placed and routed with ISE on the Spartan6 with a 100MHz
 794 clock. The main problems were encountered with series of arithmetic
 795 operations and more especially with RAM outputs used in DSPs. So, we
 796 needed to decompose some parts of the pipeline, which added few clock
 797 cycles. Finally, we obtained a bitstream that has been successfully
 798 tested on the board.
 799
 800 Its latency is of 112 cycles and it computes a new phase every 40
 801 cycles. For 100 cantilevers, it takes $(112+200\times 40)\times 10ns =81.12\mu
 802 $s to compute their deflection. It corresponds to about 12300 images
 803 per second, which is largely beyond the camera capacities and the
 804 possibility to extract a new profile from an image every 40
 805 cycles. Nevertheless, it also largely fits our design goals.
 806
 807 \section{Conclusion and perspectives}
 808
 809 In this paper we have presented a full hardware/software solution for
 810 real-time cantilever deflection computation from interferometry images.
 811 Phases are computed thanks to a new algorithm based on the least square
 812 method. It has been quantized and pipelined to be mapped into a FPGA, the
 813 architecture of our solution. Performances have been analyzed through
 814 simulations and real experiments on a Spartan6 FPGA. The results meet our
 815 initial requirements. In future work, the algorithm quantization will be
 816 better analyzed and an high speed camera will be introduced in the
 817 processing chain so that to process real images. Finally, we will address
 818 real-time filtering and control problems for AFM arrays in dynamic regime.
 819
 820 %\section{Acknowledgments}
 821 %We would like to thank A. Meister and M. Favre, from CSEM, for sharing all the
 822 %material we used to write this article and for the time they spent to
 823 %explain us their approach.
 824
 825 \bibliographystyle{plain}
 826 \bibliography{biblio}
 827
 828 \end{document}