dmems12.tex

   1 %\usepackage{latex8}
   2 %\usepackage{times}
   3 %\usepackage[cyr]{aeguill}
   4 %\usepackage{pstricks,pst-node,pst-text,pst-3d}
   5 %\usepackage{babel}
   6 %\input{psfig.sty}
   7 %%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
   8
   9
  10 \documentclass[10pt, peerreview, compsocconf]{IEEEtran}
  11 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  12 \usepackage[utf8]{inputenc}
  13 \usepackage{amsmath}
  14 \usepackage{url}
  15 \usepackage{graphicx}
  16 \usepackage{thumbpdf}
  17 \usepackage{color}
  18 \usepackage{moreverb}
  19 \usepackage{commath}
  20 \usepackage{subfigure}
  21 \usepackage{fullpage}
  22 \usepackage{fancybox}
  23 \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  24
  25 \setcounter{MaxMatrixCols}{10}
  26 %TCIDATA{OutputFilter=LATEX.DLL}
  27 %TCIDATA{Version=5.50.0.2953}
  28 %TCIDATA{<META NAME="SaveForMode" CONTENT="1">}
  29 %TCIDATA{BibliographyScheme=BibTeX}
  30 %TCIDATA{LastRevised=Wednesday, October 26, 2011 09:49:54}
  31 %TCIDATA{<META NAME="GraphicsSave" CONTENT="32">}
  32
  33 \newcommand{\noun}[1]{\textsc{#1}}
  34 \newcommand{\tab}{\ \ \ }
  35
  36
  37 \begin{document}
  38
  39 \title{A new approach based on a least square method for real-time estimation of cantilever array deflections with a FPGA}
  40 \author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane
  41 Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and
  42 Michel Lenczner\IEEEauthorrefmark{2}}
  43 \IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France \and
  44 \{raphael.couturier,stephane.domas\}@univ-fcomte.fr}
  45 \IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France \and
  46 \{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com} }
  47
  48 \begin{abstract}
  49 Atomic force microscopes (AFM) provide high resolution images of surfaces.
  50 In this paper, we focus our attention on an interferometry method for
  51 deflection estimation of cantilever arrays in quasi-static regime. In its
  52 original form, spline interpolation was used to determine interference
  53 fringe phase, and thus the deflections. Computations were performed on a PC.
  54 Here, we propose a new complete solution with a least square based algorithm
  55 and an optimized FPGA implementation. Simulations and real tests showed very
  56 good results and open perspective for real-time estimation and control of
  57 cantilever arrays in the dynamic regime.
  58 \end{abstract}
  59
  60 %% \author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)}
  61 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  62 %% line 2: name of organization, acronyms acceptable\\
  63 %% line 3: City, Country\\
  64 %% line 4: Email: name@xyz.com}
  65 %% \and
  66 %% \IEEEauthorblockN{Authors Name/s per 2nd Affiliation (Author)}
  67 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  68 %% line 2: name of organization, acronyms acceptable\\
  69 %% line 3: City, Country\\
  70 %% line 4: Email: name@xyz.com}
  71 %% }
  72
  73 %\maketitle
  74
  75 \thispagestyle{empty}
  76
  77 \begin{IEEEkeywords}
  78 FPGA, cantilever arrays, interferometry.
  79 \end{IEEEkeywords}
  80
  81 \IEEEpeerreviewmaketitle
  82
  83 \section{Introduction}
  84
  85 Cantilevers are used in atomic force microscopes (AFM) which provide high
  86 resolution surface images. Several techniques have been reported in
  87 literature for cantilever displacement measurement. In~\cite{CantiPiezzo01},
  88 authors have shown how a piezoresistor can be integrated into a cantilever
  89 for deflection measurement. Nevertheless this approach suffers from the
  90 complexity of the microfabrication process needed to implement the sensor.
  91 In~\cite{CantiCapacitive03}, authors have presented a cantilever mechanism
  92 based on capacitive sensing. These techniques require cantilever
  93 instrumentation resulting in\ complex fabrication processes.
  94
  95 In this paper  our attention is focused on a method  based on interferometry for
  96 cantilever  displacement  measurement in  quasi-static  regime. Cantilevers  are
  97 illuminated  by an  optical  source.  Interferometry  produces fringes  enabling
  98 cantilever displacement computation. A high  speed camera is used to analyze the
  99 fringes. In view of real time  applications, images need to be processed quickly
 100 and then a  fast estimation method is required to  determine the displacement of
 101 each  cantilever. In~\cite{AFMCSEM11},  an algorithm  based on  spline  has been
 102 introduced  for  cantilever  position  estimation.  The  overall  process  gives
 103 accurate results  but computations  are performed on  a standard  computer using
 104 LabView      \textsuperscript{\textregistered}     \textsuperscript{\copyright}.
 105 Consequently, the main drawback of this implementation is that the computer is a
 106 bottleneck. In this  paper we pose the problem  of real-time cantilever position
 107 estimation and  bring a  hardware/software solution. It  includes a  fast method
 108 based on least squares and its FPGA implementation.
 109
 110 The remainder of the paper is organized as follows. Section~\ref{sec:measure}
 111 describes the measurement process. Our solution based on the least square
 112 method and its implementation on a FPGA is presented in Section~\ref{sec:solus}. Numerical experimentations are described in Section~\ref{sec:results}. Finally a conclusion and some perspectives are drawn.
 113
 114 \section{Architecture and goals}
 115
 116 \label{sec:measure}
 117
 118 In order to develop simple, cost effective and user-friendly cantilever
 119 arrays, authors of ~\cite{AFMCSEM11} have developed a system based on
 120 interferometry.
 121
 122 \subsection{Experimental setup}
 123
 124 \label{sec:archi}
 125
 126 In opposition to other optical based system\textbf{s u}sing a laser beam
 127 deflection scheme and sensitive to the angular displacement of the
 128 cantilever, interferometry is sensitive to the optical path difference
 129 induced by the vertical displacement of the cantilever.
 130
 131 The system is based on a Linnick interferometer~\cite{Sinclair:05}.
 132 It is illustrated in Figure~\ref{fig:AFM} \footnote{by courtesy of
 133   CSEM}. A laser diode is first split (by the splitter) into a
 134 reference beam and a sample beam both reaching the cantilever array.
 135 The complete system including a cantilever array\ and the optical
 136 system can be moved thanks to a translation and rotational hexapod
 137 stage with five degrees of freedom. Thus, the cantilever array is
 138 centered in the optical system which can be adjusted accurately.  The
 139 beam illuminates the array by a microscope objective and the light
 140 reflects on the cantilevers. Likewise the reference beam reflects on a
 141 movable mirror. A CMOS camera chip records the reference and sample
 142 beams which are recombined in the beam splitter and the
 143 interferogram. At the beginning of each experiment, the movable mirror
 144 is fitted manually in order to align the interferometric fringes
 145 approximately parallel to the cantilevers. Then, cantilever motion in
 146 the transverse direction produces movements in the fringes. They are
 147 detected with the CMOS camera which images are analyzed by a Labview
 148 program to recover the cantilever deflections.
 149
 150 \begin{figure}[tbp]
 151 \begin{center}
 152 \includegraphics[width=\columnwidth]{AFM}
 153 \end{center}
 154 \caption{AFM Setup}
 155 \label{fig:AFM}
 156 \end{figure}
 157
 158 %% image tirée des expériences.
 159
 160 \subsection{Inteferometric based cantilever deflection estimation}
 161
 162 \label{sec:deflest}
 163
 164 \begin{figure}[tbp]
 165 \begin{center}
 166 \includegraphics[width=\columnwidth]{lever-xp}
 167 \end{center}
 168 \caption{Portion of a camera image showing moving interferometric fringes in
 169 cantilevers}
 170 \label{fig:img-xp}
 171 \end{figure}
 172
 173 As shown in Figure \ref{fig:img-xp} \footnote{by courtesy of CSEM}, each
 174 cantilever is covered by several interferometric fringes. The fringes
 175 distort when cantilevers are deflected.  In \cite{AFMCSEM11}, a novel
 176 method for interferometric based cantilever deflection measurement was
 177 reported. For each cantilever, the method uses three segments of pixels,
 178 parallel to its section, to determine phase shifts.  The first is
 179 located just above the AFM tip (tip profile), it provides the phase
 180 shift modulo $2\pi $. The second one is close to the base junction
 181 (base profile) and is used to determine the exact multiple of $2\pi $
 182 through an operation called unwrapping where it is assumed that the
 183 deflection means along the two measurement segments are linearly
 184 dependent.  The third is on the base and provides a reference for
 185 noise suppression.  Finally, deflections are simply derived from phase
 186 shifts.\\
 187
 188 The pixel gray-level intensity $I$ of each profile is modelized by%
 189 \begin{equation}
 190 I(x)=A\text{ }\cos (2\pi fx+\theta )+ax+b  \label{equ:profile}
 191 \end{equation}%
 192 where $x$ denotes the position of a pixel in a segment, $A$, $f$ and $\theta
 193 $ are the amplitude, the frequency and the phase of the light signal when
 194 the affine function $ax+b$ corresponds to the cantilever array surface tilt
 195 with respect to the light source.
 196
 197 The method consists in two main sequences.  In the first one
 198 corresponding to precomputation, the frequency $f$ of each profile is
 199 determined using a spline interpolation (see section \ref%
 200 {sec:algo-spline}) and the coefficient used for phase unwrapping is
 201 computed. The second one, that we call the \textit{acquisition loop,}
 202 is done after images have been taken at regular time steps. For each
 203 image, the phase $\theta $ of all profiles is computed to obtain,
 204 after unwrapping, the cantilever deflection. The phase determination
 205 in \cite{AFMCSEM11} is achieved by a spline based algorithm which is
 206 the most consuming part of the computation. In this article, we
 207 propose an alternate version based on the least square method which is
 208 faster and better suited for FPGA implementation.
 209
 210 \subsection{Computation design goals}
 211
 212 \label{sec:goals}
 213
 214 To evaluate the solution performances, we choose a goal which consists
 215 in designing a computing unit able to estimate the deflections of
 216 a $10\times 10$%
 217 -cantilever array, faster than the camera image stream. In addition,
 218 the result accuracy must be close to 0.3nm, the maximum precision
 219 reached in \cite{AFMCSEM11}. Finally, the latency between the entrance
 220 of the first pixel of an image and the end of deflection computation
 221 must be as small as possible. All these requirement are
 222 stated in the perspective of implementing real-time active control for
 223 each cantilever, see~\cite{LencznerChap10,Hui11}.
 224
 225 If we put aside other hardware issues like the speed of the link
 226 between the camera and the computation unit, the time to deserialize
 227 pixels and to store them in memory, the phase computation is the
 228 bottleneck of the whole process. For example, the camera in the setup
 229 of \cite{AFMCSEM11} provides $%
 230 1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if we
 231 the pixel extraction time is neglected, each phase calculation of a
 232 100-cantilever array should take no more than 12.5$\mu$s. \newline
 233
 234 In fact, this timing is a very hard constraint. To illustrate this point, we
 235 consider a very small program that initializes twenty million of doubles in
 236 memory and then does 1,000,000 cumulated sums on 20 contiguous values
 237 (experimental profiles have about this size). On an intel Core 2 Duo E6650
 238 at 2.33GHz, this program reaches an average of 155Mflops.
 239 Obviously, some cache effects and optimizations on huge amount of
 240 computations can drastically increase these performances: peak efficiency is
 241 about 2.5Gflops for the considered CPU. But this is not the case for phase
 242 computation that is using only a few tenth of values.\newline
 243
 244 In order to evaluate the original algorithm, we translated it in C language.
 245 As stated before, for 20 pixels, it does about 1,550 operations, thus an
 246 estimated execution time of $1,550/155=$10$\mu $s. For a more realistic
 247 evaluation, we constructed a file of 1Mo containing 200 profiles of 20
 248 pixels, equally scattered. This file is equivalent to an image stored in a
 249 device file representing the camera. We obtained an average of 10.5$\mu$s
 250 by profile (including I/O accesses). It is under our requirements but close
 251 to the limit. In case of an occasional load of the system, it could be
 252 largely overtaken. Solutions would be to use a real-time operating system or
 253 to search for a more efficient algorithm.\newline
 254
 255 However, the main drawback is the latency of such a solution because each
 256 profile must be treated one after another and the deflection of 100
 257 cantilevers takes about $200\times 10.5=2.1$ms. This would be inadequate
 258 for real-time requirements as for individual cantilever active control. An
 259 obvious solution is to parallelize the computations, for example on a GPU.
 260 Nevertheless, the cost of transferring profile in GPU memory and of taking
 261 back results would be prohibitive compared to computation time.\newline
 262
 263 We remark that when possible, it is more efficient to pipeline the
 264 computation. For example, supposing that 200 profiles of 20 pixels
 265 could be pushed sequentially in a pipelined unit cadenced at a 100MHz
 266 (i.e. a pixel enters in the unit each 10ns), all profiles would be
 267 treated in $200\times 20\times 10.10^{-9}=$ 40$\mu$s plus the latency
 268 of the pipeline. Such a solution would be meeting our requirements and
 269 would be 50 times faster than our C code, and even more compared to
 270 the LabView version use in \cite{AFMCSEM11}. FPGAs are appropriate for
 271 such implementation, so they turn out to be the computation units of
 272 choice to reach our performance requirements. Nevertheless, passing
 273 from a C code to a pipelined version in VHDL is not obvious at all. It
 274 can even be impossible because of FPGA hardware constraints. All these
 275 points are discussed in the following sections.
 276
 277 \section{An hardware/software solution}
 278
 279 \label{sec:solus}
 280
 281 In this section we present part of the computing solution to the above
 282 requirements. We first give some general information about FPGAs, then we
 283 describe the FPGA board we use for implementation and finally the two
 284 algorithms for phase computation are detailed. Presentation of VHDL
 285 implementations is postponned until Section \ref{Experimental tests}.
 286 \newline
 287
 288 The hardware part consists in a high-speed camera, linked on an embedded
 289 board hosting two FPGAs. In this way, the camera output stream can be pushed
 290 directly into the FPGA. The software part is mostly the VHDL code that
 291 deserializes the camera stream, extracts profiles and computes the
 292 deflection. Before to present the board we use, we give some general
 293 information about FPGAs.
 294
 295 \subsection{Elements of FPGA architecture and programming}
 296
 297 A field-programmable gate array (FPGA) is an integrated circuit designed to
 298 be configured by the customer. FGPAs are composed of programmable logic
 299 components, called configurable logic blocks (CLB). These blocks mainly
 300 contain look-up tables (LUT), flip/flops (F/F) and latches, organized in one
 301 or more slices connected together. Each CLB can be configured to perform
 302 simple (AND, XOR, ...) or complex combinational functions. They are
 303 interconnected by reconfigurable links. Modern FPGAs contain memory elements
 304 and multipliers which enable to simplify the design and to increase the
 305 performance. Nevertheless, all other complex operations like division and
 306 other functions like trigonometric functions are not available and must be
 307 built by configuring a set of CLBs. Since this is not an obvious task at
 308 all, tools like ISE~\cite{ISE} have been built to do this operation. Such a
 309 software can synthetize a design written in a hardware description language
 310 (HDL), maps it onto CLBs, place/route them for a specific FPGA, and finally
 311 produces a bitstream that is used to configure the FPGA. Thus, from the
 312 developer's point of view, the main difficulty is to translate an algorithm
 313 into HDL code, taking into account FPGA resources and constraints like clock
 314 signals and I/O values that drive the FPGA.
 315
 316 Indeed, HDL programming is very different from classic languages like
 317 C. A program can be seen as a state-machine, manipulating signals that
 318 evolve from state to state. Moreover, HDL instructions can be executed
 319 concurrently. Signals may be combined with basic logic operations to
 320 produce new states that are assigned to another signal. States are mainly expressed as
 321 arrays of bits.  Fortunately, libraries propose some higher levels
 322 representations like signed integers, and arithmetic operations.
 323
 324 Furthermore, even if FPGAs are cadenced more slowly than classic processors,
 325 they can perform pipelines as well as parallel operations. A pipeline
 326 consists in cutting a process in a sequence of small tasks, taking the same
 327 execution time. It accepts a new data at each clock top, thus, after a known
 328 latency, it also provides a result at each clock top. We observe that the
 329 components of a task are not reusable by another one. Nevertheless, this is
 330 the most efficient technique on FPGAs. Because of their architecture, it is
 331 also very easy to process several data concurrently. Finally, the best
 332 performance can be reached when several pipelines are operating on multiple
 333 data streams in parallel.
 334
 335 \subsection{The FPGA board}
 336
 337 The architecture we use is designed by the Armadeus Systems
 338 company. It consists in a development board called APF27 \textsuperscript{\textregistered}, hosting a
 339 i.MX27 ARM processor (from Freescale) and a Spartan3A (from
 340 XIlinx). This board includes all classical connectors as USB and
 341 Ethernet for instance. A Flash memory contains a Linux kernel that can
 342 be launched after booting the board via u-Boot. The processor is
 343 directly connected to the Spartan3A via its special interface called
 344 WEIM. The Spartan3A is itself connected to an extension board called
 345 SP Vision \textsuperscript{\textregistered}, that hosts a Spartan6 FPGA. Thus, it is
 346 possible to develop programs that communicate between i.MX and
 347 Spartan6, using Spartan3 as a tunnel. A clock signal at 100MHz (by
 348 default) is delivered to dedicated FPGA pins. The Spartan6 of our
 349 board is an LX100 version. It has 15,822 slices, each slice containing
 350 4 LUTs and 8 flip/flops. It is equivalent to 101,261 logic
 351 cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
 352 multiply-adders (named DSP48), which is largely enough for our
 353 project. Some I/O pins of Spartan6 are connected to two $2\times 17$
 354 headers that can be used for any purpose as to be connected to the
 355 interface of a camera.
 356
 357 \subsection{Two algorithms for phase computation}
 358
 359 In \cite{AFMCSEM11}, $f$ the frequency and $\theta $\ the phase of the
 360 light wave are computed thanks to spline interpolation. As said in
 361 section \ref{sec:deflest}, $f$ is computed only once time but the
 362 phase needs to be computed for each image. This is why, in this paper,
 363 we focus on its computation.
 364
 365 \subsubsection{Spline algorithm (SPL)}
 366
 367 \label{sec:algo-spline}
 368
 369 We denote by $M$ the number of pixels in a segment used for phase
 370 computation. For the sake of simplicity of the notations, we consider
 371 the light intensity $I$ to be a mapping of the physical segment in the
 372 interval $[0,M[$. The pixels are assumed to be regularly spaced and
 373 centered at the positions $x^{p}\in\{0,1,\ldots,M-1\}.$ We use the simplest
 374 definition of a pixel, namely the value of $I$ at its center. The
 375 pixel intensities are considered as pre-normalized so that their
 376 minimum and maximum have been resized to $-1$ and $1$.
 377
 378 The first step consists in computing the cubic spline interpolation of
 379 the intensities. This allows for interpolating $I$ at a larger number
 380 $L=k\times M$ of points (typically $k=4$ is sufficient) $%
 381 x^{s}$ in the interval $[0,M[$. During the precomputation sequence,
 382 the second step is to determin the afine part $a.x+b$ of $I$. It is
 383 found with an ordinary least square method, taking account the $L$
 384 points. Values of $I$ in $x^s$ are used to compute its intersections
 385 with $a.x+b$. The period of $I$ (and thus its frequency) is deduced
 386 from the number of intersections and the distance between the first
 387 and last.
 388
 389 During the acquisition loop, the second step is the phase computation, with
 390 \begin{equation}
 391 \theta =atan\left[ \frac{\sum_{i=0}^{N-1}\text{sin}(2\pi fx_{i}^{s})\times
 392 I(x_{i}^{s})}{\sum_{i=0}^{N-1}\text{cos}(2\pi fx_{i}^{s})\times I(x_{i}^{s})}%
 393 \right] .
 394 \end{equation}
 395
 396 \textit{Remarks: }
 397
 398 \begin{itemize}
 399 \item The frequency could also be obtained using the derivates of spline
 400 equations, which only implies to solve quadratic equations but certainly
 401 yields higher errors.
 402
 403 \item Profile frequency are computed during the precomputation step,
 404   thus the values sin$(2\pi fx_{i}^{s})$ and cos$(2\pi fx_{i}^{s})$
 405   can be determined once for all.
 406 \end{itemize}
 407
 408 \subsubsection{Least square algorithm (LSQ)}
 409
 410 Assuming that we compute the phase during the acquisition loop, equation \ref%
 411 {equ:profile} has only 4 parameters: $a,b,A$, and $\theta $, $f$ and $x$
 412 being already known. Since $I$ is non-linear, a least square method based on
 413 a Gauss-newton algorithm can be used to determine these four parameters.
 414 This kind of iterative process ends with a convergence criterion, so it is
 415 not suited to our design goals. Fortunately, it is quite simple to reduce
 416 the number of parameters to only $\theta $. Firstly, the afine part $ax+b$
 417 is estimated from the $M$ values $I(x^{p})$ to determine the rectified
 418 intensities,%
 419 \begin{equation*}
 420 I^{corr}(x^{p})\approx I(x^{p})-a.x^{p}-b.
 421 \end{equation*}%
 422 To find $a$ and $b$ we apply an ordinary least square method (as in SPL but on $M$ points)%
 423 \begin{equation*}
 424 a=\frac{covar(x^{p},I(x^{p}))}{\text{var}(x^{p})}\text{ and }b=\overline{%
 425 I(x^{p})}-a.\overline{{x^{p}}}
 426 \end{equation*}%
 427 where overlined symbols represent average. Then the amplitude $A$ is
 428 approximated by%
 429 \begin{equation*}
 430 A\approx \frac{\text{max}(I^{corr})-\text{min}(I^{corr})}{2}.
 431 \end{equation*}%
 432 Finally, the problem of approximating $\theta $ is reduced to minimizing%
 433 \begin{equation*}
 434 \min_{\theta \in \lbrack -\pi ,\pi ]}\sum_{i=0}^{M-1}\left[ \text{cos}(2\pi
 435 f.i+\theta )-\frac{I^{corr}(i)}{A}\right] ^{2}.
 436 \end{equation*}%
 437 An optimal value $\theta ^{\ast }$ of the minimization problem is a zero of
 438 the first derivative of the above argument,%\begin{eqnarray*}{l}
 439 \begin{equation*}
 440 2\left[ \text{cos}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{sin}(2\pi
 441 f.i)\right.
 442 \end{equation*}%
 443 \begin{equation*}
 444 \left. +\text{sin}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{cos}(2\pi
 445 f.i)\right] -
 446 \end{equation*}%
 447 \begin{equation*}
 448 A\left[ \text{cos}2\theta ^{\ast }\sum_{i=0}^{M-1}\sin (4\pi f.i)+\text{sin}%
 449 2\theta ^{\ast }\sum_{i=0}^{M-1}\cos (4\pi f.i)\right] =0
 450 \end{equation*}%
 451 %
 452 %\end{eqnarray*}
 453
 454 Several points can be noticed:
 455
 456 \begin{itemize}
 457 \item The terms $\sum_{i=0}^{M-1}$sin$(4\pi f.i)$ and$\sum_{i=0}^{M-1}$cos$%
 458 (4\pi f.i)$ are independent of $\theta $, they can be precomputed.
 459
 460 \item Lookup tables (namely lut$_{sfi}$ and lut$_{cfi}$ in the following algorithms) can be
 461   set with the $2.M$ values $\sin (2\pi f.i)$ and $\cos (2\pi f.i)$.
 462
 463 \item A simple method to find a zero $\theta ^{\ast }$ of the optimality
 464 condition is to discretize the range $[-\pi ,\pi ]$ with a large number $%
 465 nb_{s}$ of nodes and to find which one is a minimizer in the absolute value
 466 sense. Hence, three other lookup tables (lut$_{s}$, lut$_{c}$ and lut$_{A}$) can be set with the $%
 467 3\times nb_{s}$ values $\sin \theta ,$ $\cos \theta ,$
 468 \begin{equation*}
 469 \left[ cos2\theta \sum_{i=0}^{M-1}sin(4\pi f.i)+sin2\theta
 470 \sum_{i=0}^{M-1}cos(4\pi f.i)\right] .
 471 \end{equation*}
 472
 473 \item The search algorithm can be very fast using a dichotomous process in $%
 474 log_{2}(nb_{s}).$
 475 \end{itemize}
 476
 477 The overall method is synthetized in an algorithm (called LSQ in the
 478 following) divided into the precomputing part and the acquisition loop:
 479
 480 \begin{algorithm}[htbp]
 481 \caption{LSQ algorithm - before acquisition loop.}
 482 \label{alg:lsq-before}
 483
 484    $M \leftarrow $ number of pixels of the profile\\
 485    I[] $\leftarrow $ intensity of pixels\\
 486    $f \leftarrow $ frequency of the profile\\
 487    $s4i \leftarrow \sum_{i=0}^{M-1} sin(4\pi f.i)$\\
 488    $c4i \leftarrow \sum_{i=0}^{M-1} cos(4\pi f.i)$\\
 489    $nb_s \leftarrow $ number of discretization steps of $[-\pi,\pi]$\\
 490
 491    \For{$i=0$ to $nb_s $}{
 492      $\theta  \leftarrow -\pi + 2\pi\times \frac{i}{nb_s}$\\
 493      lut$_s$[$i$] $\leftarrow sin \theta$\\
 494      lut$_c$[$i$] $\leftarrow cos \theta$\\
 495      lut$_A$[$i$] $\leftarrow cos 2 \theta \times s4i + sin 2 \theta \times c4i$\\
 496      lut$_{sfi}$[$i$] $\leftarrow sin (2\pi f.i)$\\
 497      lut$_{cfi}$[$i$] $\leftarrow cos (2\pi f.i)$\\
 498    }
 499 \end{algorithm}
 500
 501 \begin{algorithm}[htbp]
 502 \caption{LSQ algorithm - during acquisition loop.}
 503 \label{alg:lsq-during}
 504
 505    $\bar{x} \leftarrow \frac{M-1}{2}$\\
 506    $\bar{y} \leftarrow 0$, $x_{var} \leftarrow 0$, $xy_{covar} \leftarrow 0$\\
 507    \For{$i=0$ to $M-1$}{
 508      $\bar{y} \leftarrow \bar{y} + $ I[$i$]\\
 509      $x_{var} \leftarrow x_{var} + (i-\bar{x})^2$\\
 510    }
 511    $\bar{y} \leftarrow \frac{\bar{y}}{M}$\\
 512    \For{$i=0$ to $M-1$}{
 513      $xy_{covar} \leftarrow xy_{covar} + (i-\bar{x}) \times (I[i]-\bar{y})$\\
 514    }
 515    $slope \leftarrow \frac{xy_{covar}}{x_{var}}$\\
 516    $start \leftarrow y_{moy} - slope\times \bar{x}$\\
 517    \For{$i=0$ to $M-1$}{
 518      $I[i] \leftarrow I[i] - start - slope\times i$\\
 519    }
 520
 521    $I_{max} \leftarrow max_i(I[i])$, $I_{min} \leftarrow min_i(I[i])$\\
 522    $amp \leftarrow \frac{I_{max}-I_{min}}{2}$\\
 523
 524    $Is \leftarrow 0$, $Ic \leftarrow 0$\\
 525    \For{$i=0$ to $M-1$}{
 526      $Is \leftarrow Is + I[i]\times $ lut$_{sfi}$[$i$]\\
 527      $Ic \leftarrow Ic + I[i]\times $ lut$_{cfi}$[$i$]\\
 528    }
 529
 530    $\delta \leftarrow \frac{nb_s}{2}$, $b_l \leftarrow 0$, $b_r \leftarrow \delta$\\
 531    $v_l \leftarrow -2.I_s - amp.$lut$_A$[$b_l$]\\
 532
 533    \While{$\delta >= 1$}{
 534
 535      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 536
 537      \If{$!(v_l < 0$ and $v_r >= 0)$}{
 538        $v_l \leftarrow v_r$ \\
 539        $b_l \leftarrow b_r$ \\
 540      }
 541      $\delta \leftarrow \frac{\delta}{2}$\\
 542      $b_r \leftarrow b_l + \delta$\\
 543    }
 544    \uIf{$!(v_l < 0$ and $v_r >= 0)$}{
 545      $v_l \leftarrow v_r$ \\
 546      $b_l \leftarrow b_r$ \\
 547      $b_r \leftarrow b_l + 1$\\
 548      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 549    }
 550    \Else {
 551      $b_r \leftarrow b_l + 1$\\
 552    }
 553
 554    \uIf{$ abs(v_l) < v_r$}{
 555      $b_{\theta} \leftarrow b_l$ \\
 556    }
 557    \Else {
 558      $b_{\theta} \leftarrow b_r$ \\
 559    }
 560    $\theta \leftarrow \pi\times \left[\frac{2.b_{ref}}{nb_s}-1\right]$\\
 561
 562 \end{algorithm}
 563
 564 \subsubsection{Algorithm comparison}
 565
 566 We compared the two algorithms on the base of three criteria:
 567
 568 \begin{itemize}
 569 \item precision of results on a cosines profile, distorted by noise,
 570
 571 \item number of operations,
 572
 573 \item complexity of FPGA implementation
 574 \end{itemize}
 575
 576 For the first item, we produced a matlab version of each algorithm,
 577 running in double precision. The profile was generated for about
 578 34,000 different quadruplets of periods ($\in \lbrack 3.1,6.1]$, step
 579 = 0.1), phases ($\in \lbrack -3.1,3.1]$, steps = 0.062) and slope
 580 ($\in \lbrack -2,2]$, step = 0.4). Obviously, the discretization of
 581 $[-\pi ,\pi ]$ introduces an error in the phase estimation. It is at
 582 most equal to $\frac{\pi}{nb_s}$. From some experiments on a $17\times
 583 4$ array, authors of \cite{AFMCSEM11} noticed a average ratio of 50
 584 between phase variation in radians and lever end position in
 585 nanometers. Assuming such a ratio and $nb_s = 1024$, the maximum lever
 586 deflection error would be 0.15nm which is smaller than 0.3nm, the best
 587 precision achieved with the setup used in \cite{AFMCSEM11}.
 588
 589 Moreover, pixels have been paired and the paired intensities have been
 590 perturbed by addition of a random number uniformly picked in
 591 $[-N,N]$. Notice that we have observed that perturbing each pixel
 592 independently yields too weak profile distortion. We report
 593 percentages of errors between the reference and the computed phases
 594 out of $2\pi ,$%
 595 \begin{equation*}
 596 err=100\times \frac{|\theta _{ref}-\theta _{comp}|}{2\pi }.
 597 \end{equation*}%
 598 Table \ref{tab:algo_prec} gives the maximum and the average errors for both
 599 algorithms and for increasing values of $N$ the noise parameter.
 600
 601 \begin{table}[ht]
 602 \begin{center}
 603 \begin{tabular}{|c|c|c|c|c|}
 604 \hline
 605 & \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5}
 606 noise (N)& max. err. & aver. err. & max. err. & aver. err. \\ \hline
 607 0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline
 608 2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline
 609 5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline
 610 7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline
 611 10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline
 612 15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline
 613 30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
 614 \end{tabular}%
 615 \end{center}
 616 \caption{Error (in \%) for cosines profiles, with noise.}
 617 \label{tab:algo_prec}
 618 \end{table}
 619
 620 The results show that the two algorithms yield close results, with a slight
 621 advantage for LSQ. Furthermore, both behave very well against noise.
 622 Assuming an average ratio of 50 (see above), an error of 1 percent on
 623 the phase corresponds to an error of 0.5nm on the lever deflection, which is
 624 very close to the best precision.
 625
 626 It is very hard to predict which level of noise will be present in
 627 real experiments and how it will distort the profiles. Authors of
 628 \cite{AFMCSEM11} gave us the authorization to exploit some of their
 629 results on a $17\times 4$ array. It allowed us to compare experimental
 630 profiles to simulated ones. We can see on figure \ref{fig:noise20} the
 631 profile with $N=10$ that leads to the biggest error. It is a bit
 632 distorted, with pikes and straight/rounded portions. In fact, it is
 633 very close to some of the worst experimental profiles. Figure
 634 \ref{fig:noise60} shows a sample of worst profile for $N=30$. It is
 635 completely distorted, largely beyond any experimental ones. Obviously,
 636 these comparisons are a bit subjectives and experimental profiles
 637 could also be completly distorted on other experiments. Nevertheless,
 638 they give an idea about the possible error.
 639
 640 \begin{figure}[ht]
 641 \begin{center}
 642 \includegraphics[width=\columnwidth]{intens-noise20}
 643 \end{center}
 644 \caption{Sample of worst profile for N=10}
 645 \label{fig:noise20}
 646 \end{figure}
 647
 648 \begin{figure}[ht]
 649 \begin{center}
 650 \includegraphics[width=\columnwidth]{intens-noise60}
 651 \end{center}
 652 \caption{Sample of worst profile for N=30}
 653 \label{fig:noise60}
 654 \end{figure}
 655
 656 The second criterion is relatively easy to estimate for LSQ and harder for
 657 SPL because of the use of the arctangent function. In both cases, the number
 658 of operation is proportional to $M$ the numbers of pixels. For LSQ, it also
 659 depends on $nb_{s}$ and for SPL on $L=k\times M$ the number of interpolated
 660 points. We assume that $M=20$, $nb_{s}=1024$ and $k=4$, that all possible
 661 parts are already in lookup tables and that a limited set of operations (+,
 662 -, *, /, $<$, $>$) is taken into account. Translating both algorithms in C
 663 code, we obtain about 430 operations for LSQ and 1,550 (plus a few tenth for
 664 $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
 665 considering the total number of operations is not fully relevant for FPGA
 666 implementation which time and space consumption depends not only on the type
 667 of operations but also of their ordering. The final evaluation is thus very
 668 much driven by the third criterion.\newline
 669
 670 The Spartan 6 used in our architecture has a hard constraint since it
 671 has no built-in floating point units. Obviously, it is possible to use
 672 some existing "black-boxes" for double precision operations. But they
 673 require a lot of clock cycles to complete. It is much simpler to
 674 exclusively use integers, with a quantization of all double precision
 675 values. It should be chosen in a manner that does not alterate result
 676 precision. Furthermore, it should not lead to a design with a huge
 677 latency because of operations that could not complete during a single
 678 or few clock cycles. Divisions fall into that category and, moreover,
 679 they need a varying number of clock cycles to complete. Even
 680 multiplications can be a problem since a DSP48 takes inputs of 18 bits
 681 maximum. So, for larger multiplications, several DSP must be combined
 682 which increases the overall latency.
 683
 684 In the present algorithms, the hardest constraint does not come from the
 685 FPGA characteristics but from the algorithms. Their VHDL implementation can
 686 be efficient only if they can be fully (or near) pipelined. We observe that
 687 only a small part of SPL can be pipelined, indeed, the computation of spline
 688 coefficients implies to solve a linear tridiagonal system which matrix and
 689 right-hand side are computed from incoming pixels intensity but after, the
 690 back-solve starts with the latest values, which breaks the pipeline.
 691 Moreover, SPL relies on interpolating far more points than profile size.
 692 Thus, the end of SPL works on a larger amount of data than at the beginning,
 693 which also breaks the pipeline.
 694
 695 LSQ has not this problem since all parts, except the dichotomic search, work
 696 on the same amount of data, i.e. the profile size. Furthermore, LSQ requires
 697 less operations than SPL, implying a smaller output latency. In total, LSQ
 698 turns out to be the best candidate for phase computation on any architecture
 699 including FPGA.
 700
 701 \section{VHDL implementation and experimental tests}
 702
 703 \label{Experimental tests}
 704
 705 \subsection{VHDL implementation}
 706
 707 From the LSQ algorithm, we have written a C program that uses only
 708 integer values. We used a very simple quantization which consists in
 709 multiplying each double precision value by a factor power of two and
 710 by keeping the integer part. For an accurate evaluation of the
 711 division in the computation of $a$ the slope coefficient, we also
 712 scaled the pixel intensities by another power of two. The main problem
 713 was to determin these factors. Most of the time, they are chosen to
 714 minimize the error induced by the quantization. But in our case, we
 715 also have some hardware constraints, for example the size and depth of
 716 RAMs or the input size of DSPs. Thus, having a maximum of values that
 717 fit in these sizes is a very important criterion to choose the scaling
 718 factors.
 719
 720 Consequently, we have determined the maximum value of each variable as
 721 a function of the scale factors and the profile size involved in the
 722 algorithm. It gave us the the maximum number of bits necessary to code
 723 them. We have chosen the scale factors so that any variable (except
 724 the covariance) fits in 18 bits, which is the maximum input size of
 725 DSPs. In this way, all multiplications, except one, could be done with
 726 a single DSP, in a single clock cycle. Moreover, assuming that $nb_s =
 727 1024$, all LUTs could fit in the 18Kbits RAMs. Finally, we compared
 728 the double and integer versions of LSQ and found a nearly perfect
 729 agreement between their results.
 730
 731 As mentionned above, some operations like divisions must be
 732 avoided. But when the denominator is fixed, a division can be replaced
 733 by its multiplication/shift counterpart. This is always the case in
 734 LSQ. For example, assuming that $M$ is fixed, $x_{var}$ is known and
 735 fixed. Thus, $\frac{xy_{covar}}{x_{var}}$ can be replaced by
 736
 737 \[ (xy_{covar}\times \left \lfloor\frac{2^n}{x_{var}} \right \rfloor) \gg n\]
 738
 739 where $n$ depends on the desired precision (in our case $n=24$).
 740
 741 Obviously, multiplications and divisions by a power of two can be
 742 replaced by left or right bit shifts. Finally, the code only contains
 743 shifts, additions, subtractions and multiplications of signed integers, which
 744 are perfectly adapted to FGPAs.
 745
 746
 747 We built two versions of VHDL codes, namely one directly by hand
 748 coding and the other with Matlab using the Simulink HDL coder feature~\cite%
 749 {HDLCoder}. Although the approaches are completely different we obtained
 750 quite comparable VHDL codes. Each approach has advantages and drawbacks.
 751 Roughly speaking, hand coding provides beautiful and much better structured
 752 code while Simulind HDL coder produces allows for fast code production. In
 753 terms of throughput and latency, simulations show that the two approaches
 754 yield close results with a slight advantage for hand coding.
 755
 756 \subsection{Simulation}
 757
 758 Before experimental tests on the FPGA board, we simulated our two VHDL
 759 codes with GHDL and GTKWave (two free tools with linux). We built a
 760 testbench based on experimental profiles and compared the results to
 761 values given by the SPL algorithm. Both versions lead to correct
 762 results. Our first codes were highly optimized, indeed the pipeline
 763 could compute a new phase each 33 cycles and its latency was equal to
 764 95 cycles. Since the Spartan6 is clocked at 100MHz, estimating the
 765 deflection of 100 cantilevers would take about $%
 766 (95+200\times 33).10=66.95\mu $s, i.e. nearly 15,000 estimations by
 767 second.
 768
 769 \subsection{Bitstream creation}
 770
 771 In order to test our code on the SP Vision board, the design was
 772 extended with a component that keeps profiles in RAM, flushes them in
 773 the phase computation component and stores its output in another
 774 RAM. We also added a wishbone, a component that can "drive" signals to
 775 communicate between i.MX and other components. It is mainly used to
 776 start to flush profiles and to retrieve the computed phases in
 777 RAM. Unfortunately, the first designs could not be placed and routed
 778 with ISE on the Spartan6 with a 100MHz clock. The main problems were
 779 encountered with series of arthmetic operations and more especially
 780 with RAM outputs used in DSPs. So, we needed to decompose some parts
 781 of the pipeline, which added few clock cycles. Finally, we obtained a
 782 bitstream that has been successfully tested on the board.
 783
 784 Its latency is of 112 cycles and computes a new phase every 40
 785 cycles. For 100 cantilevers, it takes $(112+200\times 40).10=81.12\mu
 786 $s to compute their deflection. It corresponds to about 12300 images
 787 per second, which is largely beyond the camera capacities and the
 788 possibility to extract a new profile from an image every 40
 789 cycles. Nevertheless, it also largely fits our design goals.
 790
 791 \label{sec:results}
 792
 793 \section{Conclusion and perspectives}
 794
 795 In this paper we have presented a full hardware/software solution for
 796 real-time cantilever deflection computation from interferometry images.
 797 Phases are computed thanks to a new algorithm based on the least square
 798 method. It has been quantized and pipelined to be mapped into a FPGA, the
 799 architecture of our solution. Performances have been analyzed through
 800 simulations and real experiments on a Spartan6 FPGA. The results meet our
 801 initial requirements. In future work, the algorithm quantization will be
 802 better analyzed and an high speed camera will be introduced in the
 803 processing chain so that to process real images. Finally, we will address
 804 real-time filtering and control problems for AFM arrays in dynamic regime.
 805
 806 \section{Acknowledgments}
 807 We would like to thank A. Meister and M. Favre, from CSEM, for sharing all the
 808 material we used to write this article and for the time they spent to
 809 explain us their approach.
 810
 811 \bibliographystyle{plain}
 812 \bibliography{biblio}
 813
 814 \end{document}