dmems12.tex

   1
   2 \documentclass[10pt, peerreview, compsocconf]{IEEEtran}
   3 %\usepackage{latex8}
   4 %\usepackage{times}
   5 \usepackage[utf8]{inputenc}
   6 %\usepackage[cyr]{aeguill}
   7 %\usepackage{pstricks,pst-node,pst-text,pst-3d}
   8 %\usepackage{babel}
   9 \usepackage{amsmath}
  10 \usepackage{url}
  11 \usepackage{graphicx}
  12 \usepackage{thumbpdf}
  13 \usepackage{color}
  14 \usepackage{moreverb}
  15 \usepackage{commath}
  16 \usepackage{subfigure}
  17 %\input{psfig.sty}
  18 \usepackage{fullpage}
  19 \usepackage{fancybox}
  20
  21 \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  22
  23 %%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
  24 \newcommand{\noun}[1]{\textsc{#1}}
  25
  26 \newcommand{\tab}{\ \ \ }
  27
  28
  29 \begin{document}
  30
  31
  32 %% \author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)}
  33 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  34 %% line 2: name of organization, acronyms acceptable\\
  35 %% line 3: City, Country\\
  36 %% line 4: Email: name@xyz.com}
  37 %% \and
  38 %% \IEEEauthorblockN{Authors Name/s per 2nd Affiliation (Author)}
  39 %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
  40 %% line 2: name of organization, acronyms acceptable\\
  41 %% line 3: City, Country\\
  42 %% line 4: Email: name@xyz.com}
  43 %% }
  44
  45
  46
  47 \title{A new approach based on least square methods to estimate in real time cantilevers deflection with a FPGA}
  48 \author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and Michel Lenczner\IEEEauthorrefmark{2}}
  49 \IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France\\
  50 \{raphael.couturier,stephane.domas\}@univ-fcomte.fr}
  51 \IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France\\
  52 \{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com}
  53 }
  54
  55
  56
  57
  58
  59
  60 %\maketitle
  61
  62 \thispagestyle{empty}
  63
  64 \begin{abstract}
  65
  66 Atomic force  microscope (AFM) provides  high resolution images of  surfaces. We
  67 focus  our attention  on an  interferometry method  to estimate  the cantilevers
  68 deflection.   This method  was based  on the  spline method  to  interpolate the
  69 deflection and  the computations were performed  on a PC with  LabView.  In this
  70 paper, we propose a  new method based on the least square  method and we present
  71 the implementation that we developped on a FPGA.  Our method can be pipelined on
  72 a FPGA in order to manipulate image profiles very quickly.  Simulations and real
  73 tests we have performed showed us that this implementation is very efficient and
  74 should allow us to control of a cantilevers array in real time.
  75
  76
  77 \end{abstract}
  78
  79 \begin{IEEEkeywords}
  80 FPGA, cantilever, interferometry.
  81 \end{IEEEkeywords}
  82
  83
  84 \IEEEpeerreviewmaketitle
  85
  86 \section{Introduction}
  87
  88 Cantilevers  are  used  inside  atomic  force  microscope (AFM) which  provides  high
  89 resolution images of  surfaces.  Several technics have been  used to measure the
  90 displacement  of cantilevers  in litterature.   For example,  it is  possible to
  91 determine  accurately  the  deflection  with different  mechanisms.
  92 In~\cite{CantiPiezzo01},   authors  used   piezoresistor  integrated   into  the
  93 cantilever.   Nevertheless this  approach  suffers from  the  complexity of  the
  94 microfabrication  process needed  to  implement the  sensor  in the  cantilever.
  95 In~\cite{CantiCapacitive03},  authors  have  presented an  cantilever  mechanism
  96 based on  capacitive sensing. This kind  of technic also  involves to instrument
  97 the cantiliver which result in a complex fabrication process.
  98
  99 In this  paper our attention is focused  on a method based  on interferometry to
 100 measure cantilevers' displacements.  In  this method cantilevers are illuminated
 101 by  an optic  source. The  interferometry produces  fringes on  each cantilever
 102 which enables to  compute the cantilever displacement.  In  order to analyze the
 103 fringes a  high speed camera  is used. Images  need to be processed  quickly and
 104 then  a estimation  method is  required to  determine the  displacement  of each
 105 cantilever.  In~\cite{AFMCSEM11},  authors have  used an algorithm  based on
 106 spline to estimate the cantilevers' positions.
 107
 108 The overall process gives accurate results but all the computations
 109 are performed on a standard computer using LabView.  Consequently, the
 110 main drawback of this implementation is that the computer is a
 111 bootleneck. In this paper we propose to use a method based on least
 112 square and to implement all the computation on a FGPA.
 113
 114 The remainder  of the paper  is organized as  follows. Section~\ref{sec:measure}
 115 describes  more precisely  the measurement  process. Our  solution based  on the
 116 least  square   method  and   the  implementation  on   FPGA  is   presented  in
 117 Section~\ref{sec:solus}.       Experimentations      are       described      in
 118 Section~\ref{sec:results}.  Finally  a  conclusion  and  some  perspectives  are
 119 presented.
 120
 121
 122
 123 %% quelques ref commentées sur les calculs basés sur l'interférométrie
 124
 125 \section{Measurement principles}
 126 \label{sec:measure}
 127
 128 \subsection{Architecture}
 129 \label{sec:archi}
 130 %% description de l'architecture générale de l'acquisition d'images
 131 %% avec au milieu une unité de traitement dont on ne précise pas ce
 132 %% qu'elle est.
 133
 134 In order to develop simple,  cost effective and user-friendly cantilever arrays,
 135 authors   of    ~\cite{AFMCSEM11}   have   developped   a    system   based   of
 136 interferometry. In opposition to other optical based systems, using a laser beam
 137 deflection scheme and  sentitive to the angular displacement  of the cantilever,
 138 interferometry  is sensitive  to  the  optical path  difference  induced by  the
 139 vertical displacement of the cantilever.
 140
 141 The system build by these authors is based on a Linnick
 142 interferomter~\cite{Sinclair:05}.  It is illustrated in
 143 Figure~\ref{fig:AFM}.  A laser diode is first split (by the splitter)
 144 into a reference beam and a sample beam that reachs the cantilever
 145 array.  In order to be able to move the cantilever array, it is
 146 mounted on a translation and rotational hexapod stage with five
 147 degrees of freedom. The optical system is also fixed to the stage.
 148 Thus, the cantilever array is centered in the optical system which can
 149 be adjusted accurately.  The beam illuminates the array by a
 150 microscope objective and the light reflects on the cantilevers.
 151 Likewise the reference beam reflects on a movable mirror.  A CMOS
 152 camera chip records the reference and sample beams which are
 153 recombined in the beam splitter and the interferogram.  At the
 154 beginning of each experiment, the movable mirror is fitted manually in
 155 order to align the interferometric fringes approximately parallel to
 156 the cantilevers.  When cantilevers move due to the surface, the
 157 bending of cantilevers produce movements in the fringes that can be
 158 detected with the CMOS camera.  Finally the fringes need to be
 159 analyzed. In~\cite{AFMCSEM11}, authors used a LabView program to
 160 compute the cantilevers' deflections from the fringes.
 161
 162 \begin{figure}
 163 \begin{center}
 164 \includegraphics[width=\columnwidth]{AFM}
 165 \end{center}
 166 \caption{schema of the AFM}
 167 \label{fig:AFM}
 168 \end{figure}
 169
 170
 171 %% image tirée des expériences.
 172
 173 \subsection{Cantilever deflection estimation}
 174 \label{sec:deflest}
 175
 176 \begin{figure}
 177 \begin{center}
 178 \includegraphics[width=\columnwidth]{lever-xp}
 179 \end{center}
 180 \caption{Portion of an image picked by the camera}
 181 \label{fig:img-xp}
 182 \end{figure}
 183
 184 As shown on image \ref{fig:img-xp}, each cantilever is covered by
 185 several interferometric fringes. The fringes will distort when
 186 cantilevers are deflected. Estimating the deflection is done by
 187 computing this distortion. For that, authors of \cite{AFMCSEM11}
 188 proposed a method based on computing the phase of the fringes, at the
 189 base of each cantilever, near the tip, and on the base of the
 190 array. They assume that a linear relation binds these phases, which
 191 can be use to "unwrap" the phase at the tip and to determine the deflection.\\
 192
 193 More precisely, segment of pixels are extracted from images taken by
 194 the camera. These segments are large enough to cover several
 195 interferometric fringes. As said above, they are placed at the base
 196 and near the tip of the cantilevers. They are called base profile and
 197 tip profile in the following. Furthermore, a reference profile is
 198 taken on the base of the cantilever array.
 199
 200 The pixels intensity $I$ (in gray level) of each profile is modelized by:
 201
 202 \begin{equation}
 203 \label{equ:profile}
 204 I(x) = ax+b+A.cos(2\pi f.x + \theta)
 205 \end{equation}
 206
 207 where $x$ is the position of a pixel in its associated segment.
 208
 209 The global method consists in two main sequences. The first one aims
 210 to determin the frequency $f$ of each profile with an algorithm based
 211 on spline interpolation (see section \ref{algo-spline}). It also
 212 computes the coefficient used for unwrapping the phase. The second one
 213 is the acquisition loop, while which images are taken at regular time
 214 steps. For each image, the phase $\theta$ of all profiles is computed
 215 to obtain, after unwrapping, the deflection of
 216 cantilevers. Originally, this computation was also done with an
 217 algorithm based on spline. This article proposes a new version based
 218 on a least square method.
 219
 220 \subsection{Design goals}
 221 \label{sec:goals}
 222
 223 The main goal is to implement a computing unit to estimate the
 224 deflection of about $10\times10$ cantilevers, faster than the stream of
 225 images coming from the camera. The accuracy of results must be close
 226 to the maximum precision ever obtained experimentally on the
 227 architecture, i.e. 0.3nm. Finally, the latency between an image
 228 entering in the unit and the deflections must be as small as possible
 229 (NB: future works plan to add some control on the cantilevers).\\
 230
 231 If we put aside some hardware issues like the speed of the link
 232 between the camera and the computation unit, the time to deserialize
 233 pixels and to store them in memory, ... the phase computation is
 234 obviously the bottle-neck of the whole process. For example, if we
 235 consider the camera actually in use, an exposition time of 2.5ms for
 236 $1024\times 1204$ pixels seems the minimum that can be reached. For
 237 100 cantilevers, if we neglect the time to extract pixels, it implies
 238 that computing the deflection of a single
 239 cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\
 240
 241 In fact, this timing is a very hard constraint. Let consider a very
 242 small programm that initializes twenty million of doubles in memory
 243 and then does 1000000 cumulated sums on 20 contiguous values
 244 (experimental profiles have about this size). On an intel Core 2 Duo
 245 E6650 at 2.33GHz, this program reaches an average of 155Mflops.
 246
 247 %%Itimplies that the phase computation algorithm should not take more than
 248 %%$155\times 12.5 = 1937$ floating operations. For integers, it gives $3000$ operations.
 249
 250 Obviously, some cache effects and optimizations on
 251 huge amount of computations can drastically increase these
 252 performances: peak efficiency is about 2.5Gflops for the considered
 253 CPU. But this is not the case for phase computation that used only few
 254 tenth of values.\\
 255
 256 In order to evaluate the original algorithm, we translated it in C
 257 language. As said further, for 20 pixels, it does about 1550
 258 operations, thus an estimated execution time of $1550/155
 259 =$10$\mu$s. For a more realistic evaluation, we constructed a file of
 260 1Mo containing 200 profiles of 20 pixels, equally scattered. This file
 261 is equivalent to an image stored in a device file representing the
 262 camera. We obtained an average of 10.5$\mu$s by profile (including I/O
 263 accesses). It is under are requirements but close to the limit. In
 264 case of an occasional load of the system, it could be largely
 265 overtaken. A solution would be to use a real-time operating system but
 266 another one to search for a more efficient algorithm.
 267
 268 But the main drawback is the latency of such a solution: since each
 269 profile must be treated one after another, the deflection of 100
 270 cantilevers takes about $200\times 10.5 = 2.1$ms, which is inadequate
 271 for an efficient control. An obvious solution is to parallelize the
 272 computations, for example on a GPU. Nevertheless, the cost to transfer
 273 profile in GPU memory and to take back results would be prohibitive
 274 compared to computation time. It is certainly more efficient to
 275 pipeline the computation. For example, supposing that 200 profiles of
 276 20 pixels can be pushed sequentially in the pipelined unit cadenced at
 277 a 100MHz (i.e. a pixel enters in the unit each 10ns), all profiles
 278 would be treated in $200\times 20\times 10.10^{-9} =$ 40$\mu$s plus
 279 the latency of the pipeline. This is about 500 times faster than
 280 actual results.\\
 281
 282 For these reasons, an FPGA as the computation unit is the best choice
 283 to achieve the required performance. Nevertheless, passing from
 284 a C code to a pipelined version in VHDL is not obvious at all. As
 285 explained in the next section, it can even be impossible because of
 286 some hardware constraints specific to FPGAs.
 287
 288
 289 \section{Proposed solution}
 290 \label{sec:solus}
 291
 292 Project Oscar aims  to provide a hardware and  software architecture to estimate
 293 and  control the  deflection of  cantilevers. The  hardware part  consists  in a
 294 high-speed camera,  linked on an embedded  board hosting FPGAs. By  the way, the
 295 camera output stream can be pushed  directly into the FPGA. The software part is
 296 mostly the VHDL  code that deserializes the camera  stream, extracts profile and
 297 computes  the deflection. Before  focusing on  our work  to implement  the phase
 298 computation, we give some general information about FPGAs and the board we use.
 299
 300 \subsection{FPGAs}
 301
 302 A field-programmable gate  array (FPGA) is an integrated  circuit designed to be
 303 configured by the customer. FGPAs are composed of programmable logic components,
 304 called  configurable logic blocks  (CLB). These  blocks mainly  contains look-up
 305 tables  (LUT), flip/flops (F/F)  and latches,  organized in  one or  more slices
 306 connected together. Each CLB can be configured to perform simple (AND, XOR, ...)
 307 or complex  combinational functions.  They are interconnected  by reconfigurable
 308 links.  Modern FPGAs  contain memory  elements and  multipliers which  enable to
 309 simplify the  design and  to increase the  performance. Nevertheless,  all other
 310 complex  operations, like  division, trigonometric  functions, $\ldots$  are not
 311 available  and  must  be  done  by   configuring  a  set  of  CLBs.  Since  this
 312 configuration  is not  obvious at  all, it  can be  done via  a  framework, like
 313 ISE~\cite{ISE}. Such  a software  can synthetize a  design written in  a hardware
 314 description language  (HDL), map it onto  CLBs, place/route them  for a specific
 315 FPGA, and finally  produce a bitstream that is used to  configre the FPGA. Thus,
 316 from  the developper  point of  view,  the main  difficulty is  to translate  an
 317 algorithm in HDL code, taking  account FPGA resources and constraints like clock
 318 signals and I/O values that drive the FPGA.
 319
 320 Indeed, HDL programming is very different from classic languages like
 321 C. A program can be seen as a state-machine, manipulating signals that
 322 evolve from state to state. By the way, HDL instructions can execute
 323 concurrently. Basic logic operations are used to agregate signals to
 324 produce new states and assign it to another signal. States are mainly
 325 expressed as arrays of bits. Fortunaltely, libraries propose some
 326 higher levels representations like signed integers, and arithmetic
 327 operations.
 328
 329 Furthermore, even if FPGAs are cadenced more slowly than classic
 330 processors, they can perform pipeline as well as parallel
 331 operations. A pipeline consists in cutting a process in sequence of
 332 small tasks, taking the same execution time. It accepts a new data at
 333 each clock top, thus, after a known latency, it also provides a result
 334 at each clock top. However, using a pipeline consumes more logics
 335 since the components of a task are not reusable by another
 336 one. Nevertheless it is probably the most efficient technique on
 337 FPGA. Because of its architecture, it is also very easy to process
 338 several data concurrently. When it is possible, the best performance
 339 is reached using parallelism to handle simultaneously several
 340 pipelines in order to handle multiple data streams.
 341
 342 \subsection{The board}
 343
 344 The board we use is designed by the Armadeus compagny, under the name
 345 SP Vision. It consists in a development board hosting a i.MX27 ARM
 346 processor (from Freescale). The board includes all classical
 347 connectors: USB, Ethernet, ... A Flash memory contains a Linux kernel
 348 that can be launched after booting the board via u-Boot.
 349
 350 The processor is directly connected to a Spartan3A FPGA (from Xilinx)
 351 via its special interface called WEIM. The Spartan3A is itself
 352 connected to a Spartan6 FPGA. Thus, it is possible to develop programs
 353 that communicate between i.MX and Spartan6, using Spartan3 as a
 354 tunnel. By default, the WEIM interface provides a clock signal at
 355 100MHz that is connected to dedicated FPGA pins.
 356
 357 The Spartan6 is an LX100 version. It has 15822 slices, each slice
 358 containing 4 LUTs and 8 flip/flops. It is equivalent to 101261 logic
 359 cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
 360 multiply-adders (named DSP48), which is largely enough for our
 361 project.
 362
 363 Some I/O pins of Spartan6 are connected to two $2\times 17$ headers
 364 that can be used as user wants. For the project, they will be
 365 connected to the interface card of the camera.
 366
 367 \subsection{Considered algorithms}
 368
 369 Two solutions have been studied to achieve phase computation. The
 370 original one, proposed by A. Meister and M. Favre, is based on
 371 interpolation by splines. It allows to compute frequency and
 372 phase. The second one, detailed in this article, is based on a
 373 classical least square method but suppose that frequency is already
 374 known.
 375
 376 \subsubsection{Spline algorithm (SPL)}
 377 \label{sec:algo-spline}
 378 Let consider a profile $P$, that is a segment of $M$ pixels with an
 379 intensity in gray levels. Let call $I(x)$ the intensity of profile in $x
 380 \in [0,M[$.
 381
 382 At first, only $M$ values of $I$ are known, for $x = 0, 1,
 383 \ldots,M-1$. A normalisation allows to scale known intensities into
 384 $[-1,1]$. We compute splines that fit at best these normalised
 385 intensities. Splines are used to interpolate $N = k\times M$ points
 386 (typically $k=4$ is sufficient), within $[0,M[$. Let call $x^s$ the
 387 coordinates of these $N$ points and $I^s$ their intensities.
 388
 389 In order to have the frequency, the mean line $a.x+b$ (see equation \ref{equ:profile}) of $I^s$ is
 390 computed. Finding intersections of $I^s$ and this line allow to obtain
 391 the period thus the frequency.
 392
 393 The phase is computed via the equation:
 394 \begin{equation}
 395 \theta = atan \left[ \frac{\sum_{i=0}^{N-1} sin(2\pi f x^s_i) \times I^s(x^s_i)}{\sum_{i=0}^{N-1} cos(2\pi f x^s_i) \times I^s(x^s_i)} \right]
 396 \end{equation}
 397
 398 Two things can be noticed:
 399 \begin{itemize}
 400 \item the frequency could also be obtained using the derivates of
 401   spline equations, which only implies to solve quadratic equations.
 402 \item frequency of each profile is computed a single time, before the
 403   acquisition loop. Thus, $sin(2\pi f x^s_i)$ and $cos(2\pi f x^s_i)$
 404   could also be computed before the loop, which leads to a much faster
 405   computation of $\theta$.
 406 \end{itemize}
 407
 408 \subsubsection{Least square algorithm (LSQ)}
 409
 410 Assuming that we compute the phase during the acquisition loop,
 411 equation \ref{equ:profile} has only 4 parameters: $a, b, A$, and
 412 $\theta$, $f$ and $x$ being already known. Since $I$ is non-linear, a
 413 least square method based on a Gauss-newton algorithm can be used to
 414 determine these four parameters. Since it is an iterative process
 415 ending with a convergence criterion, it is obvious that it is not
 416 particularly adapted to our design goals.
 417
 418 Fortunatly, it is quite simple to reduce the number of parameters to
 419 only $\theta$. Let $x^p$ be the coordinates of pixels in a segment of
 420 size $M$. Thus, $x^p = 0, 1, \ldots, M-1$. Let $I(x^p)$ be their
 421 intensity. Firstly, we "remove" the slope by computing:
 422
 423 \[I^{corr}(x^p) = I(x^p) - a.x^p - b\]
 424
 425 Since linear equation coefficients are searched, a classical least
 426 square method can be used to determine $a$ and $b$:
 427
 428 \[a = \frac{covar(x^p,I(x^p))}{var(x^p)} \]
 429
 430 Assuming an overlined symbol means an average, then:
 431
 432 \[b = \overline{I(x^p)} - a.\overline{{x^p}}\]
 433
 434 Let $A$ be the amplitude of $I^{corr}$, i.e.
 435
 436 \[A = \frac{max(I^{corr}) - min(I^{corr})}{2}\]
 437
 438 Then, the least square method to find $\theta$ is reduced to search the minimum of:
 439
 440 \[\sum_{i=0}^{M-1} \left[ cos(2\pi f.i + \theta) - \frac{I^{corr}(i)}{A} \right]^2\]
 441
 442 It is equivalent to derivate this expression and to solve the following equation:
 443
 444 \begin{eqnarray*}
 445 2\left[ cos\theta \sum_{i=0}^{M-1} I^{corr}(i).sin(2\pi f.i) + sin\theta \sum_{i=0}^{M-1} I^{corr}(i).cos(2\pi f.i)\right] \\
 446 - A\left[ cos2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right]   = 0
 447 \end{eqnarray*}
 448
 449 Several points can be noticed:
 450 \begin{itemize}
 451 \item As in the spline method, some parts of this equation can be
 452   computed before the acquisition loop. It is the case of sums that do
 453   not depend on $\theta$:
 454
 455 \[ \sum_{i=0}^{M-1} sin(4\pi f.i), \sum_{i=0}^{M-1} cos(4\pi f.i) \]
 456
 457 \item Lookup tables for $sin(2\pi f.i)$ and $cos(2\pi f.i)$ can also be
 458 computed.
 459
 460 \item The simplest method to find the good $\theta$ is to discretize
 461   $[-\pi,\pi]$ in $nb_s$ steps, and to search which step leads to the
 462   result closest to zero. By the way, three other lookup tables can
 463   also be computed before the loop:
 464
 465 \[ sin \theta, cos \theta, \]
 466
 467 \[ \left[ cos 2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin 2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right] \]
 468
 469 \item This search can be very fast using a dichotomous process in $log_2(nb_s)$
 470
 471 \end{itemize}
 472
 473 Finally, the whole summarizes in an algorithm (called LSQ in the following) in two parts, one before and one during the acquisition loop:
 474 \begin{algorithm}[htbp]
 475 \caption{LSQ algorithm - before acquisition loop.}
 476 \label{alg:lsq-before}
 477
 478    $M \leftarrow $ number of pixels of the profile\\
 479    I[] $\leftarrow $ intensities of pixels\\
 480    $f \leftarrow $ frequency of the profile\\
 481    $s4i \leftarrow \sum_{i=0}^{M-1} sin(4\pi f.i)$\\
 482    $c4i \leftarrow \sum_{i=0}^{M-1} cos(4\pi f.i)$\\
 483    $nb_s \leftarrow $ number of discretization steps of $[-\pi,\pi]$\\
 484
 485    \For{$i=0$ to $nb_s $}{
 486      $\theta  \leftarrow -\pi + 2\pi\times \frac{i}{nb_s}$\\
 487      lut$_s$[$i$] $\leftarrow sin \theta$\\
 488      lut$_c$[$i$] $\leftarrow cos \theta$\\
 489      lut$_A$[$i$] $\leftarrow cos 2 \theta \times s4i + sin 2 \theta \times c4i$\\
 490      lut$_{sfi}$[$i$] $\leftarrow sin (2\pi f.i)$\\
 491      lut$_{cfi}$[$i$] $\leftarrow cos (2\pi f.i)$\\
 492    }
 493 \end{algorithm}
 494
 495 \begin{algorithm}[htbp]
 496 \caption{LSQ algorithm - during acquisition loop.}
 497 \label{alg:lsq-during}
 498
 499    $\bar{x} \leftarrow \frac{M-1}{2}$\\
 500    $\bar{y} \leftarrow 0$, $x_{var} \leftarrow 0$, $xy_{covar} \leftarrow 0$\\
 501    \For{$i=0$ to $M-1$}{
 502      $\bar{y} \leftarrow \bar{y} + $ I[$i$]\\
 503      $x_{var} \leftarrow x_{var} + (i-\bar{x})^2$\\
 504    }
 505    $\bar{y} \leftarrow \frac{\bar{y}}{M}$\\
 506    \For{$i=0$ to $M-1$}{
 507      $xy_{covar} \leftarrow xy_{covar} + (i-\bar{x}) \times (I[i]-\bar{y})$\\
 508    }
 509    $slope \leftarrow \frac{xy_{covar}}{x_{var}}$\\
 510    $start \leftarrow y_{moy} - slope\times \bar{x}$\\
 511    \For{$i=0$ to $M-1$}{
 512      $I[i] \leftarrow I[i] - start - slope\times i$\\
 513    }
 514
 515    $I_{max} \leftarrow max_i(I[i])$, $I_{min} \leftarrow min_i(I[i])$\\
 516    $amp \leftarrow \frac{I_{max}-I_{min}}{2}$\\
 517
 518    $Is \leftarrow 0$, $Ic \leftarrow 0$\\
 519    \For{$i=0$ to $M-1$}{
 520      $Is \leftarrow Is + I[i]\times $ lut$_{sfi}$[$i$]\\
 521      $Ic \leftarrow Ic + I[i]\times $ lut$_{cfi}$[$i$]\\
 522    }
 523
 524    $\delta \leftarrow \frac{nb_s}{2}$, $b_l \leftarrow 0$, $b_r \leftarrow \delta$\\
 525    $v_l \leftarrow -2.I_s - amp.$lut$_A$[$b_l$]\\
 526
 527    \While{$\delta >= 1$}{
 528
 529      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 530
 531      \If{$!(v_l < 0$ and $v_r >= 0)$}{
 532        $v_l \leftarrow v_r$ \\
 533        $b_l \leftarrow b_r$ \\
 534      }
 535      $\delta \leftarrow \frac{\delta}{2}$\\
 536      $b_r \leftarrow b_l + \delta$\\
 537    }
 538    \uIf{$!(v_l < 0$ and $v_r >= 0)$}{
 539      $v_l \leftarrow v_r$ \\
 540      $b_l \leftarrow b_r$ \\
 541      $b_r \leftarrow b_l + 1$\\
 542      $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
 543    }
 544    \Else {
 545      $b_r \leftarrow b_l + 1$\\
 546    }
 547
 548    \uIf{$ abs(v_l) < v_r$}{
 549      $b_{\theta} \leftarrow b_l$ \\
 550    }
 551    \Else {
 552      $b_{\theta} \leftarrow b_r$ \\
 553    }
 554    $\theta \leftarrow \pi\times \left[\frac{2.b_{ref}}{nb_s}-1\right]$\\
 555
 556 \end{algorithm}
 557
 558 \subsubsection{Comparison}
 559
 560 We compared the two algorithms on the base of three criteria:
 561 \begin{itemize}
 562 \item precision of results on a cosinus profile, distorted with noise,
 563 \item number of operations,
 564 \item complexity to implement an FPGA version.
 565 \end{itemize}
 566
 567 For the first item, we produced a matlab version of each algorithm,
 568 running with double precision values. The profile was generated for
 569 about 34000 different values of period ($\in [3.1, 6.1]$, step = 0.1),
 570 phase ($\in [-3.1 , 3.1]$, step = 0.062) and slope ($\in [-2 , 2]$,
 571 step = 0.4). For LSQ, $nb_s = 1024$, which leads to a maximal error of
 572 $\frac{\pi}{1024}$ on phase computation. Current A. Meister and
 573 M. Favre experiments show a ratio of 50 between variation of phase and
 574 the deflection of a lever. Thus, the maximal error due to
 575 discretization correspond to an error of 0.15nm on the lever
 576 deflection, which is smaller than the best precision they achieved,
 577 i.e. 0.3nm.
 578
 579 For each test, we add some noise to the profile: each group of two
 580 pixels has its intensity added to a random number picked in $[-N,N]$
 581 (NB: it should be noticed that picking a new value for each pixel does
 582 not distort enough the profile). The absolute error on the result is
 583 evaluated by comparing the difference between the reference and
 584 computed phase, out of $2\pi$, expressed in percents. That is: $err =
 585 100\times \frac{|\theta_{ref} - \theta_{comp}|}{2\pi}$.
 586
 587 Table \ref{tab:algo_prec} gives the maximum and average error for the two algorithms and increasing values of $N$.
 588
 589 \begin{table}[ht]
 590   \begin{center}
 591     \begin{tabular}{|c|c|c|c|c|}
 592       \hline
 593   & \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5}
 594   noise & max. err. & aver. err. & max. err. & aver. err. \\ \hline
 595   0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline
 596   2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline
 597   5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline
 598   7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline
 599   10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline
 600   15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline
 601   30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
 602
 603 \end{tabular}
 604 \caption{Error (in \%) for cosinus profiles, with noise.}
 605 \label{tab:algo_prec}
 606 \end{center}
 607 \end{table}
 608
 609 These results show that the two algorithms are very close, with a
 610 slight advantage for LSQ. Furthemore, both behave very well against
 611 noise. Assuming the experimental ratio of 50 (see above), an error of
 612 1 percent on phase correspond to an error of 0.5nm on the lever
 613 deflection, which is very close to the best precision.
 614
 615 Obviously, it is very hard to predict which level of noise will be
 616 present in real experiments and how it will distort the
 617 profiles. Nevertheless, we can see on figure \ref{fig:noise20} the
 618 profile with $N=10$ that leads to the biggest error. It is a bit
 619 distorted, with pikes and straight/rounded portions, and relatively
 620 close to most of that come from experiments. Figure \ref{fig:noise60}
 621 shows a sample of worst profile for $N=30$. It is completly distorted,
 622 largely beyond the worst experimental ones.
 623
 624 \begin{figure}[ht]
 625 \begin{center}
 626   \includegraphics[width=\columnwidth]{intens-noise20}
 627 \end{center}
 628 \caption{Sample of worst profile for N=10}
 629 \label{fig:noise20}
 630 \end{figure}
 631
 632 \begin{figure}[ht]
 633 \begin{center}
 634   \includegraphics[width=\columnwidth]{intens-noise60}
 635 \end{center}
 636 \caption{Sample of worst profile for N=30}
 637 \label{fig:noise60}
 638 \end{figure}
 639
 640 The second criterion is relatively easy to estimate for LSQ and harder
 641 for SPL because of $atan$ operation. In both cases, it is proportional
 642 to numbers of pixels $M$. For LSQ, it also depends on $nb_s$ and for
 643 SPL on $N = k\times M$, i.e. the number of interpolated points.
 644
 645 We assume that $M=20$, $nb_s=1024$, $k=4$, all possible parts are
 646 already in lookup tables and a limited set of operations (+, -, *, /,
 647 $<$, $>$) is taken account. Translating the two algorithms in C code, we
 648 obtain about 430 operations for LSQ and 1550 (plus few tenth for
 649 $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
 650 considering the total number of operations is not really pertinent for
 651 an FPGA implementation: it mainly depends on the type of operations
 652 and their
 653 ordering. The final decision is thus driven by the third criterion.\\
 654
 655 The Spartan 6 used in our architecture has a hard constraint: it has no built-in
 656 floating  point  units.   Obviously,  it  is  possible  to   use  some  existing
 657 "black-boxes"  for double  precision  operations.  But they  have  a quite  long
 658 latency. It is much simpler to  exclusively use integers, with a quantization of
 659 all double  precision values. Obviously,  this quantization should  not decrease
 660 too much the  precision of results. Furthermore, it should not  lead to a design
 661 with  a huge  latency because  of operations  that could  not complete  during a
 662 single or few clock cycles. Divisions  are in this case and, moreover, they need
 663 a varying  number of  clock cycles  to complete. Even  multiplications can  be a
 664 problem:  DSP48 take  inputs of  18  bits maximum.  For larger  multiplications,
 665 several DSP must be combined, increasing the latency.
 666
 667 Nevertheless, the hardest constraint does not come from the FPGA characteristics
 668 but from the algorithms. Their VHDL  implentation will be efficient only if they
 669 can be fully (or near) pipelined. By the way, the choice is quickly done: only a
 670 small  part of  SPL  can be.   Indeed,  the computation  of spline  coefficients
 671 implies to solve  a tridiagonal system $A.m =  b$. Values in $A$ and  $b$ can be
 672 computed from  incoming pixels intensity  but after, the back-solve  starts with
 673 the  lastest  values,  which  breaks  the  pipeline.  Moreover,  SPL  relies  on
 674 interpolating far more points than profile size. Thus, the end of SPL works on a
 675 larger amount of data than the beginning, which also breaks the pipeline.
 676
 677 LSQ has  not this problem: all parts  except the dichotomial search  work on the
 678 same  amount  of  data, i.e.  the  profile  size.  Furthermore, LSQ  needs  less
 679 operations than SPL, implying a  smaller output latency. Consequently, it is the
 680 best candidate for phase  computation. Nevertheless, obtaining a fully pipelined
 681 version supposes that  operations of different parts complete  in a single clock
 682 cycle. It is  the case for simulations but it completely  fails when mapping and
 683 routing the design  on the Spartan6. By the way,  extra-latency is generated and
 684 there must be idle times between two profiles entering into the pipeline.
 685
 686 %%Before obtaining the least bitstream, the crucial question is: how to
 687 %%translate the C code the LSQ into VHDL ?
 688
 689
 690 %\subsection{VHDL design paradigms}
 691
 692 \section{Experimental tests}
 693
 694 In this section we explain what  we have done yet. Until now, we could not perform
 695 real experiments  since we just have  received the FGPA  board. Nevertheless, we
 696 will include real experiments in the final version of this paper.
 697
 698 \subsection{VHDL implementation}
 699
 700
 701
 702 % - ecriture d'un code en C avec integer
 703 % - calcul de la taille max en bit de chaque variable en fonction de la quantization.
 704 % - tests de quantization : équilibre entre précision et contraintes FPGA
 705 % - en parallèle : simulink et VHDL à la main
 706
 707
 708 From the LSQ algorithm, we have written a C program that uses only
 709 integer values. We use a very simple quantization by multiplying
 710 double precision values by a power of two, keeping the integer
 711 part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are
 712 scaled by 1024.  Since LSQ also computes average, variance, ... to
 713 remove the slope, the result of implied euclidian divisions may be
 714 relatively wrong. To avoid that, we also scale the pixel intensities
 715 by a power of two. Futhermore, assuming $nb_s$ is fixed, these
 716 divisions have a knonw denominator. Thus, they can be replaced by
 717 their multiplication/shift counterpart. Finally, all other
 718 multiplications or divisions by a power of two have been replaced by
 719 left or right bit shifts. By the way, the code only contains
 720 additions, substractions and multiplications of signed integers, which
 721 is perfectly adapted to FGPAs.
 722
 723 As said above, hardware constraints have a great influence on the VHDL
 724 implementation. Consequently, we searched the maximum value of each
 725 variable as a function of the different scale factors and the size of
 726 profiles, which gives their maximum size in bits. That size determines
 727 the maximum scale factors that allow to use the least possible RAMs
 728 and DSPs. Actually, we implemented our algorithm with this maximum
 729 size but current works study the impact of quantization on the results
 730 precision and design complexity. We have compared the result of the
 731 LSQ version using integers and doubles and observed that the precision
 732 of both were similar.
 733
 734 Then we built two versions of VHDL codes: one directly by hand coding
 735 and the other with Matlab using the Simulink HDL coder
 736 feature~\cite{HDLCoder}. Although the approach is completely different
 737 we obtained VHDL codes that are quite comparable. Each approach has
 738 advantages and drawbacks.  Roughly speaking, hand coding provides
 739 beautiful and much better structured code while Simulink allows to
 740 produce a code faster.  In terms of throughput and latency,
 741 simulations shows that the two approaches are close with a slight
 742 advantage for hand coding.  We hope that real experiments will confirm
 743 that.
 744
 745 \subsection{Simulation}
 746
 747 Currently, we have only simulated our VHDL codes with GHDL and GTKWave (two free
 748 tools with linux).  Both approaches led to correct results.  At the beginning of
 749 our simulations, our  pipiline could compute a new phase each  33 cycles and the
 750 length of the  pipeline was equal to  95 cycles.  When we tried  to generate the
 751 corresponding bitsream  with ISE environment  we had many problems  because many
 752 stages required  more than the  10$n$s required by  the clock frequency.   So we
 753 needed to decompose  some part of the  pipeline in order to add  some cycles and
 754 simplify some parts between a clock top.
 755 % ghdl + gtkwave
 756 % au mieux : une phase tous les 33 cycles, latence de 95 cycles.
 757 % mais routage/placement impossible.
 758 \subsection{Bitstream creation}
 759
 760 Currently both  approaches provide synthesable  bitstreams with ISE.   We expect
 761 that the  pipeline will  have a latency  of 112  cycles, i.e. 1.12$\mu$s  and it
 762 could accept new profiles of pixel each 48 cycles, i.e. 480$n$s.
 763
 764 % pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120
 765
 766 \label{sec:results}
 767
 768
 769
 770
 771 \section{Conclusion and perspectives}
 772 In  this paper  we  have presented  a  new method  to  estimate the  cantilevers
 773 deflection in  an AFM.  This  method is based  on least square methods.  We have
 774 studied  the  quantization  of this  algorithm  and  have  implemented it  on  a
 775 FPGA. Our method gives comparable  results compared to the initial version based
 776 on splines.   Our solution has been be  implemented with a  pipeline technique.
 777 Consequently, it enables  to handle a new profile  image very quickly. Currently
 778 we have performed simulations and real tests on a Spartan6 FPGA.
 779
 780 In future  work, we want to couple  our algorithm with a  high speed camera
 781 and we plan to control the whole AFM system.
 782
 783 \bibliographystyle{plain}
 784 \bibliography{biblio}
 785
 786 \end{document}