relecture anglais

[dmems12.git] / dmems12.tex
diff --git a/dmems12.tex b/dmems12.tex

index 61bbdbabfeeae8328ac4488f49d376330800ab13..fb6d7d0a61c8bd5006735c3b81936e6ef0805b1f 100644 (file)
--- a/dmems12.tex
+++ b/dmems12.tex
@@ -1,5 +1,793 @@
-\documentclass{article}
+
+\documentclass[10pt, peerreview, compsocconf]{IEEEtran}
+%\usepackage{latex8}
+%\usepackage{times}
+\usepackage[utf8]{inputenc}
+%\usepackage[cyr]{aeguill}
+%\usepackage{pstricks,pst-node,pst-text,pst-3d}
+%\usepackage{babel}
+\usepackage{amsmath}
+\usepackage{url}
+\usepackage{graphicx}
+\usepackage{thumbpdf}
+\usepackage{color}
+\usepackage{moreverb}
+\usepackage{commath}
+\usepackage{subfigure}
+%\input{psfig.sty}
+\usepackage{fullpage}
+\usepackage{fancybox}
+
+\usepackage[ruled,lined,linesnumbered]{algorithm2e}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
+\newcommand{\noun}[1]{\textsc{#1}}
+
+\newcommand{\tab}{\ \ \ }
+
  
  \begin{document}
-blabla
+
+
+%% \author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)}
+%% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
+%% line 2: name of organization, acronyms acceptable\\
+%% line 3: City, Country\\
+%% line 4: Email: name@xyz.com}
+%% \and
+%% \IEEEauthorblockN{Authors Name/s per 2nd Affiliation (Author)}
+%% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
+%% line 2: name of organization, acronyms acceptable\\
+%% line 3: City, Country\\
+%% line 4: Email: name@xyz.com}
+%% }
+
+
+
+\title{A new approach based on least square methods to estimate in real time cantilevers deflection with a FPGA}
+\author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and Michel Lenczner\IEEEauthorrefmark{2}}
+\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France\\
+\{raphael.couturier,stephane.domas\}@univ-fcomte.fr}
+\IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France\\
+\{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com}
+}
+
+
+
+
+
+
+%\maketitle
+
+\thispagestyle{empty}
+
+\begin{abstract}
+
+Atomics force  microscope (AFM) provide  high resolution images of  surfaces. In
+this paper, we  focus our attention on an interferometry  method to estimate the
+cantilevers deflection.   The initial method  was based on splines  to determine
+the phase  of interference fringes,  and thus the deflection.  Computations were
+performed on a  PC with LabView.  Here,  we propose a new approach  based on the
+least square  methods and its implementation  that we have developed  on a FPGA,
+using the pipelining  technique. Simulations and real tests  showed us that this
+implementation is  very efficient and should  allow us to  control a cantilevers
+array in real time.
+
+
+\end{abstract}
+
+\begin{IEEEkeywords}
+FPGA, cantilever, interferometry.
+\end{IEEEkeywords}
+
+
+\IEEEpeerreviewmaketitle
+
+\section{Introduction}
+
+Cantilevers are  used inside atomic  force microscopes (AFM) which  provide high
+resolution images of surfaces.  Several techniques have been used to measure the
+displacement  of cantilevers  in literature.   For  example, it  is possible  to
+determine    accurately    the    deflection    with    different    mechanisms.
+In~\cite{CantiPiezzo01},   authors  used   piezoresistor  integrated   into  the
+cantilever.   Nevertheless this  approach  suffers from  the  complexity of  the
+microfabrication  process needed  to  implement the  sensor  in the  cantilever.
+In~\cite{CantiCapacitive03},  authors  have  presented a  cantilever  mechanism
+based on capacitive sensing. This  kind of technique also involves to instrument
+the cantilever which results in a complex fabrication process.
+
+In this  paper our attention is focused  on a method based  on interferometry to
+measure cantilevers' displacements.  In  this method cantilevers are illuminated
+by an optic source. The interferometry produces fringes on each cantilever which
+enables  us to compute  the cantilever  displacement.  In  order to  analyze the
+fringes a  high speed camera  is used. Images  need to be processed  quickly and
+then  an estimation method  is required  to determine  the displacement  of each
+cantilever.  In~\cite{AFMCSEM11}, authors have used an algorithm based on spline
+to estimate the cantilevers' positions.
+
+The  overall  process  gives  accurate  results but  all  the  computations  are
+performed on a standard computer using LabView.  Consequently, the main drawback
+of this  implementation is that the computer  is a bottleneck. In  this paper we
+propose  to use  a  method  based on  least  squares and  to  implement all  the
+computation on a FPGA.
+
+The remainder  of the paper  is organized as  follows. Section~\ref{sec:measure}
+describes  more precisely  the measurement  process. Our  solution based  on the
+least  square   method  and   the  implementation  on   FPGA  is   presented  in
+Section~\ref{sec:solus}.       Experimentations      are       described      in
+Section~\ref{sec:results}.  Finally  a  conclusion  and  some  perspectives  are
+presented.
+
+
+
+
+
+\section{Measurement principles}
+\label{sec:measure}
+
+\subsection{Architecture}
+\label{sec:archi}
+
+
+In order to develop simple,  cost effective and user-friendly cantilever arrays,
+authors   of    ~\cite{AFMCSEM11}   have   developed   a    system   based   on
+interferometry. In opposition to other optical based systems, using a laser beam
+deflection scheme and  sensitive to the angular displacement  of the cantilever,
+interferometry  is sensitive  to  the  optical path  difference  induced by  the
+vertical displacement of the cantilever.
+
+The    system   built    by   these    authors   is    based   on    a   Linnick
+interferometer~\cite{Sinclair:05}.   It is illustrated  in Figure~\ref{fig:AFM}.
+A laser  diode is  first split  (by the splitter)  into a  reference beam  and a
+sample beam that reach the cantilever array.   In order to be able to move the
+cantilever array,  it is mounted on  a translation and  rotational hexapod stage
+with five  degrees of freedom.  The optical system  is also fixed to  the stage.
+Thus,  the cantilever  array is  centered  in the  optical system  which can  be
+adjusted accurately.  The  beam illuminates the array by  a microscope objective
+and the light reflects on the cantilevers.  Likewise the reference beam reflects
+on a movable mirror.  A CMOS  camera chip records the reference and sample beams
+which  are  recombined in  the  beam splitter  and  the  interferogram.  At  the
+beginning of each experiment, the movable  mirror is fitted manually in order to
+align  the interferometric  fringes approximately  parallel to  the cantilevers.
+When cantilevers  move due  to the surface,  the bending of  cantilevers produce
+movements in the fringes that can be detected with the CMOS camera.  Finally the
+fringes need to be analyzed. In~\cite{AFMCSEM11}, authors used a LabView program
+to compute the cantilevers' deflections from the fringes.
+
+\begin{figure}    
+\begin{center}
+\includegraphics[width=\columnwidth]{AFM}
+\end{center}
+\caption{schema of the AFM}
+\label{fig:AFM}   
+\end{figure}
+
+
+%% image tirée des expériences.
+
+\subsection{Cantilever deflection estimation}
+\label{sec:deflest}
+
+\begin{figure}    
+\begin{center}
+\includegraphics[width=\columnwidth]{lever-xp}
+\end{center}
+\caption{Portion of an image picked by the camera}
+\label{fig:img-xp}   
+\end{figure}
+
+As shown on image \ref{fig:img-xp}, each cantilever is covered by
+several interferometric fringes. The fringes will distort when
+cantilevers are deflected. Estimating the deflection is done by
+computing this distortion. For that, authors of \cite{AFMCSEM11}
+proposed a method based on computing the phase of the fringes, at the
+base of each cantilever, near the tip, and on the base of the
+array. They assume that a linear relation binds these phases, which
+can be used to "unwrap" the phase at the tip and to determine the deflection.\\
+
+More precisely, segments of pixels are extracted from images taken by
+the camera. These segments are large enough to cover several
+interferometric fringes. As said above, they are placed at the base
+and near the tip of the cantilevers. They are called base profile and
+tip profile in the following. Furthermore, a reference profile is
+taken on the base of the cantilever array.
+
+The pixels intensity $I$ (in gray level) of each profile is modelized by:
+
+\begin{equation}
+\label{equ:profile}
+I(x) = ax+b+A.cos(2\pi f.x + \theta)
+\end{equation}
+
+where $x$ is the position of a pixel in its associated segment.
+
+The global method consists in two main sequences. The first one aims
+to determine the frequency $f$ of each profile with an algorithm based
+on spline interpolation (see section \ref{sec:algo-spline}). It also
+computes the coefficient used for unwrapping the phase. The second one
+is the acquisition loop, during which images are taken at regular time
+steps. For each image, the phase $\theta$ of all profiles is computed
+to obtain, after unwrapping, the deflection of
+cantilevers. Originally, this computation was also done with an
+algorithm based on spline. This article proposes a new version based
+on a least square method.
+
+\subsection{Design goals}
+\label{sec:goals}
+
+The main goal is to implement a computing unit to estimate the
+deflection of about $10\times10$ cantilevers, faster than the stream of
+images coming from the camera. The accuracy of results must be close
+to the maximum precision ever obtained experimentally on the
+architecture, i.e. 0.3nm. Finally, the latency between an image
+entering in the unit and the deflections must be as small as possible
+(NB: future works plan to add some control on the cantilevers).\\
+
+If we put aside some hardware issues like the speed of the link
+between the camera and the computation unit, the time to deserialize
+pixels and to store them in memory, ... the phase computation is
+obviously the bottle-neck of the whole process. For example, if we
+consider the camera actually in use, an exposition time of 2.5ms for
+$1024\times 1204$ pixels seems the minimum that can be reached. For
+100 cantilevers, if we neglect the time to extract pixels, it implies
+that computing the deflection of a single
+cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\
+
+In fact, this timing is a very hard constraint. Let us consider a very
+small program that initializes twenty million of doubles in memory
+and then does 1,000,000 cumulated sums on 20 contiguous values
+(experimental profiles have about this size). On an intel Core 2 Duo
+E6650 at 2.33GHz, this program reaches an average of 155Mflops. 
+
+%%Itimplies that the phase computation algorithm should not take more than
+%%$155\times 12.5 = 1937$ floating operations. For integers, it gives $3000$ operations. 
+
+Obviously, some cache effects and optimizations on
+huge amount of computations can drastically increase these
+performances: peak efficiency is about 2.5Gflops for the considered
+CPU. But this is not the case for phase computation that used only a few
+tenth of values.\\
+
+In order to evaluate the original algorithm, we translated it in C
+language. As stated before, for 20 pixels, it does about 1,550
+operations, thus an estimated execution time of $1,550/155
+=$10$\mu$s. For a more realistic evaluation, we constructed a file of
+1Mo containing 200 profiles of 20 pixels, equally scattered. This file
+is equivalent to an image stored in a device file representing the
+camera. We obtained an average of 10.5$\mu$s by profile (including I/O
+accesses). It is under our requirements but close to the limit. In
+case of an occasional load of the system, it could be largely
+overtaken. A solution would be to use a real-time operating system but
+another one to search for a more efficient algorithm.
+
+But the main drawback is the latency of such a solution: since each
+profile must be treated one after another, the deflection of 100
+cantilevers takes about $200\times 10.5 = 2.1$ms, which is inadequate
+for an efficient control. An obvious solution is to parallelize the
+computations, for example on a GPU. Nevertheless, the cost of transferring
+profile in GPU memory and of taking back results would be prohibitive
+compared to computation time. It is certainly more efficient to
+pipeline the computation. For example, supposing that 200 profiles of
+20 pixels can be pushed sequentially in the pipelined unit cadenced at
+a 100MHz (i.e. a pixel enters in the unit each 10ns), all profiles
+would be treated in $200\times 20\times 10.10^{-9} =$ 40$\mu$s plus
+the latency of the pipeline. This is about 500 times faster than
+actual results.\\
+
+For these reasons, a FPGA as the computation unit is the best choice
+to achieve the required performance. Nevertheless, passing from
+a C code to a pipelined version in VHDL is not obvious at all. As
+explained in the next section, it can even be impossible because of
+some hardware constraints specific to FPGAs.
+
+
+\section{Proposed solution}
+\label{sec:solus}
+
+Project Oscar aims  to provide a hardware and  software architecture to estimate
+and  control the  deflection of  cantilevers. The  hardware part  consists  in a
+high-speed camera, linked  on an embedded board hosting FPGAs.  In this way, the
+camera output stream can be pushed  directly into the FPGA. The software part is
+mostly the VHDL  code that deserializes the camera  stream, extracts profile and
+computes  the deflection. Before  focusing on  our work  to implement  the phase
+computation, we give some general information about FPGAs and the board we use.
+
+\subsection{FPGAs}
+
+A field-programmable gate  array (FPGA) is an integrated  circuit designed to be
+configured by the customer. FGPAs are composed of programmable logic components,
+called  configurable logic  blocks (CLB).  These blocks  mainly  contain look-up
+tables  (LUT), flip/flops (F/F)  and latches,  organized in  one or  more slices
+connected together. Each CLB can be configured to perform simple (AND, XOR, ...)
+or complex  combinational functions.  They are  interconnected by reconfigurable
+links.  Modern  FPGAs contain  memory elements and  multipliers which  enable to
+simplify the  design and  to increase the  performance. Nevertheless,  all other
+complex  operations, like  division, trigonometric  functions, $\ldots$  are not
+available  and  must  be  done  by  configuring  a  set  of  CLBs.   Since  this
+configuration  is not  obvious at  all, it  can be  done via  a  framework, like
+ISE~\cite{ISE}. Such  a software can synthetize  a design written  in a hardware
+description language  (HDL), map it onto  CLBs, place/route them  for a specific
+FPGA, and finally produce a bitstream  that is used to configure the FPGA. Thus,
+from  the developer's  point of  view, the  main difficulty  is to  translate an
+algorithm in HDL  code, taking into account FPGA  resources and constraints like
+clock signals and I/O values that drive the FPGA.
+
+Indeed, HDL programming is very different from classic languages like
+C. A program can be seen as a state-machine, manipulating signals that
+evolve from state to state. Moreover, HDL instructions can executed
+concurrently. Basic logic operations are used to aggregate signals to
+produce new states and assign it to another signal. States are mainly
+expressed as arrays of bits. Fortunately, libraries propose some
+higher levels representations like signed integers, and arithmetic
+operations.
+
+Furthermore, even if FPGAs are cadenced more slowly than classic
+processors, they can perform pipeline as well as parallel
+operations. A pipeline consists in cutting a process in a sequence of
+small tasks, taking the same execution time. It accepts a new data at
+each clock top, thus, after a known latency, it also provides a result
+at each clock top. However, using a pipeline consumes more logics
+since the components of a task are not reusable by another
+one. Nevertheless it is probably the most efficient technique on
+FPGA. Because of its architecture, it is also very easy to process
+several data concurrently. Whenever possible, the best performance
+is reached using parallelism to handle simultaneously several
+pipelines in order to handle multiple data streams.
+
+\subsection{The board}
+
+The board we use is designed by the Armadeus company, under the name
+SP Vision. It consists in a development board hosting a i.MX27 ARM
+processor (from Freescale). The board includes all classical
+connectors: USB, Ethernet, ... A Flash memory contains a Linux kernel
+that can be launched after booting the board via u-Boot.
+
+The processor is directly connected to a Spartan3A FPGA (from Xilinx)
+via its special interface called WEIM. The Spartan3A is itself
+connected to a Spartan6 FPGA. Thus, it is possible to develop programs
+that communicate between i.MX and Spartan6, using Spartan3 as a
+tunnel. By default, the WEIM interface provides a clock signal at
+100MHz that is connected to dedicated FPGA pins.
+
+The Spartan6 is an LX100 version. It has 15822 slices, each slice
+containing 4 LUTs and 8 flip/flops. It is equivalent to 101261 logic
+cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
+multiply-adders (named DSP48), which is largely enough for our
+project.
+
+Some I/O pins of Spartan6 are connected to two $2\times 17$ headers
+that can be used as user wants. For the project, they will be
+connected to the interface card of the camera.
+
+\subsection{Considered algorithms}
+
+Two solutions have been studied to achieve phase computation. The
+original one, proposed by A. Meister and M. Favre, is based on
+interpolation by splines. It allows to compute frequency and
+phase. The second one, detailed in this article, is based on a
+classical least square method which supposes that the frequency is already
+known.
+
+\subsubsection{Spline algorithm (SPL)}
+\label{sec:algo-spline}
+Let us consider a profile $P$, that is a segment of $M$ pixels with an
+intensity in gray levels. Let us call $I(x)$ the intensity of profile in $x
+\in [0,M[$. 
+
+At first, only $M$ values of $I$ are known, for $x = 0, 1,
+\ldots,M-1$. A normalization allows to scale known intensities into
+$[-1,1]$. We compute splines that fit at best these normalized
+intensities. Splines are used to interpolate $N = k\times M$ points
+(typically $k=4$ is sufficient), within $[0,M[$. Let  $x^s$ be the
+coordinates of these $N$ points and $I^s$ their intensities.
+
+In order to have the frequency, the mean line $a.x+b$ (see equation \ref{equ:profile}) of $I^s$ is
+computed. Finding intersections of $I^s$ and this line allows us to obtain
+the period and thus the frequency.
+
+The phase is computed via the equation:
+\begin{equation}
+\theta = atan \left[ \frac{\sum_{i=0}^{N-1} sin(2\pi f x^s_i) \times I^s(x^s_i)}{\sum_{i=0}^{N-1} cos(2\pi f x^s_i) \times I^s(x^s_i)} \right]
+\end{equation}
+
+Two things can be noticed:
+\begin{itemize}
+\item the frequency could also be obtained using the derivates of
+  spline equations, which only implies to solve quadratic equations.
+\item frequency of each profile is computed only once, before the
+  acquisition loop. Thus, $sin(2\pi f x^s_i)$ and $cos(2\pi f x^s_i)$
+  could also be computed before the loop, which would lead to a much faster
+  computation of $\theta$.
+\end{itemize}
+
+\subsubsection{Least square algorithm (LSQ)}
+
+Assuming that we compute the phase during the acquisition loop,
+equation \ref{equ:profile} has only 4 parameters: $a, b, A$, and
+$\theta$, $f$ and $x$ being already known. Since $I$ is non-linear, a
+least square method based on a Gauss-newton algorithm can be used to
+determine these four parameters. Since it is an iterative process
+ending with a convergence criterion, it is obvious that it is not
+particularly adapted to our design goals.
+
+Fortunately, it is quite simple to reduce the number of parameters to
+only $\theta$. Let $x^p$ be the coordinates of pixels in a segment of
+size $M$. Thus, $x^p = 0, 1, \ldots, M-1$. Let $I(x^p)$ be their
+intensity. Firstly, we "remove" the slope by computing:
+
+\[I^{corr}(x^p) = I(x^p) - a.x^p - b\]
+
+Since linear equation coefficients are searched, a classical least
+square method can be used to determine $a$ and $b$:
+
+\[a = \frac{covar(x^p,I(x^p))}{var(x^p)} \]
+
+Assuming an overlined symbol means an average, then:
+
+\[b = \overline{I(x^p)} - a.\overline{{x^p}}\]
+
+Let $A$ be the amplitude of $I^{corr}$, i.e. 
+
+\[A = \frac{max(I^{corr}) - min(I^{corr})}{2}\]
+
+Then, the least square method to find $\theta$ is reduced to search the minimum of:
+
+\[\sum_{i=0}^{M-1} \left[ cos(2\pi f.i + \theta) - \frac{I^{corr}(i)}{A} \right]^2\]
+
+It is equivalent to derivating this expression and to solving the following equation:
+
+
+%\begin{eqnarray*}{l}
+$$2\left[ cos\theta \sum_{i=0}^{M-1} I^{corr}(i).sin(2\pi f.i) \right.$$
+$$\left. + sin\theta \sum_{i=0}^{M-1} I^{corr}(i).cos(2\pi f.i)\right]- $$
+$$ A\left[ cos2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right]   = 0$$
+%\end{eqnarray*}
+
+
+Several points can be noticed:
+\begin{itemize}
+\item As in the spline method, some parts of this equation can be
+  computed before the acquisition loop. It is the case of sums that do
+  not depend on $\theta$:
+
+\[ \sum_{i=0}^{M-1} sin(4\pi f.i), \sum_{i=0}^{M-1} cos(4\pi f.i) \] 
+
+\item Lookup tables for $sin(2\pi f.i)$ and $cos(2\pi f.i)$ can also be
+computed.
+
+\item The simplest method to find the good $\theta$ is to discretize
+  $[-\pi,\pi]$ in $nb_s$ steps, and to search which step leads to the
+  result closest to zero. Hence, three other lookup tables can
+  also be computed before the loop:
+
+\[ sin \theta, cos \theta, \]
+
+\[ \left[ cos 2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin 2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right] \]
+
+\item This search can be very fast using a dichotomous process in $log_2(nb_s)$ 
+
+\end{itemize}
+
+Finally, this is synthetized in an algorithm (called LSQ in the following) in two parts, one before and one during the acquisition loop:
+\begin{algorithm}[htbp]
+\caption{LSQ algorithm - before acquisition loop.}
+\label{alg:lsq-before}
+
+   $M \leftarrow $ number of pixels of the profile\\
+   I[] $\leftarrow $ intensity of pixels\\
+   $f \leftarrow $ frequency of the profile\\
+   $s4i \leftarrow \sum_{i=0}^{M-1} sin(4\pi f.i)$\\
+   $c4i \leftarrow \sum_{i=0}^{M-1} cos(4\pi f.i)$\\
+   $nb_s \leftarrow $ number of discretization steps of $[-\pi,\pi]$\\
+
+   \For{$i=0$ to $nb_s $}{
+     $\theta  \leftarrow -\pi + 2\pi\times \frac{i}{nb_s}$\\
+     lut$_s$[$i$] $\leftarrow sin \theta$\\
+     lut$_c$[$i$] $\leftarrow cos \theta$\\
+     lut$_A$[$i$] $\leftarrow cos 2 \theta \times s4i + sin 2 \theta \times c4i$\\
+     lut$_{sfi}$[$i$] $\leftarrow sin (2\pi f.i)$\\
+     lut$_{cfi}$[$i$] $\leftarrow cos (2\pi f.i)$\\
+   }
+\end{algorithm}
+
+\begin{algorithm}[htbp]
+\caption{LSQ algorithm - during acquisition loop.}
+\label{alg:lsq-during}
+
+   $\bar{x} \leftarrow \frac{M-1}{2}$\\
+   $\bar{y} \leftarrow 0$, $x_{var} \leftarrow 0$, $xy_{covar} \leftarrow 0$\\
+   \For{$i=0$ to $M-1$}{
+     $\bar{y} \leftarrow \bar{y} + $ I[$i$]\\
+     $x_{var} \leftarrow x_{var} + (i-\bar{x})^2$\\
+   }
+   $\bar{y} \leftarrow \frac{\bar{y}}{M}$\\
+   \For{$i=0$ to $M-1$}{
+     $xy_{covar} \leftarrow xy_{covar} + (i-\bar{x}) \times (I[i]-\bar{y})$\\
+   }
+   $slope \leftarrow \frac{xy_{covar}}{x_{var}}$\\
+   $start \leftarrow y_{moy} - slope\times \bar{x}$\\
+   \For{$i=0$ to $M-1$}{
+     $I[i] \leftarrow I[i] - start - slope\times i$\\
+   }
+   
+   $I_{max} \leftarrow max_i(I[i])$, $I_{min} \leftarrow min_i(I[i])$\\
+   $amp \leftarrow \frac{I_{max}-I_{min}}{2}$\\
+
+   $Is \leftarrow 0$, $Ic \leftarrow 0$\\
+   \For{$i=0$ to $M-1$}{
+     $Is \leftarrow Is + I[i]\times $ lut$_{sfi}$[$i$]\\
+     $Ic \leftarrow Ic + I[i]\times $ lut$_{cfi}$[$i$]\\
+   }
+
+   $\delta \leftarrow \frac{nb_s}{2}$, $b_l \leftarrow 0$, $b_r \leftarrow \delta$\\
+   $v_l \leftarrow -2.I_s - amp.$lut$_A$[$b_l$]\\
+
+   \While{$\delta >= 1$}{
+
+     $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
+
+     \If{$!(v_l < 0$ and $v_r >= 0)$}{
+       $v_l \leftarrow v_r$ \\
+       $b_l \leftarrow b_r$ \\
+     }
+     $\delta \leftarrow \frac{\delta}{2}$\\
+     $b_r \leftarrow b_l + \delta$\\
+   }
+   \uIf{$!(v_l < 0$ and $v_r >= 0)$}{
+     $v_l \leftarrow v_r$ \\
+     $b_l \leftarrow b_r$ \\
+     $b_r \leftarrow b_l + 1$\\
+     $v_r \leftarrow 2.[ Is.$lut$_c$[$b_r$]$ + Ic.$lut$_s$[$b_r$]$ ] - amp.$lut$_A$[$b_r$]\\
+   }
+   \Else {
+     $b_r \leftarrow b_l + 1$\\
+   }
+
+   \uIf{$ abs(v_l) < v_r$}{
+     $b_{\theta} \leftarrow b_l$ \\
+   }
+   \Else {
+     $b_{\theta} \leftarrow b_r$ \\
+   }
+   $\theta \leftarrow \pi\times \left[\frac{2.b_{ref}}{nb_s}-1\right]$\\
+
+\end{algorithm}
+
+\subsubsection{Comparison}
+
+We compared the two algorithms on the base of three criteria:
+\begin{itemize}
+\item precision of results on a cosines profile, distorted by noise,
+\item number of operations,
+\item complexity of implementating an FPGA version.
+\end{itemize}
+
+For the first item, we produced a matlab version of each algorithm,
+running with double precision values. The profile was generated for
+about 34000 different values of period ($\in [3.1, 6.1]$, step = 0.1),
+phase ($\in [-3.1 , 3.1]$, step = 0.062) and slope ($\in [-2 , 2]$,
+step = 0.4). For LSQ, $nb_s = 1024$, which leads to a maximal error of
+$\frac{\pi}{1024}$ on phase computation. Current A. Meister and
+M. Favre's experiments show a ratio of 50 between the variation of a phase and
+the deflection of a lever. Thus, the maximal error due to
+discretization corresponds to an error of 0.15nm on the lever
+deflection, which is smaller than the best precision they achieved,
+i.e. 0.3nm.
+
+For each test, we add some noise to the profile: each group of two
+pixels has its intensity added to a random number picked in $[-N,N]$
+(NB: it should be noticed that picking a new value for each pixel does
+not distort enough the profile). The absolute error on the result is
+evaluated by comparing the difference between the reference and
+computed phase, out of $2\pi$, expressed in percentage. That is: $err =
+100\times \frac{|\theta_{ref} - \theta_{comp}|}{2\pi}$.
+
+Table  \ref{tab:algo_prec}  gives  the   maximum  and  average  error  for  both
+algorithms and increasing values of $N$.
+
+\begin{table}[ht]
+  \begin{center}
+    \begin{tabular}{|c|c|c|c|c|}
+      \hline
+  & \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5}
+  noise & max. err. & aver. err. & max. err. & aver. err. \\ \hline
+  0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline
+  2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline
+  5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline
+  7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline
+  10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline
+  15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline
+  30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
+
+\end{tabular}
+\caption{Error (in \%) for cosines profiles, with noise.}
+\label{tab:algo_prec}
+\end{center}
+\end{table}
+
+These results show that the two algorithms are very close, with a
+slight advantage for LSQ. Furthermore, both behave very well against
+noise. Assuming the experimental ratio of 50 (see above), an error of
+1 percent on the phase corresponds to an error of 0.5nm on the lever
+deflection, which is very close to the best precision.
+
+Obviously, it is very hard to predict which level of noise will be
+present in real experiments and how it will distort the
+profiles. Nevertheless, we can see on figure \ref{fig:noise20} the
+profile with $N=10$ that leads to the biggest error. It is a bit
+distorted, with pikes and straight/rounded portions, and relatively
+close to experiments. Figure \ref{fig:noise60}
+shows a sample of worst profile for $N=30$. It is completely distorted,
+largely beyond the worst experimental ones. 
+
+\begin{figure}[ht]
+\begin{center}
+  \includegraphics[width=\columnwidth]{intens-noise20}
+\end{center}
+\caption{Sample of worst profile for N=10}
+\label{fig:noise20}
+\end{figure}
+
+\begin{figure}[ht]
+\begin{center}
+  \includegraphics[width=\columnwidth]{intens-noise60}
+\end{center}
+\caption{Sample of worst profile for N=30}
+\label{fig:noise60}
+\end{figure}
+
+The second criterion is relatively easy to estimate for LSQ and harder
+for SPL because of $atan$ operation. In both cases, it is proportional
+to the numbers of pixels $M$. For LSQ, it also depends on $nb_s$ and for
+SPL on $N = k\times M$, i.e. the number of interpolated points. 
+
+We assume that $M=20$, $nb_s=1024$, $k=4$, all possible parts are
+already in lookup tables and a limited set of operations (+, -, *, /,
+$<$, $>$) is taken into account. Translating both algorithms in C code, we
+obtain about 430 operations for LSQ and 1,550 (plus a few tenth for
+$atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
+considering the total number of operations is not really pertinent for
+an FPGA implementation: it mainly depends on the type of operations
+and their
+ordering. The final decision is thus driven by the third criterion.\\
+
+The Spartan 6 used in our architecture has a hard constraint: it has no built-in
+floating  point  units.   Obviously,  it   is  possible  to  use  some  existing
+"black-boxes"  for double  precision operations.   But  they have  quite a  long
+latency. It is much simpler to  exclusively use integers, with a quantization of
+all double  precision values. Obviously,  this quantization should  not decrease
+too much the  precision of results. Furthermore, it should not  lead to a design
+with  a huge  latency because  of operations  that could  not complete  during a
+single or  few clock  cycles. Divisions fall  into that category  and, moreover,
+they need a varying number of clock cycles to complete. Even multiplications can
+be a problem: a DSP48 takes inputs of 18 bits maximum.  For larger multiplications,
+several DSP must be combined, increasing the latency.
+
+Nevertheless, the hardest constraint does not come from the FPGA characteristics
+but from  the algorithms.  Their VHDL implementation  will be efficient  only if
+they can be fully (or near) pipelined.  Thus, the choice is quickly made: only a
+small  part  of  SPL  can  be  pipelined.  Indeed,  the  computation  of  spline
+coefficients implies to solve a tridiagonal  system $A.m = b$. Values in $A$ and
+$b$ can  be computed  from incoming pixels  intensity but after,  the back-solve
+starts with the latest values,  which breaks the pipeline.  Moreover, SPL relies
+on interpolating far  more points than profile size. Thus, the  end of SPL works
+on a larger amount of data than at the beginning, which also breaks the pipeline.
+
+LSQ has  not this problem: all parts  except the dichotomial search  work on the
+same  amount  of  data, i.e.  the  profile  size.  Furthermore, LSQ  needs  less
+operations than SPL, implying a  smaller output latency. Consequently, it is the
+best candidate for phase  computation. Nevertheless, obtaining a fully pipelined
+version supposes that  operations of different parts complete  in a single clock
+cycle. It is  the case for simulations but it completely  fails when mapping and
+routing the design  on the Spartan6. Thus,  extra-latency is generated and
+there must be idle times between two profiles entering into the pipeline.
+
+%%Before obtaining the least bitstream, the crucial question is: how to
+%%translate the C code the LSQ into VHDL ?
+
+
+%\subsection{VHDL design paradigms}
+
+\section{Experimental tests}
+
+%In this section we explain what  we have done yet. Until now, we could not perform
+%real experiments  since we just have  received the FGPA  board. Nevertheless, we
+%will include real experiments in the final version of this paper.
+
+\subsection{VHDL implementation}
+
+From the LSQ algorithm, we have written a C program that uses only
+integer values. We use a very simple quantization by multiplying
+double precision values by a power of two, keeping the integer
+part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are
+scaled by 1,024. Since LSQ also computes average, variance, ... to
+remove the slope, the result of implied Euclidean divisions may be
+relatively wrong. To avoid that, we also scale the pixel intensities
+by a power of two. Furthermore, assuming $nb_s$ is fixed, these
+divisions have a known denominator. Thus, they can be replaced by
+their multiplication/shift counterpart. Finally, all other
+multiplications or divisions by a power of two have been replaced by
+left or right bit shifts. Thus, the code only contains
+additions, subtractions and multiplications of signed integers, which
+are perfectly adapted to FGPAs.
+
+As  mentioned above, hardware  constraints have  a great  influence on  the VHDL
+implementation. Consequently, we searched the  maximum value of each variable as
+a function of the different scale  factors and the size of profiles, which gives
+their maximum size in bits. That  size determines the maximum scale factors that
+allow to  use the  least possible  RAMs and DSPs.  Actually, we  implemented our
+algorithm  with  this  maximum  size  but  current works  study  the  impact  of
+quantization on  the results precision  and design complexity. We  have compared
+the result of  the LSQ version using integers and doubles  and observed that the
+precision of both were similar.
+
+Then we built two versions of VHDL codes: one directly by hand coding
+and the other with Matlab using the Simulink HDL coder
+feature~\cite{HDLCoder}. Although the approach is completely different
+we obtained VHDL codes that are quite comparable. Each approach has
+advantages and drawbacks.  Roughly speaking, hand coding provides
+beautiful and much better structured code while Simulink enables us to
+produce a code faster.  In terms of throughput and latency,
+simulations show that the two approaches are close with a slight
+advantage for hand coding.  We hope that real experiments will confirm
+that.
+
+\subsection{Simulation}
+
+Before experimental tests on the board, we simulated our two VHDL
+codes with GHDL and GTKWave (two free tools with linux). For that, we
+built a testbench based on profiles taken from experimentations and
+compared the results to values given by the SPL algorithm. Both
+versions lead to correct results.
+
+Our first codes were highly optimized : the pipeline could compute a
+new phase each 33 cycles and its latency was equal to 95 cycles. Since
+the Spartan6 is clocked at 100MHz, it implies that estimating the
+deflection of 100 cantilevers would take about $(95 + 200\times 33).10
+= 66.95\mu$s, i.e. nearly 15,000 estimations by second.
+
+\subsection{Bitstream creation}
+
+In order to test our code on the SP Vision board, the design was
+extended with a component that keeps profiles in RAM, flushes them in
+the phase computation component and stores its output in another
+RAM. We also added a wishbone : a component that can "drive" signals
+to communicate between i.MX and other components. It is mainly used
+to start to flush profiles and to retrieve the computed phases in RAM.
+
+Unfortunately, the first  designs could not be placed and route  with ISE on the
+Spartan6 with  a 100MHz clock. The  main problems came from  routing values from
+RAMs to DSPs and obtaining a result  under 10ns. So, we needed to decompose some
+parts of  the pipeline, which  adds some cycles.  For example, some  delays have
+been introduced between  RAMs output and DSPs. Finally,  we obtained a bitstream
+that has a latency  of 112 cycles and computes a new  phase every 40 cycles. For
+100 cantilevers, it takes $(112 + 200\times 40).10 = 81.12\mu$s to compute their
+deflection.
+
+This bitstream has been successfully tested on the board.
+
+
+
+\label{sec:results}
+
+
+
+
+\section{Conclusion and perspectives}
+In this paper we have presented a new method to estimate the
+cantilevers deflection in an AFM.  This method is based on least
+square methods.  We have used quantization to produce an algorithm
+based exclusively on integer values, which is adapted to a FPGA
+implementation. We obtained a precision on results similar to the
+initial version based on splines.  Our solution has been implemented
+with a pipeline technique.  Consequently, it enables to handle a new
+profile image very quickly. Currently we have performed simulations
+and real tests on a Spartan6 FPGA.
+
+In future work,  we plan to study  the quantization. Then we want  to couple our
+algorithm with a high speed camera and we plan to control the whole AFM system.
+
+\bibliographystyle{plain}
+\bibliography{biblio}
+
  \end{document}