new

[dmems12.git] / dmems12.tex
diff --git a/dmems12.tex b/dmems12.tex

index 94e96e4203f948817f5a9ac856345e4ef3cb7537..0c00ee0fdbaa7dfde3bde6414f23d19261524caf 100644 (file)
--- a/dmems12.tex
+++ b/dmems12.tex
@@ -1,11 +1,15 @@
-
-\documentclass[10pt, peerreview, compsocconf]{IEEEtran}
  %\usepackage{latex8}
  %\usepackage{times}
-\usepackage[utf8]{inputenc}
  %\usepackage[cyr]{aeguill}
  %\usepackage{pstricks,pst-node,pst-text,pst-3d}
  %\usepackage{babel}
+%\input{psfig.sty}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
+
+
+\documentclass[10pt, peerreview, compsocconf]{IEEEtran}
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\usepackage[utf8]{inputenc}
  \usepackage{amsmath}
  \usepackage{url}
  \usepackage{graphicx}
@@ -14,20 +18,44 @@
  \usepackage{moreverb}
  \usepackage{commath}
  \usepackage{subfigure}
-%\input{psfig.sty}
  \usepackage{fullpage}
  \usepackage{fancybox}
-
  \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  
-%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
-\newcommand{\noun}[1]{\textsc{#1}}
+\setcounter{MaxMatrixCols}{10}
+%TCIDATA{OutputFilter=LATEX.DLL}
+%TCIDATA{Version=5.50.0.2953}
+%TCIDATA{<META NAME="SaveForMode" CONTENT="1">}
+%TCIDATA{BibliographyScheme=BibTeX}
+%TCIDATA{LastRevised=Wednesday, October 26, 2011 09:49:54}
+%TCIDATA{<META NAME="GraphicsSave" CONTENT="32">}
  
+\newcommand{\noun}[1]{\textsc{#1}}
  \newcommand{\tab}{\ \ \ }
  
  
  \begin{document}
  
+\title{A new approach based on a least square method for real-time estimation of cantilever array deflections with a FPGA}
+\author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane
+Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and
+Michel Lenczner\IEEEauthorrefmark{2}} 
+\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France \and 
+\{raphael.couturier,stephane.domas\}@univ-fcomte.fr} 
+\IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France \and 
+\{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com} }
+
+\begin{abstract}
+Atomic force microscopes (AFM) provide high resolution images of surfaces.
+In this paper, we focus our attention on an interferometry method for
+deflection estimation of cantilever arrays in quasi-static regime. In its
+original form, spline interpolation was used to determine interference
+fringe phase, and thus the deflections. Computations were performed on a PC.
+Here, we propose a new complete solution with a least square based algorithm
+and an optimized FPGA implementation. Simulations and real tests showed very
+good results and open perspective for real-time estimation and control of
+cantilever arrays in the dynamic regime.
+\end{abstract}
  
  %% \author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)}
  %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\
@@ -42,432 +70,418 @@
  %% line 4: Email: name@xyz.com}
  %% }
  
-
-
-\title{Using FPGAs for high speed and real time cantilever deflection estimation}
-\author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and Michel Lenczner\IEEEauthorrefmark{2}}
-\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France\\
-\{raphael.couturier,stephane.domas\}@univ-fcomte.fr}
-\IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France\\
-\{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com}
-}
-
-
-
-
-
-
  %\maketitle
  
  \thispagestyle{empty}
  
-\begin{abstract}
-
-  
-
-
-\end{abstract}
-
  \begin{IEEEkeywords}
-FPGA, cantilever, interferometry.
+FPGA, cantilever arrays, interferometry.
  \end{IEEEkeywords}
  
-
  \IEEEpeerreviewmaketitle
  
  \section{Introduction}
  
-Cantilevers  are  used  inside  atomic  force  microscope (AFM) which  provides  high
-resolution images of  surfaces.  Several technics have been  used to measure the
-displacement  of cantilevers  in litterature.   For example,  it is  possible to
-determine  accurately  the  deflection  with different  mechanisms. 
-In~\cite{CantiPiezzo01},   authors  used   piezoresistor  integrated   into  the
-cantilever.   Nevertheless this  approach  suffers from  the  complexity of  the
-microfabrication  process needed  to  implement the  sensor  in the  cantilever.
-In~\cite{CantiCapacitive03},  authors  have  presented an  cantilever  mechanism
-based on  capacitive sensing. This kind  of technic also  involves to instrument
-the cantiliver which result in a complex fabrication process.
-
-In this  paper our attention is focused  on a method based  on interferometry to
-measure cantilevers' displacements.  In  this method cantilevers are illuminated
-by  an optic  source. The  interferometry produces  fringes on  each cantilevers
-which enables to  compute the cantilever displacement.  In  order to analyze the
-fringes a  high speed camera  is used. Images  need to be processed  quickly and
-then  a estimation  method is  required to  determine the  displacement  of each
-cantilever.  In~\cite{AFMCSEM11},  the authors have  used an algorithm  based on
-spline to estimate the cantilevers' positions.
-
-   The overall  process gives
-accurate results  but all the computation  are performed on  a standard computer
-using labview.  Consequently,  the main drawback of this  implementation is that
-the computer is a bootleneck in the overall process. In this paper we propose to
-use a  method based on least  square and to  implement all the computation  on a
-FGPA.
-
-The remainder  of the paper  is organized as  follows. Section~\ref{sec:measure}
-describes  more precisely  the measurement  process. Our  solution based  on the
-least  square   method  and   the  implementation  on   FPGA  is   presented  in
-Section~\ref{sec:solus}.       Experimentations      are       described      in
-Section~\ref{sec:results}.  Finally  a  conclusion  and  some  perspectives  are
-presented.
-
-
-
-%% quelques ref commentées sur les calculs basés sur l'interférométrie
-
-\section{Measurement principles}
-\label{sec:measure}
-
-
-
-
+Cantilevers are used in atomic force microscopes (AFM) which provide high
+resolution surface images. Several techniques have been reported in
+literature for cantilever displacement measurement. In~\cite{CantiPiezzo01},
+authors have shown how a piezoresistor can be integrated into a cantilever
+for deflection measurement. Nevertheless this approach suffers from the
+complexity of the microfabrication process needed to implement the sensor.
+In~\cite{CantiCapacitive03}, authors have presented a cantilever mechanism
+based on capacitive sensing. These techniques require cantilever
+instrumentation resulting in\ complex fabrication processes.
+
+In this paper  our attention is focused on a method  based on interferometry for
+cantilever  displacement  measurement in  quasi-static  regime. Cantilevers  are
+illuminated  by an  optical  source.  Interferometry  produces fringes  enabling
+cantilever displacement computation. A high  speed camera is used to analyze the
+fringes. In view of real time  applications, images need to be processed quickly
+and then a  fast estimation method is required to  determine the displacement of
+each  cantilever. In~\cite{AFMCSEM11},  an algorithm  based on  spline  has been
+introduced  for  cantilever  position  estimation.  The  overall  process  gives
+accurate results  but computations  are performed on  a standard  computer using
+LabView      \textsuperscript{\textregistered}     \textsuperscript{\copyright}.
+Consequently, the main drawback of this implementation is that the computer is a
+bottleneck. In this  paper we pose the problem  of real-time cantilever position
+estimation and  bring a  hardware/software solution. It  includes a  fast method
+based on least squares and its FPGA implementation.
+
+The remainder of the paper is organized as follows. Section~\ref{sec:measure}
+describes the measurement process. Our solution based on the least square
+method and its implementation on a FPGA is presented in Section~\ref{sec:solus}. Numerical experimentations are described in Section~\ref{sec:results}. Finally a conclusion and some perspectives are drawn.
+
+\section{Architecture and goals}
  
+\label{sec:measure}
  
+In order to build simple, cost effective and user-friendly cantilever
+arrays, authors of ~\cite{AFMCSEM11} have developed a system based on
+interferometry.
  
+\subsection{Experimental setup}
  
-\subsection{Architecture}
  \label{sec:archi}
-%% description de l'architecture générale de l'acquisition d'images
-%% avec au milieu une unité de traitement dont on ne précise pas ce
-%% qu'elle est.
-
-In order to develop simple,  cost effective and user-friendly cantilever arrays,
-authors   of    ~\cite{AFMCSEM11}   have   developped   a    system   based   of
-interferometry. In opposition to other optical based systems, using a laser beam
-deflection scheme and  sentitive to the angular displacement  of the cantilever,
-interferometry  is sensitive  to  the  optical path  difference  induced by  the
-vertical displacement of the cantilever.
-
-The system build  by authors of~\cite{AFMCSEM11} has been  developped based on a
-Linnick     interferomter~\cite{Sinclair:05}.    It     is     illustrated    in
-Figure~\ref{fig:AFM}.  A  laser diode  is first split  (by the splitter)  into a
-reference beam and a sample beam  that reachs the cantilever array.  In order to
-be  able to  move  the cantilever  array, it  is  mounted on  a translation  and
-rotational hexapod  stage with  five degrees of  freedom. The optical  system is
-also fixed to the stage.  Thus,  the cantilever array is centered in the optical
-system which  can be adjusted accurately.   The beam illuminates the  array by a
-microscope objective  and the  light reflects on  the cantilevers.  Likewise the
-reference beam  reflects on a  movable mirror.  A  CMOS camera chip  records the
-reference and  sample beams which  are recombined in  the beam splitter  and the
-interferogram.   At the  beginning of  each  experiment, the  movable mirror  is
-fitted  manually in  order to  align the  interferometric  fringes approximately
-parallel  to the cantilevers.   When cantilevers  move due  to the  surface, the
-bending of  cantilevers produce  movements in the  fringes that can  be detected
-with    the    CMOS    camera.     Finally    the    fringes    need    to    be
-analyzed. In~\cite{AFMCSEM11}, the authors used a LabView program to compute the
-cantilevers' movements from the fringes.
-
-\begin{figure}    
+
+In opposition to other optical based system\textbf{s u}sing a laser beam
+deflection scheme and sensitive to the angular displacement of the
+cantilever, interferometry is sensitive to the optical path difference
+induced by the vertical displacement of the cantilever.
+
+The system is based on a Linnick interferometer~\cite{Sinclair:05}.
+It is illustrated in Figure~\ref{fig:AFM} \footnote{by courtesy of
+  CSEM}. A laser diode is first split (by the splitter) into a
+reference beam and a sample beam both reaching the cantilever array.
+The complete system including a cantilever array\ and the optical
+system can be moved thanks to a translation and rotational hexapod
+stage with five degrees of freedom. Thus, the cantilever array is
+centered in the optical system which can be adjusted accurately.  The
+beam illuminates the array by a microscope objective and the light
+reflects on the cantilevers. Likewise the reference beam reflects on a
+movable mirror. A CMOS camera chip records the reference and sample
+beams which are recombined in the beam splitter and the
+interferogram. At the beginning of each experiment, the movable mirror
+is fitted manually in order to align the interferometric fringes
+approximately parallel to the cantilevers. Then, cantilever motion in
+the transverse direction produces movements in the fringes. They are
+detected with the CMOS camera which images are analyzed by a Labview
+program to recover the cantilever deflections.
+
+\begin{figure}[tbp]
  \begin{center}
  \includegraphics[width=\columnwidth]{AFM}
  \end{center}
-\caption{schema of the AFM}
-\label{fig:AFM}   
+\caption{AFM Setup}
+\label{fig:AFM}
  \end{figure}
  
-
  %% image tirée des expériences.
  
-\subsection{Cantilever deflection estimation}
-\label{sec:deflest}
-
-As shown on image \ref{img:img-xp}, each cantilever is covered by
-interferometric fringes. The fringes will distort when cantilevers are
-deflected. Estimating the deflection is done by computing this
-distortion. For that, (ref A. Meister + M Favre) proposed a method
-based on computing the phase of the fringes, at the base of each
-cantilever, near the tip, and on the base of the array. They assume
-that a linear relation binds these phases, which can be use to
-"unwrap" the phase at the tip and to determine the deflection.\\
+\subsection{Inteferometric based cantilever deflection estimation}
  
-More precisely, segment of pixels are extracted from images taken by a
-high-speed camera. These segments are large enough to cover several
-interferometric fringes and are placed at the base and near the tip of
-the cantilevers. They are called base profile and tip profile in the
-following. Furthermore, a reference profile is taken on the base of
-the cantilever array.
+\label{sec:deflest}
  
-The pixels intensity $I$ (in gray level) of each profile is modelized by:
+\begin{figure}[tbp]
+\begin{center}
+\includegraphics[width=\columnwidth]{lever-xp}
+\end{center}
+\caption{Portion of a camera image showing moving interferometric fringes in
+cantilevers}
+\label{fig:img-xp}
+\end{figure}
  
+As shown in Figure \ref{fig:img-xp} \footnote{by courtesy of CSEM}, each
+cantilever is covered by several interferometric fringes. The fringes
+distort when cantilevers are deflected.  In \cite{AFMCSEM11}, a novel
+method for interferometric based cantilever deflection measurement was
+reported. For each cantilever, the method uses three segments of pixels,
+parallel to its section, to determine phase shifts.  The first is
+located just above the AFM tip (tip profile), it provides the phase
+shift modulo $2\pi $. The second one is close to the base junction
+(base profile) and is used to determine the exact multiple of $2\pi $
+through an operation called unwrapping where it is assumed that the
+deflection means along the two measurement segments are linearly
+dependent.  The third is on the base and provides a reference for
+noise suppression.  Finally, deflections are simply derived from phase
+shifts.
+
+The pixel gray-level intensity $I$ of each profile is modelized by%
  \begin{equation}
-\label{equ:profile}
-I(x) = ax+b+A.cos(2\pi f.x + \theta)
-\end{equation}
-
-where $x$ is the position of a pixel in its associated segment.
+I(x)=A\text{ }\cos (2\pi fx+\theta )+ax+b  \label{equ:profile}
+\end{equation}%
+where $x$ denotes the position of a pixel in a segment, $A$, $f$ and $\theta 
+$ are the amplitude, the frequency and the phase of the light signal when
+the affine function $ax+b$ corresponds to the cantilever array surface tilt
+with respect to the light source. 
+
+The method consists in two main sequences.  In the first one
+corresponding to precomputation, the frequency $f$ of each profile is
+determined using a spline interpolation (see section \ref%
+{sec:algo-spline}) and the coefficient used for phase unwrapping is
+computed. The second one, that we call the \textit{acquisition loop,}
+is done after images have been taken at regular time steps. For each
+image, the phase $\theta $ of all profiles is computed to obtain,
+after unwrapping, the cantilever deflection. The phase determination
+in \cite{AFMCSEM11} is achieved by a spline based algorithm which is
+the most consuming part of the computation. In this article, we
+propose an alternate version based on the least square method which is
+faster and better suited for FPGA implementation.
+
+\subsection{Computation design goals}
  
-The global method consists in two main sequences. The first one aims
-to determin the frequency $f$ of each profile with an algorithm based
-on spline interpolation (see section \ref{algo-spline}). It also
-computes the coefficient used for unwrapping the phase. The second one
-is the acquisition loop, while which images are taken at regular time
-steps. For each image, the phase $\theta$ of all profiles is computed
-to obtain, after unwrapping, the deflection of
-cantilevers. Originally, this computation was also done with an
-algorithm based on spline. This article proposes a new version based
-on a least square method.
-
-\subsection{Design goals}
  \label{sec:goals}
  
-The main goal is to implement a computing unit to estimate the
-deflection of about $10\times10$ cantilevers, faster than the stream of
-images coming from the camera. The accuracy of results must be close
-to the maximum precision ever obtained experimentally on the
-architecture, i.e. 0.3nm. Finally, the latency between an image
-entering in the unit and the deflections must be as small as possible
-(NB: future works plan to add some control on the cantilevers).\\
-
-If we put aside some hardware issues like the speed of the link
+To evaluate the solution performances, we choose a goal which consists
+in designing a computing unit able to estimate the deflections of
+a $10\times 10$%
+-cantilever array, faster than the camera image stream. In addition,
+the result accuracy must be close to 0.3nm, the maximum precision
+reached in \cite{AFMCSEM11}. Finally, the latency between the entrance
+of the first pixel of an image and the end of deflection computation
+must be as small as possible. All these requirement are
+stated in the perspective of implementing real-time active control for
+each cantilever, see~\cite{LencznerChap10,Hui11}.
+
+If we put aside other hardware issues like the speed of the link
  between the camera and the computation unit, the time to deserialize
-pixels and to store them in memory, ... the phase computation is
-obviously the bottle-neck of the whole process. For example, if we
-consider the camera actually in use, an exposition time of 2.5ms for
-$1024\times 1204$ pixels seems the minimum that can be reached. For
-100 cantilevers, if we neglect the time to extract pixels, it implies
-that computing the deflection of a single
-cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\
-
-In fact, this timing is a very hard constraint. Let consider a very
-small programm that initializes twenty million of doubles in memory
-and then does 1000000 cumulated sums on 20 contiguous values
-(experimental profiles have about this size). On an intel Core 2 Duo
-E6650 at 2.33GHz, this program reaches an average of 155Mflops. 
-
-%%Itimplies that the phase computation algorithm should not take more than
-%%$155\times 12.5 = 1937$ floating operations. For integers, it gives $3000$ operations. 
-
-Obviously, some cache effects and optimizations on
-huge amount of computations can drastically increase these
-performances: peak efficiency is about 2.5Gflops for the considered
-CPU. But this is not the case for phase computation that used only few
-tenth of values.\\
-
-In order to evaluate the original algorithm, we translated it in C
-language. As said further, for 20 pixels, it does about 1550
-operations, thus an estimated execution time of $1550/155
-=$10$\mu$s. For a more realistic evaluation, we constructed a file of
-1Mo containing 200 profiles of 20 pixels, equally scattered. This file
-is equivalent to an image stored in a device file representing the
-camera. We obtained an average of 10.5$\mu$s by profile (including I/O
-accesses). It is under are requirements but close to the limit. In
-case of an occasional load of the system, it could be largely
-overtaken. A solution would be to use a real-time operating system but
-another one to search for a more efficient algorithm.
-
-But the main drawback is the latency of such a solution: since each
-profile must be treated one after another, the deflection of 100
-cantilevers takes about $200\times 10.5 = 2.1$ms, which is inadequate
-for an efficient control. An obvious solution is to parallelize the
-computations, for example on a GPU. Nevertheless, the cost to transfer
-profile in GPU memory and to take back results would be prohibitive
-compared to computation time. It is certainly more efficient to
-pipeline the computation. For example, supposing that 200 profiles of
-20 pixels can be pushed sequentially in the pipelined unit cadenced at
-a 100MHz (i.e. a pixel enters in the unit each 10ns), all profiles
-would be treated in $200\times 20\times 10.10^{-9} =$ 40$\mu$s plus
-the latency of the pipeline. This is about 500 times faster than
-actual results.\\
-
-For these reasons, an FPGA as the computation unit is the best choice
-to achieve the required performance. Nevertheless, passing from
-a C code to a pipelined version in VHDL is not obvious at all. As
-explained in the next section, it can even be impossible because of
-some hardware constraints specific to FPGAs.
-
-
-\section{Proposed solution}
+pixels and to store them in memory, the phase computation is the
+bottleneck of the whole process. For example, the camera in the setup
+of \cite{AFMCSEM11} provides $%
+1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if we
+the pixel extraction time is neglected, each phase calculation of a
+100-cantilever array should take no more than 12.5$\mu$s. 
+
+In fact, this timing is a very hard constraint. To illustrate this point, we
+consider a very small program that initializes twenty million of doubles in
+memory and then does 1,000,000 cumulated sums on 20 contiguous values
+(experimental profiles have about this size). On an intel Core 2 Duo E6650
+at 2.33GHz, this program reaches an average of 155Mflops. 
+Obviously, some cache effects and optimizations on huge amount of
+computations can drastically increase these performances: peak efficiency is
+about 2.5Gflops for the considered CPU. But this is not the case for phase
+computation that is using only a few tenth of values.
+
+In order to evaluate the original algorithm, we translated it in C language.
+As stated before, for 20 pixels, it does about 1,550 operations, thus an
+estimated execution time of $1,550/155=$10$\mu $s. For a more realistic
+evaluation, we constructed a file of 1Mo containing 200 profiles of 20
+pixels, equally scattered. This file is equivalent to an image stored in a
+device file representing the camera. We obtained an average of 10.5$\mu$s
+by profile (including I/O accesses). It is under our requirements but close
+to the limit. In case of an occasional load of the system, it could be
+largely overtaken. Solutions would be to use a real-time operating system or
+to search for a more efficient algorithm.
+
+However, the main drawback is the latency of such a solution because each
+profile must be treated one after another and the deflection of 100
+cantilevers takes about $200\times 10.5=2.1$ms. This would be inadequate
+for real-time requirements as for individual cantilever active control. An
+obvious solution is to parallelize the computations, for example on a GPU.
+Nevertheless, the cost of transferring profile in GPU memory and of taking
+back results would be prohibitive compared to computation time.
+
+We remark that when possible, it is more efficient to pipeline the
+computation. For example, supposing that 200 profiles of 20 pixels
+could be pushed sequentially in a pipelined unit cadenced at a 100MHz
+(i.e. a pixel enters in the unit each 10ns), all profiles would be
+treated in $200\times 20\times 10.10^{-9}=$ 40$\mu$s plus the latency
+of the pipeline. Such a solution would be meeting our requirements and
+would be 50 times faster than our C code, and even more compared to
+the LabView version use in \cite{AFMCSEM11}. FPGAs are appropriate for
+such implementation, so they turn out to be the computation units of
+choice to reach our performance requirements. Nevertheless, passing
+from a C code to a pipelined version in VHDL is not obvious at all. It
+can even be impossible because of FPGA hardware constraints. All these
+points are discussed in the following sections.
+
+\section{An hardware/software solution}
+
  \label{sec:solus}
  
-Project Oscar aims  to provide a hardware and  software architecture to estimate
-and  control the  deflection of  cantilevers. The  hardware part  consists  in a
-high-speed camera,  linked on an embedded  board hosting FPGAs. By  the way, the
-camera output stream can be pushed  directly into the FPGA. The software part is
-mostly the VHDL  code that deserializes the camera  stream, extracts profile and
-computes  the deflection. Before  focusing on  our work  to implement  the phase
-computation, we give some general information about FPGAs and the board we use.
-
-\subsection{FPGAs}
-
-A field-programmable gate array (FPGA) is an integrated circuit
-designed to be configured by the customer. FGPAs are composed of
-programmable logic components, called configurable logic blocks
-(CLB). These blocks mainly contains look-up tables (LUT), flip/flops
-(F/F) and latches, organized in one or more slices connected
-together. Each CLB can be configured to perform simple (AND, XOR, ...)
-or complex combinational functions. They are interconnected by
-reconfigurable links. Modern FPGAs contain memory elements and
-multipliers which enable to simplify the design and to increase the
-performance. Nevertheless, all other complex operations, like
-division, trigonometric functions, $\ldots$ are not available and must
-be done by configuring a set of CLBs. Since this configuration is not
-obvious at all, it can be done via a framework, like ISE. Such a
-software can synthetize a design written in a hardware description
-language (HDL), map it onto CLBs, place/route them for a specific
-FPGA, and finally produce a bitstream that is used to configre the
-FPGA. Thus, from the developper point of view, the main difficulty is
-to translate an algorithm in HDL code, taking account FPGA resources
-and constraints like clock signals and I/O values that drive the FPGA.
+In  this  section we  present  parts  of the  computing  solution  to the  above
+requirements. The  hardware part consists in  a high-speed camera,  linked on an
+embedded board hosting  two FPGAs. In this way, the camera  output stream can be
+pushed directly  into the FPGA. The software  part is mostly the  VHDL code that
+deserializes the camera stream, extracts profiles and computes the deflection.
+
+We first give some general information about FPGAs, then we
+describe the FPGA board we use for implementation and finally the two
+algorithms for phase computation are detailed. Presentation of VHDL
+implementations is postponned until Section \ref{Experimental tests}. 
+
+
+
+\subsection{Elements of FPGA architecture and programming}
+
+A field-programmable gate array (FPGA) is an integrated circuit designed to
+be configured by the customer. FGPAs are composed of programmable logic
+components, called configurable logic blocks (CLB). These blocks mainly
+contain look-up tables (LUT), flip/flops (F/F) and latches, organized in one
+or more slices connected together. Each CLB can be configured to perform
+simple (AND, XOR, ...) or complex combinational functions. They are
+interconnected by reconfigurable links. Modern FPGAs contain memory elements
+and multipliers which enable to simplify the design and to increase the
+performance. Nevertheless, all other complex operations like division and
+other functions like trigonometric functions are not available and must be
+built by configuring a set of CLBs. Since this is not an obvious task at
+all, tools like ISE~\cite{ISE} have been built to do this operation. Such a
+software can synthetize a design written in a hardware description language
+(HDL), maps it onto CLBs, place/route them for a specific FPGA, and finally
+produces a bitstream that is used to configure the FPGA. Thus, from the
+developer's point of view, the main difficulty is to translate an algorithm
+into HDL code, taking into account FPGA resources and constraints like clock
+signals and I/O values that drive the FPGA.
  
  Indeed, HDL programming is very different from classic languages like
  C. A program can be seen as a state-machine, manipulating signals that
-evolve from state to state. By the way, HDL instructions can execute
-concurrently. Basic logic operations are used to agregate signals to
-produce new states and assign it to another signal. States are mainly
-expressed as arrays of bits. Fortunaltely, libraries propose some
-higher levels representations like signed integers, and arithmetic
-operations.
-
-Furthermore, even if FPGAs are cadenced more slowly than classic
-processors, they can perform pipeline as well as parallel
-operations. A pipeline consists in cutting a process in sequence of
-small tasks, taking the same execution time. It accepts a new data at
-each clock top, thus, after a known latency, it also provides a result
-at each clock top. However, using a pipeline consumes more logics
-since the components of a task are not reusable by another
-one. Nevertheless it is probably the most efficient technique on
-FPGA. Because of its architecture, it is also very easy to process
-several data concurrently. When it is possible, the best performance
-is reached using parallelism to handle simultaneously several
-pipelines in order to handle multiple data streams.
-
-\subsection{The board}
-
-The board we use is designed by the Armadeus compagny, under the name
-SP Vision. It consists in a development board hosting a i.MX27 ARM
-processor (from Freescale). The board includes all classical
-connectors: USB, Ethernet, ... A Flash memory contains a Linux kernel
-that can be launched after booting the board via u-Boot.
-
-The processor is directly connected to a Spartan3A FPGA (from Xilinx)
-via its special interface called WEIM. The Spartan3A is itself
-connected to a Spartan6 FPGA. Thus, it is possible to develop programs
-that communicate between i.MX and Spartan6, using Spartan3 as a
-tunnel. By default, the WEIM interface provides a clock signal at
-100MHz that is connected to dedicated FPGA pins.
-
-The Spartan6 is an LX100 version. It has 15822 slices, equivalent to
-101261 logic cells. There are 268 internal block RAM of 18Kbits, and
-180 dedicated multiply-adders (named DSP48), which is largely enough
-for our project.
-
-Some I/O pins of Spartan6 are connected to two $2\times 17$ headers
-that can be used as user wants. For the project, they will be
-connected to the interface card of the camera.
-
-\subsection{Considered algorithms}
-
-Two solutions have been studied to achieve phase computation. The
-original one, proposed by A. Meister and M. Favre, is based on
-interpolation by splines. It allows to compute frequency and
-phase. The second one, detailed in this article, is based on a
-classical least square method but suppose that frequency is already
-known.
-
-\subsubsection{Spline algorithm}
+evolve from state to state. Moreover, HDL instructions can be executed
+concurrently. Signals may be combined with basic logic operations to
+produce new states that are assigned to another signal. States are mainly expressed as
+arrays of bits.  Fortunately, libraries propose some higher levels
+representations like signed integers, and arithmetic operations.
+
+Furthermore, even if FPGAs are cadenced more slowly than classic processors,
+they can perform pipelines as well as parallel operations. A pipeline
+consists in cutting a process in a sequence of small tasks, taking the same
+execution time. It accepts a new data at each clock top, thus, after a known
+latency, it also provides a result at each clock top. We observe that the
+components of a task are not reusable by another one. Nevertheless, this is
+the most efficient technique on FPGAs. Because of their architecture, it is
+also very easy to process several data concurrently. Finally, the best
+performance can be reached when several pipelines are operating on multiple
+data streams in parallel.
+
+\subsection{The FPGA board}
+
+The architecture we use is designed by the Armadeus Systems
+company. It consists in a development board called APF27 \textsuperscript{\textregistered}, hosting a
+i.MX27 ARM processor (from Freescale) and a Spartan3A (from
+XIlinx). This board includes all classical connectors as USB and
+Ethernet for instance. A Flash memory contains a Linux kernel that can
+be launched after booting the board via u-Boot. The processor is
+directly connected to the Spartan3A via its special interface called
+WEIM. The Spartan3A is itself connected to an extension board called
+SP Vision \textsuperscript{\textregistered}, that hosts a Spartan6 FPGA. Thus, it is
+possible to develop programs that communicate between i.MX and
+Spartan6, using Spartan3 as a tunnel. A clock signal at 100MHz (by
+default) is delivered to dedicated FPGA pins. The Spartan6 of our
+board is an LX100 version. It has 15,822 slices, each slice containing
+4 LUTs and 8 flip/flops. It is equivalent to 101,261 logic
+cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
+multiply-adders (named DSP48), which is largely enough for our
+project. Some I/O pins of Spartan6 are connected to two $2\times 17$
+headers that can be used for any purpose as to be connected to the
+interface of a camera.
+
+\subsection{Two algorithms for phase computation}
+
+In \cite{AFMCSEM11}, $f$ the frequency and $\theta $\ the phase of the
+light wave are computed thanks to spline interpolation. As said in
+section \ref{sec:deflest}, $f$ is computed only once time but the
+phase needs to be computed for each image. This is why, in this paper,
+we focus on its computation.
+
+\subsubsection{Spline algorithm (SPL)}
+
  \label{sec:algo-spline}
-Let consider a profile $P$, that is a segment of $M$ pixels with an
-intensity in gray levels. Let call $I(x)$ the intensity of profile in $x
-\in [0,M[$. 
-
-At  first, only $M$  values of  $I$ are  known, for  $x =  0, 1,  \ldots,M-1$. A
-normalisation  allows  to scale  known  intensities  into  $[-1,1]$. We  compute
-splines  that fit  at best  these normalised  intensities. Splines  (SPL  in the
-following) are  used to interpolate $N  = k\times M$ points  (typically $k=4$ is
-sufficient), within $[0,M[$. Let call  $x^s$ the coordinates of these $N$ points
-    and $I^s$ their intensities.
-
-In order to have the frequency, the mean line $a.x+b$ (see equation \ref{equ:profile}) of $I^s$ is
-computed. Finding intersections of $I^s$ and this line allow to obtain
-the period thus the frequency.
-
-The phase is computed via the equation:
+
+We denote by $M$ the number of pixels in a segment used for phase
+computation. For the sake of simplicity of the notations, we consider
+the light intensity $I$ to be a mapping of the physical segment in the
+interval $[0,M[$. The pixels are assumed to be regularly spaced and
+centered at the positions $x^{p}\in\{0,1,\ldots,M-1\}.$ We use the simplest
+definition of a pixel, namely the value of $I$ at its center. The
+pixel intensities are considered as pre-normalized so that their
+minimum and maximum have been resized to $-1$ and $1$. 
+
+The first step consists in computing the cubic spline interpolation of
+the intensities. This allows for interpolating $I$ at a larger number
+$L=k\times M$ of points (typically $k=4$ is sufficient) $%
+x^{s}$ in the interval $[0,M[$. During the precomputation sequence,
+the second step is to determin the afine part $a.x+b$ of $I$. It is
+found with an ordinary least square method, taking account the $L$
+points. Values of $I$ in $x^s$ are used to compute its intersections
+with $a.x+b$. The period of $I$ (and thus its frequency) is deduced
+from the number of intersections and the distance between the first
+and last.
+
+During the acquisition loop, the second step is the phase computation, with
  \begin{equation}
-\theta = atan \left[ \frac{\sum_{i=0}^{N-1} sin(2\pi f x^s_i) \times I^s(x^s_i)}{\sum_{i=0}^{N-1} cos(2\pi f x^s_i) \times I^s(x^s_i)} \right]
+\theta =atan\left[ \frac{\sum_{i=0}^{N-1}\text{sin}(2\pi fx_{i}^{s})\times
+I(x_{i}^{s})}{\sum_{i=0}^{N-1}\text{cos}(2\pi fx_{i}^{s})\times I(x_{i}^{s})}%
+\right] .
  \end{equation}
  
-Two things can be noticed:
-\begin{itemize}
-\item the frequency could also be obtained using the derivates of
-  spline equations, which only implies to solve quadratic equations.
-\item frequency of each profile is computed a single time, before the
-  acquisition loop. Thus, $sin(2\pi f x^s_i)$ and $cos(2\pi f x^s_i)$
-  could also be computed before the loop, which leads to a much faster
-  computation of $\theta$.
-\end{itemize}
-
-\subsubsection{Least square algorithm}
-
-Assuming that we compute the phase during the acquisition loop,
-equation \ref{equ:profile} has only 4 parameters: $a, b, A$, and
-$\theta$, $f$ and $x$ being already known. Since $I$ is non-linear, a
-least square method based on a Gauss-newton algorithm can be used to
-determine these four parameters. Since it is an iterative process
-ending with a convergence criterion, it is obvious that it is not
-particularly adapted to our design goals.
-
-Fortunatly, it is quite simple to reduce the number of parameters to
-only $\theta$. Let $x^p$ be the coordinates of pixels in a segment of
-size $M$. Thus, $x^p = 0, 1, \ldots, M-1$. Let $I(x^p)$ be their
-intensity. Firstly, we "remove" the slope by computing:
-
-\[I^{corr}(x^p) = I(x^p) - a.x^p - b\]
-
-Since linear equation coefficients are searched, a classical least
-square method can be used to determine $a$ and $b$:
+\textit{Remarks: }
  
-\[a = \frac{covar(x^p,I(x^p))}{var(x^p)} \]
-
-Assuming an overlined symbol means an average, then:
-
-\[b = \overline{I(x^p)} - a.\overline{{x^p}}\]
-
-Let $A$ be the amplitude of $I^{corr}$, i.e. 
-
-\[A = \frac{max(I^{corr}) - min(I^{corr})}{2}\]
-
-Then, the least square method to find $\theta$ is reduced to search the minimum of:
-
-\[\sum_{i=0}^{M-1} \left[ cos(2\pi f.i + \theta) - \frac{I^{corr}(i)}{A} \right]^2\]
-
-It is equivalent to derivate this expression and to solve the following equation:
-
-\begin{eqnarray*}
-2\left[ cos\theta \sum_{i=0}^{M-1} I^{corr}(i).sin(2\pi f.i) + sin\theta \sum_{i=0}^{M-1} I^{corr}(i).cos(2\pi f.i)\right] \\
-- A\left[ cos2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right]   = 0
-\end{eqnarray*}
-
-Several points can be noticed:
  \begin{itemize}
-\item As in the spline method, some parts of this equation can be
-  computed before the acquisition loop. It is the case of sums that do
-  not depend on $\theta$:
-
-\[ \sum_{i=0}^{M-1} sin(4\pi f.i), \sum_{i=0}^{M-1} cos(4\pi f.i) \] 
-
-\item Lookup tables for $sin(2\pi f.i)$ and $cos(2\pi f.i)$ can also be
-computed.
+\item The frequency could also be obtained using the derivates of spline
+equations, which only implies to solve quadratic equations but certainly
+yields higher errors.
  
-\item The simplest method to find the good $\theta$ is to discretize
-  $[-\pi,\pi]$ in $nb_s$ steps, and to search which step leads to the
-  result closest to zero. By the way, three other lookup tables can
-  also be computed before the loop:
-
-\[ sin \theta, cos \theta, \]
+\item Profile frequency are computed during the precomputation step,
+  thus the values sin$(2\pi fx_{i}^{s})$ and cos$(2\pi fx_{i}^{s})$
+  can be determined once for all.
+\end{itemize}
  
-\[ \left[ cos 2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin 2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right] \]
+\subsubsection{Least square algorithm (LSQ)}
+
+Assuming that we compute the phase during the acquisition loop, equation \ref%
+{equ:profile} has only 4 parameters: $a,b,A$, and $\theta $, $f$ and $x$
+being already known. Since $I$ is non-linear, a least square method based on
+a Gauss-newton algorithm can be used to determine these four parameters.
+This kind of iterative process ends with a convergence criterion, so it is
+not suited to our design goals. Fortunately, it is quite simple to reduce
+the number of parameters to only $\theta $. Firstly, the afine part $ax+b$
+is estimated from the $M$ values $I(x^{p})$ to determine the rectified
+intensities,%
+\begin{equation*}
+I^{corr}(x^{p})\approx I(x^{p})-a.x^{p}-b.
+\end{equation*}%
+To find $a$ and $b$ we apply an ordinary least square method (as in SPL but on $M$ points)%
+\begin{equation*}
+a=\frac{covar(x^{p},I(x^{p}))}{\text{var}(x^{p})}\text{ and }b=\overline{%
+I(x^{p})}-a.\overline{{x^{p}}}
+\end{equation*}%
+where overlined symbols represent average. Then the amplitude $A$ is
+approximated by%
+\begin{equation*}
+A\approx \frac{\text{max}(I^{corr})-\text{min}(I^{corr})}{2}.
+\end{equation*}%
+Finally, the problem of approximating $\theta $ is reduced to minimizing%
+\begin{equation*}
+\min_{\theta \in \lbrack -\pi ,\pi ]}\sum_{i=0}^{M-1}\left[ \text{cos}(2\pi
+f.i+\theta )-\frac{I^{corr}(i)}{A}\right] ^{2}.
+\end{equation*}%
+An optimal value $\theta ^{\ast }$ of the minimization problem is a zero of
+the first derivative of the above argument,%\begin{eqnarray*}{l}
+\begin{equation*}
+2\left[ \text{cos}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{sin}(2\pi
+f.i)\right.
+\end{equation*}%
+\begin{equation*}
+\left. +\text{sin}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{cos}(2\pi
+f.i)\right] -
+\end{equation*}%
+\begin{equation*}
+A\left[ \text{cos}2\theta ^{\ast }\sum_{i=0}^{M-1}\sin (4\pi f.i)+\text{sin}%
+2\theta ^{\ast }\sum_{i=0}^{M-1}\cos (4\pi f.i)\right] =0
+\end{equation*}%
+%
+%\end{eqnarray*}
  
-\item This search can be very fast using a dichotomous process in $log_2(nb_s)$ 
+Several points can be noticed:
  
+\begin{itemize}
+\item The terms $\sum_{i=0}^{M-1}$sin$(4\pi f.i)$ and$\sum_{i=0}^{M-1}$cos$%
+(4\pi f.i)$ are independent of $\theta $, they can be precomputed.
+
+\item Lookup tables (namely lut$_{sfi}$ and lut$_{cfi}$ in the following algorithms) can be
+  set with the $2.M$ values $\sin (2\pi f.i)$ and $\cos (2\pi f.i)$.
+
+\item A simple method to find a zero $\theta ^{\ast }$ of the optimality
+condition is to discretize the range $[-\pi ,\pi ]$ with a large number $%
+nb_{s}$ of nodes and to find which one is a minimizer in the absolute value
+sense. Hence, three other lookup tables (lut$_{s}$, lut$_{c}$ and lut$_{A}$) can be set with the $%
+3\times nb_{s}$ values $\sin \theta ,$ $\cos \theta ,$ 
+\begin{equation*}
+\left[ cos2\theta \sum_{i=0}^{M-1}sin(4\pi f.i)+sin2\theta
+\sum_{i=0}^{M-1}cos(4\pi f.i)\right] .
+\end{equation*}
+
+\item The search algorithm can be very fast using a dichotomous process in $%
+log_{2}(nb_{s}).$
  \end{itemize}
  
-Finally, the whole summarizes in an algorithm (called LSQ in the following) in two parts, one before and one during the acquisition loop:
-\begin{algorithm}[h]
+The overall method is synthetized in an algorithm (called LSQ in the
+following) divided into the precomputing part and the acquisition loop:
+
+\begin{algorithm}[htbp]
  \caption{LSQ algorithm - before acquisition loop.}
  \label{alg:lsq-before}
  
     $M \leftarrow $ number of pixels of the profile\\
-   I[] $\leftarrow $ intensities of pixels\\
+   I[] $\leftarrow $ intensity of pixels\\
     $f \leftarrow $ frequency of the profile\\
     $s4i \leftarrow \sum_{i=0}^{M-1} sin(4\pi f.i)$\\
     $c4i \leftarrow \sum_{i=0}^{M-1} cos(4\pi f.i)$\\
@@ -483,7 +497,7 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t
     }
  \end{algorithm}
  
-\begin{algorithm}[ht]
+\begin{algorithm}[htbp]
  \caption{LSQ algorithm - during acquisition loop.}
  \label{alg:lsq-during}
  
@@ -502,7 +516,7 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t
     \For{$i=0$ to $M-1$}{
       $I[i] \leftarrow I[i] - start - slope\times i$\\
     }
-   
+
     $I_{max} \leftarrow max_i(I[i])$, $I_{min} \leftarrow min_i(I[i])$\\
     $amp \leftarrow \frac{I_{max}-I_{min}}{2}$\\
  
@@ -546,75 +560,85 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t
  
  \end{algorithm}
  
-\subsubsection{Comparison}
+\subsubsection{Algorithm comparison}
  
  We compared the two algorithms on the base of three criteria:
+
  \begin{itemize}
-\item precision of results on a cosinus profile, distorted with noise,
+\item precision of results on a cosines profile, distorted by noise,
+
  \item number of operations,
-\item complexity to implement an FPGA version.
+
+\item complexity of FPGA implementation
  \end{itemize}
  
  For the first item, we produced a matlab version of each algorithm,
-running with double precision values. The profile was generated for
-about 34000 different values of period ($\in [3.1, 6.1]$, step = 0.1),
-phase ($\in [-3.1 , 3.1]$, step = 0.062) and slope ($\in [-2 , 2]$,
-step = 0.4). For LSQ, $nb_s = 1024$, which leads to a maximal error of
-$\frac{\pi}{1024}$ on phase computation. Current A. Meister and
-M. Favre experiments show a ratio of 50 between variation of phase and
-the deflection of a lever. Thus, the maximal error due to
-discretization correspond to an error of 0.15nm on the lever
-deflection, which is smaller than the best precision they achieved,
-i.e. 0.3nm.
-
-For each test, we add some noise to the profile: each group of two
-pixels has its intensity added to a random number picked in $[-N,N]$
-(NB: it should be noticed that picking a new value for each pixel does
-not distort enough the profile). The absolute error on the result is
-evaluated by comparing the difference between the reference and
-computed phase, out of $2\pi$, expressed in percents. That is: $err =
-100\times \frac{|\theta_{ref} - \theta_{comp}|}{2\pi}$.
-
-Table \ref{tab:algo_prec} gives the maximum and average error for the two algorithms and increasing values of $N$.
+running in double precision. The profile was generated for about
+34,000 different quadruplets of periods ($\in \lbrack 3.1,6.1]$, step
+= 0.1), phases ($\in \lbrack -3.1,3.1]$, steps = 0.062) and slope
+($\in \lbrack -2,2]$, step = 0.4). Obviously, the discretization of
+$[-\pi ,\pi ]$ introduces an error in the phase estimation. It is at
+most equal to $\frac{\pi}{nb_s}$. From some experiments on a $17\times
+4$ array, authors of \cite{AFMCSEM11} noticed a average ratio of 50
+between phase variation in radians and lever end position in
+nanometers. Assuming such a ratio and $nb_s = 1024$, the maximum lever
+deflection error would be 0.15nm which is smaller than 0.3nm, the best
+precision achieved with the setup used in \cite{AFMCSEM11}. 
+
+Moreover, pixels have been paired and the paired intensities have been
+perturbed by addition of a random number uniformly picked in
+$[-N,N]$. Notice that we have observed that perturbing each pixel
+independently yields too weak profile distortion. We report
+percentages of errors between the reference and the computed phases
+out of $2\pi ,$%
+\begin{equation*}
+err=100\times \frac{|\theta _{ref}-\theta _{comp}|}{2\pi }.
+\end{equation*}%
+Table \ref{tab:algo_prec} gives the maximum and the average errors for both
+algorithms and for increasing values of $N$ the noise parameter.
  
  \begin{table}[ht]
-  \begin{center}
-    \begin{tabular}{|c|c|c|c|c|}
-      \hline
-  & \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5}
-  noise & max. err. & aver. err. & max. err. & aver. err. \\ \hline
-  0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline
-  2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline
-  5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline
-  7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline
-  10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline
-  15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline
-  30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
-
-\end{tabular}
-\caption{Error (in \%) for cosinus profiles, with noise.}
-\label{tab:algo_prec}
+\begin{center}
+\begin{tabular}{|c|c|c|c|c|}
+\hline
+& \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5}
+noise (N)& max. err. & aver. err. & max. err. & aver. err. \\ \hline
+0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline
+2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline
+5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline
+7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline
+10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline
+15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline
+30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
+\end{tabular}%
  \end{center}
+\caption{Error (in \%) for cosines profiles, with noise.}
+\label{tab:algo_prec}
  \end{table}
  
-These results show that the two algorithms are very close, with a
-slight advantage for LSQ. Furthemore, both behave very well against
-noise. Assuming the experimental ratio of 50 (see above), an error of
-1 percent on phase correspond to an error of 0.5nm on the lever
-deflection, which is very close to the best precision.
-
-Obviously, it is very hard to predict which level of noise will be
-present in real experiments and how it will distort the
-profiles. Nevertheless, we can see on figure \ref{fig:noise20} the
+The results show that the two algorithms yield close results, with a slight
+advantage for LSQ. Furthermore, both behave very well against noise.
+Assuming an average ratio of 50 (see above), an error of 1 percent on
+the phase corresponds to an error of 0.5nm on the lever deflection, which is
+very close to the best precision.
+
+It is very hard to predict which level of noise will be present in
+real experiments and how it will distort the profiles. Authors of
+\cite{AFMCSEM11} gave us the authorization to exploit some of their
+results on a $17\times 4$ array. It allowed us to compare experimental
+profiles to simulated ones. We can see on figure \ref{fig:noise20} the
  profile with $N=10$ that leads to the biggest error. It is a bit
-distorted, with pikes and straight/rounded portions, and relatively
-close to most of that come from experiments. Figure \ref{fig:noise60}
-shows a sample of worst profile for $N=30$. It is completly distorted,
-largely beyond the worst experimental ones. 
+distorted, with pikes and straight/rounded portions. In fact, it is
+very close to some of the worst experimental profiles. Figure
+\ref{fig:noise60} shows a sample of worst profile for $N=30$. It is
+completely distorted, largely beyond any experimental ones. Obviously,
+these comparisons are a bit subjectives and experimental profiles
+could also be completly distorted on other experiments. Nevertheless,
+they give an idea about the possible error.
  
  \begin{figure}[ht]
  \begin{center}
-  \includegraphics[width=\columnwidth]{intens-noise20}
+\includegraphics[width=\columnwidth]{intens-noise20}
  \end{center}
  \caption{Sample of worst profile for N=10}
  \label{fig:noise20}
@@ -622,125 +646,166 @@ largely beyond the worst experimental ones.
  
  \begin{figure}[ht]
  \begin{center}
-  \includegraphics[width=\columnwidth]{intens-noise60}
+\includegraphics[width=\columnwidth]{intens-noise60}
  \end{center}
  \caption{Sample of worst profile for N=30}
  \label{fig:noise60}
  \end{figure}
  
-The second criterion is relatively easy to estimate for LSQ and harder
-for SPL because of $atan$ operation. In both cases, it is proportional
-to numbers of pixels $M$. For LSQ, it also depends on $nb_s$ and for
-SPL on $N = k\times M$, i.e. the number of interpolated points. 
-
-We assume that $M=20$, $nb_s=1024$, $k=4$, all possible parts are
-already in lookup tables and a limited set of operations (+, -, *, /,
-$<$, $>$) is taken account. Translating the two algorithms in C code, we
-obtain about 430 operations for LSQ and 1550 (plus few tenth for
+The second criterion is relatively easy to estimate for LSQ and harder for
+SPL because of the use of the arctangent function. In both cases, the number
+of operation is proportional to $M$ the numbers of pixels. For LSQ, it also
+depends on $nb_{s}$ and for SPL on $L=k\times M$ the number of interpolated
+points. We assume that $M=20$, $nb_{s}=1024$ and $k=4$, that all possible
+parts are already in lookup tables and that a limited set of operations (+,
+-, *, /, $<$, $>$) is taken into account. Translating both algorithms in C
+code, we obtain about 430 operations for LSQ and 1,550 (plus a few tenth for 
  $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
-considering the total number of operations is not really pertinent for
-an FPGA implementation: it mainly depends on the type of operations
-and their
-ordering. The final decision is thus driven by the third criterion.\\
-
-The Spartan 6 used in our architecture has a hard constraint: it has no built-in
-floating  point  units.   Obviously,  it  is  possible  to   use  some  existing
-"black-boxes"  for double  precision  operations.  But they  have  a quite  long
-latency. It is much simpler to  exclusively use integers, with a quantization of
-all double  precision values. Obviously,  this quantization should  not decrease
-too much the  precision of results. Furthermore, it should not  lead to a design
-with  a huge  latency because  of operations  that could  not complete  during a
-single or few clock cycles. Divisions  are in this case and, moreover, they need
-a varying  number of  clock cycles  to complete. Even  multiplications can  be a
-problem:  DSP48 take  inputs of  18  bits maximum.  For larger  multiplications,
-several DSP must be combined, increasing the latency.
-
-Nevertheless, the hardest constraint does not come from the FPGA characteristics
-but from the algorithms. Their VHDL  implentation will be efficient only if they
-can be fully (or near) pipelined. By the way, the choice is quickly done: only a
-small  part of  SPL  can be.   Indeed,  the computation  of spline  coefficients
-implies to solve  a tridiagonal system $A.m =  b$. Values in $A$ and  $b$ can be
-computed from  incoming pixels intensity  but after, the back-solve  starts with
-the  lastest  values,  which  breaks  the  pipeline.  Moreover,  SPL  relies  on
-interpolating far more points than profile size. Thus, the end of SPL works on a
-larger amount of data than the beginning, which also breaks the pipeline.
-
-LSQ has  not this problem: all parts  except the dichotomial search  work on the
-same  amount  of  data, i.e.  the  profile  size.  Furthermore, LSQ  needs  less
-operations than SPL, implying a  smaller output latency. Consequently, it is the
-best candidate for phase  computation. Nevertheless, obtaining a fully pipelined
-version supposes that  operations of different parts complete  in a single clock
-cycle. It is  the case for simulations but it completely  fails when mapping and
-routing the design  on the Spartan6. By the way,  extra-latency is generated and
-there must be idle times between two profiles entering into the pipeline.
-
-%%Before obtaining the least bitstream, the crucial question is: how to
-%%translate the C code the LSQ into VHDL ?
-
-
-%\subsection{VHDL design paradigms}
-
-\section{Experimental tests}
-
-In this section we explain what  we have done yet. Until now, we could not perform
-real experiments  since we just have  received the FGPA  board. Nevertheless, we
-will include real experiments in the final version of this paper.
+considering the total number of operations is not fully relevant for FPGA
+implementation which time and space consumption depends not only on the type
+of operations but also of their ordering. The final evaluation is thus very
+much driven by the third criterion.
+
+The Spartan 6 used in our architecture has a hard constraint since it
+has no built-in floating point units. Obviously, it is possible to use
+some existing "black-boxes" for double precision operations. But they
+require a lot of clock cycles to complete. It is much simpler to
+exclusively use integers, with a quantization of all double precision
+values. It should be chosen in a manner that does not alterate result
+precision. Furthermore, it should not lead to a design with a huge
+latency because of operations that could not complete during a single
+or few clock cycles. Divisions fall into that category and, moreover,
+they need a varying number of clock cycles to complete. Even
+multiplications can be a problem since a DSP48 takes inputs of 18 bits
+maximum. So, for larger multiplications, several DSP must be combined
+which increases the overall latency.
+
+In the present algorithms, the hardest constraint does not come from the
+FPGA characteristics but from the algorithms. Their VHDL implementation can
+be efficient only if they can be fully (or near) pipelined. We observe that
+only a small part of SPL can be pipelined, indeed, the computation of spline
+coefficients implies to solve a linear tridiagonal system which matrix and
+right-hand side are computed from incoming pixels intensity but after, the
+back-solve starts with the latest values, which breaks the pipeline.
+Moreover, SPL relies on interpolating far more points than profile size.
+Thus, the end of SPL works on a larger amount of data than at the beginning,
+which also breaks the pipeline.
+
+LSQ has not this problem since all parts, except the dichotomic search, work
+on the same amount of data, i.e. the profile size. Furthermore, LSQ requires
+less operations than SPL, implying a smaller output latency. In total, LSQ
+turns out to be the best candidate for phase computation on any architecture
+including FPGA.
+
+\section{VHDL implementation and experimental tests}
+
+\label{Experimental tests} 
  
  \subsection{VHDL implementation}
  
-
-
-% - ecriture d'un code en C avec integer
-% - calcul de la taille max en bit de chaque variable en fonction de la quantization.
-% - tests de quantization : équilibre entre précision et contraintes FPGA
-% - en parallèle : simulink et VHDL à la main
-
-
-From the  LSQ algorithm,  we have written  a C  program which uses  only integer
-values  that have  been  previously  scaled. The  quantization  of doubles  into
-integers has  been performed  in order  to obtain a  good trade-off  between the
-number of bits  used and the precision. Finally, we have  compared the result of
-the LSQ version  using integer and double. We have observed  that the results of
-both versions were similar.
-
-Then we have built  two versions of VHDL codes: one directly  by hand coding and
-the other with Matlab using simulink HDL coder feature. Although the approach is
-completely different we  have obtain VHDL codes that  are quite comparable. Each
-approach has  advantages and drawbacks.  Roughly speaking,  hand coding provides
-beautiful and much better structures  code while HDL coder provides code faster.
-In  terms  of speed  of  code,  we think  that  both  approaches  will be  quite
-comparable. Real experiments  will confirm that.  In the  LSQ algorithm, we have
-replaced all the divisions by  multiplications by a constant since divisions are
-performed  with  constants depending  of  the number  of  pixels  in the  profile
-(i.e. $M$).
+From the LSQ algorithm, we have written a C program that uses only
+integer values. We used a very simple quantization which consists in
+multiplying each double precision value by a factor power of two and
+by keeping the integer part. For an accurate evaluation of the
+division in the computation of $a$ the slope coefficient, we also
+scaled the pixel intensities by another power of two. The main problem
+was to determin these factors. Most of the time, they are chosen to
+minimize the error induced by the quantization. But in our case, we
+also have some hardware constraints, for example the size and depth of
+RAMs or the input size of DSPs. Thus, having a maximum of values that
+fit in these sizes is a very important criterion to choose the scaling
+factors.
+
+Consequently, we have determined the maximum value of each variable as
+a function of the scale factors and the profile size involved in the
+algorithm. It gave us the the maximum number of bits necessary to code
+them. We have chosen the scale factors so that any variable (except
+the covariance) fits in 18 bits, which is the maximum input size of
+DSPs. In this way, all multiplications, except one, could be done with
+a single DSP, in a single clock cycle. Moreover, assuming that $nb_s =
+1024$, all LUTs could fit in the 18Kbits RAMs. Finally, we compared
+the double and integer versions of LSQ and found a nearly perfect
+agreement between their results.
+
+As mentionned above, some operations like divisions must be
+avoided. But when the denominator is fixed, a division can be replaced
+by its multiplication/shift counterpart. This is always the case in
+LSQ. For example, assuming that $M$ is fixed, $x_{var}$ is known and
+fixed. Thus, $\frac{xy_{covar}}{x_{var}}$ can be replaced by
+
+\[ (xy_{covar}\times \left \lfloor\frac{2^n}{x_{var}} \right \rfloor) \gg n\]
+
+where $n$ depends on the desired precision (in our case $n=24$).
+
+Obviously, multiplications and divisions by a power of two can be
+replaced by left or right bit shifts. Finally, the code only contains
+shifts, additions, subtractions and multiplications of signed integers, which
+are perfectly adapted to FGPAs.
+
+
+We built two versions of VHDL codes, namely one directly by hand
+coding and the other with Matlab using the Simulink HDL coder feature~\cite%
+{HDLCoder}. Although the approaches are completely different we obtained
+quite comparable VHDL codes. Each approach has advantages and drawbacks.
+Roughly speaking, hand coding provides beautiful and much better structured
+code while Simulind HDL coder produces allows for fast code production. In
+terms of throughput and latency, simulations show that the two approaches
+yield close results with a slight advantage for hand coding.
  
  \subsection{Simulation}
  
-Currently, we only have simulated our VHDL codes with GHDL and GTKWave (two free
-tools with linux). Both approaches led to correct results. At the beginning with
-simulations our pipiline could compute a new phase each 33 cycles and the length
-of the pipeline was  equal to 95 cycles. When we tried  to generate the bitsream
-with ISE environment we had many problems because many stages required more than
-the 10$n$s availabe. So we needed to  decompose some part of the pipeline in order
-to add some cycles and siplify some parts.
-% ghdl + gtkwave
-% au mieux : une phase tous les 33 cycles, latence de 95 cycles.
-% mais routage/placement impossible.
-\subsection{Bitstream creation}
+Before experimental tests on the FPGA board, we simulated our two VHDL
+codes with GHDL and GTKWave (two free tools with linux). We built a
+testbench based on experimental profiles and compared the results to
+values given by the SPL algorithm. Both versions lead to correct
+results. Our first codes were highly optimized, indeed the pipeline
+could compute a new phase each 33 cycles and its latency was equal to
+95 cycles. Since the Spartan6 is clocked at 100MHz, estimating the
+deflection of 100 cantilevers would take about $%
+(95+200\times 33).10=66.95\mu $s, i.e. nearly 15,000 estimations by
+second.
  
-Currently both  approaches provide synthesable  bitstreams with ISE.   We expect
-that the  pipeline will  have a latency  of 112  cycles, i.e. 1.12$\mu$s  and it
-could accept new line of pixel each 48 cycles, i.e. 480$n$s.
+\subsection{Bitstream creation}
  
-% pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120
+In order to test our code on the SP Vision board, the design was
+extended with a component that keeps profiles in RAM, flushes them in
+the phase computation component and stores its output in another
+RAM. We also added a wishbone, a component that can "drive" signals to
+communicate between i.MX and other components. It is mainly used to
+start to flush profiles and to retrieve the computed phases in
+RAM. Unfortunately, the first designs could not be placed and routed
+with ISE on the Spartan6 with a 100MHz clock. The main problems were
+encountered with series of arthmetic operations and more especially
+with RAM outputs used in DSPs. So, we needed to decompose some parts
+of the pipeline, which added few clock cycles. Finally, we obtained a
+bitstream that has been successfully tested on the board.
+
+Its latency is of 112 cycles and computes a new phase every 40
+cycles. For 100 cantilevers, it takes $(112+200\times 40).10=81.12\mu
+$s to compute their deflection. It corresponds to about 12300 images
+per second, which is largely beyond the camera capacities and the
+possibility to extract a new profile from an image every 40
+cycles. Nevertheless, it also largely fits our design goals.
  
  \label{sec:results}
  
-
-
-
  \section{Conclusion and perspectives}
  
+In this paper we have presented a full hardware/software solution for
+real-time cantilever deflection computation from interferometry images.
+Phases are computed thanks to a new algorithm based on the least square
+method. It has been quantized and pipelined to be mapped into a FPGA, the
+architecture of our solution. Performances have been analyzed through
+simulations and real experiments on a Spartan6 FPGA. The results meet our
+initial requirements. In future work, the algorithm quantization will be
+better analyzed and an high speed camera will be introduced in the
+processing chain so that to process real images. Finally, we will address
+real-time filtering and control problems for AFM arrays in dynamic regime.
+
+\section{Acknowledgments}
+We would like to thank A. Meister and M. Favre, from CSEM, for sharing all the
+material we used to write this article and for the time they spent to
+explain us their approach.
  
  \bibliographystyle{plain}
  \bibliography{biblio}