X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/blobdiff_plain/85ccdc9bbf8d6b30f5737245a16d277a7713908c..b5dcb332822aed6879619b36c36ad800b7672e2f:/dmems12.tex?ds=inline diff --git a/dmems12.tex b/dmems12.tex index c5ee6cf..0c00ee0 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -1,11 +1,15 @@ - -\documentclass[10pt, peerreview, compsocconf]{IEEEtran} %\usepackage{latex8} %\usepackage{times} -\usepackage[utf8]{inputenc} %\usepackage[cyr]{aeguill} %\usepackage{pstricks,pst-node,pst-text,pst-3d} %\usepackage{babel} +%\input{psfig.sty} +%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands. + + +\documentclass[10pt, peerreview, compsocconf]{IEEEtran} +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\usepackage[utf8]{inputenc} \usepackage{amsmath} \usepackage{url} \usepackage{graphicx} @@ -14,20 +18,44 @@ \usepackage{moreverb} \usepackage{commath} \usepackage{subfigure} -%\input{psfig.sty} \usepackage{fullpage} \usepackage{fancybox} - \usepackage[ruled,lined,linesnumbered]{algorithm2e} -%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands. -\newcommand{\noun}[1]{\textsc{#1}} +\setcounter{MaxMatrixCols}{10} +%TCIDATA{OutputFilter=LATEX.DLL} +%TCIDATA{Version=5.50.0.2953} +%TCIDATA{} +%TCIDATA{BibliographyScheme=BibTeX} +%TCIDATA{LastRevised=Wednesday, October 26, 2011 09:49:54} +%TCIDATA{} +\newcommand{\noun}[1]{\textsc{#1}} \newcommand{\tab}{\ \ \ } \begin{document} +\title{A new approach based on a least square method for real-time estimation of cantilever array deflections with a FPGA} +\author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane +Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and +Michel Lenczner\IEEEauthorrefmark{2}} +\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France \and +\{raphael.couturier,stephane.domas\}@univ-fcomte.fr} +\IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France \and +\{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com} } + +\begin{abstract} +Atomic force microscopes (AFM) provide high resolution images of surfaces. +In this paper, we focus our attention on an interferometry method for +deflection estimation of cantilever arrays in quasi-static regime. In its +original form, spline interpolation was used to determine interference +fringe phase, and thus the deflections. Computations were performed on a PC. +Here, we propose a new complete solution with a least square based algorithm +and an optimized FPGA implementation. Simulations and real tests showed very +good results and open perspective for real-time estimation and control of +cantilever arrays in the dynamic regime. +\end{abstract} %% \author{\IEEEauthorblockN{Authors Name/s per 1st Affiliation (Author)} %% \IEEEauthorblockA{line 1 (of Affiliation): dept. name of organization\\ @@ -42,441 +70,418 @@ %% line 4: Email: name@xyz.com} %% } - - -\title{A new approach based on least square methods to estimate in real time cantilevers deflection with a FPGA} -\author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and Michel Lenczner\IEEEauthorrefmark{2}} -\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France\\ -\{raphael.couturier,stephane.domas\}@univ-fcomte.fr} -\IEEEauthorblockA{\IEEEauthorrefmark{2}FEMTO-ST, Time-Frequency, University of Franche-Comte, Besançon, France\\ -\{michel.lenczner@utbm.fr,gwenhael.goavec@trabucayre.com} -} - - - - - - %\maketitle \thispagestyle{empty} -\begin{abstract} - -Atomic force microscope (AFM) provides high resolution images of surfaces. We -focus our attention on an interferometry method to estimate the cantilevers -deflection. This method was based on the spline method to interpolate the -deflection and the computations were performed on a PC with LabView. In this -paper, we propose a new method based on the least square method and we present -the implementation that we developped on a FPGA. Our method can be pipelined on -a FPGA in order to manipulate image profiles very quickly. Simulations and real -tests we have performed showed us that this implementation is very efficient and -should allow us to control of a cantilevers array in real time. - - -\end{abstract} - \begin{IEEEkeywords} -FPGA, cantilever, interferometry. +FPGA, cantilever arrays, interferometry. \end{IEEEkeywords} - \IEEEpeerreviewmaketitle \section{Introduction} -Cantilevers are used inside atomic force microscope (AFM) which provides high -resolution images of surfaces. Several technics have been used to measure the -displacement of cantilevers in litterature. For example, it is possible to -determine accurately the deflection with different mechanisms. -In~\cite{CantiPiezzo01}, authors used piezoresistor integrated into the -cantilever. Nevertheless this approach suffers from the complexity of the -microfabrication process needed to implement the sensor in the cantilever. -In~\cite{CantiCapacitive03}, authors have presented an cantilever mechanism -based on capacitive sensing. This kind of technic also involves to instrument -the cantiliver which result in a complex fabrication process. - -In this paper our attention is focused on a method based on interferometry to -measure cantilevers' displacements. In this method cantilevers are illuminated -by an optic source. The interferometry produces fringes on each cantilever -which enables to compute the cantilever displacement. In order to analyze the -fringes a high speed camera is used. Images need to be processed quickly and -then a estimation method is required to determine the displacement of each -cantilever. In~\cite{AFMCSEM11}, authors have used an algorithm based on -spline to estimate the cantilevers' positions. - -The overall process gives accurate results but all the computations -are performed on a standard computer using LabView. Consequently, the -main drawback of this implementation is that the computer is a -bootleneck. In this paper we propose to use a method based on least -square and to implement all the computation on a FGPA. - -The remainder of the paper is organized as follows. Section~\ref{sec:measure} -describes more precisely the measurement process. Our solution based on the -least square method and the implementation on FPGA is presented in -Section~\ref{sec:solus}. Experimentations are described in -Section~\ref{sec:results}. Finally a conclusion and some perspectives are -presented. - - - -%% quelques ref commentées sur les calculs basés sur l'interférométrie - -\section{Measurement principles} +Cantilevers are used in atomic force microscopes (AFM) which provide high +resolution surface images. Several techniques have been reported in +literature for cantilever displacement measurement. In~\cite{CantiPiezzo01}, +authors have shown how a piezoresistor can be integrated into a cantilever +for deflection measurement. Nevertheless this approach suffers from the +complexity of the microfabrication process needed to implement the sensor. +In~\cite{CantiCapacitive03}, authors have presented a cantilever mechanism +based on capacitive sensing. These techniques require cantilever +instrumentation resulting in\ complex fabrication processes. + +In this paper our attention is focused on a method based on interferometry for +cantilever displacement measurement in quasi-static regime. Cantilevers are +illuminated by an optical source. Interferometry produces fringes enabling +cantilever displacement computation. A high speed camera is used to analyze the +fringes. In view of real time applications, images need to be processed quickly +and then a fast estimation method is required to determine the displacement of +each cantilever. In~\cite{AFMCSEM11}, an algorithm based on spline has been +introduced for cantilever position estimation. The overall process gives +accurate results but computations are performed on a standard computer using +LabView \textsuperscript{\textregistered} \textsuperscript{\copyright}. +Consequently, the main drawback of this implementation is that the computer is a +bottleneck. In this paper we pose the problem of real-time cantilever position +estimation and bring a hardware/software solution. It includes a fast method +based on least squares and its FPGA implementation. + +The remainder of the paper is organized as follows. Section~\ref{sec:measure} +describes the measurement process. Our solution based on the least square +method and its implementation on a FPGA is presented in Section~\ref{sec:solus}. Numerical experimentations are described in Section~\ref{sec:results}. Finally a conclusion and some perspectives are drawn. + +\section{Architecture and goals} + \label{sec:measure} -\subsection{Architecture} +In order to build simple, cost effective and user-friendly cantilever +arrays, authors of ~\cite{AFMCSEM11} have developed a system based on +interferometry. + +\subsection{Experimental setup} + \label{sec:archi} -%% description de l'architecture générale de l'acquisition d'images -%% avec au milieu une unité de traitement dont on ne précise pas ce -%% qu'elle est. - -In order to develop simple, cost effective and user-friendly cantilever arrays, -authors of ~\cite{AFMCSEM11} have developped a system based of -interferometry. In opposition to other optical based systems, using a laser beam -deflection scheme and sentitive to the angular displacement of the cantilever, -interferometry is sensitive to the optical path difference induced by the -vertical displacement of the cantilever. - -The system build by these authors is based on a Linnick -interferomter~\cite{Sinclair:05}. It is illustrated in -Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter) -into a reference beam and a sample beam that reachs the cantilever -array. In order to be able to move the cantilever array, it is -mounted on a translation and rotational hexapod stage with five -degrees of freedom. The optical system is also fixed to the stage. -Thus, the cantilever array is centered in the optical system which can -be adjusted accurately. The beam illuminates the array by a -microscope objective and the light reflects on the cantilevers. -Likewise the reference beam reflects on a movable mirror. A CMOS -camera chip records the reference and sample beams which are -recombined in the beam splitter and the interferogram. At the -beginning of each experiment, the movable mirror is fitted manually in -order to align the interferometric fringes approximately parallel to -the cantilevers. When cantilevers move due to the surface, the -bending of cantilevers produce movements in the fringes that can be -detected with the CMOS camera. Finally the fringes need to be -analyzed. In~\cite{AFMCSEM11}, authors used a LabView program to -compute the cantilevers' deflections from the fringes. - -\begin{figure} + +In opposition to other optical based system\textbf{s u}sing a laser beam +deflection scheme and sensitive to the angular displacement of the +cantilever, interferometry is sensitive to the optical path difference +induced by the vertical displacement of the cantilever. + +The system is based on a Linnick interferometer~\cite{Sinclair:05}. +It is illustrated in Figure~\ref{fig:AFM} \footnote{by courtesy of + CSEM}. A laser diode is first split (by the splitter) into a +reference beam and a sample beam both reaching the cantilever array. +The complete system including a cantilever array\ and the optical +system can be moved thanks to a translation and rotational hexapod +stage with five degrees of freedom. Thus, the cantilever array is +centered in the optical system which can be adjusted accurately. The +beam illuminates the array by a microscope objective and the light +reflects on the cantilevers. Likewise the reference beam reflects on a +movable mirror. A CMOS camera chip records the reference and sample +beams which are recombined in the beam splitter and the +interferogram. At the beginning of each experiment, the movable mirror +is fitted manually in order to align the interferometric fringes +approximately parallel to the cantilevers. Then, cantilever motion in +the transverse direction produces movements in the fringes. They are +detected with the CMOS camera which images are analyzed by a Labview +program to recover the cantilever deflections. + +\begin{figure}[tbp] \begin{center} \includegraphics[width=\columnwidth]{AFM} \end{center} -\caption{schema of the AFM} -\label{fig:AFM} +\caption{AFM Setup} +\label{fig:AFM} \end{figure} - %% image tirée des expériences. -\subsection{Cantilever deflection estimation} +\subsection{Inteferometric based cantilever deflection estimation} + \label{sec:deflest} -\begin{figure} +\begin{figure}[tbp] \begin{center} \includegraphics[width=\columnwidth]{lever-xp} \end{center} -\caption{Portion of an image picked by the camera} -\label{fig:img-xp} +\caption{Portion of a camera image showing moving interferometric fringes in +cantilevers} +\label{fig:img-xp} \end{figure} -As shown on image \ref{fig:img-xp}, each cantilever is covered by -several interferometric fringes. The fringes will distort when -cantilevers are deflected. Estimating the deflection is done by -computing this distortion. For that, authors of \cite{AFMCSEM11} -proposed a method based on computing the phase of the fringes, at the -base of each cantilever, near the tip, and on the base of the -array. They assume that a linear relation binds these phases, which -can be use to "unwrap" the phase at the tip and to determine the deflection.\\ - -More precisely, segment of pixels are extracted from images taken by -the camera. These segments are large enough to cover several -interferometric fringes. As said above, they are placed at the base -and near the tip of the cantilevers. They are called base profile and -tip profile in the following. Furthermore, a reference profile is -taken on the base of the cantilever array. - -The pixels intensity $I$ (in gray level) of each profile is modelized by: - +As shown in Figure \ref{fig:img-xp} \footnote{by courtesy of CSEM}, each +cantilever is covered by several interferometric fringes. The fringes +distort when cantilevers are deflected. In \cite{AFMCSEM11}, a novel +method for interferometric based cantilever deflection measurement was +reported. For each cantilever, the method uses three segments of pixels, +parallel to its section, to determine phase shifts. The first is +located just above the AFM tip (tip profile), it provides the phase +shift modulo $2\pi $. The second one is close to the base junction +(base profile) and is used to determine the exact multiple of $2\pi $ +through an operation called unwrapping where it is assumed that the +deflection means along the two measurement segments are linearly +dependent. The third is on the base and provides a reference for +noise suppression. Finally, deflections are simply derived from phase +shifts. + +The pixel gray-level intensity $I$ of each profile is modelized by% \begin{equation} -\label{equ:profile} -I(x) = ax+b+A.cos(2\pi f.x + \theta) -\end{equation} - -where $x$ is the position of a pixel in its associated segment. - -The global method consists in two main sequences. The first one aims -to determin the frequency $f$ of each profile with an algorithm based -on spline interpolation (see section \ref{algo-spline}). It also -computes the coefficient used for unwrapping the phase. The second one -is the acquisition loop, while which images are taken at regular time -steps. For each image, the phase $\theta$ of all profiles is computed -to obtain, after unwrapping, the deflection of -cantilevers. Originally, this computation was also done with an -algorithm based on spline. This article proposes a new version based -on a least square method. +I(x)=A\text{ }\cos (2\pi fx+\theta )+ax+b \label{equ:profile} +\end{equation}% +where $x$ denotes the position of a pixel in a segment, $A$, $f$ and $\theta +$ are the amplitude, the frequency and the phase of the light signal when +the affine function $ax+b$ corresponds to the cantilever array surface tilt +with respect to the light source. + +The method consists in two main sequences. In the first one +corresponding to precomputation, the frequency $f$ of each profile is +determined using a spline interpolation (see section \ref% +{sec:algo-spline}) and the coefficient used for phase unwrapping is +computed. The second one, that we call the \textit{acquisition loop,} +is done after images have been taken at regular time steps. For each +image, the phase $\theta $ of all profiles is computed to obtain, +after unwrapping, the cantilever deflection. The phase determination +in \cite{AFMCSEM11} is achieved by a spline based algorithm which is +the most consuming part of the computation. In this article, we +propose an alternate version based on the least square method which is +faster and better suited for FPGA implementation. + +\subsection{Computation design goals} -\subsection{Design goals} \label{sec:goals} -The main goal is to implement a computing unit to estimate the -deflection of about $10\times10$ cantilevers, faster than the stream of -images coming from the camera. The accuracy of results must be close -to the maximum precision ever obtained experimentally on the -architecture, i.e. 0.3nm. Finally, the latency between an image -entering in the unit and the deflections must be as small as possible -(NB: future works plan to add some control on the cantilevers).\\ - -If we put aside some hardware issues like the speed of the link +To evaluate the solution performances, we choose a goal which consists +in designing a computing unit able to estimate the deflections of +a $10\times 10$% +-cantilever array, faster than the camera image stream. In addition, +the result accuracy must be close to 0.3nm, the maximum precision +reached in \cite{AFMCSEM11}. Finally, the latency between the entrance +of the first pixel of an image and the end of deflection computation +must be as small as possible. All these requirement are +stated in the perspective of implementing real-time active control for +each cantilever, see~\cite{LencznerChap10,Hui11}. + +If we put aside other hardware issues like the speed of the link between the camera and the computation unit, the time to deserialize -pixels and to store them in memory, ... the phase computation is -obviously the bottle-neck of the whole process. For example, if we -consider the camera actually in use, an exposition time of 2.5ms for -$1024\times 1204$ pixels seems the minimum that can be reached. For -100 cantilevers, if we neglect the time to extract pixels, it implies -that computing the deflection of a single -cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\ - -In fact, this timing is a very hard constraint. Let consider a very -small programm that initializes twenty million of doubles in memory -and then does 1000000 cumulated sums on 20 contiguous values -(experimental profiles have about this size). On an intel Core 2 Duo -E6650 at 2.33GHz, this program reaches an average of 155Mflops. - -%%Itimplies that the phase computation algorithm should not take more than -%%$155\times 12.5 = 1937$ floating operations. For integers, it gives $3000$ operations. - -Obviously, some cache effects and optimizations on -huge amount of computations can drastically increase these -performances: peak efficiency is about 2.5Gflops for the considered -CPU. But this is not the case for phase computation that used only few -tenth of values.\\ - -In order to evaluate the original algorithm, we translated it in C -language. As said further, for 20 pixels, it does about 1550 -operations, thus an estimated execution time of $1550/155 -=$10$\mu$s. For a more realistic evaluation, we constructed a file of -1Mo containing 200 profiles of 20 pixels, equally scattered. This file -is equivalent to an image stored in a device file representing the -camera. We obtained an average of 10.5$\mu$s by profile (including I/O -accesses). It is under are requirements but close to the limit. In -case of an occasional load of the system, it could be largely -overtaken. A solution would be to use a real-time operating system but -another one to search for a more efficient algorithm. - -But the main drawback is the latency of such a solution: since each -profile must be treated one after another, the deflection of 100 -cantilevers takes about $200\times 10.5 = 2.1$ms, which is inadequate -for an efficient control. An obvious solution is to parallelize the -computations, for example on a GPU. Nevertheless, the cost to transfer -profile in GPU memory and to take back results would be prohibitive -compared to computation time. It is certainly more efficient to -pipeline the computation. For example, supposing that 200 profiles of -20 pixels can be pushed sequentially in the pipelined unit cadenced at -a 100MHz (i.e. a pixel enters in the unit each 10ns), all profiles -would be treated in $200\times 20\times 10.10^{-9} =$ 40$\mu$s plus -the latency of the pipeline. This is about 500 times faster than -actual results.\\ - -For these reasons, an FPGA as the computation unit is the best choice -to achieve the required performance. Nevertheless, passing from -a C code to a pipelined version in VHDL is not obvious at all. As -explained in the next section, it can even be impossible because of -some hardware constraints specific to FPGAs. - - -\section{Proposed solution} +pixels and to store them in memory, the phase computation is the +bottleneck of the whole process. For example, the camera in the setup +of \cite{AFMCSEM11} provides $% +1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if we +the pixel extraction time is neglected, each phase calculation of a +100-cantilever array should take no more than 12.5$\mu$s. + +In fact, this timing is a very hard constraint. To illustrate this point, we +consider a very small program that initializes twenty million of doubles in +memory and then does 1,000,000 cumulated sums on 20 contiguous values +(experimental profiles have about this size). On an intel Core 2 Duo E6650 +at 2.33GHz, this program reaches an average of 155Mflops. +Obviously, some cache effects and optimizations on huge amount of +computations can drastically increase these performances: peak efficiency is +about 2.5Gflops for the considered CPU. But this is not the case for phase +computation that is using only a few tenth of values. + +In order to evaluate the original algorithm, we translated it in C language. +As stated before, for 20 pixels, it does about 1,550 operations, thus an +estimated execution time of $1,550/155=$10$\mu $s. For a more realistic +evaluation, we constructed a file of 1Mo containing 200 profiles of 20 +pixels, equally scattered. This file is equivalent to an image stored in a +device file representing the camera. We obtained an average of 10.5$\mu$s +by profile (including I/O accesses). It is under our requirements but close +to the limit. In case of an occasional load of the system, it could be +largely overtaken. Solutions would be to use a real-time operating system or +to search for a more efficient algorithm. + +However, the main drawback is the latency of such a solution because each +profile must be treated one after another and the deflection of 100 +cantilevers takes about $200\times 10.5=2.1$ms. This would be inadequate +for real-time requirements as for individual cantilever active control. An +obvious solution is to parallelize the computations, for example on a GPU. +Nevertheless, the cost of transferring profile in GPU memory and of taking +back results would be prohibitive compared to computation time. + +We remark that when possible, it is more efficient to pipeline the +computation. For example, supposing that 200 profiles of 20 pixels +could be pushed sequentially in a pipelined unit cadenced at a 100MHz +(i.e. a pixel enters in the unit each 10ns), all profiles would be +treated in $200\times 20\times 10.10^{-9}=$ 40$\mu$s plus the latency +of the pipeline. Such a solution would be meeting our requirements and +would be 50 times faster than our C code, and even more compared to +the LabView version use in \cite{AFMCSEM11}. FPGAs are appropriate for +such implementation, so they turn out to be the computation units of +choice to reach our performance requirements. Nevertheless, passing +from a C code to a pipelined version in VHDL is not obvious at all. It +can even be impossible because of FPGA hardware constraints. All these +points are discussed in the following sections. + +\section{An hardware/software solution} + \label{sec:solus} -Project Oscar aims to provide a hardware and software architecture to estimate -and control the deflection of cantilevers. The hardware part consists in a -high-speed camera, linked on an embedded board hosting FPGAs. By the way, the -camera output stream can be pushed directly into the FPGA. The software part is -mostly the VHDL code that deserializes the camera stream, extracts profile and -computes the deflection. Before focusing on our work to implement the phase -computation, we give some general information about FPGAs and the board we use. - -\subsection{FPGAs} - -A field-programmable gate array (FPGA) is an integrated circuit designed to be -configured by the customer. FGPAs are composed of programmable logic components, -called configurable logic blocks (CLB). These blocks mainly contains look-up -tables (LUT), flip/flops (F/F) and latches, organized in one or more slices -connected together. Each CLB can be configured to perform simple (AND, XOR, ...) -or complex combinational functions. They are interconnected by reconfigurable -links. Modern FPGAs contain memory elements and multipliers which enable to -simplify the design and to increase the performance. Nevertheless, all other -complex operations, like division, trigonometric functions, $\ldots$ are not -available and must be done by configuring a set of CLBs. Since this -configuration is not obvious at all, it can be done via a framework, like -ISE~\cite{ISE}. Such a software can synthetize a design written in a hardware -description language (HDL), map it onto CLBs, place/route them for a specific -FPGA, and finally produce a bitstream that is used to configre the FPGA. Thus, -from the developper point of view, the main difficulty is to translate an -algorithm in HDL code, taking account FPGA resources and constraints like clock +In this section we present parts of the computing solution to the above +requirements. The hardware part consists in a high-speed camera, linked on an +embedded board hosting two FPGAs. In this way, the camera output stream can be +pushed directly into the FPGA. The software part is mostly the VHDL code that +deserializes the camera stream, extracts profiles and computes the deflection. + +We first give some general information about FPGAs, then we +describe the FPGA board we use for implementation and finally the two +algorithms for phase computation are detailed. Presentation of VHDL +implementations is postponned until Section \ref{Experimental tests}. + + + +\subsection{Elements of FPGA architecture and programming} + +A field-programmable gate array (FPGA) is an integrated circuit designed to +be configured by the customer. FGPAs are composed of programmable logic +components, called configurable logic blocks (CLB). These blocks mainly +contain look-up tables (LUT), flip/flops (F/F) and latches, organized in one +or more slices connected together. Each CLB can be configured to perform +simple (AND, XOR, ...) or complex combinational functions. They are +interconnected by reconfigurable links. Modern FPGAs contain memory elements +and multipliers which enable to simplify the design and to increase the +performance. Nevertheless, all other complex operations like division and +other functions like trigonometric functions are not available and must be +built by configuring a set of CLBs. Since this is not an obvious task at +all, tools like ISE~\cite{ISE} have been built to do this operation. Such a +software can synthetize a design written in a hardware description language +(HDL), maps it onto CLBs, place/route them for a specific FPGA, and finally +produces a bitstream that is used to configure the FPGA. Thus, from the +developer's point of view, the main difficulty is to translate an algorithm +into HDL code, taking into account FPGA resources and constraints like clock signals and I/O values that drive the FPGA. Indeed, HDL programming is very different from classic languages like C. A program can be seen as a state-machine, manipulating signals that -evolve from state to state. By the way, HDL instructions can execute -concurrently. Basic logic operations are used to agregate signals to -produce new states and assign it to another signal. States are mainly -expressed as arrays of bits. Fortunaltely, libraries propose some -higher levels representations like signed integers, and arithmetic -operations. - -Furthermore, even if FPGAs are cadenced more slowly than classic -processors, they can perform pipeline as well as parallel -operations. A pipeline consists in cutting a process in sequence of -small tasks, taking the same execution time. It accepts a new data at -each clock top, thus, after a known latency, it also provides a result -at each clock top. However, using a pipeline consumes more logics -since the components of a task are not reusable by another -one. Nevertheless it is probably the most efficient technique on -FPGA. Because of its architecture, it is also very easy to process -several data concurrently. When it is possible, the best performance -is reached using parallelism to handle simultaneously several -pipelines in order to handle multiple data streams. - -\subsection{The board} - -The board we use is designed by the Armadeus compagny, under the name -SP Vision. It consists in a development board hosting a i.MX27 ARM -processor (from Freescale). The board includes all classical -connectors: USB, Ethernet, ... A Flash memory contains a Linux kernel -that can be launched after booting the board via u-Boot. - -The processor is directly connected to a Spartan3A FPGA (from Xilinx) -via its special interface called WEIM. The Spartan3A is itself -connected to a Spartan6 FPGA. Thus, it is possible to develop programs -that communicate between i.MX and Spartan6, using Spartan3 as a -tunnel. By default, the WEIM interface provides a clock signal at -100MHz that is connected to dedicated FPGA pins. - -The Spartan6 is an LX100 version. It has 15822 slices, each slice -containing 4 LUTs and 8 flip/flops. It is equivalent to 101261 logic +evolve from state to state. Moreover, HDL instructions can be executed +concurrently. Signals may be combined with basic logic operations to +produce new states that are assigned to another signal. States are mainly expressed as +arrays of bits. Fortunately, libraries propose some higher levels +representations like signed integers, and arithmetic operations. + +Furthermore, even if FPGAs are cadenced more slowly than classic processors, +they can perform pipelines as well as parallel operations. A pipeline +consists in cutting a process in a sequence of small tasks, taking the same +execution time. It accepts a new data at each clock top, thus, after a known +latency, it also provides a result at each clock top. We observe that the +components of a task are not reusable by another one. Nevertheless, this is +the most efficient technique on FPGAs. Because of their architecture, it is +also very easy to process several data concurrently. Finally, the best +performance can be reached when several pipelines are operating on multiple +data streams in parallel. + +\subsection{The FPGA board} + +The architecture we use is designed by the Armadeus Systems +company. It consists in a development board called APF27 \textsuperscript{\textregistered}, hosting a +i.MX27 ARM processor (from Freescale) and a Spartan3A (from +XIlinx). This board includes all classical connectors as USB and +Ethernet for instance. A Flash memory contains a Linux kernel that can +be launched after booting the board via u-Boot. The processor is +directly connected to the Spartan3A via its special interface called +WEIM. The Spartan3A is itself connected to an extension board called +SP Vision \textsuperscript{\textregistered}, that hosts a Spartan6 FPGA. Thus, it is +possible to develop programs that communicate between i.MX and +Spartan6, using Spartan3 as a tunnel. A clock signal at 100MHz (by +default) is delivered to dedicated FPGA pins. The Spartan6 of our +board is an LX100 version. It has 15,822 slices, each slice containing +4 LUTs and 8 flip/flops. It is equivalent to 101,261 logic cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated multiply-adders (named DSP48), which is largely enough for our -project. +project. Some I/O pins of Spartan6 are connected to two $2\times 17$ +headers that can be used for any purpose as to be connected to the +interface of a camera. -Some I/O pins of Spartan6 are connected to two $2\times 17$ headers -that can be used as user wants. For the project, they will be -connected to the interface card of the camera. +\subsection{Two algorithms for phase computation} -\subsection{Considered algorithms} - -Two solutions have been studied to achieve phase computation. The -original one, proposed by A. Meister and M. Favre, is based on -interpolation by splines. It allows to compute frequency and -phase. The second one, detailed in this article, is based on a -classical least square method but suppose that frequency is already -known. +In \cite{AFMCSEM11}, $f$ the frequency and $\theta $\ the phase of the +light wave are computed thanks to spline interpolation. As said in +section \ref{sec:deflest}, $f$ is computed only once time but the +phase needs to be computed for each image. This is why, in this paper, +we focus on its computation. \subsubsection{Spline algorithm (SPL)} + \label{sec:algo-spline} -Let consider a profile $P$, that is a segment of $M$ pixels with an -intensity in gray levels. Let call $I(x)$ the intensity of profile in $x -\in [0,M[$. - -At first, only $M$ values of $I$ are known, for $x = 0, 1, -\ldots,M-1$. A normalisation allows to scale known intensities into -$[-1,1]$. We compute splines that fit at best these normalised -intensities. Splines are used to interpolate $N = k\times M$ points -(typically $k=4$ is sufficient), within $[0,M[$. Let call $x^s$ the -coordinates of these $N$ points and $I^s$ their intensities. - -In order to have the frequency, the mean line $a.x+b$ (see equation \ref{equ:profile}) of $I^s$ is -computed. Finding intersections of $I^s$ and this line allow to obtain -the period thus the frequency. - -The phase is computed via the equation: + +We denote by $M$ the number of pixels in a segment used for phase +computation. For the sake of simplicity of the notations, we consider +the light intensity $I$ to be a mapping of the physical segment in the +interval $[0,M[$. The pixels are assumed to be regularly spaced and +centered at the positions $x^{p}\in\{0,1,\ldots,M-1\}.$ We use the simplest +definition of a pixel, namely the value of $I$ at its center. The +pixel intensities are considered as pre-normalized so that their +minimum and maximum have been resized to $-1$ and $1$. + +The first step consists in computing the cubic spline interpolation of +the intensities. This allows for interpolating $I$ at a larger number +$L=k\times M$ of points (typically $k=4$ is sufficient) $% +x^{s}$ in the interval $[0,M[$. During the precomputation sequence, +the second step is to determin the afine part $a.x+b$ of $I$. It is +found with an ordinary least square method, taking account the $L$ +points. Values of $I$ in $x^s$ are used to compute its intersections +with $a.x+b$. The period of $I$ (and thus its frequency) is deduced +from the number of intersections and the distance between the first +and last. + +During the acquisition loop, the second step is the phase computation, with \begin{equation} -\theta = atan \left[ \frac{\sum_{i=0}^{N-1} sin(2\pi f x^s_i) \times I^s(x^s_i)}{\sum_{i=0}^{N-1} cos(2\pi f x^s_i) \times I^s(x^s_i)} \right] +\theta =atan\left[ \frac{\sum_{i=0}^{N-1}\text{sin}(2\pi fx_{i}^{s})\times +I(x_{i}^{s})}{\sum_{i=0}^{N-1}\text{cos}(2\pi fx_{i}^{s})\times I(x_{i}^{s})}% +\right] . \end{equation} -Two things can be noticed: +\textit{Remarks: } + \begin{itemize} -\item the frequency could also be obtained using the derivates of - spline equations, which only implies to solve quadratic equations. -\item frequency of each profile is computed a single time, before the - acquisition loop. Thus, $sin(2\pi f x^s_i)$ and $cos(2\pi f x^s_i)$ - could also be computed before the loop, which leads to a much faster - computation of $\theta$. +\item The frequency could also be obtained using the derivates of spline +equations, which only implies to solve quadratic equations but certainly +yields higher errors. + +\item Profile frequency are computed during the precomputation step, + thus the values sin$(2\pi fx_{i}^{s})$ and cos$(2\pi fx_{i}^{s})$ + can be determined once for all. \end{itemize} \subsubsection{Least square algorithm (LSQ)} -Assuming that we compute the phase during the acquisition loop, -equation \ref{equ:profile} has only 4 parameters: $a, b, A$, and -$\theta$, $f$ and $x$ being already known. Since $I$ is non-linear, a -least square method based on a Gauss-newton algorithm can be used to -determine these four parameters. Since it is an iterative process -ending with a convergence criterion, it is obvious that it is not -particularly adapted to our design goals. - -Fortunatly, it is quite simple to reduce the number of parameters to -only $\theta$. Let $x^p$ be the coordinates of pixels in a segment of -size $M$. Thus, $x^p = 0, 1, \ldots, M-1$. Let $I(x^p)$ be their -intensity. Firstly, we "remove" the slope by computing: - -\[I^{corr}(x^p) = I(x^p) - a.x^p - b\] - -Since linear equation coefficients are searched, a classical least -square method can be used to determine $a$ and $b$: - -\[a = \frac{covar(x^p,I(x^p))}{var(x^p)} \] - -Assuming an overlined symbol means an average, then: - -\[b = \overline{I(x^p)} - a.\overline{{x^p}}\] - -Let $A$ be the amplitude of $I^{corr}$, i.e. - -\[A = \frac{max(I^{corr}) - min(I^{corr})}{2}\] - -Then, the least square method to find $\theta$ is reduced to search the minimum of: - -\[\sum_{i=0}^{M-1} \left[ cos(2\pi f.i + \theta) - \frac{I^{corr}(i)}{A} \right]^2\] - -It is equivalent to derivate this expression and to solve the following equation: - -\begin{eqnarray*} -2\left[ cos\theta \sum_{i=0}^{M-1} I^{corr}(i).sin(2\pi f.i) + sin\theta \sum_{i=0}^{M-1} I^{corr}(i).cos(2\pi f.i)\right] \\ -- A\left[ cos2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right] = 0 -\end{eqnarray*} +Assuming that we compute the phase during the acquisition loop, equation \ref% +{equ:profile} has only 4 parameters: $a,b,A$, and $\theta $, $f$ and $x$ +being already known. Since $I$ is non-linear, a least square method based on +a Gauss-newton algorithm can be used to determine these four parameters. +This kind of iterative process ends with a convergence criterion, so it is +not suited to our design goals. Fortunately, it is quite simple to reduce +the number of parameters to only $\theta $. Firstly, the afine part $ax+b$ +is estimated from the $M$ values $I(x^{p})$ to determine the rectified +intensities,% +\begin{equation*} +I^{corr}(x^{p})\approx I(x^{p})-a.x^{p}-b. +\end{equation*}% +To find $a$ and $b$ we apply an ordinary least square method (as in SPL but on $M$ points)% +\begin{equation*} +a=\frac{covar(x^{p},I(x^{p}))}{\text{var}(x^{p})}\text{ and }b=\overline{% +I(x^{p})}-a.\overline{{x^{p}}} +\end{equation*}% +where overlined symbols represent average. Then the amplitude $A$ is +approximated by% +\begin{equation*} +A\approx \frac{\text{max}(I^{corr})-\text{min}(I^{corr})}{2}. +\end{equation*}% +Finally, the problem of approximating $\theta $ is reduced to minimizing% +\begin{equation*} +\min_{\theta \in \lbrack -\pi ,\pi ]}\sum_{i=0}^{M-1}\left[ \text{cos}(2\pi +f.i+\theta )-\frac{I^{corr}(i)}{A}\right] ^{2}. +\end{equation*}% +An optimal value $\theta ^{\ast }$ of the minimization problem is a zero of +the first derivative of the above argument,%\begin{eqnarray*}{l} +\begin{equation*} +2\left[ \text{cos}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{sin}(2\pi +f.i)\right. +\end{equation*}% +\begin{equation*} +\left. +\text{sin}\theta ^{\ast }\sum_{i=0}^{M-1}I^{corr}(i).\text{cos}(2\pi +f.i)\right] - +\end{equation*}% +\begin{equation*} +A\left[ \text{cos}2\theta ^{\ast }\sum_{i=0}^{M-1}\sin (4\pi f.i)+\text{sin}% +2\theta ^{\ast }\sum_{i=0}^{M-1}\cos (4\pi f.i)\right] =0 +\end{equation*}% +% +%\end{eqnarray*} Several points can be noticed: -\begin{itemize} -\item As in the spline method, some parts of this equation can be - computed before the acquisition loop. It is the case of sums that do - not depend on $\theta$: - -\[ \sum_{i=0}^{M-1} sin(4\pi f.i), \sum_{i=0}^{M-1} cos(4\pi f.i) \] - -\item Lookup tables for $sin(2\pi f.i)$ and $cos(2\pi f.i)$ can also be -computed. - -\item The simplest method to find the good $\theta$ is to discretize - $[-\pi,\pi]$ in $nb_s$ steps, and to search which step leads to the - result closest to zero. By the way, three other lookup tables can - also be computed before the loop: - -\[ sin \theta, cos \theta, \] - -\[ \left[ cos 2\theta \sum_{i=0}^{M-1} sin(4\pi f.i) + sin 2\theta \sum_{i=0}^{M-1} cos(4\pi f.i)\right] \] - -\item This search can be very fast using a dichotomous process in $log_2(nb_s)$ +\begin{itemize} +\item The terms $\sum_{i=0}^{M-1}$sin$(4\pi f.i)$ and$\sum_{i=0}^{M-1}$cos$% +(4\pi f.i)$ are independent of $\theta $, they can be precomputed. + +\item Lookup tables (namely lut$_{sfi}$ and lut$_{cfi}$ in the following algorithms) can be + set with the $2.M$ values $\sin (2\pi f.i)$ and $\cos (2\pi f.i)$. + +\item A simple method to find a zero $\theta ^{\ast }$ of the optimality +condition is to discretize the range $[-\pi ,\pi ]$ with a large number $% +nb_{s}$ of nodes and to find which one is a minimizer in the absolute value +sense. Hence, three other lookup tables (lut$_{s}$, lut$_{c}$ and lut$_{A}$) can be set with the $% +3\times nb_{s}$ values $\sin \theta ,$ $\cos \theta ,$ +\begin{equation*} +\left[ cos2\theta \sum_{i=0}^{M-1}sin(4\pi f.i)+sin2\theta +\sum_{i=0}^{M-1}cos(4\pi f.i)\right] . +\end{equation*} + +\item The search algorithm can be very fast using a dichotomous process in $% +log_{2}(nb_{s}).$ \end{itemize} -Finally, the whole summarizes in an algorithm (called LSQ in the following) in two parts, one before and one during the acquisition loop: +The overall method is synthetized in an algorithm (called LSQ in the +following) divided into the precomputing part and the acquisition loop: + \begin{algorithm}[htbp] \caption{LSQ algorithm - before acquisition loop.} \label{alg:lsq-before} $M \leftarrow $ number of pixels of the profile\\ - I[] $\leftarrow $ intensities of pixels\\ + I[] $\leftarrow $ intensity of pixels\\ $f \leftarrow $ frequency of the profile\\ $s4i \leftarrow \sum_{i=0}^{M-1} sin(4\pi f.i)$\\ $c4i \leftarrow \sum_{i=0}^{M-1} cos(4\pi f.i)$\\ @@ -511,7 +516,7 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t \For{$i=0$ to $M-1$}{ $I[i] \leftarrow I[i] - start - slope\times i$\\ } - + $I_{max} \leftarrow max_i(I[i])$, $I_{min} \leftarrow min_i(I[i])$\\ $amp \leftarrow \frac{I_{max}-I_{min}}{2}$\\ @@ -555,75 +560,85 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t \end{algorithm} -\subsubsection{Comparison} +\subsubsection{Algorithm comparison} We compared the two algorithms on the base of three criteria: + \begin{itemize} -\item precision of results on a cosinus profile, distorted with noise, +\item precision of results on a cosines profile, distorted by noise, + \item number of operations, -\item complexity to implement an FPGA version. + +\item complexity of FPGA implementation \end{itemize} For the first item, we produced a matlab version of each algorithm, -running with double precision values. The profile was generated for -about 34000 different values of period ($\in [3.1, 6.1]$, step = 0.1), -phase ($\in [-3.1 , 3.1]$, step = 0.062) and slope ($\in [-2 , 2]$, -step = 0.4). For LSQ, $nb_s = 1024$, which leads to a maximal error of -$\frac{\pi}{1024}$ on phase computation. Current A. Meister and -M. Favre experiments show a ratio of 50 between variation of phase and -the deflection of a lever. Thus, the maximal error due to -discretization correspond to an error of 0.15nm on the lever -deflection, which is smaller than the best precision they achieved, -i.e. 0.3nm. - -For each test, we add some noise to the profile: each group of two -pixels has its intensity added to a random number picked in $[-N,N]$ -(NB: it should be noticed that picking a new value for each pixel does -not distort enough the profile). The absolute error on the result is -evaluated by comparing the difference between the reference and -computed phase, out of $2\pi$, expressed in percents. That is: $err = -100\times \frac{|\theta_{ref} - \theta_{comp}|}{2\pi}$. - -Table \ref{tab:algo_prec} gives the maximum and average error for the two algorithms and increasing values of $N$. +running in double precision. The profile was generated for about +34,000 different quadruplets of periods ($\in \lbrack 3.1,6.1]$, step += 0.1), phases ($\in \lbrack -3.1,3.1]$, steps = 0.062) and slope +($\in \lbrack -2,2]$, step = 0.4). Obviously, the discretization of +$[-\pi ,\pi ]$ introduces an error in the phase estimation. It is at +most equal to $\frac{\pi}{nb_s}$. From some experiments on a $17\times +4$ array, authors of \cite{AFMCSEM11} noticed a average ratio of 50 +between phase variation in radians and lever end position in +nanometers. Assuming such a ratio and $nb_s = 1024$, the maximum lever +deflection error would be 0.15nm which is smaller than 0.3nm, the best +precision achieved with the setup used in \cite{AFMCSEM11}. + +Moreover, pixels have been paired and the paired intensities have been +perturbed by addition of a random number uniformly picked in +$[-N,N]$. Notice that we have observed that perturbing each pixel +independently yields too weak profile distortion. We report +percentages of errors between the reference and the computed phases +out of $2\pi ,$% +\begin{equation*} +err=100\times \frac{|\theta _{ref}-\theta _{comp}|}{2\pi }. +\end{equation*}% +Table \ref{tab:algo_prec} gives the maximum and the average errors for both +algorithms and for increasing values of $N$ the noise parameter. \begin{table}[ht] - \begin{center} - \begin{tabular}{|c|c|c|c|c|} - \hline - & \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5} - noise & max. err. & aver. err. & max. err. & aver. err. \\ \hline - 0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline - 2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline - 5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline - 7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline - 10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline - 15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline - 30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline - -\end{tabular} -\caption{Error (in \%) for cosinus profiles, with noise.} -\label{tab:algo_prec} +\begin{center} +\begin{tabular}{|c|c|c|c|c|} +\hline +& \multicolumn{2}{c|}{SPL} & \multicolumn{2}{c|}{LSQ} \\ \cline{2-5} +noise (N)& max. err. & aver. err. & max. err. & aver. err. \\ \hline +0 & 2.46 & 0.58 & 0.49 & 0.1 \\ \hline +2.5 & 2.75 & 0.62 & 1.16 & 0.22 \\ \hline +5 & 3.77 & 0.72 & 2.47 & 0.41 \\ \hline +7.5 & 4.72 & 0.86 & 3.33 & 0.62 \\ \hline +10 & 5.62 & 1.03 & 4.29 & 0.81 \\ \hline +15 & 7.96 & 1.38 & 6.35 & 1.21 \\ \hline +30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline +\end{tabular}% \end{center} +\caption{Error (in \%) for cosines profiles, with noise.} +\label{tab:algo_prec} \end{table} -These results show that the two algorithms are very close, with a -slight advantage for LSQ. Furthemore, both behave very well against -noise. Assuming the experimental ratio of 50 (see above), an error of -1 percent on phase correspond to an error of 0.5nm on the lever -deflection, which is very close to the best precision. - -Obviously, it is very hard to predict which level of noise will be -present in real experiments and how it will distort the -profiles. Nevertheless, we can see on figure \ref{fig:noise20} the +The results show that the two algorithms yield close results, with a slight +advantage for LSQ. Furthermore, both behave very well against noise. +Assuming an average ratio of 50 (see above), an error of 1 percent on +the phase corresponds to an error of 0.5nm on the lever deflection, which is +very close to the best precision. + +It is very hard to predict which level of noise will be present in +real experiments and how it will distort the profiles. Authors of +\cite{AFMCSEM11} gave us the authorization to exploit some of their +results on a $17\times 4$ array. It allowed us to compare experimental +profiles to simulated ones. We can see on figure \ref{fig:noise20} the profile with $N=10$ that leads to the biggest error. It is a bit -distorted, with pikes and straight/rounded portions, and relatively -close to most of that come from experiments. Figure \ref{fig:noise60} -shows a sample of worst profile for $N=30$. It is completly distorted, -largely beyond the worst experimental ones. +distorted, with pikes and straight/rounded portions. In fact, it is +very close to some of the worst experimental profiles. Figure +\ref{fig:noise60} shows a sample of worst profile for $N=30$. It is +completely distorted, largely beyond any experimental ones. Obviously, +these comparisons are a bit subjectives and experimental profiles +could also be completly distorted on other experiments. Nevertheless, +they give an idea about the possible error. \begin{figure}[ht] \begin{center} - \includegraphics[width=\columnwidth]{intens-noise20} +\includegraphics[width=\columnwidth]{intens-noise20} \end{center} \caption{Sample of worst profile for N=10} \label{fig:noise20} @@ -631,154 +646,166 @@ largely beyond the worst experimental ones. \begin{figure}[ht] \begin{center} - \includegraphics[width=\columnwidth]{intens-noise60} +\includegraphics[width=\columnwidth]{intens-noise60} \end{center} \caption{Sample of worst profile for N=30} \label{fig:noise60} \end{figure} -The second criterion is relatively easy to estimate for LSQ and harder -for SPL because of $atan$ operation. In both cases, it is proportional -to numbers of pixels $M$. For LSQ, it also depends on $nb_s$ and for -SPL on $N = k\times M$, i.e. the number of interpolated points. - -We assume that $M=20$, $nb_s=1024$, $k=4$, all possible parts are -already in lookup tables and a limited set of operations (+, -, *, /, -$<$, $>$) is taken account. Translating the two algorithms in C code, we -obtain about 430 operations for LSQ and 1550 (plus few tenth for +The second criterion is relatively easy to estimate for LSQ and harder for +SPL because of the use of the arctangent function. In both cases, the number +of operation is proportional to $M$ the numbers of pixels. For LSQ, it also +depends on $nb_{s}$ and for SPL on $L=k\times M$ the number of interpolated +points. We assume that $M=20$, $nb_{s}=1024$ and $k=4$, that all possible +parts are already in lookup tables and that a limited set of operations (+, +-, *, /, $<$, $>$) is taken into account. Translating both algorithms in C +code, we obtain about 430 operations for LSQ and 1,550 (plus a few tenth for $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless, -considering the total number of operations is not really pertinent for -an FPGA implementation: it mainly depends on the type of operations -and their -ordering. The final decision is thus driven by the third criterion.\\ - -The Spartan 6 used in our architecture has a hard constraint: it has no built-in -floating point units. Obviously, it is possible to use some existing -"black-boxes" for double precision operations. But they have a quite long -latency. It is much simpler to exclusively use integers, with a quantization of -all double precision values. Obviously, this quantization should not decrease -too much the precision of results. Furthermore, it should not lead to a design -with a huge latency because of operations that could not complete during a -single or few clock cycles. Divisions are in this case and, moreover, they need -a varying number of clock cycles to complete. Even multiplications can be a -problem: DSP48 take inputs of 18 bits maximum. For larger multiplications, -several DSP must be combined, increasing the latency. - -Nevertheless, the hardest constraint does not come from the FPGA characteristics -but from the algorithms. Their VHDL implentation will be efficient only if they -can be fully (or near) pipelined. By the way, the choice is quickly done: only a -small part of SPL can be. Indeed, the computation of spline coefficients -implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be -computed from incoming pixels intensity but after, the back-solve starts with -the lastest values, which breaks the pipeline. Moreover, SPL relies on -interpolating far more points than profile size. Thus, the end of SPL works on a -larger amount of data than the beginning, which also breaks the pipeline. - -LSQ has not this problem: all parts except the dichotomial search work on the -same amount of data, i.e. the profile size. Furthermore, LSQ needs less -operations than SPL, implying a smaller output latency. Consequently, it is the -best candidate for phase computation. Nevertheless, obtaining a fully pipelined -version supposes that operations of different parts complete in a single clock -cycle. It is the case for simulations but it completely fails when mapping and -routing the design on the Spartan6. By the way, extra-latency is generated and -there must be idle times between two profiles entering into the pipeline. - -%%Before obtaining the least bitstream, the crucial question is: how to -%%translate the C code the LSQ into VHDL ? - - -%\subsection{VHDL design paradigms} - -\section{Experimental tests} - -In this section we explain what we have done yet. Until now, we could not perform -real experiments since we just have received the FGPA board. Nevertheless, we -will include real experiments in the final version of this paper. +considering the total number of operations is not fully relevant for FPGA +implementation which time and space consumption depends not only on the type +of operations but also of their ordering. The final evaluation is thus very +much driven by the third criterion. + +The Spartan 6 used in our architecture has a hard constraint since it +has no built-in floating point units. Obviously, it is possible to use +some existing "black-boxes" for double precision operations. But they +require a lot of clock cycles to complete. It is much simpler to +exclusively use integers, with a quantization of all double precision +values. It should be chosen in a manner that does not alterate result +precision. Furthermore, it should not lead to a design with a huge +latency because of operations that could not complete during a single +or few clock cycles. Divisions fall into that category and, moreover, +they need a varying number of clock cycles to complete. Even +multiplications can be a problem since a DSP48 takes inputs of 18 bits +maximum. So, for larger multiplications, several DSP must be combined +which increases the overall latency. + +In the present algorithms, the hardest constraint does not come from the +FPGA characteristics but from the algorithms. Their VHDL implementation can +be efficient only if they can be fully (or near) pipelined. We observe that +only a small part of SPL can be pipelined, indeed, the computation of spline +coefficients implies to solve a linear tridiagonal system which matrix and +right-hand side are computed from incoming pixels intensity but after, the +back-solve starts with the latest values, which breaks the pipeline. +Moreover, SPL relies on interpolating far more points than profile size. +Thus, the end of SPL works on a larger amount of data than at the beginning, +which also breaks the pipeline. + +LSQ has not this problem since all parts, except the dichotomic search, work +on the same amount of data, i.e. the profile size. Furthermore, LSQ requires +less operations than SPL, implying a smaller output latency. In total, LSQ +turns out to be the best candidate for phase computation on any architecture +including FPGA. + +\section{VHDL implementation and experimental tests} + +\label{Experimental tests} \subsection{VHDL implementation} - - -% - ecriture d'un code en C avec integer -% - calcul de la taille max en bit de chaque variable en fonction de la quantization. -% - tests de quantization : équilibre entre précision et contraintes FPGA -% - en parallèle : simulink et VHDL à la main - - From the LSQ algorithm, we have written a C program that uses only -integer values. We use a very simple quantization by multiplying -double precision values by a power of two, keeping the integer -part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are -scaled by 1024. Since LSQ also computes average, variance, ... to -remove the slope, the result of implied euclidian divisions may be -relatively wrong. To avoid that, we also scale the pixel intensities -by a power of two. Futhermore, assuming $nb_s$ is fixed, these -divisions have a knonw denominator. Thus, they can be replaced by -their multiplication/shift counterpart. Finally, all other -multiplications or divisions by a power of two have been replaced by -left or right bit shifts. By the way, the code only contains -additions, substractions and multiplications of signed integers, which -is perfectly adapted to FGPAs. - -As said above, hardware constraints have a great influence on the VHDL -implementation. Consequently, we searched the maximum value of each -variable as a function of the different scale factors and the size of -profiles, which gives their maximum size in bits. That size determines -the maximum scale factors that allow to use the least possible RAMs -and DSPs. Actually, we implemented our algorithm with this maximum -size but current works study the impact of quantization on the results -precision and design complexity. We have compared the result of the -LSQ version using integers and doubles and observed that the precision -of both were similar. - -Then we built two versions of VHDL codes: one directly by hand coding -and the other with Matlab using the Simulink HDL coder -feature~\cite{HDLCoder}. Although the approach is completely different -we obtained VHDL codes that are quite comparable. Each approach has -advantages and drawbacks. Roughly speaking, hand coding provides -beautiful and much better structured code while Simulink allows to -produce a code faster. In terms of throughput and latency, -simulations shows that the two approaches are close with a slight -advantage for hand coding. We hope that real experiments will confirm -that. +integer values. We used a very simple quantization which consists in +multiplying each double precision value by a factor power of two and +by keeping the integer part. For an accurate evaluation of the +division in the computation of $a$ the slope coefficient, we also +scaled the pixel intensities by another power of two. The main problem +was to determin these factors. Most of the time, they are chosen to +minimize the error induced by the quantization. But in our case, we +also have some hardware constraints, for example the size and depth of +RAMs or the input size of DSPs. Thus, having a maximum of values that +fit in these sizes is a very important criterion to choose the scaling +factors. + +Consequently, we have determined the maximum value of each variable as +a function of the scale factors and the profile size involved in the +algorithm. It gave us the the maximum number of bits necessary to code +them. We have chosen the scale factors so that any variable (except +the covariance) fits in 18 bits, which is the maximum input size of +DSPs. In this way, all multiplications, except one, could be done with +a single DSP, in a single clock cycle. Moreover, assuming that $nb_s = +1024$, all LUTs could fit in the 18Kbits RAMs. Finally, we compared +the double and integer versions of LSQ and found a nearly perfect +agreement between their results. + +As mentionned above, some operations like divisions must be +avoided. But when the denominator is fixed, a division can be replaced +by its multiplication/shift counterpart. This is always the case in +LSQ. For example, assuming that $M$ is fixed, $x_{var}$ is known and +fixed. Thus, $\frac{xy_{covar}}{x_{var}}$ can be replaced by + +\[ (xy_{covar}\times \left \lfloor\frac{2^n}{x_{var}} \right \rfloor) \gg n\] + +where $n$ depends on the desired precision (in our case $n=24$). + +Obviously, multiplications and divisions by a power of two can be +replaced by left or right bit shifts. Finally, the code only contains +shifts, additions, subtractions and multiplications of signed integers, which +are perfectly adapted to FGPAs. + + +We built two versions of VHDL codes, namely one directly by hand +coding and the other with Matlab using the Simulink HDL coder feature~\cite% +{HDLCoder}. Although the approaches are completely different we obtained +quite comparable VHDL codes. Each approach has advantages and drawbacks. +Roughly speaking, hand coding provides beautiful and much better structured +code while Simulind HDL coder produces allows for fast code production. In +terms of throughput and latency, simulations show that the two approaches +yield close results with a slight advantage for hand coding. \subsection{Simulation} -Currently, we have only simulated our VHDL codes with GHDL and GTKWave (two free -tools with linux). Both approaches led to correct results. At the beginning of -our simulations, our pipiline could compute a new phase each 33 cycles and the -length of the pipeline was equal to 95 cycles. When we tried to generate the -corresponding bitsream with ISE environment we had many problems because many -stages required more than the 10$n$s required by the clock frequency. So we -needed to decompose some part of the pipeline in order to add some cycles and -simplify some parts between a clock top. -% ghdl + gtkwave -% au mieux : une phase tous les 33 cycles, latence de 95 cycles. -% mais routage/placement impossible. -\subsection{Bitstream creation} +Before experimental tests on the FPGA board, we simulated our two VHDL +codes with GHDL and GTKWave (two free tools with linux). We built a +testbench based on experimental profiles and compared the results to +values given by the SPL algorithm. Both versions lead to correct +results. Our first codes were highly optimized, indeed the pipeline +could compute a new phase each 33 cycles and its latency was equal to +95 cycles. Since the Spartan6 is clocked at 100MHz, estimating the +deflection of 100 cantilevers would take about $% +(95+200\times 33).10=66.95\mu $s, i.e. nearly 15,000 estimations by +second. -Currently both approaches provide synthesable bitstreams with ISE. We expect -that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it -could accept new profiles of pixel each 48 cycles, i.e. 480$n$s. +\subsection{Bitstream creation} -% pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120 +In order to test our code on the SP Vision board, the design was +extended with a component that keeps profiles in RAM, flushes them in +the phase computation component and stores its output in another +RAM. We also added a wishbone, a component that can "drive" signals to +communicate between i.MX and other components. It is mainly used to +start to flush profiles and to retrieve the computed phases in +RAM. Unfortunately, the first designs could not be placed and routed +with ISE on the Spartan6 with a 100MHz clock. The main problems were +encountered with series of arthmetic operations and more especially +with RAM outputs used in DSPs. So, we needed to decompose some parts +of the pipeline, which added few clock cycles. Finally, we obtained a +bitstream that has been successfully tested on the board. + +Its latency is of 112 cycles and computes a new phase every 40 +cycles. For 100 cantilevers, it takes $(112+200\times 40).10=81.12\mu +$s to compute their deflection. It corresponds to about 12300 images +per second, which is largely beyond the camera capacities and the +possibility to extract a new profile from an image every 40 +cycles. Nevertheless, it also largely fits our design goals. \label{sec:results} - - - \section{Conclusion and perspectives} -In this paper we have presented a new method to estimate the cantilevers -deflection in an AFM. This method is based on least square methods. We have -studied the quantization of this algorithm and have implemented it on a -FPGA. Our method gives comparable results compared to the initial version based -on splines. Our solution has been be implemented with a pipeline technique. -Consequently, it enables to handle a new profile image very quickly. Currently -we have performed simulations and real tests on a Spartan6 FPGA. - -In future work, we want to couple our algorithm with a high speed camera -and we plan to control the whole AFM system. + +In this paper we have presented a full hardware/software solution for +real-time cantilever deflection computation from interferometry images. +Phases are computed thanks to a new algorithm based on the least square +method. It has been quantized and pipelined to be mapped into a FPGA, the +architecture of our solution. Performances have been analyzed through +simulations and real experiments on a Spartan6 FPGA. The results meet our +initial requirements. In future work, the algorithm quantization will be +better analyzed and an high speed camera will be introduced in the +processing chain so that to process real images. Finally, we will address +real-time filtering and control problems for AFM arrays in dynamic regime. + +\section{Acknowledgments} +We would like to thank A. Meister and M. Favre, from CSEM, for sharing all the +material we used to write this article and for the time they spent to +explain us their approach. \bibliographystyle{plain} \bibliography{biblio}