-\title{Using FPGAs for high speed and real time cantilever deflection estimation}
+\title{A new approach based on least square methods to estimate in real time cantilevers deflection with a FPGA}
\author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and Michel Lenczner\IEEEauthorrefmark{2}}
\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France\\
\{raphael.couturier,stephane.domas\}@univ-fcomte.fr}
\begin{abstract}
-
+ Atomic force microscope (AFM) provides high resolution images of
+ surfaces. We focus our attention on an interferometry method to
+ estimate the cantilevers deflection. The initial method was based
+ on splines to determine the phase of interference fringes, and thus
+ the deflection. Computations were performed on a PC with LabView.
+ In this paper, we propose a new approach based on the least square
+ methods and its implementation that we developed on a FPGA, using
+ the pipelining technique. Simulations and real tests showed us that
+ this implementation is very efficient and should allow us to control
+ a cantilevers array in real time.
\end{abstract}
\section{Introduction}
Cantilevers are used inside atomic force microscope (AFM) which provides high
-resolution images of surfaces. Several technics have been used to measure the
-displacement of cantilevers in litterature. For example, it is possible to
+resolution images of surfaces. Several techniques have been used to measure the
+displacement of cantilevers in literature. For example, it is possible to
determine accurately the deflection with different mechanisms.
In~\cite{CantiPiezzo01}, authors used piezoresistor integrated into the
cantilever. Nevertheless this approach suffers from the complexity of the
microfabrication process needed to implement the sensor in the cantilever.
In~\cite{CantiCapacitive03}, authors have presented an cantilever mechanism
-based on capacitive sensing. This kind of technic also involves to instrument
-the cantiliver which result in a complex fabrication process.
+based on capacitive sensing. This kind of technique also involves to instrument
+the cantilever which result in a complex fabrication process.
In this paper our attention is focused on a method based on interferometry to
measure cantilevers' displacements. In this method cantilevers are illuminated
-by an optic source. The interferometry produces fringes on each cantilevers
+by an optic source. The interferometry produces fringes on each cantilever
which enables to compute the cantilever displacement. In order to analyze the
fringes a high speed camera is used. Images need to be processed quickly and
then a estimation method is required to determine the displacement of each
-cantilever. In~\cite{AFMCSEM11}, the authors have used an algorithm based on
+cantilever. In~\cite{AFMCSEM11}, authors have used an algorithm based on
spline to estimate the cantilevers' positions.
- The overall process gives
-accurate results but all the computation are performed on a standard computer
-using labview. Consequently, the main drawback of this implementation is that
-the computer is a bootleneck in the overall process. In this paper we propose to
-use a method based on least square and to implement all the computation on a
-FGPA.
+The overall process gives accurate results but all the computations
+are performed on a standard computer using LabView. Consequently, the
+main drawback of this implementation is that the computer is a
+bottleneck. In this paper we propose to use a method based on least
+square and to implement all the computation on a FPGA.
The remainder of the paper is organized as follows. Section~\ref{sec:measure}
describes more precisely the measurement process. Our solution based on the
\section{Measurement principles}
\label{sec:measure}
-
-
-
-
-
-
-
\subsection{Architecture}
\label{sec:archi}
%% description de l'architecture générale de l'acquisition d'images
%% qu'elle est.
In order to develop simple, cost effective and user-friendly cantilever arrays,
-authors of ~\cite{AFMCSEM11} have developped a system based of
+authors of ~\cite{AFMCSEM11} have developed a system based of
interferometry. In opposition to other optical based systems, using a laser beam
-deflection scheme and sentitive to the angular displacement of the cantilever,
+deflection scheme and sensitive to the angular displacement of the cantilever,
interferometry is sensitive to the optical path difference induced by the
vertical displacement of the cantilever.
-The system build by authors of~\cite{AFMCSEM11} has been developped based on a
-Linnick interferomter~\cite{Sinclair:05}. It is illustrated in
-Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter) into a
-reference beam and a sample beam that reachs the cantilever array. In order to
-be able to move the cantilever array, it is mounted on a translation and
-rotational hexapod stage with five degrees of freedom. The optical system is
-also fixed to the stage. Thus, the cantilever array is centered in the optical
-system which can be adjusted accurately. The beam illuminates the array by a
-microscope objective and the light reflects on the cantilevers. Likewise the
-reference beam reflects on a movable mirror. A CMOS camera chip records the
-reference and sample beams which are recombined in the beam splitter and the
-interferogram. At the beginning of each experiment, the movable mirror is
-fitted manually in order to align the interferometric fringes approximately
-parallel to the cantilevers. When cantilevers move due to the surface, the
-bending of cantilevers produce movements in the fringes that can be detected
-with the CMOS camera. Finally the fringes need to be
-analyzed. In~\cite{AFMCSEM11}, the authors used a LabView program to compute the
-cantilevers' movements from the fringes.
+The system build by these authors is based on a Linnick
+interferometer~\cite{Sinclair:05}. It is illustrated in
+Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter)
+into a reference beam and a sample beam that reaches the cantilever
+array. In order to be able to move the cantilever array, it is
+mounted on a translation and rotational hexapod stage with five
+degrees of freedom. The optical system is also fixed to the stage.
+Thus, the cantilever array is centered in the optical system which can
+be adjusted accurately. The beam illuminates the array by a
+microscope objective and the light reflects on the cantilevers.
+Likewise the reference beam reflects on a movable mirror. A CMOS
+camera chip records the reference and sample beams which are
+recombined in the beam splitter and the interferogram. At the
+beginning of each experiment, the movable mirror is fitted manually in
+order to align the interferometric fringes approximately parallel to
+the cantilevers. When cantilevers move due to the surface, the
+bending of cantilevers produce movements in the fringes that can be
+detected with the CMOS camera. Finally the fringes need to be
+analyzed. In~\cite{AFMCSEM11}, authors used a LabView program to
+compute the cantilevers' deflections from the fringes.
\begin{figure}
\begin{center}
\subsection{Cantilever deflection estimation}
\label{sec:deflest}
-As shown on image \ref{img:img-xp}, each cantilever is covered by
-interferometric fringes. The fringes will distort when cantilevers are
-deflected. Estimating the deflection is done by computing this
-distortion. For that, (ref A. Meister + M Favre) proposed a method
-based on computing the phase of the fringes, at the base of each
-cantilever, near the tip, and on the base of the array. They assume
-that a linear relation binds these phases, which can be use to
-"unwrap" the phase at the tip and to determine the deflection.\\
-
-More precisely, segment of pixels are extracted from images taken by a
-high-speed camera. These segments are large enough to cover several
-interferometric fringes and are placed at the base and near the tip of
-the cantilevers. They are called base profile and tip profile in the
-following. Furthermore, a reference profile is taken on the base of
-the cantilever array.
+\begin{figure}
+\begin{center}
+\includegraphics[width=\columnwidth]{lever-xp}
+\end{center}
+\caption{Portion of an image picked by the camera}
+\label{fig:img-xp}
+\end{figure}
+
+As shown on image \ref{fig:img-xp}, each cantilever is covered by
+several interferometric fringes. The fringes will distort when
+cantilevers are deflected. Estimating the deflection is done by
+computing this distortion. For that, authors of \cite{AFMCSEM11}
+proposed a method based on computing the phase of the fringes, at the
+base of each cantilever, near the tip, and on the base of the
+array. They assume that a linear relation binds these phases, which
+can be use to "unwrap" the phase at the tip and to determine the deflection.\\
+
+More precisely, segment of pixels are extracted from images taken by
+the camera. These segments are large enough to cover several
+interferometric fringes. As said above, they are placed at the base
+and near the tip of the cantilevers. They are called base profile and
+tip profile in the following. Furthermore, a reference profile is
+taken on the base of the cantilever array.
The pixels intensity $I$ (in gray level) of each profile is modelized by:
where $x$ is the position of a pixel in its associated segment.
The global method consists in two main sequences. The first one aims
-to determin the frequency $f$ of each profile with an algorithm based
+to determine the frequency $f$ of each profile with an algorithm based
on spline interpolation (see section \ref{algo-spline}). It also
computes the coefficient used for unwrapping the phase. The second one
is the acquisition loop, while which images are taken at regular time
cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\
In fact, this timing is a very hard constraint. Let consider a very
-small programm that initializes twenty million of doubles in memory
+small program that initializes twenty million of doubles in memory
and then does 1000000 cumulated sums on 20 contiguous values
(experimental profiles have about this size). On an intel Core 2 Duo
E6650 at 2.33GHz, this program reaches an average of 155Mflops.
\subsection{FPGAs}
-A field-programmable gate array (FPGA) is an integrated circuit
-designed to be configured by the customer. FGPAs are composed of
-programmable logic components, called configurable logic blocks
-(CLB). These blocks mainly contains look-up tables (LUT), flip/flops
-(F/F) and latches, organized in one or more slices connected
-together. Each CLB can be configured to perform simple (AND, XOR, ...)
-or complex combinational functions. They are interconnected by
-reconfigurable links. Modern FPGAs contain memory elements and
-multipliers which enable to simplify the design and to increase the
-performance. Nevertheless, all other complex operations, like
-division, trigonometric functions, $\ldots$ are not available and must
-be done by configuring a set of CLBs. Since this configuration is not
-obvious at all, it can be done via a framework, like ISE. Such a
-software can synthetize a design written in a hardware description
-language (HDL), map it onto CLBs, place/route them for a specific
-FPGA, and finally produce a bitstream that is used to configre the
-FPGA. Thus, from the developper point of view, the main difficulty is
-to translate an algorithm in HDL code, taking account FPGA resources
-and constraints like clock signals and I/O values that drive the FPGA.
+A field-programmable gate array (FPGA) is an integrated circuit designed to be
+configured by the customer. FGPAs are composed of programmable logic components,
+called configurable logic blocks (CLB). These blocks mainly contains look-up
+tables (LUT), flip/flops (F/F) and latches, organized in one or more slices
+connected together. Each CLB can be configured to perform simple (AND, XOR, ...)
+or complex combinational functions. They are interconnected by reconfigurable
+links. Modern FPGAs contain memory elements and multipliers which enable to
+simplify the design and to increase the performance. Nevertheless, all other
+complex operations, like division, trigonometric functions, $\ldots$ are not
+available and must be done by configuring a set of CLBs. Since this
+configuration is not obvious at all, it can be done via a framework, like
+ISE~\cite{ISE}. Such a software can synthetize a design written in a hardware
+description language (HDL), map it onto CLBs, place/route them for a specific
+FPGA, and finally produce a bitstream that is used to configure the FPGA. Thus,
+from the developer point of view, the main difficulty is to translate an
+algorithm in HDL code, taking account FPGA resources and constraints like clock
+signals and I/O values that drive the FPGA.
Indeed, HDL programming is very different from classic languages like
C. A program can be seen as a state-machine, manipulating signals that
evolve from state to state. By the way, HDL instructions can execute
-concurrently. Basic logic operations are used to agregate signals to
+concurrently. Basic logic operations are used to aggregate signals to
produce new states and assign it to another signal. States are mainly
-expressed as arrays of bits. Fortunaltely, libraries propose some
+expressed as arrays of bits. Fortunately, libraries propose some
higher levels representations like signed integers, and arithmetic
operations.
\subsection{The board}
-The board we use is designed by the Armadeus compagny, under the name
+The board we use is designed by the Armadeus company, under the name
SP Vision. It consists in a development board hosting a i.MX27 ARM
processor (from Freescale). The board includes all classical
connectors: USB, Ethernet, ... A Flash memory contains a Linux kernel
tunnel. By default, the WEIM interface provides a clock signal at
100MHz that is connected to dedicated FPGA pins.
-The Spartan6 is an LX100 version. It has 15822 slices, equivalent to
-101261 logic cells. There are 268 internal block RAM of 18Kbits, and
-180 dedicated multiply-adders (named DSP48), which is largely enough
-for our project.
+The Spartan6 is an LX100 version. It has 15822 slices, each slice
+containing 4 LUTs and 8 flip/flops. It is equivalent to 101261 logic
+cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
+multiply-adders (named DSP48), which is largely enough for our
+project.
Some I/O pins of Spartan6 are connected to two $2\times 17$ headers
that can be used as user wants. For the project, they will be
classical least square method but suppose that frequency is already
known.
-\subsubsection{Spline algorithm}
+\subsubsection{Spline algorithm (SPL)}
\label{sec:algo-spline}
Let consider a profile $P$, that is a segment of $M$ pixels with an
intensity in gray levels. Let call $I(x)$ the intensity of profile in $x
\in [0,M[$.
-At first, only $M$ values of $I$ are known, for $x = 0, 1, \ldots,M-1$. A
-normalisation allows to scale known intensities into $[-1,1]$. We compute
-splines that fit at best these normalised intensities. Splines (SPL in the
-following) are used to interpolate $N = k\times M$ points (typically $k=4$ is
-sufficient), within $[0,M[$. Let call $x^s$ the coordinates of these $N$ points
- and $I^s$ their intensities.
+At first, only $M$ values of $I$ are known, for $x = 0, 1,
+\ldots,M-1$. A normalization allows to scale known intensities into
+$[-1,1]$. We compute splines that fit at best these normalized
+intensities. Splines are used to interpolate $N = k\times M$ points
+(typically $k=4$ is sufficient), within $[0,M[$. Let call $x^s$ the
+coordinates of these $N$ points and $I^s$ their intensities.
In order to have the frequency, the mean line $a.x+b$ (see equation \ref{equ:profile}) of $I^s$ is
computed. Finding intersections of $I^s$ and this line allow to obtain
computation of $\theta$.
\end{itemize}
-\subsubsection{Least square algorithm}
+\subsubsection{Least square algorithm (LSQ)}
Assuming that we compute the phase during the acquisition loop,
equation \ref{equ:profile} has only 4 parameters: $a, b, A$, and
ending with a convergence criterion, it is obvious that it is not
particularly adapted to our design goals.
-Fortunatly, it is quite simple to reduce the number of parameters to
+Fortunately, it is quite simple to reduce the number of parameters to
only $\theta$. Let $x^p$ be the coordinates of pixels in a segment of
size $M$. Thus, $x^p = 0, 1, \ldots, M-1$. Let $I(x^p)$ be their
intensity. Firstly, we "remove" the slope by computing:
\end{itemize}
Finally, the whole summarizes in an algorithm (called LSQ in the following) in two parts, one before and one during the acquisition loop:
-\begin{algorithm}[h]
+\begin{algorithm}[htbp]
\caption{LSQ algorithm - before acquisition loop.}
\label{alg:lsq-before}
}
\end{algorithm}
-\begin{algorithm}[ht]
+\begin{algorithm}[htbp]
\caption{LSQ algorithm - during acquisition loop.}
\label{alg:lsq-during}
We compared the two algorithms on the base of three criteria:
\begin{itemize}
-\item precision of results on a cosinus profile, distorted with noise,
+\item precision of results on a cosines profile, distorted with noise,
\item number of operations,
\item complexity to implement an FPGA version.
\end{itemize}
30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline
\end{tabular}
-\caption{Error (in \%) for cosinus profiles, with noise.}
+\caption{Error (in \%) for cosines profiles, with noise.}
\label{tab:algo_prec}
\end{center}
\end{table}
These results show that the two algorithms are very close, with a
-slight advantage for LSQ. Furthemore, both behave very well against
+slight advantage for LSQ. Furthermore, both behave very well against
noise. Assuming the experimental ratio of 50 (see above), an error of
1 percent on phase correspond to an error of 0.5nm on the lever
deflection, which is very close to the best precision.
profile with $N=10$ that leads to the biggest error. It is a bit
distorted, with pikes and straight/rounded portions, and relatively
close to most of that come from experiments. Figure \ref{fig:noise60}
-shows a sample of worst profile for $N=30$. It is completly distorted,
+shows a sample of worst profile for $N=30$. It is completely distorted,
largely beyond the worst experimental ones.
\begin{figure}[ht]
several DSP must be combined, increasing the latency.
Nevertheless, the hardest constraint does not come from the FPGA characteristics
-but from the algorithms. Their VHDL implentation will be efficient only if they
+but from the algorithms. Their VHDL implementation will be efficient only if they
can be fully (or near) pipelined. By the way, the choice is quickly done: only a
small part of SPL can be. Indeed, the computation of spline coefficients
implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be
computed from incoming pixels intensity but after, the back-solve starts with
-the lastest values, which breaks the pipeline. Moreover, SPL relies on
+the latest values, which breaks the pipeline. Moreover, SPL relies on
interpolating far more points than profile size. Thus, the end of SPL works on a
larger amount of data than the beginning, which also breaks the pipeline.
\subsection{VHDL implementation}
+From the LSQ algorithm, we have written a C program that uses only
+integer values. We use a very simple quantization by multiplying
+double precision values by a power of two, keeping the integer
+part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are
+scaled by 1024. Since LSQ also computes average, variance, ... to
+remove the slope, the result of implied Euclidean divisions may be
+relatively wrong. To avoid that, we also scale the pixel intensities
+by a power of two. Furthermore, assuming $nb_s$ is fixed, these
+divisions have a known denominator. Thus, they can be replaced by
+their multiplication/shift counterpart. Finally, all other
+multiplications or divisions by a power of two have been replaced by
+left or right bit shifts. By the way, the code only contains
+additions, subtractions and multiplications of signed integers, which
+is perfectly adapted to FGPAs.
+
+As said above, hardware constraints have a great influence on the VHDL
+implementation. Consequently, we searched the maximum value of each
+variable as a function of the different scale factors and the size of
+profiles, which gives their maximum size in bits. That size determines
+the maximum scale factors that allow to use the least possible RAMs
+and DSPs. Actually, we implemented our algorithm with this maximum
+size but current works study the impact of quantization on the results
+precision and design complexity. We have compared the result of the
+LSQ version using integers and doubles and observed that the precision
+of both were similar.
+
+Then we built two versions of VHDL codes: one directly by hand coding
+and the other with Matlab using the Simulink HDL coder
+feature~\cite{HDLCoder}. Although the approach is completely different
+we obtained VHDL codes that are quite comparable. Each approach has
+advantages and drawbacks. Roughly speaking, hand coding provides
+beautiful and much better structured code while Simulink allows to
+produce a code faster. In terms of throughput and latency,
+simulations shows that the two approaches are close with a slight
+advantage for hand coding. We hope that real experiments will confirm
+that.
+\subsection{Simulation}
-% - ecriture d'un code en C avec integer
-% - calcul de la taille max en bit de chaque variable en fonction de la quantization.
-% - tests de quantization : équilibre entre précision et contraintes FPGA
-% - en parallèle : simulink et VHDL à la main
+Before experimental tests on the board, we simulated our two VHDL
+codes with GHDL and GTKWave (two free tools with linux). For that, we
+build a testbench based on profiles taken from experimentations and
+compare the results to values given by the SPL algorithm. Both
+versions lead to correct results.
+Our first code were highly optimized : the pipeline could compute a
+new phase each 33 cycles and its latency was equal to 95 cycles. Since
+the Spartan6 is clocked at 100MHz, it implies that estimating the
+deflection of 100 cantilevers would take about $(95 + 200\times 33).10
+= 66.95\mu$s, i.e. nearly 15000 estimations by second.
-From the LSQ algorithm, we have written a C program which uses only integer
-values that have been previously scaled. The quantization of doubles into
-integers has been performed in order to obtain a good trade-off between the
-number of bits used and the precision. Finally, we have compared the result of
-the LSQ version using integer and double. We have observed that the results of
-both versions were similar.
+\subsection{Bitstream creation}
-Then we have built two versions of VHDL codes: one directly by hand coding and
-the other with Matlab using simulink HDL coder feature. Although the approach is
-completely different we have obtain VHDL codes that are quite comparable. Each
-approach has advantages and drawbacks. Roughly speaking, hand coding provides
-beautiful and much better structures code while HDL coder provides code faster.
-In terms of speed of code, we think that both approaches will be quite
-comparable. Real experiments will confirm that. In the LSQ algorithm, we have
-replaced all the divisions by multiplications by a constant since divisions are
-performed with constants depending of the number of pixels in the profile
-(i.e. $M$).
+In order to test our code on the SP Vision board, the design was
+extended with a component that keeps profiles in RAM, flushes them in
+the phase computation component and stores its output in another
+RAM. We also added a wishbone : a component that can "drive" signals
+to communicate between i.MX and others components. It is mainly used
+to start to flush profiles and to retrieve the computed phases in RAM.
-\subsection{Simulation}
+Unfortunately, the first designs could not be placed and route with ISE
+on the Spartan6 with a 100MHz clock. The main problems came from
+routing values from RAMs to DSPs and obtaining a result under 10ns. By
+the way, we needed to decompose some parts of the pipeline, which adds
+some cycles. For example, some delays have been introduced between
+RAMs output and DSPs. Finally, we obtained a bitstream that has a
+latency of 112 cycles and computes a new phase every 40 cycles. For
+100 cantilevers, it takes $(112 + 200\times 40).10 = 81.12\mu$s to
+compute their deflection.
-Currently, we only have simulated our VHDL codes with GHDL and GTKWave (two free
-tools with linux). Both approaches led to correct results. At the beginning with
-simulations our pipiline could compute a new phase each 33 cycles and the length
-of the pipeline was equal to 95 cycles. When we tried to generate the bitsream
-with ISE environment we had many problems because many stages required more than
-the 10$n$s availabe. So we needed to decompose some part of the pipeline in order
-to add some cycles and siplify some parts.
-% ghdl + gtkwave
-% au mieux : une phase tous les 33 cycles, latence de 95 cycles.
-% mais routage/placement impossible.
-\subsection{Bitstream creation}
+This bitstream has been successfully tested on the board TODAY ! YEAAHHHHH
-Currently both approaches provide synthesable bitstreams with ISE. We expect
-that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it
-could accept new line of pixel each 48 cycles, i.e. 480$n$s.
-% pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120
\label{sec:results}
\section{Conclusion and perspectives}
-
+In this paper we have presented a new method to estimate the
+cantilevers deflection in an AFM. This method is based on least
+square methods. We have used quantization to produce an algorithm
+based exclusively on integer values, which is adapted to a FPGA
+implementation. We obtained a precision on results similar to the
+initial version based on splines. Our solution has been implemented
+with a pipeline technique. Consequently, it enables to handle a new
+profile image very quickly. Currently we have performed simulations
+and real tests on a Spartan6 FPGA.
+
+In future work, we plan to study the quantization. Then we want to couple our
+algorithm with a high speed camera and we plan to control the whole AFM system.
\bibliographystyle{plain}
\bibliography{biblio}