-\title{Using FPGAs for high speed and real time cantilever deflection estimation}
+\title{A new approach based on least square methods to estimate in real time cantilevers deflection with a FPGA}
\author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and Michel Lenczner\IEEEauthorrefmark{2}}
\IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France\\
\{raphael.couturier,stephane.domas\}@univ-fcomte.fr}
In this paper our attention is focused on a method based on interferometry to
measure cantilevers' displacements. In this method cantilevers are illuminated
-by an optic source. The interferometry produces fringes on each cantilevers
+by an optic source. The interferometry produces fringes on each cantilever
which enables to compute the cantilever displacement. In order to analyze the
fringes a high speed camera is used. Images need to be processed quickly and
then a estimation method is required to determine the displacement of each
-cantilever. In~\cite{AFMCSEM11}, the authors have used an algorithm based on
+cantilever. In~\cite{AFMCSEM11}, authors have used an algorithm based on
spline to estimate the cantilevers' positions.
- The overall process gives
-accurate results but all the computation are performed on a standard computer
-using labview. Consequently, the main drawback of this implementation is that
-the computer is a bootleneck in the overall process. In this paper we propose to
-use a method based on least square and to implement all the computation on a
-FGPA.
+The overall process gives accurate results but all the computations
+are performed on a standard computer using LabView. Consequently, the
+main drawback of this implementation is that the computer is a
+bootleneck. In this paper we propose to use a method based on least
+square and to implement all the computation on a FGPA.
The remainder of the paper is organized as follows. Section~\ref{sec:measure}
describes more precisely the measurement process. Our solution based on the
\section{Measurement principles}
\label{sec:measure}
-
-
-
-
-
-
-
\subsection{Architecture}
\label{sec:archi}
%% description de l'architecture générale de l'acquisition d'images
interferometry is sensitive to the optical path difference induced by the
vertical displacement of the cantilever.
-The system build by authors of~\cite{AFMCSEM11} has been developped based on a
-Linnick interferomter~\cite{Sinclair:05}. It is illustrated in
-Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter) into a
-reference beam and a sample beam that reachs the cantilever array. In order to
-be able to move the cantilever array, it is mounted on a translation and
-rotational hexapod stage with five degrees of freedom. The optical system is
-also fixed to the stage. Thus, the cantilever array is centered in the optical
-system which can be adjusted accurately. The beam illuminates the array by a
-microscope objective and the light reflects on the cantilevers. Likewise the
-reference beam reflects on a movable mirror. A CMOS camera chip records the
-reference and sample beams which are recombined in the beam splitter and the
-interferogram. At the beginning of each experiment, the movable mirror is
-fitted manually in order to align the interferometric fringes approximately
-parallel to the cantilevers. When cantilevers move due to the surface, the
-bending of cantilevers produce movements in the fringes that can be detected
-with the CMOS camera. Finally the fringes need to be
-analyzed. In~\cite{AFMCSEM11}, the authors used a LabView program to compute the
-cantilevers' movements from the fringes.
+The system build by these authors is based on a Linnick
+interferomter~\cite{Sinclair:05}. It is illustrated in
+Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter)
+into a reference beam and a sample beam that reachs the cantilever
+array. In order to be able to move the cantilever array, it is
+mounted on a translation and rotational hexapod stage with five
+degrees of freedom. The optical system is also fixed to the stage.
+Thus, the cantilever array is centered in the optical system which can
+be adjusted accurately. The beam illuminates the array by a
+microscope objective and the light reflects on the cantilevers.
+Likewise the reference beam reflects on a movable mirror. A CMOS
+camera chip records the reference and sample beams which are
+recombined in the beam splitter and the interferogram. At the
+beginning of each experiment, the movable mirror is fitted manually in
+order to align the interferometric fringes approximately parallel to
+the cantilevers. When cantilevers move due to the surface, the
+bending of cantilevers produce movements in the fringes that can be
+detected with the CMOS camera. Finally the fringes need to be
+analyzed. In~\cite{AFMCSEM11}, authors used a LabView program to
+compute the cantilevers' deflections from the fringes.
\begin{figure}
\begin{center}
\subsection{Cantilever deflection estimation}
\label{sec:deflest}
-As shown on image \ref{img:img-xp}, each cantilever is covered by
-interferometric fringes. The fringes will distort when cantilevers are
-deflected. Estimating the deflection is done by computing this
-distortion. For that, (ref A. Meister + M Favre) proposed a method
-based on computing the phase of the fringes, at the base of each
-cantilever, near the tip, and on the base of the array. They assume
-that a linear relation binds these phases, which can be use to
-"unwrap" the phase at the tip and to determine the deflection.\\
-
-More precisely, segment of pixels are extracted from images taken by a
-high-speed camera. These segments are large enough to cover several
-interferometric fringes and are placed at the base and near the tip of
-the cantilevers. They are called base profile and tip profile in the
-following. Furthermore, a reference profile is taken on the base of
-the cantilever array.
+\begin{figure}
+\begin{center}
+\includegraphics[width=\columnwidth]{lever-xp}
+\end{center}
+\caption{Portion of an image picked by the camera}
+\label{fig:img-xp}
+\end{figure}
+
+As shown on image \ref{fig:img-xp}, each cantilever is covered by
+several interferometric fringes. The fringes will distort when
+cantilevers are deflected. Estimating the deflection is done by
+computing this distortion. For that, authors of \cite{AFMCSEM11}
+proposed a method based on computing the phase of the fringes, at the
+base of each cantilever, near the tip, and on the base of the
+array. They assume that a linear relation binds these phases, which
+can be use to "unwrap" the phase at the tip and to determine the deflection.\\
+
+More precisely, segment of pixels are extracted from images taken by
+the camera. These segments are large enough to cover several
+interferometric fringes. As said above, they are placed at the base
+and near the tip of the cantilevers. They are called base profile and
+tip profile in the following. Furthermore, a reference profile is
+taken on the base of the cantilever array.
The pixels intensity $I$ (in gray level) of each profile is modelized by:
\subsection{FPGAs}
-A field-programmable gate array (FPGA) is an integrated circuit
-designed to be configured by the customer. FGPAs are composed of
-programmable logic components, called configurable logic blocks
-(CLB). These blocks mainly contains look-up tables (LUT), flip/flops
-(F/F) and latches, organized in one or more slices connected
-together. Each CLB can be configured to perform simple (AND, XOR, ...)
-or complex combinational functions. They are interconnected by
-reconfigurable links. Modern FPGAs contain memory elements and
-multipliers which enable to simplify the design and to increase the
-performance. Nevertheless, all other complex operations, like
-division, trigonometric functions, $\ldots$ are not available and must
-be done by configuring a set of CLBs. Since this configuration is not
-obvious at all, it can be done via a framework, like ISE. Such a
-software can synthetize a design written in an hardware description
-language (HDL), map it onto CLBs, place/route them for a specific
-FPGA, and finally produce a bitstream that is used to configre the
-FPGA. Thus, from the developper point of view, the main difficulty is
-to translate an algorithm in HDL code, taking account FPGA resources
-and constraints like clock signals and I/O values that drive the FPGA.
+A field-programmable gate array (FPGA) is an integrated circuit designed to be
+configured by the customer. FGPAs are composed of programmable logic components,
+called configurable logic blocks (CLB). These blocks mainly contains look-up
+tables (LUT), flip/flops (F/F) and latches, organized in one or more slices
+connected together. Each CLB can be configured to perform simple (AND, XOR, ...)
+or complex combinational functions. They are interconnected by reconfigurable
+links. Modern FPGAs contain memory elements and multipliers which enable to
+simplify the design and to increase the performance. Nevertheless, all other
+complex operations, like division, trigonometric functions, $\ldots$ are not
+available and must be done by configuring a set of CLBs. Since this
+configuration is not obvious at all, it can be done via a framework, like
+ISE~\cite{ISE}. Such a software can synthetize a design written in a hardware
+description language (HDL), map it onto CLBs, place/route them for a specific
+FPGA, and finally produce a bitstream that is used to configre the FPGA. Thus,
+from the developper point of view, the main difficulty is to translate an
+algorithm in HDL code, taking account FPGA resources and constraints like clock
+signals and I/O values that drive the FPGA.
Indeed, HDL programming is very different from classic languages like
C. A program can be seen as a state-machine, manipulating signals that
tunnel. By default, the WEIM interface provides a clock signal at
100MHz that is connected to dedicated FPGA pins.
-The Spartan6 is an LX100 version. It has 15822 slices, equivalent to
-101261 logic cells. There are 268 internal block RAM of 18Kbits, and
-180 dedicated multiply-adders (named DSP48), which is largely enough
-for our project.
+The Spartan6 is an LX100 version. It has 15822 slices, each slice
+containing 4 LUTs and 8 flip/flops. It is equivalent to 101261 logic
+cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated
+multiply-adders (named DSP48), which is largely enough for our
+project.
Some I/O pins of Spartan6 are connected to two $2\times 17$ headers
that can be used as user wants. For the project, they will be
classical least square method but suppose that frequency is already
known.
-\subsubsection{Spline algorithm}
+\subsubsection{Spline algorithm (SPL)}
\label{sec:algo-spline}
Let consider a profile $P$, that is a segment of $M$ pixels with an
intensity in gray levels. Let call $I(x)$ the intensity of profile in $x
\in [0,M[$.
-At first, only $M$ values of $I$ are known, for $x = 0, 1, \ldots,M-1$. A
-normalisation allows to scale known intensities into $[-1,1]$. We compute
-splines that fit at best these normalised intensities. Splines (SPL in the
-following) are used to interpolate $N = k\times M$ points (typically $k=4$ is
-sufficient), within $[0,M[$. Let call $x^s$ the coordinates of these $N$ points
- and $I^s$ their intensities.
+At first, only $M$ values of $I$ are known, for $x = 0, 1,
+\ldots,M-1$. A normalisation allows to scale known intensities into
+$[-1,1]$. We compute splines that fit at best these normalised
+intensities. Splines are used to interpolate $N = k\times M$ points
+(typically $k=4$ is sufficient), within $[0,M[$. Let call $x^s$ the
+coordinates of these $N$ points and $I^s$ their intensities.
In order to have the frequency, the mean line $a.x+b$ (see equation \ref{equ:profile}) of $I^s$ is
computed. Finding intersections of $I^s$ and this line allow to obtain
computation of $\theta$.
\end{itemize}
-\subsubsection{Least square algorithm}
+\subsubsection{Least square algorithm (LSQ)}
Assuming that we compute the phase during the acquisition loop,
equation \ref{equ:profile} has only 4 parameters: $a, b, A$, and
\end{itemize}
Finally, the whole summarizes in an algorithm (called LSQ in the following) in two parts, one before and one during the acquisition loop:
-\begin{algorithm}[h]
+\begin{algorithm}[htbp]
\caption{LSQ algorithm - before acquisition loop.}
\label{alg:lsq-before}
}
\end{algorithm}
-\begin{algorithm}[ht]
+\begin{algorithm}[htbp]
\caption{LSQ algorithm - during acquisition loop.}
\label{alg:lsq-during}
and their
ordering. The final decision is thus driven by the third criterion.\\
-The Spartan 6 used in our architecture has hard constraint: it has no
-built-in floating point units. Obviously, it is possible to use some
-existing "black-boxes" for double precision operations. But they have
-a quite long latency. It is much simpler to exclusively use integers,
-with a quantization of all double precision values. Obviously, this
-quantization should not decrease too much the precision of
-results. Furthermore, it should not lead to a design with a huge
-latency because of operations that could not complete during a single
-or few clock cycles. Divisions are in this case and, moreover, they
-need an varying number of clock cycles to complete. Even
-multiplications can be a problem: DSP48 take inputs of 18 bits
-maximum. For larger multiplications, several DSP must be combined,
-increasing the latency.
-
-Nevertheless, the hardest constraint does not come from the FPGA
-characteristics but from the algorithms. Their VHDL implentation will
-be efficient only if they can be fully (or near) pipelined. By the
-way, the choice is quickly done: only a small part of SPL can be.
-Indeed, the computation of spline coefficients implies to solve a
-tridiagonal system $A.m = b$. Values in $A$ and $b$ can be computed
-from incoming pixels intensity but after, the back-solve starts with
-the lastest values, which breaks the pipeline. Moreover, SPL relies on
-interpolating far more points than profile size. Thus, the end
-of SPL works on a larger amount of data than the beginning, which
-also breaks the pipeline.
-
-LSQ has not this problem: all parts except the dichotomial search
-work on the same amount of data, i.e. the profile size. Furthermore,
-LSQ needs less operations than SPL, implying a smaller output
-latency. Consequently, it is the best candidate for phase
-computation. Nevertheless, obtaining a fully pipelined version
-supposes that operations of different parts complete in a single clock
-cycle. It is the case for simulations but it completely fails when
-mapping and routing the design on the Spartan6. By the way,
-extra-latency is generated and there must be idle times between two
-profiles entering into the pipeline.
+The Spartan 6 used in our architecture has a hard constraint: it has no built-in
+floating point units. Obviously, it is possible to use some existing
+"black-boxes" for double precision operations. But they have a quite long
+latency. It is much simpler to exclusively use integers, with a quantization of
+all double precision values. Obviously, this quantization should not decrease
+too much the precision of results. Furthermore, it should not lead to a design
+with a huge latency because of operations that could not complete during a
+single or few clock cycles. Divisions are in this case and, moreover, they need
+a varying number of clock cycles to complete. Even multiplications can be a
+problem: DSP48 take inputs of 18 bits maximum. For larger multiplications,
+several DSP must be combined, increasing the latency.
+
+Nevertheless, the hardest constraint does not come from the FPGA characteristics
+but from the algorithms. Their VHDL implentation will be efficient only if they
+can be fully (or near) pipelined. By the way, the choice is quickly done: only a
+small part of SPL can be. Indeed, the computation of spline coefficients
+implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be
+computed from incoming pixels intensity but after, the back-solve starts with
+the lastest values, which breaks the pipeline. Moreover, SPL relies on
+interpolating far more points than profile size. Thus, the end of SPL works on a
+larger amount of data than the beginning, which also breaks the pipeline.
+
+LSQ has not this problem: all parts except the dichotomial search work on the
+same amount of data, i.e. the profile size. Furthermore, LSQ needs less
+operations than SPL, implying a smaller output latency. Consequently, it is the
+best candidate for phase computation. Nevertheless, obtaining a fully pipelined
+version supposes that operations of different parts complete in a single clock
+cycle. It is the case for simulations but it completely fails when mapping and
+routing the design on the Spartan6. By the way, extra-latency is generated and
+there must be idle times between two profiles entering into the pipeline.
%%Before obtaining the least bitstream, the crucial question is: how to
%%translate the C code the LSQ into VHDL ?
\section{Experimental tests}
+In this section we explain what we have done yet. Until now, we could not perform
+real experiments since we just have received the FGPA board. Nevertheless, we
+will include real experiments in the final version of this paper.
+
\subsection{VHDL implementation}
+
+
% - ecriture d'un code en C avec integer
% - calcul de la taille max en bit de chaque variable en fonction de la quantization.
% - tests de quantization : équilibre entre précision et contraintes FPGA
% - en parallèle : simulink et VHDL à la main
-%
+
+
+From the LSQ algorithm, we have written a C program that uses only
+integer values. We use a very simple quantization by multiplying
+double precision values by a power of two, keeping the integer
+part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are
+scaled by 1024. Since LSQ also computes average, variance, ... to
+remove the slope, the result of implied euclidian divisions may be
+relatively wrong. To avoid that, we also scale the pixel intensities
+by a power of two. Futhermore, assuming $nb_s$ is fixed, these
+divisions have a knonw denominator. Thus, they can be replaced by
+their multiplication/shift counterpart. Finally, all other
+multiplications or divisions by a power of two have been replaced by
+left or right bit shifts. By the way, the code only contains
+additions, substractions and multiplications of signed integers, which
+is perfectly adapted to FGPAs.
+
+As said above, hardware constraints have a great influence on the VHDL
+implementation. Consequently, we searched the maximum value of each
+variable as a function of the different scale factors and the size of
+profiles, which gives their maximum size in bits. That size determines
+the maximum scale factors that allow to use the least possible RAMs
+and DSPs. Actually, we implemented our algorithm with this maximum
+size but current works study the impact of quantization on the results
+precision and design complexity. We have compared the result of the
+LSQ version using integers and doubles and observed that the precision
+of both were similar.
+
+Then we built two versions of VHDL codes: one directly by hand coding
+and the other with Matlab using the Simulink HDL coder
+feature~\cite{HDLCoder}. Although the approach is completely different
+we obtained VHDL codes that are quite comparable. Each approach has
+advantages and drawbacks. Roughly speaking, hand coding provides
+beautiful and much better structured code while Simulink allows to
+produce a code faster. In terms of throughput and latency,
+simulations shows that the two approaches are close with a slight
+advantage for hand coding. We hope that real experiments will confirm
+that.
+
\subsection{Simulation}
+Currently, we have only simulated our VHDL codes with GHDL and GTKWave (two free
+tools with linux). Both approaches led to correct results. At the beginning of
+our simulations, our pipiline could compute a new phase each 33 cycles and the
+length of the pipeline was equal to 95 cycles. When we tried to generate the
+corresponding bitsream with ISE environment we had many problems because many
+stages required more than the 10$n$s required by the clock frequency. So we
+needed to decompose some part of the pipeline in order to add some cycles and
+simplify some parts between a clock top.
% ghdl + gtkwave
% au mieux : une phase tous les 33 cycles, latence de 95 cycles.
% mais routage/placement impossible.
\subsection{Bitstream creation}
+Currently both approaches provide synthesable bitstreams with ISE. We expect
+that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it
+could accept new profiles of pixel each 48 cycles, i.e. 480$n$s.
+
% pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120
\label{sec:results}