X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/blobdiff_plain/015bc351b18995b7727145984fd39381bcce9a5a..fbdb2fc4a1cd5f66f4a094641303b58c423f2e16:/dmems12.tex?ds=sidebyside diff --git a/dmems12.tex b/dmems12.tex index 94e96e4..1dd3b4a 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -44,7 +44,7 @@ -\title{Using FPGAs for high speed and real time cantilever deflection estimation} +\title{A new approach based on least square methods to estimate in real time cantilevers deflection with a FPGA} \author{\IEEEauthorblockN{Raphaël Couturier\IEEEauthorrefmark{1}, Stéphane Domas\IEEEauthorrefmark{1}, Gwenhaël Goavec-Merou\IEEEauthorrefmark{2} and Michel Lenczner\IEEEauthorrefmark{2}} \IEEEauthorblockA{\IEEEauthorrefmark{1}FEMTO-ST, DISC, University of Franche-Comte, Belfort, France\\ \{raphael.couturier,stephane.domas\}@univ-fcomte.fr} @@ -63,7 +63,16 @@ \begin{abstract} - + Atomic force microscope (AFM) provides high resolution images of + surfaces. We focus our attention on an interferometry method to + estimate the cantilevers deflection. The initial method was based + on splines to determine the phase of interference fringes, and thus + the deflection. Computations were performed on a PC with LabView. + In this paper, we propose a new approach based on the least square + methods and its implementation that we developed on a FPGA, using + the pipelining technique. Simulations and real tests showed us that + this implementation is very efficient and should allow us to control + a cantilevers array in real time. \end{abstract} @@ -78,31 +87,30 @@ FPGA, cantilever, interferometry. \section{Introduction} Cantilevers are used inside atomic force microscope (AFM) which provides high -resolution images of surfaces. Several technics have been used to measure the -displacement of cantilevers in litterature. For example, it is possible to +resolution images of surfaces. Several techniques have been used to measure the +displacement of cantilevers in literature. For example, it is possible to determine accurately the deflection with different mechanisms. In~\cite{CantiPiezzo01}, authors used piezoresistor integrated into the cantilever. Nevertheless this approach suffers from the complexity of the microfabrication process needed to implement the sensor in the cantilever. In~\cite{CantiCapacitive03}, authors have presented an cantilever mechanism -based on capacitive sensing. This kind of technic also involves to instrument -the cantiliver which result in a complex fabrication process. +based on capacitive sensing. This kind of technique also involves to instrument +the cantilever which result in a complex fabrication process. In this paper our attention is focused on a method based on interferometry to measure cantilevers' displacements. In this method cantilevers are illuminated -by an optic source. The interferometry produces fringes on each cantilevers +by an optic source. The interferometry produces fringes on each cantilever which enables to compute the cantilever displacement. In order to analyze the fringes a high speed camera is used. Images need to be processed quickly and then a estimation method is required to determine the displacement of each -cantilever. In~\cite{AFMCSEM11}, the authors have used an algorithm based on +cantilever. In~\cite{AFMCSEM11}, authors have used an algorithm based on spline to estimate the cantilevers' positions. - The overall process gives -accurate results but all the computation are performed on a standard computer -using labview. Consequently, the main drawback of this implementation is that -the computer is a bootleneck in the overall process. In this paper we propose to -use a method based on least square and to implement all the computation on a -FGPA. +The overall process gives accurate results but all the computations +are performed on a standard computer using LabView. Consequently, the +main drawback of this implementation is that the computer is a +bottleneck. In this paper we propose to use a method based on least +square and to implement all the computation on a FPGA. The remainder of the paper is organized as follows. Section~\ref{sec:measure} describes more precisely the measurement process. Our solution based on the @@ -118,13 +126,6 @@ presented. \section{Measurement principles} \label{sec:measure} - - - - - - - \subsection{Architecture} \label{sec:archi} %% description de l'architecture générale de l'acquisition d'images @@ -132,30 +133,32 @@ presented. %% qu'elle est. In order to develop simple, cost effective and user-friendly cantilever arrays, -authors of ~\cite{AFMCSEM11} have developped a system based of +authors of ~\cite{AFMCSEM11} have developed a system based of interferometry. In opposition to other optical based systems, using a laser beam -deflection scheme and sentitive to the angular displacement of the cantilever, +deflection scheme and sensitive to the angular displacement of the cantilever, interferometry is sensitive to the optical path difference induced by the vertical displacement of the cantilever. -The system build by authors of~\cite{AFMCSEM11} has been developped based on a -Linnick interferomter~\cite{Sinclair:05}. It is illustrated in -Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter) into a -reference beam and a sample beam that reachs the cantilever array. In order to -be able to move the cantilever array, it is mounted on a translation and -rotational hexapod stage with five degrees of freedom. The optical system is -also fixed to the stage. Thus, the cantilever array is centered in the optical -system which can be adjusted accurately. The beam illuminates the array by a -microscope objective and the light reflects on the cantilevers. Likewise the -reference beam reflects on a movable mirror. A CMOS camera chip records the -reference and sample beams which are recombined in the beam splitter and the -interferogram. At the beginning of each experiment, the movable mirror is -fitted manually in order to align the interferometric fringes approximately -parallel to the cantilevers. When cantilevers move due to the surface, the -bending of cantilevers produce movements in the fringes that can be detected -with the CMOS camera. Finally the fringes need to be -analyzed. In~\cite{AFMCSEM11}, the authors used a LabView program to compute the -cantilevers' movements from the fringes. +The system build by these authors is based on a Linnick +interferometer~\cite{Sinclair:05}. It is illustrated in +Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter) +into a reference beam and a sample beam that reaches the cantilever +array. In order to be able to move the cantilever array, it is +mounted on a translation and rotational hexapod stage with five +degrees of freedom. The optical system is also fixed to the stage. +Thus, the cantilever array is centered in the optical system which can +be adjusted accurately. The beam illuminates the array by a +microscope objective and the light reflects on the cantilevers. +Likewise the reference beam reflects on a movable mirror. A CMOS +camera chip records the reference and sample beams which are +recombined in the beam splitter and the interferogram. At the +beginning of each experiment, the movable mirror is fitted manually in +order to align the interferometric fringes approximately parallel to +the cantilevers. When cantilevers move due to the surface, the +bending of cantilevers produce movements in the fringes that can be +detected with the CMOS camera. Finally the fringes need to be +analyzed. In~\cite{AFMCSEM11}, authors used a LabView program to +compute the cantilevers' deflections from the fringes. \begin{figure} \begin{center} @@ -171,21 +174,29 @@ cantilevers' movements from the fringes. \subsection{Cantilever deflection estimation} \label{sec:deflest} -As shown on image \ref{img:img-xp}, each cantilever is covered by -interferometric fringes. The fringes will distort when cantilevers are -deflected. Estimating the deflection is done by computing this -distortion. For that, (ref A. Meister + M Favre) proposed a method -based on computing the phase of the fringes, at the base of each -cantilever, near the tip, and on the base of the array. They assume -that a linear relation binds these phases, which can be use to -"unwrap" the phase at the tip and to determine the deflection.\\ - -More precisely, segment of pixels are extracted from images taken by a -high-speed camera. These segments are large enough to cover several -interferometric fringes and are placed at the base and near the tip of -the cantilevers. They are called base profile and tip profile in the -following. Furthermore, a reference profile is taken on the base of -the cantilever array. +\begin{figure} +\begin{center} +\includegraphics[width=\columnwidth]{lever-xp} +\end{center} +\caption{Portion of an image picked by the camera} +\label{fig:img-xp} +\end{figure} + +As shown on image \ref{fig:img-xp}, each cantilever is covered by +several interferometric fringes. The fringes will distort when +cantilevers are deflected. Estimating the deflection is done by +computing this distortion. For that, authors of \cite{AFMCSEM11} +proposed a method based on computing the phase of the fringes, at the +base of each cantilever, near the tip, and on the base of the +array. They assume that a linear relation binds these phases, which +can be use to "unwrap" the phase at the tip and to determine the deflection.\\ + +More precisely, segment of pixels are extracted from images taken by +the camera. These segments are large enough to cover several +interferometric fringes. As said above, they are placed at the base +and near the tip of the cantilevers. They are called base profile and +tip profile in the following. Furthermore, a reference profile is +taken on the base of the cantilever array. The pixels intensity $I$ (in gray level) of each profile is modelized by: @@ -197,7 +208,7 @@ I(x) = ax+b+A.cos(2\pi f.x + \theta) where $x$ is the position of a pixel in its associated segment. The global method consists in two main sequences. The first one aims -to determin the frequency $f$ of each profile with an algorithm based +to determine the frequency $f$ of each profile with an algorithm based on spline interpolation (see section \ref{algo-spline}). It also computes the coefficient used for unwrapping the phase. The second one is the acquisition loop, while which images are taken at regular time @@ -229,7 +240,7 @@ that computing the deflection of a single cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\ In fact, this timing is a very hard constraint. Let consider a very -small programm that initializes twenty million of doubles in memory +small program that initializes twenty million of doubles in memory and then does 1000000 cumulated sums on 20 contiguous values (experimental profiles have about this size). On an intel Core 2 Duo E6650 at 2.33GHz, this program reaches an average of 155Mflops. @@ -289,32 +300,30 @@ computation, we give some general information about FPGAs and the board we use. \subsection{FPGAs} -A field-programmable gate array (FPGA) is an integrated circuit -designed to be configured by the customer. FGPAs are composed of -programmable logic components, called configurable logic blocks -(CLB). These blocks mainly contains look-up tables (LUT), flip/flops -(F/F) and latches, organized in one or more slices connected -together. Each CLB can be configured to perform simple (AND, XOR, ...) -or complex combinational functions. They are interconnected by -reconfigurable links. Modern FPGAs contain memory elements and -multipliers which enable to simplify the design and to increase the -performance. Nevertheless, all other complex operations, like -division, trigonometric functions, $\ldots$ are not available and must -be done by configuring a set of CLBs. Since this configuration is not -obvious at all, it can be done via a framework, like ISE. Such a -software can synthetize a design written in a hardware description -language (HDL), map it onto CLBs, place/route them for a specific -FPGA, and finally produce a bitstream that is used to configre the -FPGA. Thus, from the developper point of view, the main difficulty is -to translate an algorithm in HDL code, taking account FPGA resources -and constraints like clock signals and I/O values that drive the FPGA. +A field-programmable gate array (FPGA) is an integrated circuit designed to be +configured by the customer. FGPAs are composed of programmable logic components, +called configurable logic blocks (CLB). These blocks mainly contains look-up +tables (LUT), flip/flops (F/F) and latches, organized in one or more slices +connected together. Each CLB can be configured to perform simple (AND, XOR, ...) +or complex combinational functions. They are interconnected by reconfigurable +links. Modern FPGAs contain memory elements and multipliers which enable to +simplify the design and to increase the performance. Nevertheless, all other +complex operations, like division, trigonometric functions, $\ldots$ are not +available and must be done by configuring a set of CLBs. Since this +configuration is not obvious at all, it can be done via a framework, like +ISE~\cite{ISE}. Such a software can synthetize a design written in a hardware +description language (HDL), map it onto CLBs, place/route them for a specific +FPGA, and finally produce a bitstream that is used to configure the FPGA. Thus, +from the developer point of view, the main difficulty is to translate an +algorithm in HDL code, taking account FPGA resources and constraints like clock +signals and I/O values that drive the FPGA. Indeed, HDL programming is very different from classic languages like C. A program can be seen as a state-machine, manipulating signals that evolve from state to state. By the way, HDL instructions can execute -concurrently. Basic logic operations are used to agregate signals to +concurrently. Basic logic operations are used to aggregate signals to produce new states and assign it to another signal. States are mainly -expressed as arrays of bits. Fortunaltely, libraries propose some +expressed as arrays of bits. Fortunately, libraries propose some higher levels representations like signed integers, and arithmetic operations. @@ -333,7 +342,7 @@ pipelines in order to handle multiple data streams. \subsection{The board} -The board we use is designed by the Armadeus compagny, under the name +The board we use is designed by the Armadeus company, under the name SP Vision. It consists in a development board hosting a i.MX27 ARM processor (from Freescale). The board includes all classical connectors: USB, Ethernet, ... A Flash memory contains a Linux kernel @@ -346,10 +355,11 @@ that communicate between i.MX and Spartan6, using Spartan3 as a tunnel. By default, the WEIM interface provides a clock signal at 100MHz that is connected to dedicated FPGA pins. -The Spartan6 is an LX100 version. It has 15822 slices, equivalent to -101261 logic cells. There are 268 internal block RAM of 18Kbits, and -180 dedicated multiply-adders (named DSP48), which is largely enough -for our project. +The Spartan6 is an LX100 version. It has 15822 slices, each slice +containing 4 LUTs and 8 flip/flops. It is equivalent to 101261 logic +cells. There are 268 internal block RAM of 18Kbits, and 180 dedicated +multiply-adders (named DSP48), which is largely enough for our +project. Some I/O pins of Spartan6 are connected to two $2\times 17$ headers that can be used as user wants. For the project, they will be @@ -364,18 +374,18 @@ phase. The second one, detailed in this article, is based on a classical least square method but suppose that frequency is already known. -\subsubsection{Spline algorithm} +\subsubsection{Spline algorithm (SPL)} \label{sec:algo-spline} Let consider a profile $P$, that is a segment of $M$ pixels with an intensity in gray levels. Let call $I(x)$ the intensity of profile in $x \in [0,M[$. -At first, only $M$ values of $I$ are known, for $x = 0, 1, \ldots,M-1$. A -normalisation allows to scale known intensities into $[-1,1]$. We compute -splines that fit at best these normalised intensities. Splines (SPL in the -following) are used to interpolate $N = k\times M$ points (typically $k=4$ is -sufficient), within $[0,M[$. Let call $x^s$ the coordinates of these $N$ points - and $I^s$ their intensities. +At first, only $M$ values of $I$ are known, for $x = 0, 1, +\ldots,M-1$. A normalization allows to scale known intensities into +$[-1,1]$. We compute splines that fit at best these normalized +intensities. Splines are used to interpolate $N = k\times M$ points +(typically $k=4$ is sufficient), within $[0,M[$. Let call $x^s$ the +coordinates of these $N$ points and $I^s$ their intensities. In order to have the frequency, the mean line $a.x+b$ (see equation \ref{equ:profile}) of $I^s$ is computed. Finding intersections of $I^s$ and this line allow to obtain @@ -396,7 +406,7 @@ Two things can be noticed: computation of $\theta$. \end{itemize} -\subsubsection{Least square algorithm} +\subsubsection{Least square algorithm (LSQ)} Assuming that we compute the phase during the acquisition loop, equation \ref{equ:profile} has only 4 parameters: $a, b, A$, and @@ -406,7 +416,7 @@ determine these four parameters. Since it is an iterative process ending with a convergence criterion, it is obvious that it is not particularly adapted to our design goals. -Fortunatly, it is quite simple to reduce the number of parameters to +Fortunately, it is quite simple to reduce the number of parameters to only $\theta$. Let $x^p$ be the coordinates of pixels in a segment of size $M$. Thus, $x^p = 0, 1, \ldots, M-1$. Let $I(x^p)$ be their intensity. Firstly, we "remove" the slope by computing: @@ -462,7 +472,7 @@ computed. \end{itemize} Finally, the whole summarizes in an algorithm (called LSQ in the following) in two parts, one before and one during the acquisition loop: -\begin{algorithm}[h] +\begin{algorithm}[htbp] \caption{LSQ algorithm - before acquisition loop.} \label{alg:lsq-before} @@ -483,7 +493,7 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t } \end{algorithm} -\begin{algorithm}[ht] +\begin{algorithm}[htbp] \caption{LSQ algorithm - during acquisition loop.} \label{alg:lsq-during} @@ -550,7 +560,7 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t We compared the two algorithms on the base of three criteria: \begin{itemize} -\item precision of results on a cosinus profile, distorted with noise, +\item precision of results on a cosines profile, distorted with noise, \item number of operations, \item complexity to implement an FPGA version. \end{itemize} @@ -592,13 +602,13 @@ Table \ref{tab:algo_prec} gives the maximum and average error for the two algori 30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline \end{tabular} -\caption{Error (in \%) for cosinus profiles, with noise.} +\caption{Error (in \%) for cosines profiles, with noise.} \label{tab:algo_prec} \end{center} \end{table} These results show that the two algorithms are very close, with a -slight advantage for LSQ. Furthemore, both behave very well against +slight advantage for LSQ. Furthermore, both behave very well against noise. Assuming the experimental ratio of 50 (see above), an error of 1 percent on phase correspond to an error of 0.5nm on the lever deflection, which is very close to the best precision. @@ -609,7 +619,7 @@ profiles. Nevertheless, we can see on figure \ref{fig:noise20} the profile with $N=10$ that leads to the biggest error. It is a bit distorted, with pikes and straight/rounded portions, and relatively close to most of that come from experiments. Figure \ref{fig:noise60} -shows a sample of worst profile for $N=30$. It is completly distorted, +shows a sample of worst profile for $N=30$. It is completely distorted, largely beyond the worst experimental ones. \begin{figure}[ht] @@ -656,12 +666,12 @@ problem: DSP48 take inputs of 18 bits maximum. For larger multiplications, several DSP must be combined, increasing the latency. Nevertheless, the hardest constraint does not come from the FPGA characteristics -but from the algorithms. Their VHDL implentation will be efficient only if they +but from the algorithms. Their VHDL implementation will be efficient only if they can be fully (or near) pipelined. By the way, the choice is quickly done: only a small part of SPL can be. Indeed, the computation of spline coefficients implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be computed from incoming pixels intensity but after, the back-solve starts with -the lastest values, which breaks the pipeline. Moreover, SPL relies on +the latest values, which breaks the pipeline. Moreover, SPL relies on interpolating far more points than profile size. Thus, the end of SPL works on a larger amount of data than the beginning, which also breaks the pipeline. @@ -688,51 +698,79 @@ will include real experiments in the final version of this paper. \subsection{VHDL implementation} +From the LSQ algorithm, we have written a C program that uses only +integer values. We use a very simple quantization by multiplying +double precision values by a power of two, keeping the integer +part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are +scaled by 1024. Since LSQ also computes average, variance, ... to +remove the slope, the result of implied Euclidean divisions may be +relatively wrong. To avoid that, we also scale the pixel intensities +by a power of two. Furthermore, assuming $nb_s$ is fixed, these +divisions have a known denominator. Thus, they can be replaced by +their multiplication/shift counterpart. Finally, all other +multiplications or divisions by a power of two have been replaced by +left or right bit shifts. By the way, the code only contains +additions, subtractions and multiplications of signed integers, which +is perfectly adapted to FGPAs. + +As said above, hardware constraints have a great influence on the VHDL +implementation. Consequently, we searched the maximum value of each +variable as a function of the different scale factors and the size of +profiles, which gives their maximum size in bits. That size determines +the maximum scale factors that allow to use the least possible RAMs +and DSPs. Actually, we implemented our algorithm with this maximum +size but current works study the impact of quantization on the results +precision and design complexity. We have compared the result of the +LSQ version using integers and doubles and observed that the precision +of both were similar. + +Then we built two versions of VHDL codes: one directly by hand coding +and the other with Matlab using the Simulink HDL coder +feature~\cite{HDLCoder}. Although the approach is completely different +we obtained VHDL codes that are quite comparable. Each approach has +advantages and drawbacks. Roughly speaking, hand coding provides +beautiful and much better structured code while Simulink allows to +produce a code faster. In terms of throughput and latency, +simulations shows that the two approaches are close with a slight +advantage for hand coding. We hope that real experiments will confirm +that. +\subsection{Simulation} -% - ecriture d'un code en C avec integer -% - calcul de la taille max en bit de chaque variable en fonction de la quantization. -% - tests de quantization : équilibre entre précision et contraintes FPGA -% - en parallèle : simulink et VHDL à la main +Before experimental tests on the board, we simulated our two VHDL +codes with GHDL and GTKWave (two free tools with linux). For that, we +build a testbench based on profiles taken from experimentations and +compare the results to values given by the SPL algorithm. Both +versions lead to correct results. +Our first code were highly optimized : the pipeline could compute a +new phase each 33 cycles and its latency was equal to 95 cycles. Since +the Spartan6 is clocked at 100MHz, it implies that estimating the +deflection of 100 cantilevers would take about $(95 + 200\times 33).10 += 66.95\mu$s, i.e. nearly 15000 estimations by second. -From the LSQ algorithm, we have written a C program which uses only integer -values that have been previously scaled. The quantization of doubles into -integers has been performed in order to obtain a good trade-off between the -number of bits used and the precision. Finally, we have compared the result of -the LSQ version using integer and double. We have observed that the results of -both versions were similar. +\subsection{Bitstream creation} -Then we have built two versions of VHDL codes: one directly by hand coding and -the other with Matlab using simulink HDL coder feature. Although the approach is -completely different we have obtain VHDL codes that are quite comparable. Each -approach has advantages and drawbacks. Roughly speaking, hand coding provides -beautiful and much better structures code while HDL coder provides code faster. -In terms of speed of code, we think that both approaches will be quite -comparable. Real experiments will confirm that. In the LSQ algorithm, we have -replaced all the divisions by multiplications by a constant since divisions are -performed with constants depending of the number of pixels in the profile -(i.e. $M$). +In order to test our code on the SP Vision board, the design was +extended with a component that keeps profiles in RAM, flushes them in +the phase computation component and stores its output in another +RAM. We also added a wishbone : a component that can "drive" signals +to communicate between i.MX and others components. It is mainly used +to start to flush profiles and to retrieve the computed phases in RAM. -\subsection{Simulation} +Unfortunately, the first designs could not be placed and route with ISE +on the Spartan6 with a 100MHz clock. The main problems came from +routing values from RAMs to DSPs and obtaining a result under 10ns. By +the way, we needed to decompose some parts of the pipeline, which adds +some cycles. For example, some delays have been introduced between +RAMs output and DSPs. Finally, we obtained a bitstream that has a +latency of 112 cycles and computes a new phase every 40 cycles. For +100 cantilevers, it takes $(112 + 200\times 40).10 = 81.12\mu$s to +compute their deflection. -Currently, we only have simulated our VHDL codes with GHDL and GTKWave (two free -tools with linux). Both approaches led to correct results. At the beginning with -simulations our pipiline could compute a new phase each 33 cycles and the length -of the pipeline was equal to 95 cycles. When we tried to generate the bitsream -with ISE environment we had many problems because many stages required more than -the 10$n$s availabe. So we needed to decompose some part of the pipeline in order -to add some cycles and siplify some parts. -% ghdl + gtkwave -% au mieux : une phase tous les 33 cycles, latence de 95 cycles. -% mais routage/placement impossible. -\subsection{Bitstream creation} +This bitstream has been successfully tested on the board TODAY ! YEAAHHHHH -Currently both approaches provide synthesable bitstreams with ISE. We expect -that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it -could accept new line of pixel each 48 cycles, i.e. 480$n$s. -% pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120 \label{sec:results} @@ -740,7 +778,18 @@ could accept new line of pixel each 48 cycles, i.e. 480$n$s. \section{Conclusion and perspectives} - +In this paper we have presented a new method to estimate the +cantilevers deflection in an AFM. This method is based on least +square methods. We have used quantization to produce an algorithm +based exclusively on integer values, which is adapted to a FPGA +implementation. We obtained a precision on results similar to the +initial version based on splines. Our solution has been implemented +with a pipeline technique. Consequently, it enables to handle a new +profile image very quickly. Currently we have performed simulations +and real tests on a Spartan6 FPGA. + +In future work, we plan to study the quantization. Then we want to couple our +algorithm with a high speed camera and we plan to control the whole AFM system. \bibliographystyle{plain} \bibliography{biblio}