X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/blobdiff_plain/961aed44358c71a04ebcb5aace3d3be4cff962f4..fbdb2fc4a1cd5f66f4a094641303b58c423f2e16:/dmems12.tex diff --git a/dmems12.tex b/dmems12.tex index 10db624..1dd3b4a 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -63,7 +63,16 @@ \begin{abstract} - + Atomic force microscope (AFM) provides high resolution images of + surfaces. We focus our attention on an interferometry method to + estimate the cantilevers deflection. The initial method was based + on splines to determine the phase of interference fringes, and thus + the deflection. Computations were performed on a PC with LabView. + In this paper, we propose a new approach based on the least square + methods and its implementation that we developed on a FPGA, using + the pipelining technique. Simulations and real tests showed us that + this implementation is very efficient and should allow us to control + a cantilevers array in real time. \end{abstract} @@ -78,15 +87,15 @@ FPGA, cantilever, interferometry. \section{Introduction} Cantilevers are used inside atomic force microscope (AFM) which provides high -resolution images of surfaces. Several technics have been used to measure the -displacement of cantilevers in litterature. For example, it is possible to +resolution images of surfaces. Several techniques have been used to measure the +displacement of cantilevers in literature. For example, it is possible to determine accurately the deflection with different mechanisms. In~\cite{CantiPiezzo01}, authors used piezoresistor integrated into the cantilever. Nevertheless this approach suffers from the complexity of the microfabrication process needed to implement the sensor in the cantilever. In~\cite{CantiCapacitive03}, authors have presented an cantilever mechanism -based on capacitive sensing. This kind of technic also involves to instrument -the cantiliver which result in a complex fabrication process. +based on capacitive sensing. This kind of technique also involves to instrument +the cantilever which result in a complex fabrication process. In this paper our attention is focused on a method based on interferometry to measure cantilevers' displacements. In this method cantilevers are illuminated @@ -100,8 +109,8 @@ spline to estimate the cantilevers' positions. The overall process gives accurate results but all the computations are performed on a standard computer using LabView. Consequently, the main drawback of this implementation is that the computer is a -bootleneck. In this paper we propose to use a method based on least -square and to implement all the computation on a FGPA. +bottleneck. In this paper we propose to use a method based on least +square and to implement all the computation on a FPGA. The remainder of the paper is organized as follows. Section~\ref{sec:measure} describes more precisely the measurement process. Our solution based on the @@ -124,16 +133,16 @@ presented. %% qu'elle est. In order to develop simple, cost effective and user-friendly cantilever arrays, -authors of ~\cite{AFMCSEM11} have developped a system based of +authors of ~\cite{AFMCSEM11} have developed a system based of interferometry. In opposition to other optical based systems, using a laser beam -deflection scheme and sentitive to the angular displacement of the cantilever, +deflection scheme and sensitive to the angular displacement of the cantilever, interferometry is sensitive to the optical path difference induced by the vertical displacement of the cantilever. The system build by these authors is based on a Linnick -interferomter~\cite{Sinclair:05}. It is illustrated in +interferometer~\cite{Sinclair:05}. It is illustrated in Figure~\ref{fig:AFM}. A laser diode is first split (by the splitter) -into a reference beam and a sample beam that reachs the cantilever +into a reference beam and a sample beam that reaches the cantilever array. In order to be able to move the cantilever array, it is mounted on a translation and rotational hexapod stage with five degrees of freedom. The optical system is also fixed to the stage. @@ -199,7 +208,7 @@ I(x) = ax+b+A.cos(2\pi f.x + \theta) where $x$ is the position of a pixel in its associated segment. The global method consists in two main sequences. The first one aims -to determin the frequency $f$ of each profile with an algorithm based +to determine the frequency $f$ of each profile with an algorithm based on spline interpolation (see section \ref{algo-spline}). It also computes the coefficient used for unwrapping the phase. The second one is the acquisition loop, while which images are taken at regular time @@ -231,7 +240,7 @@ that computing the deflection of a single cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\ In fact, this timing is a very hard constraint. Let consider a very -small programm that initializes twenty million of doubles in memory +small program that initializes twenty million of doubles in memory and then does 1000000 cumulated sums on 20 contiguous values (experimental profiles have about this size). On an intel Core 2 Duo E6650 at 2.33GHz, this program reaches an average of 155Mflops. @@ -304,17 +313,17 @@ available and must be done by configuring a set of CLBs. Since this configuration is not obvious at all, it can be done via a framework, like ISE~\cite{ISE}. Such a software can synthetize a design written in a hardware description language (HDL), map it onto CLBs, place/route them for a specific -FPGA, and finally produce a bitstream that is used to configre the FPGA. Thus, -from the developper point of view, the main difficulty is to translate an +FPGA, and finally produce a bitstream that is used to configure the FPGA. Thus, +from the developer point of view, the main difficulty is to translate an algorithm in HDL code, taking account FPGA resources and constraints like clock signals and I/O values that drive the FPGA. Indeed, HDL programming is very different from classic languages like C. A program can be seen as a state-machine, manipulating signals that evolve from state to state. By the way, HDL instructions can execute -concurrently. Basic logic operations are used to agregate signals to +concurrently. Basic logic operations are used to aggregate signals to produce new states and assign it to another signal. States are mainly -expressed as arrays of bits. Fortunaltely, libraries propose some +expressed as arrays of bits. Fortunately, libraries propose some higher levels representations like signed integers, and arithmetic operations. @@ -333,7 +342,7 @@ pipelines in order to handle multiple data streams. \subsection{The board} -The board we use is designed by the Armadeus compagny, under the name +The board we use is designed by the Armadeus company, under the name SP Vision. It consists in a development board hosting a i.MX27 ARM processor (from Freescale). The board includes all classical connectors: USB, Ethernet, ... A Flash memory contains a Linux kernel @@ -372,8 +381,8 @@ intensity in gray levels. Let call $I(x)$ the intensity of profile in $x \in [0,M[$. At first, only $M$ values of $I$ are known, for $x = 0, 1, -\ldots,M-1$. A normalisation allows to scale known intensities into -$[-1,1]$. We compute splines that fit at best these normalised +\ldots,M-1$. A normalization allows to scale known intensities into +$[-1,1]$. We compute splines that fit at best these normalized intensities. Splines are used to interpolate $N = k\times M$ points (typically $k=4$ is sufficient), within $[0,M[$. Let call $x^s$ the coordinates of these $N$ points and $I^s$ their intensities. @@ -407,7 +416,7 @@ determine these four parameters. Since it is an iterative process ending with a convergence criterion, it is obvious that it is not particularly adapted to our design goals. -Fortunatly, it is quite simple to reduce the number of parameters to +Fortunately, it is quite simple to reduce the number of parameters to only $\theta$. Let $x^p$ be the coordinates of pixels in a segment of size $M$. Thus, $x^p = 0, 1, \ldots, M-1$. Let $I(x^p)$ be their intensity. Firstly, we "remove" the slope by computing: @@ -551,7 +560,7 @@ Finally, the whole summarizes in an algorithm (called LSQ in the following) in t We compared the two algorithms on the base of three criteria: \begin{itemize} -\item precision of results on a cosinus profile, distorted with noise, +\item precision of results on a cosines profile, distorted with noise, \item number of operations, \item complexity to implement an FPGA version. \end{itemize} @@ -593,13 +602,13 @@ Table \ref{tab:algo_prec} gives the maximum and average error for the two algori 30 & 17.06 & 2.6 & 13.94 & 2.45 \\ \hline \end{tabular} -\caption{Error (in \%) for cosinus profiles, with noise.} +\caption{Error (in \%) for cosines profiles, with noise.} \label{tab:algo_prec} \end{center} \end{table} These results show that the two algorithms are very close, with a -slight advantage for LSQ. Furthemore, both behave very well against +slight advantage for LSQ. Furthermore, both behave very well against noise. Assuming the experimental ratio of 50 (see above), an error of 1 percent on phase correspond to an error of 0.5nm on the lever deflection, which is very close to the best precision. @@ -610,7 +619,7 @@ profiles. Nevertheless, we can see on figure \ref{fig:noise20} the profile with $N=10$ that leads to the biggest error. It is a bit distorted, with pikes and straight/rounded portions, and relatively close to most of that come from experiments. Figure \ref{fig:noise60} -shows a sample of worst profile for $N=30$. It is completly distorted, +shows a sample of worst profile for $N=30$. It is completely distorted, largely beyond the worst experimental ones. \begin{figure}[ht] @@ -657,12 +666,12 @@ problem: DSP48 take inputs of 18 bits maximum. For larger multiplications, several DSP must be combined, increasing the latency. Nevertheless, the hardest constraint does not come from the FPGA characteristics -but from the algorithms. Their VHDL implentation will be efficient only if they +but from the algorithms. Their VHDL implementation will be efficient only if they can be fully (or near) pipelined. By the way, the choice is quickly done: only a small part of SPL can be. Indeed, the computation of spline coefficients implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be computed from incoming pixels intensity but after, the back-solve starts with -the lastest values, which breaks the pipeline. Moreover, SPL relies on +the latest values, which breaks the pipeline. Moreover, SPL relies on interpolating far more points than profile size. Thus, the end of SPL works on a larger amount of data than the beginning, which also breaks the pipeline. @@ -689,27 +698,19 @@ will include real experiments in the final version of this paper. \subsection{VHDL implementation} - - -% - ecriture d'un code en C avec integer -% - calcul de la taille max en bit de chaque variable en fonction de la quantization. -% - tests de quantization : équilibre entre précision et contraintes FPGA -% - en parallèle : simulink et VHDL à la main - - From the LSQ algorithm, we have written a C program that uses only integer values. We use a very simple quantization by multiplying double precision values by a power of two, keeping the integer part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are -scaled by 1024. Since LSQ also computes average, variance, ... to -remove the slope, the result of implied euclidian divisions may be +scaled by 1024. Since LSQ also computes average, variance, ... to +remove the slope, the result of implied Euclidean divisions may be relatively wrong. To avoid that, we also scale the pixel intensities -by a power of two. Futhermore, assuming $nb_s$ is fixed, these -divisions have a knonw denominator. Thus, they can be replaced by +by a power of two. Furthermore, assuming $nb_s$ is fixed, these +divisions have a known denominator. Thus, they can be replaced by their multiplication/shift counterpart. Finally, all other multiplications or divisions by a power of two have been replaced by left or right bit shifts. By the way, the code only contains -additions, substractions and multiplications of signed integers, which +additions, subtractions and multiplications of signed integers, which is perfectly adapted to FGPAs. As said above, hardware constraints have a great influence on the VHDL @@ -736,24 +737,40 @@ that. \subsection{Simulation} -Currently, we have only simulated our VHDL codes with GHDL and GTKWave (two free -tools with linux). Both approaches led to correct results. At the beginning of -our simulations, our pipiline could compute a new phase each 33 cycles and the -length of the pipeline was equal to 95 cycles. When we tried to generate the -corresponding bitsream with ISE environment we had many problems because many -stages required more than the 10$n$s required by the clock frequency. So we -needed to decompose some part of the pipeline in order to add some cycles and -simplify some parts between a clock top. -% ghdl + gtkwave -% au mieux : une phase tous les 33 cycles, latence de 95 cycles. -% mais routage/placement impossible. +Before experimental tests on the board, we simulated our two VHDL +codes with GHDL and GTKWave (two free tools with linux). For that, we +build a testbench based on profiles taken from experimentations and +compare the results to values given by the SPL algorithm. Both +versions lead to correct results. + +Our first code were highly optimized : the pipeline could compute a +new phase each 33 cycles and its latency was equal to 95 cycles. Since +the Spartan6 is clocked at 100MHz, it implies that estimating the +deflection of 100 cantilevers would take about $(95 + 200\times 33).10 += 66.95\mu$s, i.e. nearly 15000 estimations by second. + \subsection{Bitstream creation} -Currently both approaches provide synthesable bitstreams with ISE. We expect -that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it -could accept new profiles of pixel each 48 cycles, i.e. 480$n$s. +In order to test our code on the SP Vision board, the design was +extended with a component that keeps profiles in RAM, flushes them in +the phase computation component and stores its output in another +RAM. We also added a wishbone : a component that can "drive" signals +to communicate between i.MX and others components. It is mainly used +to start to flush profiles and to retrieve the computed phases in RAM. + +Unfortunately, the first designs could not be placed and route with ISE +on the Spartan6 with a 100MHz clock. The main problems came from +routing values from RAMs to DSPs and obtaining a result under 10ns. By +the way, we needed to decompose some parts of the pipeline, which adds +some cycles. For example, some delays have been introduced between +RAMs output and DSPs. Finally, we obtained a bitstream that has a +latency of 112 cycles and computes a new phase every 40 cycles. For +100 cantilevers, it takes $(112 + 200\times 40).10 = 81.12\mu$s to +compute their deflection. + +This bitstream has been successfully tested on the board TODAY ! YEAAHHHHH + -% pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120 \label{sec:results} @@ -761,7 +778,18 @@ could accept new profiles of pixel each 48 cycles, i.e. 480$n$s. \section{Conclusion and perspectives} - +In this paper we have presented a new method to estimate the +cantilevers deflection in an AFM. This method is based on least +square methods. We have used quantization to produce an algorithm +based exclusively on integer values, which is adapted to a FPGA +implementation. We obtained a precision on results similar to the +initial version based on splines. Our solution has been implemented +with a pipeline technique. Consequently, it enables to handle a new +profile image very quickly. Currently we have performed simulations +and real tests on a Spartan6 FPGA. + +In future work, we plan to study the quantization. Then we want to couple our +algorithm with a high speed camera and we plan to control the whole AFM system. \bibliographystyle{plain} \bibliography{biblio}