From: Raphael Couturier Date: Fri, 28 Oct 2011 08:55:54 +0000 (+0200) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/commitdiff_plain/b5dcb332822aed6879619b36c36ad800b7672e2f?ds=inline new --- diff --git a/dmems12.tex b/dmems12.tex index 24460dd..0c00ee0 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -115,7 +115,7 @@ method and its implementation on a FPGA is presented in Section~\ref{sec:solus}. \label{sec:measure} -In order to develop simple, cost effective and user-friendly cantilever +In order to build simple, cost effective and user-friendly cantilever arrays, authors of ~\cite{AFMCSEM11} have developed a system based on interferometry. @@ -183,7 +183,7 @@ through an operation called unwrapping where it is assumed that the deflection means along the two measurement segments are linearly dependent. The third is on the base and provides a reference for noise suppression. Finally, deflections are simply derived from phase -shifts.\\ +shifts. The pixel gray-level intensity $I$ of each profile is modelized by% \begin{equation} @@ -229,7 +229,7 @@ bottleneck of the whole process. For example, the camera in the setup of \cite{AFMCSEM11} provides $% 1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if we the pixel extraction time is neglected, each phase calculation of a -100-cantilever array should take no more than 12.5$\mu$s. \newline +100-cantilever array should take no more than 12.5$\mu$s. In fact, this timing is a very hard constraint. To illustrate this point, we consider a very small program that initializes twenty million of doubles in @@ -239,7 +239,7 @@ at 2.33GHz, this program reaches an average of 155Mflops. Obviously, some cache effects and optimizations on huge amount of computations can drastically increase these performances: peak efficiency is about 2.5Gflops for the considered CPU. But this is not the case for phase -computation that is using only a few tenth of values.\newline +computation that is using only a few tenth of values. In order to evaluate the original algorithm, we translated it in C language. As stated before, for 20 pixels, it does about 1,550 operations, thus an @@ -250,7 +250,7 @@ device file representing the camera. We obtained an average of 10.5$\mu$s by profile (including I/O accesses). It is under our requirements but close to the limit. In case of an occasional load of the system, it could be largely overtaken. Solutions would be to use a real-time operating system or -to search for a more efficient algorithm.\newline +to search for a more efficient algorithm. However, the main drawback is the latency of such a solution because each profile must be treated one after another and the deflection of 100 @@ -258,7 +258,7 @@ cantilevers takes about $200\times 10.5=2.1$ms. This would be inadequate for real-time requirements as for individual cantilever active control. An obvious solution is to parallelize the computations, for example on a GPU. Nevertheless, the cost of transferring profile in GPU memory and of taking -back results would be prohibitive compared to computation time.\newline +back results would be prohibitive compared to computation time. We remark that when possible, it is more efficient to pipeline the computation. For example, supposing that 200 profiles of 20 pixels @@ -278,19 +278,18 @@ points are discussed in the following sections. \label{sec:solus} -In this section we present part of the computing solution to the above -requirements. We first give some general information about FPGAs, then we +In this section we present parts of the computing solution to the above +requirements. The hardware part consists in a high-speed camera, linked on an +embedded board hosting two FPGAs. In this way, the camera output stream can be +pushed directly into the FPGA. The software part is mostly the VHDL code that +deserializes the camera stream, extracts profiles and computes the deflection. + +We first give some general information about FPGAs, then we describe the FPGA board we use for implementation and finally the two algorithms for phase computation are detailed. Presentation of VHDL implementations is postponned until Section \ref{Experimental tests}. -\newline -The hardware part consists in a high-speed camera, linked on an embedded -board hosting two FPGAs. In this way, the camera output stream can be pushed -directly into the FPGA. The software part is mostly the VHDL code that -deserializes the camera stream, extracts profiles and computes the -deflection. Before to present the board we use, we give some general -information about FPGAs. + \subsection{Elements of FPGA architecture and programming} @@ -665,7 +664,7 @@ $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless, considering the total number of operations is not fully relevant for FPGA implementation which time and space consumption depends not only on the type of operations but also of their ordering. The final evaluation is thus very -much driven by the third criterion.\newline +much driven by the third criterion. The Spartan 6 used in our architecture has a hard constraint since it has no built-in floating point units. Obviously, it is possible to use