\label{sec:measure}
-In order to develop simple, cost effective and user-friendly cantilever
+In order to build simple, cost effective and user-friendly cantilever
arrays, authors of ~\cite{AFMCSEM11} have developed a system based on
interferometry.
deflection means along the two measurement segments are linearly
dependent. The third is on the base and provides a reference for
noise suppression. Finally, deflections are simply derived from phase
-shifts.\\
+shifts.
The pixel gray-level intensity $I$ of each profile is modelized by%
\begin{equation}
of \cite{AFMCSEM11} provides $%
1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if we
the pixel extraction time is neglected, each phase calculation of a
-100-cantilever array should take no more than 12.5$\mu$s. \newline
+100-cantilever array should take no more than 12.5$\mu$s.
In fact, this timing is a very hard constraint. To illustrate this point, we
consider a very small program that initializes twenty million of doubles in
Obviously, some cache effects and optimizations on huge amount of
computations can drastically increase these performances: peak efficiency is
about 2.5Gflops for the considered CPU. But this is not the case for phase
-computation that is using only a few tenth of values.\newline
+computation that is using only a few tenth of values.
In order to evaluate the original algorithm, we translated it in C language.
As stated before, for 20 pixels, it does about 1,550 operations, thus an
by profile (including I/O accesses). It is under our requirements but close
to the limit. In case of an occasional load of the system, it could be
largely overtaken. Solutions would be to use a real-time operating system or
-to search for a more efficient algorithm.\newline
+to search for a more efficient algorithm.
However, the main drawback is the latency of such a solution because each
profile must be treated one after another and the deflection of 100
for real-time requirements as for individual cantilever active control. An
obvious solution is to parallelize the computations, for example on a GPU.
Nevertheless, the cost of transferring profile in GPU memory and of taking
-back results would be prohibitive compared to computation time.\newline
+back results would be prohibitive compared to computation time.
We remark that when possible, it is more efficient to pipeline the
computation. For example, supposing that 200 profiles of 20 pixels
\label{sec:solus}
-In this section we present part of the computing solution to the above
-requirements. We first give some general information about FPGAs, then we
+In this section we present parts of the computing solution to the above
+requirements. The hardware part consists in a high-speed camera, linked on an
+embedded board hosting two FPGAs. In this way, the camera output stream can be
+pushed directly into the FPGA. The software part is mostly the VHDL code that
+deserializes the camera stream, extracts profiles and computes the deflection.
+
+We first give some general information about FPGAs, then we
describe the FPGA board we use for implementation and finally the two
algorithms for phase computation are detailed. Presentation of VHDL
implementations is postponned until Section \ref{Experimental tests}.
-\newline
-The hardware part consists in a high-speed camera, linked on an embedded
-board hosting two FPGAs. In this way, the camera output stream can be pushed
-directly into the FPGA. The software part is mostly the VHDL code that
-deserializes the camera stream, extracts profiles and computes the
-deflection. Before to present the board we use, we give some general
-information about FPGAs.
+
\subsection{Elements of FPGA architecture and programming}
considering the total number of operations is not fully relevant for FPGA
implementation which time and space consumption depends not only on the type
of operations but also of their ordering. The final evaluation is thus very
-much driven by the third criterion.\newline
+much driven by the third criterion.
The Spartan 6 used in our architecture has a hard constraint since it
has no built-in floating point units. Obviously, it is possible to use