+on spline interpolation (see section \ref{algo-spline}). It also
+computes the coefficient used for unwrapping the phase. The second one
+is the acquisition loop, while which images are taken at regular time
+steps. For each image, the phase $\theta$ of all profiles is computed
+to obtain, after unwrapping, the deflection of
+cantilevers. Originally, this computation was also done with an
+algorithm based on spline. This article proposes a new version based
+on a least square method.
+
+\subsection{Design goals}
+\label{sec:goals}
+
+The main goal is to implement a computing unit to estimate the
+deflection of about $10\times10$ cantilevers, faster than the stream of
+images coming from the camera. The accuracy of results must be close
+to the maximum precision ever obtained experimentally on the
+architecture, i.e. 0.3nm. Finally, the latency between an image
+entering in the unit and the deflections must be as small as possible
+(NB : future works plan to add some control on the cantilevers).\\
+
+If we put aside some hardware issues like the speed of the link
+between the camera and the computation unit, the time to deserialize
+pixels and to store them in memory, ... the phase computation is
+obviously the bottle-neck of the whole process. For example, if we
+consider the camera actually in use, an exposition time of 2.5ms for
+$1024\times 1204$ pixels seems the minimum that can be reached. For
+100 cantilevers, if we neglect the time to extract pixels, it implies
+that computing the deflection of a single
+cantilever should take less than 25$\mu$s, thus 12.5$\mu$s by phase.\\
+
+In fact, this timing is a very hard constraint. Let consider a very
+small programm that initializes twenty million of doubles in memory
+and then does 1000000 cumulated sums on 20 contiguous values
+(experimental profiles have about this size). On an intel Core 2 Duo
+E6650 at 2.33GHz, this program reaches an average of 155Mflops.
+
+%%Itimplies that the phase computation algorithm should not take more than
+%%$155\times 12.5 = 1937$ floating operations. For integers, it gives $3000$ operations.
+
+Obviously, some cache effects and optimizations on
+huge amount of computations can drastically increase these
+performances : peak efficiency is about 2.5Gflops for the considered
+CPU. But this is not the case for phase computation that used only few
+tenth of values.\\
+
+In order to evaluate the original algorithm, we translated it in C
+language. Profiles are read from a 1Mo file, as if it was an image
+stored in a device file representing the camera. The file contains 100
+profiles of 21 pixels, equally scattered in the file. We obtained an
+average of 10.5$\mu$s by profile (including I/O accesses). It is under
+are requirements but close to the limit. In case of an occasional load
+of the system, it could be largely overtaken. A solution would be to
+use a real-time operating system but another one to search for a more
+efficient algorithm.
+
+But the main drawback is the latency of such a solution : since each
+profile must be treated one after another, the deflection of 100
+cantilevers takes about $200\times 10.5 = 2.1$ms, which is inadequate
+for an efficient control. An obvious solution is to parallelize the
+computations, for example on a GPU. Nevertheless, the cost to transfer
+profile in GPU memory and to take back results would be prohibitive
+compared to computation time. It is certainly more efficient to
+pipeline the computation. For example, supposing that 200 profiles of
+20 pixels can be pushed sequentially in the pipelined unit cadenced at
+a 100MHz (i.e. a pixel enters in the unit each 10ns), all profiles
+would be treated in $200\times 20\times 10.10^{-9} =$ 40$\mu$s plus
+the latency of the pipeline. This is about 500 times faster than
+actual results.\\
+
+For these reasons, an FPGA as the computation unit is the best choice
+to achieve the required performance. Nevertheless, passing from
+a C code to a pipelined version in VHDL is not obvious at all. As
+explained in the next section, it can even be impossible because of
+some hardware constraints specific to FPGAs.
+
+
+\section{Proposed solution}
+\label{sec:solus}
+
+Project Oscar aims to provide an hardware and software architecture to
+estimate and control the deflection of cantilevers. The hardware part
+consists in a high-speed camera, linked on an embedded board hosting
+FPGAs. By the way, the camera output stream can be pushed directly
+into the FPGA. The software part is mostly the VHDL code that
+deserializes the camera stream, extracts profile and computes the
+deflection. Before focusing on our work to implement the phase
+computation, we give some general informations about FPGAs and the
+board we use.
+
+\subsection{FPGAs}
+
+A field-programmable gate array (FPGA) is an integrated circuit designed to be
+configured by the customer. A hardware description language (HDL) is used to
+configure a FPGA. FGPAs are composed of programmable logic components, called
+logic blocks. These blocks can be configured to perform simple (AND, XOR, ...)
+or complex combinational functions. Logic blocks are interconnected by
+reconfigurable links. Modern FPGAs contains memory elements and multipliers
+which enables to simplify the design and increase the speed. As the most complex
+operation operation on FGPAs is the multiplier, design of FGPAs should not used
+complex operations. For example, a divider is not an available operation and it
+should be programmed using simple components.
+
+FGPAs programming is very different from classic processors programming. When
+logic block are programmed and linked to performed an operation, they cannot be
+reused anymore. FPGA are cadenced more slowly than classic processors but they can
+performed pipelined as well as parallel operations. A pipeline provides a way
+manipulate data quickly since at each clock top to handle a new data. However,
+using a pipeline consomes more logics and components since they are not
+reusable, nevertheless it is probably the most efficient technique on FPGA.
+Parallel operations can be used in order to manipulate several data
+simultaneously. When it is possible, using a pipeline is a good solution to
+manipulate new data at each clock top and using parallelism to handle
+simultaneously several data streams.
+
+%% parler du VHDL, synthèse et bitstream
+\subsection{The board}
+
+The board we use is designed by the Armadeus compagny, under the name
+SP Vision. It consists in a development board hosting a i.MX27 ARM
+processor (from Freescale). The board includes all classical
+connectors : USB, Ethernet, ... A Flash memory contains a Linux kernel
+that can be launched after booting the board via u-Boot.
+
+The processor is directly connected to a Spartan3A FPGA (from Xilinx)
+via its special interface called WEIM. The Spartan3A is itself
+connected to a Spartan6 FPGA. Thus, it is possible to develop programs
+that communicate between i.MX and Spartan6, using Spartan3 as a
+tunnel. By default, the WEIM interface provides a clock signal at
+100MHz that is connected to dedicated FPGA pins.
+
+The Spartan6 is an LX100 version. It has 15822 slices, equivalent to
+101261 logic cells. There are 268 internal block RAM of 18Kbits, and
+180 dedicated multiply-adders (named DSP48), which is largely enough
+for our project.
+
+Some I/O pins of Spartan6 are connected to two $2\times 17$ headers
+that can be used as user wants. For the project, they will be
+connected to the interface card of the camera.