-\documentclass[10pt, conference, compsocconf]{IEEEtran}
+\documentclass[10pt, peerreview, compsocconf]{IEEEtran}
%\usepackage{latex8}
%\usepackage{times}
\usepackage[utf8]{inputenc}
-\maketitle
+%\maketitle
\thispagestyle{empty}
-{\it keywords}: FPGA, cantilever, interferometry.
+
\end{abstract}
+\begin{IEEEkeywords}
+FPGA, cantilever, interferometry.
+\end{IEEEkeywords}
+
+
+\IEEEpeerreviewmaketitle
+
\section{Introduction}
Cantilevers are used inside atomic force microscope (AFM) which provides high
tenth of values.\\
In order to evaluate the original algorithm, we translated it in C
-language. Profiles are read from a 1Mo file, as if it was an image
-stored in a device file representing the camera. The file contains 100
-profiles of 21 pixels, equally scattered in the file. We obtained an
-average of 10.5$\mu$s by profile (including I/O accesses). It is under
-are requirements but close to the limit. In case of an occasional load
-of the system, it could be largely overtaken. A solution would be to
-use a real-time operating system but another one to search for a more
-efficient algorithm.
+language. As said further, for 20 pixels, it does about 1550
+operations, thus an estimated execution time of $1550/155
+=$10$\mu$s. For a more realistic evaluation, we constructed a file of
+1Mo containing 200 profiles of 20 pixels, equally scattered. This file
+is equivalent to an image stored in a device file representing the
+camera. We obtained an average of 10.5$\mu$s by profile (including I/O
+accesses). It is under are requirements but close to the limit. In
+case of an occasional load of the system, it could be largely
+overtaken. A solution would be to use a real-time operating system but
+another one to search for a more efficient algorithm.
But the main drawback is the latency of such a solution : since each
profile must be treated one after another, the deflection of 100
\section{Proposed solution}
\label{sec:solus}
-Project Oscar aims to provide an hardware and software architecture to
-estimate and control the deflection of cantilevers. The hardware part
-consists in a high-speed camera, linked on an embedded board hosting
-FPGAs. By the way, the camera output stream can be pushed directly
-into the FPGA. The software part is mostly the VHDL code that
-deserializes the camera stream, extracts profile and computes the
-deflection. Before focusing on our work to implement the phase
-computation, we give some general informations about FPGAs and the
-board we use.
+Project Oscar aims to provide a hardware and software architecture to estimate
+and control the deflection of cantilevers. The hardware part consists in a
+high-speed camera, linked on an embedded board hosting FPGAs. By the way, the
+camera output stream can be pushed directly into the FPGA. The software part is
+mostly the VHDL code that deserializes the camera stream, extracts profile and
+computes the deflection. Before focusing on our work to implement the phase
+computation, we give some general information about FPGAs and the board we use.
\subsection{FPGAs}
-A field-programmable gate array (FPGA) is an integrated circuit designed to be
-configured by the customer. A hardware description language (HDL) is used to
-configure a FPGA. FGPAs are composed of programmable logic components, called
-logic blocks. These blocks can be configured to perform simple (AND, XOR, ...)
-or complex combinational functions. Logic blocks are interconnected by
-reconfigurable links. Modern FPGAs contains memory elements and multipliers
-which enables to simplify the design and increase the speed. As the most complex
-operation operation on FGPAs is the multiplier, design of FGPAs should not used
-complex operations. For example, a divider is not an available operation and it
-should be programmed using simple components.
-
+A field-programmable gate array (FPGA) is an integrated circuit
+designed to be configured by the customer. FGPAs are composed of
+programmable logic components, called configurable logic blocks
+(CLB). These blocks mainly contains look-up tables (LUT), flip/flops
+(F/F) and latches, organized in one or more slices connected
+together. Each CLB can be configured to perform simple (AND, XOR, ...)
+or complex combinational functions. They are interconnected by
+reconfigurable links. Modern FPGAs contain memory elements and
+multipliers which enable to simplify the design and to increase the
+performance. Nevertheless, all other complex operations, like
+division, trigonometric functions, $\ldots$ are not available and must
+be done by configuring a set of CLBs.
+
+Since this configuration is not obvious at all, it can be done via a
+framework that synthetize a design written in an hardware description
+language (HDL), and after, that place and route
+
+ is used to configure a FPGA.
FGPAs programming is very different from classic processors programming. When
-logic block are programmed and linked to performed an operation, they cannot be
-reused anymore. FPGA are cadenced more slowly than classic processors but they can
-performed pipelined as well as parallel operations. A pipeline provides a way
-manipulate data quickly since at each clock top to handle a new data. However,
-using a pipeline consomes more logics and components since they are not
-reusable, nevertheless it is probably the most efficient technique on FPGA.
-Parallel operations can be used in order to manipulate several data
+logic blocks are programmed and linked to perform an operation, they cannot be
+reused anymore. FPGAs are cadenced more slowly than classic processors but they
+can perform pipeline as well as parallel operations. A pipeline provides a way
+to manipulate data quickly since at each clock top it handles a new
+data. However, using a pipeline consumes more logics and components since they
+are not reusable. Nevertheless it is probably the most efficient technique on
+FPGA. Parallel operations can be used in order to manipulate several data
simultaneously. When it is possible, using a pipeline is a good solution to
manipulate new data at each clock top and using parallelism to handle
-simultaneously several data streams.
+simultaneously several pipelines in order to handle multiple data streams.
%% parler du VHDL, synthèse et bitstream
\subsection{The board}
\begin{figure}[ht]
\begin{center}
- \includegraphics[width=9cm]{intens-noise20-spl}
+ \includegraphics[width=9cm]{intens-noise20}
\end{center}
\caption{Sample of worst profile for N=10}
\label{fig:noise20}
\begin{figure}[ht]
\begin{center}
- \includegraphics[width=9cm]{intens-noise60-lsq}
+ \includegraphics[width=9cm]{intens-noise60}
\end{center}
\caption{Sample of worst profile for N=30}
\label{fig:noise60}
We assume that $M=20$, $nb_s=1024$, $k=4$, all possible parts are
already in lookup tables and a limited set of operations (+, -, *, /,
-<, >) is taken account. Translating the two algorithms in C code, we
+$<$, $>$) is taken account. Translating the two algorithms in C code, we
obtain about 430 operations for LSQ and 1550 (plus few tenth for
$atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
considering the total number of operations is not really pertinent for