X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/blobdiff_plain/76b46095d99d4f535bfac76daa38cf3b31a477b3..5643f354956645b7d3592ac0e32af50cdf351155:/dmems12.tex?ds=sidebyside diff --git a/dmems12.tex b/dmems12.tex index 647111e..314368e 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -1,5 +1,5 @@ -\documentclass[10pt, conference, compsocconf]{IEEEtran} +\documentclass[10pt, peerreview, compsocconf]{IEEEtran} %\usepackage{latex8} %\usepackage{times} \usepackage[utf8]{inputenc} @@ -58,7 +58,7 @@ -\maketitle +%\maketitle \thispagestyle{empty} @@ -66,9 +66,16 @@ -{\it keywords}: FPGA, cantilever, interferometry. + \end{abstract} +\begin{IEEEkeywords} +FPGA, cantilever, interferometry. +\end{IEEEkeywords} + + +\IEEEpeerreviewmaketitle + \section{Introduction} Cantilevers are used inside atomic force microscope (AFM) which provides high @@ -238,14 +245,16 @@ CPU. But this is not the case for phase computation that used only few tenth of values.\\ In order to evaluate the original algorithm, we translated it in C -language. Profiles are read from a 1Mo file, as if it was an image -stored in a device file representing the camera. The file contains 100 -profiles of 21 pixels, equally scattered in the file. We obtained an -average of 10.5$\mu$s by profile (including I/O accesses). It is under -are requirements but close to the limit. In case of an occasional load -of the system, it could be largely overtaken. A solution would be to -use a real-time operating system but another one to search for a more -efficient algorithm. +language. As said further, for 20 pixels, it does about 1550 +operations, thus an estimated execution time of $1550/155 +=$10$\mu$s. For a more realistic evaluation, we constructed a file of +1Mo containing 200 profiles of 20 pixels, equally scattered. This file +is equivalent to an image stored in a device file representing the +camera. We obtained an average of 10.5$\mu$s by profile (including I/O +accesses). It is under are requirements but close to the limit. In +case of an occasional load of the system, it could be largely +overtaken. A solution would be to use a real-time operating system but +another one to search for a more efficient algorithm. But the main drawback is the latency of such a solution : since each profile must be treated one after another, the deflection of 100 @@ -271,40 +280,45 @@ some hardware constraints specific to FPGAs. \section{Proposed solution} \label{sec:solus} -Project Oscar aims to provide an hardware and software architecture to -estimate and control the deflection of cantilevers. The hardware part -consists in a high-speed camera, linked on an embedded board hosting -FPGAs. By the way, the camera output stream can be pushed directly -into the FPGA. The software part is mostly the VHDL code that -deserializes the camera stream, extracts profile and computes the -deflection. Before focusing on our work to implement the phase -computation, we give some general informations about FPGAs and the -board we use. +Project Oscar aims to provide a hardware and software architecture to estimate +and control the deflection of cantilevers. The hardware part consists in a +high-speed camera, linked on an embedded board hosting FPGAs. By the way, the +camera output stream can be pushed directly into the FPGA. The software part is +mostly the VHDL code that deserializes the camera stream, extracts profile and +computes the deflection. Before focusing on our work to implement the phase +computation, we give some general information about FPGAs and the board we use. \subsection{FPGAs} -A field-programmable gate array (FPGA) is an integrated circuit designed to be -configured by the customer. A hardware description language (HDL) is used to -configure a FPGA. FGPAs are composed of programmable logic components, called -logic blocks. These blocks can be configured to perform simple (AND, XOR, ...) -or complex combinational functions. Logic blocks are interconnected by -reconfigurable links. Modern FPGAs contains memory elements and multipliers -which enables to simplify the design and increase the speed. As the most complex -operation operation on FGPAs is the multiplier, design of FGPAs should not used -complex operations. For example, a divider is not an available operation and it -should be programmed using simple components. - +A field-programmable gate array (FPGA) is an integrated circuit +designed to be configured by the customer. FGPAs are composed of +programmable logic components, called configurable logic blocks +(CLB). These blocks mainly contains look-up tables (LUT), flip/flops +(F/F) and latches, organized in one or more slices connected +together. Each CLB can be configured to perform simple (AND, XOR, ...) +or complex combinational functions. They are interconnected by +reconfigurable links. Modern FPGAs contain memory elements and +multipliers which enable to simplify the design and to increase the +performance. Nevertheless, all other complex operations, like +division, trigonometric functions, $\ldots$ are not available and must +be done by configuring a set of CLBs. + +Since this configuration is not obvious at all, it can be done via a +framework that synthetize a design written in an hardware description +language (HDL), and after, that place and route + + is used to configure a FPGA. FGPAs programming is very different from classic processors programming. When -logic block are programmed and linked to performed an operation, they cannot be -reused anymore. FPGA are cadenced more slowly than classic processors but they can -performed pipelined as well as parallel operations. A pipeline provides a way -manipulate data quickly since at each clock top to handle a new data. However, -using a pipeline consomes more logics and components since they are not -reusable, nevertheless it is probably the most efficient technique on FPGA. -Parallel operations can be used in order to manipulate several data +logic blocks are programmed and linked to perform an operation, they cannot be +reused anymore. FPGAs are cadenced more slowly than classic processors but they +can perform pipeline as well as parallel operations. A pipeline provides a way +to manipulate data quickly since at each clock top it handles a new +data. However, using a pipeline consumes more logics and components since they +are not reusable. Nevertheless it is probably the most efficient technique on +FPGA. Parallel operations can be used in order to manipulate several data simultaneously. When it is possible, using a pipeline is a good solution to manipulate new data at each clock top and using parallelism to handle -simultaneously several data streams. +simultaneously several pipelines in order to handle multiple data streams. %% parler du VHDL, synthèse et bitstream \subsection{The board} @@ -590,7 +604,7 @@ largely beyond the worst experimental ones. \begin{figure}[ht] \begin{center} - \includegraphics[width=9cm]{intens-noise20-spl} + \includegraphics[width=9cm]{intens-noise20} \end{center} \caption{Sample of worst profile for N=10} \label{fig:noise20} @@ -598,7 +612,7 @@ largely beyond the worst experimental ones. \begin{figure}[ht] \begin{center} - \includegraphics[width=9cm]{intens-noise60-lsq} + \includegraphics[width=9cm]{intens-noise60} \end{center} \caption{Sample of worst profile for N=30} \label{fig:noise60} @@ -611,7 +625,7 @@ SPL on $N = k\times M$, i.e. the number of interpolated points. We assume that $M=20$, $nb_s=1024$, $k=4$, all possible parts are already in lookup tables and a limited set of operations (+, -, *, /, -<, >) is taken account. Translating the two algorithms in C code, we +$<$, $>$) is taken account. Translating the two algorithms in C code, we obtain about 430 operations for LSQ and 1550 (plus few tenth for $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless, considering the total number of operations is not really pertinent for