-A field-programmable gate array (FPGA) is an integrated circuit
-designed to be configured by the customer. FGPAs are composed of
-programmable logic components, called configurable logic blocks
-(CLB). These blocks mainly contains look-up tables (LUT), flip/flops
-(F/F) and latches, organized in one or more slices connected
-together. Each CLB can be configured to perform simple (AND, XOR, ...)
-or complex combinational functions. They are interconnected by
-reconfigurable links. Modern FPGAs contain memory elements and
-multipliers which enable to simplify the design and to increase the
-performance. Nevertheless, all other complex operations, like
-division, trigonometric functions, $\ldots$ are not available and must
-be done by configuring a set of CLBs. Since this configuration is not
-obvious at all, it can be done via a framework, like ISE. Such a
-software can synthetize a design written in an hardware description
-language (HDL), map it onto CLBs, place/route them for a specific
-FPGA, and finally produce a bitstream that is used to configre the
-FPGA. Thus, from the developper point of view, the main difficulty is
-to translate an algorithm in HDL code, taking account FPGA resources
-and constraints like clock signals and I/O values that drive the FPGA.
+A field-programmable gate array (FPGA) is an integrated circuit designed to be
+configured by the customer. FGPAs are composed of programmable logic components,
+called configurable logic blocks (CLB). These blocks mainly contains look-up
+tables (LUT), flip/flops (F/F) and latches, organized in one or more slices
+connected together. Each CLB can be configured to perform simple (AND, XOR, ...)
+or complex combinational functions. They are interconnected by reconfigurable
+links. Modern FPGAs contain memory elements and multipliers which enable to
+simplify the design and to increase the performance. Nevertheless, all other
+complex operations, like division, trigonometric functions, $\ldots$ are not
+available and must be done by configuring a set of CLBs. Since this
+configuration is not obvious at all, it can be done via a framework, like
+ISE~\cite{ISE}. Such a software can synthetize a design written in a hardware
+description language (HDL), map it onto CLBs, place/route them for a specific
+FPGA, and finally produce a bitstream that is used to configre the FPGA. Thus,
+from the developper point of view, the main difficulty is to translate an
+algorithm in HDL code, taking account FPGA resources and constraints like clock
+signals and I/O values that drive the FPGA.
Indeed, HDL programming is very different from classic languages like
C. A program can be seen as a state-machine, manipulating signals that
and their
ordering. The final decision is thus driven by the third criterion.\\
-The Spartan 6 used in our architecture has hard constraint: it has no
-built-in floating point units. Obviously, it is possible to use some
-existing "black-boxes" for double precision operations. But they have
-a quite long latency. It is much simpler to exclusively use integers,
-with a quantization of all double precision values. Obviously, this
-quantization should not decrease too much the precision of
-results. Furthermore, it should not lead to a design with a huge
-latency because of operations that could not complete during a single
-or few clock cycles. Divisions are in this case and, moreover, they
-need an varying number of clock cycles to complete. Even
-multiplications can be a problem: DSP48 take inputs of 18 bits
-maximum. For larger multiplications, several DSP must be combined,
-increasing the latency.
-Nevertheless, the hardest constraint does not come from the FPGA
-characteristics but from the algorithms. Their VHDL implentation will
-be efficient only if they can be fully (or near) pipelined. By the
-way, the choice is quickly done: only a small part of SPL can be.
-Indeed, the computation of spline coefficients implies to solve a
-tridiagonal system $A.m = b$. Values in $A$ and $b$ can be computed
-from incoming pixels intensity but after, the back-solve starts with
-the lastest values, which breaks the pipeline. Moreover, SPL relies on
-interpolating far more points than profile size. Thus, the end
-of SPL works on a larger amount of data than the beginning, which
-also breaks the pipeline.
-LSQ has not this problem: all parts except the dichotomial search
-work on the same amount of data, i.e. the profile size. Furthermore,
-LSQ needs less operations than SPL, implying a smaller output
-latency. Consequently, it is the best candidate for phase
-computation. Nevertheless, obtaining a fully pipelined version
-supposes that operations of different parts complete in a single clock
-cycle. It is the case for simulations but it completely fails when
-mapping and routing the design on the Spartan6. By the way,
-extra-latency is generated and there must be idle times between two
-profiles entering into the pipeline.
+The Spartan 6 used in our architecture has a hard constraint: it has no built-in
+floating point units. Obviously, it is possible to use some existing
+"black-boxes" for double precision operations. But they have a quite long
+latency. It is much simpler to exclusively use integers, with a quantization of
+all double precision values. Obviously, this quantization should not decrease
+too much the precision of results. Furthermore, it should not lead to a design
+with a huge latency because of operations that could not complete during a
+single or few clock cycles. Divisions are in this case and, moreover, they need
+a varying number of clock cycles to complete. Even multiplications can be a
+problem: DSP48 take inputs of 18 bits maximum. For larger multiplications,
+several DSP must be combined, increasing the latency.
+Nevertheless, the hardest constraint does not come from the FPGA characteristics
+but from the algorithms. Their VHDL implentation will be efficient only if they
+can be fully (or near) pipelined. By the way, the choice is quickly done: only a
+small part of SPL can be. Indeed, the computation of spline coefficients
+implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be
+computed from incoming pixels intensity but after, the back-solve starts with
+the lastest values, which breaks the pipeline. Moreover, SPL relies on
+interpolating far more points than profile size. Thus, the end of SPL works on a
+larger amount of data than the beginning, which also breaks the pipeline.
+LSQ has not this problem: all parts except the dichotomial search work on the
+same amount of data, i.e. the profile size. Furthermore, LSQ needs less
+operations than SPL, implying a smaller output latency. Consequently, it is the
+best candidate for phase computation. Nevertheless, obtaining a fully pipelined
+version supposes that operations of different parts complete in a single clock
+cycle. It is the case for simulations but it completely fails when mapping and
+routing the design on the Spartan6. By the way, extra-latency is generated and
+there must be idle times between two profiles entering into the pipeline.
%%Before obtaining the least bitstream, the crucial question is: how to
%%translate the C code the LSQ into VHDL ?
\section{Experimental tests}
+In this section we explain what we have done yet. Until now, we could not perform
+real experiments since we just have received the FGPA board. Nevertheless, we
+will include real experiments in the final version of this paper.
\subsection{VHDL implementation}
% - ecriture d'un code en C avec integer
% - calcul de la taille max en bit de chaque variable en fonction de la quantization.
% - tests de quantization : équilibre entre précision et contraintes FPGA
% - en parallèle : simulink et VHDL à la main
+From the LSQ algorithm, we have written a C program which uses only integer
+values that have been previously scaled. The quantization of doubles into
+integers has been performed in order to obtain a good trade-off between the
+number of bits used and the precision. We have compared the result of
+the LSQ version using integers and doubles. We have observed that the results of
+both versions were similar.
+Then we have built two versions of VHDL codes: one directly by hand coding and
+the other with Matlab using the Simulink HDL coder
+feature~\cite{HDLCoder}. Although the approach is completely different we have
+obtain VHDL codes that are quite comparable. Each approach has advantages and
+drawbacks. Roughly speaking, hand coding provides beautiful and much better
+structured code while HDL coder provides code faster. In terms of speed of
+code, we think that both approaches will be quite comparable with a slightly
+advantage for hand coding. We hope that real experiments will confirm that. In
+the LSQ algorithm, we have replaced all the divisions by multiplications by
+constants since divisions are performed with constants depending of the number
+of pixels in the profile (i.e. $M$).
+Currently, we have only simulated our VHDL codes with GHDL and GTKWave (two free
+tools with linux). Both approaches led to correct results. At the beginning of
+our simulations, our pipiline could compute a new phase each 33 cycles and the
+length of the pipeline was equal to 95 cycles. When we tried to generate the
+corresponding bitsream with ISE environment we had many problems because many
+stages required more than the 10$n$s required by the clock frequency. So we
+needed to decompose some part of the pipeline in order to add some cycles and
+simplify some parts between a clock top.
% ghdl + gtkwave
% au mieux : une phase tous les 33 cycles, latence de 95 cycles.
% mais routage/placement impossible.
\subsection{Bitstream creation}
+Currently both approaches provide synthesable bitstreams with ISE. We expect
+that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it
+could accept new profiles of pixel each 48 cycles, i.e. 480$n$s.
% pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120