-considering the total number of operations is not really pertinent for
-an FPGA implementation: it mainly depends on the type of operations
-and their
-ordering. The final decision is thus driven by the third criterion.\\
-
-The Spartan 6 used in our architecture has a hard constraint: it has no built-in
-floating point units. Obviously, it is possible to use some existing
-"black-boxes" for double precision operations. But they have quite a long
-latency. It is much simpler to exclusively use integers, with a quantization of
-all double precision values. Obviously, this quantization should not decrease
-too much the precision of results. Furthermore, it should not lead to a design
-with a huge latency because of operations that could not complete during a
-single or few clock cycles. Divisions fall into that category and, moreover,
-they need a varying number of clock cycles to complete. Even multiplications can
-be a problem: a DSP48 takes inputs of 18 bits maximum. For larger multiplications,
-several DSP must be combined, increasing the latency.
-
-Nevertheless, the hardest constraint does not come from the FPGA characteristics
-but from the algorithms. Their VHDL implementation will be efficient only if
-they can be fully (or near) pipelined. Thus, the choice is quickly made: only a
-small part of SPL can be pipelined. Indeed, the computation of spline
-coefficients implies to solve a tridiagonal system $A.m = b$. Values in $A$ and
-$b$ can be computed from incoming pixels intensity but after, the back-solve
-starts with the latest values, which breaks the pipeline. Moreover, SPL relies
-on interpolating far more points than profile size. Thus, the end of SPL works
-on a larger amount of data than at the beginning, which also breaks the pipeline.
-
-LSQ has not this problem: all parts except the dichotomial search work on the
-same amount of data, i.e. the profile size. Furthermore, LSQ needs less
-operations than SPL, implying a smaller output latency. Consequently, it is the
-best candidate for phase computation. Nevertheless, obtaining a fully pipelined
-version supposes that operations of different parts complete in a single clock
-cycle. It is the case for simulations but it completely fails when mapping and
-routing the design on the Spartan6. Thus, extra-latency is generated and
-there must be idle times between two profiles entering into the pipeline.
-
-%%Before obtaining the least bitstream, the crucial question is: how to
-%%translate the C code the LSQ into VHDL ?
-
-
-%\subsection{VHDL design paradigms}
-
-\section{Experimental tests}
-
-%In this section we explain what we have done yet. Until now, we could not perform
-%real experiments since we just have received the FGPA board. Nevertheless, we
-%will include real experiments in the final version of this paper.
+considering the total number of operations is not fully relevant for FPGA
+implementation for which time and space consumption depends not only on the type
+of operations but also of their ordering. The final evaluation is thus very
+much driven by the third criterion.
+
+The Spartan 6 used in our architecture has a hard constraint since it
+has no built-in floating point units. Obviously, it is possible to use
+some existing "black-boxes" for double precision operations. But they
+require a lot of clock cycles to complete. It is much simpler to
+exclusively use integers, with a quantization of all double precision
+values. It should be chosen in a manner that does not alterate result
+precision. Furthermore, it should not lead to a design with a huge
+latency because of operations that could not complete during a single
+or few clock cycles. Divisions fall into that category and, moreover,
+they need a varying number of clock cycles to complete. Even
+multiplications can be a problem since a DSP48 takes inputs of 18 bits
+maximum. So, for larger multiplications, several DSP must be combined
+which increases the overall latency.
+
+In the present algorithms, the hardest constraint does not come from the
+FPGA characteristics but from the algorithms. Their VHDL implementation can
+be efficient only if they can be fully (or near) pipelined. We observe that
+only a small part of SPL can be pipelined, indeed, the computation of spline
+coefficients implies to solve a linear tridiagonal system which matrix and
+right-hand side are computed from incoming pixels intensity but after, the
+back-solve starts with the latest values, which breaks the pipeline.
+Moreover, SPL relies on interpolating far more points than profile size.
+Thus, the end of SPL works on a larger amount of data than at the beginning,
+which also breaks the pipeline.
+
+LSQ has not this problem since all parts, except the dichotomic search, work
+on the same amount of data, i.e. the profile size. Furthermore, LSQ requires
+less operations than SPL, implying a smaller output latency. In total, LSQ
+turns out to be the best candidate for phase computation on any architecture
+including FPGA.
+
+\section{VHDL implementation and experimental tests}
+
+\label{Experimental tests}