+From the LSQ algorithm, we have written a C program that uses only
+integer values. We use a very simple quantization by multiplying
+double precision values by a power of two, keeping the integer
+part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are
+scaled by 1,024. Since LSQ also computes average, variance, ... to
+remove the slope, the result of implied Euclidean divisions may be
+relatively wrong. To avoid that, we also scale the pixel intensities
+by a power of two. Furthermore, assuming $nb_s$ is fixed, these
+divisions have a known denominator. Thus, they can be replaced by
+their multiplication/shift counterpart. Finally, all other
+multiplications or divisions by a power of two have been replaced by
+left or right bit shifts. Thus, the code only contains
+additions, subtractions and multiplications of signed integers, which
+are perfectly adapted to FGPAs.
+
+As mentioned above, hardware constraints have a great influence on the VHDL
+implementation. Consequently, we searched the maximum value of each variable as
+a function of the different scale factors and the size of profiles, which gives
+their maximum size in bits. That size determines the maximum scale factors that
+allow to use the least possible RAMs and DSPs. Actually, we implemented our
+algorithm with this maximum size but current works study the impact of
+quantization on the results precision and design complexity. We have compared
+the result of the LSQ version using integers and doubles and observed that the
+precision of both were similar.
+
+Then we built two versions of VHDL codes: one directly by hand coding
+and the other with Matlab using the Simulink HDL coder
+feature~\cite{HDLCoder}. Although the approach is completely different
+we obtained VHDL codes that are quite comparable. Each approach has
+advantages and drawbacks. Roughly speaking, hand coding provides
+beautiful and much better structured code while Simulink enables us to
+produce a code faster. In terms of throughput and latency,
+simulations show that the two approaches are close with a slight
+advantage for hand coding. We hope that real experiments will confirm
+that.
+
+\subsection{Simulation}
+
+Before experimental tests on the board, we simulated our two VHDL
+codes with GHDL and GTKWave (two free tools with linux). For that, we
+built a testbench based on profiles taken from experimentations and
+compared the results to values given by the SPL algorithm. Both
+versions lead to correct results.
+
+Our first codes were highly optimized : the pipeline could compute a
+new phase each 33 cycles and its latency was equal to 95 cycles. Since
+the Spartan6 is clocked at 100MHz, it implies that estimating the
+deflection of 100 cantilevers would take about $(95 + 200\times 33).10
+= 66.95\mu$s, i.e. nearly 15,000 estimations by second.
+
+\subsection{Bitstream creation}
+
+In order to test our code on the SP Vision board, the design was
+extended with a component that keeps profiles in RAM, flushes them in
+the phase computation component and stores its output in another
+RAM. We also added a wishbone : a component that can "drive" signals
+to communicate between i.MX and other components. It is mainly used
+to start to flush profiles and to retrieve the computed phases in RAM.
+
+Unfortunately, the first designs could not be placed and route with ISE on the
+Spartan6 with a 100MHz clock. The main problems came from routing values from
+RAMs to DSPs and obtaining a result under 10ns. So, we needed to decompose some
+parts of the pipeline, which adds some cycles. For example, some delays have
+been introduced between RAMs output and DSPs. Finally, we obtained a bitstream
+that has a latency of 112 cycles and computes a new phase every 40 cycles. For
+100 cantilevers, it takes $(112 + 200\times 40).10 = 81.12\mu$s to compute their
+deflection.
+
+This bitstream has been successfully tested on the board.