-integer values. We use a very simple quantization by multiplying
-double precision values by a power of two, keeping the integer
-part. For example, all values stored in lut$_s$, lut$_c$, $\ldots$ are
-scaled by 1024. Since LSQ also computes average, variance, ... to
-remove the slope, the result of implied euclidian divisions may be
-relatively wrong. To avoid that, we also scale the pixel intensities
-by a power of two. Futhermore, assuming $nb_s$ is fixed, these
-divisions have a knonw denominator. Thus, they can be replaced by
-their multiplication/shift counterpart. Finally, all other
-multiplications or divisions by a power of two have been replaced by
-left or right bit shifts. By the way, the code only contains
-additions, substractions and multiplications of signed integers, which
-is perfectly adapted to FGPAs.
-
-As said above, hardware constraints have a great influence on the VHDL
-implementation. Consequently, we searched the maximum value of each
-variable as a function of the different scale factors and the size of
-profiles, which gives their maximum size in bits. That size determines
-the maximum scale factors that allow to use the least possible RAMs
-and DSPs. Actually, we implemented our algorithm with this maximum
-size but current works study the impact of quantization on the results
-precision and design complexity. We have compared the result of the
-LSQ version using integers and doubles and observed that the precision
-of both were similar.
-
-Then we built two versions of VHDL codes: one directly by hand coding
-and the other with Matlab using the Simulink HDL coder
-feature~\cite{HDLCoder}. Although the approach is completely different
-we obtained VHDL codes that are quite comparable. Each approach has
-advantages and drawbacks. Roughly speaking, hand coding provides
-beautiful and much better structured code while Simulink allows to
-produce a code faster. In terms of throughput and latency,
-simulations shows that the two approaches are close with a slight
-advantage for hand coding. We hope that real experiments will confirm
-that.
+integer values. We used a very simple quantization which consists in
+multiplying each double precision value by a factor power of two and
+by keeping the integer part. For an accurate evaluation of the
+division in the computation of $a$ the slope coefficient, we also
+scaled the pixel intensities by another power of two. The main problem
+was to determin these factors. Most of the time, they are chosen to
+minimize the error induced by the quantization. But in our case, we
+also have some hardware constraints, for example the width and depth of
+RAMs or the input size of DSPs. Thus, having a maximum of values that
+fit in these sizes is a very important criterion to choose the scaling
+factors.
+
+Consequently, we have determined the maximum value of each variable as
+a function of the scale factors and the profile size involved in the
+algorithm. It gave us the maximum number of bits necessary to code
+them. We have chosen the scale factors so that any variable (except
+the covariance) fits in 18 bits, which is the maximum input size of
+DSPs. In this way, all multiplications (except one with covariance)
+could be done with a single DSP, in a single clock cycle. Moreover,
+assuming that $nb_s = 1024$, all LUTs could fit in the 18Kbits
+RAMs. Finally, we compared the double and integer versions of LSQ and
+found a nearly perfect agreement between their results.
+
+As mentionned above, some operations like divisions must be
+avoided. But when the divisor is fixed, a division can be replaced
+by its multiplication/shift counterpart. This is always the case in
+LSQ. For example, assuming that $M$ is fixed, $x_{var}$ is known and
+fixed. Thus, $\frac{xy_{covar}}{x_{var}}$ can be replaced by
+
+\[ (xy_{covar}\times \left \lfloor\frac{2^n}{x_{var}} \right \rfloor) \gg n\]
+
+where $n$ depends on the desired precision (in our case $n=24$).
+
+Obviously, multiplications and divisions by a power of two can be
+replaced by left or right bit shifts. Finally, the code only contains
+shifts, additions, subtractions and multiplications of signed integers, which
+are perfectly adapted to FGPAs.
+
+
+We built two versions of VHDL codes, namely one directly by hand
+coding and the other with Matlab using the Simulink HDL coder feature~\cite%
+{HDLCoder}. Although the approaches are completely different we obtained
+quite comparable VHDL codes. Each approach has advantages and drawbacks.
+Roughly speaking, hand coding provides beautiful and much better structured
+code while Simulink HDL coder allows fast code production. In
+terms of throughput and latency, simulations show that the two approaches
+yield close results with a slight advantage for hand coding.