From: Raphael Couturier Date: Thu, 20 Oct 2011 14:41:53 +0000 (+0200) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/commitdiff_plain/015bc351b18995b7727145984fd39381bcce9a5a new --- diff --git a/dmems12.tex b/dmems12.tex index 701ce92..94e96e4 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -302,7 +302,7 @@ performance. Nevertheless, all other complex operations, like division, trigonometric functions, $\ldots$ are not available and must be done by configuring a set of CLBs. Since this configuration is not obvious at all, it can be done via a framework, like ISE. Such a -software can synthetize a design written in an hardware description +software can synthetize a design written in a hardware description language (HDL), map it onto CLBs, place/route them for a specific FPGA, and finally produce a bitstream that is used to configre the FPGA. Thus, from the developper point of view, the main difficulty is @@ -643,42 +643,36 @@ an FPGA implementation: it mainly depends on the type of operations and their ordering. The final decision is thus driven by the third criterion.\\ -The Spartan 6 used in our architecture has hard constraint: it has no -built-in floating point units. Obviously, it is possible to use some -existing "black-boxes" for double precision operations. But they have -a quite long latency. It is much simpler to exclusively use integers, -with a quantization of all double precision values. Obviously, this -quantization should not decrease too much the precision of -results. Furthermore, it should not lead to a design with a huge -latency because of operations that could not complete during a single -or few clock cycles. Divisions are in this case and, moreover, they -need an varying number of clock cycles to complete. Even -multiplications can be a problem: DSP48 take inputs of 18 bits -maximum. For larger multiplications, several DSP must be combined, -increasing the latency. - -Nevertheless, the hardest constraint does not come from the FPGA -characteristics but from the algorithms. Their VHDL implentation will -be efficient only if they can be fully (or near) pipelined. By the -way, the choice is quickly done: only a small part of SPL can be. -Indeed, the computation of spline coefficients implies to solve a -tridiagonal system $A.m = b$. Values in $A$ and $b$ can be computed -from incoming pixels intensity but after, the back-solve starts with -the lastest values, which breaks the pipeline. Moreover, SPL relies on -interpolating far more points than profile size. Thus, the end -of SPL works on a larger amount of data than the beginning, which -also breaks the pipeline. - -LSQ has not this problem: all parts except the dichotomial search -work on the same amount of data, i.e. the profile size. Furthermore, -LSQ needs less operations than SPL, implying a smaller output -latency. Consequently, it is the best candidate for phase -computation. Nevertheless, obtaining a fully pipelined version -supposes that operations of different parts complete in a single clock -cycle. It is the case for simulations but it completely fails when -mapping and routing the design on the Spartan6. By the way, -extra-latency is generated and there must be idle times between two -profiles entering into the pipeline. +The Spartan 6 used in our architecture has a hard constraint: it has no built-in +floating point units. Obviously, it is possible to use some existing +"black-boxes" for double precision operations. But they have a quite long +latency. It is much simpler to exclusively use integers, with a quantization of +all double precision values. Obviously, this quantization should not decrease +too much the precision of results. Furthermore, it should not lead to a design +with a huge latency because of operations that could not complete during a +single or few clock cycles. Divisions are in this case and, moreover, they need +a varying number of clock cycles to complete. Even multiplications can be a +problem: DSP48 take inputs of 18 bits maximum. For larger multiplications, +several DSP must be combined, increasing the latency. + +Nevertheless, the hardest constraint does not come from the FPGA characteristics +but from the algorithms. Their VHDL implentation will be efficient only if they +can be fully (or near) pipelined. By the way, the choice is quickly done: only a +small part of SPL can be. Indeed, the computation of spline coefficients +implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be +computed from incoming pixels intensity but after, the back-solve starts with +the lastest values, which breaks the pipeline. Moreover, SPL relies on +interpolating far more points than profile size. Thus, the end of SPL works on a +larger amount of data than the beginning, which also breaks the pipeline. + +LSQ has not this problem: all parts except the dichotomial search work on the +same amount of data, i.e. the profile size. Furthermore, LSQ needs less +operations than SPL, implying a smaller output latency. Consequently, it is the +best candidate for phase computation. Nevertheless, obtaining a fully pipelined +version supposes that operations of different parts complete in a single clock +cycle. It is the case for simulations but it completely fails when mapping and +routing the design on the Spartan6. By the way, extra-latency is generated and +there must be idle times between two profiles entering into the pipeline. %%Before obtaining the least bitstream, the crucial question is: how to %%translate the C code the LSQ into VHDL ? @@ -688,20 +682,56 @@ profiles entering into the pipeline. \section{Experimental tests} +In this section we explain what we have done yet. Until now, we could not perform +real experiments since we just have received the FGPA board. Nevertheless, we +will include real experiments in the final version of this paper. + \subsection{VHDL implementation} + + % - ecriture d'un code en C avec integer % - calcul de la taille max en bit de chaque variable en fonction de la quantization. % - tests de quantization : équilibre entre précision et contraintes FPGA % - en parallèle : simulink et VHDL à la main -% + + +From the LSQ algorithm, we have written a C program which uses only integer +values that have been previously scaled. The quantization of doubles into +integers has been performed in order to obtain a good trade-off between the +number of bits used and the precision. Finally, we have compared the result of +the LSQ version using integer and double. We have observed that the results of +both versions were similar. + +Then we have built two versions of VHDL codes: one directly by hand coding and +the other with Matlab using simulink HDL coder feature. Although the approach is +completely different we have obtain VHDL codes that are quite comparable. Each +approach has advantages and drawbacks. Roughly speaking, hand coding provides +beautiful and much better structures code while HDL coder provides code faster. +In terms of speed of code, we think that both approaches will be quite +comparable. Real experiments will confirm that. In the LSQ algorithm, we have +replaced all the divisions by multiplications by a constant since divisions are +performed with constants depending of the number of pixels in the profile +(i.e. $M$). + \subsection{Simulation} +Currently, we only have simulated our VHDL codes with GHDL and GTKWave (two free +tools with linux). Both approaches led to correct results. At the beginning with +simulations our pipiline could compute a new phase each 33 cycles and the length +of the pipeline was equal to 95 cycles. When we tried to generate the bitsream +with ISE environment we had many problems because many stages required more than +the 10$n$s availabe. So we needed to decompose some part of the pipeline in order +to add some cycles and siplify some parts. % ghdl + gtkwave % au mieux : une phase tous les 33 cycles, latence de 95 cycles. % mais routage/placement impossible. \subsection{Bitstream creation} +Currently both approaches provide synthesable bitstreams with ISE. We expect +that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it +could accept new line of pixel each 48 cycles, i.e. 480$n$s. + % pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120 \label{sec:results}