X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/blobdiff_plain/8d8c7405f943188a7b597675cc8a23e95430b6d7..c6645d99e9614d4fe8c5d8699f94440ba339900f:/dmems12.tex?ds=inline diff --git a/dmems12.tex b/dmems12.tex index 701ce92..ad693f6 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -289,25 +289,23 @@ computation, we give some general information about FPGAs and the board we use. \subsection{FPGAs} -A field-programmable gate array (FPGA) is an integrated circuit -designed to be configured by the customer. FGPAs are composed of -programmable logic components, called configurable logic blocks -(CLB). These blocks mainly contains look-up tables (LUT), flip/flops -(F/F) and latches, organized in one or more slices connected -together. Each CLB can be configured to perform simple (AND, XOR, ...) -or complex combinational functions. They are interconnected by -reconfigurable links. Modern FPGAs contain memory elements and -multipliers which enable to simplify the design and to increase the -performance. Nevertheless, all other complex operations, like -division, trigonometric functions, $\ldots$ are not available and must -be done by configuring a set of CLBs. Since this configuration is not -obvious at all, it can be done via a framework, like ISE. Such a -software can synthetize a design written in an hardware description -language (HDL), map it onto CLBs, place/route them for a specific -FPGA, and finally produce a bitstream that is used to configre the -FPGA. Thus, from the developper point of view, the main difficulty is -to translate an algorithm in HDL code, taking account FPGA resources -and constraints like clock signals and I/O values that drive the FPGA. +A field-programmable gate array (FPGA) is an integrated circuit designed to be +configured by the customer. FGPAs are composed of programmable logic components, +called configurable logic blocks (CLB). These blocks mainly contains look-up +tables (LUT), flip/flops (F/F) and latches, organized in one or more slices +connected together. Each CLB can be configured to perform simple (AND, XOR, ...) +or complex combinational functions. They are interconnected by reconfigurable +links. Modern FPGAs contain memory elements and multipliers which enable to +simplify the design and to increase the performance. Nevertheless, all other +complex operations, like division, trigonometric functions, $\ldots$ are not +available and must be done by configuring a set of CLBs. Since this +configuration is not obvious at all, it can be done via a framework, like +ISE~\cite{ISE}. Such a software can synthetize a design written in a hardware +description language (HDL), map it onto CLBs, place/route them for a specific +FPGA, and finally produce a bitstream that is used to configre the FPGA. Thus, +from the developper point of view, the main difficulty is to translate an +algorithm in HDL code, taking account FPGA resources and constraints like clock +signals and I/O values that drive the FPGA. Indeed, HDL programming is very different from classic languages like C. A program can be seen as a state-machine, manipulating signals that @@ -643,42 +641,36 @@ an FPGA implementation: it mainly depends on the type of operations and their ordering. The final decision is thus driven by the third criterion.\\ -The Spartan 6 used in our architecture has hard constraint: it has no -built-in floating point units. Obviously, it is possible to use some -existing "black-boxes" for double precision operations. But they have -a quite long latency. It is much simpler to exclusively use integers, -with a quantization of all double precision values. Obviously, this -quantization should not decrease too much the precision of -results. Furthermore, it should not lead to a design with a huge -latency because of operations that could not complete during a single -or few clock cycles. Divisions are in this case and, moreover, they -need an varying number of clock cycles to complete. Even -multiplications can be a problem: DSP48 take inputs of 18 bits -maximum. For larger multiplications, several DSP must be combined, -increasing the latency. - -Nevertheless, the hardest constraint does not come from the FPGA -characteristics but from the algorithms. Their VHDL implentation will -be efficient only if they can be fully (or near) pipelined. By the -way, the choice is quickly done: only a small part of SPL can be. -Indeed, the computation of spline coefficients implies to solve a -tridiagonal system $A.m = b$. Values in $A$ and $b$ can be computed -from incoming pixels intensity but after, the back-solve starts with -the lastest values, which breaks the pipeline. Moreover, SPL relies on -interpolating far more points than profile size. Thus, the end -of SPL works on a larger amount of data than the beginning, which -also breaks the pipeline. - -LSQ has not this problem: all parts except the dichotomial search -work on the same amount of data, i.e. the profile size. Furthermore, -LSQ needs less operations than SPL, implying a smaller output -latency. Consequently, it is the best candidate for phase -computation. Nevertheless, obtaining a fully pipelined version -supposes that operations of different parts complete in a single clock -cycle. It is the case for simulations but it completely fails when -mapping and routing the design on the Spartan6. By the way, -extra-latency is generated and there must be idle times between two -profiles entering into the pipeline. +The Spartan 6 used in our architecture has a hard constraint: it has no built-in +floating point units. Obviously, it is possible to use some existing +"black-boxes" for double precision operations. But they have a quite long +latency. It is much simpler to exclusively use integers, with a quantization of +all double precision values. Obviously, this quantization should not decrease +too much the precision of results. Furthermore, it should not lead to a design +with a huge latency because of operations that could not complete during a +single or few clock cycles. Divisions are in this case and, moreover, they need +a varying number of clock cycles to complete. Even multiplications can be a +problem: DSP48 take inputs of 18 bits maximum. For larger multiplications, +several DSP must be combined, increasing the latency. + +Nevertheless, the hardest constraint does not come from the FPGA characteristics +but from the algorithms. Their VHDL implentation will be efficient only if they +can be fully (or near) pipelined. By the way, the choice is quickly done: only a +small part of SPL can be. Indeed, the computation of spline coefficients +implies to solve a tridiagonal system $A.m = b$. Values in $A$ and $b$ can be +computed from incoming pixels intensity but after, the back-solve starts with +the lastest values, which breaks the pipeline. Moreover, SPL relies on +interpolating far more points than profile size. Thus, the end of SPL works on a +larger amount of data than the beginning, which also breaks the pipeline. + +LSQ has not this problem: all parts except the dichotomial search work on the +same amount of data, i.e. the profile size. Furthermore, LSQ needs less +operations than SPL, implying a smaller output latency. Consequently, it is the +best candidate for phase computation. Nevertheless, obtaining a fully pipelined +version supposes that operations of different parts complete in a single clock +cycle. It is the case for simulations but it completely fails when mapping and +routing the design on the Spartan6. By the way, extra-latency is generated and +there must be idle times between two profiles entering into the pipeline. %%Before obtaining the least bitstream, the crucial question is: how to %%translate the C code the LSQ into VHDL ? @@ -688,20 +680,58 @@ profiles entering into the pipeline. \section{Experimental tests} +In this section we explain what we have done yet. Until now, we could not perform +real experiments since we just have received the FGPA board. Nevertheless, we +will include real experiments in the final version of this paper. + \subsection{VHDL implementation} + + % - ecriture d'un code en C avec integer % - calcul de la taille max en bit de chaque variable en fonction de la quantization. % - tests de quantization : équilibre entre précision et contraintes FPGA % - en parallèle : simulink et VHDL à la main -% + + +From the LSQ algorithm, we have written a C program which uses only integer +values that have been previously scaled. The quantization of doubles into +integers has been performed in order to obtain a good trade-off between the +number of bits used and the precision. We have compared the result of +the LSQ version using integers and doubles. We have observed that the results of +both versions were similar. + +Then we have built two versions of VHDL codes: one directly by hand coding and +the other with Matlab using the Simulink HDL coder +feature~\cite{HDLCoder}. Although the approach is completely different we have +obtain VHDL codes that are quite comparable. Each approach has advantages and +drawbacks. Roughly speaking, hand coding provides beautiful and much better +structured code while HDL coder provides code faster. In terms of speed of +code, we think that both approaches will be quite comparable with a slightly +advantage for hand coding. We hope that real experiments will confirm that. In +the LSQ algorithm, we have replaced all the divisions by multiplications by +constants since divisions are performed with constants depending of the number +of pixels in the profile (i.e. $M$). + \subsection{Simulation} +Currently, we have only simulated our VHDL codes with GHDL and GTKWave (two free +tools with linux). Both approaches led to correct results. At the beginning of +our simulations, our pipiline could compute a new phase each 33 cycles and the +length of the pipeline was equal to 95 cycles. When we tried to generate the +corresponding bitsream with ISE environment we had many problems because many +stages required more than the 10$n$s required by the clock frequency. So we +needed to decompose some part of the pipeline in order to add some cycles and +simplify some parts between a clock top. % ghdl + gtkwave % au mieux : une phase tous les 33 cycles, latence de 95 cycles. % mais routage/placement impossible. \subsection{Bitstream creation} +Currently both approaches provide synthesable bitstreams with ISE. We expect +that the pipeline will have a latency of 112 cycles, i.e. 1.12$\mu$s and it +could accept new profiles of pixel each 48 cycles, i.e. 480$n$s. + % pas fait mais prévision d'une sortie tous les 480ns avec une latence de 1120 \label{sec:results}