From: Raphael Couturier <raphael.couturier@univ-fcomte.fr>
Date: Thu, 20 Oct 2011 14:41:53 +0000 (+0200)
Subject: new
X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/commitdiff_plain/015bc351b18995b7727145984fd39381bcce9a5a

new
---

diff --git a/dmems12.tex b/dmems12.tex
index 701ce92..94e96e4 100644
--- a/dmems12.tex
+++ b/dmems12.tex
@@ -302,7 +302,7 @@ performance. Nevertheless, all other complex operations, like
 division, trigonometric functions, $\ldots$ are not available and must
 be done by configuring a set of CLBs. Since this configuration is not
 obvious at all, it can be done via a framework, like ISE. Such a
-software can synthetize a design written in an hardware description
+software can synthetize a design written in a hardware description
 language (HDL), map it onto CLBs, place/route them for a specific
 FPGA, and finally produce a bitstream that is used to configre the
 FPGA. Thus, from the developper point of view, the main difficulty is
@@ -643,42 +643,36 @@ an FPGA implementation: it mainly depends on the type of operations
 and their
 ordering. The final decision is thus driven by the third criterion.\\
 
-The Spartan 6 used in our architecture has hard constraint: it has no
-built-in floating point units. Obviously, it is possible to use some
-existing "black-boxes" for double precision operations. But they have
-a quite long latency. It is much simpler to exclusively use integers,
-with a quantization of all double precision values. Obviously, this
-quantization should not decrease too much the precision of
-results. Furthermore, it should not lead to a design with a huge
-latency because of operations that could not complete during a single
-or few clock cycles. Divisions are in this case and, moreover, they
-need an varying number of clock cycles to complete. Even
-multiplications can be a problem: DSP48 take inputs of 18 bits
-maximum. For larger multiplications, several DSP must be combined,
-increasing the latency.
-
-Nevertheless, the hardest constraint does not come from the FPGA
-characteristics but from the algorithms. Their VHDL implentation will
-be efficient only if they can be fully (or near) pipelined. By the
-way, the choice is quickly done: only a small part of SPL can be.
-Indeed, the computation of spline coefficients implies to solve a
-tridiagonal system $A.m = b$. Values in $A$ and $b$ can be computed
-from incoming pixels intensity but after, the back-solve starts with
-the lastest values, which breaks the pipeline. Moreover, SPL relies on
-interpolating far more points than profile size. Thus, the end
-of SPL works on a larger amount of data than the beginning, which
-also breaks the pipeline.
-
-LSQ has not this problem: all parts except the dichotomial search
-work on the same amount of data, i.e. the profile size. Furthermore,
-LSQ needs less operations than SPL, implying a smaller output
-latency. Consequently, it is the best candidate for phase
-computation. Nevertheless, obtaining a fully pipelined version
-supposes that operations of different parts complete in a single clock
-cycle. It is the case for simulations but it completely fails when
-mapping and routing the design on the Spartan6. By the way,
-extra-latency is generated and there must be idle times between two
-profiles entering into the pipeline.
+The Spartan 6 used in our architecture has a hard constraint: it has no built-in
+floating  point  units.   Obviously,  it  is  possible  to   use  some  existing
+"black-boxes"  for double  precision  operations.  But they  have  a quite  long
+latency. It is much simpler to  exclusively use integers, with a quantization of
+all double  precision values. Obviously,  this quantization should  not decrease
+too much the  precision of results. Furthermore, it should not  lead to a design
+with  a huge  latency because  of operations  that could  not complete  during a
+single or few clock cycles. Divisions  are in this case and, moreover, they need
+a varying  number of  clock cycles  to complete. Even  multiplications can  be a
+problem:  DSP48 take  inputs of  18  bits maximum.  For larger  multiplications,
+several DSP must be combined, increasing the latency.
+
+Nevertheless, the hardest constraint does not come from the FPGA characteristics
+but from the algorithms. Their VHDL  implentation will be efficient only if they
+can be fully (or near) pipelined. By the way, the choice is quickly done: only a
+small  part of  SPL  can be.   Indeed,  the computation  of spline  coefficients
+implies to solve  a tridiagonal system $A.m =  b$. Values in $A$ and  $b$ can be
+computed from  incoming pixels intensity  but after, the back-solve  starts with
+the  lastest  values,  which  breaks  the  pipeline.  Moreover,  SPL  relies  on
+interpolating far more points than profile size. Thus, the end of SPL works on a
+larger amount of data than the beginning, which also breaks the pipeline.
+
+LSQ has  not this problem: all parts  except the dichotomial search  work on the
+same  amount  of  data, i.e.  the  profile  size.  Furthermore, LSQ  needs  less
+operations than SPL, implying a  smaller output latency. Consequently, it is the
+best candidate for phase  computation. Nevertheless, obtaining a fully pipelined
+version supposes that  operations of different parts complete  in a single clock
+cycle. It is  the case for simulations but it completely  fails when mapping and
+routing the design  on the Spartan6. By the way,  extra-latency is generated and
+there must be idle times between two profiles entering into the pipeline.
 
 %%Before obtaining the least bitstream, the crucial question is: how to
 %%translate the C code the LSQ into VHDL ?
@@ -688,20 +682,56 @@ profiles entering into the pipeline.
 
 \section{Experimental tests}
 
+In this section we explain what  we have done yet. Until now, we could not perform
+real experiments  since we just have  received the FGPA  board. Nevertheless, we
+will include real experiments in the final version of this paper.
+
 \subsection{VHDL implementation}
 
+
+
 % - ecriture d'un code en C avec integer
 % - calcul de la taille max en bit de chaque variable en fonction de la quantization.
 % - tests de quantization : Ã©quilibre entre prÃ©cision et contraintes FPGA
 % - en parallÃ¨le : simulink et VHDL Ã  la main
-%
+
+
+From the  LSQ algorithm,  we have written  a C  program which uses  only integer
+values  that have  been  previously  scaled. The  quantization  of doubles  into
+integers has  been performed  in order  to obtain a  good trade-off  between the
+number of bits  used and the precision. Finally, we have  compared the result of
+the LSQ version  using integer and double. We have observed  that the results of
+both versions were similar.
+
+Then we have built  two versions of VHDL codes: one directly  by hand coding and
+the other with Matlab using simulink HDL coder feature. Although the approach is
+completely different we  have obtain VHDL codes that  are quite comparable. Each
+approach has  advantages and drawbacks.  Roughly speaking,  hand coding provides
+beautiful and much better structures  code while HDL coder provides code faster.
+In  terms  of speed  of  code,  we think  that  both  approaches  will be  quite
+comparable. Real experiments  will confirm that.  In the  LSQ algorithm, we have
+replaced all the divisions by  multiplications by a constant since divisions are
+performed  with  constants depending  of  the number  of  pixels  in the  profile
+(i.e. $M$).
+
 \subsection{Simulation}
 
+Currently, we only have simulated our VHDL codes with GHDL and GTKWave (two free
+tools with linux). Both approaches led to correct results. At the beginning with
+simulations our pipiline could compute a new phase each 33 cycles and the length
+of the pipeline was  equal to 95 cycles. When we tried  to generate the bitsream
+with ISE environment we had many problems because many stages required more than
+the 10$n$s availabe. So we needed to  decompose some part of the pipeline in order
+to add some cycles and siplify some parts.
 % ghdl + gtkwave
 % au mieux : une phase tous les 33 cycles, latence de 95 cycles.
 % mais routage/placement impossible.
 \subsection{Bitstream creation}
 
+Currently both  approaches provide synthesable  bitstreams with ISE.   We expect
+that the  pipeline will  have a latency  of 112  cycles, i.e. 1.12$\mu$s  and it
+could accept new line of pixel each 48 cycles, i.e. 480$n$s.
+
 % pas fait mais prÃ©vision d'une sortie tous les 480ns avec une latence de 1120
 
 \label{sec:results}