From: couturie Date: Fri, 28 Oct 2011 15:45:26 +0000 (+0200) Subject: qlq modifs X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/dmems12.git/commitdiff_plain/06332ca3be475a4ea52097c17c17f0998628c699?ds=sidebyside qlq modifs --- diff --git a/dmems12.tex b/dmems12.tex index ef49def..698f99e 100644 --- a/dmems12.tex +++ b/dmems12.tex @@ -268,7 +268,7 @@ obvious solution is to parallelize the computations, for example on a GPU. Nevertheless, the cost of transferring profile in GPU memory and of taking back results would be prohibitive compared to computation time. -We remark that when possible, it is more efficient to pipeline the +It should be noticed that when possible, it is more efficient to pipeline the computation. For example, supposing that 200 profiles of 20 pixels could be pushed sequentially in a pipelined unit cadenced at a 100MHz (i.e. a pixel enters in the unit each 10ns), all profiles would be @@ -332,7 +332,7 @@ Furthermore, even if FPGAs are cadenced more slowly than classic processors, they can perform pipelines as well as parallel operations. A pipeline consists in cutting a process in a sequence of small tasks, taking the same execution time. It accepts a new data at each clock top, thus, after a known -latency, it also provides a result at each clock top. We observe that the +latency, it also provides a result at each clock top. The drawback is that the components of a task are not reusable by another one. Nevertheless, this is the most efficient technique on FPGAs. Because of their architecture, it is also very easy to process several data concurrently. Finally, the best @@ -727,7 +727,7 @@ factors. Consequently, we have determined the maximum value of each variable as a function of the scale factors and the profile size involved in the -algorithm. It gave us the the maximum number of bits necessary to code +algorithm. It gave us the maximum number of bits necessary to code them. We have chosen the scale factors so that any variable (except the covariance) fits in 18 bits, which is the maximum input size of DSPs. In this way, all multiplications (except one with covariance) @@ -757,7 +757,7 @@ coding and the other with Matlab using the Simulink HDL coder feature~\cite% {HDLCoder}. Although the approaches are completely different we obtained quite comparable VHDL codes. Each approach has advantages and drawbacks. Roughly speaking, hand coding provides beautiful and much better structured -code while Simulink HDL coder allows for fast code production. In +code while Simulink HDL coder allows fast code production. In terms of throughput and latency, simulations show that the two approaches yield close results with a slight advantage for hand coding. @@ -784,14 +784,14 @@ in order to "drive" signals to communicate between i.MX and other components. It is mainly used to start to flush profiles and to retrieve the computed phases in RAM. Unfortunately, the first designs could not be placed and routed with ISE on the Spartan6 with a 100MHz -clock. The main problems were encountered with series of arthmetic +clock. The main problems were encountered with series of arithmetic operations and more especially with RAM outputs used in DSPs. So, we needed to decompose some parts of the pipeline, which added few clock cycles. Finally, we obtained a bitstream that has been successfully tested on the board. Its latency is of 112 cycles and it computes a new phase every 40 -cycles. For 100 cantilevers, it takes $(112+200\times 40).10=81.12\mu +cycles. For 100 cantilevers, it takes $(112+200\times 40)\times 10ns =81.12\mu $s to compute their deflection. It corresponds to about 12300 images per second, which is largely beyond the camera capacities and the possibility to extract a new profile from an image every 40