Nevertheless, the cost of transferring profile in GPU memory and of taking
back results would be prohibitive compared to computation time.
-We remark that when possible, it is more efficient to pipeline the
+It should be noticed that when possible, it is more efficient to pipeline the
computation. For example, supposing that 200 profiles of 20 pixels
could be pushed sequentially in a pipelined unit cadenced at a 100MHz
(i.e. a pixel enters in the unit each 10ns), all profiles would be
they can perform pipelines as well as parallel operations. A pipeline
consists in cutting a process in a sequence of small tasks, taking the same
execution time. It accepts a new data at each clock top, thus, after a known
-latency, it also provides a result at each clock top. We observe that the
+latency, it also provides a result at each clock top. The drawback is that the
components of a task are not reusable by another one. Nevertheless, this is
the most efficient technique on FPGAs. Because of their architecture, it is
also very easy to process several data concurrently. Finally, the best
Consequently, we have determined the maximum value of each variable as
a function of the scale factors and the profile size involved in the
-algorithm. It gave us the the maximum number of bits necessary to code
+algorithm. It gave us the maximum number of bits necessary to code
them. We have chosen the scale factors so that any variable (except
the covariance) fits in 18 bits, which is the maximum input size of
DSPs. In this way, all multiplications (except one with covariance)
{HDLCoder}. Although the approaches are completely different we obtained
quite comparable VHDL codes. Each approach has advantages and drawbacks.
Roughly speaking, hand coding provides beautiful and much better structured
-code while Simulink HDL coder allows for fast code production. In
+code while Simulink HDL coder allows fast code production. In
terms of throughput and latency, simulations show that the two approaches
yield close results with a slight advantage for hand coding.
components. It is mainly used to start to flush profiles and to
retrieve the computed phases in RAM. Unfortunately, the first designs
could not be placed and routed with ISE on the Spartan6 with a 100MHz
-clock. The main problems were encountered with series of arthmetic
+clock. The main problems were encountered with series of arithmetic
operations and more especially with RAM outputs used in DSPs. So, we
needed to decompose some parts of the pipeline, which added few clock
cycles. Finally, we obtained a bitstream that has been successfully
tested on the board.
Its latency is of 112 cycles and it computes a new phase every 40
-cycles. For 100 cantilevers, it takes $(112+200\times 40).10=81.12\mu
+cycles. For 100 cantilevers, it takes $(112+200\times 40)\times 10ns =81.12\mu
$s to compute their deflection. It corresponds to about 12300 images
per second, which is largely beyond the camera capacities and the
possibility to extract a new profile from an image every 40