From b5dcb332822aed6879619b36c36ad800b7672e2f Mon Sep 17 00:00:00 2001
From: Raphael Couturier <raphael.couturier@univ-fcomte.fr>
Date: Fri, 28 Oct 2011 10:55:54 +0200
Subject: [PATCH] new

---
 dmems12.tex | 31 +++++++++++++++----------------
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/dmems12.tex b/dmems12.tex
index 24460dd..0c00ee0 100644
--- a/dmems12.tex
+++ b/dmems12.tex
@@ -115,7 +115,7 @@ method and its implementation on a FPGA is presented in Section~\ref{sec:solus}.
 
 \label{sec:measure}
 
-In order to develop simple, cost effective and user-friendly cantilever
+In order to build simple, cost effective and user-friendly cantilever
 arrays, authors of ~\cite{AFMCSEM11} have developed a system based on
 interferometry.
 
@@ -183,7 +183,7 @@ through an operation called unwrapping where it is assumed that the
 deflection means along the two measurement segments are linearly
 dependent.  The third is on the base and provides a reference for
 noise suppression.  Finally, deflections are simply derived from phase
-shifts.\\
+shifts.
 
 The pixel gray-level intensity $I$ of each profile is modelized by%
 \begin{equation}
@@ -229,7 +229,7 @@ bottleneck of the whole process. For example, the camera in the setup
 of \cite{AFMCSEM11} provides $%
 1024\times 1204$ pixels with an exposition time of 2.5ms. Thus, if we
 the pixel extraction time is neglected, each phase calculation of a
-100-cantilever array should take no more than 12.5$\mu$s. \newline
+100-cantilever array should take no more than 12.5$\mu$s. 
 
 In fact, this timing is a very hard constraint. To illustrate this point, we
 consider a very small program that initializes twenty million of doubles in
@@ -239,7 +239,7 @@ at 2.33GHz, this program reaches an average of 155Mflops.
 Obviously, some cache effects and optimizations on huge amount of
 computations can drastically increase these performances: peak efficiency is
 about 2.5Gflops for the considered CPU. But this is not the case for phase
-computation that is using only a few tenth of values.\newline
+computation that is using only a few tenth of values.
 
 In order to evaluate the original algorithm, we translated it in C language.
 As stated before, for 20 pixels, it does about 1,550 operations, thus an
@@ -250,7 +250,7 @@ device file representing the camera. We obtained an average of 10.5$\mu$s
 by profile (including I/O accesses). It is under our requirements but close
 to the limit. In case of an occasional load of the system, it could be
 largely overtaken. Solutions would be to use a real-time operating system or
-to search for a more efficient algorithm.\newline
+to search for a more efficient algorithm.
 
 However, the main drawback is the latency of such a solution because each
 profile must be treated one after another and the deflection of 100
@@ -258,7 +258,7 @@ cantilevers takes about $200\times 10.5=2.1$ms. This would be inadequate
 for real-time requirements as for individual cantilever active control. An
 obvious solution is to parallelize the computations, for example on a GPU.
 Nevertheless, the cost of transferring profile in GPU memory and of taking
-back results would be prohibitive compared to computation time.\newline
+back results would be prohibitive compared to computation time.
 
 We remark that when possible, it is more efficient to pipeline the
 computation. For example, supposing that 200 profiles of 20 pixels
@@ -278,19 +278,18 @@ points are discussed in the following sections.
 
 \label{sec:solus}
 
-In this section we present part of the computing solution to the above
-requirements. We first give some general information about FPGAs, then we
+In  this  section we  present  parts  of the  computing  solution  to the  above
+requirements. The  hardware part consists in  a high-speed camera,  linked on an
+embedded board hosting  two FPGAs. In this way, the camera  output stream can be
+pushed directly  into the FPGA. The software  part is mostly the  VHDL code that
+deserializes the camera stream, extracts profiles and computes the deflection.
+
+We first give some general information about FPGAs, then we
 describe the FPGA board we use for implementation and finally the two
 algorithms for phase computation are detailed. Presentation of VHDL
 implementations is postponned until Section \ref{Experimental tests}. 
-\newline
 
-The hardware part consists in a high-speed camera, linked on an embedded
-board hosting two FPGAs. In this way, the camera output stream can be pushed
-directly into the FPGA. The software part is mostly the VHDL code that
-deserializes the camera stream, extracts profiles and computes the
-deflection. Before to present the board we use, we give some general
-information about FPGAs.
+
 
 \subsection{Elements of FPGA architecture and programming}
 
@@ -665,7 +664,7 @@ $atan$) for SPL. This result is largely in favor of LSQ. Nevertheless,
 considering the total number of operations is not fully relevant for FPGA
 implementation which time and space consumption depends not only on the type
 of operations but also of their ordering. The final evaluation is thus very
-much driven by the third criterion.\newline
+much driven by the third criterion.
 
 The Spartan 6 used in our architecture has a hard constraint since it
 has no built-in floating point units. Obviously, it is possible to use
-- 
2.39.5