texture space, which reside in external DRAM, and are accessed via
read-only caches.
-\section{ The implementation of Ehrlich-Aberth method on GPU}
+\section{ Implementation of Ehrlich-Aberth method on GPU}
\label{sec5}
%%\subsection{A CUDA implementation of the Aberth's method }
%%\subsection{A GPU implementation of the Aberth's method }
-\subsection{A sequential Ehrlich-Aberth algorithm}
+\subsection{Sequential Ehrlich-Aberth algorithm}
The main steps of Ehrlich-Aberth method are shown in Algorithm.~\ref{alg1-seq} :
%\LinesNumbered
\begin{algorithm}[H]
\caption{A sequential algorithm to find roots with the Ehrlich-Aberth method}
-\KwIn{$Z^{0}$(Initial root's vector),$\varepsilon$ (error tolerance threshold), P(Polynomial to solve),$\Delta z_{max}$ (maximum value of stop condition),k (number of iteration),n(Polynomial's degrees)}
-\KwOut {Z (The solution root's vector),ZPrec (the previous solution root's vector)}
+\KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (error tolerance
+ threshold), P (Polynomial to solve), $\Delta z_{max}$ (maximum value
+ of stop condition), k (number of iteration), n (Polynomial's degrees)}
+\KwOut {Z (The solution root's vector), ZPrec (the previous solution root's vector)}
\BlankLine
EAGS: z^{k+1}_{i}=\frac{p(z^{k}_{i})}{p'(z^{k}_{i})-p(z^{k}_{i})(\sum^{i-1}_{j=1}\frac{1}{z^{k}_{i}-z^{k+1}_{j}}+\sum^{n}_{j=i+1}\frac{1}{z^{k}_{i}-z^{k}_{j}})}, i=1,...,n.
\end{equation}
%%Here a finiched my revision %%
-Using Equation.~\ref{eq:Aberth-H-GS} to update the vector solution \textit{Z}, we expect the Gauss-Seidel iteration to converge more quickly because, just as its ancestor (for solving linear systems of equations), it uses the most fresh computed roots $z^{k+1}_{i}$.
+Using Equation.~\ref{eq:Aberth-H-GS} to update the vector solution
+\textit{Z}, we expect the Gauss-Seidel iteration to converge more
+quickly because, just as any Jacobi algorithm (for solving linear systems of equations), it uses the most fresh computed roots $z^{k+1}_{i}$.
The $4^{th}$ step of the algorithm checks the convergence condition using Equation.~\ref{eq:Aberth-Conv-Cond}.
Both steps 3 and 4 use 1 thread to compute all the $n$ roots on CPU, which is very harmful for performance in case of the large degree polynomials.
-\subsection{A Parallel implementation with CUDA }
+\subsection{Parallel implementation with CUDA }
On the CPU, both steps 3 and 4 contain the loop \verb=for= and a single thread executes all the instructions in the loop $n$ times. In this subsection, we explain how the GPU architecture can compute this loop and reduce the execution time.
In the GPU, the schduler assigns the execution of this loop to a group of threads organised as a grid of blocks with block containing a number of threads. All threads within a block are executed concurrently in parallel. The instructions run on the GPU are grouped in special function called kernels. It's up to the programmer, to describe the execution context, that is the size of the Grid, the number of blocks and the number of threads per block upon the call of a given kernel, according to a special syntax defined by CUDA.