%% In this sequential algorithm, one CPU thread executes all the steps. Let us look to the $3^{rd}$ step i.e. the execution of the iterative function, 2 sub-steps are needed. The first sub-step \textit{save}s the solution vector of the previous iteration, the second sub-step \textit{update}s or computes the new values of the roots vector.
\subsection{Parallel implementation with CUDA }
%% In this sequential algorithm, one CPU thread executes all the steps. Let us look to the $3^{rd}$ step i.e. the execution of the iterative function, 2 sub-steps are needed. The first sub-step \textit{save}s the solution vector of the previous iteration, the second sub-step \textit{update}s or computes the new values of the roots vector.
\subsection{Parallel implementation with CUDA }
%In order to implement the Ehrlich-Aberth method in CUDA, it is
%possible to use the Jacobi scheme or the Gauss-Seidel one. With the
%Jacobi iteration, at iteration $k+1$ we need all the previous values
%In order to implement the Ehrlich-Aberth method in CUDA, it is
%possible to use the Jacobi scheme or the Gauss-Seidel one. With the
%Jacobi iteration, at iteration $k+1$ we need all the previous values
%\item Initialization of the solution vector $Z^{0}$\;
\item Initialization of the parameters of roots finding problem (P, Pu, $Z^{0}$);
\item Allocate and copy initial data to the GPU global memory\;
%\item Initialization of the solution vector $Z^{0}$\;
\item Initialization of the parameters of roots finding problem (P, Pu, $Z^{0}$);
\item Allocate and copy initial data to the GPU global memory\;
If the modulus of the current complex is less than a given value called the
radius i.e. ($ |z^{k}_{i}|<= R$), then the classical form of the EA
function Eq.~\ref{Eq:Hi} is executed, else the EA.EL function Eq.~\ref{Log_H2} is executed.
If the modulus of the current complex is less than a given value called the
radius i.e. ($ |z^{k}_{i}|<= R$), then the classical form of the EA
function Eq.~\ref{Eq:Hi} is executed, else the EA.EL function Eq.~\ref{Log_H2} is executed.
-(with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}). The radius $R$ is evaluated as in Eq.~\ref{R.EL}. \KG{experimentally it is difficult to solve the high degree polynomial with the classical Ehrlich-Aberth method, so if the root are under to the unit circle ($R$)the kernel \textit{update} is called in the EA.EL function Eq.~\ref{Log_H2} (with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}) line 3, to put into account the limited of grander floating manipulated by processors and compute more roots}.
-\KG{} We notice that we used \verb= cuDoubleComplex= to exploit the complex number in CUDA, and the functions of the CUBLAS library to implement some vector operations on the GPU. We use the following functions:
+(with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}). The radius $R$ is evaluated as in Eq.~\ref{R.EL}. \KG{experimentally it is difficult to solve the high degree polynomial with the classical Ehrlich-Aberth method, so if the root are under to the unit circle ($R$)the kernel \textit{update} is called in the EA.EL function Eq.~\ref{Log_H2} (with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}) line 3, to put into account the limited of grander floating manipulated by processors and compute more roots}. We notice that we used \verb= cuDoubleComplex= to exploit the complex number in CUDA, and the functions of the CUBLAS library to implement some vector operations on the GPU. We use the following functions:
code (some parameters in the kernels have been simplified in order to
increase the readability). As can be seen this
kernel calls multiple kernels, all the kernels for complex numbers and
code (some parameters in the kernels have been simplified in order to
increase the readability). As can be seen this
kernel calls multiple kernels, all the kernels for complex numbers and
-\KG{}The Kernel \verb= __global__EA_update()= is a code called from the host and executed on the device, need to call \verb=__device__FirstH_EA()= on the device to compute the root who are under the unit circle ($R$), the \verb= __device__NewH_EA()= is called to compute root with the exp-log version. The Horner schema are used to evaluate the polynomial and his derivative in $Z[i]$ \verb=__device__Fonction()= and \verb=__deviceFonctionD()= respectively. Its exp-log version need to be implemented \verb= __device__LogFonction()=, \verb= __device__LogFonctionD()=
+\KG{}The Kernel \verb= __global__EA_update()= is a code called from the host and executed on the device, need to call \verb=__device__FirstH_EA()= on the device to compute the root who are under the unit circle ($R$), the \verb= __device__NewH_EA()= is called to compute root with the exp-log version. The Horner scheme are used to evaluate the polynomial and his derivative in $Z[i]$ \verb=__device__Fonction()= and \verb=__device__FonctionD()= respectively. Its exp-log version need to be implemented \verb= __device__LogFonction()=, \verb= __device__LogFonctionD()=
-\item \verb= __device__LogFonction()= to
-\item \verb= __device__LogFonctionD()= for the
+\item \verb= __device__LogFonction()= to evaluate the polynomial $P$ when the modulus of Z[i] is upper to circle unit $R$.
+\item \verb= __device__LogFonctionD()=to evaluate the derivative of the polynomial $P$ when the modulus of Z[i] is upper to circle unit $R$.
\end{itemize}
%need to call some function on device
%The Kernel \verb= EA_update= is a code implemented to executed on GPU and lanced in CPU,
This kernel is executed by a large number of GPU threads such that each thread is in charge of the computation of one component of the iterate vector $Z$. We set the size of a thread block, \textit{Threads}, to 512 threads and the number of thread blocks of a kernel, \textit{Blocks}, is computed so as each GPU thread is in charge of one vector element:
\end{itemize}
%need to call some function on device
%The Kernel \verb= EA_update= is a code implemented to executed on GPU and lanced in CPU,
This kernel is executed by a large number of GPU threads such that each thread is in charge of the computation of one component of the iterate vector $Z$. We set the size of a thread block, \textit{Threads}, to 512 threads and the number of thread blocks of a kernel, \textit{Blocks}, is computed so as each GPU thread is in charge of one vector element:
-Each GPU threads in grid compute one root en parallel, if the polynomial size exceed the capacity of the grid the G.S schema are finely executed, like the grid can only compute << Blocks,Threads>> roots at the same time, if we ned to compute more roots, the grid can used the roots previously executed to compute other root ih the same iteration, like the following schema:
+Each GPU threads in grid compute one root en parallel, if the polynomial size exceed the capacity of the grid the G.S schema are finely executed, like the grid can only compute $Blocks*Threads$ roots at the same time, if we need to compute more roots, the grid can used the roots previously executed to compute other root in the same iteration, like the following scheme:
The last kernel checks the convergence of the roots after each update
of $Z^{k}$, according to formula Eq.~\ref{eq:Aberth-Conv-Cond}. We used the functions of the CUBLAS Library (CUDA Basic Linear Algebra Subroutines) to implement this kernel.
The last kernel checks the convergence of the roots after each update
of $Z^{k}$, according to formula Eq.~\ref{eq:Aberth-Conv-Cond}. We used the functions of the CUBLAS Library (CUDA Basic Linear Algebra Subroutines) to implement this kernel.
The kernel terminates its computations when all the roots have
converged. It should be noticed that, as blocks of threads are
The kernel terminates its computations when all the roots have
converged. It should be noticed that, as blocks of threads are