+\section{Aberth method and difficulties}\r
+A cubically convergent iteration method for finding zeros of\r
+polynomials was proposed by O.Aberth~\cite{Aberth73}.The Aberth\r
+method is a purely algebraic derivation.To illustrate the\r
+derivation, we let $w_{i}(z)$ be the product of linear factor $\r
+w_{i}(z)=\prod_{j=1,j \neq i}^{n} (z-x_{j})$\r
+\r
+and rational function $R_{i}(z)$ be the correction term of\r
+Weistrass method~\cite{Weierstrass03}:\r
+\r
+\begin{equation}\r
+R_{i}(z)=\dfrac{p(z)}{w_{i}(Z)} , i=1,2,...,n.\r
+\end{equation}\r
+\r
+Differentiating the rational function $R_{i}(z)$ and applying the\r
+Newton method, we have\r
+\r
+\begin{equation}\r
+\dfrac{R_{i}(z)}{R_{i}^{'}(z)}=\r
+\dfrac{p(z)}{p^{'}(z)-p(z)\dfrac{w_{i}(z)}{w_{i}^{'}(z)}}=\r
+\dfrac{p(z)}{p^{'}(z)-p(z) \sum _{j=1,j \neq\r
+i}^{n}\dfrac{1}{z-x_{i}}}, i=1,2,...,n\r
+\end{equation}\r
+\r
+Substituting $x_{j}$ for z we obtain the Aberth iteration method\r
+\r
+Let present the means stages of Aberth's method.\r
+\r
+\subsection{Polynomials Initialization}\r
+ The initialization of polynomial P(z) with complex coefficients\r
+ are given by:\r
+\r
+\begin{equation}\r
+ p(z)=\sum{a_{i}z^{n-i}}. where a_{n} \neq 0,a_{0}=1, a_{i}\subset C\r
+\end{equation}\r
+\r
+\r
+\subsection{Vector $Z^{0)}$ Initialization}\r
+\r
+The choice of the initial points $z^{(0)}_{i} , i = 1, . . . , n,$\r
+from which starting the iteration (2) or (3), is rather delicate\r
+since the number of steps needed by the iterative method to reach\r
+a given approximation strongly depends on it.\r
+In~\cite{Aberth73}the Aberth iteration is started by selecting n\r
+equispaced points on a circle of center 0 and radius r, where r is\r
+an upper bound to the moduli of the zeros. After,~\cite{Bini96}\r
+performs this choice by selecting complex numbers along different\r
+circles and relies on the result of~\cite{Ostrowski41}.\r
+\r
+\begin{equation}\r
+\sigma_{0}=\frac{u+v}{2};u=\frac{\sum_{i=1}^{n}u_{i}}{n.max_{i=1}^{n}u_{i}};\r
+v=\frac{\sum_{i=0}^{n-1}v_{i}}{n.min_{i=0}^{n-1}v_{i}};u_{i}=2.|a_{i}|^{\frac{1}{i}};\r
+v_{i}=\frac{|\frac{a_{n}}{a_{i}}|^{\frac{1}{n-i}}}{2}\r
+\end{equation}\r
+\r
+\subsection{Iterative Function Hi}\r
+The operator used with Aberth's method is corresponding to the\r
+following equation which will enable the convergence towards\r
+polynomial solutions, provided all the roots are distinct.\r
+\r
+\begin{equation}\r
+H_{i}(z)=z_{i}-\frac{1}{\frac{P^{'}(z_{i})}{P(z_{i})}-\sum_{j\neq\r
+i}{\frac{1}{z_{i}-z_{j}}}}\r
+\end{equation}\r
+\r
+\subsection{Convergence condition}\r
+determines the success of the termination. It consists in stopping\r
+the iterative function $H_{i}(z)$ when the are stable,the method\r
+converge sufficiently:\r
+\r
+\begin{equation}\r
+\forall i \in\r
+[1,n];\frac{z_{i}^{(k)}-z_{i}^{(k-1)}}{z_{i}^{(k)}}<\xi\r
+\end{equation}\r
+\r
+\r
+\section{Difficulties and amelioration}\r
+the Aberth method implementation suffer of overflow problems. This\r
+situation occurs, for instance, in the case where a polynomial\r
+having positive coefficients and large degree is computed at a\r
+point $\xi$ where $|\xi| > 1$.Indeed the limited number in the\r
+mantissa of floating takings the computation of P(z) wrong when z\r
+is large. for example $(10^{50}) +1+ (- 10_{50})$ will give result\r
+0 instead of 1 in reality.consequently we can't compute the roots\r
+for large polynomial's degree. This problem was discuss in\r
+~\cite{Karimall98} for the Durand-Kerner method, the authors\r
+propose to use the logratihm and the exponential of a complex:\r
+\r
+\begin{equation}\r
+ \forall(x,y)\in R^{*2}; \ln (x+i.y)=\ln(x^{2}+y^{2})\r
+2+i.\arcsin(y\sqrt{x^{2}+y^{2}})_{\left] -\pi, \pi\right[ }\r
+\end{equation}\r
+%%\begin{equation}\r
+\begin{align}\r
+ \forall(x,y)\in R^{*2}; \exp(x+i.y)&= \exp(x).\exp(i.y)\\\r
+ &=\exp(x).\cos(y)+i.\exp(x).\sin(y)\r
+\end{align}\r
+%%\end{equation}\r
+\r
+The application of logarithm can replace any multiplications and\r
+divisions with additions and subtractions; consequently it\r
+manipulates lower absolute values and can be compute the roots for\r
+large polynomial's degree exceed~\cite{Karimall98}.\r
+\r
+Applying this solution for the Aberth method we obtain the\r
+iteration function with logarithm:\r
+%%$$ \exp \bigl( \ln(p(z)_{k})-ln(\ln(p(z)_{k}^{'}))- \ln(1- \exp(\ln(p(z)_{k})-ln(\ln(p(z)_{k}^{'})+\ln\sum_{i\neq j}^{n}\frac{1}{z_{k}-z_{j}})$$\r
+\begin{equation}\r
+H_{i}(z)=z_{i}^{k}-\exp \left(\ln \left(\r
+p(z_{k})\right)-\ln\left(p(z_{k}^{'})\right)- \ln\r
+\left(1-Q(z_{k})\right)\right)\r
+\end{equation}\r
+where:\r
+\r
+\begin{equation}\r
+Q(z_{k})=\exp\left( \ln (p(z_{k}))-\ln(p(z_{k}^{'}))+\ln \left(\r
+\sum_{k\neq j}^{n}\frac{1}{z_{k}-z_{j}}\right)\right)\r
+\end{equation}\r
+\r
+\r
+this solution is applying when it is necessary\r
+\r
+\section{The implementation of simultaneous methods in a parallel computer}\r
+ The main problem of the simultaneous methods is that the necessary\r
+time needed for the convergence is increased with the increasing\r
+of the degree of the polynomial. The parallelization of these\r
+algorithms will improve the convergence time. Researchers usually\r
+adopt one of the two following approaches to parallelize root\r
+finding algorithms. One approach is to reduce the total number of\r
+iterations as implemented by Miranker\r
+~\cite{Mirankar68,Mirankar71}, Schedler~\cite{Schedler72} and\r
+Winogard~\cite{Winogard72}. Another approach is to reduce the\r
+computation time per iteration, as reported\r
+in~\cite{Benall68,Jana06,Janall99,Riceall06}. There are many\r
+schemes for simultaneous approximations of all roots of a given\r
+polynomial. Several works on different methods and issues of root\r
+finding have been reported in~\cite{Azad07,Gemignani07,Kalantari08\r
+,Skachek08,Zhancall08,Zhuall08}. However, Durand-Kerner and\r
+Ehrlich methods are the most practical choices among\r
+them~\cite{Bini04}. These two methods have been extensively\r
+studied for parallelization due to their following advantages. The\r
+computation involved in these methods has some inherent\r
+parallelism that can be suitably exploited by SIMD machines.\r
+Moreover, they have fast rate of convergence (quadratic for the\r
+Durand-Kerner method and cubic for the Ehrlich). Various parallel\r
+algorithms reported for these methods can be found\r
+in~\cite{Cosnard90, Freeman89,Freemanall90,,Jana99,Janall99}.\r
+Freeman and Bane~\cite{Freemanall90} presented two parallel\r
+algorithms on a local memory MIMD computer with the compute-to\r
+communication time ratio O(n). However, their algorithms require\r
+each processor to communicate its current approximation to all\r
+other processors at the end of each iteration. Therefore they\r
+cause a high degree of memory conflict. Recently the author\r
+in~\cite{Mirankar71} proposed two versions of parallel algorithm\r
+for the Durand-Kerner method, and Aberth method on an on model of\r
+Optoelectronic Transpose Interconnection System (OTIS).The\r
+algorithms are mapped on an OTIS-2D torus using N processors. This\r
+solution need N processors to compute N roots, that it is not\r
+practical (is not suitable to compute large polynomial's degrees).\r
+Until then, the related works are not able to compute the root of\r
+the large polynomial's degrees (higher then 1000) and with small\r
+time.\r
+\r
+ Finding polynomial roots rapidly and accurately it is our\r
+objective, with the apparition of the CUDA(Compute Unified Device\r
+Architecture), finding the roots of polynomials becomes rewarding\r
+and very interesting, CUDA adopts a totally new computing\r
+architecture to use the hardware resources provided by GPU in\r
+order to offer a stronger computing ability to the massive data\r
+computing.in~\cite{Kahinall14} we proposed the first implantation\r
+of the root finding polynomials method on GPU (Graphics Processing\r
+Unit),which is the Durand-Kerner method. The main result prove\r
+that a parallel implementation is 10 times as fast as the\r
+sequential implementation on a single CPU for high degree\r
+polynomials that is greater than about 48000. Indeed, in this\r
+paper we present a parallel implementation of Aberth's method on\r
+GPU, more details are discussed in the following of this paper.\r
+\r
+\section {A parallel implementation of Aberth's method}\r
+\subsection{Background on the GPU architecture}\r
+A GPU is viewed as an accelerator for the data-parallel and\r
+intensive arithmetic computations. It draws its computing power\r
+from the parallel nature of its hardware and software\r
+architectures. A GPU is composed of hundreds of Streaming\r
+Processors (SPs) organized in several blocks called Streaming\r
+Multiprocessors (SMs). It also has a memory hierarchy. It has a\r
+private read-write local memory per SP, fast shared memory and\r
+read-only constant and texture caches per SM and a read-write\r
+global memory shared by all its SPs~\cite{NVIDIA10}\r
+\r
+ On a CPU equipped with a GPU, all the data-parallel and intensive\r
+functions of an application running on the CPU are off-loaded onto\r
+the GPU in order to accelerate their computations. A similar\r
+data-parallel function is executed on a GPU as a kernel by\r
+thousands or even millions of parallel threads, grouped together\r
+as a grid of thread blocks. Therefore, each SM of the GPU executes\r
+one or more thread blocks in SIMD fashion (Single Instruction,\r
+Multiple Data) and in turn each SP of a GPU SM runs one or more\r
+threads within a block in SIMT fashion (Single Instruction,\r
+Multiple threads). Indeed at any given clock cycle, the threads\r
+execute the same instruction of a kernel, but each of them\r
+operates on different data.\r
+ GPUs only work on data filled in their\r
+global memories and the final results of their kernel executions\r
+must be communicated to their CPUs. Hence, the data must be\r
+transferred in and out of the GPU. However, the speed of memory\r
+copy between the GPU and the CPU is slower than the memory\r
+bandwidths of the GPU memories and, thus, it dramatically affects\r
+the performances of GPU computations. Accordingly, it is necessary\r
+to limit data transfers between the GPU and its CPU during the\r
+computations.\r
+\subsection{Background on the CUDA Programming Model}\r
+\r
+The CUDA programming model is similar in style to a single program\r
+multiple-data (SPMD) softwaremodel. The GPU is treated as a\r
+coprocessor that executes data-parallel kernel functions. CUDA\r
+provides three key abstractions, a hierarchy of thread groups,\r
+shared memories, and barrier synchronization. Threads have a three\r
+level hierarchy. A grid is a set of thread blocks that execute a\r
+kernel function. Each grid consists of blocks of threads. Each\r
+block is composed of hundreds of threads. Threads within one block\r
+can share data using shared memory and can be synchronized at a\r
+barrier. All threads within a block are executed concurrently on a\r
+multithreaded architecture.The programmer specifies the number of\r
+threads per block, and the number of blocks per grid. A thread in\r
+the CUDA programming language is much lighter weight than a thread\r
+in traditional operating systems. A thread in CUDA typically\r
+processes one data element at a time. The CUDA programming model\r
+has two shared read-write memory spaces, the shared memory space\r
+and the global memory space. The shared memory is local to a block\r
+and the global memory space is accessible by all blocks. CUDA also\r
+provides two read-only memory spaces, the constant space and the\r
+texture space, which reside in external DRAM, and are accessed via\r
+read-only caches\r
+\r
+\subsection{A parallel implementation of the Aberth's method }\r
+%%\subsection{A CUDA implementation of the Aberth's method }\r
+%%\subsection{A GPU implementation of the Aberth's method }\r
+\r
+\r
+\r
+\subsubsection{A sequential Aberth algorithm}\r
+The means steps of Aberth's method can expressed as an algorithm\r
+like:\r
+ \r
+\begin{algorithm}[H]\r
+\LinesNumbered\r
+\caption{Algorithm to find root polynomial with Aberth method}\r
+\r
+\KwIn{$Z^{0}$(Initial root's vector),$\varepsilon$ (error\r
+tolerance threshold),P(Polynomial to solve)}\r
+\r
+\KwOut {Z(The solution root's vector)}\r
+\r
+\BlankLine\r
+\r
+Initialization of the parameter of the polynomial to solve\;\r
+Initialization of the solution vector $Z^{0}$\;\r
+\r
+\While {$\Delta z_{max}\succ \epsilon$}{\r
+ Let $\Delta z_{max}=0$\;\r
+\For{$j \gets 0 $ \KwTo $n$}{\r
+$ZPrec\left[j\right]=Z\left[j\right]$\;\r
+$Z\left[j\right]=H\left(j,Z\right)$\;\r
+}\r
+\r
+\For{$i \gets 0 $ \KwTo $n-1$}{\r
+$c=\frac{\left|Z\left[i\right]-ZPrec\left[i\right]\right|}{Z\left[i\right]}$\;\r
+\If{$c\succ\Delta z_{max}$ }{\r
+$\Delta z_{max}$=c\;}\r
+}\r
+}\r
+\end{algorithm}\r
+~\\ \r
+~\\ \r
+In this sequential algorithm one thread CPU execute all steps, let see the step 3 the execution of the iterative function , 2 instructions are needed, the first instruction \textit{save} the solution vector for the previous iteration, the second instruction \textit{update} or compute a new values of the roots.\r
+We have two manner to execute the iterative function, taking a Jacobi iteration who need all the previous value $z^{(k)}_{i}$ to compute the new value $z^{(k+1)}_{i}$we have:\r
+\r
+\begin{equation}\r
+H(i,z^{k+1})=\frac{p(z^{(k)}_{i})}{p'(z^{(k)}_{i})-p(z^{(k)}_{i})\sum^{n}_{j=1 j\neq i}\frac{1}{z^{(k)}_{i}-z^{(k)}_{j}}}, i=1,...,n.\r
+\end{equation}\r
+\r
+Or with the Gauss-seidel iteration, we have:\r
+\begin{equation}\r
+H(i,z^{k+1})=\frac{p(z^{(k)}_{i})}{p'(z^{(k)}_{i})-p(z^{(k)}_{i})\sum^{i-1}_{j=1}\frac{1}{z^{(k)}_{i}-z^{(k)}_{j}}+\sum^{n}_{j=i+1}\frac{1}{z^{(k)}_{i}-z^{(k)}_{j}}}, i=1,...,n.\r
+\end{equation}\r
+\r
+In formula(16) the iteration function use the $z^{k+1}_{i}$ computed in the current iteration to compute the rest of the roots, which take him to converge more quickly compare to the jacobi iteration (it's well now that the Gauss-seidel iteration converge more quickly because they used the most fresh computed root, so we used Gauss-seidel iteration.)\r
+\r
+The steps 4 of the Aberth's method compute the convergence of the roots, using(9) formula.\r
+Both steps 3 and 4 use 1 thread to compute N roots on CPU, which is faster and hard, it make the algorithm faster and hard for the large polynomial's roots finding.\r