+ Finding polynomial roots rapidly and accurately it is our\r
+objective, with the apparition of the CUDA(Compute Unified Device\r
+Architecture), finding the roots of polynomials becomes rewarding\r
+and very interesting, CUDA adopts a totally new computing\r
+architecture to use the hardware resources provided by GPU in\r
+order to offer a stronger computing ability to the massive data\r
+computing.in~\cite{Kahinall14} we proposed the first implantation\r
+of the root finding polynomials method on GPU (Graphics Processing\r
+Unit),which is the Durand-Kerner method. The main result prove\r
+that a parallel implementation is 10 times as fast as the\r
+sequential implementation on a single CPU for high degree\r
+polynomials that is greater than about 48000. Indeed, in this\r
+paper we present a parallel implementation of Aberth's method on\r
+GPU, more details are discussed in the following of this paper.\r
+\r
+\section {A parallel implementation of Aberth's method}\r
+\subsection{Background on the GPU architecture}\r
+A GPU is viewed as an accelerator for the data-parallel and\r
+intensive arithmetic computations. It draws its computing power\r
+from the parallel nature of its hardware and software\r
+architectures. A GPU is composed of hundreds of Streaming\r
+Processors (SPs) organized in several blocks called Streaming\r
+Multiprocessors (SMs). It also has a memory hierarchy. It has a\r
+private read-write local memory per SP, fast shared memory and\r
+read-only constant and texture caches per SM and a read-write\r
+global memory shared by all its SPs~\cite{NVIDIA10}\r
+\r
+ On a CPU equipped with a GPU, all the data-parallel and intensive\r
+functions of an application running on the CPU are off-loaded onto\r
+the GPU in order to accelerate their computations. A similar\r
+data-parallel function is executed on a GPU as a kernel by\r
+thousands or even millions of parallel threads, grouped together\r
+as a grid of thread blocks. Therefore, each SM of the GPU executes\r
+one or more thread blocks in SIMD fashion (Single Instruction,\r
+Multiple Data) and in turn each SP of a GPU SM runs one or more\r
+threads within a block in SIMT fashion (Single Instruction,\r
+Multiple threads). Indeed at any given clock cycle, the threads\r
+execute the same instruction of a kernel, but each of them\r
+operates on different data.\r
+ GPUs only work on data filled in their\r
+global memories and the final results of their kernel executions\r
+must be communicated to their CPUs. Hence, the data must be\r
+transferred in and out of the GPU. However, the speed of memory\r
+copy between the GPU and the CPU is slower than the memory\r
+bandwidths of the GPU memories and, thus, it dramatically affects\r
+the performances of GPU computations. Accordingly, it is necessary\r
+to limit data transfers between the GPU and its CPU during the\r
+computations.\r
+\subsection{Background on the CUDA Programming Model}\r
+\r
+The CUDA programming model is similar in style to a single program\r
+multiple-data (SPMD) softwaremodel. The GPU is treated as a\r
+coprocessor that executes data-parallel kernel functions. CUDA\r
+provides three key abstractions, a hierarchy of thread groups,\r
+shared memories, and barrier synchronization. Threads have a three\r
+level hierarchy. A grid is a set of thread blocks that execute a\r
+kernel function. Each grid consists of blocks of threads. Each\r
+block is composed of hundreds of threads. Threads within one block\r
+can share data using shared memory and can be synchronized at a\r
+barrier. All threads within a block are executed concurrently on a\r
+multithreaded architecture.The programmer specifies the number of\r
+threads per block, and the number of blocks per grid. A thread in\r
+the CUDA programming language is much lighter weight than a thread\r
+in traditional operating systems. A thread in CUDA typically\r
+processes one data element at a time. The CUDA programming model\r
+has two shared read-write memory spaces, the shared memory space\r
+and the global memory space. The shared memory is local to a block\r
+and the global memory space is accessible by all blocks. CUDA also\r
+provides two read-only memory spaces, the constant space and the\r
+texture space, which reside in external DRAM, and are accessed via\r
+read-only caches\r
+\r
+\subsection{A parallel implementation of the Aberth's method }\r
+\subsection{A CUDA implementation of the Aberth's method }\r
+\subsection{A GPU implementation of the Aberth's method }\r
+\subsubsection{the step to parallelize}\r
+\subsubsection{the kernel corresponding }\r
+\subsubsection{Comparison between sequential algorithm and GPU algorithm }\r