-The main problem of the simultaneous methods is that the necessary
-time needed for the convergence increases with the increasing of the
-polynomial's degree. Many authors have treated the problem of
-implementing simultaneous methods in
-parallel. Freeman~\cite{Freeman89} implemented and compared
-Durand-Kerner method, Ehrlich-Aberth method and another method of the
-fourth order of convergence proposed by Farmer and
-Loizou~\cite{Loizou83} on a 8-processor linear chain, for polynomials
-of degree up-to 8. The method of Farmer and Loizou~\cite{Loizou83}
-often diverges, but the first two methods (Durand-Kerner and
-Ehrlich-Aberth methods) have a speed-up equals to 5.5. Later, Freeman
-and Bane~\cite{Freemanall90} considered asynchronous algorithms in
-which each processor continues to update its approximations even
-though the latest values of other approximations $z^{k}_{i}$ have not
-been received from the other processors, in contrast with synchronous
-algorithms where it would wait those values before making a new
-iteration. Couturier and al.~\cite{cs01:nj} proposed two methods
-of parallelization for a shared memory architecture with OpenMP and
-for a distributed memory one with MPI. They are able to compute the
-roots of sparse polynomials of degree 10,000. The authors showed an interesting
-speedup that is 20 times as fast as the sequential implementation.
-
-
-Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA15}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Units) has exceeded that of traditional processors CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. %\LZK{Y a pas d'autres travaux pour la résolution de polynômes sur GPUs?}
-
-In this paper we propose the parallelization of Ehrlich-Aberth method which has a good convergence and it is suitable to be implemented in parallel computers. We use two parallel programming paradigms OpenMP and MPI on CUDA multi-GPU platforms. Our CUDA-MPI and CUDA-OpenMP codes are the first implementations of Ehrlich-Aberth method with multiple GPUs for finding roots of polynomials. Our major contributions include:
+
+The convergence time of simultaneous methods drastically increases with the increasing of the polynomial's degree. The great challenge with simultaneous methods is to parallelize them and to improve their convergence. Many authors have proposed parallel simultaneous methods~\cite{Freeman89,Loizou83,Freemanall90,bini96,cs01:nj,Couturier02}, using several paradigms of parallelization (synchronous or asynchronous computations, mechanism of shared or distributed memory, etc). However, they have solved only polynomials not exceeding degrees of 20,000.
+
+%The main problem of the simultaneous methods is that the necessary
+%time needed for the convergence increases with the increasing of the
+%polynomial's degree. Many authors have treated the problem of
+%implementing simultaneous methods in
+%parallel. Freeman~\cite{Freeman89} implemented and compared
+%Durand-Kerner method, Ehrlich-Aberth method and another method of the
+%fourth order of convergence proposed by Farmer and
+%Loizou~\cite{Loizou83} on a 8-processor linear chain, for polynomials
+%of degree up-to 8. The method of Farmer and Loizou~\cite{Loizou83}
+%often diverges, but the first two methods (Durand-Kerner and
+%Ehrlich-Aberth methods) have a speed-up equals to 5.5. Later, Freeman
+%and Bane~\cite{Freemanall90} considered asynchronous algorithms in
+%which each processor continues to update its approximations even
+%though the latest values of other approximations $z^{k}_{i}$ have not
+%been received from the other processors, in contrast with synchronous
+%algorithms where it would wait those values before making a new
+%iteration. Couturier and al.~\cite{cs01:nj} proposed two methods
+%of parallelization for a shared memory architecture with OpenMP and
+%for a distributed memory one with MPI. They are able to compute the
+%roots of sparse polynomials of degree 10,000. The authors showed an interesting
+%speedup that is 20 times as fast as the sequential implementation.
+
+But the recent advent of the Compute Unified Device Architecture (CUDA)~\cite{CUDA15}, a parallel computing platform and a programming model invented by NVIDIA had revived parallel programming interest for this problem. Indeed, the computing power of GPUs (Graphics Processing Units) has exceeded that of traditional processors CPUs, which makes it very appealing to the research community to investigate new parallel implementations for a whole set of scientific problems in the reasonable hope to solve bigger instances of well known computationally demanding issues such as the one beforehand. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000.
+
+In this paper we propose the parallelization of Ehrlich-Aberth (EA) method which has a cubic convergence rate which is much better than the quadratic rate of the Durand-Kerner method which has already been investigated in \cite{Kahinall14}. In the other hand, EA is suitable to be implemented in parallel computers according to the data-parallel paradigm. In this model, computing elements carry computations on the data they are assigned and communicate with other computing elements in order to get fresh data or to synchronise. Classically, two parallel programming paradigms OpenMP and MPI are used to code such solutions. But in our case, computing elements are CUDA multi-GPU platforms. This architectural setting poses new programming challenges but offers also new opportunities to efficiently solve huge problems, otherwise considered intractable until recently. To the best of our knowledge, our CUDA-MPI and CUDA-OpenMP codes are the first implementations of EA method with multiple GPUs for finding roots of polynomials. Our major contributions include: