-Most of the numerical methods that deal with the polynomial root-finding problem are simultaneous methods, \textit{i.e.} the iterative methods to find simultaneous approximations of the $n$ polynomial roots. These methods start from the initial approximations of all $n$ polynomial roots and give a sequence of approximations that converge to the roots of the polynomial. Two examples of well-known simultaneous methods for root-finding problem of polynomials are Durand-Kerner method~\cite{Durand60,Kerner66} and Ehrlich-Aberth method~\cite{Ehrlich67,Aberth73}.
-
-The main problem of the simultaneous methods is that the necessary
-time needed for the convergence increases with the increasing of the
-polynomial's degree. Many authors have treated the problem of
-implementing simultaneous methods in
-parallel. Freeman~\cite{Freeman89} implemented and compared
-Durand-Kerner method, Ehrlich-Aberth method and another method of the
-fourth order of convergence proposed by Farmer and
-Loizou~\cite{Loizou83} on a 8-processor linear chain, for polynomials
-of degree up-to 8. The method of Farmer and Loizou~\cite{Loizou83}
-often diverges, but the first two methods (Durand-Kerner and
-Ehrlich-Aberth methods) have a speed-up equals to 5.5. Later, Freeman
-and Bane~\cite{Freemanall90} considered asynchronous algorithms in
-which each processor continues to update its approximations even
-though the latest values of other approximations $z^{k}_{i}$ have not
-been received from the other processors, in contrast with synchronous
-algorithms where it would wait those values before making a new
-iteration. Couturier and al.~\cite{cs01:nj} proposed two methods
-of parallelization for a shared memory architecture with OpenMP and
-for a distributed memory one with MPI. They are able to compute the
-roots of sparse polynomials of degree 10,000. The authors showed an interesting
-speedup that is 20 times as fast as the sequential implementation.
-%which takes up-to 3,300 seconds to obtain same results.
-\RC{si on donne des temps faut donner le proc, comme c'est vieux à mon avis faut supprimer ca, votre avis?}
-\LZK{Supprimons ces détails et mettons une référence s'il y en a une}
-\KG{Je viens de supprimer les détails, la référence existe déja, a reverifier\LZK{Elle est où la référence?}}
-Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA15}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Units) has exceeded that of traditional processors CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. \LZK{Y a pas d'autres travaux pour la résolution de polynômes sur GPUs?}
-
-In this paper we propose the parallelization of Ehrlich-Aberth method which has a good convergence and it is suitable to be implemented in parallel computers. We use two parallel programming paradigms OpenMP and MPI on CUDA multi-GPU platforms. Our CUDA-MPI and CUDA-OpenMP codes are the first implementations of Ehrlich-Aberth method with multiple GPUs for finding roots of polynomials. Our major contributions include:
+Most of the numerical methods that deal with the polynomial
+root-finding problems are simultaneous methods, \textit{i.e.} the
+iterative methods to find simultaneous approximations of the $n$
+polynomial roots. These methods start from the initial approximation
+of all $n$ polynomial roots and give a sequence of approximations that
+converge to the roots of the polynomial. Two examples of well-known
+simultaneous methods for root-finding problem of polynomials are
+the Durand-Kerner method~\cite{Durand60,Kerner66} and the Ehrlich-Aberth method~\cite{Ehrlich67,Aberth73}.
+
+
+The convergence time of simultaneous methods drastically increases
+with the increasing of the polynomial's degree. The great challenge
+with simultaneous methods is to parallelize them and to improve their
+convergence. Many authors have proposed parallel simultaneous
+methods~\cite{Freeman89,Loizou83,Freemanall90,bini96,cs01:nj,Couturier02},
+using several paradigms of parallelization (synchronous or
+asynchronous computations, mechanism of shared or distributed memory,
+etc). However, so far until now, only polynomials not exceeding
+degrees of less than 100,000 have been solved.
+
+%The main problem of the simultaneous methods is that the necessary
+%time needed for the convergence increases with the increasing of the
+%polynomial's degree. Many authors have treated the problem of
+%implementing simultaneous methods in
+%parallel. Freeman~\cite{Freeman89} implemented and compared
+%Durand-Kerner method, Ehrlich-Aberth method and another method of the
+%fourth order of convergence proposed by Farmer and
+%Loizou~\cite{Loizou83} on a 8-processor linear chain, for polynomials
+%of degree up-to 8. The method of Farmer and Loizou~\cite{Loizou83}
+%often diverges, but the first two methods (Durand-Kerner and
+%Ehrlich-Aberth methods) have a speed-up equals to 5.5. Later, Freeman
+%and Bane~\cite{Freemanall90} considered asynchronous algorithms in
+%which each processor continues to update its approximations even
+%though the latest values of other approximations $z^{k}_{i}$ have not
+%been received from the other processors, in contrast with synchronous
+%algorithms where it would wait those values before making a new
+%iteration. Couturier and al.~\cite{cs01:nj} proposed two methods
+%of parallelization for a shared memory architecture with OpenMP and
+%for a distributed memory one with MPI. They are able to compute the
+%roots of sparse polynomials of degree 10,000. The authors showed an interesting
+%speedup that is 20 times as fast as the sequential implementation.
+
+With the recent advent of the Compute Unified Device Architecture
+(CUDA)~\cite{CUDA15}, a parallel computing platform and a programming
+model invented by NVIDIA had revived parallel programming interest for
+this problem. Indeed, the computing power of GPUs (Graphics Processing
+Units) has exceeded that of traditional CPUs processors, which makes it very appealing to the research community to investigate new parallel implementations for a whole set of scientific problems in the reasonable hope to solve bigger instances of well known computationally demanding issues such as the one beforehand. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000.
+
+In this paper we propose the parallelization of the Ehrlich-Aberth (EA) method which has a cubic convergence rate which is much better than the quadratic rate of the Durand-Kerner method which has already been investigated in \cite{Kahinall14}. In the other hand, EA is suitable to be implemented in parallel computers according to the data-parallel paradigm. In this model, computing elements carry computations on the data they are assigned and communicate with other computing elements in order to get fresh data or to synchronize. Classically, two parallel programming paradigms OpenMP and MPI are used to code such solutions. But in our case, computing elements are CUDA multi-GPU platforms. This architectural setting poses new programming challenges but offers also new opportunities to efficiently solve huge problems, otherwise considered intractable until recently. To the best of our knowledge, our CUDA-MPI and CUDA-OpenMP codes are the first implementations of EA method with multiple GPUs for finding roots of polynomials. Our major contributions include: