Most of the numerical methods that deal with the polynomial root-finding problem are simultaneous methods, \textit{i.e.} the iterative methods to find simultaneous approximations of the $n$ polynomial roots. These methods start from the initial approximations of all $n$ polynomial roots and give a sequence of approximations that converge to the roots of the polynomial. Two examples of well-known simultaneous methods for root-finding problem of polynomials are Durand-Kerner method~\cite{Durand60,Kerner66} and Ehrlich-Aberth method~\cite{Ehrlich67,Aberth73}.
-The convergence time of simultaneous methods drastically increases with the increasing of the polynomial's degree. The great challenge with simultaneous methods is to parallelize them and to improve their convergence. Many authors have proposed parallel simultaneous methods~\cite{Freeman89,Loizou83,Freemanall90,bini96,cs01:nj,Couturier02}, using several paradigms of parallelization (synchronous or asynchronous computations, mechanism of shared or distributed memory, etc). However, they have treated only polynomials not exceeding degrees of 20,000.
+The convergence time of simultaneous methods drastically increases with the increasing of the polynomial's degree. The great challenge with simultaneous methods is to parallelize them and to improve their convergence. Many authors have proposed parallel simultaneous methods~\cite{Freeman89,Loizou83,Freemanall90,bini96,cs01:nj,Couturier02}, using several paradigms of parallelization (synchronous or asynchronous computations, mechanism of shared or distributed memory, etc). However, they have solved only polynomials not exceeding degrees of 20,000.
%The main problem of the simultaneous methods is that the necessary
%time needed for the convergence increases with the increasing of the
%roots of sparse polynomials of degree 10,000. The authors showed an interesting
%speedup that is 20 times as fast as the sequential implementation.
-Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA15}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Units) has exceeded that of traditional processors CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000.
+But the recent advent of the Compute Unified Device Architecture (CUDA)~\cite{CUDA15}, a parallel computing platform and a programming model invented by NVIDIA had revived parallel programming interest for this problem. Indeed, the computing power of GPUs (Graphics Processing Units) has exceeded that of traditional processors CPUs, which makes it very appealing to the research community to investigate new parallel implementations for a whole set of scientific problems in the reasonable hope to solve bigger instances of well known computationally demanding issues such as the one beforehand. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000.
-In this paper we propose the parallelization of Ehrlich-Aberth method which has a good convergence and it is suitable to be implemented in parallel computers. We use two parallel programming paradigms OpenMP and MPI on CUDA multi-GPU platforms. Our CUDA-MPI and CUDA-OpenMP codes are the first implementations of Ehrlich-Aberth method with multiple GPUs for finding roots of polynomials. Our major contributions include:
+In this paper we propose the parallelization of Ehrlich-Aberth (EA) method which has a cubic convergence rate which is much better than the quadratic rate of the Durand-Kerner method which has already been investigated in \cite{Kahinall14}. In the other hand, EA is suitable to be implemented in parallel computers according to the data-parallel paradigm. In this model, computing elements carry computations on the data they are assigned and communicate with other computing elements in order to get fresh data or to synchronise. Classically, two parallel programming paradigms OpenMP and MPI are used to code such solutions. But in our case, computing elements are CUDA multi-GPU platforms. This architectural setting poses new programming challenges but offers also new opportunities to efficiently solve huge problems, otherwise considered intractable until recently. To the best of our knowledge, our CUDA-MPI and CUDA-OpenMP codes are the first implementations of EA method with multiple GPUs for finding roots of polynomials. Our major contributions include:
\begin{itemize}
-\item The parallel implementation of Ehrlich-Aberth algorithm on a multi-GPU platform with a shared memory using OpenMP API. It is based on threads created from the same system process, such that each thread is attached to one GPU. In this case the communications between GPUs are done by OpenMP threads through shared memory.
-\item The parallel implementation of Ehrlich-Aberth algorithm on a
+\item The parallel implementation of EA algorithm on a multi-GPU platform with a shared memory using OpenMP API. It is based on threads created from the same system process, such that each thread is attached to one GPU. In this case the communications between GPUs are done by OpenMP threads through shared memory.
+\item The parallel implementation of EA algorithm on a
multi-GPU platform with a distributed memory using MPI API, such
that each GPU is attached and managed by a MPI process. The GPUs
exchange their data by message-passing communications. This approach is more used on clusters to solve very complex problems that are too large for traditional supercomputers, which are very expensive to build and run.