+\section{The Ehrlich-Aberth algorithm on multiple GPUs}
+\label{sec4}
+\subsection{An OpenMP-CUDA approach}
+Our OpenMP-CUDA implementation of EA algorithm is based on the hybrid
+OpenMP and CUDA programming model. This algorithm is presented in
+Algorithm~\ref{alg2-cuda-openmp}. All the data are shared with OpenMP
+among all the OpenMP threads. The shared data are the solution vector
+$Z$, the polynomial to solve $P$, its derivative $P'$, and the error
+vector $error$. The number of OpenMP threads is equal to the number of
+GPUs, each OpenMP thread binds to one GPU, and it controls a part of
+the shared memory. More precisely each OpenMP thread will be
+responsible to update its owns part of the vector $Z$. This part is
+called $Z_{loc}$ in the following. Then all GPUs will have a grid of
+computation organized according to the device performance and the size
+of data on which it runs the computation kernels.
+
+To compute one iteration of the EA method each GPU performs the
+followings steps. First roots are shared with OpenMP and the
+computation of the local size for each GPU is performed (line 4). Each
+thread starts by copying all the previous roots inside its GPU (line
+5). At each iteration, the following operations are performed. First
+the vector $Z$ is transferred from the CPU to the GPU (line 7). Each
+GPU copies the previous roots (line 8) and it computes an iteration of
+the EA method on its own roots (line 9). For that all the other roots
+are used. The local error is computed on the new roots (line 10) and
+the max of the local errors is computed on all OpenMP threads (lien 11). At
+the end of an iteration, the updated roots are copied from the GPU to
+the CPU (line 12) by directly updating its own roots in the shared
+memory arrays containing all the roots.
+
+
+
+\begin{algorithm}[htpb]
+\label{alg2-cuda-openmp}
+\LinesNumbered
+\SetAlgoNoLine
+\caption{Finding roots of polynomials with the Ehrlich-Aberth method on multiple GPUs using OpenMP}
+\KwIn{ $\epsilon$ (tolerance threshold)}
+\KwOut{$Z$ (solution vector of roots)}
+Initialize the polynomial $P$ and its derivative $P'$\;
+Set the initial values of vector $Z$\;
+Start of a parallel part with OpenMP ($Z$, $error$, $P$, $P'$ are shared variables)\;
+Determine the local part of the OpenMP thread\;
+Copy $P$, $P'$ from CPU to GPU\;
+\While{$error > \epsilon$}{
+ Copy $Z$ from CPU to GPU\;
+ $Z^{prev}_{loc}$ = KernelSave($Z_{loc}$)\;
+ $Z_{loc}$ = KernelUpdate($P,P',Z$)\;
+ $error_{loc}$ = KernelComputeError($Z_{loc},Z^{prev}_{loc}$)\;
+ $error = max(error_{loc})$\;
+ Copy $Z_{loc}$ from GPU to $Z$ in CPU\;
+}
+\end{algorithm}
+
+
+
+
+
+\subsection{A MPI-CUDA approach}
+
+Our parallel implementation of EA to find roots of polynomials using a
+CUDA-MPI approach follows a similar approach to the one used in
+CUDA-OpenMP. Each process is responsible to compute its own part of
+roots using all the roots computed by other processors at the previous
+iteration. The difference between both approaches lies in the way
+processes communicate and exchange data. With MPI, processors need to
+send and receive data explicitly. So in
+Algorithm~\ref{alg2-cuda-mpi}, after the initialization all the
+processors have the same $Z$ vector. Then they need to compute the
+parameters used by the $MPI\_AlltoAll$ routines (line 4). In practice,
+each processor needs to compute its offset and its local
+size. Processors need to allocate memory on their GPU and need to copy
+their data on the GPU (line 5). At the beginning of each iteration, a
+processor starts by transferring the whole vector $Z$ from the CPU to the
+GPU (line 7). Only the local part of $Z^{prev}$ is saved (line
+8). After that, a processor is able to compute an updated version of
+its own roots (line 9) with the EA method. The local error is computed
+(line 10) and the global error using $MPI\_Reduce$ (line 11). Then
+the local roots are transferred from the GPU memory to the CPU memory
+(line 12) before being exchanged between all processors (line 13) in
+order to give to all processors the last version of the roots (with
+the MPI\_AlltoAll routine). If the convergence is not satisfied, an
+new iteration is executed.
+
+\begin{algorithm}[htpb]
+\label{alg2-cuda-mpi}
+\LinesNumbered
+\SetAlgoNoLine
+\caption{Finding roots of polynomials with the Ehrlich-Aberth method on multiple GPUs using MPI}
+
+\KwIn{ $\epsilon$ (tolerance threshold)}
+
+\KwOut {$Z$ (solution vector of roots)}
+
+\BlankLine
+Initialize the polynomial $P$ and its derivative $P'$\;
+Set the initial values of vector $Z$\;
+Determine the local part of the MPI process\;
+Computation of the parameters for the $MPI\_AlltoAll$\;
+Copy $P$, $P'$ from CPU to GPU\;
+\While {$error > \epsilon$}{
+ Copy $Z$ from CPU to GPU\;
+ $Z^{prev}_{loc}$ = KernelSave($Z_{loc}$)\;
+ $Z_{loc}$ = KernelUpdate($P,P',Z$)\;
+ $error_{loc}$ = KernelComputeError($Z_{loc},Z^{prev}_{loc}$)\;
+ $error=MPI\_Reduce(error_{loc})$\;
+ Copy $Z_{loc}$ from GPU to CPU\;
+ $Z=MPI\_AlltoAll(Z_{loc})$\;
+}
+\end{algorithm}
+
+
+\section{Experiments}
+\label{sec5}
+We study two categories of polynomials: sparse polynomials and full polynomials.\\
+{\it A sparse polynomial} is a polynomial for which only some coefficients are not null. In this paper, we consider sparse polynomials for which the roots are distributed on 2 distinct circles:
+\begin{equation}
+ \forall \alpha_{1} \alpha_{2} \in \mathbb{C},\forall n_{1},n_{2} \in \mathbb{N}^{*}; p(z)= (z^{n_{1}}-\alpha_{1})(z^{n_{2}}-\alpha_{2})
+\end{equation}\noindent
+{\it A full polynomial} is, in contrast, a polynomial for which all the coefficients are not null. A full polynomial is defined by:
+
+\begin{equation}
+ {\Large \forall \alpha_{i} \in \mathbb{C}, i\in \mathbb{N}; p(x)=\sum^{n}_{i=0} \alpha_{i}.x^{i}}
+\end{equation}
+
+For our tests, 4 cards GPU Tesla Kepler K40 are used. In order to evaluate both the GPU and Multi-GPU approaches, we performed a set of experiments on a single GPU and multiple GPUs using OpenMP or MPI with the EA algorithm, for both sparse and full polynomials of different sizes. All experimental results obtained are performed with double precision float data and the convergence threshold of the EA method is set to $10^{-7}$. The initialization values of the vector solution of the methods are given by Guggenheimer method~\cite{Gugg86}.
+
+\subsection{Evaluation of the multi-GPUs approaches}
+Here we evaluate the performances of the CUDA-OpenMP and CUDA-MPI approaches of the EA algorithm on different GPU platforms composed each of 1, 2, 3 or 4 GPUs. In this experiments we report the experimental results of the EA algorithms to find roots of different sparse and full polynomials of high degrees ranging from 100,000 to 1,400,000. Figures~\ref{fig:01} and~\ref{fig:02} show the execution times to solve, respectively, sparse and full polynomials with the CUDA-OpenMP algorithm, and Figures~\ref{fig:03} and~\ref{fig:04} show those to solve, respectively, sparse and full polynomials with the CUDA-MPI algorithm.
+
+All these figures show that the CUDA-OpenMP and CUDA-MPI approaches of the EA algorithm, compared to the single GPU version, are efficient and scale well with multiple GPUs. Both approaches allow us to solve sparse and full polynomials of very high degrees. Using 4 GPUs allows us to achieve a quasi-linear speedup.