+\section{The EA algorithm on Multiple GPUs}
+\subsection{an OpenMP-CUDA approach}
+Our OpenMP-CUDA implementation of EA algorithm is based on the hybrid
+OpenMP and CUDA programming model. All the data are shared with
+OpenMP amoung all the OpenMP threads. The shared data are the solution
+vector $Z$, the polynomial to solve $P$, and the error vector $\Delta
+z$. The number of OpenMP threads is equal to the number of GPUs, each
+OpenMP thread binds to one GPU, and it controls a part of the shared
+memory. More precisely each OpenMP thread will be responsible to
+update its owns part of the vector Z. This part is call $Z_{loc}$ in
+the following. Then all GPUs will have a grid of computation organized
+according to the device performance and the size of data on which it
+runs the computation kernels.
+To compute one iteration of the EA method each GPU performs the
+followings steps. First roots are shared with OpenMP and the
+computation of the local size for each GPU is performed (lines 5-7 in
+Algo\ref{alg2-cuda-openmp}). Each thread starts by copying all the
+previous roots inside its GPU (line 9). Then each GPU will copy the
+previous roots (line 10) and it will compute an iteration of the EA
+method on its own roots (line 11). For that all the other roots are
+used. The convergence is checked on the new roots (line 12). At the end
+of an iteration, the updated roots are copied from the GPU to the
+CPU (line 14) by direcly updating its own roots in the shared memory
+arrays containing all the roots.
+\caption{Finding roots of polynomials with the Ehrlich-Aberth method on multiple GPUs using OpenMP}
+\KwIn{$n$ (polynomial's degree), $\epsilon$ (tolerance threshold), $ngpu$ (number of GPUs)}
+\KwOut{$Z$ (solution vector of roots)}
+Initialize the polynomial $P$ and its derivative $P'$\;
+Set the initial values of vector $Z$\;
+Start of a parallel part with OpenMP ($Z$, $\Delta Z$, $\Delta Z_{max}$, $P$ are shared variables)\;
+$id_{gpu}$ = cudaGetDevice()\;
+$n_{loc}$ = $n/ngpu$ (local size)\;
+%$idx$ = $id_{gpu}\times n_{loc}$ (local offset)\;
+Copy $P$, $P'$ from CPU to GPU\;
+\While{\emph{not convergence}}{
+ Copy $Z$ from CPU to GPU\;
+ $Z^{prev}$ = KernelSave($Z,n$)\;
+ $Z_{loc}$ = KernelUpdate($P,P',Z^{prev},n_{loc}$)\;
+ $\Delta Z_{loc}$ = KernelComputeError($Z_{loc},Z^{prev}_{loc},n_{loc}$)\;
+ $\Delta Z_{max}[id_{gpu}]$ = CudaMaxFunction($\Delta Z_{loc},n_{loc}$)\;
+ Copy $Z_{loc}$ from GPU to $Z$ in CPU\;
+ $max$ = MaxFunction($\Delta Z_{max},ngpu$)\;
+ TestConvergence($max,\epsilon$)\;