+
+\section{The EA algorithm on Multiple GPUs}
+\label{sec4}
+\subsection{an OpenMP-CUDA approach}
+Our OpenMP-CUDA implementation of EA algorithm is based on the hybrid
+OpenMP and CUDA programming model. All the data are shared with
+OpenMP amoung all the OpenMP threads. The shared data are the solution
+vector $Z$, the polynomial to solve $P$, and the error vector $\Delta
+z$. The number of OpenMP threads is equal to the number of GPUs, each
+OpenMP thread binds to one GPU, and it controls a part of the shared
+memory. More precisely each OpenMP thread will be responsible to
+update its owns part of the vector Z. This part is call $Z_{loc}$ in
+the following. Then all GPUs will have a grid of computation organized
+according to the device performance and the size of data on which it
+runs the computation kernels.
+
+To compute one iteration of the EA method each GPU performs the
+followings steps. First roots are shared with OpenMP and the
+computation of the local size for each GPU is performed (lines 5-7 in
+Algo\ref{alg2-cuda-openmp}). Each thread starts by copying all the
+previous roots inside its GPU (line 9). Then each GPU will copy the
+previous roots (line 10) and it will compute an iteration of the EA
+method on its own roots (line 11). For that all the other roots are
+used. The convergence is checked on the new roots (line 12). At the end
+of an iteration, the updated roots are copied from the GPU to the
+CPU (line 14) by direcly updating its own roots in the shared memory
+arrays containing all the roots.
+
+%In principle a grid is set by two parameter DimGrid, the number of block per grid, DimBloc: the number of threads per block. The following schema shows the architecture of (CUDA,OpenMP).
+
+%\begin{figure}[htbp]
+%\centering
+ % \includegraphics[angle=-90,width=0.5\textwidth]{OpenMP-CUDA}
+%\caption{The OpenMP-CUDA architecture}
+%\label{fig:03}
+%\end{figure}
+%Each thread OpenMP compute the kernels on GPUs,than after each iteration they copy out the data from GPU memory to CPU shared memory. The kernels are re-runs is up to the roots converge sufficiently. Here are below the corresponding algorithm:
+
+%% \RC{Surement à virer ou réécrire pour etre compris sans algo}
+%% $num\_gpus$ OpenMP threads are created using
+%% \verb=omp_set_num_threads();=function (step $3$, Algorithm
+%% \ref{alg2-cuda-openmp}), the shared memory is created using
+%% \verb=#pragma omp parallel shared()= OpenMP function (line $5$,
+%% Algorithm\ref{alg2-cuda-openmp}), then each OpenMP thread allocates
+%% memory and copies initial data from CPU memory to GPU global memory,
+%% executes the kernels on GPU, but computes only his portion of roots
+%% indicated with variable \textit{index} initialized in (line 5,
+%% Algorithm \ref{alg2-cuda-openmp}), used as input data in the
+%% $kernel\_update$ (line 10, Algorithm \ref{alg2-cuda-openmp}). After
+%% each iteration, all OpenMP threads synchronize using
+%% \verb=#pragma omp barrier;= to gather all the correct values of
+%% $\Delta z$, thus allowing the computation the maximum stop condition
+%% on vector $\Delta z$ (line 12, Algorithm
+%% \ref{alg2-cuda-openmp}). Finally, threads copy the results from GPU
+%% memories to CPU memory. The OpenMP threads execute kernels until the
+%% roots sufficiently converge.
+
+
+%% \begin{algorithm}[h]
+%% \label{alg2-cuda-openmp}
+%% \LinesNumbered
+%% \SetAlgoNoLine
+%% \caption{CUDA-OpenMP Algorithm to find roots with the Ehrlich-Aberth method}
+
+%% \KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (Error tolerance
+%% threshold), P (Polynomial to solve), Pu (Derivative of P), $n$ (Polynomial degree), $\Delta z$ ( Vector of errors for stop condition), $num\_gpus$ (number of OpenMP threads/ Number of GPUs), $Size$ (number of roots)}
+
+%% \KwOut {$Z$ ( Root's vector), $ZPrec$ (Previous root's vector)}
+
+%% \BlankLine
+
+%% Initialization of P\;
+%% Initialization of Pu\;
+%% Initialization of the solution vector $Z^{0}$\;
+%% Start of a parallel part with OpenMP (Z, $\Delta z$, P are shared variables)\;
+%% gpu\_id=cudaGetDevice()\;
+%% Allocate memory on GPU\;
+%% Compute local size and offet according to gpu\_id\;
+%% \While {$error > \epsilon$}{
+%% copy Z from CPU to GPU\;
+%% $ ZPrec_{loc}=kernel\_save(Z_{loc})$\;
+%% $ Z_{loc}=kernel\_update(Z,P,Pu)$\;
+%% $\Delta z[gpu\_id] = kernel\_testConv(Z_{loc},ZPrec_{loc})$\;
+%% $ error= Max(\Delta z)$\;
+%% copy $Z_{loc}$ from GPU to Z in CPU
+%% }
+%%\end{algorithm}
+
+\begin{algorithm}[htpb]
+\LinesNumbered
+\SetAlgoNoLine
+\caption{Finding roots of polynomials with the Ehrlich-Aberth method on multiple GPUs using OpenMP}
+\KwIn{$n$ (polynomial's degree), $\epsilon$ (tolerance threshold), $ngpu$ (number of GPUs)}
+\KwOut{$Z$ (solution vector of roots)}
+Initialize the polynomial $P$ and its derivative $P'$\;
+Set the initial values of vector $Z$\;
+Start of a parallel part with OpenMP ($Z$, $\Delta Z$, $\Delta Z_{max}$, $P$ are shared variables)\;
+$id_{gpu}$ = cudaGetDevice()\;
+$n_{loc}$ = $n/ngpu$ (local size)\;
+%$idx$ = $id_{gpu}\times n_{loc}$ (local offset)\;
+Copy $P$, $P'$ from CPU to GPU\;
+\While{\emph{not convergence}}{
+ Copy $Z$ from CPU to GPU\;
+ $Z^{prev}$ = KernelSave($Z,n$)\;
+ $Z_{loc}$ = KernelUpdate($P,P',Z^{prev},n_{loc}$)\;
+ $\Delta Z_{loc}$ = KernelComputeError($Z_{loc},Z^{prev}_{loc},n_{loc}$)\;
+ $\Delta Z_{max}[id_{gpu}]$ = CudaMaxFunction($\Delta Z_{loc},n_{loc}$)\;
+ Copy $Z_{loc}$ from GPU to $Z$ in CPU\;
+ $max$ = MaxFunction($\Delta Z_{max},ngpu$)\;
+ TestConvergence($max,\epsilon$)\;