+
+\subsection{Multi-GPU (MPI-CUDA) approach}
+%\begin{figure}[htbp]
+%\centering
+ % \includegraphics[angle=-90,width=0.2\textwidth]{MPI-CUDA}
+%\caption{The MPI-CUDA architecture }
+%\label{fig:03}
+%\end{figure}
+Our parallel implementation of the Ehrlich-Aberth method to find root polynomial using (CUDA-MPI) approach, splits input data of the polynomial to solve between MPI processes. From Algorithm 3, the input data are the polynomial to solve $P$, the solution vector $Z$, the previous solution vector $zPrev$, and the Value of errors of stop condition $\Delta z$. Let $p$ denote the number of MPI processes on and $n$ the size of the polynomial to be solved. The algorithm performs a simple data partitioning by creating $p$ portions, of at most $⌈n/p⌉$ roots to find per MPI process, for each element mentioned above. Consequently, each MPI process $k$ will have its own solution vector $Z_{k}$,polynomial to be solved $p_{k}$, the error of stop condition $\Delta z_{k}$, Than each MPI processes compute only $⌈n/p⌉$ roots.
+
+Since a GPU works only on data of its memory, all local input data, $Z_{k}, p_{k}$ and $\Delta z_{k}$, must be transferred from CPU memories to the corresponding GPU memories. Afterward, the same EA algorithm (Algorithm 1) is run by all processes but on different sub-polynomial root $ p(x)_{k}=\sum_{i=k(\frac{n}{p})}^{k+1(\frac{n}{p})} a_{i}x^{i}, k=1,...,p$. Each processes MPI execute the loop \verb=(While(...)...do)= contain the kernels. Than each process MPI compute only his portion of roots indicated with variable \textit{index} initialized in (line 5, Algorithm \ref{alg2-cuda-mpi}), used as input data in the $kernel\_update$ (line 10, Algorithm \ref{alg2-cuda-mpi}). After each iteration, MPI processes synchronize using \verb=MPI_Allreduce= function, in order to compute the maximum error stops condition $\Delta z_{k}$ computed by each process MPI line (line, Algorithm\ref{alg2-cuda-mpi}), and copy the values of new roots computed from GPU memories to CPU memories, than communicate her results to the neighboring processes,using \verb=MPI_Alltoallv=. If maximum stop condition $error > \epsilon$ the processes stay to execute the loop \verb= while(...)...do= until all the roots converge sufficiently.
+
+\begin{enumerate}
+\begin{algorithm}[htpb]
+\label{alg2-cuda-mpi}
+%\LinesNumbered
+\caption{CUDA-MPI Algorithm to find roots with the Ehrlich-Aberth method}
+
+\KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (Error tolerance
+ threshold), P (Polynomial to solve), Pu (Derivative of P), $n$ (Polynomial degrees), $\Delta z$ ( error of stop condition), $num_gpus$ (number of MPI processes/ number of GPUs), Size (number of roots)}
+
+\KwOut {$Z$ (Solution root's vector), $ZPrec$ (Previous solution root's vector)}
+
+\BlankLine
+\item Initialization of the P\;
+\item Initialization of the Pu\;
+\item Initialization of the solution vector $Z^{0}$\;
+\item Allocate and copy initial data from CPU memories to the GPU global memories\;
+\item $index= Size/num_gpus$\;
+\item k=0\;
+\While {$error > \epsilon$}{
+\item Let $\Delta z=0$\;
+\item $ kernel\_save(ZPrec,Z)$\;
+\item k=k+1\;
+\item $ kernel\_update(Z,P,Pu,index)$\;
+\item $kernel\_testConverge(\Delta z,Z,ZPrec)$\;
+\item ComputeMaxError($\Delta z$,error)\;
+\item Copy results from GPU memories to CPU memories\;
+\item Send $Z[id]$ to all neighboring processes\;
+\item Receive $Z[j]$ from neighboring process j\;
+
+
+}
+\end{algorithm}
+\end{enumerate}
+~\\