+\subsection{CUDA}
+%CUDA (is an acronym of the Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA~\cite{CUDA10}.The unit of execution in CUDA is called a thread. Each thread executes a kernel by the streaming processors in parallel. In CUDA, a group of threads that are executed together is called a thread block, and the computational grid consists of a grid of thread blocks. Additionally, a thread block can use the shared memory on a single multiprocessor while the grid executes a single CUDA program logically in parallel. Thus in CUDA programming, it is necessary to design carefully the arrangement of the thread blocks in order to ensure low latency and a proper usage of shared memory, since it can be shared only in a thread block scope. The effective bandwidth of each memory space depends on the memory access pattern. Since the global memory has lower bandwidth than the shared memory, the global memory accesses should be minimized.
+
+CUDA (Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA~\cite{CUDA15} for GPUs. It provides a high level GPGPU-based programming model to program GPUs for general purpose computations and non-graphic applications. The GPU is viewed as an accelerator such that data-parallel operations of a CUDA program running on a CPU are off-loaded onto GPU and executed by this later. The data-parallel operations executed by GPUs are called kernels. The same kernel is executed in parallel by a large number of threads organized in grids of thread blocks, such that each GPU multiprocessor executes one or more thread blocks in SIMD fashion (Single Instruction, Multiple Data) and in turn each core of the multiprocessor executes one or more threads within a block. Threads within a block can cooperate by sharing data through a fast shared memory and coordinate their execution through synchronization points. In contrast, within a grid of thread blocks, there is no synchronization at all between blocks. The GPU only works on data filled in the global memory and the final results of the kernel executions must be transferred out of the GPU. In the GPU, the global memory has lower bandwidth than the shared memory associated to each multiprocessor. Thus in the CUDA programming, it is necessary to design carefully the arrangement of the thread blocks in order to ensure low latency and a proper usage of the shared memory, and the global memory accesses should be minimized.
+
+%We introduced three paradigms of parallel programming. Our objective consists in implementing a root finding polynomial algorithm on multiple GPUs. To this end, it is primordial to know how to manage CUDA contexts of different GPUs. A direct method for controlling the various GPUs is to use as many threads or processes as GPU devices. We can choose the GPU index based on the identifier of OpenMP thread or the rank of the MPI process. Both approaches will be investigated.
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{The Ehrlich-Aberth algorithm on a GPU}
+\label{sec3}
+
+\subsection{The EA method}
+%A cubically convergent iteration method to find zeros of
+%polynomials was proposed by O. Aberth~\cite{Aberth73}. The
+%Ehrlich-Aberth (EA is short) method contains 4 main steps, presented in what
+%follows.
+
+%The Aberth method is a purely algebraic derivation.
+%To illustrate the derivation, we let $w_{i}(z)$ be the product of linear factors
+
+%\begin{equation}
+%w_{i}(z)=\prod_{j=1,j \neq i}^{n} (z-x_{j})
+%\end{equation}
+
+%And let a rational function $R_{i}(z)$ be the correction term of the
+%Weistrass method~\cite{Weierstrass03}
+
+%\begin{equation}
+%R_{i}(z)=\frac{p(z)}{w_{i}(z)} , i=1,2,...,n.
+%\end{equation}
+
+%Differentiating the rational function $R_{i}(z)$ and applying the
+%Newton method, we have:
+
+%\begin{equation}
+%\frac{R_{i}(z)}{R_{i}^{'}(z)}= \frac{p(z)}{p^{'}(z)-p(z)\frac{w_{i}(z)}{w_{i}^{'}(z)}}= \frac{p(z)}{p^{'}(z)-p(z) \sum _{j=1,j \neq i}^{n}\frac{1}{z-x_{j}}}, i=1,2,...,n
+%\end{equation}
+%where R_{i}^{'}(z)is the rational function derivative of F evaluated in the point z
+%Substituting $x_{j}$ for $z_{j}$ we obtain the Aberth iteration method.%
+
+
+%\subsubsection{Polynomials Initialization}
+%The initialization of a polynomial $p(z)$ is done by setting each of the $n$ complex coefficients %$a_{i}$:
+
+%\begin{equation}
+%\label{eq:SimplePolynome}
+% p(z)=\sum{a_{i}z^{n-i}} , a_{n} \neq 0,a_{0}=1, a_{i}\subset C
+%\end{equation}
+
+
+%\subsubsection{Vector $Z^{(0)}$ Initialization}
+%\label{sec:vec_initialization}
+%As for any iterative method, we need to choose $n$ initial guess points $z^{0}_{i}, i = 1, . . . , %n.$
+%The initial guess is very important since the number of steps needed by the iterative method to %reach
+%a given approximation strongly depends on it.
+%In~\cite{Aberth73} the Ehrlich-Aberth iteration is started by selecting $n$
+%equi-distant points on a circle of center 0 and radius r, where r is
+%an upper bound to the moduli of the zeros. Later, Bini and al.~\cite{Bini96}
+%performed this choice by selecting complex numbers along different
+%circles which relies on the result of~\cite{Ostrowski41}.
+
+%\begin{equation}
+%\label{eq:radiusR}
+%%\begin{align}
+%\sigma_{0}=\frac{u+v}{2};u=\frac{\sum_{i=1}^{n}u_{i}}{n.max_{i=1}^{n}u_{i}};
+%v=\frac{\sum_{i=0}^{n-1}v_{i}}{n.min_{i=0}^{n-1}v_{i}};\\
+%%\end{align}
+%\end{equation}
+%Where:
+%\begin{equation}
+%u_{i}=2.|a_{i}|^{\frac{1}{i}};
+%v_{i}=\frac{|\frac{a_{n}}{a_{i}}|^{\frac{1}{n-i}}}{2}.
+%\end{equation}
+
+%\subsubsection{Iterative Function}
+%The operator used by the Aberth method corresponds to the
+%equation~\ref{Eq:EA1}, it enables the convergence towards
+%the polynomials zeros, provided all the roots are distinct.
+
+%Here we give a second form of the iterative function used by the Ehrlich-Aberth method:
+
+%\begin{equation}
+%\label{Eq:EA1}
+%EA: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}}
+%{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, %i=1,. . . .,n
+%\end{equation}
+
+%\subsubsection{Convergence Condition}
+%The convergence condition determines the termination of the algorithm. It consists in stopping the %iterative function when the roots are sufficiently stable. We consider that the method converges %sufficiently when:
+
+%\begin{equation}
+%\label{eq:Aberth-Conv-Cond}
+%\forall i \in [1,n];\vert\frac{z_{i}^{k}-z_{i}^{k-1}}{z_{i}^{k}}\vert<\xi
+%\end{equation}
+
+
+%\begin{figure}[htbp]
+%\centering
+ % \includegraphics[angle=-90,width=0.5\textwidth]{EA-Algorithm}
+%\caption{The Ehrlich-Aberth algorithm on single GPU}
+%\label{fig:03}
+%\end{figure}
+
+the Ehrlich-Aberth method is an iterative method, contain 4 steps, start from the initial approximations of all the roots of the polynomial,the second step initialize the solution vector $Z$ using the Guggenheimer method to assure the distinction of the initial vector roots, than in step 3 we apply the the iterative function based on the Newton's method and Weiestrass operator~\cite{,}, wich will make it possible to converge to the roots solution, provided that all the root are different.
+
+\begin{equation}
+\label{Eq:EA1}
+EA: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}}
+{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, i=1,. . . .,n
+\end{equation}
+
+ At the end of each application of the iterative function, a stop condition is verified consists in stopping the iterative process when the whole of the modules of the roots are lower than a fixed value $\xi$
+
+\begin{equation}
+\label{eq:Aberth-Conv-Cond}
+\forall i \in [1,n];\vert\frac{z_{i}^{k}-z_{i}^{k-1}}{z_{i}^{k}}\vert<\xi
+\end{equation}
+
+\subsection{EA parallel implementation on CUDA}
+We introduced three paradigms of parallel programming. Our objective consists in implementing a root finding polynomial algorithm on multiple GPUs. To this end, it is primordial to know how to manage CUDA contexts of different GPUs. A direct method for controlling the various GPUs is to use as many threads or processes as GPU devices. We can choose the GPU index based on the identifier of OpenMP thread or the rank of the MPI process. Both approaches will be investigated.
+
+
+
+
+Like any parallel code, a GPU parallel implementation first
+requires to determine the sequential tasks and the
+parallelizable parts of the sequential version of the
+program/algorithm. In our case, all the operations that are easy
+to execute in parallel must be made by the GPU to accelerate
+the execution of the application, like the step 3 and step 4. On the other hand, all the
+sequential operations and the operations that have data
+dependencies between threads or recursive computations must
+be executed by only one CUDA or CPU thread (step 1 and step 2). Initially, we specify the organization of parallel threads, by specifying the dimension of the grid Dimgrid, the number of blocks per grid DimBlock and the number of threads per block.
+
+The code is organzed by what is named kernels, portions o code that are run on GPU devices. For step 3, there are two kernels, the
+first named \textit{save} is used to save vector $Z^{K-1}$ and the seconde one is named
+\textit{update} and is used to update the $Z^{K}$ vector. For step 4, a kernel
+tests the convergence of the method. In order to
+compute the function H, we have two possibilities: either to use
+the Jacobi mode, or the Gauss-Seidel mode of iterating which uses the
+most recent computed roots. It is well known that the Gauss-
+Seidel mode converges more quickly. So, we used the Gauss-Seidel mode of iteration. To
+parallelize the code, we created kernels and many functions to
+be executed on the GPU for all the operations dealing with the
+computation on complex numbers and the evaluation of the
+polynomials. As said previously, we managed both functions
+of evaluation of a polynomial: the normal method, based on
+the method of Horner and the method based on the logarithm
+of the polynomial. All these methods were rather long to
+implement, as the development of corresponding kernels with
+CUDA is longer than on a CPU host. This comes in particular
+from the fact that it is very difficult to debug CUDA running
+threads like threads on a CPU host. In the following paragraph
+Algorithm~\ref{alg1-cuda} shows the GPU parallel implementation of Ehrlich-Aberth method.
+
+\begin{enumerate}
+\begin{algorithm}[htpb]
+\label{alg1-cuda}
+%\LinesNumbered
+\caption{CUDA Algorithm to find roots with the Ehrlich-Aberth method}
+
+\KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (Error tolerance
+ threshold), P (Polynomial to solve), Pu (Derivative of P), $n$ (Polynomial degrees), $\Delta z_{max}$ (Maximum value of stop condition)}
+
+\KwOut {$Z$ (Solution root's vector), $ZPrec$ (Previous solution root's vector)}
+
+%\BlankLine
+
+\item Initialization of the of P\;
+\item Initialization of the of Pu\;
+\item Initialization of the solution vector $Z^{0}$\;
+\item Allocate and copy initial data to the GPU global memory\;
+\item k=0\;
+\While {$\Delta z_{max} > \epsilon$}{
+\item Let $\Delta z_{max}=0$\;
+\item $ kernel\_save(ZPrec,Z)$\;
+\item k=k+1\;
+\item $ kernel\_update(Z,P,Pu)$\;
+\item $kernel\_testConverge(\Delta z_{max},Z,ZPrec)$\;
+
+}
+\item Copy results from GPU memory to CPU memory\;
+\end{algorithm}
+\end{enumerate}
+~\\