X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper1.git/blobdiff_plain/d7b78423e1604bb5a80c776db126e0288ae639fb..7f2978c0d220516decb65faf2b8ba2da34df8db2:/paper.tex diff --git a/paper.tex b/paper.tex index a40df2f..e3dde6e 100644 --- a/paper.tex +++ b/paper.tex @@ -229,7 +229,7 @@ and experimental study results. Finally, Section~\ref{sec7} 6 concludes this paper and gives some hints for future research directions in this topic. -\section{The Sequential Ehrlich-Aberth method} +\section{Ehrlich-Aberth method} \label{sec1} A cubically convergent iteration method for finding zeros of polynomials was proposed by O. Aberth~\cite{Aberth73}. In the @@ -496,48 +496,54 @@ polynomials of 48000. -\subsection{Sequential Ehrlich-Aberth algorithm} -The main steps of Ehrlich-Aberth method are shown in Algorithm.~\ref{alg1-seq} : -%\LinesNumbered -\begin{algorithm}[H] -\label{alg1-seq} +%% \subsection{Sequential Ehrlich-Aberth algorithm} +%% The main steps of Ehrlich-Aberth method are shown in Algorithm.~\ref{alg1-seq} : +%% %\LinesNumbered +%% \begin{algorithm}[H] +%% \label{alg1-seq} -\caption{A sequential algorithm to find roots with the Ehrlich-Aberth method} +%% \caption{A sequential algorithm to find roots with the Ehrlich-Aberth method} -\KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (error tolerance - threshold), $P$ (Polynomial to solve),$Pu$ (the derivative of P) $\Delta z_{max}$ (maximum value - of stop condition), k (number of iteration), n (Polynomial's degrees)} -\KwOut {$Z$ (The solution root's vector), $ZPrec$ (the previous solution root's vector)} +%% \KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (error tolerance +%% threshold), $P$ (Polynomial to solve),$Pu$ (the derivative of P) $\Delta z_{max}$ (maximum value +%% of stop condition), k (number of iteration), n (Polynomial's degrees)} +%% \KwOut {$Z$ (The solution root's vector), $ZPrec$ (the previous solution root's vector)} -\BlankLine +%% \BlankLine -Initialization of $P$\; -Initialization of $Pu$\; -Initialization of the solution vector $Z^{0}$\; -$\Delta z_{max}=0$\; - k=0\; +%% Initialization of $P$\; +%% Initialization of $Pu$\; +%% Initialization of the solution vector $Z^{0}$\; +%% $\Delta z_{max}=0$\; +%% k=0\; -\While {$\Delta z_{max} > \varepsilon$}{ - Let $\Delta z_{max}=0$\; -\For{$j \gets 0 $ \KwTo $n$}{ -$ZPrec\left[j\right]=Z\left[j\right]$;// save Z at the iteration k.\ +%% \While {$\Delta z_{max} > \varepsilon$}{ +%% Let $\Delta z_{max}=0$\; +%% \For{$j \gets 0 $ \KwTo $n$}{ +%% $ZPrec\left[j\right]=Z\left[j\right]$;// save Z at the iteration k.\ -$Z\left[j\right]=H\left(j, Z, P, Pu\right)$;//update Z with the iterative function.\ -} -k=k+1\; +%% $Z\left[j\right]=H\left(j, Z, P, Pu\right)$;//update Z with the iterative function.\ +%% } +%% k=k+1\; -\For{$i \gets 0 $ \KwTo $n-1$}{ -$c= testConverge(\Delta z_{max},ZPrec\left[j\right],Z\left[j\right])$\; -\If{$c > \Delta z_{max}$ }{ -$\Delta z_{max}$=c\;} -} +%% \For{$i \gets 0 $ \KwTo $n-1$}{ +%% $c= testConverge(\Delta z_{max},ZPrec\left[j\right],Z\left[j\right])$\; +%% \If{$c > \Delta z_{max}$ }{ +%% $\Delta z_{max}$=c\;} +%% } -} -\end{algorithm} +%% } +%% \end{algorithm} -~\\ -In this sequential algorithm, one CPU thread executes all the steps. Let us look to the $3^{rd}$ step i.e. the execution of the iterative function, 2 sub-steps are needed. The first sub-step \textit{save}s the solution vector of the previous iteration, the second sub-step \textit{update}s or computes the new values of the roots vector. -There exists two ways to execute the iterative function that we call a Jacobi one and a Gauss-Seidel one. With the Jacobi iteration, at iteration $k+1$ we need all the previous values $z^{(k)}_{i}$ to compute the new values $z^{(k+1)}_{i}$, that is : +%% ~\\ +%% In this sequential algorithm, one CPU thread executes all the steps. Let us look to the $3^{rd}$ step i.e. the execution of the iterative function, 2 sub-steps are needed. The first sub-step \textit{save}s the solution vector of the previous iteration, the second sub-step \textit{update}s or computes the new values of the roots vector. + +\subsection{Parallel implementation with CUDA } + +In order to implement the Ehrlich-Aberth method in CUDA, it is +possible to use the Jacobi scheme or the Gauss Seidel one. With the +Jacobi iteration, at iteration $k+1$ we need all the previous values +$z^{(k)}_{i}$ to compute the new values $z^{(k+1)}_{i}$, that is : \begin{equation} EAJ: z^{k+1}_{i}=\frac{p(z^{k}_{i})}{p'(z^{k}_{i})-p(z^{k}_{i})\sum^{n}_{j=1 j\neq i}\frac{1}{z^{k}_{i}-z^{k}_{j}}}, i=1,...,n. @@ -553,18 +559,18 @@ Using Eq.~\ref{eq:Aberth-H-GS} to update the vector solution \textit{Z}, we expect the Gauss-Seidel iteration to converge more quickly because, just as any Jacobi algorithm (for solving linear systems of equations), it uses the most fresh computed roots $z^{k+1}_{i}$. -The $4^{th}$ step of the algorithm checks the convergence condition using Eq.~\ref{eq:Aberth-Conv-Cond}. -Both steps 3 and 4 use 1 thread to compute all the $n$ roots on CPU, which is very harmful for performance in case of the large degree polynomials. +%The $4^{th}$ step of the algorithm checks the convergence condition using Eq.~\ref{eq:Aberth-Conv-Cond}. +%Both steps 3 and 4 use 1 thread to compute all the $n$ roots on CPU, which is very harmful for performance in case of the large degree polynomials. -\subsection{Parallel implementation with CUDA } -On the CPU, both steps 3 and 4 contain the loop \verb=for= and a single thread executes all the instructions in the loop $n$ times. In this subsection, we explain how the GPU architecture can compute this loop and reduce the execution time. -In the GPU, the scheduler assigns the execution of this loop to a -group of threads organised as a grid of blocks with block containing a -number of threads. All threads within a block are executed -concurrently in parallel. The instructions run on the GPU are grouped -in special function called kernels. With CUDA, a programmer must -describe the kernel execution context: the size of the Grid, the number of blocks and the number of threads per block. + +%On the CPU, both steps 3 and 4 contain the loop \verb=for= and a single thread executes all the instructions in the loop $n$ times. In this subsection, we explain how the GPU architecture can compute this loop and reduce the execution time. +%In the GPU, the scheduler assigns the execution of this loop to a +%group of threads organised as a grid of blocks with block containing a +%number of threads. All threads within a block are executed +%concurrently in parallel. The instructions run on the GPU are grouped +%in special function called kernels. With CUDA, a programmer must +%describe the kernel execution context: the size of the Grid, the number of blocks and the number of threads per block. %In CUDA programming, all the instructions of the \verb=for= loop are executed by the GPU as a kernel. A kernel is a function written in CUDA and defined by the \verb=__global__= qualifier added before a usual \verb=C= function, which instructs the compiler to generate appropriate code to pass it to the CUDA runtime in order to be executed on the GPU. @@ -575,7 +581,8 @@ Algorithm~\ref{alg2-cuda} shows a sketch of the Ehrlich-Aberth algorithm using C %\LinesNumbered \caption{CUDA Algorithm to find roots with the Ehrlich-Aberth method} -\KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (error tolerance threshold), P(Polynomial to solve),Pu (the derivative of P), $n$ (Polynomial's degrees),$\Delta z_{max}$ (maximum value of stop condition)} +\KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (error tolerance + threshold), P(Polynomial to solve), Pu (the derivative of P), $n$ (Polynomial's degrees),$\Delta z_{max}$ (maximum value of stop condition)} \KwOut {$Z$ (The solution root's vector), $ZPrec$ (the previous solution root's vector)} @@ -612,13 +619,13 @@ exponential logarithm algorithm. \caption{Kernel update} \eIf{$(\left|d\_Z\right|<= R)$}{ -$kernel\_update((d\_Z,d\_Pcoef,d\_Pdegres,d\_Pucoef,d\_Pudegres)$\;} +$kernel\_update((d\_Z,d\_P,d\_Pu)$\;} { -$kernel\_update\_ExpoLog((d\_Z,d\_Pcoef,d\_Pdegres,d\_Pucoef,d\_Pudegres))$\; +$kernel\_update\_ExpoLog((d\_Z,d\_P,\_Pu))$\; } \end{algorithm} -The first form executes formula \ref{eq:SimplePolynome} if the modulus +The first form executes formula the EA function Eq.~\ref{Eq:Hi} if the modulus of the current complex is less than the a certain value called the radius i.e. ($ |z^{k}_{i}|<= R$), else the kernel executes the EA.EL function Eq.~\ref{Log_H2}