X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper1.git/blobdiff_plain/abc776be5595380b3d192f54e5a7b84da33b02ec..10844d40f853b6fad58b5f366618e3b2aec1066c:/paper.tex?ds=sidebyside diff --git a/paper.tex b/paper.tex index 788db9a..8f03506 100644 --- a/paper.tex +++ b/paper.tex @@ -300,7 +300,7 @@ Here we give a second form of the iterative function used by Ehrlich-Aberth meth \begin{equation} \label{Eq:Hi} EA2: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}} -{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, i=0,. . . .,n +{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, i=1,. . . .,n \end{equation} It can be noticed that this equation is equivalent to Eq.~\ref{Eq:EA}, but we prefer the latter one because we can use it to improve the @@ -363,11 +363,13 @@ Q(z^{k}_{i})=\exp\left( \ln (p(z^{k}_{i}))-\ln(p'(z^{k}_{i}))+\ln \left( \sum_{i\neq j}^{n}\frac{1}{z^{k}_{i}-z^{k}_{j}}\right)\right)i=1,...,n, \end{equation} -This solution is applied when the root except the circle unit, represented by the radius $R$ evaluated in C language as: - +This solution is applied when the root except the circle unit, represented by the radius $R$ evaluated in C language as : +\label{R.EL} +\begin{center} \begin{verbatim} R = exp(log(DBL_MAX)/(2*n) ); -\end{verbatim} +\end{verbatim} +\end{center} %\begin{equation} @@ -385,7 +387,7 @@ Authors usually adopt one of the two following approaches to parallelize root finding algorithms. The first approach aims at reducing the total number of iterations as by Miranker ~\cite{Mirankar68,Mirankar71}, Schedler~\cite{Schedler72} and -Winogard~\cite{Winogard72}. The second approach aims at reducing the +Winograd~\cite{Winogard72}. The second approach aims at reducing the computation time per iteration, as reported in~\cite{Benall68,Jana06,Janall99,Riceall06}. @@ -409,8 +411,8 @@ cause a high degree of memory conflict. Recently the author in~\cite{Mirankar71} proposed two versions of parallel algorithm for the Durand-Kerner method, and Ehrlich-Aberth method on a model of Optoelectronic Transpose Interconnection System (OTIS).The -algorithms are mapped on an OTIS-2D torus using N processors. This -solution needs N processors to compute N roots, which is not +algorithms are mapped on an OTIS-2D torus using $N$ processors. This +solution needs $N$ processors to compute $N$ roots, which is not practical for solving polynomials with large degrees. %Until very recently, the literature did not mention implementations %able to compute the roots of large degree polynomials (higher then @@ -423,7 +425,7 @@ In~\cite{Kahinall14} we already proposed the first implementation of a root finding method on GPUs, that of the Durand-Kerner method. The main result showed that a parallel CUDA implementation is 10 times as fast as the sequential implementation on a single CPU for high degree -polynomials of 48000. +polynomials of 48,000. %In this paper we present a parallel implementation of Ehrlich-Aberth %method on GPUs for sparse and full polynomials with high degree (up %to $1,000,000$). @@ -543,18 +545,25 @@ polynomials of 48000. In order to implement the Ehrlich-Aberth method in CUDA, it is possible to use the Jacobi scheme or the Gauss Seidel one. With the Jacobi iteration, at iteration $k+1$ we need all the previous values -$z^{(k)}_{i}$ to compute the new values $z^{(k+1)}_{i}$, that is : +$z^{k}_{i}$ to compute the new values $z^{k+1}_{i}$, that is : \begin{equation} -EAJ: z^{k+1}_{i}=\frac{p(z^{k}_{i})}{p'(z^{k}_{i})-p(z^{k}_{i})\sum^{n}_{j=1 j\neq i}\frac{1}{z^{k}_{i}-z^{k}_{j}}}, i=1,...,n. +EAJ: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}} +{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, i=1,. . . .,n. \end{equation} With the Gauss-Seidel iteration, we have: +%\begin{equation} +%\label{eq:Aberth-H-GS} +%EAGS: z^{k+1}_{i}=\frac{p(z^{k}_{i})}{p'(z^{k}_{i})-p(z^{k}_{i})(\sum^{i-1}_{j=1}\frac{1}{z^{k}_{i}-z^{k+1}_{j}}+\sum^{n}_{j=i+1}\frac{1}{z^{k}_{i}-z^{k}_{j}})}, i=1,...,n. +%\end{equation} + \begin{equation} \label{eq:Aberth-H-GS} -EAGS: z^{k+1}_{i}=\frac{p(z^{k}_{i})}{p'(z^{k}_{i})-p(z^{k}_{i})(\sum^{i-1}_{j=1}\frac{1}{z^{k}_{i}-z^{k+1}_{j}}+\sum^{n}_{j=i+1}\frac{1}{z^{k}_{i}-z^{k}_{j}})}, i=1,...,n. +EAGS: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}} +{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}(\sum^{i-1}_{j=1}\frac{1}{z^{k}_{i}-z^{k+1}_{j}}+\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}})}, i=1,. . . .,n \end{equation} -%%Here a finiched my revision %% + Using Eq.~\ref{eq:Aberth-H-GS} to update the vector solution \textit{Z}, we expect the Gauss-Seidel iteration to converge more quickly because, just as any Jacobi algorithm (for solving linear systems of equations), it uses the most fresh computed roots $z^{k+1}_{i}$. @@ -574,8 +583,9 @@ quickly because, just as any Jacobi algorithm (for solving linear systems of equ %In CUDA programming, all the instructions of the \verb=for= loop are executed by the GPU as a kernel. A kernel is a function written in CUDA and defined by the \verb=__global__= qualifier added before a usual \verb=C= function, which instructs the compiler to generate appropriate code to pass it to the CUDA runtime in order to be executed on the GPU. -Algorithm~\ref{alg2-cuda} shows a sketch of the Ehrlich-Aberth algorithm using CUDA. +Algorithm~\ref{alg2-cuda} shows steps of the Ehrlich-Aberth algorithm using CUDA. +\begin{enumerate} \begin{algorithm}[H] \label{alg2-cuda} %\LinesNumbered @@ -588,21 +598,22 @@ Algorithm~\ref{alg2-cuda} shows a sketch of the Ehrlich-Aberth algorithm using C \BlankLine -Initialization of the of P\; -Initialization of the of Pu\; -Initialization of the solution vector $Z^{0}$\; -Allocate and copy initial data to the GPU global memory\; -k=0\; +\item Initialization of the of P\; +\item Initialization of the of Pu\; +\item Initialization of the solution vector $Z^{0}$\; +\item Allocate and copy initial data to the GPU global memory\; +\item k=0\; \While {$\Delta z_{max} > \epsilon$}{ - Let $\Delta z_{max}=0$\; -$ kernel\_save(ZPrec,Z)$\; -k=k+1\; -$ kernel\_update(Z,P,Pu)$\; -$kernel\_testConverge(\Delta z_{max},Z,ZPrec)$\; +\item Let $\Delta z_{max}=0$\; +\item $ kernel\_save(ZPrec,Z)$\; +\item k=k+1\; +\item $ kernel\_update(Z,P,Pu)$\; +\item $kernel\_testConverge(\Delta z_{max},Z,ZPrec)$\; } -Copy results from GPU memory to CPU memory\; +\item Copy results from GPU memory to CPU memory\; \end{algorithm} +\end{enumerate} ~\\ After the initialization step, all data of the root finding problem @@ -615,8 +626,8 @@ polynomial's root found at the previous time-step in GPU memory, in order to check the convergence of the roots after each iteration (line 8, Algorithm~\ref{alg2-cuda}). -The second kernel executes the iterative function $H$ and updates -Z, according to Algorithm~\ref{alg3-update}. We notice that the +The second kernel executes the iterative function and updates +$Z$, according to Algorithm~\ref{alg3-update}. We notice that the update kernel is called in two forms, according to the value \emph{R} which determines the radius beyond which we apply the exponential logarithm algorithm. @@ -637,7 +648,7 @@ The first form executes formula the EA function Eq.~\ref{Eq:Hi} if the modulus of the current complex is less than the a certain value called the radius i.e. ($ |z^{k}_{i}|<= R$), else the kernel executes the EA.EL function Eq.~\ref{Log_H2} -(with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}). The radius $R$ is evaluated as : +(with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}). The radius $R$ is evaluated as in ~\ref{R.EL} : $$R = \exp( \log(DBL\_MAX) / (2*n) )$$ where $DBL\_MAX$ stands for the maximum representable double value. @@ -843,7 +854,8 @@ numerical applications on GPU. In future works, we plan to investigate the possibility of using several multiple GPUs simultaneously, either with multi-GPU machine or -with cluster of GPUs. +with cluster of GPUs. It may also be interesting to study the +implementation of other root finding polynomial methods on GPU.