X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper2.git/blobdiff_plain/e357a167b1353d31a235c2800076102191d0d0fb..ddad0602649366bd9341613ee440f46e5fa8beba:/paper.tex?ds=inline diff --git a/paper.tex b/paper.tex index 223dfd4..41bdf69 100644 --- a/paper.tex +++ b/paper.tex @@ -462,7 +462,7 @@ where $p'(z)$ is the polynomial derivative of $p$ evaluated in the point $z$. The main problem of the simultaneous methods is that the necessary time needed for the convergence increases with the increasing of the polynomial's degree. Many authors have treated the problem of implementing simultaneous methods in parallel. Freeman~\cite{Freeman89} implemented and compared DK, EA and another method of the fourth order of convergence proposed by Farmer and Loizou~\cite{Loizou83} on a 8-processor linear chain, for polynomials of degree up-to 8. The method of Farmer and Loizou~\cite{Loizou83} often diverges, but the first two methods (DK and EA) have a speed-up equals to 5.5. Later, Freeman and Bane~\cite{Freemanall90} considered asynchronous algorithms in which each processor continues to update its approximations even though the latest values of other $z^{k}_{i}$ have not been received from the other processors, in contrast with synchronous algorithms where it would wait those values before making a new iteration. Couturier et al.~\cite{Raphaelall01} proposed two methods of parallelization for a shared memory architecture with \textit{OpenMP} and for a distributed memory one with \textit{MPI}. They are able to compute the roots of sparse polynomials of degree 10,000 in 116 seconds with \textit{OpenMP} and 135 seconds with \textit{MPI} only by using 8 personal computers and 2 communications per iteration. \LZK{je suppose que c'est pour la version mpi (only by using 8 personal computers and 2 communications per iteration). A t on utilisé le même nombre de procs pour les deux versions openmp et mpi} The authors showed an interesting speedup comparing to the sequential implementation that takes up-to 3,300 seconds to obtain same results. -Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA10}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Units) has exceeded that of CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche and al~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. +Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA15}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Units) has exceeded that of CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche and al~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. %Finding polynomial roots rapidly and accurately is the main objective of our work. In this paper we propose the parallelization of Ehrlich-Aberth method using two parallel programming paradigms OpenMP and MPI on multi-GPU platforms. We consider two architectures: shared memory and distributed memory computers. The first parallel algorithm is implemented on shared memory computers by using OpenMP API. It is based on threads created from the same system process, such that each thread is attached to one GPU. In this case the communications between GPUs are done by OpenMP threads through shared memory. The second parallel algorithm uses the MPI API, such that each GPU is attached and managed by a MPI process. The GPUs exchange their data by message-passing communications. This latter approach is more used on distributed memory clusters to solve very complex problems that are too large for traditional supercomputers, which are very expensive to build and run. %\LZK{Cette partie est réécrite. \\ Sinon qu'est ce qui a été fait pour l'accuracy dans ce papier (Finding polynomial roots rapidly and accurately is the main objective of our work.)?} @@ -526,10 +526,10 @@ CUDA (Compute Unified Device Architecture) is a parallel computing architecture \label{sec3} \subsection{The EA method} -A cubically convergent iteration method to find zeros of -polynomials was proposed by O. Aberth~\cite{Aberth73}. The -Ehrlich-Aberth (EA is short) method contains 4 main steps, presented in what -follows. +%A cubically convergent iteration method to find zeros of +%polynomials was proposed by O. Aberth~\cite{Aberth73}. The +%Ehrlich-Aberth (EA is short) method contains 4 main steps, presented in what +%follows. %The Aberth method is a purely algebraic derivation. %To illustrate the derivation, we let $w_{i}(z)$ be the product of linear factors @@ -555,59 +555,59 @@ follows. %Substituting $x_{j}$ for $z_{j}$ we obtain the Aberth iteration method.% -\subsubsection{Polynomials Initialization} -The initialization of a polynomial $p(z)$ is done by setting each of the $n$ complex coefficients $a_{i}$: +%\subsubsection{Polynomials Initialization} +%The initialization of a polynomial $p(z)$ is done by setting each of the $n$ complex coefficients %$a_{i}$: -\begin{equation} -\label{eq:SimplePolynome} - p(z)=\sum{a_{i}z^{n-i}} , a_{n} \neq 0,a_{0}=1, a_{i}\subset C -\end{equation} +%\begin{equation} +%\label{eq:SimplePolynome} +% p(z)=\sum{a_{i}z^{n-i}} , a_{n} \neq 0,a_{0}=1, a_{i}\subset C +%\end{equation} -\subsubsection{Vector $Z^{(0)}$ Initialization} -\label{sec:vec_initialization} -As for any iterative method, we need to choose $n$ initial guess points $z^{0}_{i}, i = 1, . . . , n.$ -The initial guess is very important since the number of steps needed by the iterative method to reach -a given approximation strongly depends on it. -In~\cite{Aberth73} the Ehrlich-Aberth iteration is started by selecting $n$ -equi-distant points on a circle of center 0 and radius r, where r is -an upper bound to the moduli of the zeros. Later, Bini and al.~\cite{Bini96} -performed this choice by selecting complex numbers along different -circles which relies on the result of~\cite{Ostrowski41}. +%\subsubsection{Vector $Z^{(0)}$ Initialization} +%\label{sec:vec_initialization} +%As for any iterative method, we need to choose $n$ initial guess points $z^{0}_{i}, i = 1, . . . , %n.$ +%The initial guess is very important since the number of steps needed by the iterative method to %reach +%a given approximation strongly depends on it. +%In~\cite{Aberth73} the Ehrlich-Aberth iteration is started by selecting $n$ +%equi-distant points on a circle of center 0 and radius r, where r is +%an upper bound to the moduli of the zeros. Later, Bini and al.~\cite{Bini96} +%performed this choice by selecting complex numbers along different +%circles which relies on the result of~\cite{Ostrowski41}. -\begin{equation} -\label{eq:radiusR} +%\begin{equation} +%\label{eq:radiusR} %%\begin{align} -\sigma_{0}=\frac{u+v}{2};u=\frac{\sum_{i=1}^{n}u_{i}}{n.max_{i=1}^{n}u_{i}}; -v=\frac{\sum_{i=0}^{n-1}v_{i}}{n.min_{i=0}^{n-1}v_{i}};\\ +%\sigma_{0}=\frac{u+v}{2};u=\frac{\sum_{i=1}^{n}u_{i}}{n.max_{i=1}^{n}u_{i}}; +%v=\frac{\sum_{i=0}^{n-1}v_{i}}{n.min_{i=0}^{n-1}v_{i}};\\ %%\end{align} -\end{equation} -Where: -\begin{equation} -u_{i}=2.|a_{i}|^{\frac{1}{i}}; -v_{i}=\frac{|\frac{a_{n}}{a_{i}}|^{\frac{1}{n-i}}}{2}. -\end{equation} +%\end{equation} +%Where: +%\begin{equation} +%u_{i}=2.|a_{i}|^{\frac{1}{i}}; +%v_{i}=\frac{|\frac{a_{n}}{a_{i}}|^{\frac{1}{n-i}}}{2}. +%\end{equation} -\subsubsection{Iterative Function} -The operator used by the Aberth method corresponds to the -equation~\ref{Eq:EA1}, it enables the convergence towards -the polynomials zeros, provided all the roots are distinct. +%\subsubsection{Iterative Function} +%The operator used by the Aberth method corresponds to the +%equation~\ref{Eq:EA1}, it enables the convergence towards +%the polynomials zeros, provided all the roots are distinct. %Here we give a second form of the iterative function used by the Ehrlich-Aberth method: -\begin{equation} -\label{Eq:EA1} -EA: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}} -{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, i=1,. . . .,n -\end{equation} +%\begin{equation} +%\label{Eq:EA1} +%EA: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}} +%{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, %i=1,. . . .,n +%\end{equation} -\subsubsection{Convergence Condition} -The convergence condition determines the termination of the algorithm. It consists in stopping the iterative function when the roots are sufficiently stable. We consider that the method converges sufficiently when: +%\subsubsection{Convergence Condition} +%The convergence condition determines the termination of the algorithm. It consists in stopping the %iterative function when the roots are sufficiently stable. We consider that the method converges %sufficiently when: -\begin{equation} -\label{eq:Aberth-Conv-Cond} -\forall i \in [1,n];\vert\frac{z_{i}^{k}-z_{i}^{k-1}}{z_{i}^{k}}\vert<\xi -\end{equation} +%\begin{equation} +%\label{eq:Aberth-Conv-Cond} +%\forall i \in [1,n];\vert\frac{z_{i}^{k}-z_{i}^{k-1}}{z_{i}^{k}}\vert<\xi +%\end{equation} %\begin{figure}[htbp] @@ -617,10 +617,20 @@ The convergence condition determines the termination of the algorithm. It consis %\label{fig:03} %\end{figure} -%the Ehrlich-Aberth method is an iterative method, contain 4 steps, start from the initial approximations of all the -%roots of the polynomial,the second step initialize the solution vector $Z$ using the Guggenheimer method to assure the distinction of the initial vector roots, than in step 3 we apply the the iterative function based on the Newton's method and Weiestrass operator[...,...], wich will make it possible to converge to the roots solution, provided that all the root are different. At the end of each application of the iterative function, a stop condition is verified consists in stopping the iterative process when the whole of the modules of the roots -%are lower than a fixed value $ε$ +the Ehrlich-Aberth method is an iterative method, contain 4 steps, start from the initial approximations of all the roots of the polynomial,the second step initialize the solution vector $Z$ using the Guggenheimer method to assure the distinction of the initial vector roots, than in step 3 we apply the the iterative function based on the Newton's method and Weiestrass operator~\cite{,}, wich will make it possible to converge to the roots solution, provided that all the root are different. + +\begin{equation} +\label{Eq:EA1} +EA: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}} +{1-\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}\sum_{j=1,j\neq i}^{j=n}{\frac{1}{(z_{i}^{k}-z_{j}^{k})}}}, i=1,. . . .,n +\end{equation} + + At the end of each application of the iterative function, a stop condition is verified consists in stopping the iterative process when the whole of the modules of the roots are lower than a fixed value $\xi$ +\begin{equation} +\label{eq:Aberth-Conv-Cond} +\forall i \in [1,n];\vert\frac{z_{i}^{k}-z_{i}^{k-1}}{z_{i}^{k}}\vert<\xi +\end{equation} \subsection{EA parallel implementation on CUDA} We introduced three paradigms of parallel programming. Our objective consists in implementing a root finding polynomial algorithm on multiple GPUs. To this end, it is primordial to know how to manage CUDA contexts of different GPUs. A direct method for controlling the various GPUs is to use as many threads or processes as GPU devices. We can choose the GPU index based on the identifier of OpenMP thread or the rank of the MPI process. Both approaches will be investigated. @@ -845,7 +855,7 @@ The experiments shows the execution time of the EA algorithm, on a single GPU an Results with full polynomials show very important savings in execution time. For a polynomial of degree 1,4 million, the CUDA-OpenMP approach with 4 GPUs solves it 4 times as fast as single GPU, thus achieving a quasi-linear speedup. -\subsection{Evaluting the Multi-GPU (CUDA-MPI) approach} +\subsection{Evaluating the Multi-GPU (CUDA-MPI) approach} In this part we perform a set of experiments to compare Multi-GPU (CUDA MPI) approach with single GPU, for solving full and sparse polynomials of degrees ranging from 100,000 to 1,400,000. \subsubsection{Execution times in seconds of the Ehrlich-Aberth method for solving sparse polynomials on GPUs using distributed memory paradigm with MPI} @@ -860,6 +870,7 @@ In this part we perform a set of experiments to compare Multi-GPU (CUDA MPI) app This figure shows 4 curves of execution time of EA algorithm, a curve with single GPU, 3 curves with multiple GPUs (2, 3, 4). We can clearly see that the curve with single GPU is above the other curves, which shows consumption in execution time compared to the Multi-GPU. We can see also that the CUDA-MPI approach reduces the execution time by a factor of 100 for polynomials of degree more than 1,000,000 whereas a single GPU is of the scale 1000. %%SIDER : Je n'ai pas reformuler car je n'ai pas compris la phrase, merci de l'ecrire ici en fran\cais. \\cette figure montre 4 courbes de temps d'exécution pour l'algorithme EA, une courbe avec un seul GPU, 3 courbes pour multiple GPUs(2, 3, 4), on peut constaté clairement que la courbe à un seul GPU est au-dessus des autres courbes, vue sa consomation en temps d'exècution. On peut voir aussi qu'avec l'approche Multi-GPU (CUDA-MPI) reduit le temps d'exècution jusqu'à l'echelle 100 pour le polynômes qui dépasse 1,000,000 tandis que Single GPU est de l'echelle 1000. + \subsubsection{Execution times in seconds of the Ehrlich-Aberth method for solving full polynomials on GPUs using distributed memory paradigm with MPI} \begin{figure}[htbp] @@ -876,42 +887,59 @@ This is due to the use of MPI parallel paradigm that divides the problem computa +\subsection{Comparative between (CUDA-OpenMP) approach and (CUDA-MPI) approach} +In this part we present some experiment comparing the two Multi-GPU approach (OpenMP versus MPI) for solving sparse polynomial, full polynomials than we compare the execution time of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with MPI and with OpenMP. +\subsubsection{Comparison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving sparse polynomials on GPUs} +In this experiment we chose three polynomials of different size like (200K, 800K, 1,4M). We compare their execution time according to the number of the GPUs. \begin{figure}[htbp] \centering \includegraphics[angle=-90,width=0.5\textwidth]{Sparse} -\caption{Comparaison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving sparse plynomials on GPUs.} +\caption{Comparison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving sparse polynomials on GPUs.} \label{fig:05} \end{figure} +in figure ~\ref{fig:05} we have two curves: MPI curve and OpenMP curve for each polynomials size. We can see that the results are similar between OpenMP curves and MPI curves for the polynomials size (200K, 1,4M), but there is a slight different between MPI curve and OpenMP curve for the polynomial of size 800K. ... +\subsubsection{Comparison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving full polynomials on GPUs} \begin{figure}[htbp] \centering \includegraphics[angle=-90,width=0.5\textwidth]{Full} -\caption{Comparaison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving full polynomials on GPUs.} +\caption{Comparison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving full polynomials on GPUs.} \label{fig:06} \end{figure} +in figure ~\ref{fig:06}, we can see that the two paradigm MPI and OpenMP give the same result for solving full polynomials with EA algorithm. +% size (200k,800K, 1,4M) are very similar for solving full polynomials with the EA algorithm. +\subsubsection{Comparison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with distributed memory paradigm using MPI} +in this experiment we compare the execution time of EA algorithm according to the number of the GPU for solving sparse and full polynomials on Multi-GPU using MPI. We chose three sparse and full polynomials of different size like (200K, 800K, 1,4M). \begin{figure}[htbp] \centering \includegraphics[angle=-90,width=0.5\textwidth]{MPI} -\caption{Comparaison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with distributed memory paradigm using MPI.} +\caption{Comparison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with distributed memory paradigm using MPI.} \label{fig:07} \end{figure} +in figure ~\ref{fig:07} we can see that (CUDA-MPI) can solve sparse and full polynomials of high degrees, the execution time with sparse polynomial are very low comparing to full polynomials. with sparse polynomials the number of monomial are reduce, consequently the number of operation are reduce than the execution time decrease. + +\subsubsection{Comparison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with shared memory paradigm using OpenMP} \begin{figure}[htbp] \centering \includegraphics[angle=-90,width=0.5\textwidth]{OMP} -\caption{Comparaison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with shared memory paradigm using OpenMP.} +\caption{Comparison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with shared memory paradigm using OpenMP.} \label{fig:08} \end{figure} +in figure ~\ref{fig:08} +\subsection{Scalability of the EA method on Multi-GPU to solve very high polynomials degrees} + This experiment we report the execution time according to the degrees polynomials ranging from 1,000,000 to 5,000,000 for both approaches (cuda-OpenMP) and (CUDA-MPI) with sparse and full polynomials. \begin{figure}[htbp] \centering \includegraphics[angle=-90,width=0.5\textwidth]{big} -\caption{Execution times in seconds of the Ehrlich-Aberth method for solving full polynomials of high degrees on 4 GPUs.} + \caption{Execution times in seconds of the Ehrlich-Aberth method for solving full polynomials of high degrees on 4 GPUs.} \label{fig:09} \end{figure} - +in figure ~\ref{fig:09} we can see that both (cuda-OpenMP) and (CUDA-MPI) approaches are scalable can solve very high polynomials degrees. with full polynomial the both approaches give very interesting ans similar results for polynomials of 5,000,000 degrees we not reach 30,000 s +%for sparse and full polynomials % An example of a floating figure using the graphicx package. % Note that \label must occur AFTER (or within) \caption. % For figures, \caption should occur after the \includegraphics. @@ -1010,7 +1038,7 @@ This is due to the use of MPI parallel paradigm that divides the problem computa \section{Conclusion} -\label{sec5} +\label{sec6} In this paper, we have presented a parallel implementation of Ehrlich-Aberth algorithm for solving full and sparse polynomials, on single GPU with CUDA and on multiple GPUs using two parallel paradigms : shared memory with OpenMP and distributed memory with MPI. These architectures were addressed by a CUDA-OpenMP approach and CUDA-MPI approach, respectively. The experiments show that, using parallel programming model like (OpenMP, MPI), we can efficiently manage multiple graphics cards to work together to solve the same problem and accelerate the parallel execution with 4 GPUs and solve a polynomial of degree 1,000,000, four times faster than on single GPU, that is a quasi-linear speedup.