From: Kahina Date: Sat, 2 Jan 2016 17:19:32 +0000 (+0100) Subject: MAJ X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper2.git/commitdiff_plain/e128b20edaa616004667fb4e35f332f23b1b1512?ds=sidebyside;hp=7428d8e83fd6d03716256f2d3d66cbb2cac57694 MAJ --- diff --git a/paper.tex b/paper.tex index 8e5ba4d..6d33e40 100644 --- a/paper.tex +++ b/paper.tex @@ -422,7 +422,7 @@ Finding roots of polynomials is a very important part of solving real-life probl \section{Introduction} -Polynomials are mathematical algebraic structures that play an important role in science and engineering by capturing physical phenomena and expressing any outcome as a function of some unknown variables. Formally speaking, a polynomial $p(x)$ of degree $n$ having $n$ coefficients in the complex plane $\mathbb{C}$ is: \begin{equation}p(x)=\sum_{i=0}^{n-1}{a_ix^i}.\end{equation} +Polynomials are mathematical algebraic structures that play an important role in science and engineering by capturing physical phenomena and expressing any outcome as a function of some unknown variables. Formally speaking, a polynomial $p(x)$ of degree $n$ having $n$ coefficients in the complex plane $\mathbb{C}$ is: \begin{equation}p(x)=\sum_{i=0}^{n}{a_ix^i}.\end{equation} \LZK{Dans ce cas le polynôme est de degré $n-1$!} The root-finding problem consists in finding the $n$ different values of the unknown variable $x$ for which $p(x)=0$. Such values are called zeros of $p$ (\textit{i.e.} roots). If zeros are $\alpha_{i}$, $i=1,\ldots,n$, then $p(x)$ can be written as : @@ -449,16 +449,16 @@ point $z$. %Aberth, Ehrlich and Farmer-Loizou~\cite{Loizou83} have proved that %the Ehrlich-Aberth method (EA) has a cubic order of convergence for simple roots whereas the Durand-Kerner has a quadratic order of %convergence. -The main problem of the simultaneous methods is that the necessary time needed for the convergence increases with the increasing of the polynomial's degree. Many authors have treated the problem of implementing simultaneous methods in parallel. Freeman~\cite{Freeman89} implemented and compared DK, EA and another method of the fourth order proposed by Farmer and Loizou~\cite{Loizou83} \LZK{of the fourth order ?? \\ Sinon peut on donner et citer le nom de la 3ième méthode?} on a 8-processor linear chain, for polynomials of degree up-to 8. +The main problem of the simultaneous methods is that the necessary time needed for the convergence increases with the increasing of the polynomial's degree. Many authors have treated the problem of implementing simultaneous methods in parallel. Freeman~\cite{Freeman89} implemented and compared DK, EA and another method of the fourth order proposed by Farmer and Loizou~\cite{Loizou83} \LZK{of the fourth order ?? \color{red}{of convergence} \\ Sinon peut on donner et citer le nom de la 3ième méthode?\color{red}{Farmer-Loizou method}} on a 8-processor linear chain, for polynomials of degree up-to 8. The third method often diverges, \LZK{C'est mieux de donner le nom de cette 3ième méthode} but the first two methods have a speed-up equals to 5.5. Later, Freeman and Bane~\cite{Freemanall90} considered asynchronous algorithms, in which each processor continues to update its approximations even though the latest values of other $z^{k}_{i}$ have not been received from the other processors, in contrast with synchronous algorithms where it would wait those values before making a new iteration. Couturier et al.~\cite{Raphaelall01} proposed two methods of parallelization for a shared memory architecture with \textit{OpenMP} and for a distributed memory one with \textit{MPI}. They are able to compute the roots of sparse polynomials of degree 10,000 in 116 seconds with \textit{OpenMP} and 135 seconds with \textit{MPI} only by using 8 personal computers and 2 communications per iteration. \LZK{je suppose que c'est pour la version mpi (only by using 8 personal computers and 2 communications per iteration). A t on utilisé le même nombre de procs pour les deux versions openmp et mpi} The authors showed an interesting speedup comparing to the sequential implementation that takes up-to 3,300 seconds to obtain same results. -Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA10}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Unit) has exceeded that of CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. +Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA10}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Unit) has exceeded that of CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche and al~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. Finding polynomial roots rapidly and accurately is the main objective of our work. In this paper we propose the parallelization of Ehrlich-Aberth method using two parallel programming paradigms OpenMP and MPI on multi-GPU platforms. {\color{red}{We consider two architectures: shared-memory computers with OpenMP API and distributed-memory computers with MPI API. The first approach is based on threads from the same system process, with each thread attached to one GPU and after the various memory allocations, each thread launches its part of computations. To do this we must first load on the GPU required data and after the computations are carried, repatriate the result on the host. The second approach i.e distributed memory with MPI relies on the MPI library which is often used for parallel programming~\cite{Peter96} in cluster systems because it is a message-passing programming language. Each GPU is attached to one MPI process, and a loop is in charge of the distribution of tasks between the MPI processes. This solution can be used on one GPU, or executed on a distributed cluster of GPUs, employing the Message Passing Interface (MPI) to communicate between separate CUDA cards. This solution permits scaling of the problem size to larger classes than would be possible on a single device and demonstrates the performance which users might expect from future HPC architectures where accelerators are deployed.}} \LZK{Trop détaillé et mal expliqué. \\ We consider two architectures: shared-memory and distributed-memory computers. The first parallel algorithm is implemented on shared-memory computers by using OpenMP API. It is based on threads created from the same system process, such that each thread is attached to one GPU. In this case the communications between GPUs are done by OpenMP threads through shared memory. The second parallel algorithm uses the MPI API, such that each GPU is attached and managed by a MPI process. The GPUs exchange their data by message-passing communications. This latter approach is more used on distributed-memory clusters to solve very complex problems that are too large for traditional supercomputers, which are very expensive to build and run.} -{\color{red}{This paper is organized as follows. In Section~\ref{sec2} we recall the Ehrlich-Aberth method. In section 3 we present EA algorithm on single GPU. In section 4 we propose the EA algorithm implementation on Multi-GPU for (OpenMP-CUDA) approach and (MPI-CUDA) approach. In section 5 we present our experiments and discus it. Finally, Section~\ref{sec6} concludes this paper and gives some hints for future research directions in this topic.}}\LZK{A revoir toute cette organization} +{\color{red}{This paper is organized as follows. In Section~\ref{sec2} we recall the Ehrlich-Aberth method. In section~\ref{sec3} we present EA algorithm on single GPU. In section~\ref{sec4} we propose the EA algorithm implementation on Multi-GPU for (OpenMP-CUDA) approach and (MPI-CUDA) approach. In sectioné\ref{sec5} we present our experiments and discus it. Finally, Section~\ref{sec6} concludes this paper and gives some hints for future research directions in this topic.}}\LZK{A revoir toute cette organization} \section{Parallel Programmings Model} @@ -759,6 +759,7 @@ Since a GPU works only on data already allocated in its memory, all local input ~\\ \section{Experiments} +\label{sec5} We study two categories of polynomials: sparse polynomials and full polynomials.\\ {\it A sparse polynomial} is a polynomial for which only some coefficients are not null. In this paper, we consider sparse polynomials for which the roots are distributed on 2 distinct circles: \begin{equation} @@ -826,7 +827,7 @@ In this part we perform a set of experiments to compare Multi-GPU (CUDA MPI) app ~\\ This figure shows 4 curves of execution time of EA algorithm, a curve with single GPU, 3 curves with multiple GPUs (2, 3, 4). We can clearly see that the curve with single GPU is above the other curves, which shows consumption in execution time compared to the Multi-GPU. We can see also that the CUDA-MPI approach reduces the execution time by a factor of 100 for polynomials of degree more than 1,000,000 whereas a single GPU is of the scale 1000. %%SIDER : Je n'ai pas reformuler car je n'ai pas compris la phrase, merci de l'ecrire ici en fran\cais. -\\ +\\cette figure montre 4 courbes de temps d'exécution pour l'algorithme EA, une courbe avec un seul GPU, 3 courbes pour multiple GPUs(2, 3, 4), on peut constaté clairement que la courbe à un seul GPU est au-dessus des autres courbes, vue sa consomation en temps d'exècution. On peut voir aussi qu'avec l'approche Multi-GPU (CUDA-MPI) reduit le temps d'exècution jusqu'à l'echelle 100 pour le polynômes qui dépasse 1,000,000 tandis que Single GPU est de l'echelle 1000. \subsubsection{Execution times in seconds of the Ehrlich-Aberth method for solving full polynomials on GPUs using distributed memory paradigm with MPI} \begin{figure}[htbp]