X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper2.git/blobdiff_plain/0036086ac3b16dd8e27667e166acbaae2b5703a8..2c8490c6d1f44464682f315ff67bcb96b7e8296b:/paper.tex?ds=sidebyside diff --git a/paper.tex b/paper.tex index 2fd0993..8203ed0 100644 --- a/paper.tex +++ b/paper.tex @@ -50,7 +50,7 @@ of polynomials of degree up-to 5 millions. \end{abstract} % no keywords -\LZK{Faut pas mettre des keywords?} +\LZK{Faut pas mettre des keywords?\KG{Oui d'après ça: "no keywords" qui se trouve dans leur fichier source!!, mais c'est Bizzard!!!}} \IEEEpeerreviewmaketitle @@ -88,17 +88,15 @@ which each processor continues to update its approximations even though the latest values of other approximations $z^{k}_{i}$ have not been received from the other processors, in contrast with synchronous algorithms where it would wait those values before making a new -iteration. Couturier et al.~\cite{Raphaelall01} proposed two methods +iteration. Couturier and al.~\cite{Raphaelall01} proposed two methods of parallelization for a shared memory architecture with OpenMP and for a distributed memory one with MPI. They are able to compute the -roots of sparse polynomials of degree 10,000 in 116 seconds with -OpenMP and 135 seconds with MPI only by using 8 personal computers and -2 communications per iteration. The authors showed an interesting -speedup comparing to the sequential implementation which takes up-to -3,300 seconds to obtain same results. +roots of sparse polynomials of degree 10,000. The authors showed an interesting +speedup that is 20 times as fast as the sequential implementation. +%which takes up-to 3,300 seconds to obtain same results. \RC{si on donne des temps faut donner le proc, comme c'est vieux à mon avis faut supprimer ca, votre avis?} \LZK{Supprimons ces détails et mettons une référence s'il y en a une} - +\KG{Je viens de supprimer les détails, la référence existe déja, a reverifier} Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA15}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Units) has exceeded that of traditional processors CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. In this paper we propose the parallelization of Ehrlich-Aberth method which has a good convergence and it is suitable to be implemented in parallel computers. We use two parallel programming paradigms OpenMP and MPI on CUDA multi-GPU platforms. Our CUDA-MPI and CUDA-OpenMP codes are the first implementations of Ehrlich-Aberth method with multiple GPUs for finding roots of polynomials. Our major contributions include: @@ -109,7 +107,8 @@ In this paper we propose the parallelization of Ehrlich-Aberth method which has \item The parallel implementation of Ehrlich-Aberth algorithm on a multi-GPU platform with a distributed memory using MPI API, such that each GPU is attached and managed by a MPI process. The GPUs exchange their data by message-passing communications. \end{itemize} This latter approach is more used on clusters to solve very complex problems that are too large for traditional supercomputers, which are very expensive to build and run. -\LZK{Pas d'autres contributions possibles? J'ai supprimé les deux premiers points proposés précédemment.} +\LZK{Pas d'autres contributions possibles? J'ai supprimé les deux premiers points proposés précédemment. +\AS{La résolution du problème pour des polynomes pleins de degré 6M est une contribution aussi à mon avis}} The paper is organized as follows. In Section~\ref{sec2} we present three different parallel programming models OpenMP, MPI and CUDA. In Section~\ref{sec3} we present the implementation of the Ehrlich-Aberth algorithm on a single GPU. In Section~\ref{sec4} we present the parallel implementations of the Ehrlich-Aberth algorithm on multiple GPUs using the OpenMP and MPI approaches. In section~\ref{sec5} we present our experiments and discuss them. Finally, Section~\ref{sec6} concludes this paper and gives some hints for future research directions in this topic.