From: zianekhodja Date: Sun, 3 Jan 2016 20:46:48 +0000 (+0100) Subject: Relecture de la partie CUDA X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper2.git/commitdiff_plain/acc9e54f340b556928943d478703538ac5843299 Relecture de la partie CUDA --- diff --git a/mybibfile.bib b/mybibfile.bib index dc57751..7513da0 100644 --- a/mybibfile.bib +++ b/mybibfile.bib @@ -184,20 +184,16 @@ OPTannote = {•} pages = "742-746", year = "1941", author = "A. Ostrowski", -}x +} -@Manual{CUDA10, -title = {Compute Unified Device Architecture Programming Guide Version 3.0}, +@Manual{CUDA15, +title = {{CUDA} {C} programming guide}, OPTkey = {NVIDIA CUDA}, -OPTauthor = {•}, -OPTorganization = {NVIDIA CUDA}, -OPTaddress = {•}, -OPTedition = {•}, -OPTmonth = {March}, -OPTyear = {2010}, -OPTnote = {http://www.nvidia.com/object/cuda_develop.html}, -OPTannote = {•} +OPTorganization = {{NVIDIA}}, +OPTmonth = {September}, +OPTyear = {2015}, +URL = {{http://docs.nvidia.com/cuda/pdf/CUDA\_C\_Programming\_Guide.pdf}} } @Article{Kahinall14, diff --git a/paper.tex b/paper.tex index b0e76b5..24fea51 100644 --- a/paper.tex +++ b/paper.tex @@ -339,8 +339,7 @@ % not capitalized unless they are the first or last word of the title. % Linebreaks \\ can be used within to get better formatting as desired. % Do not put math or special symbols in the title. -\title{Two parallel implementations of Ehrlich-Aberth algorithm for root finding of polynomials -on multiple GPUs with OpenMP and MPI} +\title{Two parallel implementations of Ehrlich-Aberth algorithm for root-finding of polynomials on multiple GPUs with OpenMP and MPI} % author names and affiliations @@ -454,7 +453,8 @@ where $p'(z)$ is the polynomial derivative of $p$ evaluated in the point $z$. %Aberth, Ehrlich and Farmer-Loizou~\cite{Loizou83} have proved that %the Ehrlich-Aberth method (EA) has a cubic order of convergence for simple roots whereas the Durand-Kerner has a quadratic order of %convergence. -The main problem of the simultaneous methods is that the necessary time needed for the convergence increases with the increasing of the polynomial's degree. Many authors have treated the problem of implementing simultaneous methods in parallel. Freeman~\cite{Freeman89} implemented and compared DK, EA and another method of the fourth order of convergence proposed by Farmer and Loizou~\cite{Loizou83} on a 8-processor linear chain, for polynomials of degree up-to 8. The method of Farmer and Loizou~\cite{Loizou83} often diverges, but the first two methods (DK and EA) have a speed-up equals to 5.5. Later, Freeman and Bane~\cite{Freemanall90} considered asynchronous algorithms in which each processor continues to update its approximations even though the latest values of other $z^{k}_{i}$ have not been received from the other processors, in contrast with synchronous algorithms where it would wait those values before making a new iteration. Couturier et al.~\cite{Raphaelall01} proposed two methods of parallelization for a shared memory architecture with \textit{OpenMP} and for a distributed memory one with \textit{MPI}. They are able to compute the roots of sparse polynomials of degree 10,000 in 116 seconds with \textit{OpenMP} and 135 seconds with \textit{MPI} only by using 8 personal computers and 2 communications per iteration. \LZK{je suppose que c'est pour la version mpi (only by using 8 personal computers and 2 communications per iteration). A t on utilisé le même nombre de procs pour les deux versions openmp et mpi} The authors showed an interesting speedup comparing to the sequential implementation that takes up-to 3,300 seconds to obtain same results. +The main problem of the simultaneous methods is that the necessary time needed for the convergence increases with the increasing of the polynomial's degree. Many authors have treated the problem of implementing simultaneous methods in parallel. Freeman~\cite{Freeman89} implemented and compared DK, EA and another method of the fourth order of convergence proposed by Farmer and Loizou~\cite{Loizou83} on a 8-processor linear chain, for polynomials of degree up-to 8. The method of Farmer and Loizou~\cite{Loizou83} often diverges, but the first two methods (DK and EA) have a speed-up equals to 5.5. Later, Freeman and Bane~\cite{Freemanall90} considered asynchronous algorithms in which each processor continues to update its approximations even though the latest values of other $z^{k}_{i}$ have not been received from the other processors, in contrast with synchronous algorithms where it would wait those values before making a new iteration. Couturier et al.~\cite{Raphaelall01} proposed two methods of parallelization for a shared memory architecture with \textit{OpenMP} and for a distributed memory one with \textit{MPI}. They are able to compute the roots of sparse polynomials of degree 10,000 in 116 seconds with \textit{OpenMP} and 135 seconds with \textit{MPI} only by using 8 personal computers and 2 communications per iteration. +\LZK{je suppose que c'est pour la version mpi (only by using 8 personal computers and 2 communications per iteration). A t on utilisé le même nombre de procs pour les deux versions openmp et mpi} The authors showed an interesting speedup comparing to the sequential implementation that takes up-to 3,300 seconds to obtain same results. Very few work had been performed since then until the appearing of the Compute Unified Device Architecture (CUDA)~\cite{CUDA10}, a parallel computing platform and a programming model invented by NVIDIA. The computing power of GPUs (Graphics Processing Units) has exceeded that of CPUs. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche and al~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. @@ -462,7 +462,6 @@ Very few work had been performed since then until the appearing of the Compute U Finding polynomial roots rapidly and accurately is the main objective of our work. In this paper we propose the parallelization of Ehrlich-Aberth method using two parallel programming paradigms OpenMP and MPI on multi-GPU platforms. We consider two architectures: shared memory and distributed memory computers. The first parallel algorithm is implemented on shared memory computers by using OpenMP API. It is based on threads created from the same system process, such that each thread is attached to one GPU. In this case the communications between GPUs are done by OpenMP threads through shared memory. The second parallel algorithm uses the MPI API, such that each GPU is attached and managed by a MPI process. The GPUs exchange their data by message-passing communications. This latter approach is more used on distributed memory clusters to solve very complex problems that are too large for traditional supercomputers, which are very expensive to build and run. \LZK{Cette partie est réécrite. \\ Sinon qu'est ce qui a été fait pour l'accuracy dans ce papier (Finding polynomial roots rapidly and accurately is the main objective of our work.)?} - \LZK{Les contributions ne sont pas définies !!} %This paper is organized as follows. In Section~\ref{sec2} we recall the Ehrlich-Aberth method. In section~\ref{sec3} we present EA algorithm on single GPU. In section~\ref{sec4} we propose the EA algorithm implementation on Multi-GPU for (OpenMP-CUDA) approach and (MPI-CUDA) approach. In sectioné\ref{sec5} we present our experiments and discus it. Finally, Section~\ref{sec6} concludes this paper and gives some hints for future research directions in this topic.} @@ -475,6 +474,8 @@ The paper is organized as follows. In Section~\ref{sec2} we present three differ \section{Parallel programming models} \label{sec2} +\LZK{Toute cette section a été réécrite. Donc à relire et à améliorer si possible.} + In this section we present three different parallel programming models: OpenMP, MPI and CUDA. \subsection{OpenMP} @@ -490,24 +491,21 @@ In this section we present three different parallel programming models: OpenMP, %Sequential natively. Threads share some or all of the available memory and can %have private memory areas [6]. -OpenMP (Open Multi-processing) is an application programming interface for shared memory parallel programming~\cite{openmp13}. It is a portable approach based on the multithreading designed for shared memory computers, where a master thread forks a number of slave threads which execute blocks of code in parallel. An OpenMP program alternates sequential regions and parallel regions of code, where the sequential regions are executed by the master thread and the parallel ones may be executed by multiple threads. During the execution of an OpenMP program the threads communicate their data (read and modified) in the shared memory. One advantage of OpenMP is the global view of the memory address space of an application. This allows relatively a fast development of parallel applications with easier maintenance. However, it is often difficult to get high rates of performances in large scale-applications. -\LZK{Cette partie est réécrite. A relire et à améliorer si possible.} +OpenMP (Open Multi-processing) is an application programming interface for shared memory parallel programming~\cite{openmp13}. It is a portable approach based on the multithreading designed for shared memory computers, where a master thread forks a number of slave threads which execute blocks of code in parallel. An OpenMP program alternates sequential regions and parallel regions of code, where the sequential regions are executed by the master thread and the parallel ones may be executed by multiple threads. During the execution of an OpenMP program the threads communicate their data (read and modified) in the shared memory. One advantage of OpenMP is the global view of the memory address space of an application. This allows relatively a fast development of parallel applications with easier maintenance. However, it is often difficult to get high rates of performances in large scale-applications. \subsection{MPI} %The MPI (Message Passing Interface) library allows to create computer programs that run on a distributed memory architecture. The various processes have their own environment of execution and execute their code in a asynchronous way, according to the MIMD model (Multiple Instruction streams, Multiple Data streams); they communicate and synchronize by exchanging messages~\cite{Peter96}. MPI messages are explicitly sent, while the exchanges are implicit within the framework of a multi-thread programming environment like OpenMP or Pthreads. MPI (Message Passing Interface) is a portable message passing style of the parallel programming designed especially for the distributed memory architectures~\cite{Peter96}. In most MPI implementations, a computation contains a fixed set of processes created at the initialization of the program in such way one process is created per processor. The processes synchronize their computations and communicate by sending/receiving messages to/from other processes. In this case, the data are explicitly exchanged by message passing while the data exchanges are implicit in a multithread programming model like OpenMP and Pthreads. However in the MPI programming model, the processes may either execute different programs referred to as multiple program multiple data (MPMD) or every process executes the same program (SPMD). The MPI approach is one of most used HPC programming model to solve large scale and complex applications. -\LZK{Cette partie est réécrite. A relire et à améliorer si possible.} \subsection{CUDA} -CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA~\cite{CUDA10}. The -unit of execution in CUDA is called a thread. Each thread executes a kernel by the streaming processors in parallel. In CUDA, -a group of threads that are executed together is called a thread block, and the computational grid consists of a grid of thread -blocks. Additionally, a thread block can use the shared memory on a single multiprocessor while the grid executes a single -CUDA program logically in parallel. Thus in CUDA programming, it is necessary to design carefully the arrangement of the thread -blocks in order to ensure low latency and a proper usage of shared memory, since it can be shared only in a thread block -scope. The effective bandwidth of each memory space depends on the memory access pattern. Since the global memory has lower -bandwidth than the shared memory, the global memory accesses should be minimized. +%CUDA (is an acronym of the Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA~\cite{CUDA10}.The unit of execution in CUDA is called a thread. Each thread executes a kernel by the streaming processors in parallel. In CUDA, a group of threads that are executed together is called a thread block, and the computational grid consists of a grid of thread blocks. Additionally, a thread block can use the shared memory on a single multiprocessor while the grid executes a single CUDA program logically in parallel. Thus in CUDA programming, it is necessary to design carefully the arrangement of the thread blocks in order to ensure low latency and a proper usage of shared memory, since it can be shared only in a thread block scope. The effective bandwidth of each memory space depends on the memory access pattern. Since the global memory has lower bandwidth than the shared memory, the global memory accesses should be minimized. + +CUDA (Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA~\cite{CUDA15} for GPUs. It provides a high level GPGPU-based programming model to program GPUs for general purpose computations and non-graphic applications. The GPU is viewed as an accelerator such that data-parallel operations of a CUDA program running on a CPU are off-loaded onto GPU and executed by this later. The data-parallel operations executed by GPUs are called kernels. The same kernel is executed in parallel by a large number of threads organized in grids of thread blocks, such that each GPU multiprocessor executes one or more thread blocks in SIMD fashion (Single Instruction, Multiple Data) and in turn each core of the multiprocessor executes one or more threads within a block. Threads within a block can cooperate by sharing data through a fast shared memory and coordinate their execution through synchronization points. In contrast, within a grid of thread blocks, there is no synchronization at all between blocks. The GPU only works on data filled in the global memory and the final results of the kernel executions must be transferred out of the GPU. In the GPU, the global memory has lower bandwidth than the shared memory associated to each multiprocessor. Thus in the CUDA programming, it is necessary to design carefully the arrangement of the thread blocks in order to ensure low latency, a proper usage of the shared memory and the global memory accesses should be minimized. + + + + We introduced three paradigms of parallel programming. Our objective consists in implementing a root finding polynomial algorithm on multiple GPUs. To this end, it is primordial to know how to manage CUDA contexts of different GPUs. A direct method for controlling the various GPUs is to use as many threads or processes as GPU devices. We can choose the GPU index based on the identifier of OpenMP thread or the rank of the MPI process. Both approaches will be investigated.