-Finding polynomial roots rapidly and accurately is the main objective of our work. In this paper we propose the parallelization of Ehrlich-Aberth method using a parallel programming paradigms (OpenMP, MPI) on GPUs. We consider two architectures: Shared memory with OpenMP API based on threads from the same system process, which each thread is attached to one GPU and after the various memory allocation, each thread throws its part of calculation ( to do this you must first load on the GPU required data and after Suddenly repatriate the result on the host). Distributed memory with MPI: The MPI library is often used for parallel programming [11] in
-cluster systems because it is a message-passing programming language. Each GPU are attached to one process MPI, and a loop is in charge of the distribution of tasks between the MPI processes. this solution can be used on one GPU, or executed on a distributed cluster of GPUs, employing the Message Passing Interface (MPI) to communicate between separate CUDA cards. This solution permits scaling of the problem size to larger classes than would be possible on a single device and demonstrates the performance which users might expect from future
-HPC architectures where accelerators are deployed.
-
-This paper is organized as follows, in section 2 we recall the Ehrlich-Aberth method. In section 3 we present EA algorithm on single GPU. In section 4 we propose the EA algorithm implementation on MGPU for (OpenMP-CUDA) approach and (MPI-CUDA) approach. In section 5 we present our experiments and discus it. Finally, Section~\ref{sec6} concludes this paper and gives some hints for future research directions in this topic.
-
-
-\section{Parallel Programmings Model}
-
-\subsection{OpenMP}%L'article en anglais Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
-Open Multi-Processing (OpenMP) is a shared memory architecture API that provides multi thread capacity [22]. OpenMP is
-a portable approach for parallel programming on shared memory systems based on compiler directives, that can be included in order
-to parallelize a loop. In this way, a set of loops can be distributed along the different threads that will access to different data allo-
-cated in local shared memory. One of the advantages of OpenMP is its global view of application memory address space that allows relatively fast development of parallel applications with easier maintenance. However, it is often difficult to get high rates of
-performance in large scale applications. Although, in OpenMP a usage of threads ids and managing data explicitly as done in an MPI
-code can be considered, it defeats the advantages of OpenMP.
-
-\subsection{OpenMP} %L'article en Français Programmation multiGPU – OpenMP versus MPI
-OpenMP is a shared memory programming API based on threads from
-the same system process. Designed for multiprocessor shared memory UMA or
-NUMA [10], it relies on the execution model SPMD ( Single Program, Multiple Data Stream )
-where the thread "master" and threads "slaves" asynchronously execute their codes
-communicate / synchronize via shared memory [7]. It also helps to build
-the loop parallelism and is very suitable for an incremental code parallelization
-Sequential natively. Threads share some or all of the available memory and can
-have private memory areas [6].
-
-\subsection{MPI} %L'article en Français Programmation multiGPU – OpenMP versus MPI
- The library MPI allows to use a distributed memory architecture. The various processes have their own environment of execution and execute their codes in a asynchronous way, according to the model MIMD (Multiple Instruction streams, Multiple Dated streams); they communicate and synchronize by exchanges of messages [17]. MPI messages are explicitly sent, while the exchanges are implicit within the framework of a programming multi-thread (OpenMP/Pthreads).
-
-\subsection{CUDA}%L'article en anglais Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
- CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA [28]. The
-unit of execution in CUDA is called a thread. Each thread executes the kernel by the streaming processors in parallel. In CUDA,
-a group of threads that are executed together is called thread blocks, and the computational grid consists of a grid of thread
-blocks. Additionally, a thread block can use the shared memory on a single multiprocessor as while as the grid executes a single
-CUDA program logically in parallel. Thus in CUDA programming, it is necessary to design carefully the arrangement of the thread
-blocks in order to ensure low latency and a proper usage of shared memory, since it can be shared only in a thread block
-scope. The effective bandwidth of each memory space depends on the memory access pattern. Since the global memory has lower
-bandwidth than the shared memory, the global memory accesses should be minimized.
-
-
-We introduced three paradigms of parallel programming. Our objective consist to implement an algorithm of root finding polynomial on multiple GPUs. It primordial to know how manage CUDA context of different GPUs. A direct method for controlling the various GPU is to use as many threads or processes that GPU. We can choose the GPU index based on the identifier of OpenMP thread or the rank of the MPI process. Both approaches will be created.
-
-\section{The EA algorithm on single GPU}
-\subsection{the EA method}
-the Ehrlich-Aberth method is an iterative method , contain 4 steps, start from the initial approximations of all the
-roots of the polynomial,the second step initialize the solution vector $Z$ using the Guggenheimer method to assure the distinction of the initial vector roots, than in step 3 we apply the the iterative function based on the Newton's method and Weiestrass operator[...,...], wich will make it possible to converge to the roots solution, provided that all the root are different. At the end of each application of the iterative function, a stop condition is verified consists in stopping the iterative process when the whole of the modules of the roots
-are lower than a fixed value $ε$
-\subsection{EA parallel implementation on CUDA}
-Like any parallel code, a GPU parallel implementation first
-requires to determine the sequential tasks and the
-parallelizable parts of the sequential version of the
-program/algorithm. In our case, all the operations that are easy
-to execute in parallel must be made by the GPU to accelerate
-the execution of the application, like the step 3 and step 4. On the other hand, all the
-sequential operations and the operations that have data
-dependencies between threads or recursive computations must
-be executed by only one CUDA or CPU thread (step 1 and step 2). Initially we specifies the organization of threads in parallel, need to specify the dimension of the grid Dimgrid: the number of block per grid and block by DimBlock: the number of threads per block required to process a certain task.
-
-we create the kernel, for step 3 we have two kernels, the
-first named \textit{save} is used to save vector $Z^{K-1}$ and the kernel
-\textit{update} is used to update the $Z^{K}$ vector. In step 4 a kernel is
-created to test the convergence of the method. In order to
-compute function H, we have two possibilities: either to use
-the Jacobi method, or the Gauss-Seidel method which uses the
-most recent computed roots. It is well known that the Gauss-
-Seidel mode converges more quickly. So, we used the Gauss-Seidel mode of iteration. To
-parallelize the code, we created kernels and many functions to
-be executed on the GPU for all the operations dealing with the
-computation on complex numbers and the evaluation of the
-polynomials. As said previously, we managed both functions
-of evaluation of a polynomial: the normal method, based on
-the method of Horner and the method based on the logarithm
-of the polynomial. All these methods were rather long to
-implement, as the development of corresponding kernels with
-CUDA is longer than on a CPU host. This comes in particular
-from the fact that it is very difficult to debug CUDA running
-threads like threads on a CPU host. In the following paragraph
-Algorithm 1 shows the GPU parallel implementation of Ehrlich-Aberth method.
-
-Algorithm~\ref{alg2-cuda} shows a sketch of the Ehrlich-Aberth method using CUDA.
-
-\begin{enumerate}
-\begin{algorithm}[htpb]
-\label{alg2-cuda}
-%\LinesNumbered
-\caption{CUDA Algorithm to find roots with the Ehrlich-Aberth method}