X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper2.git/blobdiff_plain/624cd6c9fc05e8de555dc396531fe50e990f9d1b..f7cf9d24e2bcb8efb27d03b167b8c97b5b9560ec:/paper.tex?ds=inline diff --git a/paper.tex b/paper.tex index 9810500..cca9a54 100644 --- a/paper.tex +++ b/paper.tex @@ -461,10 +461,7 @@ This paper is organized as follows, in section 2 we recall the Ehrlich-Aberth me \subsection{OpenMP} Open Multi-Processing (OpenMP) is a shared memory architecture API that provides multi thread capacity~\cite{openmp13}. OpenMP is a portable approach for parallel programming on shared memory systems based on compiler directives, that can be included in order -to parallelize a loop. In this way, a set of loops can be distributed along the different threads that will access to different data allo- -cated in local shared memory. One of the advantages of OpenMP is its global view of application memory address space that allows relatively fast development of parallel applications with easier maintenance. However, it is often difficult to get high rates of -performance in large scale applications. Although, in OpenMP a usage of threads ids and managing data explicitly as done in an MPI -code can be considered, it defeats the advantages of OpenMP. +to parallelize a loop. In this way, a set of loops can be distributed along the different threads that will access to different data allocated in local shared memory. One of the advantages of OpenMP is its global view of application memory address space that allows relatively fast development of parallel applications with easier maintenance. However, it is often difficult to get high rates of performance in large scale applications. Although usage of OpenMP threads and managed data explicitly done with MPI can be considered, this approcache undermines the advantages of OpenMP. %\subsection{OpenMP} %L'article en Français Programmation multiGPU – OpenMP versus MPI %OpenMP is a shared memory programming API based on threads from @@ -477,20 +474,20 @@ code can be considered, it defeats the advantages of OpenMP. %have private memory areas [6]. \subsection{MPI} - The library MPI allows to use a distributed memory architecture. The various processes have their own environment of execution and execute their codes in a asynchronous way, according to the model MIMD (Multiple Instruction streams, Multiple Dated streams); they communicate and synchronize by exchanges of messages~\cite{Peter96}. MPI messages are explicitly sent, while the exchanges are implicit within the framework of a programming multi-thread (OpenMP/Pthreads). +The MPI (Message Passing Interface) library allows to create computer programs that run on a distributed memory architecture. The various processes have their own environment of execution and execute their code in a asynchronous way, according to the MIMD model (Multiple Instruction streams, Multiple Data streams); they communicate and synchronise by exchanging messages~\cite{Peter96}. MPI messages are explicitly sent, while the exchanges are implicit within the framework of a multi-thread programming environment like OpenMP or Pthreads. \subsection{CUDA}%L'article en anglais Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications - CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA~\cite{NVIDIA12}. The -unit of execution in CUDA is called a thread. Each thread executes the kernel by the streaming processors in parallel. In CUDA, -a group of threads that are executed together is called thread blocks, and the computational grid consists of a grid of thread -blocks. Additionally, a thread block can use the shared memory on a single multiprocessor as while as the grid executes a single +CUDA (an acronym for Compute Unified Device Architecture) is a parallel computing architecture developed by NVIDIA~\cite{NVIDIA12}. The +unit of execution in CUDA is called a thread. Each thread executes a kernel by the streaming processors in parallel. In CUDA, +a group of threads that are executed together is called a thread block, and the computational grid consists of a grid of thread +blocks. Additionally, a thread block can use the shared memory on a single multiprocessor while the grid executes a single CUDA program logically in parallel. Thus in CUDA programming, it is necessary to design carefully the arrangement of the thread blocks in order to ensure low latency and a proper usage of shared memory, since it can be shared only in a thread block scope. The effective bandwidth of each memory space depends on the memory access pattern. Since the global memory has lower bandwidth than the shared memory, the global memory accesses should be minimized. -We introduced three paradigms of parallel programming. Our objective consist to implement an algorithm of root finding polynomial on multiple GPUs. It primordial to know how manage CUDA context of different GPUs. A direct method for controlling the various GPU is to use as many threads or processes that GPU. We can choose the GPU index based on the identifier of OpenMP thread or the rank of the MPI process. Both approaches will be created. +We introduced three paradigms of parallel programming. Our objective consist to implement an algorithm of root finding polynomial on multiple GPUs. It primordial to know how to manage CUDA contexts of different GPUs. A direct method for controlling the various GPU is to use as many threads or processes as GPU devices. We can choose the GPU index based on the identifier of OpenMP thread or the rank of the MPI process. Both approaches will be investigated. \section{The EA algorithm on single GPU} \subsection{the EA method} @@ -837,33 +834,37 @@ This figure shows 4 curves of execution time of EA algorithm, a curve with singl \label{fig:04} \end{figure} -\begin{figure}[htbp] -\centering - \includegraphics[angle=-90,width=0.5\textwidth]{Sparse} -\caption{Comparaison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving sparse plynomials on GPUs} -\label{fig:05} -\end{figure} -\begin{figure}[htbp] -\centering - \includegraphics[angle=-90,width=0.5\textwidth]{Full} -\caption{Comparaison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving full polynomials on GPUs} -\label{fig:06} -\end{figure} +this figure shows the execution time of the algorithm EA, on single GPU and Multi-GPUS with (2, 3, 4) GPUs for full polynomials. With (CUDA-MPI) approach we notice that the three curves are distinct from each other, more we use GPUs more the execution time decreases, on the other hand the curve with single GPU is well above the other curves. +This is due to the use of parallelization MPI paradigm that divides the polynomial into sub polynomials assigned to each GPU. unlike the single GPU which solves all the polynomial on a single GPU, consequently it engenders more execution time. -\begin{figure}[htbp] -\centering - \includegraphics[angle=-90,width=0.5\textwidth]{MPI} -\caption{Comparaison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with distributed memory paradigm using MPI} -\label{fig:07} -\end{figure} +%\begin{figure}[htbp] +%\centering + % \includegraphics[angle=-90,width=0.5\textwidth]{Sparse} +%\caption{Comparaison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving sparse plynomials on GPUs} +%\label{fig:05} +%\end{figure} -\begin{figure}[htbp] -\centering - \includegraphics[angle=-90,width=0.5\textwidth]{OMP} -\caption{Comparaison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with shared memory paradigm using OpenMP} -\label{fig:08} -\end{figure} +%\begin{figure}[htbp] +%\centering + % \includegraphics[angle=-90,width=0.5\textwidth]{Full} +%\caption{Comparaison between MPI and OpenMP versions of the Ehrlich-Aberth method for solving full polynomials on GPUs} +%\label{fig:06} +%\end{figure} + +%\begin{figure}[htbp] +%\centering + % \includegraphics[angle=-90,width=0.5\textwidth]{MPI} +%\caption{Comparaison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with distributed memory paradigm using MPI} +%\label{fig:07} +%\end{figure} + +%\begin{figure}[htbp] +%\centering + % \includegraphics[angle=-90,width=0.5\textwidth]{OMP} +%\caption{Comparaison of execution times of the Ehrlich-Aberth method for solving sparse and full polynomials on GPUs with shared memory paradigm using OpenMP} +%\label{fig:08} +%\end{figure} % An example of a floating figure using the graphicx package. % Note that \label must occur AFTER (or within) \caption. @@ -963,7 +964,19 @@ This figure shows 4 curves of execution time of EA algorithm, a curve with singl \section{Conclusion} -The conclusion goes here~\cite{IEEEexample:bibtexdesign}. +In this paper, we have presented a parallel implementation of Ehrlich-Aberth algorithm for solving full and sparse polynomials, on single GPU with CUDA and Multi-GPUs using two parallel paradigm, shared memory with OpenMP, distributed memory with MPI.(CUDA-OpenMP) approach and (CUDA-MPI) approach, +We have performed many experiments with the Ehrlich-Aberth method in single GPU, Multi-GPU with (CUDA-OpenMP) approach, Multi-GPU with (CUDA-MPI) approach for sparse and full polynomials. the experiments show that, using parallel programming model like (OpenMP, MPI) can effectively manage multiple graphics cards to work together to solve the same problem and accelerate parallel applications, like (CUDA MPI) approach with 4 GPUs can solve a polynomial of 1,000,000 4 speed up than on single GPU. + + +In future, we will evaluate our parallel implementation of Ehrlich-Aberth algorithm on other parallel programming model + + +%present a communication approach between multiple GPUs. The comparison between MPI and OpenMP as GPUs controllers shows that these +%solutions can effectively manage multiple graphics cards to work together +%to solve the same problem + + + %than we have presented two communication approach between multiple GPUs.(CUDA-OpenMP) approach and (CUDA-MPI) approach, in the objective to manage multiple graphics cards to work together and solve the same problem. in the objective to manage multiple graphics cards to work together and solve the same problem.