spell checked

author couturie <couturie@extinction>

Fri, 22 Jan 2016 06:24:03 +0000 (07:24 +0100)

committer couturie <couturie@extinction>

Fri, 22 Jan 2016 06:24:03 +0000 (07:24 +0100)
author couturie <couturie@extinction>
Fri, 22 Jan 2016 06:24:03 +0000 (07:24 +0100)
committer couturie <couturie@extinction>
Fri, 22 Jan 2016 06:24:03 +0000 (07:24 +0100)
diff --git a/paper.tex b/paper.tex

index 1bda930ea574d8ec55071568270f05aa9eabdb65..0c1a417a55746967570e28255164a75e86f01925 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -121,7 +121,7 @@ model invented by NVIDIA had revived parallel programming interest for
  this problem. Indeed, the computing power of GPUs (Graphics Processing
  Units) has exceeded that of traditional  CPUs processors, which makes it very appealing to the research community to investigate new parallel implementations for a whole set of scientific problems in the reasonable hope to solve bigger instances of well known computationally demanding issues such as the one beforehand. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. 
  
  this problem. Indeed, the computing power of GPUs (Graphics Processing
  Units) has exceeded that of traditional  CPUs processors, which makes it very appealing to the research community to investigate new parallel implementations for a whole set of scientific problems in the reasonable hope to solve bigger instances of well known computationally demanding issues such as the one beforehand. However, CUDA adopts a totally new computing architecture to use the hardware resources provided by the GPU in order to offer a stronger computing ability to the massive data computing. Ghidouche et al.~\cite{Kahinall14} proposed an implementation of the Durand-Kerner method on a single GPU. Their main results showed that a parallel CUDA implementation is about 10 times faster than the sequential implementation on a single CPU for sparse polynomials of degree 48,000. 
  
-In this paper we propose the parallelization of the Ehrlich-Aberth (EA) method which has a cubic convergence rate which is much better than the quadratic rate of the Durand-Kerner method which has already been investigated in \cite{Kahinall14}. In the other hand,  EA is suitable to be implemented in parallel computers according to the data-parallel paradigm. In this model, computing elements carry computations on the data they are assigned and communicate with other computing elements in order to get fresh data or to synchronise. Classically, two parallel programming paradigms OpenMP and MPI are used to code such solutions. But in our case, computing elements are CUDA multi-GPU platforms. This architectural setting poses new programming challenges but offers also new opportunities to efficiently solve huge problems, otherwise considered intractable until recently. To the best of our knowledge, our CUDA-MPI and CUDA-OpenMP codes are the first implementations of EA method with multiple GPUs for finding roots of polynomials. Our major contributions include:
+In this paper we propose the parallelization of the Ehrlich-Aberth (EA) method which has a cubic convergence rate which is much better than the quadratic rate of the Durand-Kerner method which has already been investigated in \cite{Kahinall14}. In the other hand,  EA is suitable to be implemented in parallel computers according to the data-parallel paradigm. In this model, computing elements carry computations on the data they are assigned and communicate with other computing elements in order to get fresh data or to synchronize. Classically, two parallel programming paradigms OpenMP and MPI are used to code such solutions. But in our case, computing elements are CUDA multi-GPU platforms. This architectural setting poses new programming challenges but offers also new opportunities to efficiently solve huge problems, otherwise considered intractable until recently. To the best of our knowledge, our CUDA-MPI and CUDA-OpenMP codes are the first implementations of EA method with multiple GPUs for finding roots of polynomials. Our major contributions include:
   \begin{itemize}
  
  \item The parallel implementation of EA algorithm on a multi-GPU platform with a shared memory using OpenMP API. It is based on threads created from the same system process, such that each thread is attached to one GPU. In this case the communications between GPUs are done by OpenMP threads through shared memory.
   \begin{itemize}
  
  \item The parallel implementation of EA algorithm on a multi-GPU platform with a shared memory using OpenMP API. It is based on threads created from the same system process, such that each thread is attached to one GPU. In this case the communications between GPUs are done by OpenMP threads through shared memory.
@@ -257,13 +257,13 @@ Using the logarithm  and the exponential operators, we can replace any
  multiplications and divisions with additions and
  subtractions. Consequently, computations manipulate lower values in
  absolute values~\cite{Karimall98}. In practice, the  exponential and
  multiplications and divisions with additions and
  subtractions. Consequently, computations manipulate lower values in
  absolute values~\cite{Karimall98}. In practice, the  exponential and
-logarithm mode is used when a root is outisde the circle unit represented by the radius $R$ evaluated in C language with:
+logarithm mode is used when a root is outside the circle unit represented by the radius $R$ evaluated in C language with:
  \begin{equation}
  \label{R.EL}
  R = exp(log(DBL\_MAX)/(2*n) );
  \end{equation}
  where \verb=DBL_MAX= stands for the maximum representable
  \begin{equation}
  \label{R.EL}
  R = exp(log(DBL\_MAX)/(2*n) );
  \end{equation}
  where \verb=DBL_MAX= stands for the maximum representable
-\verb=double= value and $n$ is the degree of the polynimal.
+\verb=double= value and $n$ is the degree of the polynomial.
  
  
  \subsection{The Ehrlich-Aberth parallel implementation on CUDA}
  
  
  \subsection{The Ehrlich-Aberth parallel implementation on CUDA}
@@ -546,7 +546,7 @@ These experiments report the execution times of the EA method for sparse and ful
  \label{sec6}
  In this paper, we have presented parallel implementations of the Ehrlich-Aberth algorithm to solve full and sparse polynomials, on a single GPU with CUDA and on multiple GPUs using two parallel paradigms: shared memory with OpenMP and distributed memory with MPI. These architectures were addressed by a CUDA-OpenMP approach and CUDA-MPI approach, respectively. Experiments show that, using parallel programming model like OpenMP or MPI, we can efficiently manage multiple graphics cards to solve the same problem and accelerate the parallel execution with 4 GPUs and solve a polynomial of degree up-to 5,000,000 four times faster than on a single GPU. 
  
  \label{sec6}
  In this paper, we have presented parallel implementations of the Ehrlich-Aberth algorithm to solve full and sparse polynomials, on a single GPU with CUDA and on multiple GPUs using two parallel paradigms: shared memory with OpenMP and distributed memory with MPI. These architectures were addressed by a CUDA-OpenMP approach and CUDA-MPI approach, respectively. Experiments show that, using parallel programming model like OpenMP or MPI, we can efficiently manage multiple graphics cards to solve the same problem and accelerate the parallel execution with 4 GPUs and solve a polynomial of degree up-to 5,000,000 four times faster than on a single GPU. 
  
-Our next objective is to extend the model presented here to clusters of GPU nodes, with a three-level scheme: inter-node communications via MPI processes (distributed memory), management of multi-GPU nodes by OpenMP threads (shared memory). Actual platforms may probably also contain purely multi-core nodes without any GPU. This heterogeneous setting may lead to the integration of load balancing algorithms so as to allow an optimal use of hardware ressources. 
+Our next objective is to extend the model presented here to clusters of GPU nodes, with a three-level scheme: inter-node communications via MPI processes (distributed memory), management of multi-GPU nodes by OpenMP threads (shared memory). Actual platforms may probably also contain purely multi-core nodes without any GPU. This heterogeneous setting may lead to the integration of load balancing algorithms so as to allow an optimal use of hardware resource's. 
  
  
  \section*{Acknowledgment}
  
  
  \section*{Acknowledgment}
author	couturie <couturie@extinction>
	Fri, 22 Jan 2016 06:24:03 +0000 (07:24 +0100)
committer	couturie <couturie@extinction>
	Fri, 22 Jan 2016 06:24:03 +0000 (07:24 +0100)