X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper2.git/blobdiff_plain/3579d2fc9e528a19506994328a257a6bce22c702..eabad72090e474064cddadf1b21b1f4fc29ae590:/paper.tex diff --git a/paper.tex b/paper.tex index c0d93ab..9bad8af 100644 --- a/paper.tex +++ b/paper.tex @@ -654,7 +654,7 @@ CUDA (Compute Unified Device Architecture) is a parallel computing architecture %the Ehrlich-Aberth method is an iterative method, contain 4 steps, start from the initial approximations of all the roots of the polynomial,the second step initialize the solution vector $Z$ using the Guggenheimer method to assure the distinction of the initial vector roots, than in step 3 we apply the the iterative function based on the Newton's method and Weiestrass operator~\cite{,}, witch will make it possible to converge to the roots solution, provided that all the root are different. -The Ehrlich-Aberth method is a simultaneous method~\cite{} using the following iteration +The Ehrlich-Aberth method is a simultaneous method~\cite{Aberth73} using the following iteration \begin{equation} \label{Eq:EA1} EA: z^{k+1}_{i}=z_{i}^{k}-\frac{\frac{p(z_{i}^{k})}{p'(z_{i}^{k})}} @@ -700,7 +700,7 @@ This method allows, indeed, to exceed the computation of the polynomials of degr \begin{equation} \label{Log_H2} -EA.EL: z^{k+1}_{i}=z_{i}^{k}-\exp \left(\ln \left( +z^{k+1}_{i}=z_{i}^{k}-\exp \left(\ln \left( p(z_{i}^{k})\right)-\ln\left(p'(z^{k}_{i})\right)- \ln\left(1-Q(z^{k}_{i})\right)\right), \end{equation} @@ -769,8 +769,8 @@ Algorithm~\ref{alg1-cuda} shows the GPU parallel implementation of Ehrlich-Abert %\BlankLine -\item Initialization of the of P\; -\item Initialization of the of Pu\; +\item Initialization of P\; +\item Initialization of Pu\; \item Initialization of the solution vector $Z^{0}$\; \item Allocate and copy initial data to the GPU global memory\; \item k=0\; @@ -904,14 +904,14 @@ We study two categories of polynomials: sparse polynomials and full polynomials. \end{equation} For our tests, a CPU Intel(R) Xeon(R) CPU E5620@2.40GHz and a GPU K40 (with 6 Go of ram) are used. %SIDER : Une meilleure présentation de l'architecture est à faire ici. - +For our test, a cluster of computing with 72 nodes, 1116 cores, 4 cards GPU tesla Kepler K40 are used, In order to evaluate both the M-GPU and Multi-GPU approaches, we performed a set of experiments on a single GPU and multiple GPUs using OpenMP or MPI by EA algorithm, for both sparse and full polynomials of different sizes. All experimental results obtained are made in double precision data whereas the convergence threshold of the EA method is set to $10^{-7}$. %Since we were more interested in the comparison of the %performance behaviors of Ehrlich-Aberth and Durand-Kerner methods on %CPUs versus on GPUs. The initialization values of the vector solution -of the methods are given in %Section~\ref{sec:vec_initialization}. +of the methods are given by Guggenheimer method~\cite{Gugg86} %Section~\ref{sec:vec_initialization}. \subsection{Evaluating the M-GPU (CUDA-OpenMP) approach} @@ -954,9 +954,9 @@ In this part we perform a set of experiments to compare the Multi-GPU (CUDA MPI) \label{fig:02} \end{figure} ~\\ -Figure~\ref{fig:02} shows execution time of EA algorithm, for a single GPU, and multiple GPUs (2, 3, 4) on respectively 2, 3 and four MPI nodes. We can clearly see that the curve for a single GPU is above the other curves, which shows overtime in execution time compared to the Multi-GPU approach. We can see also that the CUDA-MPI approach reduces the execution time by a factor of 10 for polynomials of degree more than 1,000,000. For example, at degree 1000000, the execution time with a single GPU amounted to 10 thousand seconds, while with 4 GPUs, it is lowered to about just one thousand seconds which makes it for a tenfold speedup. +Figure~\ref{fig:02} shows execution time of EA algorithm, for a single GPU, and multiple GPUs (2, 3, 4) on respectively 2, 3 and four MPI nodes. We can clearly see that the curve for a single GPU is above the other curves, which shows overtime in execution time compared to the Multi-GPU approach. We can see also that the CUDA-MPI approach reduces the execution time by a factor of 10 for polynomials of degree more than 1,000,000. For example, at degree 1,000,000, the execution time with a single GPU amounted to 10 thousand seconds, while with 4 GPUs, it is lowered to about just one thousand seconds which makes it for a tenfold speedup. %%SIDER : Je n'ai pas reformuler car je n'ai pas compris la phrase, merci de l'ecrire ici en fran\cais. -\\cette figure montre 4 courbes de temps d'exécution pour l'algorithme EA, une courbe avec un seul GPU, 3 courbes pour multiple GPUs(2, 3, 4), on peut constaté clairement que la courbe à un seul GPU est au-dessus des autres courbes, vue sa consomation en temps d'exècution. On peut voir aussi qu'avec l'approche Multi-GPU (CUDA-MPI) reduit le temps d'exècution jusqu'à l'echelle 100 pour le polynômes qui dépasse 1,000,000 tandis que Single GPU est de l'echelle 1000. +\\cette figure montre 4 courbes de temps d'exécution pour l'algorithme EA, une courbe avec un seul GPU, 3 courbes pour multiple GPUs(2, 3, 4), on peut constaté clairement que la courbe à un seul GPU est au-dessus des autres courbes, vue sa consommation en temps d'exècution. On peut voir aussi qu'avec l'approche Multi-GPU (CUDA-MPI) reduit le temps d'exècution jusqu'à l'echelle 100 pour le polynômes qui dépasse 1,000,000 tandis que Single GPU est de l'echelle 1000. \subsubsection{Execution time of the Ehrlich-Aberth method for solving full polynomials on multiple GPUs using the Multi-GPU appraoch} @@ -986,7 +986,7 @@ In this experiment three sparse polynomials of size 200K, 800K and 1,4M are inve \caption{Execution time for solving sparse polynomials of three distinct sizes on multiple GPUs using MPI and OpenMP approaches using Ehrlich-Aberth} \label{fig:05} \end{figure} -In Figure~\ref{fig:05} there two curves for each polynomial size : one for the MPI-CUDA and another for the OpenMP. We can see that the results are similar between OpenMP and MPI for the polynomials size of 200K. For the size of 800K, the MPI version is a little slower than the OpenMP approach but for for the 1,4M size, there is a slight advantage for the MPI version. +In Figure~\ref{fig:05} there two curves for each polynomial size : one for the MPI-CUDA and another for the OpenMP. We can see that the results are similar between OpenMP and MPI for the polynomials size of 200K. For the size of 800K, the MPI version is a little slower than the OpenMP approach but for the 1,4 millions size, there is a slight advantage for the MPI version. \subsubsection{Solving full polynomials} \begin{figure}[htbp]