From: Kahina <kahina@kahina-VPCEH3K1E.(none)>
Date: Sun, 20 Dec 2015 06:52:58 +0000 (+0100)
Subject: the (CUDA, OpenMP) implementation of EA algorithm
X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/kahina_paper2.git/commitdiff_plain/f9bf26cc810e585f03a82f4a0a5e8c64b96842ca

the (CUDA, OpenMP) implementation of EA algorithm
---

diff --git a/paper.tex b/paper.tex
index a96db5d..962c7f9 100644
--- a/paper.tex
+++ b/paper.tex
@@ -565,6 +565,53 @@ Algorithm~\ref{alg2-cuda} shows a sketch of the Ehrlich-Aberth method using CUDA
 \section{The EA algorithm on Multi-GPU}
 
 \subsection{MGPU (OpenMP-CUDA)approach}
+Before beginning the calculation, our implementation parallel with OpenMP and CUDA shares the input data between threads OpenMP, these input data sotn Z: the vector solution, P: the polynomial to solve,
+
+Before starting computations, our parallel implementation shared input data of the root finding polynomial between OpenMP threads. From Algorithm 1, the input data are the solution vector $Z$, the polynomial to solve $P$. Let number of OpenMP threads is equal to the number of GPUs, each threads OpenMP ( T-omp) checks one GPU,  and control a part of the shared memory, that is a part of the vector Z  like: $(n/Nbr_gpu)$ roots, n: the polynomial's degrees, $Nbr_gpu$ the number of GPUs. Then every GPU will have a grid of computation organized with its performances and the size of data of which it checks. In principle a grid is set by two parameter DimGrid, the number of block per grid, DimBloc: the number of threads per block. The following schema  shows the architecture of (CUDA,OpenMP).
+
+
+
+Each thread OpenMP compute the kernels on GPUs,than after each iteration they copy out the data from GPU memory to CPU shared memory. The kernels are re-runs is up to the roots converge sufficiently. Here are below the corresponding algorithm:
+\begin{enumerate}
+\begin{algorithm}[htpb]
+\label{alg2-cuda}
+%\LinesNumbered
+\caption{CUDA OpenMP Algorithm to find roots with the Ehrlich-Aberth method}
+
+\KwIn{$Z^{0}$ (Initial root's vector), $\varepsilon$ (Error tolerance
+  threshold), P (Polynomial to solve), Pu (Derivative of P), $n$ (Polynomial degrees), $\Delta z_{max}$ (Maximum value of stop condition)}
+
+\KwOut {$Z$ (Solution root's vector), $ZPrec$ (Previous solution root's vector)}
+
+\BlankLine
+// selection du GPU\;
+\item cudaSetDevice(i)\;
+// allocations memoire\;
+\verb= #pragma omp single=
+\item hostAlloc(P,Pu,Z)\;
+\verb= #pragma omp parallel shared(Z,âzmax,P)=
+\item deviceAlloc(dP,dPu,dZ)\;
+\verb= #pragma omp barrier=
+// transfers CPU-GPU and compute GPU\;
+\item copyH2D(P,dP)\;
+\item copyH2D(Pu,dPu)\;
+\item copyH2D(Zi,dZi)\;
+\While {$\Delta z_{max} > \epsilon$}{
+\item Let $\Delta z_{max}=0$\;
+\item $ kernel\_save(ZPrec,Z)$\;
+\item  k=k+1\;
+//each GPU i  compute the new root for his part dZi
+\item $ kernel\_update(dZi,P,Pu)$\;
+\item $kernel\_testConverge(\Delta z_{max},dZi,ZPrec)$\;
+}
+\item copyD2H(dZ,Zi)\;
+ // fin omp parallel\;
+
+\end{algorithm}
+\end{enumerate}
+~\\ 
+
+
 \subsection{MGPU (MPI-CUDA)approach}
 
 \section{experiments}