new

author couturie <couturie@extinction>

Thu, 14 Jan 2016 18:23:53 +0000 (19:23 +0100)

committer couturie <couturie@extinction>

Thu, 14 Jan 2016 18:23:53 +0000 (19:23 +0100)
author couturie <couturie@extinction>
Thu, 14 Jan 2016 18:23:53 +0000 (19:23 +0100)
committer couturie <couturie@extinction>
Thu, 14 Jan 2016 18:23:53 +0000 (19:23 +0100)
diff --git a/paper.tex b/paper.tex

index a1615fad2e0b425a029d86681bc70a571c0cbe95..72d93793db60e2b69d0bf7d790c3b1f8322eef30 100644 (file)
--- a/paper.tex
+++ b/paper.tex
@@ -811,8 +811,21 @@ implementation of Ehrlich-Aberth method.
  \section{The EA algorithm on Multiple GPUs}
  \label{sec4}
  \subsection{M-GPU : an OpenMP-CUDA approach}
  \section{The EA algorithm on Multiple GPUs}
  \label{sec4}
  \subsection{M-GPU : an OpenMP-CUDA approach}
-Our OpenMP-CUDA implementation of EA algorithm is based on the hybrid OpenMP and CUDA programming model. It works as follows.
-Based on the metadata, a shared memory is used to make data evenly shared among OpenMP threads. The shared data are the solution vector $Z$, the polynomial to solve $P$, and  the error vector $\Delta z$. Let (T\_omp) the number of OpenMP threads be equal to the number of GPUs, each OpenMP thread binds to one GPU,  and controls a part of the shared memory, that is a part of the vector Z , that is $(n/num\_gpu)$ roots where $n$ is the polynomial's degree and $num\_gpu$ the total number of available GPUs. Each OpenMP thread copies its data from host memory to GPU’s device memory. Then every GPU will have a grid of computation organized according to the device performance and the size of data on which it runs the computation kernels. %In principle a grid is set by two parameter DimGrid, the number of block per grid, DimBloc: the number of threads per block. The following schema  shows the architecture of (CUDA,OpenMP).
+Our OpenMP-CUDA implementation of EA algorithm is based on the hybrid
+OpenMP and CUDA programming model. It works as follows.  All the data
+are shared with OpenMP amoung all the OpenMP threads. The shared data
+are the solution vector $Z$, the polynomial to solve $P$, and the
+error vector $\Delta z$. The number of OpenMP threads is equal to the
+number of GPUs, each OpenMP thread binds to one GPU, and it controls a
+part of the shared memory. More precisely each OpenMP thread owns of
+the vector Z, that is $(n/num\_gpu)$ roots where $n$ is the
+polynomial's degree and $num\_gpu$ the total number of available
+GPUs. Each OpenMP thread copies its data from host memory to GPU’s
+device memory. Then all GPUs will have a grid of computation organized
+according to the device performance and the size of data on which it
+runs the computation kernels.
+
+%In principle a grid is set by two parameter DimGrid, the number of block per grid, DimBloc: the number of threads per block. The following schema  shows the architecture of (CUDA,OpenMP).
  
  %\begin{figure}[htbp]
  %\centering
  
  %\begin{figure}[htbp]
  %\centering
@@ -822,6 +835,7 @@ Based on the metadata, a shared memory is used to make data evenly shared among
  %\end{figure}
  %Each thread OpenMP compute the kernels on GPUs,than after each iteration they copy out the data from GPU memory to CPU shared memory. The kernels are re-runs is up to the roots converge sufficiently. Here are below the corresponding algorithm:
  
  %\end{figure}
  %Each thread OpenMP compute the kernels on GPUs,than after each iteration they copy out the data from GPU memory to CPU shared memory. The kernels are re-runs is up to the roots converge sufficiently. Here are below the corresponding algorithm:
  
+\RC{Surement à virer ou réécrire pour etre compris sans algo}
  $num\_gpus$ OpenMP threads  are created using \verb=omp_set_num_threads();=function (step $3$, Algorithm \ref{alg2-cuda-openmp}), the shared memory is created using \verb=#pragma omp parallel shared()= OpenMP function (line $5$, Algorithm\ref{alg2-cuda-openmp}), then each OpenMP thread allocates memory and copies initial data from CPU memory to GPU global memory, executes the kernels on GPU, but computes only his portion of roots indicated with variable \textit{index} initialized in (line 5, Algorithm \ref{alg2-cuda-openmp}), used as input data in the $kernel\_update$ (line 10, Algorithm \ref{alg2-cuda-openmp}). After each iteration, all OpenMP threads synchronize using \verb=#pragma omp barrier;= to gather all the correct values of $\Delta z$, thus allowing the computation the maximum stop condition on vector $\Delta z$ (line 12, Algorithm \ref{alg2-cuda-openmp}). Finally, threads copy the results from GPU memories to CPU memory. The OpenMP threads execute kernels until the roots sufficiently converge.  
  \begin{enumerate}
  \begin{algorithm}[htpb]
  $num\_gpus$ OpenMP threads  are created using \verb=omp_set_num_threads();=function (step $3$, Algorithm \ref{alg2-cuda-openmp}), the shared memory is created using \verb=#pragma omp parallel shared()= OpenMP function (line $5$, Algorithm\ref{alg2-cuda-openmp}), then each OpenMP thread allocates memory and copies initial data from CPU memory to GPU global memory, executes the kernels on GPU, but computes only his portion of roots indicated with variable \textit{index} initialized in (line 5, Algorithm \ref{alg2-cuda-openmp}), used as input data in the $kernel\_update$ (line 10, Algorithm \ref{alg2-cuda-openmp}). After each iteration, all OpenMP threads synchronize using \verb=#pragma omp barrier;= to gather all the correct values of $\Delta z$, thus allowing the computation the maximum stop condition on vector $\Delta z$ (line 12, Algorithm \ref{alg2-cuda-openmp}). Finally, threads copy the results from GPU memories to CPU memory. The OpenMP threads execute kernels until the roots sufficiently converge.  
  \begin{enumerate}
  \begin{algorithm}[htpb]
author	couturie <couturie@extinction>
	Thu, 14 Jan 2016 18:23:53 +0000 (19:23 +0100)
committer	couturie <couturie@extinction>
	Thu, 14 Jan 2016 18:23:53 +0000 (19:23 +0100)