Explication des kernels

author Kahina <kahina@kahina-VPCEH3K1E.(none)>

Mon, 2 May 2016 06:15:55 +0000 (08:15 +0200)

committer Kahina <kahina@kahina-VPCEH3K1E.(none)>

Mon, 2 May 2016 06:15:55 +0000 (08:15 +0200)
author Kahina <kahina@kahina-VPCEH3K1E.(none)>
Mon, 2 May 2016 06:15:55 +0000 (08:15 +0200)
committer Kahina <kahina@kahina-VPCEH3K1E.(none)>
Mon, 2 May 2016 06:15:55 +0000 (08:15 +0200)
diff --git a/maj.tex b/maj.tex

index ebc8e860f4b4e2dc8105577a1d26c4d5dff2b1b9..c5a0bebe3c8e46fdac9c03faf99cdfa109d3504d 100644 (file)
--- a/maj.tex
+++ b/maj.tex
@@ -10,6 +10,7 @@
  %\usepackage[french,boxed,linesnumbered]{algorithm2e}
  \usepackage{array,multirow,makecell}
  \usepackage[textsize=footnotesize]{todonotes}
+\usepackage{listings}
  \newcommand{\RC}[2][inline]{%
    \todo[color=blue!10,#1]{\sffamily\textbf{RC:} #2}\xspace}
  \newcommand{\KG}[2][inline]{%
@@ -727,15 +728,50 @@ $kernel\_update\_ExpoLog(Z,P,Pu)$\;
  
  We notice that the update kernel is called in two forms, according to the value
  \emph{R} which determines the radius beyond which we apply the exponential logarithm algorithm. 
-
  If the modulus of the current complex is less than a given value called the
  radius i.e. ($ |z^{k}_{i}|<= R$), then the classical form of the EA
-function Eq.~\ref{Eq:Hi} is executed, else the EA.EL
-function Eq.~\ref{Log_H2} is executed.
-(with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}). The radius $R$ is evaluated as in Eq.~\ref{R.EL}.
+function Eq.~\ref{Eq:Hi} is executed, else the EA.EL function Eq.~\ref{Log_H2} is executed.
+(with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}). The radius $R$ is evaluated as in Eq.~\ref{R.EL}. \KG{experimentally it is difficult to solve the high degree polynomial with the classical Ehrlich-Aberth method, so if the root are under to the unit circle ($R$)the kernel \textit{update} is called in the EA.EL function Eq.~\ref{Log_H2} (with Eq.~\ref{deflncomplex}, Eq.~\ref{defexpcomplex}) line 3, to put into account the limited of grander floating manipulated by processors and compute more roots}. 
+\KG{} We notice that we used \verb= cuDoubleComplex= to exploit the complex number in CUDA, and the functions of the CUBLAS library to implement some vector operations on the GPU. We use the following functions:
+
+\begin{itemize}
+\item \verb= cublasIdamax()= for the 
+\item \verb= cublasGetVector()= for the 
+\end{itemize}
+
+
+\KG{Listing ~\ref{lst:01} shows the main kernel of the iterative function
+shows the a simplified version of second kernel
+code (some parameters in the kernels have been simplified in order to
+increase the readability). As can be seen this
+kernel calls multiple kernels, all the kernels for complex numbers and
+kernels for the evaluation of a polynomial are not detailed.} 
+
+\begin{footnotesize}
+\lstinputlisting[label=lst:01,caption=Kernels to update the roots]{code.c}
+\end{footnotesize}
+\KG{}The Kernel \verb= __global__EA_update()= is a code called from the host and executed on the device, need to call \verb=__device__FirstH_EA()= on the device to compute the root who are under the unit circle ($R$), the \verb= __device__NewH_EA()= is called to compute root with the exp-log  version. The Horner schema are used to evaluate the polynomial and his derivative in $Z[i]$ \verb=__device__Fonction()= and \verb=__deviceFonctionD()= respectively. Its exp-log  version need to be implemented \verb= __device__LogFonction()=, \verb= __device__LogFonctionD()=
+and used in the exp-log version of the Ehrlich-Aberth method
+\begin{itemize}
+\item \verb= __device__LogFonction()= to  
+\item \verb= __device__LogFonctionD()= for the 
+\end{itemize}
+%need to call some function on device  
+
+%The Kernel \verb= EA_update= is a code implemented to executed on GPU and lanced in CPU, 
+This kernel is executed by a large number of GPU threads such that each thread is in charge of the computation of one component of the iterate vector $Z$. We set the size of a thread block, \textit{Threads}, to 512 threads and the number of thread blocks of a kernel, \textit{Blocks}, is computed so as each GPU thread is in charge of one vector element:
+\[ Blocks=\frac{N+Threads-1}{Threads}, N: Polynomial size \]
+%$ Blocks=\frac{N+Threads-1}{Threads}, N: Polynomial size$
+
+Each GPU threads in grid compute one root en parallel, if the polynomial size exceed the capacity of the grid the G.S schema are finely executed, like the grid can only compute << Blocks,Threads>> roots at the same time, if we ned to compute more roots, the grid can used the roots previously executed to compute other root ih the same iteration, like the following schema:
  
-\KG{inserer le code du kernel}
+\begin{figure}[htbp]
+\centering
+  \includegraphics[width=0.8\textwidth]{figures/G.S}
+\caption{Gauss Seidel iteration}
+\label{fig:08}
  
+The kernel are executed from << DimGrid, DimBlock>>, 
  The last kernel checks the convergence of the roots after each update
  of $Z^{k}$, according to formula Eq.~\ref{eq:Aberth-Conv-Cond}. We used the functions of the CUBLAS Library (CUDA Basic Linear Algebra Subroutines) to implement this kernel.
author	Kahina <kahina@kahina-VPCEH3K1E.(none)>
	Mon, 2 May 2016 06:15:55 +0000 (08:15 +0200)
committer	Kahina <kahina@kahina-VPCEH3K1E.(none)>
	Mon, 2 May 2016 06:15:55 +0000 (08:15 +0200)