These copies, along with possible scalings or transpositions, are
implemented as CUDA kernels which can be applied to two
matrices of any size starting at any offset.
These copies, along with possible scalings or transpositions, are
implemented as CUDA kernels which can be applied to two
matrices of any size starting at any offset.
provide the best performance for such memory-bound kernels.
\item[Step 2] (``Local copies''):~data are copied from
local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.
provide the best performance for such memory-bound kernels.
\item[Step 2] (``Local copies''):~data are copied from
local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.