These copies, along with possible scalings or transpositions, are
implemented as CUDA kernels which can be applied to two
matrices of any size starting at any offset.
- Memory accesses are coalesced\index{coalesced memory accesses} \cite{CUDA_ProgGuide} in order to
+ Memory accesses are coalesced\index{GPU!coalesced memory accesses} \cite{CUDA_ProgGuide} in order to
provide the best performance for such memory-bound kernels.
\item[Step 2] (``Local copies''):~data are copied from
local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.
UPMC (Universit\'e Pierre et Marie Curie, Paris, France).
As a remark, the execution times measured on the C2050 would be the same
on the C2070 and on the C2075, the only difference between these GPUs
-being their memory size and their TDP (Thermal Design Power)\index{TDP (Thermal Design Power)}.
+being their memory size and their TDP (Thermal Design Power)\index{TDP (thermal design power)}.
We emphasize that the execution times correspond to the
complete propagation for all six energies of the large case (see
Table~\ref{data-sets}), that is to say to the complete execution of