The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{blocking factor}. This translates to 6144, 3072, and 1536 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. We need two bytes for the row index.
\noindent Table \ref{table:scoo} summarizes the differences between the three subformats.
\begin{itemize}
\item[a)] Each thread in a block is allocated one memory entry for the result of a row and works on consecutive 12 rows. A p-reduction to Thread\#1 is needed to get the final result of one row.
The maximum number of rows per slice is calculated as \emph{size of shared memory per SM in bits} / \emph{blocking factor}. This translates to 6144, 3072, and 1536 rows per slice for blocking factors of 64, 128, and 256-bits, respectively. We need two bytes for the row index.
\noindent Table \ref{table:scoo} summarizes the differences between the three subformats.
\begin{itemize}
\item[a)] Each thread in a block is allocated one memory entry for the result of a row and works on consecutive 12 rows. A p-reduction to Thread\#1 is needed to get the final result of one row.