However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses.
\end{enumerate}
As observed earlier, designing a median filter GPU implementation using only global memory is fairly straightforward, but its performance remains quite low even if it is faster than CPU.
To overcome this, the most frequent choice made in efficient implementations found in literature is to use shared memory. Such option implies prefetching \index{prefetching}data prior to doing the actual computations, a relevant choice, as each pixel of an image belongs to $n^2$ different neighborhoods. Thus, it can be expected that fetching each gray-level value from global memory only once should be more efficient than doing it each time it is required. One of the most efficient implementations using shared memory is presented in \cite{5402362}. In the case of the generic kernel of Listing \ref{lst:medianGeneric}, using shared memory without further optimization would not bring valuable speedup because that would just move redundancy from texture to shared memory fetching and would generate bank conflicts. For information, we wrote such a version of the generic median kernel and our measurements showed a speedup of around 3\% (as an example, $32~ms$ for $5\times 5$ median on a 1024$^2$ pixel image, i.e., $33~MP/s$ ).
However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses.
\end{enumerate}
As observed earlier, designing a median filter GPU implementation using only global memory is fairly straightforward, but its performance remains quite low even if it is faster than CPU.
To overcome this, the most frequent choice made in efficient implementations found in literature is to use shared memory. Such option implies prefetching \index{prefetching}data prior to doing the actual computations, a relevant choice, as each pixel of an image belongs to $n^2$ different neighborhoods. Thus, it can be expected that fetching each gray-level value from global memory only once should be more efficient than doing it each time it is required. One of the most efficient implementations using shared memory is presented in \cite{5402362}. In the case of the generic kernel of Listing \ref{lst:medianGeneric}, using shared memory without further optimization would not bring valuable speedup because that would just move redundancy from texture to shared memory fetching and would generate bank conflicts. For information, we wrote such a version of the generic median kernel and our measurements showed a speedup of around 3\% (as an example, $32~ms$ for $5\times 5$ median on a 1024$^2$ pixel image, i.e., $33~MP/s$ ).