Here, the use of shared memory is the best choice, as there is no overlapping between neighbor windows and thus no possible optimization.
Moreover, to ensure efficiency, it is important to read the input image from texture memory, which implies an internal GPU data copy between both 1D convolution stages.
Here, the use of shared memory is the best choice, as there is no overlapping between neighbor windows and thus no possible optimization.
Moreover, to ensure efficiency, it is important to read the input image from texture memory, which implies an internal GPU data copy between both 1D convolution stages.