by 2 in any order. If the kernel is executed on a larger CUDA device containing
4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
approximately twice faster in the latter case. Of course, that depends on other
parameters that will be described later (in this chapter and other chapters).
by 2 in any order. If the kernel is executed on a larger CUDA device containing
4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
approximately twice faster in the latter case. Of course, that depends on other
parameters that will be described later (in this chapter and other chapters).