Since inner loops that fill the histogram vector contain very few fetching instructions (from 9 to 49, depending on the window size), it is not surprising to note their neglectable impact compared to outer loops that fetch image pixels (from 256k to 16M instructions).
One could be tempted to claim that CPU has no chance to win, which is not so obvious as it highly depends on what kind of algorithm is run and above all, how it is implemented. To illustrate this, we can notice that, despite a maximum effective throughput potential that is almost five times higher, measured GTX280 throughput values sometimes prove slower than CPU values, as shown in Table \ref{tab:medianHisto1}.
Since inner loops that fill the histogram vector contain very few fetching instructions (from 9 to 49, depending on the window size), it is not surprising to note their neglectable impact compared to outer loops that fetch image pixels (from 256k to 16M instructions).
One could be tempted to claim that CPU has no chance to win, which is not so obvious as it highly depends on what kind of algorithm is run and above all, how it is implemented. To illustrate this, we can notice that, despite a maximum effective throughput potential that is almost five times higher, measured GTX280 throughput values sometimes prove slower than CPU values, as shown in Table \ref{tab:medianHisto1}.