X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/f947c46397ce66a0f014e5e11653f0d34b3b7d50..75a5768805109606ab4e90afd4168e6975389ea0:/BookGPU/Chapters/chapter1/ch1.tex?ds=inline diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index ec639cb..88c9361 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -100,7 +100,7 @@ get maximum performance. On most powerful GPU cards, called Fermi, multiprocessors are called streaming multiprocessors (SM). Each SM contains 32 cores and is able to perform 32 floating point or integer operations on 32bits numbers per clock or 16 floating -point on 64bits number per clock. SM have their own registers, execution +point on 64bits number per clock. SMs have their own registers, execution pipelines and caches. On Fermi architecture, there are 64Kb shared memory + L1 cache and 32,536 32bits registers per SM. More precisely the programmer can decide what amount of shared memory and L1 cache SM can use. The constaint is @@ -120,7 +120,9 @@ through the use of cache memories. Moreover, nowadays CPUs perform many performance optimizations such as speculative execution which roughly speaking consists in executing a small part of code in advance even if later this work reveals to be useless. In opposite, GPUs do not have low latency memory. In -comparison GPUs have ridiculous cache memories. Nevertheless the architecture of GPUs is optimized for throughtput computation and it takes into account the memory latency. +comparison GPUs have ridiculous cache memories. Nevertheless the architecture of +GPUs is optimized for throughtput computation and it takes into account the +memory latency. @@ -146,13 +148,13 @@ computation of other tasks. \section{Kinds of parallelism} Many kinds of parallelism are avaible according to the type of hardware. -Roughtly speaking, there are three classes of parallism: instruction-level +Roughtly speaking, there are three classes of parallelism: instruction-level parallelism, data parallelism and task parallelism. Instruction-level parallelism consists in re-ordering some instructions in order -to executed some of them in parallel without changing the result of the code. +to execute some of them in parallel without changing the result of the code. In modern CPUs, instruction pipelines allow processor to execute instruction -faster. With a pipeline a processor can execute multiple instruction +faster. With a pipeline a processor can execute multiple instructions simultaneously due to the fact that the output of a task is the input of the next one.