evolving. Nevertheless some trends remain constant throughout this evolution.
Processing units composing a GPU are far simpler than a traditional CPU and
it is much easier to integrate many computing units inside a GPU card than to do
evolving. Nevertheless some trends remain constant throughout this evolution.
Processing units composing a GPU are far simpler than a traditional CPU and
it is much easier to integrate many computing units inside a GPU card than to do
memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one
by one with a short memory latency to get the data to process. After some tasks,
there is a context switch that allows the CPU to run concurrent applications
memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one
by one with a short memory latency to get the data to process. After some tasks,
there is a context switch that allows the CPU to run concurrent applications
principle to obtain a high throughput is to have many tasks to
compute. Later we will see that these tasks are called threads with CUDA. With
this principle, as soon as a task is finished the next one is ready to be
principle to obtain a high throughput is to have many tasks to
compute. Later we will see that these tasks are called threads with CUDA. With
this principle, as soon as a task is finished the next one is ready to be
-executed while the wait for data for the previous task is overlapped by
-computation of other tasks. {\bf HERE}
+executed while the wait for data for the previous task is overlapped by the
+computation of other tasks.
practice, the number of thread blocks and the size of thread blocks are given as
parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
example of a kernel composed of 8 thread blocks. Then this kernel is executed on
practice, the number of thread blocks and the size of thread blocks are given as
parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
example of a kernel composed of 8 thread blocks. Then this kernel is executed on
by 2 in any order. If the kernel is executed on a larger CUDA device containing
4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
approximately twice faster in the latter case. Of course, that depends on other
parameters that will be described later (in this chapter and other chapters).
by 2 in any order. If the kernel is executed on a larger CUDA device containing
4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
approximately twice faster in the latter case. Of course, that depends on other
parameters that will be described later (in this chapter and other chapters).
block cooperatively load and store blocks of memory they all
use. Synchronizations of threads in the same block are possible (but not between
threads of different blocks). Threads of the same block can also share results
block cooperatively load and store blocks of memory they all
use. Synchronizations of threads in the same block are possible (but not between
threads of different blocks). Threads of the same block can also share results
-The memory hierarchy of GPUs\index{memory~hierarchy} is different from that of CPUs. In practice, there are registers\index{memory~hierarchy!registers}, local
-memory\index{memory~hierarchy!local~memory}, shared
-memory\index{memory~hierarchy!shared~memory}, cache
-memory\index{memory~hierarchy!cache~memory}, and global
-memory\index{memory~hierarchy!global~memory}.
+The memory hierarchy of GPUs\index{memory hierarchy} is different from that of CPUs. In practice, there are registers\index{memory hierarchy!registers}, local
+memory\index{memory hierarchy!local memory}, shared
+memory\index{memory hierarchy!shared memory}, cache
+memory\index{memory hierarchy!cache memory}, and global
+memory\index{memory hierarchy!global memory}.