X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/11f93a2e8880680f6b192298e5ce0697d2596a31..55ce7168c6e69a2462d76c95dc9a5298ceedb04f:/BookGPU/Chapters/chapter6/PartieSync.tex diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex index bc08557..7fdaace 100755 --- a/BookGPU/Chapters/chapter6/PartieSync.tex +++ b/BookGPU/Chapters/chapter6/PartieSync.tex @@ -97,7 +97,7 @@ parallel programming schemes on a GPU cluster: Using CUDA\index{CUDA}, GPU kernel executions are nonblocking, and GPU/CPU data transfers\index{CUDA!data transfer} are blocking or nonblocking operations. All GPU kernel executions and CPU/GPU -data transfers are associated to "streams,"\index{CUDA!stream} and all operations on a same stream +data transfers are associated to ``streams'',\index{CUDA!stream} and all operations on a same stream are serialized. When transferring data from the CPU to the GPU, then running GPU computations, and finally transferring results from the GPU to the CPU, there is a natural synchronization and serialization if these operations are achieved on @@ -210,7 +210,7 @@ achieved serially and not overlapped. When CPU/GPU data transfers are not negligible compared to GPU computations, it can be interesting to overlap internode CPU computations with a \emph{GPU - sequence}\index{GPU sequence} including CPU/GPU data transfers and GPU computations (see + sequence}\index{GPU!sequence} including CPU/GPU data transfers and GPU computations (see \Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic, but their implementation requires explicit CPU multithreading and synchronization, and CPU data buffer duplication. We need to implement two @@ -367,7 +367,7 @@ of the code. \Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA code, explicitly overlapping MPI communications with -streamed GPU sequences\index{GPU sequence!streamed}. +streamed GPU sequences\index{GPU!streamed sequence}. %\begin{algorithm} % \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA @@ -489,7 +489,7 @@ working on independent subsets of data. \Lst{algo:ch6p1overlapstreamsequence} is not so generic as \Lst{algo:ch6p1overlapseqsequence}. -\subsection{Interleaved communications-transfers-computations\\overlapping} +\subsection{Interleaved communications-transfers-computations overlapping} Many algorithms do not support splitting data transfers and kernel calls, and cannot exploit CUDA streams, for example, when each GPU thread requires access to @@ -506,7 +506,8 @@ and twice as many GPU buffers. \begin{figure}[t] \centering \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf} - \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU + \caption[Complete overlap of internode CPU communications,\break\hfill CPU/GPU data transfers, and GPU + computations, interleaving computation-communication iterations.]{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU computations, interleaving computation-communication iterations.} \label{fig:ch6p1overlapinterleaved} \end{figure}