X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/6e645292c9053655e4e44bbebf1c1eb751a554a6..b1fd489e34a8d46d286a0d271c38cbfb442f511f:/BookGPU/Chapters/chapter6/PartieSync.tex?ds=inline diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex index d213e50..ce21565 100755 --- a/BookGPU/Chapters/chapter6/PartieSync.tex +++ b/BookGPU/Chapters/chapter6/PartieSync.tex @@ -11,29 +11,29 @@ \noindent {\bf Considered parallel algorithms and implementations} \medskip -This section focusses on synchronous parallel algorithms implemented with +This section focuses on synchronous parallel algorithms implemented with overlapping computations and communications\index{overlap!computation and communication}. Parallel synchronous algorithms are -easier to implement, debug and maintain than asynchronous ones, see -Section~\ref{ch6:part2}. Usually, they follow a BSP-like parallel -scheme\index{BSP parallel scheme}, -alternating local computation steps and communication steps, -see~\cite{Valiant:BSP}. Their execution is usually deterministic, excepted +easier to implement, debug, and maintain than asynchronous ones (see +Section~\ref{ch6:part2}). Usually, they follow a BSP-like parallel +scheme\index{BSP parallel scheme} (Bulk Synchronous Parallel model), +alternating local computation steps and communication steps +(see~\cite{Valiant:BSP}). Their execution is usually deterministic, except for stochastic algorithms that contain random number generations. Even in -this case, their execution can be controlled during debug steps, allowing to -track and to fix bugs quickly. +this case, their execution can be controlled during debug steps, allowing the +user to track and to fix bugs quickly. However, depending on the properties of the algorithm, it is sometimes possible to overlap computations and communications. If processes exchange data that is not needed for the -computation that is following immediately, it is possible to implement such +computation immediately following, it is possible to implement such an overlap. We have investigated the efficiency of this approach in previous works~\cite{GUSTEDT:2007:HAL-00280094:1,ChVCV13}, using standard parallel programming tools to achieve the implementation. -The normalized and well known Message Passing Interface (MPI\index{MPI}) includes some asynchronous -point-to-point communication routines, that should allow to implement some +The normalized and well-known Message Passing Interface (MPI\index{MPI}) includes some asynchronous +point-to-point communication routines, that should allow the implementation of some communication/computation overlap. However, current MPI implementations do not achieve -that goal efficiently; effective overlapping with MPI requires a group of +this goal efficiently; effective overlapping with MPI requires a group of dedicated threads (in our case implemented with OpenMP\index{OpenMP}) for the basic synchronous communications while another group of threads executes computations in parallel. @@ -42,11 +42,11 @@ in parallel. % decrease the execution time. Nevertheless, communication and computation are not completely independent on modern multicore architectures: they use shared hardware components such as the -interconnection bus and the RAM. Therefore that approach only saved up to $20\%$ +interconnection bus and the RAM. Therefore, this approach saves only up to $20\%$ of the expected time on such a platform. This picture changes on clusters -equipped with GPU. They effectively allow independence of computations on the -GPU and communication on the mainboard (CPU, interconnection bus, RAM, network -card). We saved up to $100\%$ of the expected time on our GPU cluster, as we +equipped with GPUs. Indeed, GPUs effectively allow independence of computations they +perform and communications done on the mainboard (CPU, interconnection bus, RAM, network +card). We save up to $100\%$ of the expected time on our GPU cluster, as we will expose in the next section. @@ -56,17 +56,16 @@ will expose in the next section. \noindent {\bf Specific interests in GPU clusters} \medskip -In a computing node, a GPU is a kind of scientific coprocessor usually located -on an auxiliary board, with its own memory. So, when data have been transferred -from the CPU memory to the GPU memory, then GPU computations can be achieved on -the GPU board, totally in parallel of any CPU activities (like internode cluster +In a computing node, a GPU is a kind of scientific coprocessor, usually located +on an auxiliary board, with its own memory. So, once data are transferred +from the CPU memory to the GPU memory, GPU computations can be achieved on +the GPU board, totally in parallel with any CPU activities (such as internode cluster communications). The CPU and the GPU access their respective memories and do not -interfere, so they can achieve a very good overlap\index{overlap!computation -and computation} of their activities (better -than two CPU cores). +interfere with each other, so they can achieve a very good overlap\index{overlap!computation +and computation} of their activities (better than two CPU cores). -But using a GPU on a computing node requires to transfer data from the CPU to -the GPU memory, and to transfer the computation results back from the GPU to the +But using a GPU on a computing node requires the transfer of data from the CPU to +the GPU memory, as well as the transfer of the computation results back from the GPU to the CPU. Transfer times are not excessive, but depending on the application they still can be significant compared to the GPU computation times. So, sometimes it can be interesting to overlap the internode cluster communications with both the @@ -74,14 +73,14 @@ CPU/GPU data transfers and the GPU computations. We can identify four main parallel programming schemes on a GPU cluster: \begin{enumerate} -\item parallelizing only 'internode CPU communications' with 'GPU computations', +\item parallelizing only internode CPU communications with GPU computations, and achieving CPU/GPU data transfers before and after this parallel step, -\item parallelizing 'internode CPU communications' with a '(sequential) sequence - of CPU/GPU data transfers and GPU computations', -\item parallelizing 'internode CPU communications' with a 'streamed sequence of - CPU/GPU data transfers and GPU computations', -\item parallelizing 'internode CPU communications' with 'CPU/GPU data transfers' - and with 'GPU computations', interleaving computation-communication +\item parallelizing internode CPU communications with a (sequential) sequence + of CPU/GPU data transfers and GPU computations, +\item parallelizing internode CPU communications with a streamed sequence of + CPU/GPU data transfers and GPU computations, +\item parallelizing internode CPU communications with CPU/GPU data transfers + and with GPU computations, interleaving computation-communication iterations. \end{enumerate} @@ -95,31 +94,36 @@ parallel programming schemes on a GPU cluster: \label{fig:ch6p1overlapnative} \end{figure} -Using CUDA\index{CUDA}, GPU kernel executions are non-blocking, and GPU/CPU data +Using CUDA\index{CUDA}, GPU kernel executions are nonblocking, and GPU/CPU data transfers\index{CUDA!data transfer} -are blocking or non-blocking operations. All GPU kernel executions and CPU/GPU -data transfers are associated to "streams"\index{CUDA!stream}, and all operations on a same stream +are blocking or nonblocking operations. All GPU kernel executions and CPU/GPU +data transfers are associated to ``streams'',\index{CUDA!stream} and all operations on a same stream are serialized. When transferring data from the CPU to the GPU, then running GPU -computations and finally transferring results from the GPU to the CPU, there is +computations, and finally transferring results from the GPU to the CPU, there is a natural synchronization and serialization if these operations are achieved on the same stream. GPU developers can choose to use one (default) or several streams. In this first scheme of overlapping, we consider parallel codes using only one GPU stream. -"Non-blocking GPU kernel execution" means a CPU routine runs a parallel +``Nonblocking GPU kernel execution'' means a CPU routine runs a parallel execution of a GPU computing kernel, and the CPU routine continues its execution (on the CPU) while the GPU kernel is running (on the GPU). Then the CPU routine -can initiate some communications with some others CPU, and so it automatically +can initiate some communications with some other CPUs, and so it automatically overlaps the internode CPU communications with the GPU computations (see \Fig{fig:ch6p1overlapnative}). This overlapping is natural when programming with CUDA and MPI: it is easy to deploy, but does not overlap the CPU/GPU data transfers. +\Lst{algo:ch6p1overlapnative} introduces the generic code of a MPI+CUDA +implementation, natively and implicitly overlapping MPI communications with CUDA +GPU computations. + + %\begin{algorithm}[t] % \caption{Generic scheme implicitly overlapping MPI communications with CUDA GPU % computations}\label{algo:ch6p1overlapnative} -\pagebreak -\begin{Listing}{algo:ch6p1overlapnative}{Generic scheme implicitly overlapping MPI communications with CUDA GPU computations} +%\pagebreak +\begin{Listing}{algo:ch6p1overlapnative}{generic scheme implicitly overlapping MPI communications with CUDA GPU computations} // Input data and result variables and arrays (example with // float datatype, 1D input arrays, and scalar results) float *cpuInputTabAdr, *gpuInputTabAdr; @@ -131,7 +135,7 @@ cudaMalloc(&gpuInputTabAdr,sizeof(float)*N); cpuResTabAdr = malloc(sizeof(float)*NbIter); cudaMalloc(&gpuResAdr,sizeof(float)); -// Definition of the Grid of blocks of GPU threads +// Definition of the grid of blocks of GPU threads dim3 Dg, Db; Dg.x = ... ... @@ -158,34 +162,34 @@ for (int i = 0; i < NbIter; i++) { \end{Listing} %\end{algorithm} -\Lst{algo:ch6p1overlapnative} introduces the generic code of a MPI+CUDA -implementation, natively and implicitly overlapping MPI communications with CUDA -GPU computations. Some input data and output results arrays and variables are -declared and allocated from line 1 up to 10, and a computation loop is -implemented from line 21 up to 34. At each iteration: + + +Some input data and output results arrays and variables are +declared and allocated from line~3 through 10, and a computation loop is +implemented from line~22 through 34. At each iteration: \begin{itemize} -\item \texttt{cudaMemcpy} on line 23 transfers data from the CPU memory +\item \texttt{cudaMemcpy} on line~23 transfers data from the CPU memory to the GPU memory. This is a basic and synchronous data transfer. -\item \texttt{gpuKernel\_k1<<>>} on line 26 starts GPU computation - (running a GPU kernel on the grid of blocks of threads defined at line 12 to - 15). This is a standard GPU kernel run, it is an asynchronous operation. The +\item \texttt{gpuKernel\_k1<<>>} on line~26 starts GPU computation + (running a GPU kernel on the grid of blocks of threads defined in lines~13 to + 15). This is a standard GPU kernel run; it is an asynchronous operation. The CPU can continue to run its code. -\item \texttt{MPI\_Sendrecv\_replace} on line 27 achieves some blocking - internode communications, overlapping GPU computations started just before. -\item If needed, \texttt{cudaMemcpy} on line 31 transfers the iteration result from - one variable in the GPU memory at one array index in the CPU memory (in this example the CPU +\item \texttt{MPI\_Sendrecv\_replace} on line~27 achieves some blocking + internode communications, overlapping GPU computations started just previously. +\item If needed, \texttt{cudaMemcpy} on line~31 transfers the iteration result from + one variable in the GPU memory to one array index in the CPU memory (in this example the CPU collects all iteration results in an array). This operation is started after the end of the MPI communication (previous instruction) and after the end of the GPU kernel execution. CUDA insures an implicit synchronization of all operations involving the same GPU stream, like the default stream in this - example. Result transfer has to wait the GPU kernel execution is finished. - If there is no result transfer implemented, the next operation on the GPU - will wait until the GPU kernel execution will be ended. + example. The transfer of the results has to wait until the GPU kernel execution is finished. + If there is no results transfer implemented, the next operation on the GPU + will wait until the GPU kernel execution has ended. \end{itemize} This implementation is the easiest one involving the GPU. It achieves an -implicit overlap of internode communications and GPU computations, no explicit -multithreading is required on the CPU. However, CPU/GPU data transfers are +implicit overlap of internode communications and GPU computations with no explicit +multithreading required on the CPU. However, CPU/GPU data transfers are achieved serially and not overlapped. @@ -206,22 +210,26 @@ achieved serially and not overlapped. When CPU/GPU data transfers are not negligible compared to GPU computations, it can be interesting to overlap internode CPU computations with a \emph{GPU - sequence}\index{GPU sequence} including CPU/GPU data transfers and GPU computations (see + sequence}\index{GPU!sequence} including CPU/GPU data transfers and GPU computations (see \Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic, -but their implementation require explicit CPU multithreading and +but their implementation requires explicit CPU multithreading and synchronization, and CPU data buffer duplication. We need to implement two -threads, one starting and achieving MPI communications, and the other running +threads, one starting and achieving MPI communications and the other running the \emph{GPU sequence}. OpenMP allows an easy and portable implementation of this overlapping strategy. However, it remains more complex to develop and to maintain than the previous strategy (overlapping only internode CPU -communications and GPU computations), and should be adopted only when CPU/GPU +communications and GPU computations) and should be adopted only when CPU/GPU data transfer times are not negligible. +\Lst{algo:ch6p1overlapseqsequence} introduces the generic code of a +MPI+OpenMP+CUDA implementation, explicitly overlapping MPI communications with +GPU sequences. + %\begin{algorithm} % \caption{Generic scheme explicitly overlapping MPI communications with % sequences of CUDA CPU/GPU transfers and CUDA GPU % computations}\label{algo:ch6p1overlapseqsequence} -\begin{Listing}{algo:ch6p1overlapseqsequence}{Generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations} +\begin{Listing}{algo:ch6p1overlapseqsequence}{generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations} // Input data and result variables and arrays (example with // float datatype, 1D input arrays, and scalar results) float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr; @@ -234,7 +242,7 @@ cudaMalloc(&gpuInputTabAdr,sizeof(float)*N); cpuResTabAdr = malloc(sizeof(float)*NbIter); cudaMalloc(&gpuResAdr,sizeof(float)); -// Definition of the Grid of blocks of GPU threads +// Definition of the grid of blocks of GPU threads dim3 Dg, Db; Dg.x = ... ... @@ -276,9 +284,9 @@ omp_set_num_threads(2); cudaMemcpyDeviceToHost); } - // - Wait for both threads have achieved their iteration tasks + // - Wait until both threads have achieved their iteration tasks #pragma omp barrier - // - Each thread permute its local buffer pointers + // - Each thread permutes its local buffer pointers tmp = current; current = future; future = tmp; @@ -288,44 +296,56 @@ omp_set_num_threads(2); \end{Listing} %\end{algorithm} -\Lst{algo:ch6p1overlapseqsequence} introduces the generic code of a -MPI+OpenMP+CUDA implementation, explicitly overlapping MPI communications with -\emph{GPU sequences}. Lines 25--62 implement the OpenMP parallel region, -around the computation loop (lines 33--61). For performances it is +Lines~25--62 implement the OpenMP parallel region, +around the computation loop (lines~33--61). For efficient performances it is important to create and destroy threads only one time (not at each iteration): -the parallel region has to surround the computation loop. Lines 1--11 -consist in declaration and allocation of input data arrays and result arrays and -variables, like in previous algorithm (\Lst{algo:ch6p1overlapnative}). However, we implement two input data buffers on the +the parallel region has to surround the computation loop. Lines~3--11 +consist of declaration and allocation of input data arrays and result arrays and +variables, as in the previous algorithm (\Lst{algo:ch6p1overlapnative}). However, we implement two input data buffers on the CPU (current and future version). As we aim to overlap internode MPI communications and GPU sequence, including CPU to GPU data transfer of current input data array, we need to store the received new input data array in a separate buffer. Then, the current input data array will be safely read on the CPU and copied into the GPU memory. -The thread creations\index{OpenMP!thread creation} are easily achieved with one OpenMP directive (line -25). Then each thread defines and initializes \emph{its} local buffer pointers, -and enters \emph{its} computing loop (lines 27--33). Inside the computing -loop, a test on the thread number allows to run a different code in each -thread. Lines 37--40 implement the MPI synchronous communication run by -thread number $0$. Lines 45--52 implement the GPU sequence run by thread -$1$: CPU to GPU data transfer, GPU computation and GPU to CPU result +The thread creations\index{OpenMP!thread creation} are easily achieved with one OpenMP directive (line~25). +Then each thread defines and initializes \emph{its} local buffer pointers, +and enters \emph{its} computing loop (lines~28--33). Inside the computing +loop, a test on the thread number makes it possible to run a different code in each +thread. Lines~37--40 implement the MPI synchronous communication run by +thread number $0$. Lines~45--52 implement the GPU sequence run by thread +$1$: CPU to GPU data transfer, GPU computation, and GPU to CPU result transfer (if needed). Details of the three operations of this sequence have not changed -compared to the previous overlapping strategy. +from the previous overlapping strategy. At the end of \Lst{algo:ch6p1overlapseqsequence}, an OpenMP synchronization -barrier\index{OpenMP!barrier} on line 56 allows to wait OpenMP threads have achieved MPI -communications and GPU sequence, and do not need to access the current input -data buffer. Then, each thread permute its local buffer pointers (lines 58--60), +barrier\index{OpenMP!barrier} on line~56 forces the OpenMP threads to wait until +MPI +communications and GPU sequence are achieved. %, and do not need to access the current input data buffer. +Then, each thread permutes its local buffer pointers (lines~58--60), and is ready to enter the next iteration, processing the new current input array. - %\subsubsection{Overlapping with a streamed GPU sequence} \bigskip +%\pagebreak \noindent {\bf Overlapping with a streamed GPU sequence} \medskip +Depending on the algorithm implemented, it is sometimes possible to split the +GPU computation into several parts processing distinct data. Then, we can +speedup the GPU sequence using several CUDA streams\index{CUDA!stream}. The goal is +to overlap CPU/GPU data transfers with GPU computations\index{overlap!GPU data transfers with GPU computation} inside the GPU + sequence. Compared to the previous overlapping strategy, we have to split the +initial data transfer into a set of $n$ asynchronous and smaller data transfers, +and split the initial GPU kernel call into a set of $n$ calls to the same GPU +kernel. Usually, these smaller calls are deployed with fewer GPU threads +(i.e., associated to a smaller grid of blocks of threads). Then, the first GPU +computations can start as soon as the first data transfer has been achieved, and +next transfers can be done in parallel with next GPU computations (see +\Fig{fig:ch6p1overlapstreamsequence}). + \begin{figure}[t] \centering \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf} @@ -334,34 +354,25 @@ array. \label{fig:ch6p1overlapstreamsequence} \end{figure} -Depending on the algorithm implemented, it is sometimes possible to split the -GPU computation into several parts processing distinct data. Then, we can -speedup the \emph{GPU sequence} using several CUDA \emph{streams}\index{CUDA!stream}. The goal is -to overlap CPU/GPU data transfers with GPU computations\index{overlap!GPU data transfers with GPU computation} inside the \emph{GPU - sequence}. Compared to the previous overlapping strategy, we have to split the -initial data transfer in a set of $n$ asynchronous and smaller data transfers, -and to split the initial GPU kernel call in a set of $n$ calls to the same GPU -kernel. Usually, these smaller calls are deployed with less GPU threads -(i.e. associated to a smaller grid of blocks of threads). Then, the first GPU -computations can start as soon as the first data transfer has been achieved, and -next transfers can be done in parallel of next GPU computations (see -\Fig{fig:ch6p1overlapstreamsequence}). - -NVIDIA advises to start all asynchronous CUDA data transfers, and then to call +NVIDIA advises starting all asynchronous CUDA data transfers and then calling all CUDA kernel executions, using up to $16$ streams \cite{cudabestpractices}. -Then, CUDA driver and -runtime optimize the global execution of these operations. So, we cumulate two -overlapping mechanisms. The former is controlled by CPU multithreading, and -overlap MPI communications and the \emph{streamed GPU sequence}. The latter is -controlled by CUDA programming, and overlap CPU/GPU data transfers and GPU -computations. Again, OpenMP allows to easily implement the CPU multithreading, -and to wait for the end of both CPU threads before to execute the next instructions +Then, the CUDA driver and +runtime optimize the global execution of these operations. So, we accumulate two +overlapping mechanisms. The former is controlled by CPU multithreading and +overlaps MPI communications and the streamed GPU sequence. The latter is +controlled by CUDA programming and overlaps CPU/GPU data transfers and GPU +computations. Again, OpenMP allows the easy implementation of the CPU multithreading +and waiting for the end of both CPU threads before executing the next instructions of the code. +\Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA +code, explicitly overlapping MPI communications with +streamed GPU sequences\index{GPU!streamed sequence}. + %\begin{algorithm} % \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA % CPU/GPU transfers and CUDA GPU computations}\label{algo:ch6p1overlapstreamsequence} -\begin{Listing}{algo:ch6p1overlapstreamsequence}{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations} +\begin{Listing}{algo:ch6p1overlapstreamsequence}{generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations} // Input data and result variables and arrays (example with // float datatype, 1D input arrays, and scalar results) float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr; @@ -376,7 +387,7 @@ cudaMalloc(&gpuResAdr,sizeof(float)); cudaStream_t TabS[NbS]; for(int s = 0; s < NbS; s++) cudaStreamCreate(&TabS[s]); -// Definition of the Grid of blocks of GPU threads +// Definition of the grid of blocks of GPU threads ... // Set the number of OpenMP threads (to create) to 2 omp_set_num_threads(2); @@ -417,9 +428,9 @@ omp_set_num_threads(2); sizeof(float), cudaMemcpyDeviceToHost); } - // - Wait for both threads have achieved their iteration tasks + // - Wait until both threads have achieved their iteration tasks #pragma omp barrier - // - Each thread permute its local buffer pointers + // - Each thread permutes its local buffer pointers tmp = current; current = future; future = tmp; } // End of computation loop } // End of OpenMP parallel region @@ -431,66 +442,59 @@ for(int s = 0; s < NbS; s++) \end{Listing} %\end{algorithm} -\Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA -code explicitly overlapping MPI communications with -\emph{streamed GPU sequences}\index{GPU sequence!streamed}. Efficient usage of CUDA \emph{streams} requires to execute -asynchronous CPU/GPU data transfers, that needs to read page-locked +Efficient usage of CUDA streams requires executing +asynchronous CPU/GPU data transfers, which implies reading page-locked data\index{page-locked data} in CPU memory. So, CPU -memory allocations on lines 6 and 7 are implemented with \texttt{cudaHostAlloc} instead of -the basic \texttt{malloc} function. Then, $NbS$ \emph{streams} are created on lines 12--14. +memory allocations on lines~6 and 7 are implemented with \texttt{cudaHostAlloc} instead of +the basic \texttt{malloc} function. Then, $NbS$ \emph{streams} are created on lines~12--14. Usually we create $16$ streams: the maximum number supported by CUDA. -An OpenMP parallel region\index{OpenMP!parallel region} including two threads is implemented on lines 17--61 of -\Lst{algo:ch6p1overlapstreamsequence}, similarly to the previous algorithm (see +An OpenMP parallel region\index{OpenMP!parallel region} including two threads is implemented on lines~18--61 of +\Lst{algo:ch6p1overlapstreamsequence}, as in the previous algorithm (see \Lst{algo:ch6p1overlapseqsequence}). Code of thread $0$ achieving MPI communication is unchanged, but -code of thread $1$ is now using streams. Following NVIDIA recommandations, we have first implemented -a loop starting $NbS$ asynchronous data transfers (lines 39--45): transferring $N/NbS$ data on -each stream. Then we have implemented a second loop (lines 46--48), starting asynchronous +code of thread $1$ is now using streams. Following NVIDIA recommandations, we first implement +a loop starting $NbS$ asynchronous data transfers (lines~39--45): transferring $N/NbS$ data on +each stream. Then we implement a second loop (lines~46--48), starting asynchronous executions of $NbS$ grids of blocks of GPU threads (one per stream). Data transfers and kernel executions on the same stream are synchronized by CUDA and the GPU. So, each kernel execution will -start after its data will be transferred into the GPU memory, and the GPU scheduler ensures to start +start after its data has been transferred into the GPU memory, and the GPU +scheduler ensures the start of some kernel executions as soon as the first data transfers are achieved. Then, next data transfers will be overlapped with GPU computations. After the kernel calls, on the different streams, we wait for the end of all GPU threads previously run, calling an explicit synchronization -function on line 49. This synchronization is not mandatory, but it will make the implementation more +function on line~49. This synchronization is not mandatory, but it will make the implementation more robust and will facilitate the debugging steps: all GPU computations run by the OpenMP thread number -$1$ will be achieved before this thread will enter a new loop iteration, or before the computation -loop will be ended. +$1$ will be achieved before this thread enters a new loop iteration, or before the computation +loop has ended. If a partial result has to be transferred from GPU to CPU memory at the end of each loop iteration -(for example the result of one \emph{reduction} per iteration), this transfer is achieved -synchronously on the default stream (no particular stream is specified) on lines 51--54. -Availability of the result values is ensured by the synchronization implemented on line 49. -However, if a partial result has to be transferred on the CPU on each stream, then $NbS$ asynchronous data -transfers could be started in parallel (one per stream), and should be implemented before the -synchronization operation on line 49. The end of the computation loop includes a synchronization -barrier of the two OpenMP threads, waiting they have finished to access the different data -buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, like -in the previous algorithm. However, after the computation loop, we have added the -destruction of the CUDA streams (lines 63--65). - -Finally, CUDA streams\index{CUDA!stream} have been used to extend \Lst{algo:ch6p1overlapseqsequence} -with respect to its global scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an -OpenMP parallel region, with two CPU threads, one in charge of MPI communications, and the other -managing data transfers and GPU computations. Unfortunately, using GPU streams require to be able to -split a GPU computation in independent subparts, working on independent subsets of data. -\Lst{algo:ch6p1overlapstreamsequence} is not so generic than \Lst{algo:ch6p1overlapseqsequence}. +(for example, the result of one \emph{reduction} per iteration), this transfer is achieved +synchronously on the default stream (no particular stream is specified) on lines~51--54. +Availability of the result values is ensured by the synchronization implemented on line~49. +However, if a partial result has to be transferred onto the CPU on each stream, then $NbS$ asynchronous data +transfers could be started in parallel (one per stream) and should be implemented before the +synchronization operation on line~49. The end of the computation loop includes a synchronization +barrier of the two OpenMP threads, waiting until they have finished accessing the different data +buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, as +in the previous algorithm. After the computation loop, we have added the +destruction of the CUDA streams (lines~64--65). + +In conclusion, CUDA streams\index{CUDA!stream} have been used to extend +\Lst{algo:ch6p1overlapseqsequence} with respect to its global +scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an OpenMP parallel +region, with two CPU threads, one in charge of MPI communications and the other +managing data transfers and GPU computations. Unfortunately, using GPU streams +requires the ability to split a GPU computation into independent subparts, +working on independent subsets of data. \Lst{algo:ch6p1overlapstreamsequence} +is not so generic as \Lst{algo:ch6p1overlapseqsequence}. \subsection{Interleaved communications-transfers-computations overlapping} -\begin{figure}[t] - \centering - \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf} - \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers and GPU - computations, interleaving computation-communication iterations} - \label{fig:ch6p1overlapinterleaved} -\end{figure} - -Many algorithms do not support to split data transfers and kernel calls, and can -not exploit CUDA streams. For example, when each GPU thread requires to access +Many algorithms do not support splitting data transfers and kernel calls, and +cannot exploit CUDA streams, for example, when each GPU thread requires access to some data spread in the global set of transferred data. Then, it is possible to -overlap internode CPU communications and CPU/GPU data transfers and GPU +overlap internode CPU communications, CPU/GPU data transfers, and GPU computations, if the algorithm achieves \emph{computation-communication iterations} and if we can interleave these iterations. At iteration $k$: CPUs exchange data $D_k$, each CPU/GPU couple transfers data $D_k$, and each GPU @@ -499,10 +503,23 @@ achieves computations on data $D_{k-1}$ (see strategy requires twice as many CPU data buffers and twice as many GPU buffers. +\begin{figure}[t] + \centering + \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf} + \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU + computations, interleaving computation-communication iterations.} + \label{fig:ch6p1overlapinterleaved} +\end{figure} + +\Lst{algo:ch6p1overlapinterleaved} introduces the generic code of a +MPI+OpenMP+CUDA implementation, explicitly interleaving +computation-communication iterations and overlapping MPI communications, CUDA CPU/GPU +transfers, and CUDA GPU computations. + %\begin{algorithm} % \caption{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA % GPU computations, interleaving computation-communication iterations}\label{algo:ch6p1overlapinterleaved} -\begin{Listing}{algo:ch6p1overlapinterleaved}{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA GPU computations, interleaving computation-communication iterations} +\begin{Listing}{algo:ch6p1overlapinterleaved}{generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers, and CUDA GPU computations, interleaving computation-communication iterations} // Input data and result variables and arrays (example with // float datatype, 1D input arrays, and scalar results) float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture; @@ -517,7 +534,7 @@ cudaMalloc(&gpuInputTabAdrFuture,sizeof(float)*N); cpuResTabAdr = malloc(sizeof(float)*NbIter); cudaMalloc(&gpuResAdr,sizeof(float)); -// Definition of the Grid of blocks of GPU threads +// Definition of the grid of blocks of GPU threads dim3 Dg, Db; Dg.x = ... // Indexes of source and destination MPI processes int dest, src; dest = ... @@ -534,7 +551,7 @@ omp_set_num_threads(3); float *gpuFuture = gpuInputTabAdrFuture; float *tmp; - // Computation loop on: NbIter + 1 iteration + // Computation loop on NbIter + 1 iterations for (int i = 0; i < NbIter + 1; i++) { // - Thread 0: achieves MPI communications if (omp_get_thread_num() == 0) { @@ -551,7 +568,7 @@ omp_set_num_threads(3); sizeof(float)*N, // CPU --> GPU (sync. op) cudaMemcpyHostToDevice); } - // - Thread 2: achieves the GPU computations and the result transfer + // - Thread 2: achieves the GPU computations and result transfer } else if (omp_get_thread_num() == 2) { if (i > 0) { gpuKernel_k1<<>>(gpuCurrent);// GPU comp. (async. op) @@ -561,9 +578,9 @@ omp_set_num_threads(3); cudaMemcpyDeviceToHost); } } - // - Wait for both threads have achieved their iteration tasks + // - Wait until both threads have achieved their iteration tasks #pragma omp barrier - // - Each thread permute its local buffer pointers + // - Each thread permutes its local buffer pointers tmp = cpuCurrent; cpuCurrent = cpuFuture; cpuFuture = tmp; tmp = gpuCurrent; gpuCurrent = gpuFuture; gpuFuture = tmp; } // End of computation loop @@ -572,50 +589,48 @@ omp_set_num_threads(3); \end{Listing} %\end{algorithm} -\Lst{algo:ch6p1overlapinterleaved} introduces the generic code of a -MPI+OpenMP+CUDA implementation, explicitly interleaving -computation-communication iterations and overlapping MPI communications, CUDA CPU/GPU -transfers and CUDA GPU computations. As in the previous algorithms, we declare two CPU input data arrays -(current and future version) on line 3, but we also declare two GPU input data arrays on line 4. On -lines 8--11, these four data arrays are allocated, using \texttt{malloc} and +As in the previous algorithms, we declare two CPU input data arrays +(current and future version) on line~3. However, in this version we also declare two GPU input data arrays on line~4. On +lines~8--11, these four data arrays are allocated, using \texttt{malloc} and \texttt{cudaMalloc}. -We do not need to allocate page-locked memory space. On lines 23--65 we -create an OpenMP parallel region, configured to run three threads (see line 21). Lines 26--30 are +We do not need to allocate page-locked memory space. On lines~23--65 we +create an OpenMP parallel region, configured to run three threads (see line~21). Lines~26--30 are declarations of thread local pointers on data arrays and variables (each thread will use its own -pointers). On line 33, the three threads enter a computation loop of \texttt{NbIter + 1} +pointers). On line~33, the three threads enter a computation loop of \texttt{NbIter + 1} iterations. We need to run one more iteration than with previous algorithms. -Lines 34--41 are the MPI communications, achieved by the thread number $0$. They send the +Lines~35--41 are the MPI communications, achieved by the thread number $0$. They send the current CPU input data array to another CPU, and receive the future CPU input data array from another CPU, like in previous algorithms. But this thread achieves communications only during the -\emph{first} \texttt{NbIter} iterations. Lines 43--48 are the CPU to GPU input data -transfers, achieved by thread number $1$. These data transfers are run in parallel of MPI -communications. They are run during the \emph{first} \texttt{NbIter} iterations, and transfer -current CPU input data array into the future GPU data array. Lines 50--57 +\emph{first} \texttt{NbIter} iterations. Lines~43--48 are the CPU to GPU input data +transfers, achieved by thread number $1$. These data transfers are run in parallel with MPI +communications. They are run during the \emph{first} \texttt{NbIter} iterations and transfer +current CPU input data array into the future GPU data array. Lines~50--57 correspond to the code run by -thread number $3$. They start GPU computations, to process the current GPU input data array, and if +thread number $2$. They start GPU computations, process the current GPU input data array, and if necessary -transfer a GPU result at an index of the CPU result array. These GPU computations and result +transfer a GPU result to an index of the CPU result array. These GPU computations and result transfers are run during the \emph{last} \texttt{NbIter} iterations: the GPU computations -have to wait the first data transfer is ended before to start to process any data, and can not run -during the first iteration. So, the activity of the third thread is shifted of one iteration -compared to the activities of other threads. Moreover, the address of the current GPU input data -array has to be passed as a parameter of the kernel call on line 52, in order the GPU threads access -the right data array. Like in previous algorithms the GPU result is copied at one index of the CPU -result array, in lines 53--56, but due to the shift of the third thread activity this index is +have to wait until the first data transfer is ended before starting to process any data and cannot run +during the first iteration. So, the activity of the third thread is shifted by one iteration +compared to the activities of the other threads. Moreover, the address of the current GPU input data +array has to be passed as a parameter of the kernel call on line~52, in order +for the GPU threads to access +the right data array. As in previous algorithms the GPU result is copied to one index of the CPU +result array, in lines~54--56, but due to the shift of the third thread activity this index is now \texttt{(i - 1)}. -Line 60 is a synchronization barrier\index{OpenMP!barrier} of the three OpenMP threads, followed by a pointer permutation -of local pointers on current and future data arrays, on line 62 and 63. Each +Line~60 is a synchronization barrier\index{OpenMP!barrier} of the three OpenMP threads, followed by a pointer permutation +of local pointers on current and future data arrays, on line~62 and 63. Each thread waits for the completion of other -threads to use the data arrays, and then permutes its data array pointers before to -enter a new loop iteration. +threads to use the data arrays, and then permutes its data array pointers before +entering a new loop iteration. -This complete overlap of MPI communications and CPU/GPU data transfers and GPU computations, is not +This complete overlap of MPI communications, CPU/GPU data transfers, and GPU computations is not too complex to implement, and can be a solution when GPU computations are not adapted to use CUDA -streams: when GPU computations can not be split in subparts working on independent subsets of input -data. However, it requires to run one more iterations (a total of \texttt{NbIter + 1} -iterations). Then, if the number of iterations is very small, it could be more interesting not to +streams: when GPU computations cannot be split into subparts working on independent subsets of input +data. However, this requires running one more iteration (a total of \texttt{NbIter + 1} +iterations). If the number of iterations is very small, it could be more interesting not to attempt to overlap CPU/GPU data transfers and GPU computations, and to implement \Lst{algo:ch6p1overlapseqsequence}. @@ -632,19 +647,18 @@ experiments presented in this chapter: \begin{itemize} \item The first consists of 17 nodes with an Intel Nehalem quad-core processor - at 2.67Ghz, 6 Gb RAM and an NVIDIA GeForce GTX480 GPU, each. + at 2.67Ghz, 6 Gb RAM, and an NVIDIA GeForce GTX480 GPU, each. \item The second consists of 16 nodes with an Intel core2 dual-core processor at - 2.67Ghz, 4 Gb RAM and an NVIDIA GeForce GTX580 GPU, each + 2.67Ghz, 4 Gb RAM, and an NVIDIA GeForce GTX580 GPU, each. \end{itemize} % -Both clusters have a Gigabit Ethernet interconnection network that is connected -through a DELL Power Object 5324 switch. The two switches are linked twice, -insuring the interconnection of the two clusters. The software environment +Both clusters have a gigabit Ethernet interconnection network that is connected +through a Dell Power Object 5324 switch. The two switches are linked twice, +ensuring the interconnection of the two clusters. The software environment consists of a Linux Fedora 64bit OS (kernel v. 2.6.35), GNU C and C++ compilers -(v. 4.5.1) and the CUDA library (v. 4.2). - +(v. 4.5.1), and the CUDA library (v. 4.2). %\subsubsection{Validation of the synchronous approach} @@ -656,18 +670,18 @@ consists of a Linux Fedora 64bit OS (kernel v. 2.6.35), GNU C and C++ compilers \centering \includegraphics{Chapters/chapter6/curves/gpuSyncOverlap.pdf} \caption{Experimental performances of different synchronous algorithms computing a - dense matrix product} + dense matrix product.} \label{fig:ch6p1syncexpematrixprod} \end{figure} \label{ch6:p1block-cyclic} -We have experimented our approach of synchronous parallel algorithms with a classic +We have tested our approach of synchronous parallel algorithms with a classic block cyclic algorithm for dense matrix multiplication\index{matrix - multiplication!block cyclic}. This problem requires to split two input matrices ($A$ and $B$) on a ring of -computing nodes, and to establish a circulation of the slices of $A$ matrix on the ring ($B$ matrix + multiplication!block cyclic}. This problem requires splitting two input matrices ($A$ and $B$) on a ring of +computing nodes and establishing a circulation of the slices of $A$ matrix on the ring ($B$ matrix partition does not evolve during all the run). Compared to our generic algorithms, there is no partial result to transfer from GPU to CPU at the end of each computing iteration. The part of the -result matrix computed on each GPU is transferred on the CPU at the end of the computation loop. +result matrix computed on each GPU is transferred onto the CPU at the end of the computation loop. We have first implemented a synchronous version without any overlap of MPI communications, CPU/GPU data transfers, and GPU computations. We have added some synchronizations in the native overlapping @@ -678,9 +692,9 @@ there is a significant increase in cost when comparing a single node (without an two nodes (starting to use MPI communications). But beyond two nodes we get a classical performance curve. -Then, we implemented and experimented \Lst{algo:ch6p1overlapnative}, see +Then, we implemented and tested \Lst{algo:ch6p1overlapnative}, labeled \emph{ovlp-native} in \Fig{fig:ch6p1syncexpematrixprod}. The native -overlap of MPI communications with asynchronous run of CUDA kernels appears efficient. When the +overlap of MPI communications with the asynchronous run of CUDA kernels appears efficient. When the number of nodes increases the ratio of the MPI communications increases a lot (because the computation times decrease a lot). So, there is not a lot of GPU computation time that remains to be @@ -692,10 +706,10 @@ a performance increase. Finally, we implemented \Lst{algo:ch6p1overlapseqsequence}, overlapping MPI communications with a GPU sequence including both CPU/GPU data transfers and GPU computations, -see \emph{ovlp-GPUsequence} in \Fig{fig:ch6p1syncexpematrixprod}. From four -up to sixteen nodes it achieves better performances than \emph{ovlp-native}: we better overlap -MPI communications. However, this parallelization mechanism has more overhead: OpenMP threads -have to be created and synchronized. Only for two nodes it is less efficient than the native +labeled \emph{ovlp-GPUsequence} in \Fig{fig:ch6p1syncexpematrixprod}. From four +up to sixteen nodes it achieves better performances than \emph{ovlp-native}: the +overlapping of MPI communications is wider and thus more efficient. However, this parallelization mechanism has more overhead: OpenMP threads +have to be created and synchronized. With only two nodes it is less efficient than the native overlapping algorithm. Beyond two nodes, the CPU multithreading overhead seems compensated. % No, it doesn't need the more implementation of time, but more implementation % of code :)