1 \section{General scheme of synchronous code with computation/communication
2 overlapping in GPU clusters}\label{ch6:part1}
4 %Notre chapitre précédent sur l'optimisation des schémas
5 %parallèles~\cite{ChVCV13}.
7 \subsection{Synchronous parallel algorithms on GPU clusters}
9 %\subsubsection{Considered parallel algorithms and implementations}
11 \noindent {\bf Considered parallel algorithms and implementations}
14 This section focusses on synchronous parallel algorithms implemented with
15 overlapping computations and communications\index{overlap!computation and communication}. Parallel synchronous algorithms are
16 easier to implement, debug and maintain than asynchronous ones, see
17 Section~\ref{ch6:part2}. Usually, they follow a BSP-like parallel
18 scheme\index{BSP parallel scheme},
19 alternating local computation steps and communication steps,
20 see~\cite{Valiant:BSP}. Their execution is usually deterministic, excepted
21 for stochastic algorithms that contain random number generations. Even in
22 this case, their execution can be controlled during debug steps, allowing to
23 track and to fix bugs quickly.
25 However, depending on the properties of the algorithm, it is sometimes possible to
26 overlap computations and communications. If processes exchange data
27 that is not needed for the
28 computation that is following immediately, it is possible to implement such
29 an overlap. We have investigated the efficiency of this approach in previous
30 works~\cite{GUSTEDT:2007:HAL-00280094:1,ChVCV13}, using standard parallel
31 programming tools to achieve the implementation.
33 The normalized and well known Message Passing Interface (MPI\index{MPI}) includes some asynchronous
34 point-to-point communication routines, that should allow to implement some
35 communication/computation overlap. However, current MPI implementations do not achieve
36 that goal efficiently; effective overlapping with MPI requires a group of
37 dedicated threads (in our case implemented with OpenMP\index{OpenMP}) for the basic
38 synchronous communications while another group of threads executes computations
40 % Finally, with explicit OpenMP
41 % threads executing MPI communications or various computations, we succeeded to
42 % decrease the execution time.
43 Nevertheless, communication and computation are not completely independent on
44 modern multicore architectures: they use shared hardware components such as the
45 interconnection bus and the RAM. Therefore that approach only saved up to $20\%$
46 of the expected time on such a platform. This picture changes on clusters
47 equipped with GPU. They effectively allow independence of computations on the
48 GPU and communication on the mainboard (CPU, interconnection bus, RAM, network
49 card). We saved up to $100\%$ of the expected time on our GPU cluster, as we
50 will expose in the next section.
53 %\subsubsection{Specific interests for GPU clusters}
56 \noindent {\bf Specific interests in GPU clusters}
59 In a computing node, a GPU is a kind of scientific coprocessor usually located
60 on an auxiliary board, with its own memory. So, when data have been transferred
61 from the CPU memory to the GPU memory, then GPU computations can be achieved on
62 the GPU board, totally in parallel of any CPU activities (like internode cluster
63 communications). The CPU and the GPU access their respective memories and do not
64 interfere, so they can achieve a very good overlap\index{overlap!computation
65 and computation} of their activities (better
68 But using a GPU on a computing node requires to transfer data from the CPU to
69 the GPU memory, and to transfer the computation results back from the GPU to the
70 CPU. Transfer times are not excessive, but depending on the application
71 they still can be significant compared to the GPU computation times. So, sometimes it
72 can be interesting to overlap the internode cluster communications with both the
73 CPU/GPU data transfers and the GPU computations. We can identify four main
74 parallel programming schemes on a GPU cluster:
77 \item parallelizing only 'internode CPU communications' with 'GPU computations',
78 and achieving CPU/GPU data transfers before and after this parallel step,
79 \item parallelizing 'internode CPU communications' with a '(sequential) sequence
80 of CPU/GPU data transfers and GPU computations',
81 \item parallelizing 'internode CPU communications' with a 'streamed sequence of
82 CPU/GPU data transfers and GPU computations',
83 \item parallelizing 'internode CPU communications' with 'CPU/GPU data transfers'
84 and with 'GPU computations', interleaving computation-communication
89 \subsection{Native overlap of CPU communications and GPU computations}
93 \includegraphics{Chapters/chapter6/figures/Sync-NativeOverlap.pdf}
94 \caption{Native overlap of internode CPU communications with GPU computations.}
95 \label{fig:ch6p1overlapnative}
98 Using CUDA\index{CUDA}, GPU kernel executions are non-blocking, and GPU/CPU data
99 transfers\index{CUDA!data transfer}
100 are blocking or non-blocking operations. All GPU kernel executions and CPU/GPU
101 data transfers are associated to "streams"\index{CUDA!stream}, and all operations on a same stream
102 are serialized. When transferring data from the CPU to the GPU, then running GPU
103 computations and finally transferring results from the GPU to the CPU, there is
104 a natural synchronization and serialization if these operations are achieved on
105 the same stream. GPU developers can choose to use one (default) or several
106 streams. In this first scheme of overlapping, we consider parallel codes using
109 "Non-blocking GPU kernel execution" means a CPU routine runs a parallel
110 execution of a GPU computing kernel, and the CPU routine continues its execution
111 (on the CPU) while the GPU kernel is running (on the GPU). Then the CPU routine
112 can initiate some communications with some others CPU, and so it automatically
113 overlaps the internode CPU communications with the GPU computations (see
114 \Fig{fig:ch6p1overlapnative}). This overlapping is natural when programming with
115 CUDA and MPI: it is easy to deploy, but does not overlap the CPU/GPU data
118 %\begin{algorithm}[t]
119 % \caption{Generic scheme implicitly overlapping MPI communications with CUDA GPU
120 % computations}\label{algo:ch6p1overlapnative}
122 \begin{Listing}{algo:ch6p1overlapnative}{Generic scheme implicitly overlapping MPI communications with CUDA GPU computations}
123 // Input data and result variables and arrays (example with
124 // float datatype, 1D input arrays, and scalar results)
125 float *cpuInputTabAdr, *gpuInputTabAdr;
126 float *cpuResTabAdr, *gpuResAdr;
128 // CPU and GPU array allocations
129 cpuInputTabAdr = malloc(sizeof(float)*N);
130 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
131 cpuResTabAdr = malloc(sizeof(float)*NbIter);
132 cudaMalloc(&gpuResAdr,sizeof(float));
134 // Definition of the Grid of blocks of GPU threads
139 // Indexes of source and destination MPI processes
143 // Computation loop (using the GPU)
144 for (int i = 0; i < NbIter; i++) {
145 cudaMemcpy(gpuInputTabAdr, cpuInputTabAdr, // Data transfer:
146 sizeof(float)*N, // CPU --> GPU (sync. op)
147 cudaMemcpyHostToDevice);
148 gpuKernel_k1<<<Dg,Db>>>(); // GPU comp. (async. op)
149 MPI_Sendrecv_replace(cpuInputTabAdr, // MPI comms. (sync. op)
151 dest, 0, src, 0, ...);
152 // IF there is (now) a result to transfer from the GPU to the CPU:
153 cudaMemcpy(cpuResTabAdr + i, gpuResAdr, // Data transfer:
154 sizeof(float), // GPU --> CPU (sync. op)
155 cudaMemcpyDeviceToHost);
161 \Lst{algo:ch6p1overlapnative} introduces the generic code of a MPI+CUDA
162 implementation, natively and implicitly overlapping MPI communications with CUDA
163 GPU computations. Some input data and output results arrays and variables are
164 declared and allocated from line 1 up to 10, and a computation loop is
165 implemented from line 21 up to 34. At each iteration:
167 \item \texttt{cudaMemcpy} on line 23 transfers data from the CPU memory
168 to the GPU memory. This is a basic and synchronous data transfer.
169 \item \texttt{gpuKernel\_k1<<<Dg,Db>>>} on line 26 starts GPU computation
170 (running a GPU kernel on the grid of blocks of threads defined at line 12 to
171 15). This is a standard GPU kernel run, it is an asynchronous operation. The
172 CPU can continue to run its code.
173 \item \texttt{MPI\_Sendrecv\_replace} on line 27 achieves some blocking
174 internode communications, overlapping GPU computations started just before.
175 \item If needed, \texttt{cudaMemcpy} on line 31 transfers the iteration result from
176 one variable in the GPU memory at one array index in the CPU memory (in this example the CPU
177 collects all iteration results in an array). This operation is started after
178 the end of the MPI communication (previous instruction) and after the end of
179 the GPU kernel execution. CUDA insures an implicit synchronization of all
180 operations involving the same GPU stream, like the default stream in this
181 example. Result transfer has to wait the GPU kernel execution is finished.
182 If there is no result transfer implemented, the next operation on the GPU
183 will wait until the GPU kernel execution will be ended.
186 This implementation is the easiest one involving the GPU. It achieves an
187 implicit overlap of internode communications and GPU computations, no explicit
188 multithreading is required on the CPU. However, CPU/GPU data transfers are
189 achieved serially and not overlapped.
192 \subsection{Overlapping with sequences of transfers and computations}
194 %\subsubsection{Overlapping with a sequential GPU sequence}
196 \noindent {\bf Overlapping with a sequential GPU sequence}
201 \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf}
202 \caption{Overlap of internode CPU communications with a sequence of CPU/GPU data transfers and GPU
204 \label{fig:ch6p1overlapseqsequence}
207 When CPU/GPU data transfers are not negligible compared to GPU computations, it
208 can be interesting to overlap internode CPU computations with a \emph{GPU
209 sequence}\index{GPU sequence} including CPU/GPU data transfers and GPU computations (see
210 \Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic,
211 but their implementation require explicit CPU multithreading and
212 synchronization, and CPU data buffer duplication. We need to implement two
213 threads, one starting and achieving MPI communications, and the other running
214 the \emph{GPU sequence}. OpenMP allows an easy and portable implementation of
215 this overlapping strategy. However, it remains more complex to develop and to
216 maintain than the previous strategy (overlapping only internode CPU
217 communications and GPU computations), and should be adopted only when CPU/GPU
218 data transfer times are not negligible.
221 % \caption{Generic scheme explicitly overlapping MPI communications with
222 % sequences of CUDA CPU/GPU transfers and CUDA GPU
223 % computations}\label{algo:ch6p1overlapseqsequence}
224 \begin{Listing}{algo:ch6p1overlapseqsequence}{Generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
225 // Input data and result variables and arrays (example with
226 // float datatype, 1D input arrays, and scalar results)
227 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
228 float *cpuResTabAdr, *gpuResAdr;
230 // CPU and GPU array allocations
231 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
232 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
233 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
234 cpuResTabAdr = malloc(sizeof(float)*NbIter);
235 cudaMalloc(&gpuResAdr,sizeof(float));
237 // Definition of the Grid of blocks of GPU threads
242 // Indexes of source and destination MPI processes
246 // Set the number of OpenMP threads (to create) to 2
247 omp_set_num_threads(2);
248 // Create threads and start the parallel OpenMP region
251 // Buffer pointers (thread local variables)
252 float *current = cpuInputTabAdrCurrent;
253 float *future = cpuInputTabAdrFuture;
256 // Computation loop (using the GPU)
257 for (int i = 0; i < NbIter; i++) {
259 // - Thread 0: achieves MPI communications
260 if (omp_get_thread_num() == 0) {
261 MPI_Sendrecv(current, // MPI comms. (sync. op)
262 N, MPI_FLOAT, dest, 0,
264 N, MPI_FLOAT, dest, 0, ...);
266 // - Thread 1: achieves the GPU sequence (GPU computations and
267 // CPU/GPU data transfers)
268 } else if (omp_get_thread_num() == 1) {
269 cudaMemcpy(gpuInputTabAdr, current, // Data transfer:
270 sizeof(float)*N, // CPU --> GPU (sync. op)
271 cudaMemcpyHostToDevice);
272 gpuKernel_k1<<<Dg,Db>>>(); // GPU comp. (async. op)
273 // IF there is (now) a result to transfer from the GPU to the CPU:
274 cudaMemcpy(cpuResTabAdr + i, gpuResAdr,// Data transfer:
275 sizeof(float), // GPU --> CPU (sync. op)
276 cudaMemcpyDeviceToHost);
279 // - Wait for both threads have achieved their iteration tasks
281 // - Each thread permute its local buffer pointers
285 } // End of computation loop
286 } // End of OpenMP parallel region
291 \Lst{algo:ch6p1overlapseqsequence} introduces the generic code of a
292 MPI+OpenMP+CUDA implementation, explicitly overlapping MPI communications with
293 \emph{GPU sequences}. Lines 25--62 implement the OpenMP parallel region,
294 around the computation loop (lines 33--61). For performances it is
295 important to create and destroy threads only one time (not at each iteration):
296 the parallel region has to surround the computation loop. Lines 1--11
297 consist in declaration and allocation of input data arrays and result arrays and
298 variables, like in previous algorithm (\Lst{algo:ch6p1overlapnative}). However, we implement two input data buffers on the
299 CPU (current and future version). As we aim to overlap internode MPI
300 communications and GPU sequence, including CPU to GPU data transfer of current
301 input data array, we need to store the received new input data array in a
302 separate buffer. Then, the current input data array will be safely read on the
303 CPU and copied into the GPU memory.
305 The thread creations\index{OpenMP!thread creation} are easily achieved with one OpenMP directive (line
306 25). Then each thread defines and initializes \emph{its} local buffer pointers,
307 and enters \emph{its} computing loop (lines 27--33). Inside the computing
308 loop, a test on the thread number allows to run a different code in each
309 thread. Lines 37--40 implement the MPI synchronous communication run by
310 thread number $0$. Lines 45--52 implement the GPU sequence run by thread
311 $1$: CPU to GPU data transfer, GPU computation and GPU to CPU result
312 transfer (if needed). Details of the three operations of this sequence have not changed
313 compared to the previous overlapping strategy.
315 At the end of \Lst{algo:ch6p1overlapseqsequence}, an OpenMP synchronization
316 barrier\index{OpenMP!barrier} on line 56 allows to wait OpenMP threads have achieved MPI
317 communications and GPU sequence, and do not need to access the current input
318 data buffer. Then, each thread permute its local buffer pointers (lines 58--60),
319 and is ready to enter the next iteration, processing the new current input
323 %\subsubsection{Overlapping with a streamed GPU sequence}
326 \noindent {\bf Overlapping with a streamed GPU sequence}
331 \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf}
332 \caption{Overlap of internode CPU communications with a streamed sequence of CPU/GPU data
333 transfers and GPU computations.}
334 \label{fig:ch6p1overlapstreamsequence}
337 Depending on the algorithm implemented, it is sometimes possible to split the
338 GPU computation into several parts processing distinct data. Then, we can
339 speedup the \emph{GPU sequence} using several CUDA \emph{streams}\index{CUDA!stream}. The goal is
340 to overlap CPU/GPU data transfers with GPU computations\index{overlap!GPU data transfers with GPU computation} inside the \emph{GPU
341 sequence}. Compared to the previous overlapping strategy, we have to split the
342 initial data transfer in a set of $n$ asynchronous and smaller data transfers,
343 and to split the initial GPU kernel call in a set of $n$ calls to the same GPU
344 kernel. Usually, these smaller calls are deployed with less GPU threads
345 (i.e. associated to a smaller grid of blocks of threads). Then, the first GPU
346 computations can start as soon as the first data transfer has been achieved, and
347 next transfers can be done in parallel of next GPU computations (see
348 \Fig{fig:ch6p1overlapstreamsequence}).
350 NVIDIA advises to start all asynchronous CUDA data transfers, and then to call
351 all CUDA kernel executions, using up to $16$ streams \cite{cudabestpractices}.
352 Then, CUDA driver and
353 runtime optimize the global execution of these operations. So, we cumulate two
354 overlapping mechanisms. The former is controlled by CPU multithreading, and
355 overlap MPI communications and the \emph{streamed GPU sequence}. The latter is
356 controlled by CUDA programming, and overlap CPU/GPU data transfers and GPU
357 computations. Again, OpenMP allows to easily implement the CPU multithreading,
358 and to wait for the end of both CPU threads before to execute the next instructions
362 % \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA
363 % CPU/GPU transfers and CUDA GPU computations}\label{algo:ch6p1overlapstreamsequence}
364 \begin{Listing}{algo:ch6p1overlapstreamsequence}{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
365 // Input data and result variables and arrays (example with
366 // float datatype, 1D input arrays, and scalar results)
367 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
368 float *cpuResTabAdr, *gpuResAdr;
369 // CPU and GPU array allocations (allocates page-locked CPU memory)
370 cudaHostAlloc(&cpuInputTabAdrCurrent,sizeof(float)*N,cudaHostAllocDefault);
371 cudaHostAlloc(&cpuInputTabAdrFuture,sizeof(float)*N,cudaHostAllocDefault);
372 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
373 cpuResTabAdr = malloc(sizeof(float)*NbIter);
374 cudaMalloc(&gpuResAdr,sizeof(float));
375 // Stream declaration and creation
376 cudaStream_t TabS[NbS];
377 for(int s = 0; s < NbS; s++)
378 cudaStreamCreate(&TabS[s]);
379 // Definition of the Grid of blocks of GPU threads
381 // Set the number of OpenMP threads (to create) to 2
382 omp_set_num_threads(2);
383 // Create threads and start the parallel OpenMP region
386 // Buffer pointers (thread local variables)
387 float *current = cpuInputTabAdrCurrent;
388 float *future = cpuInputTabAdrFuture;
390 // Stride of data processed per stream
392 // Computation loop (using the GPU)
393 for (int i = 0; i < NbIter; i++) {
394 // - Thread 0: achieves MPI communications
395 if (omp_get_thread_num() == 0) {
396 MPI_Sendrecv(current, // MPI comms. (sync. op)
397 N, MPI_FLOAT, dest, 0,
399 N, MPI_FLOAT, dest, 0, ...);
400 // - Thread 1: achieves the streamed GPU sequence (GPU
401 // computations and CPU/GPU data transfers)
402 } else if (omp_get_thread_num() == 1) {
403 for (int s = 0; s < NbS; s++) { // Start all data transfers:
404 cudaMemcpyAsync(gpuInputTabAdr + s*stride, // CPU --> GPU
405 current + s*stride, // (async. ops)
406 sizeof(float)*stride,
407 cudaMemcpyHostToDevice,
410 for (int s = 0; s < NbS; s++) { // Start all GPU comps. (async.)
411 gpuKernel_k1<<<Dg, Db, 0, TabS[s]>>>(gpuInputTabAdr + s*stride);
413 cudaThreadSynchronize(); // Wait all threads are ended
414 // IF there is (now) a result to transfer from the GPU to the CPU:
415 cudaMemcpy(cpuResTabAdr, // Data transfers:
416 gpuResAdr, // GPU --> CPU (sync. op)
418 cudaMemcpyDeviceToHost);
420 // - Wait for both threads have achieved their iteration tasks
422 // - Each thread permute its local buffer pointers
423 tmp = current; current = future; future = tmp;
424 } // End of computation loop
425 } // End of OpenMP parallel region
427 // Destroy the streams
428 for(int s = 0; s < NbS; s++)
429 cudaStreamDestroy(TabS[s]);
434 \Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA
435 code explicitly overlapping MPI communications with
436 \emph{streamed GPU sequences}\index{GPU sequence!streamed}. Efficient usage of CUDA \emph{streams} requires to execute
437 asynchronous CPU/GPU data transfers, that needs to read page-locked
438 data\index{page-locked data} in CPU memory. So, CPU
439 memory allocations on lines 6 and 7 are implemented with \texttt{cudaHostAlloc} instead of
440 the basic \texttt{malloc} function. Then, $NbS$ \emph{streams} are created on lines 12--14.
441 Usually we create $16$ streams: the maximum number supported by CUDA.
443 An OpenMP parallel region\index{OpenMP!parallel region} including two threads is implemented on lines 17--61 of
444 \Lst{algo:ch6p1overlapstreamsequence}, similarly to the previous algorithm (see
445 \Lst{algo:ch6p1overlapseqsequence}). Code of thread $0$ achieving MPI communication is unchanged, but
446 code of thread $1$ is now using streams. Following NVIDIA recommandations, we have first implemented
447 a loop starting $NbS$ asynchronous data transfers (lines 39--45): transferring $N/NbS$ data on
448 each stream. Then we have implemented a second loop (lines 46--48), starting asynchronous
449 executions of $NbS$ grids of blocks of GPU threads (one per stream). Data transfers and kernel
450 executions on the same stream are synchronized by CUDA and the GPU. So, each kernel execution will
451 start after its data will be transferred into the GPU memory, and the GPU scheduler ensures to start
452 some kernel executions as soon as the first data transfers are achieved. Then, next data transfers
453 will be overlapped with GPU computations. After the kernel calls, on the different streams,
454 we wait for the end of all GPU threads previously run, calling an explicit synchronization
455 function on line 49. This synchronization is not mandatory, but it will make the implementation more
456 robust and will facilitate the debugging steps: all GPU computations run by the OpenMP thread number
457 $1$ will be achieved before this thread will enter a new loop iteration, or before the computation
460 If a partial result has to be transferred from GPU to CPU memory at the end of each loop iteration
461 (for example the result of one \emph{reduction} per iteration), this transfer is achieved
462 synchronously on the default stream (no particular stream is specified) on lines 51--54.
463 Availability of the result values is ensured by the synchronization implemented on line 49.
464 However, if a partial result has to be transferred on the CPU on each stream, then $NbS$ asynchronous data
465 transfers could be started in parallel (one per stream), and should be implemented before the
466 synchronization operation on line 49. The end of the computation loop includes a synchronization
467 barrier of the two OpenMP threads, waiting they have finished to access the different data
468 buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, like
469 in the previous algorithm. However, after the computation loop, we have added the
470 destruction of the CUDA streams (lines 63--65).
472 Finally, CUDA streams\index{CUDA!stream} have been used to extend \Lst{algo:ch6p1overlapseqsequence}
473 with respect to its global scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an
474 OpenMP parallel region, with two CPU threads, one in charge of MPI communications, and the other
475 managing data transfers and GPU computations. Unfortunately, using GPU streams require to be able to
476 split a GPU computation in independent subparts, working on independent subsets of data.
477 \Lst{algo:ch6p1overlapstreamsequence} is not so generic than \Lst{algo:ch6p1overlapseqsequence}.
480 \subsection{Interleaved communications-transfers-computations overlapping}
484 \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf}
485 \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers and GPU
486 computations, interleaving computation-communication iterations}
487 \label{fig:ch6p1overlapinterleaved}
490 Many algorithms do not support to split data transfers and kernel calls, and can
491 not exploit CUDA streams. For example, when each GPU thread requires to access
492 some data spread in the global set of transferred data. Then, it is possible to
493 overlap internode CPU communications and CPU/GPU data transfers and GPU
494 computations, if the algorithm achieves \emph{computation-communication
495 iterations} and if we can interleave these iterations. At iteration $k$: CPUs
496 exchange data $D_k$, each CPU/GPU couple transfers data $D_k$, and each GPU
497 achieves computations on data $D_{k-1}$ (see
498 \Fig{fig:ch6p1overlapinterleaved}). Compared to the previous strategies, this
499 strategy requires twice as many CPU data buffers
500 and twice as many GPU buffers.
503 % \caption{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA
504 % GPU computations, interleaving computation-communication iterations}\label{algo:ch6p1overlapinterleaved}
505 \begin{Listing}{algo:ch6p1overlapinterleaved}{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA GPU computations, interleaving computation-communication iterations}
506 // Input data and result variables and arrays (example with
507 // float datatype, 1D input arrays, and scalar results)
508 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture;
509 float *gpuInputTabAdrCurrent, *gpuInputTabAdrFuture;
510 float *cpuResTabAdr, *gpuResAdr;
512 // CPU and GPU array allocations
513 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
514 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
515 cudaMalloc(&gpuInputTabAdrCurrent,sizeof(float)*N);
516 cudaMalloc(&gpuInputTabAdrFuture,sizeof(float)*N);
517 cpuResTabAdr = malloc(sizeof(float)*NbIter);
518 cudaMalloc(&gpuResAdr,sizeof(float));
520 // Definition of the Grid of blocks of GPU threads
521 dim3 Dg, Db; Dg.x = ...
522 // Indexes of source and destination MPI processes
523 int dest, src; dest = ...
525 // Set the number of OpenMP threads (to create) to 2
526 omp_set_num_threads(3);
527 // Create threads and start the parallel OpenMP region
530 // Buffer pointers (thread local variables)
531 float *cpuCurrent = cpuInputTabAdrCurrent;
532 float *cpuFuture = cpuInputTabAdrFuture;
533 float *gpuCurrent = gpuInputTabAdrCurrent;
534 float *gpuFuture = gpuInputTabAdrFuture;
537 // Computation loop on: NbIter + 1 iteration
538 for (int i = 0; i < NbIter + 1; i++) {
539 // - Thread 0: achieves MPI communications
540 if (omp_get_thread_num() == 0) {
542 MPI_Sendrecv(cpuCurrent, // MPI comms. (sync. op)
543 N, MPI_FLOAT, dest, 0,
545 N, MPI_FLOAT, dest, 0, ...);
547 // - Thread 1: achieves the CPU/GPU data transfers
548 } else if (omp_get_thread_num() == 1) {
550 cudaMemcpy(gpuFuture, cpuCurrent, // Data transfer:
551 sizeof(float)*N, // CPU --> GPU (sync. op)
552 cudaMemcpyHostToDevice);
554 // - Thread 2: achieves the GPU computations and the result transfer
555 } else if (omp_get_thread_num() == 2) {
557 gpuKernel_k1<<<Dg,Db>>>(gpuCurrent);// GPU comp. (async. op)
558 // IF there is (now) a result to transfer from GPU to CPU:
559 cudaMemcpy(cpuResTabAdr + (i-1), // Data transfer:
560 gpuResAdr, sizeof(float),// GPU --> CPU (sync. op)
561 cudaMemcpyDeviceToHost);
564 // - Wait for both threads have achieved their iteration tasks
566 // - Each thread permute its local buffer pointers
567 tmp = cpuCurrent; cpuCurrent = cpuFuture; cpuFuture = tmp;
568 tmp = gpuCurrent; gpuCurrent = gpuFuture; gpuFuture = tmp;
569 } // End of computation loop
570 } // End of OpenMP parallel region
575 \Lst{algo:ch6p1overlapinterleaved} introduces the generic code of a
576 MPI+OpenMP+CUDA implementation, explicitly interleaving
577 computation-communication iterations and overlapping MPI communications, CUDA CPU/GPU
578 transfers and CUDA GPU computations. As in the previous algorithms, we declare two CPU input data arrays
579 (current and future version) on line 3, but we also declare two GPU input data arrays on line 4. On
580 lines 8--11, these four data arrays are allocated, using \texttt{malloc} and
582 We do not need to allocate page-locked memory space. On lines 23--65 we
583 create an OpenMP parallel region, configured to run three threads (see line 21). Lines 26--30 are
584 declarations of thread local pointers on data arrays and variables (each thread will use its own
585 pointers). On line 33, the three threads enter a computation loop of \texttt{NbIter + 1}
586 iterations. We need to run one more iteration than with previous algorithms.
588 Lines 34--41 are the MPI communications, achieved by the thread number $0$. They send the
589 current CPU input data array to another CPU, and receive the future CPU input data array from
590 another CPU, like in previous algorithms. But this thread achieves communications only during the
591 \emph{first} \texttt{NbIter} iterations. Lines 43--48 are the CPU to GPU input data
592 transfers, achieved by thread number $1$. These data transfers are run in parallel of MPI
593 communications. They are run during the \emph{first} \texttt{NbIter} iterations, and transfer
594 current CPU input data array into the future GPU data array. Lines 50--57
595 correspond to the code run by
596 thread number $3$. They start GPU computations, to process the current GPU input data array, and if
598 transfer a GPU result at an index of the CPU result array. These GPU computations and result
599 transfers are run during the \emph{last} \texttt{NbIter} iterations: the GPU computations
600 have to wait the first data transfer is ended before to start to process any data, and can not run
601 during the first iteration. So, the activity of the third thread is shifted of one iteration
602 compared to the activities of other threads. Moreover, the address of the current GPU input data
603 array has to be passed as a parameter of the kernel call on line 52, in order the GPU threads access
604 the right data array. Like in previous algorithms the GPU result is copied at one index of the CPU
605 result array, in lines 53--56, but due to the shift of the third thread activity this index is
606 now \texttt{(i - 1)}.
608 Line 60 is a synchronization barrier\index{OpenMP!barrier} of the three OpenMP threads, followed by a pointer permutation
609 of local pointers on current and future data arrays, on line 62 and 63. Each
610 thread waits for the completion of other
611 threads to use the data arrays, and then permutes its data array pointers before to
612 enter a new loop iteration.
614 This complete overlap of MPI communications and CPU/GPU data transfers and GPU computations, is not
615 too complex to implement, and can be a solution when GPU computations are not adapted to use CUDA
616 streams: when GPU computations can not be split in subparts working on independent subsets of input
617 data. However, it requires to run one more iterations (a total of \texttt{NbIter + 1}
618 iterations). Then, if the number of iterations is very small, it could be more interesting not to
619 attempt to overlap CPU/GPU data transfers and GPU computations, and to implement \Lst{algo:ch6p1overlapseqsequence}.
622 \subsection{Experimental validation}
624 %\subsubsection{Experimentation testbed}
626 \noindent {\bf Experimentation testbed}
629 Two clusters located at SUPELEC in Metz (France) have been used for the entire set of
630 experiments presented in this chapter:
634 \item The first consists of 17 nodes with an Intel Nehalem quad-core processor
635 at 2.67Ghz, 6 Gb RAM and an NVIDIA GeForce GTX480 GPU, each.
637 \item The second consists of 16 nodes with an Intel core2 dual-core processor at
638 2.67Ghz, 4 Gb RAM and an NVIDIA GeForce GTX580 GPU, each
642 Both clusters have a Gigabit Ethernet interconnection network that is connected
643 through a DELL Power Object 5324 switch. The two switches are linked twice,
644 insuring the interconnection of the two clusters. The software environment
645 consists of a Linux Fedora 64bit OS (kernel v. 2.6.35), GNU C and C++ compilers
646 (v. 4.5.1) and the CUDA library (v. 4.2).
649 %\subsubsection{Validation of the synchronous approach}
652 \noindent {\bf Validation of the synchronous approach}
657 \includegraphics{Chapters/chapter6/curves/gpuSyncOverlap.pdf}
658 \caption{Experimental performances of different synchronous algorithms computing a
659 dense matrix product}
660 \label{fig:ch6p1syncexpematrixprod}
663 \label{ch6:p1block-cyclic}
664 We have experimented our approach of synchronous parallel algorithms with a classic
665 block cyclic algorithm for dense matrix multiplication\index{matrix
666 multiplication!block cyclic}. This problem requires to split two input matrices ($A$ and $B$) on a ring of
667 computing nodes, and to establish a circulation of the slices of $A$ matrix on the ring ($B$ matrix
668 partition does not evolve during all the run). Compared to our generic algorithms, there is no
669 partial result to transfer from GPU to CPU at the end of each computing iteration. The part of the
670 result matrix computed on each GPU is transferred on the CPU at the end of the computation loop.
672 We have first implemented a synchronous version without any overlap of MPI communications, CPU/GPU
673 data transfers, and GPU computations. We have added some synchronizations in the native overlapping
674 version in order to avoid any overlap. We have measured the performance achieved on our cluster with
675 NVIDIA GTX480 GPUs and matrices sizes of 4096$\times$4096, and we have obtained the curves in \Fig{fig:ch6p1syncexpematrixprod} labeled
676 \emph{no-ovlp}. We observe that performance increases when the number of processor increases. Of course,
677 there is a significant increase in cost when comparing a single node (without any MPI communication) with
678 two nodes (starting to use MPI communications). But beyond two nodes we get a classical performance
681 Then, we implemented and experimented \Lst{algo:ch6p1overlapnative}, see
682 \emph{ovlp-native} in \Fig{fig:ch6p1syncexpematrixprod}. The native
683 overlap of MPI communications with asynchronous run of CUDA kernels appears efficient. When the
684 number of nodes increases the ratio of the MPI communications increases a lot (because the
685 computation times decrease a lot). So, there is not a lot of GPU computation
686 time that remains to be
687 overlapped, and both \emph{no-ovlp} and \emph{ovlp-native} tend to the same
688 limit. Already, the native overlap performed in \Lst{algo:ch6p1overlapnative}
689 achieves a high level of performance very quickly, using only four
690 nodes. Beyond four nodes, a faster interconnection network would be required for
691 a performance increase.
693 Finally, we implemented \Lst{algo:ch6p1overlapseqsequence}, overlapping MPI communications
694 with a GPU sequence including both CPU/GPU data transfers and GPU computations,
695 see \emph{ovlp-GPUsequence} in \Fig{fig:ch6p1syncexpematrixprod}. From four
696 up to sixteen nodes it achieves better performances than \emph{ovlp-native}: we better overlap
697 MPI communications. However, this parallelization mechanism has more overhead: OpenMP threads
698 have to be created and synchronized. Only for two nodes it is less efficient than the native
699 overlapping algorithm. Beyond two nodes, the CPU multithreading overhead seems compensated.
700 % No, it doesn't need the more implementation of time, but more implementation
702 \Lst{algo:ch6p1overlapseqsequence} requires more time for the implemention
703 and can be more complex to maintain, but such extra development cost is
704 justified if we are looking for better performance.
710 %%% ispell-dictionary: "american"
712 %%% TeX-master: "../../BookGPU.tex"