1 \section{General scheme of synchronous code with computation/communication
2 overlapping in GPU clusters}\label{ch6:part1}
4 %Notre chapitre précédent sur l'optimisation des schémas
5 %parallèles~\cite{ChVCV13}.
7 \subsection{Synchronous parallel algorithms on GPU clusters}
9 %\subsubsection{Considered parallel algorithms and implementations}
11 \noindent {\bf Considered parallel algorithms and implementations}
14 This section focuses on synchronous parallel algorithms implemented with
15 overlapping computations and communications\index{overlap!computation and communication}. Parallel synchronous algorithms are
16 easier to implement, debug, and maintain than asynchronous ones (see
17 Section~\ref{ch6:part2}). Usually, they follow a BSP-like parallel
18 scheme\index{BSP parallel scheme} (Bulk Synchronous Parallel model),
19 alternating local computation steps and communication steps
20 (see~\cite{Valiant:BSP}). Their execution is usually deterministic, except
21 for stochastic algorithms that contain random number generations. Even in
22 this case, their execution can be controlled during debug steps, allowing the
23 user to track and to fix bugs quickly.
25 However, depending on the properties of the algorithm, it is sometimes possible to
26 overlap computations and communications. If processes exchange data
27 that is not needed for the
28 computation immediately following, it is possible to implement such
29 an overlap. We have investigated the efficiency of this approach in previous
30 works~\cite{GUSTEDT:2007:HAL-00280094:1,ChVCV13}, using standard parallel
31 programming tools to achieve the implementation.
33 The normalized and well-known Message Passing Interface (MPI\index{MPI}) includes some asynchronous
34 point-to-point communication routines, that should allow the implementation of some
35 communication/computation overlap. However, current MPI implementations do not achieve
36 this goal efficiently; effective overlapping with MPI requires a group of
37 dedicated threads (in our case implemented with OpenMP\index{OpenMP}) for the basic
38 synchronous communications while another group of threads executes computations
40 % Finally, with explicit OpenMP
41 % threads executing MPI communications or various computations, we succeeded to
42 % decrease the execution time.
43 Nevertheless, communication and computation are not completely independent on
44 modern multicore architectures: they use shared hardware components such as the
45 interconnection bus and the RAM. Therefore, this approach saves only up to $20\%$
46 of the expected time on such a platform. This picture changes on clusters
47 equipped with GPUs. Indeed, GPUs effectively allow independence of computations they
48 perform and communications done on the mainboard (CPU, interconnection bus, RAM, network
49 card). We save up to $100\%$ of the expected time on our GPU cluster, as we
50 will expose in the next section.
53 %\subsubsection{Specific interests for GPU clusters}
56 \noindent {\bf Specific interests in GPU clusters}
59 In a computing node, a GPU is a kind of scientific coprocessor, usually located
60 on an auxiliary board, with its own memory. So, once data are transferred
61 from the CPU memory to the GPU memory, GPU computations can be achieved on
62 the GPU board, totally in parallel with any CPU activities (such as internode cluster
63 communications). The CPU and the GPU access their respective memories and do not
64 interfere with each other, so they can achieve a very good overlap\index{overlap!computation
65 and computation} of their activities (better than two CPU cores).
67 But using a GPU on a computing node requires the transfer of data from the CPU to
68 the GPU memory, as well as the transfer of the computation results back from the GPU to the
69 CPU. Transfer times are not excessive, but depending on the application
70 they still can be significant compared to the GPU computation times. So, sometimes it
71 can be interesting to overlap the internode cluster communications with both the
72 CPU/GPU data transfers and the GPU computations. We can identify four main
73 parallel programming schemes on a GPU cluster:
76 \item parallelizing only internode CPU communications with GPU computations,
77 and achieving CPU/GPU data transfers before and after this parallel step,
78 \item parallelizing internode CPU communications with a (sequential) sequence
79 of CPU/GPU data transfers and GPU computations,
80 \item parallelizing internode CPU communications with a streamed sequence of
81 CPU/GPU data transfers and GPU computations,
82 \item parallelizing internode CPU communications with CPU/GPU data transfers
83 and with GPU computations, interleaving computation-communication
88 \subsection{Native overlap of CPU communications and GPU computations}
92 \includegraphics{Chapters/chapter6/figures/Sync-NativeOverlap.pdf}
93 \caption{Native overlap of internode CPU communications with GPU computations.}
94 \label{fig:ch6p1overlapnative}
97 Using CUDA\index{CUDA}, GPU kernel executions are nonblocking, and GPU/CPU data
98 transfers\index{CUDA!data transfer}
99 are blocking or nonblocking operations. All GPU kernel executions and CPU/GPU
100 data transfers are associated to ``streams'',\index{CUDA!stream} and all operations on a same stream
101 are serialized. When transferring data from the CPU to the GPU, then running GPU
102 computations, and finally transferring results from the GPU to the CPU, there is
103 a natural synchronization and serialization if these operations are achieved on
104 the same stream. GPU developers can choose to use one (default) or several
105 streams. In this first scheme of overlapping, we consider parallel codes using
108 ``Nonblocking GPU kernel execution'' means a CPU routine runs a parallel
109 execution of a GPU computing kernel, and the CPU routine continues its execution
110 (on the CPU) while the GPU kernel is running (on the GPU). Then the CPU routine
111 can initiate some communications with some other CPUs, and so it automatically
112 overlaps the internode CPU communications with the GPU computations (see
113 \Fig{fig:ch6p1overlapnative}). This overlapping is natural when programming with
114 CUDA and MPI: it is easy to deploy, but does not overlap the CPU/GPU data
117 \Lst{algo:ch6p1overlapnative} introduces the generic code of a MPI+CUDA
118 implementation, natively and implicitly overlapping MPI communications with CUDA
122 %\begin{algorithm}[t]
123 % \caption{Generic scheme implicitly overlapping MPI communications with CUDA GPU
124 % computations}\label{algo:ch6p1overlapnative}
126 \begin{Listing}{algo:ch6p1overlapnative}{generic scheme implicitly overlapping MPI communications with CUDA GPU computations}
127 // Input data and result variables and arrays (example with
128 // float datatype, 1D input arrays, and scalar results)
129 float *cpuInputTabAdr, *gpuInputTabAdr;
130 float *cpuResTabAdr, *gpuResAdr;
132 // CPU and GPU array allocations
133 cpuInputTabAdr = malloc(sizeof(float)*N);
134 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
135 cpuResTabAdr = malloc(sizeof(float)*NbIter);
136 cudaMalloc(&gpuResAdr,sizeof(float));
138 // Definition of the grid of blocks of GPU threads
143 // Indexes of source and destination MPI processes
147 // Computation loop (using the GPU)
148 for (int i = 0; i < NbIter; i++) {
149 cudaMemcpy(gpuInputTabAdr, cpuInputTabAdr, // Data transfer:
150 sizeof(float)*N, // CPU --> GPU (sync. op)
151 cudaMemcpyHostToDevice);
152 gpuKernel_k1<<<Dg,Db>>>(); // GPU comp. (async. op)
153 MPI_Sendrecv_replace(cpuInputTabAdr, // MPI comms. (sync. op)
155 dest, 0, src, 0, ...);
156 // IF there is (now) a result to transfer from the GPU to the CPU:
157 cudaMemcpy(cpuResTabAdr + i, gpuResAdr, // Data transfer:
158 sizeof(float), // GPU --> CPU (sync. op)
159 cudaMemcpyDeviceToHost);
167 Some input data and output results arrays and variables are
168 declared and allocated from line~3 through 10, and a computation loop is
169 implemented from line~22 through 34. At each iteration:
171 \item \texttt{cudaMemcpy} on line~23 transfers data from the CPU memory
172 to the GPU memory. This is a basic and synchronous data transfer.
173 \item \texttt{gpuKernel\_k1<<<Dg,Db>>>} on line~26 starts GPU computation
174 (running a GPU kernel on the grid of blocks of threads defined in lines~13 to
175 15). This is a standard GPU kernel run; it is an asynchronous operation. The
176 CPU can continue to run its code.
177 \item \texttt{MPI\_Sendrecv\_replace} on line~27 achieves some blocking
178 internode communications, overlapping GPU computations started just previously.
179 \item If needed, \texttt{cudaMemcpy} on line~31 transfers the iteration result from
180 one variable in the GPU memory to one array index in the CPU memory (in this example the CPU
181 collects all iteration results in an array). This operation is started after
182 the end of the MPI communication (previous instruction) and after the end of
183 the GPU kernel execution. CUDA insures an implicit synchronization of all
184 operations involving the same GPU stream, like the default stream in this
185 example. The transfer of the results has to wait until the GPU kernel execution is finished.
186 If there is no results transfer implemented, the next operation on the GPU
187 will wait until the GPU kernel execution has ended.
190 This implementation is the easiest one involving the GPU. It achieves an
191 implicit overlap of internode communications and GPU computations with no explicit
192 multithreading required on the CPU. However, CPU/GPU data transfers are
193 achieved serially and not overlapped.
196 \subsection{Overlapping with sequences of transfers and computations}
198 %\subsubsection{Overlapping with a sequential GPU sequence}
200 \noindent {\bf Overlapping with a sequential GPU sequence}
205 \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf}
206 \caption{Overlap of internode CPU communications with a sequence of CPU/GPU data transfers and GPU
208 \label{fig:ch6p1overlapseqsequence}
211 When CPU/GPU data transfers are not negligible compared to GPU computations, it
212 can be interesting to overlap internode CPU computations with a \emph{GPU
213 sequence}\index{GPU!sequence} including CPU/GPU data transfers and GPU computations (see
214 \Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic,
215 but their implementation requires explicit CPU multithreading and
216 synchronization, and CPU data buffer duplication. We need to implement two
217 threads, one starting and achieving MPI communications and the other running
218 the \emph{GPU sequence}. OpenMP allows an easy and portable implementation of
219 this overlapping strategy. However, it remains more complex to develop and to
220 maintain than the previous strategy (overlapping only internode CPU
221 communications and GPU computations) and should be adopted only when CPU/GPU
222 data transfer times are not negligible.
224 \Lst{algo:ch6p1overlapseqsequence} introduces the generic code of a
225 MPI+OpenMP+CUDA implementation, explicitly overlapping MPI communications with
229 % \caption{Generic scheme explicitly overlapping MPI communications with
230 % sequences of CUDA CPU/GPU transfers and CUDA GPU
231 % computations}\label{algo:ch6p1overlapseqsequence}
232 \begin{Listing}{algo:ch6p1overlapseqsequence}{generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
233 // Input data and result variables and arrays (example with
234 // float datatype, 1D input arrays, and scalar results)
235 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
236 float *cpuResTabAdr, *gpuResAdr;
238 // CPU and GPU array allocations
239 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
240 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
241 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
242 cpuResTabAdr = malloc(sizeof(float)*NbIter);
243 cudaMalloc(&gpuResAdr,sizeof(float));
245 // Definition of the grid of blocks of GPU threads
250 // Indexes of source and destination MPI processes
254 // Set the number of OpenMP threads (to create) to 2
255 omp_set_num_threads(2);
256 // Create threads and start the parallel OpenMP region
259 // Buffer pointers (thread local variables)
260 float *current = cpuInputTabAdrCurrent;
261 float *future = cpuInputTabAdrFuture;
264 // Computation loop (using the GPU)
265 for (int i = 0; i < NbIter; i++) {
267 // - Thread 0: achieves MPI communications
268 if (omp_get_thread_num() == 0) {
269 MPI_Sendrecv(current, // MPI comms. (sync. op)
270 N, MPI_FLOAT, dest, 0,
272 N, MPI_FLOAT, dest, 0, ...);
274 // - Thread 1: achieves the GPU sequence (GPU computations and
275 // CPU/GPU data transfers)
276 } else if (omp_get_thread_num() == 1) {
277 cudaMemcpy(gpuInputTabAdr, current, // Data transfer:
278 sizeof(float)*N, // CPU --> GPU (sync. op)
279 cudaMemcpyHostToDevice);
280 gpuKernel_k1<<<Dg,Db>>>(); // GPU comp. (async. op)
281 // IF there is (now) a result to transfer from the GPU to the CPU:
282 cudaMemcpy(cpuResTabAdr + i, gpuResAdr,// Data transfer:
283 sizeof(float), // GPU --> CPU (sync. op)
284 cudaMemcpyDeviceToHost);
287 // - Wait until both threads have achieved their iteration tasks
289 // - Each thread permutes its local buffer pointers
293 } // End of computation loop
294 } // End of OpenMP parallel region
299 Lines~25--62 implement the OpenMP parallel region,
300 around the computation loop (lines~33--61). For efficient performances it is
301 important to create and destroy threads only one time (not at each iteration):
302 the parallel region has to surround the computation loop. Lines~3--11
303 consist of declaration and allocation of input data arrays and result arrays and
304 variables, as in the previous algorithm (\Lst{algo:ch6p1overlapnative}). However, we implement two input data buffers on the
305 CPU (current and future version). As we aim to overlap internode MPI
306 communications and GPU sequence, including CPU to GPU data transfer of current
307 input data array, we need to store the received new input data array in a
308 separate buffer. Then, the current input data array will be safely read on the
309 CPU and copied into the GPU memory.
311 The thread creations\index{OpenMP!thread creation} are easily achieved with one OpenMP directive (line~25).
312 Then each thread defines and initializes \emph{its} local buffer pointers,
313 and enters \emph{its} computing loop (lines~28--33). Inside the computing
314 loop, a test on the thread number makes it possible to run a different code in each
315 thread. Lines~37--40 implement the MPI synchronous communication run by
316 thread number $0$. Lines~45--52 implement the GPU sequence run by thread
317 $1$: CPU to GPU data transfer, GPU computation, and GPU to CPU result
318 transfer (if needed). Details of the three operations of this sequence have not changed
319 from the previous overlapping strategy.
321 At the end of \Lst{algo:ch6p1overlapseqsequence}, an OpenMP synchronization
322 barrier\index{OpenMP!barrier} on line~56 forces the OpenMP threads to wait until
324 communications and GPU sequence are achieved. %, and do not need to access the current input data buffer.
325 Then, each thread permutes its local buffer pointers (lines~58--60),
326 and is ready to enter the next iteration, processing the new current input
329 %\subsubsection{Overlapping with a streamed GPU sequence}
333 \noindent {\bf Overlapping with a streamed GPU sequence}
336 Depending on the algorithm implemented, it is sometimes possible to split the
337 GPU computation into several parts processing distinct data. Then, we can
338 speedup the GPU sequence using several CUDA streams\index{CUDA!stream}. The goal is
339 to overlap CPU/GPU data transfers with GPU computations\index{overlap!GPU data transfers with GPU computation} inside the GPU
340 sequence. Compared to the previous overlapping strategy, we have to split the
341 initial data transfer into a set of $n$ asynchronous and smaller data transfers,
342 and split the initial GPU kernel call into a set of $n$ calls to the same GPU
343 kernel. Usually, these smaller calls are deployed with fewer GPU threads
344 (i.e., associated to a smaller grid of blocks of threads). Then, the first GPU
345 computations can start as soon as the first data transfer has been achieved, and
346 next transfers can be done in parallel with next GPU computations (see
347 \Fig{fig:ch6p1overlapstreamsequence}).
351 \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf}
352 \caption{Overlap of internode CPU communications with a streamed sequence of CPU/GPU data
353 transfers and GPU computations.}
354 \label{fig:ch6p1overlapstreamsequence}
357 NVIDIA advises starting all asynchronous CUDA data transfers and then calling
358 all CUDA kernel executions, using up to $16$ streams \cite{cudabestpractices}.
359 Then, the CUDA driver and
360 runtime optimize the global execution of these operations. So, we accumulate two
361 overlapping mechanisms. The former is controlled by CPU multithreading and
362 overlaps MPI communications and the streamed GPU sequence. The latter is
363 controlled by CUDA programming and overlaps CPU/GPU data transfers and GPU
364 computations. Again, OpenMP allows the easy implementation of the CPU multithreading
365 and waiting for the end of both CPU threads before executing the next instructions
368 \Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA
369 code, explicitly overlapping MPI communications with
370 streamed GPU sequences\index{GPU!streamed sequence}.
373 % \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA
374 % CPU/GPU transfers and CUDA GPU computations}\label{algo:ch6p1overlapstreamsequence}
375 \begin{Listing}{algo:ch6p1overlapstreamsequence}{generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
376 // Input data and result variables and arrays (example with
377 // float datatype, 1D input arrays, and scalar results)
378 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
379 float *cpuResTabAdr, *gpuResAdr;
380 // CPU and GPU array allocations (allocates page-locked CPU memory)
381 cudaHostAlloc(&cpuInputTabAdrCurrent,sizeof(float)*N,cudaHostAllocDefault);
382 cudaHostAlloc(&cpuInputTabAdrFuture,sizeof(float)*N,cudaHostAllocDefault);
383 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
384 cpuResTabAdr = malloc(sizeof(float)*NbIter);
385 cudaMalloc(&gpuResAdr,sizeof(float));
386 // Stream declaration and creation
387 cudaStream_t TabS[NbS];
388 for(int s = 0; s < NbS; s++)
389 cudaStreamCreate(&TabS[s]);
390 // Definition of the grid of blocks of GPU threads
392 // Set the number of OpenMP threads (to create) to 2
393 omp_set_num_threads(2);
394 // Create threads and start the parallel OpenMP region
397 // Buffer pointers (thread local variables)
398 float *current = cpuInputTabAdrCurrent;
399 float *future = cpuInputTabAdrFuture;
401 // Stride of data processed per stream
403 // Computation loop (using the GPU)
404 for (int i = 0; i < NbIter; i++) {
405 // - Thread 0: achieves MPI communications
406 if (omp_get_thread_num() == 0) {
407 MPI_Sendrecv(current, // MPI comms. (sync. op)
408 N, MPI_FLOAT, dest, 0,
410 N, MPI_FLOAT, dest, 0, ...);
411 // - Thread 1: achieves the streamed GPU sequence (GPU
412 // computations and CPU/GPU data transfers)
413 } else if (omp_get_thread_num() == 1) {
414 for (int s = 0; s < NbS; s++) { // Start all data transfers:
415 cudaMemcpyAsync(gpuInputTabAdr + s*stride, // CPU --> GPU
416 current + s*stride, // (async. ops)
417 sizeof(float)*stride,
418 cudaMemcpyHostToDevice,
421 for (int s = 0; s < NbS; s++) { // Start all GPU comps. (async.)
422 gpuKernel_k1<<<Dg, Db, 0, TabS[s]>>>(gpuInputTabAdr + s*stride);
424 cudaThreadSynchronize(); // Wait all threads are ended
425 // IF there is (now) a result to transfer from the GPU to the CPU:
426 cudaMemcpy(cpuResTabAdr, // Data transfers:
427 gpuResAdr, // GPU --> CPU (sync. op)
429 cudaMemcpyDeviceToHost);
431 // - Wait until both threads have achieved their iteration tasks
433 // - Each thread permutes its local buffer pointers
434 tmp = current; current = future; future = tmp;
435 } // End of computation loop
436 } // End of OpenMP parallel region
438 // Destroy the streams
439 for(int s = 0; s < NbS; s++)
440 cudaStreamDestroy(TabS[s]);
445 Efficient usage of CUDA streams requires executing
446 asynchronous CPU/GPU data transfers, which implies reading page-locked
447 data\index{page-locked data} in CPU memory. So, CPU
448 memory allocations on lines~6 and 7 are implemented with \texttt{cudaHostAlloc} instead of
449 the basic \texttt{malloc} function. Then, $NbS$ \emph{streams} are created on lines~12--14.
450 Usually we create $16$ streams: the maximum number supported by CUDA.
452 An OpenMP parallel region\index{OpenMP!parallel region} including two threads is implemented on lines~18--61 of
453 \Lst{algo:ch6p1overlapstreamsequence}, as in the previous algorithm (see
454 \Lst{algo:ch6p1overlapseqsequence}). Code of thread $0$ achieving MPI communication is unchanged, but
455 code of thread $1$ is now using streams. Following NVIDIA recommandations, we first implement
456 a loop starting $NbS$ asynchronous data transfers (lines~39--45): transferring $N/NbS$ data on
457 each stream. Then we implement a second loop (lines~46--48), starting asynchronous
458 executions of $NbS$ grids of blocks of GPU threads (one per stream). Data transfers and kernel
459 executions on the same stream are synchronized by CUDA and the GPU. So, each kernel execution will
460 start after its data has been transferred into the GPU memory, and the GPU
461 scheduler ensures the start of
462 some kernel executions as soon as the first data transfers are achieved. Then, next data transfers
463 will be overlapped with GPU computations. After the kernel calls, on the different streams,
464 we wait for the end of all GPU threads previously run, calling an explicit synchronization
465 function on line~49. This synchronization is not mandatory, but it will make the implementation more
466 robust and will facilitate the debugging steps: all GPU computations run by the OpenMP thread number
467 $1$ will be achieved before this thread enters a new loop iteration, or before the computation
470 If a partial result has to be transferred from GPU to CPU memory at the end of each loop iteration
471 (for example, the result of one \emph{reduction} per iteration), this transfer is achieved
472 synchronously on the default stream (no particular stream is specified) on lines~51--54.
473 Availability of the result values is ensured by the synchronization implemented on line~49.
474 However, if a partial result has to be transferred onto the CPU on each stream, then $NbS$ asynchronous data
475 transfers could be started in parallel (one per stream) and should be implemented before the
476 synchronization operation on line~49. The end of the computation loop includes a synchronization
477 barrier of the two OpenMP threads, waiting until they have finished accessing the different data
478 buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, as
479 in the previous algorithm. After the computation loop, we have added the
480 destruction of the CUDA streams (lines~64--65).
482 In conclusion, CUDA streams\index{CUDA!stream} have been used to extend
483 \Lst{algo:ch6p1overlapseqsequence} with respect to its global
484 scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an OpenMP parallel
485 region, with two CPU threads, one in charge of MPI communications and the other
486 managing data transfers and GPU computations. Unfortunately, using GPU streams
487 requires the ability to split a GPU computation into independent subparts,
488 working on independent subsets of data. \Lst{algo:ch6p1overlapstreamsequence}
489 is not so generic as \Lst{algo:ch6p1overlapseqsequence}.
492 \subsection{Interleaved communications-transfers-computations overlapping}
494 Many algorithms do not support splitting data transfers and kernel calls, and
495 cannot exploit CUDA streams, for example, when each GPU thread requires access to
496 some data spread in the global set of transferred data. Then, it is possible to
497 overlap internode CPU communications, CPU/GPU data transfers, and GPU
498 computations, if the algorithm achieves \emph{computation-communication
499 iterations} and if we can interleave these iterations. At iteration $k$: CPUs
500 exchange data $D_k$, each CPU/GPU couple transfers data $D_k$, and each GPU
501 achieves computations on data $D_{k-1}$ (see
502 \Fig{fig:ch6p1overlapinterleaved}). Compared to the previous strategies, this
503 strategy requires twice as many CPU data buffers
504 and twice as many GPU buffers.
508 \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf}
509 \caption[Complete overlap of internode CPU communications,\break\hfill CPU/GPU data transfers, and GPU
510 computations, interleaving computation-communication iterations.]{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU
511 computations, interleaving computation-communication iterations.}
512 \label{fig:ch6p1overlapinterleaved}
515 \Lst{algo:ch6p1overlapinterleaved} introduces the generic code of a
516 MPI+OpenMP+CUDA implementation, explicitly interleaving
517 computation-communication iterations and overlapping MPI communications, CUDA CPU/GPU
518 transfers, and CUDA GPU computations.
521 % \caption{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA
522 % GPU computations, interleaving computation-communication iterations}\label{algo:ch6p1overlapinterleaved}
523 \begin{Listing}{algo:ch6p1overlapinterleaved}{generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers, and CUDA GPU computations, interleaving computation-communication iterations}
524 // Input data and result variables and arrays (example with
525 // float datatype, 1D input arrays, and scalar results)
526 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture;
527 float *gpuInputTabAdrCurrent, *gpuInputTabAdrFuture;
528 float *cpuResTabAdr, *gpuResAdr;
530 // CPU and GPU array allocations
531 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
532 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
533 cudaMalloc(&gpuInputTabAdrCurrent,sizeof(float)*N);
534 cudaMalloc(&gpuInputTabAdrFuture,sizeof(float)*N);
535 cpuResTabAdr = malloc(sizeof(float)*NbIter);
536 cudaMalloc(&gpuResAdr,sizeof(float));
538 // Definition of the grid of blocks of GPU threads
539 dim3 Dg, Db; Dg.x = ...
540 // Indexes of source and destination MPI processes
541 int dest, src; dest = ...
543 // Set the number of OpenMP threads (to create) to 2
544 omp_set_num_threads(3);
545 // Create threads and start the parallel OpenMP region
548 // Buffer pointers (thread local variables)
549 float *cpuCurrent = cpuInputTabAdrCurrent;
550 float *cpuFuture = cpuInputTabAdrFuture;
551 float *gpuCurrent = gpuInputTabAdrCurrent;
552 float *gpuFuture = gpuInputTabAdrFuture;
555 // Computation loop on NbIter + 1 iterations
556 for (int i = 0; i < NbIter + 1; i++) {
557 // - Thread 0: achieves MPI communications
558 if (omp_get_thread_num() == 0) {
560 MPI_Sendrecv(cpuCurrent, // MPI comms. (sync. op)
561 N, MPI_FLOAT, dest, 0,
563 N, MPI_FLOAT, dest, 0, ...);
565 // - Thread 1: achieves the CPU/GPU data transfers
566 } else if (omp_get_thread_num() == 1) {
568 cudaMemcpy(gpuFuture, cpuCurrent, // Data transfer:
569 sizeof(float)*N, // CPU --> GPU (sync. op)
570 cudaMemcpyHostToDevice);
572 // - Thread 2: achieves the GPU computations and result transfer
573 } else if (omp_get_thread_num() == 2) {
575 gpuKernel_k1<<<Dg,Db>>>(gpuCurrent);// GPU comp. (async. op)
576 // IF there is (now) a result to transfer from GPU to CPU:
577 cudaMemcpy(cpuResTabAdr + (i-1), // Data transfer:
578 gpuResAdr, sizeof(float),// GPU --> CPU (sync. op)
579 cudaMemcpyDeviceToHost);
582 // - Wait until both threads have achieved their iteration tasks
584 // - Each thread permutes its local buffer pointers
585 tmp = cpuCurrent; cpuCurrent = cpuFuture; cpuFuture = tmp;
586 tmp = gpuCurrent; gpuCurrent = gpuFuture; gpuFuture = tmp;
587 } // End of computation loop
588 } // End of OpenMP parallel region
593 As in the previous algorithms, we declare two CPU input data arrays
594 (current and future version) on line~3. However, in this version we also declare two GPU input data arrays on line~4. On
595 lines~8--11, these four data arrays are allocated, using \texttt{malloc} and
597 We do not need to allocate page-locked memory space. On lines~23--65 we
598 create an OpenMP parallel region, configured to run three threads (see line~21). Lines~26--30 are
599 declarations of thread local pointers on data arrays and variables (each thread will use its own
600 pointers). On line~33, the three threads enter a computation loop of \texttt{NbIter + 1}
601 iterations. We need to run one more iteration than with previous algorithms.
603 Lines~35--41 are the MPI communications, achieved by the thread number $0$. They send the
604 current CPU input data array to another CPU, and receive the future CPU input data array from
605 another CPU, like in previous algorithms. But this thread achieves communications only during the
606 \emph{first} \texttt{NbIter} iterations. Lines~43--48 are the CPU to GPU input data
607 transfers, achieved by thread number $1$. These data transfers are run in parallel with MPI
608 communications. They are run during the \emph{first} \texttt{NbIter} iterations and transfer
609 current CPU input data array into the future GPU data array. Lines~50--57
610 correspond to the code run by
611 thread number $2$. They start GPU computations, process the current GPU input data array, and if
613 transfer a GPU result to an index of the CPU result array. These GPU computations and result
614 transfers are run during the \emph{last} \texttt{NbIter} iterations: the GPU computations
615 have to wait until the first data transfer is ended before starting to process any data and cannot run
616 during the first iteration. So, the activity of the third thread is shifted by one iteration
617 compared to the activities of the other threads. Moreover, the address of the current GPU input data
618 array has to be passed as a parameter of the kernel call on line~52, in order
619 for the GPU threads to access
620 the right data array. As in previous algorithms the GPU result is copied to one index of the CPU
621 result array, in lines~54--56, but due to the shift of the third thread activity this index is
622 now \texttt{(i - 1)}.
624 Line~60 is a synchronization barrier\index{OpenMP!barrier} of the three OpenMP threads, followed by a pointer permutation
625 of local pointers on current and future data arrays, on line~62 and 63. Each
626 thread waits for the completion of other
627 threads to use the data arrays, and then permutes its data array pointers before
628 entering a new loop iteration.
630 This complete overlap of MPI communications, CPU/GPU data transfers, and GPU computations is not
631 too complex to implement, and can be a solution when GPU computations are not adapted to use CUDA
632 streams: when GPU computations cannot be split into subparts working on independent subsets of input
633 data. However, this requires running one more iteration (a total of \texttt{NbIter + 1}
634 iterations). If the number of iterations is very small, it could be more interesting not to
635 attempt to overlap CPU/GPU data transfers and GPU computations, and to implement \Lst{algo:ch6p1overlapseqsequence}.
638 \subsection{Experimental validation}
640 %\subsubsection{Experimentation testbed}
642 \noindent {\bf Experimentation testbed}
645 Two clusters located at SUPELEC in Metz (France) have been used for the entire set of
646 experiments presented in this chapter:
650 \item The first consists of 17 nodes with an Intel Nehalem quad-core processor
651 at 2.67Ghz, 6 Gb RAM, and an NVIDIA GeForce GTX480 GPU, each.
653 \item The second consists of 16 nodes with an Intel core2 dual-core processor at
654 2.67Ghz, 4 Gb RAM, and an NVIDIA GeForce GTX580 GPU, each.
658 Both clusters have a gigabit Ethernet interconnection network that is connected
659 through a Dell Power Object 5324 switch. The two switches are linked twice,
660 ensuring the interconnection of the two clusters. The software environment
661 consists of a Linux Fedora 64bit OS (kernel v. 2.6.35), GNU C and C++ compilers
662 (v. 4.5.1), and the CUDA library (v. 4.2).
664 %\subsubsection{Validation of the synchronous approach}
667 \noindent {\bf Validation of the synchronous approach}
672 \includegraphics{Chapters/chapter6/curves/gpuSyncOverlap.pdf}
673 \caption{Experimental performances of different synchronous algorithms computing a
674 dense matrix product.}
675 \label{fig:ch6p1syncexpematrixprod}
678 \label{ch6:p1block-cyclic}
679 We have tested our approach of synchronous parallel algorithms with a classic
680 block cyclic algorithm for dense matrix multiplication\index{matrix
681 multiplication!block cyclic}. This problem requires splitting two input matrices ($A$ and $B$) on a ring of
682 computing nodes and establishing a circulation of the slices of $A$ matrix on the ring ($B$ matrix
683 partition does not evolve during all the run). Compared to our generic algorithms, there is no
684 partial result to transfer from GPU to CPU at the end of each computing iteration. The part of the
685 result matrix computed on each GPU is transferred onto the CPU at the end of the computation loop.
687 We have first implemented a synchronous version without any overlap of MPI communications, CPU/GPU
688 data transfers, and GPU computations. We have added some synchronizations in the native overlapping
689 version in order to avoid any overlap. We have measured the performance achieved on our cluster with
690 NVIDIA GTX480 GPUs and matrices sizes of 4096$\times$4096, and we have obtained the curves in \Fig{fig:ch6p1syncexpematrixprod} labeled
691 \emph{no-ovlp}. We observe that performance increases when the number of processor increases. Of course,
692 there is a significant increase in cost when comparing a single node (without any MPI communication) with
693 two nodes (starting to use MPI communications). But beyond two nodes we get a classical performance
696 Then, we implemented and tested \Lst{algo:ch6p1overlapnative}, labeled
697 \emph{ovlp-native} in \Fig{fig:ch6p1syncexpematrixprod}. The native
698 overlap of MPI communications with the asynchronous run of CUDA kernels appears efficient. When the
699 number of nodes increases the ratio of the MPI communications increases a lot (because the
700 computation times decrease a lot). So, there is not a lot of GPU computation
701 time that remains to be
702 overlapped, and both \emph{no-ovlp} and \emph{ovlp-native} tend to the same
703 limit. Already, the native overlap performed in \Lst{algo:ch6p1overlapnative}
704 achieves a high level of performance very quickly, using only four
705 nodes. Beyond four nodes, a faster interconnection network would be required for
706 a performance increase.
708 Finally, we implemented \Lst{algo:ch6p1overlapseqsequence}, overlapping MPI communications
709 with a GPU sequence including both CPU/GPU data transfers and GPU computations,
710 labeled \emph{ovlp-GPUsequence} in \Fig{fig:ch6p1syncexpematrixprod}. From four
711 up to sixteen nodes it achieves better performances than \emph{ovlp-native}: the
712 overlapping of MPI communications is wider and thus more efficient. However, this parallelization mechanism has more overhead: OpenMP threads
713 have to be created and synchronized. With only two nodes it is less efficient than the native
714 overlapping algorithm. Beyond two nodes, the CPU multithreading overhead seems compensated.
715 % No, it doesn't need the more implementation of time, but more implementation
717 \Lst{algo:ch6p1overlapseqsequence} requires more time for the implemention
718 and can be more complex to maintain, but such extra development cost is
719 justified if we are looking for better performance.
725 %%% ispell-dictionary: "american"
727 %%% TeX-master: "../../BookGPU.tex"