BookGPU/Chapters/chapter6/PartieSync.tex

   1 \section{General scheme of synchronous code with computation/communication
   2   overlapping in GPU clusters}\label{ch6:part1}
   3
   4 %Notre     chapitre     précédent      sur     l'optimisation     des     schémas
   5 %parallèles~\cite{ChVCV13}.
   6
   7 \subsection{Synchronous parallel algorithms on GPU clusters}
   8
   9 %\subsubsection{Considered parallel algorithms and implementations}
  10
  11 \noindent {\bf Considered parallel algorithms and implementations}
  12 \medskip
  13
  14 This section focuses on synchronous parallel algorithms implemented with
  15 overlapping computations and communications\index{overlap!computation and communication}. Parallel synchronous algorithms are
  16 easier to implement, debug, and maintain than asynchronous ones (see
  17 Section~\ref{ch6:part2}). Usually, they follow a BSP-like parallel
  18 scheme\index{BSP parallel scheme} (Bulk Synchronous Parallel model),
  19 alternating local computation steps and communication steps
  20 (see~\cite{Valiant:BSP}). Their execution is usually deterministic, except
  21 for stochastic algorithms that contain random number generations. Even in
  22 this case, their execution can be controlled during debug steps, allowing the
  23 user to track and to fix bugs quickly.
  24
  25 However, depending on the properties of the algorithm, it is sometimes possible to
  26 overlap computations and communications. If processes exchange data
  27 that is not needed for the
  28 computation immediately following, it is possible to implement such
  29 an overlap. We have investigated the efficiency of this approach in previous
  30 works~\cite{GUSTEDT:2007:HAL-00280094:1,ChVCV13}, using standard parallel
  31 programming tools to achieve the implementation.
  32
  33 The normalized and well-known Message Passing Interface (MPI\index{MPI}) includes some asynchronous
  34 point-to-point communication routines, that should allow the implementation of some
  35 communication/computation overlap. However, current MPI implementations do not achieve
  36 this goal efficiently; effective overlapping with MPI requires a group of
  37 dedicated threads (in our case implemented with OpenMP\index{OpenMP}) for the basic
  38 synchronous communications while another group of threads executes computations
  39 in parallel.
  40 % Finally, with explicit OpenMP
  41 % threads executing MPI communications or various computations, we succeeded to
  42 % decrease the execution time.
  43 Nevertheless, communication and computation are not completely independent on
  44 modern multicore architectures: they use shared hardware components such as the
  45 interconnection bus and the RAM. Therefore, this approach saves only up to $20\%$
  46 of the expected time on such a platform. This picture changes on clusters
  47 equipped with GPUs. Indeed, GPUs effectively allow independence of computations they
  48 perform and communications done on the mainboard (CPU, interconnection bus, RAM, network
  49 card). We save up to $100\%$ of the expected time on our GPU cluster, as we
  50 will expose in the next section.
  51
  52
  53 %\subsubsection{Specific interests for GPU clusters}
  54
  55 \bigskip
  56 \noindent {\bf Specific interests in GPU clusters}
  57 \medskip
  58
  59 In a computing node, a GPU is a kind of scientific coprocessor, usually located
  60 on an auxiliary board, with its own memory. So, once data are transferred
  61 from the CPU memory to the GPU memory, GPU computations can be achieved on
  62 the GPU board, totally in parallel with any CPU activities (such as internode cluster
  63 communications). The CPU and the GPU access their respective memories and do not
  64 interfere with each other, so they can achieve a very good overlap\index{overlap!computation
  65 and computation} of their activities (better than two CPU cores).
  66
  67 But using a GPU on a computing node requires the transfer of data from the CPU to
  68 the GPU memory, as well as the transfer of the computation results back from the GPU to the
  69 CPU. Transfer times are not excessive, but depending on the application
  70 they still can be significant compared to the GPU computation times. So, sometimes it
  71 can be interesting to overlap the internode cluster communications with both the
  72 CPU/GPU data transfers and the GPU computations. We can identify four main
  73 parallel programming schemes on a GPU cluster:
  74
  75 \begin{enumerate}
  76 \item parallelizing only internode CPU communications with GPU computations,
  77   and achieving CPU/GPU data transfers before and after this parallel step,
  78 \item parallelizing internode CPU communications with a (sequential) sequence
  79   of CPU/GPU data transfers and GPU computations,
  80 \item parallelizing internode CPU communications with a streamed sequence of
  81   CPU/GPU data transfers and GPU computations,
  82 \item parallelizing internode CPU communications with CPU/GPU data transfers
  83   and with GPU computations, interleaving computation-communication
  84   iterations.
  85 \end{enumerate}
  86
  87
  88 \subsection{Native overlap of CPU communications and GPU computations}
  89
  90 \begin{figure}[t]
  91   \centering
  92   \includegraphics{Chapters/chapter6/figures/Sync-NativeOverlap.pdf}
  93   \caption{Native overlap of internode CPU communications with GPU computations.}
  94   \label{fig:ch6p1overlapnative}
  95 \end{figure}
  96
  97 Using CUDA\index{CUDA}, GPU kernel executions are nonblocking, and GPU/CPU data
  98 transfers\index{CUDA!data transfer}
  99 are blocking or nonblocking operations. All GPU kernel executions and CPU/GPU
 100 data transfers are associated to ``streams'',\index{CUDA!stream} and all operations on a same stream
 101 are serialized. When transferring data from the CPU to the GPU, then running GPU
 102 computations, and finally transferring results from the GPU to the CPU, there is
 103 a natural synchronization and serialization if these operations are achieved on
 104 the same stream. GPU developers can choose to use one (default) or several
 105 streams. In this first scheme of overlapping, we consider parallel codes using
 106 only one GPU stream.
 107
 108 ``Nonblocking  GPU  kernel  execution''  means  a CPU  routine  runs  a  parallel
 109 execution of a GPU computing kernel, and the CPU routine continues its execution
 110 (on the CPU) while the GPU kernel  is running (on the GPU). Then the CPU routine
 111 can initiate some  communications with some other CPUs,  and so it automatically
 112 overlaps  the  internode  CPU  communications  with the  GPU  computations  (see
 113 \Fig{fig:ch6p1overlapnative}). This overlapping is natural when programming with
 114 CUDA  and MPI:  it is  easy to  deploy, but  does not  overlap the  CPU/GPU data
 115 transfers.
 116
 117 \Lst{algo:ch6p1overlapnative} introduces the generic code of a MPI+CUDA
 118 implementation, natively and implicitly overlapping MPI communications with CUDA
 119 GPU computations.
 120
 121
 122 %\begin{algorithm}[t]
 123 %  \caption{Generic scheme implicitly overlapping MPI communications with CUDA GPU
 124 %   computations}\label{algo:ch6p1overlapnative}
 125 %\pagebreak
 126 \begin{Listing}{algo:ch6p1overlapnative}{generic scheme implicitly overlapping MPI communications with CUDA GPU computations}
 127 // Input data and result variables and arrays (example with
 128 // float datatype, 1D input arrays, and scalar results)
 129 float *cpuInputTabAdr, *gpuInputTabAdr;
 130 float *cpuResTabAdr, *gpuResAdr;
 131
 132 // CPU and GPU array allocations
 133 cpuInputTabAdr = malloc(sizeof(float)*N);
 134 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
 135 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 136 cudaMalloc(&gpuResAdr,sizeof(float));
 137
 138 // Definition of the grid of blocks of GPU threads
 139 dim3 Dg, Db;
 140 Dg.x = ...
 141 ...
 142
 143 // Indexes of source and destination MPI processes
 144 int dest = ...
 145 int src = ...
 146
 147 // Computation loop (using the GPU)
 148 for (int i = 0; i < NbIter; i++) {
 149   cudaMemcpy(gpuInputTabAdr, cpuInputTabAdr, // Data transfer:
 150              sizeof(float)*N,                // CPU --> GPU (sync. op)
 151              cudaMemcpyHostToDevice);
 152   gpuKernel_k1<<<Dg,Db>>>();                 // GPU comp. (async. op)
 153   MPI_Sendrecv_replace(cpuInputTabAdr,       // MPI comms. (sync. op)
 154                        N,MPI_FLOAT,
 155                        dest, 0, src, 0, ...);
 156   // IF there is (now) a result to transfer from the GPU to the CPU:
 157   cudaMemcpy(cpuResTabAdr + i, gpuResAdr,    // Data transfer:
 158              sizeof(float),                  // GPU --> CPU (sync. op)
 159              cudaMemcpyDeviceToHost);
 160 }
 161 ...
 162 \end{Listing}
 163 %\end{algorithm}
 164
 165
 166
 167 Some input data and output results arrays and variables are
 168 declared and allocated from line~3 through 10, and a computation loop is
 169 implemented from line~22 through 34. At each iteration:
 170 \begin{itemize}
 171 \item \texttt{cudaMemcpy} on line~23 transfers data from the CPU memory
 172   to the GPU memory. This is a basic and synchronous data transfer.
 173 \item \texttt{gpuKernel\_k1<<<Dg,Db>>>} on line~26 starts GPU computation
 174   (running a GPU kernel on the grid of blocks of threads defined in lines~13 to
 175   15). This is a standard GPU kernel run; it is an asynchronous operation. The
 176   CPU can continue to run its code.
 177 \item \texttt{MPI\_Sendrecv\_replace} on line~27 achieves some blocking
 178   internode communications, overlapping GPU computations started just previously.
 179 \item If needed, \texttt{cudaMemcpy} on line~31 transfers the iteration result from
 180   one variable in the GPU memory to one array index in the CPU memory (in this example the CPU
 181   collects all iteration results in an array). This operation is started after
 182   the end of the MPI communication (previous instruction) and after the end of
 183   the GPU kernel execution. CUDA insures an implicit synchronization of all
 184   operations involving the same GPU stream, like the default stream in this
 185   example. The transfer of the results has to wait until the GPU kernel execution is finished.
 186   If there is no results transfer implemented, the next operation on the GPU
 187   will wait until the GPU kernel execution has ended.
 188 \end{itemize}
 189
 190 This implementation is the easiest one involving the GPU. It achieves an
 191 implicit overlap of internode communications and GPU computations with no explicit
 192 multithreading required on the CPU. However, CPU/GPU data transfers are
 193 achieved serially and not overlapped.
 194
 195
 196 \subsection{Overlapping with sequences of transfers and computations}
 197
 198 %\subsubsection{Overlapping with a sequential GPU sequence}
 199
 200 \noindent {\bf Overlapping with a sequential GPU sequence}
 201 \medskip
 202
 203 \begin{figure}[t]
 204   \centering
 205   \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf}
 206   \caption{Overlap of internode CPU communications with a sequence of CPU/GPU data transfers and GPU
 207   computations.}
 208   \label{fig:ch6p1overlapseqsequence}
 209 \end{figure}
 210
 211 When CPU/GPU data transfers are not negligible compared to GPU computations, it
 212 can be interesting to overlap internode CPU computations with a \emph{GPU
 213   sequence}\index{GPU!sequence} including CPU/GPU data transfers and GPU computations (see
 214 \Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic,
 215 but their implementation requires explicit CPU multithreading and
 216 synchronization, and CPU data buffer duplication. We need to implement two
 217 threads, one starting and achieving MPI communications and the other running
 218 the \emph{GPU sequence}. OpenMP allows an easy and portable implementation of
 219 this overlapping strategy. However, it remains more complex to develop and to
 220 maintain than the previous strategy (overlapping only internode CPU
 221 communications and GPU computations) and should be adopted only when CPU/GPU
 222 data transfer times are not negligible.
 223
 224 \Lst{algo:ch6p1overlapseqsequence} introduces the generic code of a
 225 MPI+OpenMP+CUDA implementation, explicitly overlapping MPI communications with
 226 GPU sequences.
 227
 228 %\begin{algorithm}
 229 %  \caption{Generic scheme explicitly overlapping MPI communications with
 230 %    sequences of CUDA CPU/GPU transfers and CUDA GPU
 231 %    computations}\label{algo:ch6p1overlapseqsequence}
 232 \begin{Listing}{algo:ch6p1overlapseqsequence}{generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
 233 // Input data and result variables and arrays (example with
 234 // float datatype, 1D input arrays, and scalar results)
 235 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
 236 float *cpuResTabAdr, *gpuResAdr;
 237
 238 // CPU and GPU array allocations
 239 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
 240 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
 241 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
 242 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 243 cudaMalloc(&gpuResAdr,sizeof(float));
 244
 245 // Definition of the grid of blocks of GPU threads
 246 dim3 Dg, Db;
 247 Dg.x = ...
 248 ...
 249
 250 // Indexes of source and destination MPI processes
 251 int dest = ...
 252 int src = ...
 253
 254 // Set the number of OpenMP threads (to create) to 2
 255 omp_set_num_threads(2);
 256 // Create threads and start the parallel OpenMP region
 257 #pragma omp parallel
 258 {
 259   // Buffer pointers (thread local variables)
 260   float *current = cpuInputTabAdrCurrent;
 261   float *future = cpuInputTabAdrFuture;
 262   float *tmp;
 263
 264   // Computation loop (using the GPU)
 265   for (int i = 0; i < NbIter; i++) {
 266
 267     // - Thread 0: achieves MPI communications
 268     if (omp_get_thread_num() == 0) {
 269       MPI_Sendrecv(current,                  // MPI comms. (sync. op)
 270                    N, MPI_FLOAT, dest, 0,
 271                    future,
 272                    N, MPI_FLOAT, dest, 0, ...);
 273
 274     // - Thread 1: achieves the GPU sequence (GPU computations and
 275     //             CPU/GPU data transfers)
 276     } else if (omp_get_thread_num() == 1) {
 277       cudaMemcpy(gpuInputTabAdr, current,    // Data transfer:
 278                  sizeof(float)*N,            // CPU --> GPU (sync. op)
 279                  cudaMemcpyHostToDevice);
 280       gpuKernel_k1<<<Dg,Db>>>();             // GPU comp. (async. op)
 281     // IF there is (now) a result to transfer from the GPU to the CPU:
 282       cudaMemcpy(cpuResTabAdr + i, gpuResAdr,// Data transfer:
 283                  sizeof(float),              // GPU --> CPU (sync. op)
 284                  cudaMemcpyDeviceToHost);
 285     }
 286
 287     // - Wait until both threads have achieved their iteration tasks
 288     #pragma omp barrier
 289     // - Each thread permutes its local buffer pointers
 290     tmp = current;
 291     current = future;
 292     future = tmp;
 293   } // End of computation loop
 294 } // End of OpenMP parallel region
 295 ...
 296 \end{Listing}
 297 %\end{algorithm}
 298
 299 Lines~25--62 implement the OpenMP parallel region,
 300 around the computation loop (lines~33--61). For efficient performances it is
 301 important to create and destroy threads only one time (not at each iteration):
 302 the parallel region has to surround the computation loop. Lines~3--11
 303 consist of declaration and allocation of input data arrays and result arrays and
 304 variables, as in the previous algorithm (\Lst{algo:ch6p1overlapnative}). However, we implement two input data buffers on the
 305 CPU (current and future version). As we aim to overlap internode MPI
 306 communications and GPU sequence, including CPU to GPU data transfer of current
 307 input data array, we need to store the received new input data array in a
 308 separate buffer. Then, the current input data array will be safely read on the
 309 CPU and copied into the GPU memory.
 310
 311 The thread creations\index{OpenMP!thread creation} are easily achieved with one OpenMP directive (line~25).
 312 Then each thread defines and initializes \emph{its} local buffer pointers,
 313 and enters \emph{its} computing loop (lines~28--33). Inside the computing
 314 loop, a test on the thread number makes it possible to run a different code in each
 315 thread. Lines~37--40 implement the MPI synchronous communication run by
 316 thread number $0$. Lines~45--52 implement the GPU sequence run by thread
 317 $1$: CPU to GPU data transfer, GPU computation, and GPU to CPU result
 318 transfer (if needed). Details of the three operations of this sequence have not changed
 319 from the previous overlapping strategy.
 320
 321 At the end of \Lst{algo:ch6p1overlapseqsequence}, an OpenMP synchronization
 322 barrier\index{OpenMP!barrier} on line~56 forces the  OpenMP threads to wait until
 323 MPI
 324 communications and GPU sequence are achieved. %, and do not need to access the current input data buffer.
 325 Then, each thread permutes its local buffer pointers (lines~58--60),
 326 and is ready to enter the next iteration, processing the new current input
 327 array.
 328
 329 %\subsubsection{Overlapping with a streamed GPU sequence}
 330
 331 \bigskip
 332 %\pagebreak
 333 \noindent {\bf Overlapping with a streamed GPU sequence}
 334 \medskip
 335
 336 Depending on the algorithm implemented, it is sometimes possible to split the
 337 GPU computation into several parts processing distinct data. Then, we can
 338 speedup the GPU sequence using several CUDA streams\index{CUDA!stream}. The goal is
 339 to overlap CPU/GPU data transfers with GPU computations\index{overlap!GPU data transfers with GPU computation} inside the GPU
 340   sequence. Compared to the previous overlapping strategy, we have to split the
 341 initial data transfer into a set of $n$ asynchronous and smaller data transfers,
 342 and  split the initial GPU kernel call into a set of $n$ calls to the same GPU
 343 kernel. Usually, these smaller calls are deployed with fewer GPU threads
 344 (i.e., associated to a smaller grid of blocks of threads). Then, the first GPU
 345 computations can start as soon as the first data transfer has been achieved, and
 346 next transfers can be done in parallel with next GPU computations (see
 347 \Fig{fig:ch6p1overlapstreamsequence}).
 348
 349 \begin{figure}[t]
 350   \centering
 351   \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf}
 352   \caption{Overlap of internode CPU communications with a streamed sequence of CPU/GPU data
 353   transfers and GPU computations.}
 354   \label{fig:ch6p1overlapstreamsequence}
 355 \end{figure}
 356
 357 NVIDIA advises  starting all asynchronous CUDA data transfers and then calling
 358 all CUDA kernel executions, using up to $16$ streams \cite{cudabestpractices}.
 359 Then, the CUDA driver and
 360 runtime optimize the global execution of these operations. So, we accumulate two
 361 overlapping mechanisms. The former is controlled by CPU multithreading and
 362 overlaps MPI communications and the streamed GPU sequence. The latter is
 363 controlled by CUDA programming and overlaps CPU/GPU data transfers and GPU
 364 computations. Again, OpenMP allows the easy implementation of the CPU multithreading
 365 and  waiting for the end of both CPU threads before executing the next instructions
 366 of the code.
 367
 368 \Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA
 369 code,  explicitly overlapping MPI communications with
 370 streamed GPU sequences\index{GPU!streamed sequence}.
 371
 372 %\begin{algorithm}
 373 %  \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA
 374 %  CPU/GPU transfers and CUDA GPU computations}\label{algo:ch6p1overlapstreamsequence}
 375 \begin{Listing}{algo:ch6p1overlapstreamsequence}{generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
 376 // Input data and result variables and arrays (example with
 377 // float datatype, 1D input arrays, and scalar results)
 378 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
 379 float *cpuResTabAdr, *gpuResAdr;
 380 // CPU and GPU array allocations (allocates page-locked CPU memory)
 381 cudaHostAlloc(&cpuInputTabAdrCurrent,sizeof(float)*N,cudaHostAllocDefault);
 382 cudaHostAlloc(&cpuInputTabAdrFuture,sizeof(float)*N,cudaHostAllocDefault);
 383 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
 384 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 385 cudaMalloc(&gpuResAdr,sizeof(float));
 386 // Stream declaration and creation
 387 cudaStream_t TabS[NbS];
 388 for(int s = 0; s < NbS; s++)
 389   cudaStreamCreate(&TabS[s]);
 390 // Definition of the grid of blocks of GPU threads
 391 ...
 392 // Set the number of OpenMP threads (to create) to 2
 393 omp_set_num_threads(2);
 394 // Create threads and start the parallel OpenMP region
 395 #pragma omp parallel
 396 {
 397   // Buffer pointers (thread local variables)
 398   float *current = cpuInputTabAdrCurrent;
 399   float *future = cpuInputTabAdrFuture;
 400   float *tmp;
 401   // Stride of data processed per stream
 402   int stride = N/NbS;
 403   // Computation loop (using the GPU)
 404   for (int i = 0; i < NbIter; i++) {
 405     // - Thread 0: achieves MPI communications
 406     if (omp_get_thread_num() == 0) {
 407       MPI_Sendrecv(current,             // MPI comms. (sync. op)
 408                    N, MPI_FLOAT, dest, 0,
 409                    future,
 410                    N, MPI_FLOAT, dest, 0, ...);
 411     // - Thread 1: achieves the streamed GPU sequence (GPU
 412     //   computations and CPU/GPU data transfers)
 413     } else if (omp_get_thread_num() == 1) {
 414       for (int s = 0; s < NbS; s++) {   // Start all data transfers:
 415         cudaMemcpyAsync(gpuInputTabAdr + s*stride, // CPU --> GPU
 416                         current + s*stride,        // (async. ops)
 417                         sizeof(float)*stride,
 418                         cudaMemcpyHostToDevice,
 419                         TabS[s]);
 420       }
 421       for (int s = 0; s < NbS; s++) { // Start all GPU comps. (async.)
 422         gpuKernel_k1<<<Dg, Db, 0, TabS[s]>>>(gpuInputTabAdr + s*stride);
 423       }
 424       cudaThreadSynchronize();          // Wait all threads are ended
 425    // IF there is (now) a result to transfer from the GPU to the CPU:
 426       cudaMemcpy(cpuResTabAdr,          // Data transfers:
 427                  gpuResAdr,             // GPU --> CPU (sync. op)
 428                  sizeof(float),
 429                  cudaMemcpyDeviceToHost);
 430     }
 431     // - Wait until both threads have achieved their iteration tasks
 432     #pragma omp barrier
 433     // - Each thread permutes its local buffer pointers
 434     tmp = current; current = future; future = tmp;
 435   } // End of computation loop
 436 } // End of OpenMP parallel region
 437 ...
 438 // Destroy the streams
 439 for(int s = 0; s < NbS; s++)
 440   cudaStreamDestroy(TabS[s]);
 441 ...
 442 \end{Listing}
 443 %\end{algorithm}
 444
 445 Efficient usage of CUDA streams requires  executing
 446 asynchronous CPU/GPU data transfers, which implies reading page-locked
 447 data\index{page-locked data} in CPU memory. So, CPU
 448 memory allocations on lines~6 and 7 are implemented with \texttt{cudaHostAlloc} instead of
 449 the basic \texttt{malloc} function. Then, $NbS$ \emph{streams} are created on lines~12--14.
 450 Usually we create $16$ streams: the maximum number supported by CUDA.
 451
 452 An OpenMP parallel region\index{OpenMP!parallel region} including two threads is implemented on lines~18--61 of
 453 \Lst{algo:ch6p1overlapstreamsequence}, as in the previous algorithm (see
 454 \Lst{algo:ch6p1overlapseqsequence}). Code of thread $0$ achieving MPI communication is unchanged, but
 455 code of thread $1$ is now using streams. Following NVIDIA recommandations, we first implement
 456 a loop starting $NbS$ asynchronous data transfers (lines~39--45): transferring $N/NbS$ data on
 457 each stream. Then we  implement a second loop (lines~46--48), starting asynchronous
 458 executions of $NbS$ grids of blocks of GPU threads (one per stream). Data transfers and kernel
 459 executions on the same stream are synchronized by CUDA and the GPU. So, each kernel execution will
 460 start after its data has been transferred into the GPU memory, and the GPU
 461 scheduler ensures the start of
 462 some kernel executions as soon as the first data transfers are achieved. Then, next data transfers
 463 will be overlapped with GPU computations. After the kernel calls, on the different streams,
 464 we wait for the end of all GPU threads previously run, calling an explicit synchronization
 465 function on line~49. This synchronization is not mandatory, but it will make the implementation more
 466 robust and will facilitate the debugging steps: all GPU computations run by the OpenMP thread number
 467 $1$ will be achieved before this thread enters a new loop iteration, or before the computation
 468 loop has ended.
 469
 470 If a partial result has to be transferred from GPU to CPU memory  at the end of each loop iteration
 471 (for example, the result of one \emph{reduction} per iteration), this transfer is achieved
 472 synchronously on  the default stream (no particular stream is specified) on lines~51--54.
 473 Availability of the result values is ensured by the synchronization implemented on line~49.
 474 However, if a partial result has to be transferred onto the CPU on each stream, then $NbS$ asynchronous data
 475 transfers could be started in parallel (one per stream) and should be implemented before the
 476 synchronization operation on line~49. The end of the computation loop includes a synchronization
 477 barrier of the two OpenMP threads, waiting until they have finished accessing the different data
 478 buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, as
 479 in the previous algorithm. After the computation loop, we have added the
 480 destruction of the CUDA streams (lines~64--65).
 481
 482 In  conclusion,  CUDA  streams\index{CUDA!stream}   have  been  used  to  extend
 483 \Lst{algo:ch6p1overlapseqsequence}     with     respect     to    its     global
 484 scheme. \Lst{algo:ch6p1overlapstreamsequence}  still creates an  OpenMP parallel
 485 region, with two CPU threads, one  in charge of MPI communications and the other
 486 managing data  transfers and GPU computations. Unfortunately,  using GPU streams
 487 requires  the ability  to split  a  GPU computation  into independent  subparts,
 488 working on  independent subsets of  data.  \Lst{algo:ch6p1overlapstreamsequence}
 489 is not so generic as \Lst{algo:ch6p1overlapseqsequence}.
 490
 491
 492 \subsection{Interleaved communications-transfers-computations overlapping}
 493
 494 Many algorithms do not support splitting data transfers and kernel calls, and
 495 cannot exploit CUDA streams, for example, when each GPU thread requires access to
 496 some data spread in the global set of transferred data. Then, it is possible to
 497 overlap internode CPU communications, CPU/GPU data transfers, and GPU
 498 computations, if the algorithm achieves \emph{computation-communication
 499   iterations} and if we can interleave these iterations. At iteration $k$: CPUs
 500 exchange data $D_k$, each CPU/GPU couple transfers data $D_k$, and each GPU
 501 achieves computations on data $D_{k-1}$ (see
 502 \Fig{fig:ch6p1overlapinterleaved}). Compared to the previous strategies, this
 503 strategy requires twice as many CPU data buffers
 504 and twice as many GPU buffers.
 505
 506 \begin{figure}[t]
 507   \centering
 508   \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf}
 509   \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU
 510   computations, interleaving computation-communication iterations.}
 511   \label{fig:ch6p1overlapinterleaved}
 512 \end{figure}
 513
 514 \Lst{algo:ch6p1overlapinterleaved} introduces the generic code of a
 515 MPI+OpenMP+CUDA implementation, explicitly interleaving
 516 computation-communication iterations and overlapping MPI communications, CUDA CPU/GPU
 517 transfers, and CUDA GPU computations.
 518
 519 %\begin{algorithm}
 520 %  \caption{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA
 521 %  GPU computations, interleaving computation-communication iterations}\label{algo:ch6p1overlapinterleaved}
 522 \begin{Listing}{algo:ch6p1overlapinterleaved}{generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers, and CUDA GPU computations, interleaving computation-communication iterations}
 523 // Input data and result variables and arrays (example with
 524 // float datatype, 1D input arrays, and scalar results)
 525 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture;
 526 float *gpuInputTabAdrCurrent, *gpuInputTabAdrFuture;
 527 float *cpuResTabAdr, *gpuResAdr;
 528
 529 // CPU and GPU array allocations
 530 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
 531 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
 532 cudaMalloc(&gpuInputTabAdrCurrent,sizeof(float)*N);
 533 cudaMalloc(&gpuInputTabAdrFuture,sizeof(float)*N);
 534 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 535 cudaMalloc(&gpuResAdr,sizeof(float));
 536
 537 // Definition of the grid of blocks of GPU threads
 538 dim3 Dg, Db; Dg.x = ...
 539 // Indexes of source and destination MPI processes
 540 int dest, src; dest = ...
 541
 542 // Set the number of OpenMP threads (to create) to 2
 543 omp_set_num_threads(3);
 544 // Create threads and start the parallel OpenMP region
 545 #pragma omp parallel
 546 {
 547   // Buffer pointers (thread local variables)
 548   float *cpuCurrent = cpuInputTabAdrCurrent;
 549   float *cpuFuture  = cpuInputTabAdrFuture;
 550   float *gpuCurrent = gpuInputTabAdrCurrent;
 551   float *gpuFuture  = gpuInputTabAdrFuture;
 552   float *tmp;
 553
 554   // Computation loop on NbIter + 1 iterations
 555   for (int i = 0; i < NbIter + 1; i++) {
 556     // - Thread 0: achieves MPI communications
 557     if (omp_get_thread_num() == 0) {
 558       if (i < NbIter) {
 559         MPI_Sendrecv(cpuCurrent,            // MPI comms. (sync. op)
 560                      N, MPI_FLOAT, dest, 0,
 561                      cpuFuture,
 562                      N, MPI_FLOAT, dest, 0, ...);
 563       }
 564     // - Thread 1: achieves the CPU/GPU data transfers
 565     } else if (omp_get_thread_num() == 1) {
 566       if (i < NbIter) {
 567         cudaMemcpy(gpuFuture, cpuCurrent,   // Data transfer:
 568                    sizeof(float)*N,         // CPU --> GPU (sync. op)
 569                    cudaMemcpyHostToDevice);
 570       }
 571     // - Thread 2: achieves the GPU computations and result transfer
 572     } else if (omp_get_thread_num() == 2) {
 573       if (i > 0) {
 574         gpuKernel_k1<<<Dg,Db>>>(gpuCurrent);// GPU comp. (async. op)
 575         // IF there is (now) a result to transfer from GPU to CPU:
 576         cudaMemcpy(cpuResTabAdr + (i-1),    // Data transfer:
 577                    gpuResAdr, sizeof(float),// GPU --> CPU (sync. op)
 578                    cudaMemcpyDeviceToHost);
 579       }
 580     }
 581     // - Wait until both threads have achieved their iteration tasks
 582     #pragma omp barrier
 583     // - Each thread permutes its local buffer pointers
 584     tmp = cpuCurrent; cpuCurrent = cpuFuture; cpuFuture = tmp;
 585     tmp = gpuCurrent; gpuCurrent = gpuFuture; gpuFuture = tmp;
 586   } // End of computation loop
 587 } // End of OpenMP parallel region
 588 ...
 589 \end{Listing}
 590 %\end{algorithm}
 591
 592 As in the previous algorithms, we declare two CPU input data arrays
 593 (current and future version) on line~3. However, in this version we also declare two GPU input data arrays on line~4. On
 594 lines~8--11, these four data arrays are allocated, using \texttt{malloc} and
 595 \texttt{cudaMalloc}.
 596 We do not need to allocate page-locked memory space. On lines~23--65 we
 597 create an OpenMP parallel region, configured to run three threads (see line~21). Lines~26--30 are
 598 declarations of thread local pointers on data arrays and variables (each thread will use its own
 599 pointers). On line~33, the three threads enter a computation loop of \texttt{NbIter + 1}
 600 iterations. We need to run one more iteration than with previous algorithms.
 601
 602 Lines~35--41 are the MPI communications, achieved by the thread number $0$. They send the
 603 current CPU input data array to another CPU, and receive the future CPU input data array from
 604 another CPU, like in previous algorithms. But this thread achieves communications only during the
 605 \emph{first} \texttt{NbIter} iterations. Lines~43--48 are the CPU to GPU input data
 606 transfers, achieved by thread number $1$. These data transfers are run in parallel with MPI
 607 communications. They are run during the \emph{first} \texttt{NbIter} iterations and transfer
 608 current CPU input data array into the future GPU data array. Lines~50--57
 609 correspond to the code run by
 610 thread number $2$. They start GPU computations, process the current GPU input data array, and if
 611 necessary
 612 transfer a GPU result to an index of the CPU result array. These GPU computations and result
 613 transfers are run during the \emph{last} \texttt{NbIter} iterations: the GPU computations
 614 have to wait until the first data transfer is ended before starting to process any data and cannot run
 615 during the first iteration. So, the activity of the third thread is shifted by one iteration
 616 compared to the activities of the other threads. Moreover, the address of the current GPU input data
 617 array has to be passed as a parameter of the kernel call on line~52, in order
 618 for the GPU threads to access
 619 the right data array. As in previous algorithms the GPU result is copied to one index of the CPU
 620 result array, in lines~54--56, but due to the shift of the third thread activity this index is
 621 now \texttt{(i - 1)}.
 622
 623 Line~60 is a synchronization barrier\index{OpenMP!barrier} of the three OpenMP threads, followed by a pointer permutation
 624 of local pointers on current and future data arrays, on line~62 and 63. Each
 625 thread waits for the completion of other
 626 threads to use the data arrays, and then permutes its data array pointers before
 627 entering a new loop iteration.
 628
 629 This complete overlap of MPI communications, CPU/GPU data transfers, and GPU computations is not
 630 too complex to implement, and can be a solution when GPU computations are not adapted to use CUDA
 631 streams: when GPU computations cannot be split into subparts working on independent subsets of input
 632 data. However, this requires running one more iteration (a total of \texttt{NbIter + 1}
 633 iterations). If the number of iterations is very small, it could be more interesting not to
 634 attempt to overlap CPU/GPU data transfers and GPU computations, and to implement \Lst{algo:ch6p1overlapseqsequence}.
 635
 636
 637 \subsection{Experimental validation}
 638 \label{ch6:p1expes}
 639 %\subsubsection{Experimentation testbed}
 640
 641 \noindent {\bf Experimentation testbed}
 642 \medskip
 643
 644 Two clusters located at SUPELEC in Metz (France) have been used for the entire set of
 645 experiments presented in this chapter:
 646 %
 647 \begin{itemize}
 648
 649 \item The first consists of 17 nodes with an Intel Nehalem quad-core processor
 650   at 2.67Ghz, 6 Gb RAM, and an NVIDIA GeForce GTX480 GPU, each.
 651
 652 \item The second consists of 16 nodes with an Intel core2 dual-core processor at
 653   2.67Ghz, 4 Gb RAM, and an NVIDIA GeForce GTX580 GPU, each.
 654
 655 \end{itemize}
 656 %
 657 Both clusters have a gigabit Ethernet interconnection network that is connected
 658 through a Dell Power Object 5324 switch. The two switches are linked twice,
 659 ensuring the interconnection of the two clusters.  The software environment
 660 consists of a Linux Fedora 64bit OS (kernel v. 2.6.35), GNU C and C++ compilers
 661 (v. 4.5.1), and the CUDA library (v. 4.2).
 662
 663 %\subsubsection{Validation of the synchronous approach}
 664
 665 \bigskip
 666 \noindent {\bf Validation of the synchronous approach}
 667 \medskip
 668
 669 \begin{figure}[t]
 670   \centering
 671   \includegraphics{Chapters/chapter6/curves/gpuSyncOverlap.pdf}
 672   \caption{Experimental performances of different synchronous algorithms computing a
 673     dense matrix product.}
 674   \label{fig:ch6p1syncexpematrixprod}
 675 \end{figure}
 676
 677 \label{ch6:p1block-cyclic}
 678 We have tested our approach of synchronous parallel algorithms with a classic
 679 block cyclic algorithm for dense matrix multiplication\index{matrix
 680   multiplication!block cyclic}. This problem requires splitting two input matrices ($A$ and $B$) on a ring of
 681 computing nodes and  establishing a circulation of the slices of $A$ matrix on the ring ($B$ matrix
 682 partition does not evolve during all the run). Compared to our generic algorithms, there is no
 683 partial result to transfer from GPU to CPU at the end of each computing iteration. The part of the
 684 result matrix computed on each GPU is transferred onto the CPU at the end of the computation loop.
 685
 686 We have first implemented a synchronous version without any overlap of MPI communications, CPU/GPU
 687 data transfers, and GPU computations. We have added some synchronizations in the native overlapping
 688 version in order to avoid any overlap. We have measured the performance achieved on our cluster with
 689 NVIDIA GTX480 GPUs and matrices sizes of 4096$\times$4096, and we have obtained the curves in \Fig{fig:ch6p1syncexpematrixprod} labeled
 690 \emph{no-ovlp}. We observe that performance increases when the number of processor increases. Of course,
 691 there is a significant increase in cost when comparing a single node (without any MPI communication) with
 692 two nodes (starting to use MPI communications). But beyond two nodes we get a classical performance
 693 curve.
 694
 695 Then, we implemented and tested \Lst{algo:ch6p1overlapnative}, labeled
 696 \emph{ovlp-native} in \Fig{fig:ch6p1syncexpematrixprod}. The native
 697 overlap of MPI communications with the asynchronous run of CUDA kernels appears efficient. When the
 698 number of nodes increases the ratio of the MPI communications increases a lot (because the
 699 computation times decrease a lot). So, there is not a lot of GPU computation
 700 time that remains to be
 701 overlapped, and both \emph{no-ovlp} and \emph{ovlp-native} tend to the same
 702 limit. Already, the native overlap performed in \Lst{algo:ch6p1overlapnative}
 703 achieves a high level of performance very quickly, using only four
 704 nodes. Beyond four nodes, a faster interconnection network would be required for
 705 a performance increase.
 706
 707 Finally, we implemented \Lst{algo:ch6p1overlapseqsequence}, overlapping MPI communications
 708 with a GPU sequence including both CPU/GPU data transfers and GPU computations,
 709 labeled \emph{ovlp-GPUsequence} in \Fig{fig:ch6p1syncexpematrixprod}. From four
 710 up to sixteen nodes it achieves better performances than \emph{ovlp-native}: the
 711 overlapping of MPI communications is wider and thus more efficient. However, this parallelization mechanism has more overhead: OpenMP threads
 712 have to be created and synchronized. With only two nodes it is less efficient than the native
 713 overlapping algorithm. Beyond two nodes, the CPU multithreading overhead seems compensated.
 714 % No, it doesn't need the more implementation of time, but more implementation
 715 % of code :)
 716 \Lst{algo:ch6p1overlapseqsequence} requires more time for the implemention
 717 and can be more complex to maintain, but such extra development cost is
 718 justified if we are looking for better performance.
 719
 720
 721 %%% Local Variables:
 722 %%% mode: latex
 723 %%% fill-column: 80
 724 %%% ispell-dictionary: "american"
 725 %%% mode: flyspell
 726 %%% TeX-master: "../../BookGPU.tex"
 727 %%% End: