BookGPU/Chapters/chapter6/PartieSync.tex

   1 \section{General scheme of synchronous code with computation/communication
   2   overlapping in GPU clusters}\label{ch6:part1}
   3
   4 %Notre     chapitre     précédent      sur     l'optimisation     des     schémas
   5 %parallèles~\cite{ChVCV13}.
   6
   7 \subsection{Synchronous parallel algorithms on GPU clusters}
   8
   9 %\subsubsection{Considered parallel algorithms and implementations}
  10
  11 \noindent {\bf Considered parallel algorithms and implementations}
  12 \medskip
  13
  14 This section focusses on synchronous parallel algorithms implemented with
  15 overlapping computations and communications\index{overlap!computation and communication}. Parallel synchronous algorithms are
  16 easier to implement, debug and maintain than asynchronous ones, see
  17 Section~\ref{ch6:part2}. Usually, they follow a BSP-like parallel
  18 scheme\index{BSP parallel scheme},
  19 alternating local computation steps and communication steps,
  20 see~\cite{Valiant:BSP}. Their execution is usually deterministic, excepted
  21 for stochastic algorithms that contain random number generations. Even in
  22 this case, their execution can be controlled during debug steps, allowing to
  23 track and to fix bugs quickly.
  24
  25 However, depending on the properties of the algorithm, it is sometimes possible to
  26 overlap computations and communications. If processes exchange data
  27 that is not needed for the
  28 computation that is following immediately, it is possible to implement such
  29 an overlap. We have investigated the efficiency of this approach in previous
  30 works~\cite{GUSTEDT:2007:HAL-00280094:1,ChVCV13}, using standard parallel
  31 programming tools to achieve the implementation.
  32
  33 The normalized and well known Message Passing Interface (MPI\index{MPI}) includes some asynchronous
  34 point-to-point communication routines, that should allow to implement some
  35 communication/computation overlap. However, current MPI implementations do not achieve
  36 that goal efficiently; effective overlapping with MPI requires a group of
  37 dedicated threads (in our case implemented with OpenMP\index{OpenMP}) for the basic
  38 synchronous communications while another group of threads executes computations
  39 in parallel.
  40 % Finally, with explicit OpenMP
  41 % threads executing MPI communications or various computations, we succeeded to
  42 % decrease the execution time.
  43 Nevertheless, communication and computation are not completely independent on
  44 modern multicore architectures: they use shared hardware components such as the
  45 interconnection bus and the RAM. Therefore that approach only saved up to $20\%$
  46 of the expected time on such a platform. This picture changes on clusters
  47 equipped with GPU. They effectively allow independence of computations on the
  48 GPU and communication on the mainboard (CPU, interconnection bus, RAM, network
  49 card). We saved up to $100\%$ of the expected time on our GPU cluster, as we
  50 will expose in the next section.
  51
  52
  53 %\subsubsection{Specific interests for GPU clusters}
  54
  55 \bigskip
  56 \noindent {\bf Specific interests in GPU clusters}
  57 \medskip
  58
  59 In a computing node, a GPU is a kind of scientific coprocessor usually located
  60 on an auxiliary board, with its own memory. So, when data have been transferred
  61 from the CPU memory to the GPU memory, then GPU computations can be achieved on
  62 the GPU board, totally in parallel of any CPU activities (like internode cluster
  63 communications). The CPU and the GPU access their respective memories and do not
  64 interfere, so they can achieve a very good overlap\index{overlap!computation
  65 and computation} of their activities (better
  66 than two CPU cores).
  67
  68 But using a GPU on a computing node requires to transfer data from the CPU to
  69 the GPU memory, and to transfer the computation results back from the GPU to the
  70 CPU. Transfer times are not excessive, but depending on the application
  71 they still can be significant compared to the GPU computation times. So, sometimes it
  72 can be interesting to overlap the internode cluster communications with both the
  73 CPU/GPU data transfers and the GPU computations. We can identify four main
  74 parallel programming schemes on a GPU cluster:
  75
  76 \begin{enumerate}
  77 \item parallelizing only 'internode CPU communications' with 'GPU computations',
  78   and achieving CPU/GPU data transfers before and after this parallel step,
  79 \item parallelizing 'internode CPU communications' with a '(sequential) sequence
  80   of CPU/GPU data transfers and GPU computations',
  81 \item parallelizing 'internode CPU communications' with a 'streamed sequence of
  82   CPU/GPU data transfers and GPU computations',
  83 \item parallelizing 'internode CPU communications' with 'CPU/GPU data transfers'
  84   and with 'GPU computations', interleaving computation-communication
  85   iterations.
  86 \end{enumerate}
  87
  88
  89 \subsection{Native overlap of CPU communications and GPU computations}
  90
  91 \begin{figure}[t]
  92   \centering
  93   \includegraphics{Chapters/chapter6/figures/Sync-NativeOverlap.pdf}
  94   \caption{Native overlap of internode CPU communications with GPU computations.}
  95   \label{fig:ch6p1overlapnative}
  96 \end{figure}
  97
  98 Using CUDA\index{CUDA}, GPU kernel executions are non-blocking, and GPU/CPU data
  99 transfers\index{CUDA!data transfer}
 100 are blocking or non-blocking operations. All GPU kernel executions and CPU/GPU
 101 data transfers are associated to "streams"\index{CUDA!stream}, and all operations on a same stream
 102 are serialized. When transferring data from the CPU to the GPU, then running GPU
 103 computations and finally transferring results from the GPU to the CPU, there is
 104 a natural synchronization and serialization if these operations are achieved on
 105 the same stream. GPU developers can choose to use one (default) or several
 106 streams. In this first scheme of overlapping, we consider parallel codes using
 107 only one GPU stream.
 108
 109 "Non-blocking  GPU  kernel  execution"  means  a CPU  routine  runs  a  parallel
 110 execution of a GPU computing kernel, and the CPU routine continues its execution
 111 (on the CPU) while the GPU kernel  is running (on the GPU). Then the CPU routine
 112 can initiate some  communications with some others CPU,  and so it automatically
 113 overlaps  the  internode  CPU  communications  with the  GPU  computations  (see
 114 \Fig{fig:ch6p1overlapnative}). This overlapping is natural when programming with
 115 CUDA  and MPI:  it is  easy to  deploy, but  does not  overlap the  CPU/GPU data
 116 transfers.
 117
 118 %\begin{algorithm}[t]
 119 %  \caption{Generic scheme implicitly overlapping MPI communications with CUDA GPU
 120 %   computations}\label{algo:ch6p1overlapnative}
 121 \pagebreak
 122 \begin{Listing}{algo:ch6p1overlapnative}{Generic scheme implicitly overlapping MPI communications with CUDA GPU computations}
 123 // Input data and result variables and arrays (example with
 124 // float datatype, 1D input arrays, and scalar results)
 125 float *cpuInputTabAdr, *gpuInputTabAdr;
 126 float *cpuResTabAdr, *gpuResAdr;
 127
 128 // CPU and GPU array allocations
 129 cpuInputTabAdr = malloc(sizeof(float)*N);
 130 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
 131 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 132 cudaMalloc(&gpuResAdr,sizeof(float));
 133
 134 // Definition of the Grid of blocks of GPU threads
 135 dim3 Dg, Db;
 136 Dg.x = ...
 137 ...
 138
 139 // Indexes of source and destination MPI processes
 140 int dest = ...
 141 int src = ...
 142
 143 // Computation loop (using the GPU)
 144 for (int i = 0; i < NbIter; i++) {
 145   cudaMemcpy(gpuInputTabAdr, cpuInputTabAdr, // Data transfer:
 146              sizeof(float)*N,                // CPU --> GPU (sync. op)
 147              cudaMemcpyHostToDevice);
 148   gpuKernel_k1<<<Dg,Db>>>();                 // GPU comp. (async. op)
 149   MPI_Sendrecv_replace(cpuInputTabAdr,       // MPI comms. (sync. op)
 150                        N,MPI_FLOAT,
 151                        dest, 0, src, 0, ...);
 152   // IF there is (now) a result to transfer from the GPU to the CPU:
 153   cudaMemcpy(cpuResTabAdr + i, gpuResAdr,    // Data transfer:
 154              sizeof(float),                  // GPU --> CPU (sync. op)
 155              cudaMemcpyDeviceToHost);
 156 }
 157 ...
 158 \end{Listing}
 159 %\end{algorithm}
 160
 161 \Lst{algo:ch6p1overlapnative} introduces the generic code of a MPI+CUDA
 162 implementation, natively and implicitly overlapping MPI communications with CUDA
 163 GPU computations. Some input data and output results arrays and variables are
 164 declared and allocated from line 1 up to 10, and a computation loop is
 165 implemented from line 21 up to 34. At each iteration:
 166 \begin{itemize}
 167 \item \texttt{cudaMemcpy} on line 23 transfers data from the CPU memory
 168   to the GPU memory. This is a basic and synchronous data transfer.
 169 \item \texttt{gpuKernel\_k1<<<Dg,Db>>>} on line 26 starts GPU computation
 170   (running a GPU kernel on the grid of blocks of threads defined at line 12 to
 171   15). This is a standard GPU kernel run, it is an asynchronous operation. The
 172   CPU can continue to run its code.
 173 \item \texttt{MPI\_Sendrecv\_replace} on line 27 achieves some blocking
 174   internode communications, overlapping GPU computations started just before.
 175 \item If needed, \texttt{cudaMemcpy} on line 31 transfers the iteration result from
 176   one variable in the GPU memory at one array index in the CPU memory (in this example the CPU
 177   collects all iteration results in an array). This operation is started after
 178   the end of the MPI communication (previous instruction) and after the end of
 179   the GPU kernel execution. CUDA insures an implicit synchronization of all
 180   operations involving the same GPU stream, like the default stream in this
 181   example. Result transfer has to wait the GPU kernel execution is finished.
 182   If there is no result transfer implemented, the next operation on the GPU
 183   will wait until the GPU kernel execution will be ended.
 184 \end{itemize}
 185
 186 This implementation is the easiest one involving the GPU. It achieves an
 187 implicit overlap of internode communications and GPU computations, no explicit
 188 multithreading is required on the CPU. However, CPU/GPU data transfers are
 189 achieved serially and not overlapped.
 190
 191
 192 \subsection{Overlapping with sequences of transfers and computations}
 193
 194 %\subsubsection{Overlapping with a sequential GPU sequence}
 195
 196 \noindent {\bf Overlapping with a sequential GPU sequence}
 197 \medskip
 198
 199 \begin{figure}[t]
 200   \centering
 201   \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf}
 202   \caption{Overlap of internode CPU communications with a sequence of CPU/GPU data transfers and GPU
 203   computations.}
 204   \label{fig:ch6p1overlapseqsequence}
 205 \end{figure}
 206
 207 When CPU/GPU data transfers are not negligible compared to GPU computations, it
 208 can be interesting to overlap internode CPU computations with a \emph{GPU
 209   sequence}\index{GPU sequence} including CPU/GPU data transfers and GPU computations (see
 210 \Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic,
 211 but their implementation require explicit CPU multithreading and
 212 synchronization, and CPU data buffer duplication. We need to implement two
 213 threads, one starting and achieving MPI communications, and the other running
 214 the \emph{GPU sequence}. OpenMP allows an easy and portable implementation of
 215 this overlapping strategy. However, it remains more complex to develop and to
 216 maintain than the previous strategy (overlapping only internode CPU
 217 communications and GPU computations), and should be adopted only when CPU/GPU
 218 data transfer times are not negligible.
 219
 220 %\begin{algorithm}
 221 %  \caption{Generic scheme explicitly overlapping MPI communications with
 222 %    sequences of CUDA CPU/GPU transfers and CUDA GPU
 223 %    computations}\label{algo:ch6p1overlapseqsequence}
 224 \begin{Listing}{algo:ch6p1overlapseqsequence}{Generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
 225 // Input data and result variables and arrays (example with
 226 // float datatype, 1D input arrays, and scalar results)
 227 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
 228 float *cpuResTabAdr, *gpuResAdr;
 229
 230 // CPU and GPU array allocations
 231 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
 232 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
 233 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
 234 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 235 cudaMalloc(&gpuResAdr,sizeof(float));
 236
 237 // Definition of the Grid of blocks of GPU threads
 238 dim3 Dg, Db;
 239 Dg.x = ...
 240 ...
 241
 242 // Indexes of source and destination MPI processes
 243 int dest = ...
 244 int src = ...
 245
 246 // Set the number of OpenMP threads (to create) to 2
 247 omp_set_num_threads(2);
 248 // Create threads and start the parallel OpenMP region
 249 #pragma omp parallel
 250 {
 251   // Buffer pointers (thread local variables)
 252   float *current = cpuInputTabAdrCurrent;
 253   float *future = cpuInputTabAdrFuture;
 254   float *tmp;
 255
 256   // Computation loop (using the GPU)
 257   for (int i = 0; i < NbIter; i++) {
 258
 259     // - Thread 0: achieves MPI communications
 260     if (omp_get_thread_num() == 0) {
 261       MPI_Sendrecv(current,                  // MPI comms. (sync. op)
 262                    N, MPI_FLOAT, dest, 0,
 263                    future,
 264                    N, MPI_FLOAT, dest, 0, ...);
 265
 266     // - Thread 1: achieves the GPU sequence (GPU computations and
 267     //             CPU/GPU data transfers)
 268     } else if (omp_get_thread_num() == 1) {
 269       cudaMemcpy(gpuInputTabAdr, current,    // Data transfer:
 270                  sizeof(float)*N,            // CPU --> GPU (sync. op)
 271                  cudaMemcpyHostToDevice);
 272       gpuKernel_k1<<<Dg,Db>>>();             // GPU comp. (async. op)
 273     // IF there is (now) a result to transfer from the GPU to the CPU:
 274       cudaMemcpy(cpuResTabAdr + i, gpuResAdr,// Data transfer:
 275                  sizeof(float),              // GPU --> CPU (sync. op)
 276                  cudaMemcpyDeviceToHost);
 277     }
 278
 279     // - Wait for both threads have achieved their iteration tasks
 280     #pragma omp barrier
 281     // - Each thread permute its local buffer pointers
 282     tmp = current;
 283     current = future;
 284     future = tmp;
 285   } // End of computation loop
 286 } // End of OpenMP parallel region
 287 ...
 288 \end{Listing}
 289 %\end{algorithm}
 290
 291 \Lst{algo:ch6p1overlapseqsequence} introduces the generic code of a
 292 MPI+OpenMP+CUDA implementation, explicitly overlapping MPI communications with
 293 \emph{GPU sequences}. Lines 25--62 implement the OpenMP parallel region,
 294 around the computation loop (lines 33--61). For performances it is
 295 important to create and destroy threads only one time (not at each iteration):
 296 the parallel region has to surround the computation loop. Lines 1--11
 297 consist in declaration and allocation of input data arrays and result arrays and
 298 variables, like in previous algorithm (\Lst{algo:ch6p1overlapnative}). However, we implement two input data buffers on the
 299 CPU (current and future version). As we aim to overlap internode MPI
 300 communications and GPU sequence, including CPU to GPU data transfer of current
 301 input data array, we need to store the received new input data array in a
 302 separate buffer. Then, the current input data array will be safely read on the
 303 CPU and copied into the GPU memory.
 304
 305 The thread creations\index{OpenMP!thread creation} are easily achieved with one OpenMP directive (line
 306 25). Then each thread defines and initializes \emph{its} local buffer pointers,
 307 and enters \emph{its} computing loop (lines 27--33). Inside the computing
 308 loop, a test on the thread number allows to run a different code in each
 309 thread. Lines 37--40 implement the MPI synchronous communication run by
 310 thread number $0$. Lines 45--52 implement the GPU sequence run by thread
 311 $1$: CPU to GPU data transfer, GPU computation and GPU to CPU result
 312 transfer (if needed). Details of the three operations of this sequence have not changed
 313 compared to the previous overlapping strategy.
 314
 315 At the end of \Lst{algo:ch6p1overlapseqsequence}, an OpenMP synchronization
 316 barrier\index{OpenMP!barrier} on line 56 allows to wait OpenMP threads have achieved MPI
 317 communications and GPU sequence, and do not need to access the current input
 318 data buffer. Then, each thread permute its local buffer pointers (lines 58--60),
 319 and is ready to enter the next iteration, processing the new current input
 320 array.
 321
 322
 323 %\subsubsection{Overlapping with a streamed GPU sequence}
 324
 325 \bigskip
 326 \noindent {\bf Overlapping with a streamed GPU sequence}
 327 \medskip
 328
 329 \begin{figure}[t]
 330   \centering
 331   \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf}
 332   \caption{Overlap of internode CPU communications with a streamed sequence of CPU/GPU data
 333   transfers and GPU computations.}
 334   \label{fig:ch6p1overlapstreamsequence}
 335 \end{figure}
 336
 337 Depending on the algorithm implemented, it is sometimes possible to split the
 338 GPU computation into several parts processing distinct data. Then, we can
 339 speedup the \emph{GPU sequence} using several CUDA \emph{streams}\index{CUDA!stream}. The goal is
 340 to overlap CPU/GPU data transfers with GPU computations\index{overlap!GPU data transfers with GPU computation} inside the \emph{GPU
 341   sequence}. Compared to the previous overlapping strategy, we have to split the
 342 initial data transfer in a set of $n$ asynchronous and smaller data transfers,
 343 and to split the initial GPU kernel call in a set of $n$ calls to the same GPU
 344 kernel. Usually, these smaller calls are deployed with less GPU threads
 345 (i.e. associated to a smaller grid of blocks of threads). Then, the first GPU
 346 computations can start as soon as the first data transfer has been achieved, and
 347 next transfers can be done in parallel of next GPU computations (see
 348 \Fig{fig:ch6p1overlapstreamsequence}).
 349
 350 NVIDIA advises to start all asynchronous CUDA data transfers, and then to call
 351 all CUDA kernel executions, using up to $16$ streams \cite{cudabestpractices}.
 352 Then, CUDA driver and
 353 runtime optimize the global execution of these operations. So, we cumulate two
 354 overlapping mechanisms. The former is controlled by CPU multithreading, and
 355 overlap MPI communications and the \emph{streamed GPU sequence}. The latter is
 356 controlled by CUDA programming, and overlap CPU/GPU data transfers and GPU
 357 computations. Again, OpenMP allows to easily implement the CPU multithreading,
 358 and to wait for the end of both CPU threads before to execute the next instructions
 359 of the code.
 360
 361 %\begin{algorithm}
 362 %  \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA
 363 %  CPU/GPU transfers and CUDA GPU computations}\label{algo:ch6p1overlapstreamsequence}
 364 \begin{Listing}{algo:ch6p1overlapstreamsequence}{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations}
 365 // Input data and result variables and arrays (example with
 366 // float datatype, 1D input arrays, and scalar results)
 367 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr;
 368 float *cpuResTabAdr, *gpuResAdr;
 369 // CPU and GPU array allocations (allocates page-locked CPU memory)
 370 cudaHostAlloc(&cpuInputTabAdrCurrent,sizeof(float)*N,cudaHostAllocDefault);
 371 cudaHostAlloc(&cpuInputTabAdrFuture,sizeof(float)*N,cudaHostAllocDefault);
 372 cudaMalloc(&gpuInputTabAdr,sizeof(float)*N);
 373 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 374 cudaMalloc(&gpuResAdr,sizeof(float));
 375 // Stream declaration and creation
 376 cudaStream_t TabS[NbS];
 377 for(int s = 0; s < NbS; s++)
 378   cudaStreamCreate(&TabS[s]);
 379 // Definition of the Grid of blocks of GPU threads
 380 ...
 381 // Set the number of OpenMP threads (to create) to 2
 382 omp_set_num_threads(2);
 383 // Create threads and start the parallel OpenMP region
 384 #pragma omp parallel
 385 {
 386   // Buffer pointers (thread local variables)
 387   float *current = cpuInputTabAdrCurrent;
 388   float *future = cpuInputTabAdrFuture;
 389   float *tmp;
 390   // Stride of data processed per stream
 391   int stride = N/NbS;
 392   // Computation loop (using the GPU)
 393   for (int i = 0; i < NbIter; i++) {
 394     // - Thread 0: achieves MPI communications
 395     if (omp_get_thread_num() == 0) {
 396       MPI_Sendrecv(current,             // MPI comms. (sync. op)
 397                    N, MPI_FLOAT, dest, 0,
 398                    future,
 399                    N, MPI_FLOAT, dest, 0, ...);
 400     // - Thread 1: achieves the streamed GPU sequence (GPU
 401     //   computations and CPU/GPU data transfers)
 402     } else if (omp_get_thread_num() == 1) {
 403       for (int s = 0; s < NbS; s++) {   // Start all data transfers:
 404         cudaMemcpyAsync(gpuInputTabAdr + s*stride, // CPU --> GPU
 405                         current + s*stride,        // (async. ops)
 406                         sizeof(float)*stride,
 407                         cudaMemcpyHostToDevice,
 408                         TabS[s]);
 409       }
 410       for (int s = 0; s < NbS; s++) { // Start all GPU comps. (async.)
 411         gpuKernel_k1<<<Dg, Db, 0, TabS[s]>>>(gpuInputTabAdr + s*stride);
 412       }
 413       cudaThreadSynchronize();          // Wait all threads are ended
 414    // IF there is (now) a result to transfer from the GPU to the CPU:
 415       cudaMemcpy(cpuResTabAdr,          // Data transfers:
 416                  gpuResAdr,             // GPU --> CPU (sync. op)
 417                  sizeof(float),
 418                  cudaMemcpyDeviceToHost);
 419     }
 420     // - Wait for both threads have achieved their iteration tasks
 421     #pragma omp barrier
 422     // - Each thread permute its local buffer pointers
 423     tmp = current; current = future; future = tmp;
 424   } // End of computation loop
 425 } // End of OpenMP parallel region
 426 ...
 427 // Destroy the streams
 428 for(int s = 0; s < NbS; s++)
 429   cudaStreamDestroy(TabS[s]);
 430 ...
 431 \end{Listing}
 432 %\end{algorithm}
 433
 434 \Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA
 435 code  explicitly overlapping MPI communications with
 436 \emph{streamed GPU sequences}\index{GPU sequence!streamed}. Efficient usage of CUDA \emph{streams} requires to execute
 437 asynchronous CPU/GPU data transfers, that needs to read page-locked
 438 data\index{page-locked data} in CPU memory. So, CPU
 439 memory allocations on lines 6 and 7 are implemented with \texttt{cudaHostAlloc} instead of
 440 the basic \texttt{malloc} function. Then, $NbS$ \emph{streams} are created on lines 12--14.
 441 Usually we create $16$ streams: the maximum number supported by CUDA.
 442
 443 An OpenMP parallel region\index{OpenMP!parallel region} including two threads is implemented on lines 17--61 of
 444 \Lst{algo:ch6p1overlapstreamsequence}, similarly to the previous algorithm (see
 445 \Lst{algo:ch6p1overlapseqsequence}). Code of thread $0$ achieving MPI communication is unchanged, but
 446 code of thread $1$ is now using streams. Following NVIDIA recommandations, we have first implemented
 447 a loop starting $NbS$ asynchronous data transfers (lines 39--45): transferring $N/NbS$ data on
 448 each stream. Then we have implemented a second loop (lines 46--48), starting asynchronous
 449 executions of $NbS$ grids of blocks of GPU threads (one per stream). Data transfers and kernel
 450 executions on the same stream are synchronized by CUDA and the GPU. So, each kernel execution will
 451 start after its data will be transferred into the GPU memory, and the GPU scheduler ensures to start
 452 some kernel executions as soon as the first data transfers are achieved. Then, next data transfers
 453 will be overlapped with GPU computations. After the kernel calls, on the different streams,
 454 we wait for the end of all GPU threads previously run, calling an explicit synchronization
 455 function on line 49. This synchronization is not mandatory, but it will make the implementation more
 456 robust and will facilitate the debugging steps: all GPU computations run by the OpenMP thread number
 457 $1$ will be achieved before this thread will enter a new loop iteration, or before the computation
 458 loop will be ended.
 459
 460 If a partial result has to be transferred from GPU to CPU memory  at the end of each loop iteration
 461 (for example the result of one \emph{reduction} per iteration), this transfer is achieved
 462 synchronously on  the default stream (no particular stream is specified) on lines 51--54.
 463 Availability of the result values is ensured by the synchronization implemented on line 49.
 464 However, if a partial result has to be transferred on the CPU on each stream, then $NbS$ asynchronous data
 465 transfers could be started in parallel (one per stream), and should be implemented before the
 466 synchronization operation on line 49. The end of the computation loop includes a synchronization
 467 barrier of the two OpenMP threads, waiting they have finished to access the different data
 468 buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, like
 469 in the previous algorithm. However, after the computation loop, we have added the
 470 destruction of the CUDA streams (lines 63--65).
 471
 472 Finally, CUDA streams\index{CUDA!stream} have been used to extend \Lst{algo:ch6p1overlapseqsequence}
 473 with respect to its global scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an
 474 OpenMP parallel region, with two CPU threads, one in charge of MPI communications, and the other
 475 managing data transfers and GPU computations. Unfortunately, using GPU streams require to be able to
 476 split a GPU computation in independent subparts, working on independent subsets of data.
 477 \Lst{algo:ch6p1overlapstreamsequence} is not so generic than \Lst{algo:ch6p1overlapseqsequence}.
 478
 479
 480 \subsection{Interleaved communications-transfers-computations overlapping}
 481
 482 \begin{figure}[t]
 483   \centering
 484   \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf}
 485   \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers and GPU
 486   computations, interleaving computation-communication iterations}
 487   \label{fig:ch6p1overlapinterleaved}
 488 \end{figure}
 489
 490 Many algorithms do not support to split data transfers and kernel calls, and can
 491 not exploit CUDA streams. For example, when each GPU thread requires to access
 492 some data spread in the global set of transferred data. Then, it is possible to
 493 overlap internode CPU communications and CPU/GPU data transfers and GPU
 494 computations, if the algorithm achieves \emph{computation-communication
 495   iterations} and if we can interleave these iterations. At iteration $k$: CPUs
 496 exchange data $D_k$, each CPU/GPU couple transfers data $D_k$, and each GPU
 497 achieves computations on data $D_{k-1}$ (see
 498 \Fig{fig:ch6p1overlapinterleaved}). Compared to the previous strategies, this
 499 strategy requires twice as many CPU data buffers
 500 and twice as many GPU buffers.
 501
 502 %\begin{algorithm}
 503 %  \caption{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA
 504 %  GPU computations, interleaving computation-communication iterations}\label{algo:ch6p1overlapinterleaved}
 505 \begin{Listing}{algo:ch6p1overlapinterleaved}{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA GPU computations, interleaving computation-communication iterations}
 506 // Input data and result variables and arrays (example with
 507 // float datatype, 1D input arrays, and scalar results)
 508 float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture;
 509 float *gpuInputTabAdrCurrent, *gpuInputTabAdrFuture;
 510 float *cpuResTabAdr, *gpuResAdr;
 511
 512 // CPU and GPU array allocations
 513 cpuInputTabAdrCurrent = malloc(sizeof(float)*N);
 514 cpuInputTabAdrFuture = malloc(sizeof(float)*N);
 515 cudaMalloc(&gpuInputTabAdrCurrent,sizeof(float)*N);
 516 cudaMalloc(&gpuInputTabAdrFuture,sizeof(float)*N);
 517 cpuResTabAdr = malloc(sizeof(float)*NbIter);
 518 cudaMalloc(&gpuResAdr,sizeof(float));
 519
 520 // Definition of the Grid of blocks of GPU threads
 521 dim3 Dg, Db; Dg.x = ...
 522 // Indexes of source and destination MPI processes
 523 int dest, src; dest = ...
 524
 525 // Set the number of OpenMP threads (to create) to 2
 526 omp_set_num_threads(3);
 527 // Create threads and start the parallel OpenMP region
 528 #pragma omp parallel
 529 {
 530   // Buffer pointers (thread local variables)
 531   float *cpuCurrent = cpuInputTabAdrCurrent;
 532   float *cpuFuture  = cpuInputTabAdrFuture;
 533   float *gpuCurrent = gpuInputTabAdrCurrent;
 534   float *gpuFuture  = gpuInputTabAdrFuture;
 535   float *tmp;
 536
 537   // Computation loop on: NbIter + 1 iteration
 538   for (int i = 0; i < NbIter + 1; i++) {
 539     // - Thread 0: achieves MPI communications
 540     if (omp_get_thread_num() == 0) {
 541       if (i < NbIter) {
 542         MPI_Sendrecv(cpuCurrent,            // MPI comms. (sync. op)
 543                      N, MPI_FLOAT, dest, 0,
 544                      cpuFuture,
 545                      N, MPI_FLOAT, dest, 0, ...);
 546       }
 547     // - Thread 1: achieves the CPU/GPU data transfers
 548     } else if (omp_get_thread_num() == 1) {
 549       if (i < NbIter) {
 550         cudaMemcpy(gpuFuture, cpuCurrent,   // Data transfer:
 551                    sizeof(float)*N,         // CPU --> GPU (sync. op)
 552                    cudaMemcpyHostToDevice);
 553       }
 554  // - Thread 2: achieves the GPU computations and the result transfer
 555     } else if (omp_get_thread_num() == 2) {
 556       if (i > 0) {
 557         gpuKernel_k1<<<Dg,Db>>>(gpuCurrent);// GPU comp. (async. op)
 558         // IF there is (now) a result to transfer from GPU to CPU:
 559         cudaMemcpy(cpuResTabAdr + (i-1),    // Data transfer:
 560                    gpuResAdr, sizeof(float),// GPU --> CPU (sync. op)
 561                    cudaMemcpyDeviceToHost);
 562       }
 563     }
 564     // - Wait for both threads have achieved their iteration tasks
 565     #pragma omp barrier
 566     // - Each thread permute its local buffer pointers
 567     tmp = cpuCurrent; cpuCurrent = cpuFuture; cpuFuture = tmp;
 568     tmp = gpuCurrent; gpuCurrent = gpuFuture; gpuFuture = tmp;
 569   } // End of computation loop
 570 } // End of OpenMP parallel region
 571 ...
 572 \end{Listing}
 573 %\end{algorithm}
 574
 575 \Lst{algo:ch6p1overlapinterleaved} introduces the generic code of a
 576 MPI+OpenMP+CUDA implementation, explicitly interleaving
 577 computation-communication iterations and overlapping MPI communications, CUDA CPU/GPU
 578 transfers and CUDA GPU computations. As in the previous algorithms, we declare two CPU input data arrays
 579 (current and future version) on line 3, but we also declare two GPU input data arrays on line 4. On
 580 lines 8--11, these four data arrays are allocated, using \texttt{malloc} and
 581 \texttt{cudaMalloc}.
 582 We do not need to allocate page-locked memory space. On lines 23--65 we
 583 create an OpenMP parallel region, configured to run three threads (see line 21). Lines 26--30 are
 584 declarations of thread local pointers on data arrays and variables (each thread will use its own
 585 pointers). On line 33, the three threads enter a computation loop of \texttt{NbIter + 1}
 586 iterations. We need to run one more iteration than with previous algorithms.
 587
 588 Lines 34--41 are the MPI communications, achieved by the thread number $0$. They send the
 589 current CPU input data array to another CPU, and receive the future CPU input data array from
 590 another CPU, like in previous algorithms. But this thread achieves communications only during the
 591 \emph{first} \texttt{NbIter} iterations. Lines 43--48 are the CPU to GPU input data
 592 transfers, achieved by thread number $1$. These data transfers are run in parallel of MPI
 593 communications. They are run during the \emph{first} \texttt{NbIter} iterations, and transfer
 594 current CPU input data array into the future GPU data array. Lines 50--57
 595 correspond to the code run by
 596 thread number $3$. They start GPU computations, to process the current GPU input data array, and if
 597 necessary
 598 transfer a GPU result at an index of the CPU result array. These GPU computations and result
 599 transfers are run during the \emph{last} \texttt{NbIter} iterations: the GPU computations
 600 have to wait the first data transfer is ended before to start to process any data, and can not run
 601 during the first iteration. So, the activity of the third thread is shifted of one iteration
 602 compared to the activities of other threads. Moreover, the address of the current GPU input data
 603 array has to be passed as a parameter of the kernel call on line 52, in order the GPU threads access
 604 the right data array. Like in previous algorithms the GPU result is copied at one index of the CPU
 605 result array, in lines 53--56, but due to the shift of the third thread activity this index is
 606 now \texttt{(i - 1)}.
 607
 608 Line 60 is a synchronization barrier\index{OpenMP!barrier} of the three OpenMP threads, followed by a pointer permutation
 609 of local pointers on current and future data arrays, on line 62 and 63. Each
 610 thread waits for the completion of other
 611 threads to use the data arrays, and then permutes its data array pointers before to
 612 enter a new loop iteration.
 613
 614 This complete overlap of MPI communications and CPU/GPU data transfers and GPU computations, is not
 615 too complex to implement, and can be a solution when GPU computations are not adapted to use CUDA
 616 streams: when GPU computations can not be split in subparts working on independent subsets of input
 617 data. However, it requires to run one more iterations (a total of \texttt{NbIter + 1}
 618 iterations). Then, if the number of iterations is very small, it could be more interesting not to
 619 attempt to overlap CPU/GPU data transfers and GPU computations, and to implement \Lst{algo:ch6p1overlapseqsequence}.
 620
 621
 622 \subsection{Experimental validation}
 623 \label{ch6:p1expes}
 624 %\subsubsection{Experimentation testbed}
 625
 626 \noindent {\bf Experimentation testbed}
 627 \medskip
 628
 629 Two clusters located at SUPELEC in Metz (France) have been used for the entire set of
 630 experiments presented in this chapter:
 631 %
 632 \begin{itemize}
 633
 634 \item The first consists of 17 nodes with an Intel Nehalem quad-core processor
 635   at 2.67Ghz, 6 Gb RAM and an NVIDIA GeForce GTX480 GPU, each.
 636
 637 \item The second consists of 16 nodes with an Intel core2 dual-core processor at
 638   2.67Ghz, 4 Gb RAM and an NVIDIA GeForce GTX580 GPU, each
 639
 640 \end{itemize}
 641 %
 642 Both clusters have a Gigabit Ethernet interconnection network that is connected
 643 through a DELL Power Object 5324 switch. The two switches are linked twice,
 644 insuring the interconnection of the two clusters.  The software environment
 645 consists of a Linux Fedora 64bit OS (kernel v. 2.6.35), GNU C and C++ compilers
 646 (v. 4.5.1) and the CUDA library (v. 4.2).
 647
 648
 649 %\subsubsection{Validation of the synchronous approach}
 650
 651 \bigskip
 652 \noindent {\bf Validation of the synchronous approach}
 653 \medskip
 654
 655 \begin{figure}[t]
 656   \centering
 657   \includegraphics{Chapters/chapter6/curves/gpuSyncOverlap.pdf}
 658   \caption{Experimental performances of different synchronous algorithms computing a
 659     dense matrix product}
 660   \label{fig:ch6p1syncexpematrixprod}
 661 \end{figure}
 662
 663 \label{ch6:p1block-cyclic}
 664 We have experimented our approach of synchronous parallel algorithms with a classic
 665 block cyclic algorithm for dense matrix multiplication\index{matrix
 666   multiplication!block cyclic}. This problem requires to split two input matrices ($A$ and $B$) on a ring of
 667 computing nodes, and to establish a circulation of the slices of $A$ matrix on the ring ($B$ matrix
 668 partition does not evolve during all the run). Compared to our generic algorithms, there is no
 669 partial result to transfer from GPU to CPU at the end of each computing iteration. The part of the
 670 result matrix computed on each GPU is transferred on the CPU at the end of the computation loop.
 671
 672 We have first implemented a synchronous version without any overlap of MPI communications, CPU/GPU
 673 data transfers, and GPU computations. We have added some synchronizations in the native overlapping
 674 version in order to avoid any overlap. We have measured the performance achieved on our cluster with
 675 NVIDIA GTX480 GPUs and matrices sizes of 4096$\times$4096, and we have obtained the curves in \Fig{fig:ch6p1syncexpematrixprod} labeled
 676 \emph{no-ovlp}. We observe that performance increases when the number of processor increases. Of course,
 677 there is a significant increase in cost when comparing a single node (without any MPI communication) with
 678 two nodes (starting to use MPI communications). But beyond two nodes we get a classical performance
 679 curve.
 680
 681 Then, we implemented and experimented \Lst{algo:ch6p1overlapnative}, see
 682 \emph{ovlp-native} in \Fig{fig:ch6p1syncexpematrixprod}. The native
 683 overlap of MPI communications with asynchronous run of CUDA kernels appears efficient. When the
 684 number of nodes increases the ratio of the MPI communications increases a lot (because the
 685 computation times decrease a lot). So, there is not a lot of GPU computation
 686 time that remains to be
 687 overlapped, and both \emph{no-ovlp} and \emph{ovlp-native} tend to the same
 688 limit. Already, the native overlap performed in \Lst{algo:ch6p1overlapnative}
 689 achieves a high level of performance very quickly, using only four
 690 nodes. Beyond four nodes, a faster interconnection network would be required for
 691 a performance increase.
 692
 693 Finally, we implemented \Lst{algo:ch6p1overlapseqsequence}, overlapping MPI communications
 694 with a GPU sequence including both CPU/GPU data transfers and GPU computations,
 695 see \emph{ovlp-GPUsequence} in \Fig{fig:ch6p1syncexpematrixprod}. From four
 696 up to sixteen nodes it achieves better performances than \emph{ovlp-native}: we better overlap
 697 MPI communications. However, this parallelization mechanism has more overhead: OpenMP threads
 698 have to be created and synchronized. Only for two nodes it is less efficient than the native
 699 overlapping algorithm. Beyond two nodes, the CPU multithreading overhead seems compensated.
 700 % No, it doesn't need the more implementation of time, but more implementation
 701 % of code :)
 702 \Lst{algo:ch6p1overlapseqsequence} requires more time for the implemention
 703 and can be more complex to maintain, but such extra development cost is
 704 justified if we are looking for better performance.
 705
 706
 707 %%% Local Variables:
 708 %%% mode: latex
 709 %%% fill-column: 80
 710 %%% ispell-dictionary: "american"
 711 %%% mode: flyspell
 712 %%% TeX-master: "../../BookGPU.tex"
 713 %%% End: