BookGPU/Chapters/chapter6/PartieAsync.tex

   1 \section{General scheme of asynchronous parallel code with
   2 computation/communication overlapping}
   3 \label{ch6:part2}
   4
   5 In the previous part, we have seen how to efficiently implement overlap of
   6 computations (CPU and GPU) with communications (GPU transfers and inter-node
   7 communications).  However, we have previously shown that for some parallel
   8 iterative algorithms, it is sometimes even more efficient to use an asynchronous
   9 scheme of iterations\index{iterations!asynchronous} \cite{HPCS2002,ParCo05,Para10}.  In that case, the nodes do
  10 not wait for each others but they perform their iterations using the last
  11 external data they have received from the other nodes, even if this
  12 data was produced \emph{before} the previous iteration on the other nodes.
  13
  14 Formally, if we denote by $f=(f_1,...,f_n)$ the function representing the
  15 iterative process and by $x^t=(x_1^t,...,x_n^t)$ the values of the $n$ elements of
  16 the system at iteration $t$, we pass from a synchronous iterative scheme of the
  17 form:
  18 %% \begin{algorithm}[H]
  19 %%   \caption{Synchronous iterative scheme}\label{algo:ch6p2sync}
  20 %%   \begin{Algo}
  21 %%     $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\
  22 %%     \textbf{for} $t=0,1,...$\\
  23 %%     \>\textbf{for} $i=1,...,n$\\
  24 %%     \>\>$x_{i}^{t+1}=f_{i}(x_{1}^t,...,x_i^t,...,x_{n}^t)$\\
  25 %%     \>\textbf{endfor}\\
  26 %%     \textbf{endfor}
  27 %%   \end{Algo}
  28 %% \end{algorithm}
  29 \begin{algorithm}[H]
  30   \caption{Synchronous iterative scheme}\label{algo:ch6p2sync}
  31   $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\;
  32   \For{ $t=0,1,...$} {
  33     \For{ $i=1,...,n$}{
  34       $x_{i}^{t+1}=f_{i}(x_{1}^t,...,x_i^t,...,x_{n}^t)$\;
  35     }
  36   }
  37 \end{algorithm}
  38
  39
  40 \noindent
  41 to an asynchronous iterative scheme of the form:\\
  42 %% \begin{algorithm}[H]
  43 %%   \caption{Asynchronous iterative scheme}\label{algo:ch6p2async}
  44 %%   \begin{Algo}
  45 %%     $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\
  46 %%     \textbf{for} $t=0,1,...$\\
  47 %%     \>\textbf{for} $i=1,...,n$\\
  48 %%     \>\>$x_{i}^{t+1}=\left\{
  49 %%       \begin{array}[h]{ll}
  50 %%         x_i^t & \text{if } i \text{ is \emph{not} updated at iteration } i\\
  51 %%         f_i(x_1^{s_1^i(t)},...,x_n^{s_n^i(t)}) & \text{if } i \text{ is updated at iteration } i
  52 %%       \end{array}
  53 %%     \right.$\\
  54 %%     \>\textbf{endfor}\\
  55 %%     \textbf{endfor}
  56 %%   \end{Algo}
  57 %% \end{algorithm}
  58 \begin{algorithm}[H]
  59   \caption{Asynchronous iterative scheme}\label{algo:ch6p2async}
  60     $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\;
  61     \For {$t=0,1,...$} {
  62       \For{ $i=1,...,n$} {
  63         $x_{i}^{t+1}=\left\{
  64       \begin{array}[h]{ll}
  65         x_i^t & \text{if } i \text{ is \emph{not} updated at iteration } i\\
  66         f_i(x_1^{s_1^i(t)},...,x_n^{s_n^i(t)}) & \text{if } i \text{ is updated at iteration } i
  67       \end{array}
  68     \right.$
  69       }
  70     }
  71 \end{algorithm}
  72 where $s_j^i(t)$ is the iteration number of the production of the value $x_j$ of
  73 element $j$  that is  used on element  $i$ at iteration  $t$ (see for example~\cite{BT89,
  74   FS2000} for further details).
  75 Such schemes  are called AIAC\index{AIAC} for  \emph{Asynchronous Iterations and
  76   Asynchronous Communications}.  They combine  two aspects that are respectively
  77 different computation speeds of  the computing elements and communication delays
  78 between them.
  79
  80 The key feature of such algorithmic schemes
  81 is  that they  may be  faster  than their  synchronous counterparts  due to  the
  82 implied total overlap of computations  with communications: in fact, this scheme
  83 suppresses all the  idle times induced by nodes  synchronizations between each iteration.
  84
  85 However, the efficiency of such a scheme is directly linked to the frequency at
  86 which new data arrives on each node. Typically,
  87 if a node receives newer data only every four or five local iterations, it is
  88 strongly probable that the evolution of its local iterative process will be
  89 slower than if it receives data at every iteration. The key point here is that
  90 this frequency does not only depend on the hardware configuration of the
  91 parallel system but it also depends on the software that is used to
  92 implement the algorithmic scheme.
  93
  94 The impact of the programming environments used to implement asynchronous
  95 algorithms has already been investigated in~\cite{SuperCo05}.  Although the
  96 features required to efficiently implement asynchronous schemes have not
  97 changed, the available programming environments and computing hardware have
  98 evolved, in particular now GPUs are available.  So, there is a need to reconsider the
  99 implementation schemes of AIAC according to the new de facto standards for parallel
 100 programming (communications and threads) as well as the integration of the GPUs.
 101 One of the main objective here is to obtain a maximal overlap between the
 102 activities of the three types of devices that are the CPU, the GPU and the
 103 network. Moreover, another objective is to present what we think
 104 is the best compromise between the simplicity of the implementation and its
 105 maintainability on one side and its
 106 performance on the other side.  This is especially important for industries where
 107 implementation and maintenance costs are strong constraints.
 108
 109 For the sake of clarity, we present the different algorithmic schemes in a progressive
 110 order of complexity, from the basic asynchronous scheme to the complete scheme
 111 with full overlap.  Between these two extremes, we propose a synchronization
 112 mechanism on top of our asynchronous scheme that can be used either statically or
 113 dynamically during the application execution.
 114
 115 Although there exist several programming environments for inter-node
 116 communications, multi-threading and GPU programming, a few of them have
 117 become \emph{de facto standards}, either due to their good stability, their ease
 118 of use and/or their wide adoption by the scientific community.
 119 Therefore, as in the previous section all the schemes presented in the following use MPI~\cite{MPI},
 120 OpenMP~\cite{openMP} and CUDA~\cite{CUDA}.  However, there is no loss of
 121 generality as those schemes may easily be implemented with other libraries.
 122
 123 Finally, in order to stay as clear as possible, only the parts of code and
 124 variables related to the control of parallelism (communications, threads,...)
 125 are presented in our schemes. The inner organization of data is not detailed as
 126 it depends on the application.  We only consider that we have two data arrays
 127 (previous version and current version) and communication buffers.  However, in
 128 most of the cases, those buffers can correspond to the data arrays themselves to
 129 avoid data copies.
 130
 131 \subsection{A basic asynchronous scheme}
 132 \label{ch6:p2BasicAsync}
 133
 134 The first step toward our complete scheme is to implement a basic asynchronous
 135 scheme that includes an actual overlap of the communications with the
 136 computations\index{overlap!computation and communication}. In order to ensure that the communications are actually performed
 137 in parallel of the computations, it is necessary to use different threads.  It
 138 is important to remember that asynchronous communications provided in
 139 communication libraries like MPI are not systematically performed in parallel of
 140 the computations~\cite{ChVCV13,Hoefler08a}.  So, the logical and classical way
 141 to implement such an overlap is to use three threads: one for
 142 computing, one for sending and one for receiving. Moreover, since
 143 the communication is performed by threads, blocking synchronous communications\index{MPI!communication!blocking}\index{MPI!communication!synchronous}
 144 can be used without deteriorating the overall performance.
 145
 146 In this basic version, the termination\index{termination} of the global process is performed
 147 individually on each node according to their own termination. This can be guided by either a
 148 number of iterations or a local convergence detection\index{convergence detection}. The important step at
 149 the end of the process is to perform the receptions of all pending
 150 communications in order to ensure the termination of the two communication
 151 threads.
 152
 153 So, the global organization of this scheme is set up in \Lst{algo:ch6p2BasicAsync}.
 154
 155 % \begin{algorithm}[H]
 156 %   \caption{Initialization of the basic asynchronous scheme.}
 157 %   \label{algo:ch6p2BasicAsync}
 158 \begin{Listing}{algo:ch6p2BasicAsync}{Initialization of the basic asynchronous scheme}
 159 // Variables declaration and initialization
 160 omp_lock_t lockSend; // Controls the sendings from the computing thread
 161 omp_lock_t lockRec;  // Ensures the initial reception of external data
 162 char Finished = 0;   // Boolean indicating the end of the process
 163 char SendsInProgress = 0; // Boolean indicating if previous data sendings are still in progress
 164 double Threshold;    // Threshold of the residual for convergence detection
 165
 166 // Parameters reading
 167 ...
 168
 169 // MPI initialization
 170 MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, &provided);
 171 MPI_Comm_size(MPI_COMM_WORLD, &nbP);
 172 MPI_Comm_rank(MPI_COMM_WORLD, &numP);
 173
 174 // Data initialization and distribution among the nodes
 175 ...
 176
 177 // OpenMP initialization (mainly declarations and setting up of locks)
 178 omp_set_num_threads(3);
 179 omp_init_lock(&lockSend);
 180 omp_set_lock(&lockSend); // Initially locked, unlocked to start sendings
 181 omp_init_lock(&lockRec);
 182 omp_set_lock(&lockRec);  // Initially locked, unlocked when initial data are received
 183
 184 #pragma omp parallel
 185 {
 186   switch(omp_get_thread_num()){
 187     case COMPUTATION :
 188     computations(... @\emph{relevant parameters}@ ...);
 189     break;
 190
 191     case SENDINGS :
 192     sendings();
 193     break;
 194
 195     case RECEPTIONS :
 196     receptions();
 197     break;
 198   }
 199 }
 200
 201 // Cleaning of OpenMP locks
 202 omp_test_lock(&lockSend);
 203 omp_unset_lock(&lockSend);
 204 omp_destroy_lock(&lockSend);
 205 omp_test_lock(&lockRec);
 206 omp_unset_lock(&lockRec);
 207 omp_destroy_lock(&lockRec);
 208
 209 // MPI termination
 210 MPI_Finalize();
 211 \end{Listing}
 212 %\end{algorithm}
 213
 214 % \ANNOT{Est-ce qu'on laisse le 1er échange de données ou pas dans
 215 %   l'algo~\ref{algo:ch6p2BasicAsyncComp} ? (lignes 16-19)\\
 216 %   (ça n'est pas nécessaire si les données de
 217 %   départ sont distribuées avec les dépendances qui correspondent, sinon il faut
 218 %   le faire)}
 219
 220 In this scheme, the \texttt{lockRec} mutex\index{OpenMP!mutex} is not mandatory.
 221 It is only used to ensure that data dependencies are actually exchanged at the
 222 first iteration of the process.  Data initialization and distribution
 223 (lines~16-17) are not detailed here because they are directly related to the
 224 application. The important point is that, in most cases, they should be done
 225 before the iterative process.  The computing function is given in
 226 \Lst{algo:ch6p2BasicAsyncComp}.
 227
 228 %\begin{algorithm}[H]
 229 %  \caption{Computing function in the basic asynchronous scheme.}
 230 %  \label{algo:ch6p2BasicAsyncComp}
 231 \begin{Listing}{algo:ch6p2BasicAsyncComp}{Computing function in the basic asynchronous scheme}
 232 // Variables declaration and initialization
 233 int iter = 1;      // Number of the current iteration
 234 double difference; // Variation of one element between two iterations
 235 double residual;   // Residual of the current iteration
 236
 237 // Computation loop
 238 while(!Finished){
 239   // Sendings of data dependencies if there is no previous sending in progress
 240   if(!SendsInProgress){
 241     // Potential copy of data to be sent in additional buffers
 242     ...
 243     // Change of sending state
 244     SendsInProgress = 1;
 245     omp_unset_lock(&lockSend);
 246   }
 247
 248   // Blocking receptions at the first iteration
 249   if(iter == 1){
 250     omp_set_lock(&lockRec);
 251   }
 252
 253   // Initialization of the residual
 254   residual = 0.0;
 255   // Swapping of data arrays (current and previous)
 256   tmp = current;      // Pointers swapping to avoid
 257   current = previous; // actual data copies between
 258   previous = tmp;     // the two data versions
 259   // Computation of current iteration over local data
 260   for(ind=0; ind<localSize; ++ind){
 261     // Updating of current array using previous array
 262     ...
 263     // Updating of the residual
 264     // (max difference between two successive iterations)
 265     difference = fabs(current[ind] - previous[ind]);
 266     if(difference > residual){
 267       residual = difference;
 268     }
 269   }
 270
 271   // Checking of the end of the process (residual under threshold)
 272   // Other conditions can be added to the termination detection
 273   if(residual <= Threshold){
 274     Finished = 1;
 275     omp_unset_lock(&lockSend); // Activation of end messages sendings
 276     MPI_Ssend(&Finished, 1, MPI_CHAR, numP, tagEnd, MPI_COMM_WORLD);
 277   }
 278
 279   // Updating of the iteration number
 280   iter++;
 281 }
 282 \end{Listing}
 283 %\end{algorithm}
 284
 285 As mentioned above, it can be seen in line~18 of \Lst{algo:ch6p2BasicAsyncComp}
 286 that the \texttt{lockRec} mutex is used only at the first iteration to wait for
 287 the initial data dependencies before the computations. The residual\index{residual}, initialized
 288 in line~23 and computed in lines~34-37, is defined by the maximal difference
 289 between the elements from two consecutive iterations. It is classically used to
 290 detect the local convergence of the process on each node.  In the more
 291 complete schemes presented in the sequel, a global termination detection that
 292 takes the states of all the nodes into account will be exhibited.
 293
 294 Finally, the local convergence is tested and updated when necessary. In line~44,
 295 the \texttt{lockSend} mutex is unlocked to allow the sending function to send
 296 final messages to the dependency nodes.  Those messages are required to keep the
 297 reception function alive until all the final messages have been received.
 298 Otherwise, a node could stop its reception function whereas other nodes are
 299 still trying to communicate with it.  Moreover, a local sending of a final
 300 message to the node itself is required (line~45) to ensure that the reception
 301 function will not stay blocked in a message probing
 302 (see~\Lst{algo:ch6p2BasicAsyncReceptions}, line~11). This may happen if the node
 303 receives the final messages from its dependencies \emph{before} being itself in
 304 local convergence.
 305
 306 All the messages but this final local one are performed in the sending function
 307 described in \Lst{algo:ch6p2BasicAsyncSendings}.
 308
 309 The main loop is only conditioned by the end of the computing process (line~4).
 310 At each iteration, the thread waits for the permission from the computing thread
 311 (according to the \texttt{lockSend} mutex).  Then, data are sent with
 312 blocking synchronous communications.  The \texttt{SendsInProgress} boolean
 313 allows the computing thread to skip data sendings as long as a previous sending
 314 is in progress. This skip is possible due to the nature of asynchronous
 315 algorithms that allows such \emph{message loss}\index{message!loss/miss} or \emph{message miss}. After
 316 the main loop, the final messages are sent to the dependencies of the node.
 317
 318 %\begin{algorithm}[H]
 319 %  \caption{Sending function in the basic asynchronous scheme.}
 320 %  \label{algo:ch6p2BasicAsyncSendings}
 321 \begin{Listing}{algo:ch6p2BasicAsyncSendings}{Sending function in the basic asynchronous scheme}
 322 // Variables declaration and initialization
 323 ...
 324
 325 while(!Finished){
 326   omp_set_lock(&lockSend); // Waiting for signal from the comp. thread
 327   if(!Finished){
 328     // Blocking synchronous sends to all dependencies
 329     for(i=0; i<nbDeps; ++i){
 330       MPI_Ssend(&dataToSend[deps[i]], nb_data, type_of_data, deps[i], tagCom, MPI_COMM_WORLD);
 331     }
 332     SendsInProgress = 0; // Indicates that the sendings are done
 333   }
 334 }
 335 // At the end of the process, sendings of final messages
 336 for(i=0; i<nbDeps; ++i){
 337   MPI_Ssend(&Finished, 1, MPI_CHAR, deps[i], tagEnd, MPI_COMM_WORLD);
 338 }
 339 \end{Listing}
 340 %\end{algorithm}
 341
 342 The last function, detailed in \Lst{algo:ch6p2BasicAsyncReceptions}, does all the messages receptions.
 343 %\begin{algorithm}[H]
 344 %  \caption{Reception function in the basic asynchronous scheme.}
 345 %  \label{algo:ch6p2BasicAsyncReceptions}
 346 \begin{Listing}{algo:ch6p2BasicAsyncReceptions}{Reception function in the basic asynchronous scheme}
 347 // Variables declaration and initialization
 348 char countReceipts = 0; // Boolean indicating whether receptions are counted or not
 349 int nbEndMsg = 0;       // Number of end messages received
 350 int arrived = 0;        // Boolean indicating if a message is arrived
 351 int srcNd;              // Source node of the message
 352 int size;               // Message size
 353
 354 // Main loop of receptions
 355 while(!Finished){
 356   // Waiting for an incoming message
 357   MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
 358   if(!Finished){
 359     // Management of data messages
 360     switch(status.MPI_TAG){
 361       case tagCom: // Management of data messages
 362        srcNd = status.MPI_SOURCE; // Get the source node of the message
 363        // Actual data reception in the corresponding buffer
 364        MPI_Recv(dataBufferOf(srcNd), nbDataOf(srcNd), dataTypeOf(srcNd), srcNd, tagCom, MPI_COMM_WORLD, &status);
 365        // Unlocking of the computing thread when data are received from all dependencies
 366        if(countReceipts == 1 && ... @\emph{receptions from ALL dependencies}@ ...){
 367          omp_unset_lock(&lockRec);
 368          countReceipts = 0; // No more counting after first iteration
 369        }
 370        break;
 371       case tagEnd: // Management of end messages
 372        // Actual end message reception in dummy buffer
 373        MPI_Recv(dummyBuffer, 1, MPI_CHAR, status.MPI_SOURCE, tagEnd, MPI_COMM_WORLD, &status);
 374        nbEndMsg++;
 375     }
 376   }
 377 }
 378
 379 // Reception of pending messages and counting of end messages
 380 do{ // Loop over the remaining incoming/waited messages
 381   MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
 382   MPI_Get_count(&status, MPI_CHAR, &size);
 383   // Actual reception in dummy buffer
 384   MPI_Recv(dummyBuffer, size, MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, MPI_COMM_WORLD, &status);
 385   if(status.MPI_TAG == tagEnd){ // Counting of end messages
 386     nbEndMsg++;
 387   }
 388   MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &arrived, &status);
 389 }while(arrived == 1 || nbEndMsg < nbDeps + 1);
 390 \end{Listing}
 391 %\end{algorithm}
 392
 393 As in the sending function, the main loop of receptions is done while the
 394 iterative process is not \texttt{Finished}.  In line~11, the thread waits until
 395 a message arrives on the node.  Then, it performs the actual reception and the
 396 corresponding subsequent actions (potential data copies for data messages and
 397 counting for end messages). Lines 20-23 check that all data dependencies have
 398 been received before unlocking the \texttt{lockRec} mutex. As mentioned before,
 399 they are not mandatory and are included only to ensure that all data
 400 dependencies are received at the first iteration. Lines 25-28 are required to
 401 manage end messages that arrive on the node \emph{before} it reaches its own
 402 termination process. As the nodes are \emph{not} synchronized, this may happen.
 403 Finally, lines 34-43 perform the receptions of all pending communications,
 404 including the remaining end messages (at least the one from the node itself).
 405
 406 \medskip
 407 So, with those algorithms, we obtain a quite simple and efficient asynchronous
 408 iterative scheme. It is interesting to notice that GPU computing can be easily
 409 included in the computing thread.  This will be fully addressed in
 410 paragraph~\ref{ch6:p2GPUAsync}.  However, before presenting the complete
 411 asynchronous scheme with GPU computing, we have to detail how our initial scheme
 412 can be made synchronous.
 413
 414 \subsection{Synchronization of the  asynchronous scheme}
 415 \label{ch6:p2SsyncOverAsync}
 416
 417 The presence of synchronization in the previous scheme may seem contradictory to our goal,
 418 and obviously, it is neither the simplest way to obtain a synchronous scheme nor
 419 the most efficient (as presented in~\Sec{ch6:part1}). However, it is necessary
 420 for our global convergence detection strategy\index{convergence detection!global}. Recall that the
 421 global convergence\index{convergence!global} is the extension of the local convergence\index{convergence!local} concept to all the
 422 nodes. This implies that all the nodes have to be in local convergence at the
 423 same time to achieve global convergence. Typically, if we use the residual and a
 424 threshold to stop the iterative process, all the nodes have to continue their
 425 local iterative process until \emph{all} of them obtain a residual under the
 426 threshold.
 427
 428 % Moreover, it  allows the gathering  of both operating  modes in a  single source
 429 % code, which tends to improve  the implementation and maintenance costs when both
 430 % versions are required.
 431 % The  second one  is  that
 432 In our context, the interest of being able to dynamically change the operating
 433 mode (sync/async) during the process execution, is that this strongly simplifies
 434 the global convergence detection.  In fact, our past experience in the design
 435 and implementation of global convergence detection in asynchronous
 436 algorithms~\cite{SuperCo05, BCC07, Vecpar08a}, have led us to the conclusion
 437 that although a decentralized detection scheme is possible and may be more
 438 efficient in some situations, its much higher complexity is an
 439 obstacle to actual use in practice, especially in industrial contexts where
 440 implementation/maintenance costs are strong constraints. Moreover, although the
 441 decentralized scheme does not slow down the computations, it requires more iterations than a synchronous version and
 442 thus may induce longer detection times in some cases.  So, the solution we
 443 present below is a good compromise between simplicity and efficiency.  It
 444 consists in dynamically changing the operating mode between asynchronous and synchronous
 445 during the execution of the process in order to check the global convergence.
 446 This is why we need to synchronize our asynchronous scheme.
 447
 448 In each  algorithm of the  initial scheme, we  only exhibit the  additional code
 449 required to change the operating mode.
 450
 451 %\begin{algorithm}[H]
 452 %  \caption{Initialization of the synchronized scheme.}
 453 %  \label{algo:ch6p2Sync}
 454 \begin{Listing}{algo:ch6p2Sync}{Initialization of the synchronized scheme}
 455 // Variables declarations and initialization
 456 ...
 457 omp_lock_t lockStates; // Controls the synchronous exchange of local states
 458 omp_lock_t lockIter;   // Controls the synchronization at the end of each iteration
 459 char localCV = 0;      // Boolean indicating whether the local stabilization is reached or not
 460 int nbOtherCVs = 0;    // Number of other nodes being in local stabilization
 461
 462 // Parameters reading
 463 ...
 464 // MPI initialization
 465 ...
 466 // Data initialization and distribution among the nodes
 467 ...
 468 // OpenMP initialization (mainly declarations and setting up of locks)
 469 ...
 470 omp_init_lock(&lockStates);
 471 omp_set_lock(&lockStates); // Initially locked, unlocked when all state messages are received
 472 omp_init_lock(&lockIter);
 473 omp_set_lock(&lockIter);   // Initially locked, unlocked when all "end of iteration" messages are received
 474
 475 // Threads launching
 476 #pragma omp parallel
 477 {
 478   switch(omp_get_thread_num()){
 479     ...
 480   }
 481 }
 482
 483 // Cleaning of OpenMP locks
 484 ...
 485 omp_test_lock(&lockStates);
 486 omp_unset_lock(&lockStates);
 487 omp_destroy_lock(&lockStates);
 488 omp_test_lock(&lockIter);
 489 omp_unset_lock(&lockIter);
 490 omp_destroy_lock(&lockIter);
 491
 492 // MPI termination
 493 MPI_Finalize();
 494 \end{Listing}
 495 %\end{algorithm}
 496
 497 As can be seen in \Lst{algo:ch6p2Sync}, the synchronization implies two
 498 additional mutex.  The \texttt{lockStates} mutex is used to wait for the
 499 receptions of all state messages coming from the other nodes.  As shown
 500 in \Lst{algo:ch6p2SyncComp}, those messages contain only a boolean indicating
 501 for each node if it is in local convergence\index{convergence!local}. So, once all the states are
 502 received on a node, it is possible to determine if all the nodes are in local
 503 convergence, and thus to detect the global convergence.  The \texttt{lockIter}
 504 mutex is used to synchronize all the nodes at the end of each iteration.  There
 505 are also two new variables that respectively represent the local state of the
 506 node (\texttt{localCV}) according to the iterative process (convergence) and the
 507 number of other nodes that are in local convergence (\texttt{nbOtherCVs}).
 508
 509 The computation thread is where most of the modifications take place, as shown
 510 in \Lst{algo:ch6p2SyncComp}.
 511
 512 %\begin{algorithm}[H]
 513 %  \caption{Computing function in the synchronized scheme.}
 514 %  \label{algo:ch6p2SyncComp}
 515 \begin{Listing}{algo:ch6p2SyncComp}{Computing function in the synchronized scheme}
 516 // Variables declarations and initialization
 517 ...
 518
 519 // Computation loop
 520 while(!Finished){
 521   // Sendings of data dependencies at @\emph{each}@ iteration
 522   // Potential copy of data to be sent in additional buffers
 523   ...
 524   omp_unset_lock(&lockSend);
 525
 526   // Blocking receptions at @\emph{each}@ iteration
 527   omp_set_lock(&lockRec);
 528
 529   // Local computation
 530   // (init of residual, arrays swapping and iteration computation)
 531   ...
 532
 533   // Checking of the stabilization of the local process
 534   // Other conditions than the residual can be added
 535   if(residual <= Threshold){
 536     localCV = 1;
 537   }else{
 538     localCV = 0;
 539   }
 540
 541   // Global exchange of local states of the nodes
 542   for(ind=0; ind<nbP; ++ind){
 543     if(ind != numP){
 544       MPI_Ssend(&localCV, 1, MPI_CHAR, ind, tagState, MPI_COMM_WORLD);
 545     }
 546   }
 547
 548   // Waiting for the state messages receptions from the other nodes
 549   omp_set_lock(&lockStates);
 550
 551   // Determination of global convergence (if all nodes are in local CV)
 552   if(localCV + nbOtherCVs == nbP){
 553     // Entering global CV state
 554     Finished = 1;
 555     // Unlocking of sending thread to start sendings of end messages
 556     omp_unset_lock(&lockSend);
 557     MPI_Ssend(&Finished, 1, MPI_CHAR, numP, tagEnd, MPI_COMM_WORLD);
 558   }else{
 559     // Resetting of information about the states of the other nodes
 560     ...
 561     // Global barrier at the end of each iteration during the process
 562     for(ind=0; ind<nbP; ++ind){
 563       if(ind != numP){
 564         MPI_Ssend(&Finished, 1, MPI_CHAR, ind, tagIter, MPI_COMM_WORLD);
 565       }
 566     }
 567     omp_set_lock(&lockIter);
 568   }
 569
 570
 571   // Updating of the iteration number
 572   iter++;
 573 }
 574 \end{Listing}
 575 %\end{algorithm}
 576
 577 Most of the added code is related to the waiting for specific communications.
 578 Between lines~6 and~7, the use of the flag \texttt{SendsInProgress} is no longer
 579 needed since the sends are performed at each iteration.  In line~12, the
 580 thread waits for the data receptions from its dependencies. In
 581 lines~26-34, the local states are determined and exchanged among all nodes.  A
 582 new message tag (\texttt{tagState}) is required for identifying those messages.
 583 In line~37, the global termination state is determined.  When it is reached,
 584 lines~38-42 change the \texttt{Finished} boolean to stop the iterative process,
 585 and send the end messages.  Otherwise each node resets its local state
 586 information about the other nodes and a global barrier is done between all the
 587 nodes at the end of each iteration with another new tag (\texttt{tagIter}).
 588 That barrier is needed to ensure that data messages from successive iterations
 589 are actually received during the \emph{same} iteration on the destination nodes.
 590 Nevertheless, it is not useful at the termination of the global process as it is
 591 replaced by the global exchange of end messages.
 592
 593 There is no big modification induced by the synchronization in the sending
 594 function. The only change could be the suppression of line~11 that is not useful
 595 in this case.  Apart from that, the function stays the same as
 596 in \Lst{algo:ch6p2BasicAsyncSendings}.
 597
 598 In the reception function, given in \Lst{algo:ch6p2SyncReceptions}, there are
 599 mainly two insertions (in lines~19-30 and 31-40), corresponding to the
 600 additional types of messages to receive.  There is also the insertion of three
 601 variables that are used for the receptions of the new message types.  In
 602 lines~24-29 and 34-39 are located messages counting and mutex unlocking
 603 mechanisms that are used to block the computing thread at the corresponding steps of its
 604 execution. They are similar to the mechanism used for managing the end messages
 605 at the end of the entire process.  Line~23 directly updates the
 606 number of other nodes that are in local convergence by adding the
 607 received state of the source node. This is possible due to the encoding that is used to
 608 represent the local convergence (1) and the non-convergence (0).
 609
 610 %\begin{algorithm}[H]
 611 %  \caption{Reception function in the synchronized scheme.}
 612 %  \label{algo:ch6p2SyncReceptions}
 613 \begin{Listing}{algo:ch6p2SyncReceptions}{Reception function in the synchronized scheme}
 614 // Variables declarations and initialization
 615 ...
 616 int nbStateMsg = 0; // Number of local state messages received
 617 int nbIterMsg = 0;  // Number of "end of iteration" messages received
 618 char recvdState;    // Received state from another node (0 or 1)
 619
 620 // Main loop of receptions
 621 while(!Finished){
 622   // Waiting for an incoming message
 623   MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
 624   if(!Finished){
 625     switch(status.MPI_TAG){ // Actions related to message type
 626       case tagCom: // Management of data messages
 627        ...
 628        break;
 629       case tagEnd: // Management of termination messages
 630        ...
 631        break;
 632       case tagState: // Management of local state messages
 633        // Actual reception of the message
 634        MPI_Recv(&recvdState, 1, MPI_CHAR, status.MPI_SOURCE, tagState, MPI_COMM_WORLD, &status);
 635        // Updates of numbers of stabilized nodes and received state msgs
 636        nbOtherCVs += recvdState;
 637        nbStateMsg++;
 638        // Unlocking of the computing thread when states of all other nodes are received
 639        if(nbStateMsg == nbP-1){
 640          nbStateMsg = 0;
 641          omp_unset_lock(&lockStates);
 642        }
 643        break;
 644       case tagIter: // Management of "end of iteration" messages
 645        // Actual reception of the message in dummy buffer
 646        MPI_Recv(dummyBuffer, 1, MPI_CHAR, status.MPI_SOURCE, tagIter, MPI_COMM_WORLD, &status);
 647        nbIterMsg++; // Update of the nb of iteration messages
 648        // Unlocking of the computing thread when iteration messages are received from all other nodes
 649        if(nbIterMsg == nbP - 1){
 650          nbIterMsg = 0;
 651          omp_unset_lock(&lockIter);
 652        }
 653        break;
 654     }
 655   }
 656 }
 657
 658 // Reception of pending messages and counting of end messages
 659 do{ // Loop over the remaining incoming/waited messages
 660   ...
 661 }while(arrived == 1 || nbEndMsg < nbDeps + 1);
 662 \end{Listing}
 663 %\end{algorithm}
 664
 665 Now that we can synchronize our asynchronous scheme, the final step is to
 666 dynamically alternate the two operating modes in order to regularly check the
 667 global convergence of the iterative process. This is detailed in the following
 668 paragraph together with the inclusion of GPU computing in the final asynchronous
 669 scheme.
 670
 671 \subsection{Asynchronous scheme using MPI, OpenMP and CUDA}
 672 \label{ch6:p2GPUAsync}
 673
 674 As mentioned above, the strategy proposed to obtain a good compromise between
 675 simplicity and efficiency in the asynchronous scheme is to dynamically change
 676 the operating mode of the process.  A good way to obtain a maximal
 677 simplification of the final scheme while preserving good performance is to
 678 perform local and global convergence detections only in synchronous
 679 mode. Moreover, as two successive iterations are sufficient in synchronous mode
 680 to detect local and global convergences, the key is to alternate some
 681 asynchronous iterations with two synchronous iterations until
 682 convergence.
 683
 684 The last problem is to decide \emph{when} to switch from the
 685 asynchronous to the synchronous mode.  Here again, for the sake of simplicity, any
 686 asynchronous mechanism for \emph{detecting} such instant is avoided and we
 687 prefer to use a mechanism that is local to each node. Obviously, that local system must
 688 rely neither on the number of local iterations done nor on the local
 689 convergence.  The former would slow down the fastest nodes according to the
 690 slowest ones.  The latter would provoke too much synchronization because the
 691 residuals on all nodes commonly do not evolve in the same way and, in most
 692 cases, there is a convergence wave phenomenon throughout the elements.  So, a
 693 good solution is to insert a local timer mechanism on each node with a given
 694 initial duration. Then, that duration may be modified during the execution
 695 according to the successive results of the synchronous sections.
 696
 697 Another problem induced by entering synchronous mode from the asynchronous one
 698 is the possibility to receive some data messages
 699 from previous asynchronous iterations during synchronous iterations. This could lead to deadlocks. In order
 700 to avoid this, a wait of the end of previous send is added to the
 701 transition between the two modes.  This is implemented by replacing the variable
 702 \texttt{SendsInProgress} by a mutex \texttt{lockSendsDone} which is unlocked
 703 once all the messages have been sent in the sending function.  Moreover, it is
 704 also necessary to stamp data messages\index{message!stamping} (by the function \texttt{stampData}) with
 705 a Boolean indicating whether they have been sent during a synchronous or
 706 asynchronous iteration.  Then, the \texttt{lockRec} mutex is unlocked only
 707 after to the complete reception of data messages from synchronous
 708 iterations.  The message ordering of point-to-point communications in MPI
 709 together with the barrier at the end of each iteration ensure two important
 710 properties of this mechanism. First, data messages from previous
 711 asynchronous iterations will be received but not taken into account during
 712 synchronous sections. Then, a data message from a synchronous
 713 iteration cannot be received in another synchronous iteration.  In the
 714 asynchronous sections, no additional mechanism is needed as there are no such
 715 constraints concerning the data receptions.
 716
 717 Finally, the required modifications of the previous scheme
 718 % to make it dynamically change its operating mode (asynchronous or synchronous)
 719 are mainly related to the computing thread.  Small additions or modifications
 720 are also required in the main process and the other threads.
 721
 722 In the main process, two new variables are added to store respectively the main
 723 operating mode of the iterative process (\texttt{mainMode}) and the duration of
 724 asynchronous sections (\texttt{asyncDuration}). Those variables are
 725 initialized by the programmer. The \texttt{lockSendsDone} mutex is also declared,
 726 initialized (locked) and destroyed with the other mutex in this process.
 727
 728 In the computing function, shown in \Lst{algo:ch6p2AsyncSyncComp}, the
 729 modifications consist of the insertion of the timer mechanism and of the conditions
 730 to differentiate the actions to be done in each mode.  Some additional variables
 731 are also required to store the current operating mode in action during the
 732 execution (\texttt{curMode}), the starting time of the current asynchronous
 733 section (\texttt{asyncStart}) and the number of successive synchronous
 734 iterations done (\texttt{nbSyncIter}).
 735
 736 %\begin{algorithm}[H]
 737 %  \caption{Computing function in the final asynchronous scheme.}% without GPU computing.}
 738 %  \label{algo:ch6p2AsyncSyncComp}
 739 %\pagebreak
 740 \begin{Listing}{algo:ch6p2AsyncSyncComp}{Computing function in the final asynchronous scheme}% without GPU computing.}
 741 // Variables declarations and initialization
 742 ...
 743 OpMode curMode = SYNC; // Current operating mode (always begin in sync)
 744 double asyncStart;     // Starting time of the current async section
 745 int nbSyncIter = 0;    // Number of sync iterations done in async mode
 746
 747 // Computation loop
 748 while(!Finished){
 749   // Determination of the dynamic operating mode
 750   if(curMode == ASYNC){
 751     // Entering synchronous mode when asyncDuration is reached
 752 @%    // (additional conditions can be specified if needed)
 753 @    if(MPI_Wtime() - asyncStart >= asyncDuration){
 754       // Waiting for the end of previous sends before starting sync mode
 755       omp_set_lock(&lockSendsDone);
 756       curMode = SYNC;              // Entering synchronous mode
 757       stampData(dataToSend, SYNC); // Mark data to send with sync flag
 758       nbSyncIter = 0;
 759     }
 760   }else{
 761     // In main async mode, going back to async mode when the max number of sync iterations are done
 762     if(mainMode == ASYNC){
 763       nbSyncIter++; // Update of the number of sync iterations done
 764       if(nbSyncIter == 2){
 765         curMode = ASYNC;              // Going back to async mode
 766         stampData(dataToSend, ASYNC); // Mark data to send
 767         asyncStart = MPI_Wtime();     // Get the async starting time
 768       }
 769     }
 770   }
 771
 772   // Sendings of data dependencies
 773   if(curMode == SYNC || !SendsInProgress){
 774     ...
 775   }
 776
 777   // Blocking data receptions in sync mode
 778   if(curMode == SYNC){
 779     omp_set_lock(&lockRec);
 780   }
 781
 782   // Local computation
 783   // (init of residual, arrays swapping and iteration computation)
 784   ...
 785
 786   // Checking of convergences (local & global) only in sync mode
 787   if(curMode == SYNC){
 788     // Local convergence checking (residual under threshold)
 789     ...
 790     // Blocking global exchange of local states of the nodes
 791     ...
 792     // Determination of global convergence (all nodes in local CV)
 793     //    Stop of the iterative process and sending of end messages
 794     // or Re-initialization of state information and iteration barrier
 795     ...
 796     }
 797   }
 798
 799   // Updating of the iteration number
 800   iter++;
 801 }
 802 \end{Listing}
 803 %\end{algorithm}
 804
 805 In the sending function, the only modification is the replacement in line~11 of
 806 the assignment of variable \texttt{SendsInProgress} by the unlocking of
 807 \texttt{lockSendsDone}.  Finally, in the reception function, the only
 808 modification is the insertion before line~19
 809 of \Lst{algo:ch6p2BasicAsyncReceptions} of the extraction of the stamp from the
 810 message and its counting among the receipts only if the stamp is \texttt{SYNC}.
 811
 812 The final step to get our complete scheme using GPU is to insert the GPU
 813 management in the computing thread.  The first possibility, detailed
 814 in \Lst{algo:ch6p2syncGPU}, is to simply replace the
 815 CPU kernel (lines~41-43 in \Lst{algo:ch6p2AsyncSyncComp}) by a blocking GPU kernel call.  This includes data
 816 transfers from the node RAM to the GPU RAM, the launching of the GPU kernel, the
 817 waiting for kernel completion and the results transfers from GPU RAM to
 818 node RAM.
 819
 820 %\begin{algorithm}[H]
 821 %  \caption{Computing function in the final asynchronous scheme.}
 822 %  \label{algo:ch6p2syncGPU}
 823 \begin{Listing}{algo:ch6p2syncGPU}{Computing function in the final asynchronous scheme}
 824 // Variables declarations and initialization
 825 ...
 826 dim3 Dg, Db; // CUDA kernel grids
 827
 828 // Computation loop
 829 while(!Finished){
 830   // Determination of the dynamic operating mode, sendings of data dependencies and blocking data receptions in sync mode
 831   ...
 832   // Local GPU computation
 833   // Data transfers from node RAM to GPU
 834   CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer");
 835   ... // There may be several data transfers: typically A and b in linear problems
 836   // GPU grid definition
 837   Db.x = BLOCK_SIZE_X; // BLOCK_SIZE_# are kernel design dependent
 838   Db.y = BLOCK_SIZE_Y;
 839   Db.z = BLOCK_SIZE_Z;
 840   Dg.x = localSize/BLOCK_SIZE_X + (localSize%BLOCK_SIZE_X ? 1 : 0);
 841   Dg.y = localSize/BLOCK_SIZE_Y + (localSize%BLOCK_SIZE_Y ? 1 : 0);
 842   Dg.z = localSize/BLOCK_SIZE_Z + (localSize%BLOCK_SIZE_Z ? 1 : 0);
 843   // Use of shared memory (when possible)
 844   cudaFuncSetCacheConfig(gpuKernelName, cudaFuncCachePreferShared);
 845   // Kernel call
 846   gpuKernelName<<<Dg,Db>>>(... @\emph{kernel parameters}@ ...);
 847   // Waiting for kernel completion
 848   cudaDeviceSynchronize();
 849   // Results transfer from GPU to node RAM
 850   CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer");
 851   // Potential post-treatment of results on the CPU
 852   ...
 853
 854   // Convergences checking
 855   ...
 856 }
 857 \end{Listing}
 858 %\end{algorithm}
 859
 860 This scheme provides asynchronism through a cluster of GPUs as well as a
 861 complete overlap of communications with GPU computations (similarly
 862 to~\Sec{ch6:part1}).  However, the autonomy of GPU devices according to their
 863 host can be further exploited in order to perform some computations on the CPU
 864 while the GPU kernel is running. The nature of computations that can be done by
 865 the CPU may vary depending on the application. For example, when processing data
 866 streams (pipelines), pre-processing of next data item and/or post-processing of
 867 previous result can be done on the CPU while the GPU is processing the current
 868 data item.  In other cases, the CPU can perform \emph{auxiliary}
 869 computations\index{computation!auxiliary}
 870 that are not absolutely required to obtain the result but that may accelerate
 871 the entire iterative process.  Another possibility would be to distribute the
 872 main computations between the GPU and CPU. However, this
 873 usually leads to poor performance increases. This is mainly due to data
 874 dependencies that often require additional transfers between CPU and GPU.
 875
 876 So, if we consider that the application enables such overlap of
 877 computations, its implementation is straightforward as it consists in inserting
 878 the additional CPU computations between lines~23 and~24
 879 in \Lst{algo:ch6p2syncGPU}.  Nevertheless, such scheme is fully efficient only
 880 if the computation times on both sides are similar.
 881
 882 In some cases, especially with auxiliary computations, another interesting
 883 solution is to add a fourth CPU thread to perform them. This suppresses the
 884 duration constraint over those optional computations as they are performed in
 885 parallel of the main iterative process, without blocking it.  Moreover, this
 886 scheme stays coherent with current architectures as most nodes include four CPU
 887 cores.  The algorithmic scheme of such context of complete overlap of
 888 CPU/GPU computations and communications is described in
 889 Listings~\ref{algo:ch6p2FullOverAsyncMain},~\ref{algo:ch6p2FullOverAsyncComp1}
 890 and~\ref{algo:ch6p2FullOverAsyncComp2}, where we suppose that  auxiliary
 891 computations use intermediate results of the main computation process from any previous iteration. This may be
 892 different according to the application.
 893
 894 %\begin{algorithm}[H]
 895 %  \caption{Initialization of the main process of complete overlap with asynchronism.}
 896 % \label{algo:ch6p2FullOverAsyncMain}
 897 \pagebreak
 898 \begin{Listing}{algo:ch6p2FullOverAsyncMain}{Initialization of the main process of complete overlap with asynchronism}
 899 // Variables declarations and initialization
 900 ...
 901 omp_lock_t lockAux;   // Informs main thread about new aux results
 902 omp_lock_t lockRes;   // Informs aux thread about new results
 903 omp_lock_t lockWrite; // Controls exclusion of results access
 904 ... auxRes ... ;      // Results of auxiliary computations
 905
 906 // Parameters reading, MPI initialization, data initialization and distribution
 907 ...
 908 // OpenMP initialization
 909 ...
 910 omp_init_lock(&lockAux);
 911 omp_set_lock(&lockAux);  // Unlocked when new aux results are available
 912 omp_init_lock(&lockRes);
 913 omp_set_lock(&lockRes);  // Unlocked when new results are available
 914 omp_init_lock(&lockWrite);
 915 omp_unset_lock(&lockWrite); // Controls access to results from threads
 916
 917 #pragma omp parallel
 918 {
 919   switch(omp_get_thread_num()){
 920     case COMPUTATION :
 921     computations(... @\emph{relevant parameters}@ ...);
 922     break;
 923
 924     case AUX_COMPS :
 925     auxComps(... @\emph{relevant parameters}@ ...);
 926     break;
 927
 928     case SENDINGS :
 929     sendings();
 930     break;
 931
 932     case RECEPTIONS :
 933     receptions();
 934     break;
 935   }
 936 }
 937
 938 // Cleaning of OpenMP locks
 939 ...
 940 omp_test_lock(&lockAux);
 941 omp_unset_lock(&lockAux);
 942 omp_destroy_lock(&lockAux);
 943 omp_test_lock(&lockRes);
 944 omp_unset_lock(&lockRes);
 945 omp_destroy_lock(&lockRes);
 946 omp_test_lock(&lockWrite);
 947 omp_unset_lock(&lockWrite);
 948 omp_destroy_lock(&lockWrite);
 949
 950 // MPI termination
 951 MPI_Finalize();
 952 \end{Listing}
 953 %\end{algorithm}
 954
 955 %\begin{algorithm}[H]
 956 %  \caption{Computing function in the final asynchronous scheme with CPU/GPU overlap.}
 957 %  \label{algo:ch6p2FullOverAsyncComp1}
 958 \pagebreak
 959 \begin{Listing}{algo:ch6p2FullOverAsyncComp1}{Computing function in the final asynchronous scheme with CPU/GPU overlap}
 960 // Variables declarations and initialization
 961 ...
 962 dim3 Dg, Db; // CUDA kernel grids
 963
 964 // Computation loop
 965 while(!Finished){
 966   // Determination of the dynamic operating mode, sendings of data dependencies and blocking data receptions in sync mode
 967   ...
 968   // Local GPU computation
 969   // Data transfers from node RAM to GPU, GPU grid definition and init of shared mem
 970   CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer");
 971   ...
 972   // Kernel call
 973   gpuKernelName<<<Dg,Db>>>(... @\emph{kernel parameters}@ ...);
 974   // Potential pre/post-treatments in pipeline like computations
 975   ...
 976   // Waiting for kernel completion
 977   cudaDeviceSynchronize();
 978   // Results transfer from GPU to node RAM
 979   omp_set_lock(&lockWrite); // Wait for write access to resultsInRam
 980   CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer");
 981   // Potential post-treatments in non-pipeline computations
 982   ...
 983   omp_unset_lock(&lockWrite); // Give back read access to aux thread
 984   omp_test_lock(&lockRes);
 985   omp_unset_lock(&lockRes);   // Informs aux thread of new results
 986
 987   // Auxiliary computations availability checking
 988   if(omp_test_lock(&lockAux)){
 989     // Use auxRes to update the iterative process
 990     ... // May induce additional GPU transfers
 991   }
 992
 993   // Convergences checking
 994   if(curMode == SYNC){
 995     // Local convergence checking and global exchange of local states
 996     ...
 997     // Determination of global convergence (all nodes in local CV)
 998     if(cvLocale == 1 && nbCVLocales == nbP-1){
 999       // Stop of the iterative process and sending of end messages
1000       ...
1001       // Unlocking of aux thread for termination
1002       omp_test_lock(&lockRes);
1003       omp_unset_lock(&lockRes);
1004     }else{
1005       // Re-initialization of state information and iteration barrier
1006       ...
1007     }
1008   }
1009 }
1010 \end{Listing}
1011 %\end{algorithm}
1012
1013 %\begin{algorithm}[H]
1014 %  \caption{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap.}
1015 %  \label{algo:ch6p2FullOverAsyncComp2}
1016 \pagebreak
1017 \begin{Listing}{algo:ch6p2FullOverAsyncComp2}{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap}
1018 // Variables declarations and initialization
1019 ... auxInput ... // Local array for input data
1020
1021 // Computation loop
1022 while(!Finished){
1023   // Data copy from resultsInRam into auxInput
1024   omp_set_lock(&lockRes);   // Waiting for new results from main comps
1025   if(!Finished){
1026     omp_set_lock(&lockWrite); // Waiting for access to results
1027     for(ind=0; ind<resultsSize; ++ind){
1028       auxInput[ind] = resultsInRam[ind];
1029     }
1030     omp_unset_lock(&lockWrite); // Give back write access to main thread
1031     // Auxiliary computations with possible interruption at the end
1032     for(ind=0; ind<auxSize && !Finished; ++ind){
1033       // Computation of auxRes array according to auxInput
1034       ...
1035     }
1036     // Informs main thread that new aux results are available in auxData
1037     omp_test_lock(&lockAux); // Ensures mutex is locked when unlocking
1038     omp_unset_lock(&lockAux);
1039   }
1040 }
1041 \end{Listing}
1042 %\end{algorithm}
1043
1044 As can be seen in \Lst{algo:ch6p2FullOverAsyncMain}, there are three additional
1045 mutex (\texttt{lockAux}, \texttt{lockRes} and \texttt{lockWrite}) that are used
1046 respectively to inform the main computation thread that new auxiliary results
1047 are available (lines~20-21 in \Lst{algo:ch6p2FullOverAsyncComp2} and line~29 in
1048 \Lst{algo:ch6p2FullOverAsyncComp1}), to inform the auxiliary thread that new
1049 results from the main thread are available (lines~25-26
1050 in \Lst{algo:ch6p2FullOverAsyncComp1} and line~7
1051 in \Lst{algo:ch6p2FullOverAsyncComp2}), and to perform exclusive accesses to the
1052 results from those two threads (lines~20,~24
1053 in \Lst{algo:ch6p2FullOverAsyncComp1} and 9,~13
1054 in \Lst{algo:ch6p2FullOverAsyncComp2}).  Also, an additional array
1055 (\texttt{auxRes}) is required to store the results of the auxiliary computations
1056 as well as a local array for the input of the auxiliary function
1057 (\texttt{auxInput}). That last function has the same general organization as the
1058 send/receive ones, that is a global loop conditioned by the end of the global
1059 process.  At each iteration in this function, the thread waits for the
1060 availability of new results produced by the main computation thread. This avoids
1061 to perform the same computations several times with the same input data.
1062 Then, input data of auxiliary computations
1063 % (as supposed here, they often
1064 % correspond to the results of the main computations, but may sometimes be
1065 % something else)
1066 is copied with a mutual exclusion mechanism. Finally, auxiliary
1067 computations are performed.  When they are completed, the associated mutex is
1068 unlocked to signal the availability of those auxiliary results to the main
1069 computing thread.  The main thread regularly checks this availability at the end
1070 of its iterations and takes them into account whenever this is possible.
1071
1072 Finally, we  obtain an algorithmic  scheme allowing maximal  overlap between
1073 CPU and  GPU computations as well  as communications. It is  worth noticing that
1074 such scheme is also usable for systems without GPUs but 4-cores nodes.
1075
1076 \subsection{Experimental validation}
1077 \label{sec:ch6p2expes}
1078
1079 As in~\Sec{ch6:part1},  we validate the  feasibility of our  asynchronous scheme
1080 with  some experiments  performed with  a representative  example  of scientific
1081 application.      It     is    a     three-dimensional     version    of     the
1082 advection-diffusion-reaction   process\index{PDE example}  that   models  the   evolution   of  the
1083 concentrations of  two chemical  species in shallow  waters. As this  process is
1084 dynamic in time,  the simulation is performed for a  given number of consecutive
1085 time steps.  This  implies two nested loops in the  iterative process, the outer
1086 one for the time  steps and the inner one for solving  the problem at each time.
1087 Full details about this PDE  problem can be found in~\cite{ChapNRJ2011}. That
1088 two-stage  iterative process  implies a  few adaptations  of the  general scheme
1089 presented above in order to include the  outer iterations over  the time steps,  but the
1090 inner iterative process closely follows the same scheme.
1091
1092 We show two  series of experiments performed with 16 nodes  of the first cluster
1093 described  in~\Sec{ch6:p1expes}.  The  first one  deals with  the  comparison of
1094 synchronous and asynchronous computations.  The second one is related to the use
1095 of auxiliary computations. In the context of our PDE application, they consist in
1096 the update of the Jacobian of the system.
1097
1098 \subsubsection*{Synchronous and asynchronous computations}
1099
1100 The first experiment allows us  to check that the asynchronous behavior obtained
1101 with our  scheme corresponds  to the expected  one according to  its synchronous
1102 counterpart. So,  we show  in~\Fig{fig:ch6p2syncasync} the computation  times of
1103 our test application  in both modes for different problem  sizes. The size shown
1104 is the number of discrete spatial elements on each side of the cube representing
1105 the  3D   volume.  Moreover,  for  each   of  these  elements,   there  are  the
1106 concentrations of the two chemical species considered.  So, for example, size 30
1107 corresponds in fact to $30\times30\times30\times2$ values.
1108
1109 \begin{figure}[h]
1110   \centering
1111   \includegraphics[width=.75\columnwidth]{Chapters/chapter6/curves/syncasync.pdf}
1112   \caption{Computation times of the test application in synchronous and
1113     asynchronous modes.}
1114   \label{fig:ch6p2syncasync}
1115 \end{figure}
1116
1117 The results obtained show that  the asynchronous version is sensibly faster than
1118 the synchronous one  for smaller problem sizes, then it  becomes similar or even
1119 a bit slower  for  larger problem  sizes.   A  closer  comparison of computation  and
1120 communication times  in each execution  confirms that this behavior  is consistent.
1121 The  asynchronous version is  interesting if communication time
1122 is similar or  larger than computation time.  In our  example, this is the
1123 case up  to a problem  size between 50 and 60.   Then, computations  become longer
1124 than communications. Since asynchronous computations  often require more
1125 iterations to converge, the gain  obtained on the communication side becomes smaller
1126 than the  overhead generated on the computation side, and the  asynchronous version
1127 takes longer.
1128 % So, the observed behavior is fully coherent with the expected behavior.
1129
1130 \subsubsection*{Overlap of auxiliary computations}
1131
1132 In this experiment, we use only the asynchronous version of the application.  In
1133 the context of  our test application, we have an iterative  PDE solver based on
1134 Netwon    resolution.    Such    solvers    are   written    under   the    form
1135 $x=T(x),~x\in\Reals^n$ where $T(x)=x-F'(x)^{-1}F(x)$ and $F'$ is the Jacobian of
1136 the system.  In such cases, it is  necessary to compute the vector $\Delta x$ in
1137 $F'\times \Delta  x=-F$ to update $x$ with  $\Delta x$. There are  two levels of
1138 iterations, the  inner level to get a  stabilized version of $x$,  and the outer
1139 level to compute $x$ at the successive time steps in the simulation process.  In
1140 this context,  classical algorithms either compute  $F'$ only at the  first iteration
1141 of each time step or at some iterations but not all because the  computation of $F'$ is done
1142 in the main iterative process and it has a relatively high computing cost.
1143
1144 However, with the scheme presented above, it is possible to continuously compute
1145 new  versions  of  $F'$  in  parallel  to the  main  iterative  process  without
1146 penalizing  it. Hence,  $F'$ is  updated  as often  as possible  and taken  into
1147 account in  the main computations  when it is  relevant. So, the  Newton process
1148 should be  accelerated a little bit.
1149
1150 We  compare the  performance obtained  with overlapped  Jacobian  updatings and
1151 non-overlapped ones for several problem sizes, see~\Fig{fig:ch6p2aux}.
1152 \begin{figure}[h]
1153   \centering
1154   \includegraphics[width=.75\columnwidth]{Chapters/chapter6/curves/recouvs.pdf}
1155   \caption{Computation times  with or without overlap  of Jacobian updatings
1156     in asynchronous mode.}
1157   \label{fig:ch6p2aux}
1158 \end{figure}
1159
1160 The  overlap is  clearly efficient  as  the computation  times with  overlapping
1161 Jacobian updatings  are much better  than without overlap.  Moreover,  the ratio
1162 between the  two versions tend  to increase with  the problem size, which  is as
1163 expected. Also, we have tested the application without auxiliary computations at
1164 all, that is, the  Jacobian is computed only once at the  beginning of each time
1165 step of the  simulation. The results for this last version  are quite similar to
1166 the overlapped auxiliary computations, and  even better for small problem sizes.
1167 The fact that no sensible gain can be seen on this range of problem sizes is due
1168 to the limited number of Jacobian updates taken into  account in the main
1169 computation.  This happens when  the Jacobian update is as long as
1170 several  iterations of  the main  process. So,  the benefit  is reduced  in this
1171 particular case.
1172
1173 Those results show two things; first, auxiliary computations do not induce great
1174 overhead in  the whole  process.  Second, for  this particular  application the
1175 choice of updating the Jacobian  matrix as auxiliary computations does not speed
1176 up the iterative process.  This does  not question the parallel scheme in itself
1177 but   merely  points  out   the  difficulty   to  identify   relevant  auxiliary
1178 computations.  Indeed, this identification depends on the considered application
1179 and requires a profound specialized analysis.
1180
1181 Another  interesting choice  could be  the  computation of  load estimation  for
1182 dynamic load  balancing, especially in decentralized  diffusion strategies where
1183 loads are  transferred between  neighboring nodes~\cite{BCVG11}.  In  such case,
1184 the load evaluation and the comparison  with other nodes can be done in parallel
1185 of the  main computations without perturbing  them.
1186
1187 %%% Local Variables:
1188 %%% mode: latex
1189 %%% fill-column: 80
1190 %%% ispell-dictionary: "american"
1191 %%% mode: flyspell
1192 %%% TeX-master: "../../BookGPU.tex"
1193 %%% End: