1 \section{General scheme of asynchronous parallel code with
2 computation/communication overlapping}
5 In the previous section, we have seen how to efficiently implement overlap of
6 computations (CPU and GPU) with communications (GPU transfers and internode
7 communications). However, we have previously shown that for some parallel
8 iterative algorithms, it is sometimes even more efficient to use an asynchronous
9 scheme of iterations\index{iterations asynchronous} \cite{HPCS2002,ParCo05,Para10}. In that case, the nodes do
10 not wait for each other but they perform their iterations using the last
11 external data they have received from the other nodes, even if this
12 data was produced \emph{before} the previous iteration on the other nodes.
14 Formally, if we denote by $f=(f_1,...,f_n)$ the function representing the
15 iterative process and by $x^t=(x_1^t,...,x_n^t)$ the values of the $n$ elements of
16 the system at iteration $t$, we pass from a synchronous iterative scheme of the
17 form given in \Alg{algo:ch6p2sync}
18 %% \begin{algorithm}[H]
19 %% \caption{Synchronous iterative scheme}\label{algo:ch6p2sync}
21 %% $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\
22 %% \textbf{for} $t=0,1,...$\\
23 %% \>\textbf{for} $i=1,...,n$\\
24 %% \>\>$x_{i}^{t+1}=f_{i}(x_{1}^t,...,x_i^t,...,x_{n}^t)$\\
25 %% \>\textbf{endfor}\\
30 \caption{synchronous iterative scheme}\label{algo:ch6p2sync}
31 $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\;
34 $x_{i}^{t+1}=f_{i}(x_{1}^t,...,x_i^t,...,x_{n}^t)$\;
40 to an asynchronous iterative scheme of the form given in \Alg{algo:ch6p2async}.\\
41 %% \begin{algorithm}[H]
42 %% \caption{Asynchronous iterative scheme}\label{algo:ch6p2async}
44 %% $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\
45 %% \textbf{for} $t=0,1,...$\\
46 %% \>\textbf{for} $i=1,...,n$\\
47 %% \>\>$x_{i}^{t+1}=\left\{
48 %% \begin{array}[h]{ll}
49 %% x_i^t & \text{if } i \text{ is \emph{not} updated at iteration } i\\
50 %% f_i(x_1^{s_1^i(t)},...,x_n^{s_n^i(t)}) & \text{if } i \text{ is updated at iteration } i
53 %% \>\textbf{endfor}\\
58 \caption{asynchronous iterative scheme}\label{algo:ch6p2async}
59 $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\;
64 x_i^t & \text{if } i \text{ is \emph{not} updated at iteration } i\\
65 f_i(x_1^{s_1^i(t)},...,x_n^{s_n^i(t)}) & \text{if } i \text{ is updated at iteration } i
71 In this scheme, $s_j^i(t)$ is the iteration number of the production of the value $x_j$ of
72 element $j$ that is used on element $i$ at iteration $t$ (see, for example,~\cite{BT89,
73 FS2000} for further details).
74 Such schemes are called AIAC\index{AIAC} for \emph{Asynchronous Iterations and
75 Asynchronous Communications}. They combine two aspects that are respectively
76 different computation speeds of the computing elements and communication delays
79 The key feature of such algorithmic schemes
80 is that they may be faster than their synchronous counterparts due to the
81 implied total overlap of computations with communications: in fact, this scheme
82 suppresses all the idle times induced by nodes synchronizations between each iteration.
84 However, the efficiency of such a scheme is directly linked to the frequency at
85 which new data arrives on each node. Typically,
86 if a node receives newer data only every four or five local iterations, it is
87 strongly probable that the evolution of its local iterative process will be
88 slower than if it receives data at every iteration. The key point here is that
89 not only does this frequency depend on the hardware configuration of the
90 parallel system but it also depends on the software that is used to
91 implement the algorithmic scheme.
93 The impact of the programming environments used to implement asynchronous
94 algorithms has already been investigated in~\cite{SuperCo05}. Although the
95 features required to efficiently implement asynchronous schemes have not
96 changed, the available programming environments and computing hardware have
97 evolved, in particular now that GPUs are available. So, there is a need to reconsider the
98 implementation schemes of AIAC according to the new de facto standards for parallel
99 programming (communications and threads) as well as the integration of the GPUs.
100 One of the main objective here is to obtain a maximal overlap between the
101 activities of the three types of devices: the CPU, the GPU, and the
102 network. Moreover, another objective is to present what we think
103 is the best compromise between the simplicity of the implementation and its
104 maintainability on one side and its
105 performance on the other side. This is especially important for industries where
106 implementation and maintenance costs are strong constraints.
108 For the sake of clarity, we present the different algorithmic schemes in a progressive
109 order of complexity, from the basic asynchronous scheme to the complete scheme
110 with full overlap. Between these two extremes, we propose a synchronization
111 mechanism on top of our asynchronous scheme that can be used either statically or
112 dynamically during the application execution.
114 Although there exist several programming environments for internode
115 communications, multithreading, and GPU programming, a few of them have
116 become \emph{de facto standards}, due to their good stability, their ease
117 of use, and/or their wide adoption by the scientific community.
118 Therefore, as in the previous section, all the schemes presented in the following use MPI~\cite{MPI},
119 OpenMP~\cite{openMP}, and CUDA~\cite{CUDA}. However, there is no loss of
120 generality as these schemes may easily be implemented with other libraries.
122 Finally, in order to stay as clear as possible, only the parts of code and
123 variables related to the control of parallelism (communications, threads, etc.)
124 are presented in our schemes. The inner organization of data is not detailed as
125 it depends on the application. We only consider that we have two data arrays
126 (previous version and current version) and communication buffers. However, in
127 most of the cases, those buffers can correspond to the data arrays themselves to
130 \subsection{A basic asynchronous scheme}
131 \label{ch6:p2BasicAsync}
133 The first step toward our complete scheme is to implement a basic asynchronous
134 scheme that includes an actual overlap of the communications with the
135 computations\index{overlap!computation and communication}. In order to ensure that the communications are actually performed
136 in parallel with the computations, it is necessary to use different threads. It
137 is important to remember that asynchronous communications provided in
138 communication libraries such as MPI are not systematically performed in parallel with
139 the computations~\cite{ChVCV13,Hoefler08a}. So, the logical and classical way
140 to implement such an overlap is to use three threads: one for
141 computing, one for sending, and one for receiving. Moreover, since
142 the communication is performed by threads, blocking synchronous communications\index{MPI!communication!blocking}\index{MPI!communication!synchronous}
143 can be used without deteriorating the overall performance.
145 In this basic version, the termination\index{termination} of the global process is performed
146 individually on each node according to its own termination. This can be guided by either a
147 number of iterations or a local convergence detection\index{convergence detection}. The important step at
148 the end of the process is to perform the receptions of all pending
149 communications in order to ensure the termination of the two communication
152 So, the global organization of this scheme is set up in \Lst{algo:ch6p2BasicAsync}.
154 % \begin{algorithm}[H]
155 % \caption{Initialization of the basic asynchronous scheme.}
156 % \label{algo:ch6p2BasicAsync}
157 \begin{Listing}{algo:ch6p2BasicAsync}{initialization of the basic asynchronous scheme}
158 // Variables declaration and initialization
159 // Controls the sendings from the computing thread
161 // Ensures the initial reception of external data
163 char Finished = 0; // Boolean indicating the end of the process
164 // Boolean indicating if previous data sendings are still in progress
165 char SendsInProgress = 0;
166 // Threshold of the residual for convergence detection
169 // Parameters reading
172 // MPI initialization
173 MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, &provided);
174 MPI_Comm_size(MPI_COMM_WORLD, &nbP);
175 MPI_Comm_rank(MPI_COMM_WORLD, &numP);
177 // Data initialization and distribution among the nodes
180 // OpenMP initialization (mainly declarations and setting up of locks)
181 omp_set_num_threads(3);
182 omp_init_lock(&lockSend);
183 omp_set_lock(&lockSend);//Initially locked, unlocked to start sendings
184 omp_init_lock(&lockRec);
185 //Initially locked, unlocked when initial data are received
186 omp_set_lock(&lockRec);
190 switch(omp_get_thread_num()){
192 computations(... @\emph{relevant parameters}@ ...);
205 // Cleaning of OpenMP locks
206 omp_test_lock(&lockSend);
207 omp_unset_lock(&lockSend);
208 omp_destroy_lock(&lockSend);
209 omp_test_lock(&lockRec);
210 omp_unset_lock(&lockRec);
211 omp_destroy_lock(&lockRec);
218 % \ANNOT{Est-ce qu'on laisse le 1er échange de données ou pas dans
219 % l'algo~\ref{algo:ch6p2BasicAsyncComp} ? (lignes 16-19)\\
220 % (ça n'est pas nécessaire si les données de
221 % départ sont distribuées avec les dépendances qui correspondent, sinon il faut
224 In this scheme, the \texttt{lockRec} mutex\index{OpenMP!mutex} is not mandatory.
225 It is only used to ensure that data dependencies are actually exchanged at the
226 first iteration of the process. Data initialization and distribution
227 (lines~20--21) are not detailed here because they are directly related to the
228 application. The important point is that, in most cases, they should be done
229 before the iterative process. The computing function is given in
230 \Lst{algo:ch6p2BasicAsyncComp}.
232 %\begin{algorithm}[H]
233 % \caption{Computing function in the basic asynchronous scheme.}
234 % \label{algo:ch6p2BasicAsyncComp}
235 \begin{Listing}{algo:ch6p2BasicAsyncComp}{computing function in the basic asynchronous scheme}
236 // Variables declaration and initialization
237 int iter = 1; // Number of the current iteration
238 double difference; // Variation of one element between two iterations
239 double residual; // Residual of the current iteration
243 // Sending of data dependencies if there is no previous sending
245 if(!SendsInProgress){
246 // Potential copy of data to be sent into additional buffers
248 // Change of sending state
250 omp_unset_lock(&lockSend);
253 // Blocking receptions at the first iteration
255 omp_set_lock(&lockRec);
258 // Initialization of the residual
260 // Swapping of data arrays (current and previous)
261 tmp = current; // Pointers swapping to avoid
262 current = previous; // actual data copies between
263 previous = tmp; // the two data versions
264 // Computation of current iteration over local data
265 for(ind=0; ind<localSize; ++ind){
266 // Updating of current array using previous array
268 // Updating of the residual
269 // (max difference between two successive iterations)
270 difference = fabs(current[ind] - previous[ind]);
271 if(difference > residual){
272 residual = difference;
276 // Checking of the end of the process (residual under threshold)
277 // Other conditions can be added to the termination detection
278 if(residual <= Threshold){
280 omp_unset_lock(&lockSend); // Activation of end messages sendings
281 MPI_Ssend(&Finished, 1, MPI_CHAR, numP, tagEnd, MPI_COMM_WORLD);
284 // Updating of the iteration number
290 As mentioned above, it can be seen in lines~19--21 of \Lst{algo:ch6p2BasicAsyncComp}
291 that the \texttt{lockRec} mutex is used only at the first iteration to wait for
292 the initial data dependencies before the computations. The residual\index{residual}, initialized
293 in line~24 and computed in lines~35--38, is defined by the maximal difference
294 between the elements from two consecutive iterations. It is classically used to
295 detect the local convergence of the process on each node. In the more
296 complete schemes presented in the sequel, a global termination detection that
297 takes the states of all the nodes into account will be exhibited.
299 Finally, the local convergence is tested and updated when necessary. In line~45,
300 the \texttt{lockSend} mutex is unlocked to allow the sending function to send
301 final messages to the dependency nodes. Those messages are required to keep the
302 reception function alive until all the final messages have been received.
303 Otherwise, a node could stop its reception function while other nodes are
304 still trying to communicate with it. Moreover, a local sending of a final
305 message to the node itself is required (line~46) to ensure that the reception
306 function will not stay blocked in a message probing
307 (see~\Lst{algo:ch6p2BasicAsyncReceptions}, line~12). This may happen if the node
308 receives the final messages from its dependencies \emph{before} reaching its own
311 All the messages but this final local one are performed in the sending function
312 described in \Lst{algo:ch6p2BasicAsyncSendings}.
313 The main loop is only conditioned by the end of the computing process (line~4).
314 At each iteration, the thread waits for the permission from the computing thread
315 (according to the \texttt{lockSend} mutex). Then, data are sent with
316 blocking synchronous communications. The \texttt{SendsInProgress} boolean
317 allows the computing thread to skip data sendings as long as a previous sending
318 is in progress. This skip is possible due to the nature of asynchronous
319 algorithms that allows such \emph{message loss}\index{message!loss/miss} or \emph{message miss}. After
320 the main loop, the final messages are sent to the dependencies of the node.
322 %\begin{algorithm}[H]
323 % \caption{Sending function in the basic asynchronous scheme.}
324 % \label{algo:ch6p2BasicAsyncSendings}
325 \begin{Listing}{algo:ch6p2BasicAsyncSendings}{sending function in the basic asynchronous scheme}
326 // Variables declaration and initialization
330 omp_set_lock(&lockSend); // Waiting for signal from the comp. thread
332 // Blocking synchronous sends to all dependencies
333 for(i=0; i<nbDeps; ++i){
334 MPI_Ssend(&dataToSend[deps[i]], nb_data, type_of_data, deps[i], tagCom, MPI_COMM_WORLD);
336 SendsInProgress = 0; // Indicates that the sendings are done
339 // At the end of the process, sendings of final messages
340 for(i=0; i<nbDeps; ++i){
341 MPI_Ssend(&Finished, 1, MPI_CHAR, deps[i], tagEnd, MPI_COMM_WORLD);
346 The last function, detailed in \Lst{algo:ch6p2BasicAsyncReceptions}, does all the messages receptions.
347 %\begin{algorithm}[H]
348 % \caption{Reception function in the basic asynchronous scheme.}
349 % \label{algo:ch6p2BasicAsyncReceptions}
350 \begin{Listing}{algo:ch6p2BasicAsyncReceptions}{Reception function in the basic asynchronous scheme}
351 // Variables declaration and initialization
352 char countReceipts = 1; // Boolean indicating whether receptions are
354 int nbEndMsg = 0; // Number of end messages received
355 int arrived = 0; // Boolean indicating if a message has arrived
356 int srcNd; // Source node of the message
357 int size; // Message size
359 // Main loop of receptions
361 // Waiting for an incoming message
362 MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
364 // Management of data messages
365 switch(status.MPI_TAG){
366 case tagCom: // Management of data messages
367 // Get the source node of the message
368 srcNd = status.MPI_SOURCE;
369 // Actual data reception in the corresponding buffer
370 MPI_Recv(dataBufferOf(srcNd), nbDataOf(srcNd), dataTypeOf(srcNd), srcNd, tagCom, MPI_COMM_WORLD, &status);
371 // Unlocking of the computing thread when data are received
372 // from all dependencies
373 if(countReceipts == 1 && ... @\emph{receptions from ALL dependencies}@ ...){
374 omp_unset_lock(&lockRec);
375 countReceipts = 0; // No more counting after first iteration
378 case tagEnd: // Management of end messages
379 // Actual end message reception in dummy buffer
380 MPI_Recv(dummyBuffer, 1, MPI_CHAR, status.MPI_SOURCE, tagEnd, MPI_COMM_WORLD, &status);
386 // Reception of pending messages and counting of end messages
387 do{ // Loop over the remaining incoming/end messages
388 MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
389 MPI_Get_count(&status, MPI_CHAR, &size);
390 // Actual reception in dummy buffer
391 MPI_Recv(dummyBuffer, size, MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, MPI_COMM_WORLD, &status);
392 if(status.MPI_TAG == tagEnd){ // Counting of end messages
395 MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &arrived, &status);
396 }while(arrived == 1 || nbEndMsg < nbDeps + 1);
400 As in the sending function, the main loop of receptions is done while the
401 iterative process is not \texttt{Finished}. In line~12, the thread waits until
402 a message arrives on the node. Then, it performs the actual reception and the
403 corresponding subsequent actions (potential data copies for data messages and
404 counting for end messages). Lines~23--26 check, only at the first iteration of
405 computations, that all data dependencies have been received before unlocking the
406 \texttt{lockRec} mutex. %As mentioned previously, they are not mandatory and are included only to
407 Although this is not mandatory, it ensures that all data dependencies are
408 received before starting the computations. % at the first iteration.
409 Lines~28--31 are required to manage end messages that arrive on the
410 node \emph{before} it reaches its own termination process. As the nodes are
411 \emph{not} synchronized, this may happen. Finally, lines~37--46 perform the
412 receptions of all pending communications, including the remaining end messages
413 (at least the one from the node itself).
416 So, with these algorithms, we obtain a quite simple and efficient asynchronous
417 iterative scheme. It is interesting to notice that GPU computing can be easily
418 included in the computing thread. This will be fully addressed in
419 Section~\ref{ch6:p2GPUAsync}. However, before presenting the complete
420 asynchronous scheme with GPU computing, we have to detail how our initial scheme
421 can be made synchronous.
423 \subsection{Synchronization of the asynchronous scheme}
424 \label{ch6:p2SsyncOverAsync}
426 The presence of synchronization in the previous scheme may seem contradictory to our goal,
427 and obviously, it is neither the simplest way to obtain a synchronous scheme nor
428 the most efficient (as presented in~\Sec{ch6:part1}). However, it is necessary
429 for our global convergence detection strategy\index{convergence detection!global}. Recall that the
430 global convergence\index{convergence!global} is the extension of the local convergence\index{convergence!local} concept to all the
431 nodes. This implies that all the nodes have to be in local convergence at the
432 same time to achieve global convergence. Typically, if we use the residual and a
433 threshold to stop the iterative process, all the nodes have to continue their
434 local iterative process until \emph{all} of them obtain a residual under the
437 % Moreover, it allows the gathering of both operating modes in a single source
438 % code, which tends to improve the implementation and maintenance costs when both
439 % versions are required.
440 % The second one is that
441 In our context, being able to dynamically change the operating
442 mode (sync/async) during the process execution strongly simplifies
443 the global convergence detection. In fact, our past experience in the design
444 and implementation of global convergence detection in asynchronous
445 algorithms~\cite{SuperCo05, BCC07, Vecpar08a} has led us to the conclusion
446 that although a decentralized detection scheme is possible and may be more
447 efficient in some situations, its much higher complexity is an
448 obstacle to actual use in practice, especially in industrial contexts where
449 implementation/maintenance costs are strong constraints. Moreover, although the
450 decentralized scheme does not slow down the computations, it requires more iterations than a synchronous version and
451 thus may induce longer detection times in some cases. So, the solution we
452 present below is a good compromise between simplicity and efficiency. It
453 consists in dynamically changing the operating mode between asynchronous and synchronous
454 during the execution of the process in order to check the global convergence.
455 This is why we need to synchronize our asynchronous scheme.
457 In each algorithm of the initial scheme, we only give the additional code
458 required to change the operating mode.
460 %\begin{algorithm}[H]
461 % \caption{Initialization of the synchronized scheme.}
462 % \label{algo:ch6p2Sync}
463 \begin{Listing}{algo:ch6p2Sync}{initialization of the synchronized scheme}
464 // Variables declarations and initialization
466 // Controls the synchronous exchange of local states
467 omp_lock_t lockStates;
468 // Controls the synchronization at the end of each iteration
470 //Boolean indicating whether the local stabilization is reached or not
472 // Number of other nodes being in local stabilization
475 // Parameters reading
477 // MPI initialization
479 // Data initialization and distribution among the nodes
481 // OpenMP initialization (mainly declarations and setting up of locks)
483 omp_init_lock(&lockStates);
484 // Initially locked, unlocked when all state messages are received
485 omp_set_lock(&lockStates);
486 omp_init_lock(&lockIter);
487 // Initially locked, unlocked when all "end of iteration" messages are
489 omp_set_lock(&lockIter);
494 switch(omp_get_thread_num()){
499 // Cleaning of OpenMP locks
501 omp_test_lock(&lockStates);
502 omp_unset_lock(&lockStates);
503 omp_destroy_lock(&lockStates);
504 omp_test_lock(&lockIter);
505 omp_unset_lock(&lockIter);
506 omp_destroy_lock(&lockIter);
513 As can be seen in \Lst{algo:ch6p2Sync}, the synchronization implies two
514 additional mutex. The \texttt{lockStates} mutex is used to wait for the
515 receptions of all state messages coming from the other nodes. As shown
516 in \Lst{algo:ch6p2SyncComp}, those messages contain only a boolean indicating
517 for each node if it is in local convergence\index{convergence!local}. So, once all the states are
518 received on a node, it is possible to determine if all the nodes are in local
519 convergence and, thus, to detect the global convergence. The \texttt{lockIter}
520 mutex is used to synchronize all the nodes at the end of each iteration. There
521 are also two new variables that represent the local state of the
522 node (\texttt{localCV}) according to the iterative process (convergence) and the
523 number of other nodes that are in local convergence (\texttt{nbOtherCVs}).
525 The computation thread is where most of the modifications take place, as shown
526 in \Lst{algo:ch6p2SyncComp}.
528 %\begin{algorithm}[H]
529 % \caption{Computing function in the synchronized scheme.}
530 % \label{algo:ch6p2SyncComp}
531 \begin{Listing}{algo:ch6p2SyncComp}{computing function in the synchronized scheme}
532 // Variables declarations and initialization
537 // Sending of data dependencies at @\color{white}\emph{\textbf{each}}@ iteration
538 // Potential copy of data to be sent in additional buffers
540 omp_unset_lock(&lockSend);
542 // Blocking receptions at @\color{white}\emph{\textbf{each}}@ iteration
543 omp_set_lock(&lockRec);
546 // (init of residual, arrays swapping and iteration computation)
549 // Checking of the stabilization of the local process
550 // Other conditions than the residual can be added
551 if(residual <= Threshold){
557 // Global exchange of local states of the nodes
558 for(ind=0; ind<nbP; ++ind){
560 MPI_Ssend(&localCV, 1, MPI_CHAR, ind, tagState, MPI_COMM_WORLD);
564 // Waiting for the state messages receptions from the other nodes
565 omp_set_lock(&lockStates);
567 //Determination of global convergence (if all nodes are in local CV)
568 if(localCV + nbOtherCVs == nbP){
569 // Entering global CV state
571 // Unlocking of sending thread to start sendings of end messages
572 omp_unset_lock(&lockSend);
573 MPI_Ssend(&Finished, 1, MPI_CHAR, numP, tagEnd, MPI_COMM_WORLD);
575 // Resetting of information about the states of the other nodes
577 // Global barrier at the end of each iteration during the process
578 for(ind=0; ind<nbP; ++ind){
580 MPI_Ssend(&Finished, 1, MPI_CHAR, ind, tagIter, MPI_COMM_WORLD);
583 omp_set_lock(&lockIter);
587 // Updating of the iteration number
593 Most of the added code is related to the waiting for specific communications.
594 Between lines~6 and~7, the use of the flag \texttt{SendsInProgress} is no longer
595 needed since the sends are performed at each iteration. In line~12, the
596 thread waits for the data receptions from its dependencies. In
597 lines~27--34, the local states are determined and exchanged among all nodes. A
598 new message tag (\texttt{tagState}) is required for identifying those messages.
599 In line~37, the global termination state is determined. When it is reached,
600 lines~39--42 change the \texttt{Finished} boolean to stop the iterative process
601 and send the end messages. Otherwise each node resets its local state
602 information about the other nodes and a global barrier is added between all the
603 nodes at the end of each iteration with another new tag (\texttt{tagIter}).
604 That barrier is needed to ensure that data messages from successive iterations
605 are actually received during the \emph{same} iteration on the destination nodes.
606 Nevertheless, it is not useful at the termination of the global process as it is
607 replaced by the global exchange of end messages.
609 There is no big modification induced by the synchronization in the sending
610 function. The function stays almost the same as in
611 \Lst{algo:ch6p2BasicAsyncSendings}. The only change could be the suppression of
612 line~11 that is not useful in this case.
614 In the reception function, given in \Lst{algo:ch6p2SyncReceptions}, there are
615 mainly two insertions (in lines~19--31 and 32--42), corresponding to the
616 additional types of messages to receive. There is also the insertion of three
617 variables that are used for the receptions of the new message types. In
618 lines~24--30 and 35--41 are located messages counting and mutex unlocking
619 mechanisms that are used to block the computing thread at the corresponding steps of its
620 execution. They are similar to the mechanism used for managing the end messages
621 at the end of the entire process. Line~23 directly updates the
622 number of other nodes that are in local convergence by adding the
623 received state of the source node. This is possible due to the encoding that is used to
624 represent the local convergence (1) and the non convergence (0).
626 %\begin{algorithm}[H]
627 % \caption{Reception function in the synchronized scheme.}
628 % \label{algo:ch6p2SyncReceptions}
629 \begin{Listing}{algo:ch6p2SyncReceptions}{reception function in the synchronized scheme}
630 // Variables declarations and initialization
632 int nbStateMsg = 0; // Number of local state messages received
633 int nbIterMsg = 0; // Number of "end of iteration" messages received
634 char recvdState; // Received state from another node (0 or 1)
636 // Main loop of receptions
638 // Waiting for an incoming message
639 MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
641 switch(status.MPI_TAG){ // Actions related to message type
642 case tagCom: // Management of data messages
645 case tagEnd: // Management of termination messages
648 case tagState: // Management of local state messages
649 // Actual reception of the message
650 MPI_Recv(&recvdState, 1, MPI_CHAR, status.MPI_SOURCE, tagState, MPI_COMM_WORLD, &status);
651 // Updates of numbers of stabilized nodes and received state msgs
652 nbOtherCVs += recvdState;
654 // Unlocking of the computing thread when states of all other
655 // nodes are received
656 if(nbStateMsg == nbP-1){
658 omp_unset_lock(&lockStates);
661 case tagIter: // Management of "end of iteration" messages
662 // Actual reception of the message in dummy buffer
663 MPI_Recv(dummyBuffer, 1, MPI_CHAR, status.MPI_SOURCE, tagIter, MPI_COMM_WORLD, &status);
664 nbIterMsg++; // Update of the number of iteration messages
665 // Unlocking of the computing thread when iteration messages
666 // are received from all other nodes
667 if(nbIterMsg == nbP - 1){
669 omp_unset_lock(&lockIter);
676 // Reception of pending messages and counting of end messages
677 do{ // Loop over the remaining incoming/end messages
679 }while(arrived == 1 || nbEndMsg < nbDeps + 1);
683 Now that we can synchronize our asynchronous scheme, the final step is to
684 dynamically alternate the two operating modes in order to regularly check the
685 global convergence of the iterative process. This is detailed in the following
686 section together with the inclusion of GPU computing in the final asynchronous
689 \subsection{Asynchronous scheme using MPI, OpenMP, and CUDA}
690 \label{ch6:p2GPUAsync}
692 As mentioned above, the strategy proposed to obtain a good compromise between
693 simplicity and efficiency in the asynchronous scheme is to dynamically change
694 the operating mode of the process. A good way to obtain a maximal
695 simplification of the final scheme while preserving good performance is to
696 perform local and global convergence detections only in synchronous
697 mode. Moreover, as two successive iterations are sufficient in synchronous mode
698 to detect local and global convergences, the key is to alternate some
699 asynchronous iterations with two synchronous iterations until
702 The last problem is to decide \emph{when} to switch from the
703 asynchronous to the synchronous mode. Here again, for the sake of simplicity, any
704 asynchronous mechanism for \emph{detecting} such moment is avoided, and we
705 prefer to use a mechanism that is local to each node. Obviously, that local system must
706 rely neither on the number of local iterations done nor on the local
707 convergence. The former would slow down the fastest nodes according to the
708 slowest ones. The latter would provoke too much synchronization because the
709 residuals on all nodes generally do not evolve in the same way, and in most
710 cases, there is a convergence wave phenomenon throughout the elements. So, a
711 good solution is to insert a local timer mechanism on each node with a given
712 initial duration. Then, that duration may be modified during the execution
713 according to the successive results of the synchronous sections.
715 Another problem induced by entering synchronous mode from the asynchronous one
716 is the possibility of receiving some data messages
717 from previous asynchronous iterations during synchronous iterations. This could lead to deadlocks. In order
718 to avoid this, a wait for the end of previous send is added to the
719 transition between the two modes. This is implemented by replacing the variable
720 \texttt{SendsInProgress} with a mutex \texttt{lockSendsDone} which is unlocked
721 once all the messages have been sent in the sending function. Moreover, it is
722 also necessary to stamp data messages\index{message!stamping} (by the function \texttt{stampData}) with
723 a boolean indicating whether they have been sent during a synchronous or
724 asynchronous iteration. Then, the \texttt{lockRec} mutex is unlocked only
725 after to the complete reception of data messages from synchronous
726 iterations. The message ordering of point-to-point communications in MPI
727 and the barrier at the end of each iteration ensure two important
728 properties of this mechanism. First, data messages from previous
729 asynchronous iterations will be received but not taken into account during
730 synchronous sections. Then, a data message from a given synchronous
731 iteration cannot be received during another synchronous iteration. In the
732 asynchronous sections, no additional mechanism is needed as there are no such
733 constraints concerning the data receptions.
735 Finally, the required modifications of the previous scheme
736 % to make it dynamically change its operating mode (asynchronous or synchronous)
737 are mainly related to the computing thread. Small additions or modifications
738 are also required in the main process and the other threads.
740 In the main process, two new variables are added to store the main
741 operating mode of the iterative process (\texttt{mainMode}) and the duration of
742 asynchronous sections (\texttt{asyncDuration}). Those variables are
743 initialized by the programmer. The mutex \texttt{lockSendsDone} is also declared,
744 initialized (locked), and destroyed with the other mutex in this process.
746 In the computing function, shown in \Lst{algo:ch6p2AsyncSyncComp}, the
747 modifications consist of the insertion of the timer mechanism and the tests
748 to differentiate the actions to be done in each mode. Some additional variables
749 are also required to store the current operating mode in action during the
750 execution (\texttt{curMode}), the starting time of the current asynchronous
751 section (\texttt{asyncStart}), and the number of successive synchronous
752 iterations done (\texttt{nbSyncIter}).
754 %\begin{algorithm}[H]
755 % \caption{Computing function in the final asynchronous scheme.}% without GPU computing.}
756 % \label{algo:ch6p2AsyncSyncComp}
758 \begin{Listing}{algo:ch6p2AsyncSyncComp}{computing function in the final asynchronous scheme}% without GPU computing.}
759 // Variables declarations and initialization
761 OpMode curMode = SYNC;// Current operating mode (always begin in sync)
762 double asyncStart; // Starting time of the current async section
763 int nbSyncIter = 0; // Number of sync iterations done in async mode
767 // Determination of the dynamic operating mode
768 if(curMode == ASYNC){
769 // Entering synchronous mode when asyncDuration is reached
770 @% // (additional conditions can be specified if needed)
771 @ if(MPI_Wtime() - asyncStart >= asyncDuration){
772 // Waiting for the end of previous sends before starting sync mode
773 omp_set_lock(&lockSendsDone);
774 curMode = SYNC; // Entering synchronous mode
775 stampData(dataToSend, SYNC); // Mark data to send with sync flag
779 // In main async mode, going back to async mode when the max number
780 // of sync iterations are done
781 if(mainMode == ASYNC){
782 nbSyncIter++; // Update of the number of sync iterations done
784 curMode = ASYNC; // Going back to async mode
785 stampData(dataToSend, ASYNC); // Mark data to send
786 asyncStart = MPI_Wtime(); // Get the async starting time
791 // Sending of data dependencies
792 if(curMode == SYNC || !SendsInProgress){
796 // Blocking data receptions in sync mode
798 omp_set_lock(&lockRec);
802 // (init of residual, arrays swapping, and iteration computation)
805 // Checking convergences (local & global) only in sync mode
807 // Local convergence checking (residual under threshold)
809 // Blocking global exchange of local states of the nodes
811 // Determination of global convergence (all nodes in local CV)
812 // Stopping the iterative process and sending end messages
813 // or reinitialization of state information and iteration barrier
818 // Updating of the iteration number
824 In the sending function, the only modification is the replacement in line~11 of
825 the assignment of variable \texttt{SendsInProgress} with the unlocking of
826 \texttt{lockSendsDone}. Finally, in the reception function, the only
827 modification is the insertion before line~21
828 of \Lst{algo:ch6p2BasicAsyncReceptions} of the extraction of the stamp from the
829 message and its counting among the receipts only if the stamp is \texttt{SYNC}.
831 The final step to get our complete scheme using GPU is to insert the GPU
832 management in the computing thread. The first possibility, detailed
833 in \Lst{algo:ch6p2syncGPU}, is to simply replace the
834 CPU kernel (lines~42--44 in \Lst{algo:ch6p2AsyncSyncComp}) by a blocking GPU kernel call. This includes data
835 transfers from the node RAM to the GPU RAM, the launching of the GPU kernel, the
836 waiting for kernel completion, and the results transfers from GPU RAM to
839 %\begin{algorithm}[H]
840 % \caption{Computing function in the final asynchronous scheme.}
841 % \label{algo:ch6p2syncGPU}
842 \begin{Listing}{algo:ch6p2syncGPU}{computing function in the final asynchronous scheme}
843 // Variables declarations and initialization
845 dim3 Dg, Db; // CUDA kernel grids
849 // Determination of the dynamic operating mode, sendings of data
850 // dependencies, and blocking data receptions in sync mode
852 // Local GPU computation
853 // Data transfers from node RAM to GPU
854 CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer");
855 ... // There may be several data transfers: typically A and b in
856 // linear problems of the form A.x = b
857 // GPU grid definition
858 Db.x = BLOCK_SIZE_X; // BLOCK_SIZE_# are kernel design dependent
861 Dg.x = localSize/BLOCK_SIZE_X + (localSize%BLOCK_SIZE_X ? 1 : 0);
862 Dg.y = localSize/BLOCK_SIZE_Y + (localSize%BLOCK_SIZE_Y ? 1 : 0);
863 Dg.z = localSize/BLOCK_SIZE_Z + (localSize%BLOCK_SIZE_Z ? 1 : 0);
864 // Use of shared memory (when possible)
865 cudaFuncSetCacheConfig(gpuKernelName, cudaFuncCachePreferShared);
867 gpuKernelName<<<Dg,Db>>>(... @\emph{kernel parameters}@ ...);
868 // Waiting for kernel completion
869 cudaDeviceSynchronize();
870 // Results transfer from GPU to node RAM
871 CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer");
872 // Potential post-treatment of results on the CPU
875 // Convergences checking
881 This scheme provides asynchronism through a cluster of GPUs as well as a
882 complete overlap of communications with GPU computations (similar
883 to the one described in~\Sec{ch6:part1}). However, the autonomy of GPU devices according to their
884 host can be further exploited in order to perform some computations on the CPU
885 while the GPU kernel is running. The nature of computations that can be done by
886 the CPU may vary depending on the application. For example, when processing data
887 streams (pipelines), pre-processing of the next data item and/or post-processing
888 of the previous result can be done on the CPU while the GPU is processing the current
889 data item. In other cases, the CPU can perform \emph{auxiliary}
890 computations\index{computation auxiliary}
891 that are not absolutely required to obtain the result but that may accelerate
892 the entire iterative process. Another possibility would be to distribute the
893 main computations between the GPU and CPU. However, this
894 usually leads to poor performance increases mainly due to data
895 dependencies that often require additional transfers between CPU and GPU.
897 So, if we consider that the application enables such overlap of
898 computations, its implementation is straightforward as it consists in inserting
899 the additional CPU computations between lines~25 and~26
900 in \Lst{algo:ch6p2syncGPU}. Nevertheless, such a scheme is fully efficient only
901 if the computation times on both sides are similar.
903 In some cases, especially with auxiliary computations, another interesting
904 solution is to add a fourth CPU thread to perform them. This suppresses the
905 duration constraint over those optional computations as they are performed in
906 parallel with the main iterative process, without blocking it. Moreover, this
907 scheme stays coherent with current architectures as most nodes include four CPU
908 cores. The algorithmic scheme of such context of complete overlap of
909 CPU/GPU computations and communications is described in
910 Listings~\ref{algo:ch6p2FullOverAsyncMain},~\ref{algo:ch6p2FullOverAsyncComp1},
911 and~\ref{algo:ch6p2FullOverAsyncComp2}, where we assume that auxiliary
912 computations use intermediate results of the main computation process from any previous iteration. This may be
913 different according to the application.
915 %\begin{algorithm}[H]
916 % \caption{Initialization of the main process of complete overlap with asynchronism.}
917 % \label{algo:ch6p2FullOverAsyncMain}
919 \begin{Listing}{algo:ch6p2FullOverAsyncMain}{initialization of the main process of complete overlap with asynchronism}
920 // Variables declarations and initialization
922 omp_lock_t lockAux; // Informs main thread about new aux results
923 omp_lock_t lockRes; // Informs aux thread about new results
924 omp_lock_t lockWrite; // Controls exclusion of results access
925 ... auxRes ... ; // Results of auxiliary computations
927 // Parameters reading, MPI initialization, and data initialization and
930 // OpenMP initialization
932 omp_init_lock(&lockAux);
933 omp_set_lock(&lockAux);//Unlocked when new aux results are available
934 omp_init_lock(&lockRes);
935 omp_set_lock(&lockRes); // Unlocked when new results are available
936 omp_init_lock(&lockWrite);
937 omp_unset_lock(&lockWrite); // Controls access to results from threads
941 switch(omp_get_thread_num()){
943 computations(... @\emph{relevant parameters}@ ...);
947 auxComps(... @\emph{relevant parameters}@ ...);
960 // Cleaning of OpenMP locks
962 omp_test_lock(&lockAux);
963 omp_unset_lock(&lockAux);
964 omp_destroy_lock(&lockAux);
965 omp_test_lock(&lockRes);
966 omp_unset_lock(&lockRes);
967 omp_destroy_lock(&lockRes);
968 omp_test_lock(&lockWrite);
969 omp_unset_lock(&lockWrite);
970 omp_destroy_lock(&lockWrite);
977 %\begin{algorithm}[H]
978 % \caption{Computing function in the final asynchronous scheme with CPU/GPU overlap.}
979 % \label{algo:ch6p2FullOverAsyncComp1}
981 \begin{Listing}{algo:ch6p2FullOverAsyncComp1}{computing function in the final asynchronous scheme with CPU/GPU overlap}
982 // Variables declarations and initialization
984 dim3 Dg, Db; // CUDA kernel grids
988 // Determination of the dynamic operating mode, sending of data
989 // dependencies, and blocking data receptions in sync mode
991 // Local GPU computation
992 // Data transfers from node RAM to GPU, GPU grid definition,
993 // and init of shared memory
994 CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer");
997 gpuKernelName<<<Dg,Db>>>(... @\emph{kernel parameters}@ ...);
998 // Potential pre-/post-treatments in pipeline-like computations
1000 // Waiting for kernel completion
1001 cudaDeviceSynchronize();
1002 // Results transfer from GPU to node RAM
1003 omp_set_lock(&lockWrite); // Wait for write access to resultsInRam
1004 CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer");
1005 // Potential post-treatments in non pipeline computations
1007 omp_unset_lock(&lockWrite); // Give back read access to aux thread
1008 omp_test_lock(&lockRes);
1009 omp_unset_lock(&lockRes); // Informs aux thread of new results
1011 // Auxiliary computations availability checking
1012 if(omp_test_lock(&lockAux)){
1013 // Use auxRes to update the iterative process
1014 ... // May induce additional GPU transfers
1017 // Convergences checking
1018 if(curMode == SYNC){
1019 // Local convergence checking and global exchange of local states
1021 // Determination of global convergence (all nodes in local CV)
1022 if(cvLocale == 1 && nbCVLocales == nbP-1){
1023 // Stopping the iterative process and sending end messages
1025 // Unlocking aux thread for termination
1026 omp_test_lock(&lockRes);
1027 omp_unset_lock(&lockRes);
1029 // Reinitialization of state information and iteration barrier
1037 %\begin{algorithm}[H]
1038 % \caption{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap.}
1039 % \label{algo:ch6p2FullOverAsyncComp2}
1041 \begin{Listing}{algo:ch6p2FullOverAsyncComp2}{auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap}
1042 // Variables declarations and initialization
1043 ... auxInput ... // Local array for input data
1047 // Data copy from resultsInRam into auxInput
1048 omp_set_lock(&lockRes); // Waiting for new results from main comps
1050 omp_set_lock(&lockWrite); // Waiting for access to results
1051 for(ind=0; ind<resultsSize; ++ind){
1052 auxInput[ind] = resultsInRam[ind];
1054 omp_unset_lock(&lockWrite);//Give back write access to main thread
1055 // Auxiliary computations with possible interruption at the end
1056 for(ind=0; ind<auxSize && !Finished; ++ind){
1057 // Computation of auxRes array according to auxInput
1060 // Informs main thread that new aux results are available in auxData
1061 omp_test_lock(&lockAux); // Ensures mutex is locked when unlocking
1062 omp_unset_lock(&lockAux);
1068 As can be seen in \Lst{algo:ch6p2FullOverAsyncMain}, there are three additional
1069 mutex (\texttt{lockAux}, \texttt{lockRes}, and \texttt{lockWrite}) that are used
1070 to inform the main computation thread that new auxiliary results
1071 are available (lines~20--21 in \Lst{algo:ch6p2FullOverAsyncComp2} and line~31 in
1072 \Lst{algo:ch6p2FullOverAsyncComp1}), to inform the auxiliary thread that new
1073 results from the main thread are available (lines~27--28
1074 in \Lst{algo:ch6p2FullOverAsyncComp1} and line~7
1075 in \Lst{algo:ch6p2FullOverAsyncComp2}), and to perform exclusive accesses to the
1076 results from those two threads (lines~22 and 26
1077 in \Lst{algo:ch6p2FullOverAsyncComp1} and 9 and 13
1078 in \Lst{algo:ch6p2FullOverAsyncComp2}). Also, an additional array
1079 (\texttt{auxRes}) is required to store the results of the auxiliary computations
1080 as well as a local array for the input of the auxiliary function
1081 (\texttt{auxInput}). That last function has the same general organization as the
1082 send/receive ones, that is, a global loop conditioned by the end of the global
1083 process. At each iteration in this function, the thread waits for the
1084 availability of new results produced by the main computation thread. This avoids
1085 performing the same computations several times with the same input data.
1086 Then, input data of auxiliary computations
1087 % (as supposed here, they often
1088 % correspond to the results of the main computations, but may sometimes be
1090 is copied with a mutual exclusion mechanism. Finally, auxiliary
1091 computations are performed. When they are completed, the associated mutex is
1092 unlocked to signal the availability of those auxiliary results to the main
1093 computing thread. The main thread regularly checks this availability at the end
1094 of its iterations and takes them into account whenever possible.
1096 Finally, we obtain an algorithmic scheme allowing maximal overlap between CPU
1097 and GPU computations as well as communications. It is worth noticing that such
1098 scheme is also efficiently usable for systems without GPUs but with nodes having
1099 at least four cores. In such contexts, each thread in \Lst{algo:ch6p2FullOverAsyncMain} can
1100 be executed on distinct cores.
1102 \subsection{Experimental validation}
1103 \label{sec:ch6p2expes}
1105 As in~\Sec{ch6:part1}, we validate the feasibility of our asynchronous scheme
1106 with some experiments performed with a representative example of scientific
1107 application. This three-dimensional version of the
1108 advection-diffusion-reaction process\index{PDE example} models the evolution of the
1109 concentrations of two chemical species in shallow waters. As this process is
1110 dynamic in time, the simulation is performed for a given number of consecutive
1111 time steps. This implies two nested loops in the iterative process, the outer
1112 one for the time steps and the inner one for solving the problem at each time.
1113 Full details about this PDE problem can be found in~\cite{ChapNRJ2011}. This
1114 two-stage iterative process implies a few adaptations of the general scheme
1115 presented above in order to include the outer iterations over the time steps, but the
1116 inner iterative process closely follows the same scheme.
1118 We show two series of experiments performed with 16 nodes of the first cluster
1119 described in~\Sec{ch6:p1expes}. The first one deals with the comparison of
1120 synchronous and asynchronous computations. The second one is related to the use
1121 of auxiliary computations. In the context of our PDE application, they consist of
1122 the update of the Jacobian of the system.
1124 \subsubsection*{Synchronous and asynchronous computations}
1126 The first experiment allows us to check that the asynchronous behavior obtained
1127 with our scheme corresponds to the expected one according to its synchronous
1128 counterpart. So, we show in~\Fig{fig:ch6p2syncasync} the computation times of
1129 our test application in both modes for different problem sizes. The size shown
1130 is the number of discrete spatial elements on each side of the cube representing
1131 the 3D volume. Moreover, for each of these elements, there are the
1132 concentrations of the two chemical species considered. So, for example, size 30
1133 corresponds in fact to $30\times30\times30\times2$ values.
1137 \includegraphics[width=.75\columnwidth]{Chapters/chapter6/curves/syncasync.pdf}
1138 \caption{Computation times of the test application in synchronous and
1139 asynchronous modes.}
1140 \label{fig:ch6p2syncasync}
1143 The results obtained show that the asynchronous version is significantly faster than
1144 the synchronous one for smaller problem sizes, then it becomes similar or even
1145 a bit slower for larger problem sizes. A closer comparison of computation and
1146 communication times of each execution confirms that this behavior is consistent.
1147 The asynchronous version is interesting if communication time
1148 is similar or larger than computation time. In our example, this is the
1149 case up to a problem size between 50 and 60. Then, computations become longer
1150 than communications. Since asynchronous computations often require more
1151 iterations to converge, the gain obtained on the communication side becomes smaller
1152 than the overhead generated on the computation side, and the asynchronous version
1154 % So, the observed behavior is fully coherent with the expected behavior.
1156 \subsubsection*{Overlap of auxiliary computations}
1158 In this experiment, we use only the asynchronous version of the application. In
1159 the context of our test application, we have an iterative PDE solver based on
1160 Netwon resolution. Such solvers are written under the form
1161 $x=T(x),~x\in\Reals^n$ where $T(x)=x-F'(x)^{-1}F(x)$ and $F'$ is the Jacobian of
1162 the system. In such cases, it is necessary to compute the vector $\Delta x$ in
1163 $F'\times \Delta x=-F$ to update $x$ with $\Delta x$. There are two levels of
1164 iterations, the inner level to get a stabilized version of $x$, and the outer
1165 level to compute $x$ at the successive time steps in the simulation process. In
1166 this context, classic algorithms either compute $F'$ at only the first iteration
1167 of each time step or at some iterations but not all because the computation of $F'$ is done
1168 in the main iterative process and it has a relatively high computing cost.
1170 However, with the scheme presented above, it is possible to continuously compute
1171 new versions of $F'$ in parallel with the main iterative process without
1172 penalizing it. Hence, $F'$ is updated as often as possible and taken into
1173 account in the main computations when it is relevant. So, the Newton process
1174 should be accelerated a little bit.
1176 We compare the performance obtained with overlapped Jacobian updatings and
1177 non overlapped ones for several problem sizes (see~\Fig{fig:ch6p2aux}).
1180 \includegraphics[width=.75\columnwidth]{Chapters/chapter6/curves/recouvs.pdf}
1181 \caption{Computation times with or without overlap of Jacobian updatings
1182 in asynchronous mode.}
1183 \label{fig:ch6p2aux}
1186 The overlap is clearly efficient as the computation times with overlapping
1187 Jacobian updatings are much better than the ones without overlap. Moreover, the ratio
1188 between the two versions tends to increase with the problem size, which is as
1189 expected. Also, we have tested the application without auxiliary computations at
1190 all, that is, the Jacobian is computed only once at the beginning of each time
1191 step of the simulation. The results for this last version are quite similar to
1192 the overlapped auxiliary computations, and even better for small problem sizes.
1193 The fact that no significant gain can be seen on this range of problem sizes is due
1194 to the limited number of Jacobian updates taken into account in the main
1195 computation. This happens when the Jacobian update is as long as
1196 several iterations of the main process. So, the benefit is reduced in this
1199 Those results show two things. First, auxiliary computations do not induce great
1200 overhead in the whole process. Second, for this particular application the
1201 choice of updating the Jacobian matrix as auxiliary computations does not speed
1202 up the iterative process. This does not question the parallel scheme in itself
1203 but merely points out the difficulty of identifying relevant auxiliary
1204 computations. Indeed, this identification depends on the considered application
1205 and requires a profound specialized analysis.
1207 Another interesting choice could be the computation of load estimation for
1208 dynamic load balancing, especially in decentralized diffusion strategies where
1209 loads are transferred between neighboring nodes~\cite{BCVG11}. In such a case,
1210 the load evaluation and the comparison with other nodes can be done in parallel
1211 with the main computations without perturbing them.
1213 %%% Local Variables:
1216 %%% ispell-dictionary: "american"
1218 %%% TeX-master: "../../BookGPU.tex"