+\section{General scheme of asynchronous parallel code with
+computation/communication overlapping}
+\label{ch6:part2}
+
+In the previous part, we have seen how to efficiently implement overlap of
+computations (CPU and GPU) with communications (GPU transfers and inter-node
+communications). However, we have previously shown that for some parallel
+iterative algorithms, it is sometimes even more efficient to use an asynchronous
+scheme of iterations\index{iterations!asynchronous} \cite{HPCS2002,ParCo05,Para10}. In that case, the nodes do
+not wait for each others but they perform their iterations using the last
+external data they have received from the other nodes, even if this
+data was produced \emph{before} the previous iteration on the other nodes.
+
+Formally, if we denote by $f=(f_1,...,f_n)$ the function representing the
+iterative process and by $x^t=(x_1^t,...,x_n^t)$ the values of the $n$ elements of
+the system at iteration $t$, we pass from a synchronous iterative scheme of the
+form:
+\begin{algorithm}[H]
+ \caption{Synchronous iterative scheme}\label{algo:ch6p2sync}
+ \begin{Algo}
+ $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\
+ \textbf{for} $t=0,1,...$\\
+ \>\textbf{for} $i=1,...,n$\\
+ \>\>$x_{i}^{t+1}=f_{i}(x_{1}^t,...,x_i^t,...,x_{n}^t)$\\
+ \>\textbf{endfor}\\
+ \textbf{endfor}
+ \end{Algo}
+\end{algorithm}
+\noindent
+to an asynchronous iterative scheme of the form:
+\begin{algorithm}[H]
+ \caption{Asynchronous iterative scheme}\label{algo:ch6p2async}
+ \begin{Algo}
+ $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\
+ \textbf{for} $t=0,1,...$\\
+ \>\textbf{for} $i=1,...,n$\\
+ \>\>$x_{i}^{t+1}=\left\{
+ \begin{array}[h]{ll}
+ x_i^t & \text{if } i \text{ is \emph{not} updated at iteration } i\\
+ f_i(x_1^{s_1^i(t)},...,x_n^{s_n^i(t)}) & \text{if } i \text{ is updated at iteration } i
+ \end{array}
+ \right.$\\
+ \>\textbf{endfor}\\
+ \textbf{endfor}
+ \end{Algo}
+\end{algorithm}
+where $s_j^i(t)$ is the iteration number of the production of the value $x_j$ of
+element $j$ that is used on element $i$ at iteration $t$ (see for example~\cite{BT89,
+ FS2000} for further details).
+Such schemes are called AIAC\index{AIAC} for \emph{Asynchronous Iterations and
+ Asynchronous Communications}. They combine two aspects that are respectively
+different computation speeds of the computing elements and communication delays
+between them.
+
+The key feature of such algorithmic schemes
+is that they may be faster than their synchronous counterparts due to the
+implied total overlap of computations with communications: in fact, this scheme
+suppresses all the idle times induced by nodes synchronizations between each iteration.
+
+However, the efficiency of such a scheme is directly linked to the frequency at
+which new data arrives on each node. Typically,
+if a node receives newer data only every four or five local iterations, it is
+strongly probable that the evolution of its local iterative process will be
+slower than if it receives data at every iteration. The key point here is that
+this frequency does not only depend on the hardware configuration of the
+parallel system but it also depends on the software that is used to
+implement the algorithmic scheme.
+
+The impact of the programming environments used to implement asynchronous
+algorithms has already been investigated in~\cite{SuperCo05}. Although the
+features required to efficiently implement asynchronous schemes have not
+changed, the available programming environments and computing hardware have
+evolved, in particular now GPUs are available. So, there is a need to reconsider the
+implementation schemes of AIAC according to the new de facto standards for parallel
+programming (communications and threads) as well as the integration of the GPUs.
+One of the main objective here is to obtain a maximal overlap between the
+activities of the three types of devices that are the CPU, the GPU and the
+network. Moreover, another objective is to present what we think
+is the best compromise between the simplicity of the implementation and its
+maintainability on one side and its
+performance on the other side. This is especially important for industries where
+implementation and maintenance costs are strong constraints.
+
+For the sake of clarity, we present the different algorithmic schemes in a progressive
+order of complexity, from the basic asynchronous scheme to the complete scheme
+with full overlap. Between these two extremes, we propose a synchronization
+mechanism on top of our asynchronous scheme that can be used either statically or
+dynamically during the application execution.
+
+Although there exist several programming environments for inter-node
+communications, multi-threading and GPU programming, a few of them have
+become \emph{de facto standards}, either due to their good stability, their ease
+of use and/or their wide adoption by the scientific community.
+Therefore, as in the previous section all the schemes presented in the following use MPI~\cite{MPI},
+OpenMP~\cite{openMP} and CUDA~\cite{CUDA}. However, there is no loss of
+generality as those schemes may easily be implemented with other libraries.
+
+Finally, in order to stay as clear as possible, only the parts of code and
+variables related to the control of parallelism (communications, threads,...)
+are presented in our schemes. The inner organization of data is not detailed as
+it depends on the application. We only consider that we have two data arrays
+(previous version and current version) and communication buffers. However, in
+most of the cases, those buffers can correspond to the data arrays themselves to
+avoid data copies.
+
+\subsection{A basic asynchronous scheme}
+\label{ch6:p2BasicAsync}
+
+The first step toward our complete scheme is to implement a basic asynchronous
+scheme that includes an actual overlap of the communications with the
+computations\index{overlap!computation and communication}. In order to ensure that the communications are actually performed
+in parallel of the computations, it is necessary to use different threads. It
+is important to remember that asynchronous communications provided in
+communication libraries like MPI are not systematically performed in parallel of
+the computations~\cite{ChVCV13,Hoefler08a}. So, the logical and classical way
+to implement such an overlap is to use three threads: one for
+computing, one for sending and one for receiving. Moreover, since
+the communication is performed by threads, blocking synchronous communications\index{MPI!communication!blocking}\index{MPI!communication!synchronous}
+can be used without deteriorating the overall performance.
+
+In this basic version, the termination\index{termination} of the global process is performed
+individually on each node according to their own termination. This can be guided by either a
+number of iterations or a local convergence detection\index{convergence detection}. The important step at
+the end of the process is to perform the receptions of all pending
+communications in order to ensure the termination of the two communication
+threads.
+
+So, the global organization of this scheme is set up in \Lst{algo:ch6p2BasicAsync}.
+
+% \begin{algorithm}[H]
+% \caption{Initialization of the basic asynchronous scheme.}
+% \label{algo:ch6p2BasicAsync}
+\begin{Listing}{algo:ch6p2BasicAsync}{Initialization of the basic asynchronous scheme}
+// Variables declaration and initialization
+omp_lock_t lockSend; // Controls the sendings from the computing thread
+omp_lock_t lockRec; // Ensures the initial reception of external data
+char Finished = 0; // Boolean indicating the end of the process
+char SendsInProgress = 0; // Boolean indicating if previous data sendings are still in progress
+double Threshold; // Threshold of the residual for convergence detection
+
+// Parameters reading
+...
+
+// MPI initialization
+MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, &provided);
+MPI_Comm_size(MPI_COMM_WORLD, &nbP);
+MPI_Comm_rank(MPI_COMM_WORLD, &numP);
+
+// Data initialization and distribution among the nodes
+...
+
+// OpenMP initialization (mainly declarations and setting up of locks)
+omp_set_num_threads(3);
+omp_init_lock(&lockSend);
+omp_set_lock(&lockSend); // Initially locked, unlocked to start sendings
+omp_init_lock(&lockRec);
+omp_set_lock(&lockRec); // Initially locked, unlocked when initial data are received
+
+#pragma omp parallel
+{
+ switch(omp_get_thread_num()){
+ case COMPUTATION :
+ computations(... @\emph{relevant parameters}@ ...);
+ break;
+
+ case SENDINGS :
+ sendings();
+ break;
+
+ case RECEPTIONS :
+ receptions();
+ break;
+ }
+}
+
+// Cleaning of OpenMP locks
+omp_test_lock(&lockSend);
+omp_unset_lock(&lockSend);
+omp_destroy_lock(&lockSend);
+omp_test_lock(&lockRec);
+omp_unset_lock(&lockRec);
+omp_destroy_lock(&lockRec);
+
+// MPI termination
+MPI_Finalize();
+\end{Listing}
+%\end{algorithm}
+
+% \ANNOT{Est-ce qu'on laisse le 1er échange de données ou pas dans
+% l'algo~\ref{algo:ch6p2BasicAsyncComp} ? (lignes 16-19)\\
+% (ça n'est pas nécessaire si les données de
+% départ sont distribuées avec les dépendances qui correspondent, sinon il faut
+% le faire)}
+
+In this scheme, the \texttt{lockRec} mutex\index{OpenMP!mutex} is not mandatory.
+It is only used to ensure that data dependencies are actually exchanged at the
+first iteration of the process. Data initialization and distribution
+(lines~16-17) are not detailed here because they are directly related to the
+application. The important point is that, in most cases, they should be done
+before the iterative process. The computing function is given in
+\Lst{algo:ch6p2BasicAsyncComp}.
+
+%\begin{algorithm}[H]
+% \caption{Computing function in the basic asynchronous scheme.}
+% \label{algo:ch6p2BasicAsyncComp}
+\begin{Listing}{algo:ch6p2BasicAsyncComp}{Computing function in the basic asynchronous scheme}
+// Variables declaration and initialization
+int iter = 1; // Number of the current iteration
+double difference; // Variation of one element between two iterations
+double residual; // Residual of the current iteration
+
+// Computation loop
+while(!Finished){
+ // Sendings of data dependencies if there is no previous sending in progress
+ if(!SendsInProgress){
+ // Potential copy of data to be sent in additional buffers
+ ...
+ // Change of sending state
+ SendsInProgress = 1;
+ omp_unset_lock(&lockSend);
+ }
+
+ // Blocking receptions at the first iteration
+ if(iter == 1){
+ omp_set_lock(&lockRec);
+ }
+
+ // Initialization of the residual
+ residual = 0.0;
+ // Swapping of data arrays (current and previous)
+ tmp = current; // Pointers swapping to avoid
+ current = previous; // actual data copies between
+ previous = tmp; // the two data versions
+ // Computation of current iteration over local data
+ for(ind=0; ind<localSize; ++ind){
+ // Updating of current array using previous array
+ ...
+ // Updating of the residual
+ // (max difference between two successive iterations)
+ difference = fabs(current[ind] - previous[ind]);
+ if(difference > residual){
+ residual = difference;
+ }
+ }
+
+ // Checking of the end of the process (residual under threshold)
+ // Other conditions can be added to the termination detection
+ if(residual <= Threshold){
+ Finished = 1;
+ omp_unset_lock(&lockSend); // Activation of end messages sendings
+ MPI_Ssend(&Finished, 1, MPI_CHAR, numP, tagEnd, MPI_COMM_WORLD);
+ }
+
+ // Updating of the iteration number
+ iter++;
+}
+\end{Listing}
+%\end{algorithm}
+
+As mentioned above, it can be seen in line~18 of \Lst{algo:ch6p2BasicAsyncComp}
+that the \texttt{lockRec} mutex is used only at the first iteration to wait for
+the initial data dependencies before the computations. The residual\index{residual}, initialized
+in line~23 and computed in lines~34-37, is defined by the maximal difference
+between the elements from two consecutive iterations. It is classically used to
+detect the local convergence of the process on each node. In the more
+complete schemes presented in the sequel, a global termination detection that
+takes the states of all the nodes into account will be exhibited.
+
+Finally, the local convergence is tested and updated when necessary. In line~44,
+the \texttt{lockSend} mutex is unlocked to allow the sending function to send
+final messages to the dependency nodes. Those messages are required to keep the
+reception function alive until all the final messages have been received.
+Otherwise, a node could stop its reception function whereas other nodes are
+still trying to communicate with it. Moreover, a local sending of a final
+message to the node itself is required (line~45) to ensure that the reception
+function will not stay blocked in a message probing
+(see~\Lst{algo:ch6p2BasicAsyncReceptions}, line~11). This may happen if the node
+receives the final messages from its dependencies \emph{before} being itself in
+local convergence.
+
+All the messages but this final local one are performed in the sending function
+described in \Lst{algo:ch6p2BasicAsyncSendings}.
+
+The main loop is only conditioned by the end of the computing process (line~4).
+At each iteration, the thread waits for the permission from the computing thread
+(according to the \texttt{lockSend} mutex). Then, data are sent with
+blocking synchronous communications. The \texttt{SendsInProgress} boolean
+allows the computing thread to skip data sendings as long as a previous sending
+is in progress. This skip is possible due to the nature of asynchronous
+algorithms that allows such \emph{message loss}\index{message!loss/miss} or \emph{message miss}. After
+the main loop, the final messages are sent to the dependencies of the node.
+
+%\begin{algorithm}[H]
+% \caption{Sending function in the basic asynchronous scheme.}
+% \label{algo:ch6p2BasicAsyncSendings}
+\begin{Listing}{algo:ch6p2BasicAsyncSendings}{Sending function in the basic asynchronous scheme}
+// Variables declaration and initialization
+...
+
+while(!Finished){
+ omp_set_lock(&lockSend); // Waiting for signal from the comp. thread
+ if(!Finished){
+ // Blocking synchronous sends to all dependencies
+ for(i=0; i<nbDeps; ++i){
+ MPI_Ssend(&dataToSend[deps[i]], nb_data, type_of_data, deps[i], tagCom, MPI_COMM_WORLD);
+ }
+ SendsInProgress = 0; // Indicates that the sendings are done
+ }
+}
+// At the end of the process, sendings of final messages
+for(i=0; i<nbDeps; ++i){
+ MPI_Ssend(&Finished, 1, MPI_CHAR, deps[i], tagEnd, MPI_COMM_WORLD);
+}
+\end{Listing}
+%\end{algorithm}
+
+The last function, detailed in \Lst{algo:ch6p2BasicAsyncReceptions}, does all the messages receptions.
+%\begin{algorithm}[H]
+% \caption{Reception function in the basic asynchronous scheme.}
+% \label{algo:ch6p2BasicAsyncReceptions}
+\begin{Listing}{algo:ch6p2BasicAsyncReceptions}{Reception function in the basic asynchronous scheme}
+// Variables declaration and initialization
+char countReceipts = 0; // Boolean indicating whether receptions are counted or not
+int nbEndMsg = 0; // Number of end messages received
+int arrived = 0; // Boolean indicating if a message is arrived
+int srcNd; // Source node of the message
+int size; // Message size
+
+// Main loop of receptions
+while(!Finished){
+ // Waiting for an incoming message
+ MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
+ if(!Finished){
+ // Management of data messages
+ switch(status.MPI_TAG){
+ case tagCom: // Management of data messages
+ srcNd = status.MPI_SOURCE; // Get the source node of the message
+ // Actual data reception in the corresponding buffer
+ MPI_Recv(dataBufferOf(srcNd), nbDataOf(srcNd), dataTypeOf(srcNd), srcNd, tagCom, MPI_COMM_WORLD, &status);
+ // Unlocking of the computing thread when data are received from all dependencies
+ if(countReceipts == 1 && ... @\emph{receptions from ALL dependencies}@ ...){
+ omp_unset_lock(&lockRec);
+ countReceipts = 0; // No more counting after first iteration
+ }
+ break;
+ case tagEnd: // Management of end messages
+ // Actual end message reception in dummy buffer
+ MPI_Recv(dummyBuffer, 1, MPI_CHAR, status.MPI_SOURCE, tagEnd, MPI_COMM_WORLD, &status);
+ nbEndMsg++;
+ }
+ }
+}
+
+// Reception of pending messages and counting of end messages
+do{ // Loop over the remaining incoming/waited messages
+ MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
+ MPI_Get_count(&status, MPI_CHAR, &size);
+ // Actual reception in dummy buffer
+ MPI_Recv(dummyBuffer, size, MPI_CHAR, status.MPI_SOURCE, status.MPI_TAG, MPI_COMM_WORLD, &status);
+ if(status.MPI_TAG == tagEnd){ // Counting of end messages
+ nbEndMsg++;
+ }
+ MPI_Iprobe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &arrived, &status);
+}while(arrived == 1 || nbEndMsg < nbDeps + 1);
+\end{Listing}
+%\end{algorithm}
+
+As in the sending function, the main loop of receptions is done while the
+iterative process is not \texttt{Finished}. In line~11, the thread waits until
+a message arrives on the node. Then, it performs the actual reception and the
+corresponding subsequent actions (potential data copies for data messages and
+counting for end messages). Lines 20-23 check that all data dependencies have
+been received before unlocking the \texttt{lockRec} mutex. As mentioned before,
+they are not mandatory and are included only to ensure that all data
+dependencies are received at the first iteration. Lines 25-28 are required to
+manage end messages that arrive on the node \emph{before} it reaches its own
+termination process. As the nodes are \emph{not} synchronized, this may happen.
+Finally, lines 34-43 perform the receptions of all pending communications,
+including the remaining end messages (at least the one from the node itself).
+
+\medskip
+So, with those algorithms, we obtain a quite simple and efficient asynchronous
+iterative scheme. It is interesting to notice that GPU computing can be easily
+included in the computing thread. This will be fully addressed in
+paragraph~\ref{ch6:p2GPUAsync}. However, before presenting the complete
+asynchronous scheme with GPU computing, we have to detail how our initial scheme
+can be made synchronous.
+
+\subsection{Synchronization of the asynchronous scheme}
+\label{ch6:p2SsyncOverAsync}
+
+The presence of synchronization in the previous scheme may seem contradictory to our goal,
+and obviously, it is neither the simplest way to obtain a synchronous scheme nor
+the most efficient (as presented in~\Sec{ch6:part1}). However, it is necessary
+for our global convergence detection strategy\index{convergence detection!global}. Recall that the
+global convergence\index{convergence!global} is the extension of the local convergence\index{convergence!local} concept to all the
+nodes. This implies that all the nodes have to be in local convergence at the
+same time to achieve global convergence. Typically, if we use the residual and a
+threshold to stop the iterative process, all the nodes have to continue their
+local iterative process until \emph{all} of them obtain a residual under the
+threshold.
+
+% Moreover, it allows the gathering of both operating modes in a single source
+% code, which tends to improve the implementation and maintenance costs when both
+% versions are required.
+% The second one is that
+In our context, the interest of being able to dynamically change the operating
+mode (sync/async) during the process execution, is that this strongly simplifies
+the global convergence detection. In fact, our past experience in the design
+and implementation of global convergence detection in asynchronous
+algorithms~\cite{SuperCo05, BCC07, Vecpar08a}, have led us to the conclusion
+that although a decentralized detection scheme is possible and may be more
+efficient in some situations, its much higher complexity is an
+obstacle to actual use in practice, especially in industrial contexts where
+implementation/maintenance costs are strong constraints. Moreover, although the
+decentralized scheme does not slow down the computations, it requires more iterations than a synchronous version and
+thus may induce longer detection times in some cases. So, the solution we
+present below is a good compromise between simplicity and efficiency. It
+consists in dynamically changing the operating mode between asynchronous and synchronous
+during the execution of the process in order to check the global convergence.
+This is why we need to synchronize our asynchronous scheme.
+
+In each algorithm of the initial scheme, we only exhibit the additional code
+required to change the operating mode.
+
+%\begin{algorithm}[H]
+% \caption{Initialization of the synchronized scheme.}
+% \label{algo:ch6p2Sync}
+\begin{Listing}{algo:ch6p2Sync}{Initialization of the synchronized scheme}
+// Variables declarations and initialization
+...
+omp_lock_t lockStates; // Controls the synchronous exchange of local states
+omp_lock_t lockIter; // Controls the synchronization at the end of each iteration
+char localCV = 0; // Boolean indicating whether the local stabilization is reached or not
+int nbOtherCVs = 0; // Number of other nodes being in local stabilization
+
+// Parameters reading
+...
+// MPI initialization
+...
+// Data initialization and distribution among the nodes
+...
+// OpenMP initialization (mainly declarations and setting up of locks)
+...
+omp_init_lock(&lockStates);
+omp_set_lock(&lockStates); // Initially locked, unlocked when all state messages are received
+omp_init_lock(&lockIter);
+omp_set_lock(&lockIter); // Initially locked, unlocked when all "end of iteration" messages are received
+
+// Threads launching
+#pragma omp parallel
+{
+ switch(omp_get_thread_num()){
+ ...
+ }
+}
+
+// Cleaning of OpenMP locks
+...
+omp_test_lock(&lockStates);
+omp_unset_lock(&lockStates);
+omp_destroy_lock(&lockStates);
+omp_test_lock(&lockIter);
+omp_unset_lock(&lockIter);
+omp_destroy_lock(&lockIter);
+
+// MPI termination
+MPI_Finalize();
+\end{Listing}
+%\end{algorithm}
+
+As can be seen in \Lst{algo:ch6p2Sync}, the synchronization implies two
+additional mutex. The \texttt{lockStates} mutex is used to wait for the
+receptions of all state messages coming from the other nodes. As shown
+in \Lst{algo:ch6p2SyncComp}, those messages contain only a boolean indicating
+for each node if it is in local convergence\index{convergence!local}. So, once all the states are
+received on a node, it is possible to determine if all the nodes are in local
+convergence, and thus to detect the global convergence. The \texttt{lockIter}
+mutex is used to synchronize all the nodes at the end of each iteration. There
+are also two new variables that respectively represent the local state of the
+node (\texttt{localCV}) according to the iterative process (convergence) and the
+number of other nodes that are in local convergence (\texttt{nbOtherCVs}).
+
+The computation thread is where most of the modifications take place, as shown
+in \Lst{algo:ch6p2SyncComp}.
+
+%\begin{algorithm}[H]
+% \caption{Computing function in the synchronized scheme.}
+% \label{algo:ch6p2SyncComp}
+\begin{Listing}{algo:ch6p2SyncComp}{Computing function in the synchronized scheme}
+// Variables declarations and initialization
+...
+
+// Computation loop
+while(!Finished){
+ // Sendings of data dependencies at @\emph{each}@ iteration
+ // Potential copy of data to be sent in additional buffers
+ ...
+ omp_unset_lock(&lockSend);
+
+ // Blocking receptions at @\emph{each}@ iteration
+ omp_set_lock(&lockRec);
+
+ // Local computation
+ // (init of residual, arrays swapping and iteration computation)
+ ...
+
+ // Checking of the stabilization of the local process
+ // Other conditions than the residual can be added
+ if(residual <= Threshold){
+ localCV = 1;
+ }else{
+ localCV = 0;
+ }
+
+ // Global exchange of local states of the nodes
+ for(ind=0; ind<nbP; ++ind){
+ if(ind != numP){
+ MPI_Ssend(&localCV, 1, MPI_CHAR, ind, tagState, MPI_COMM_WORLD);
+ }
+ }
+
+ // Waiting for the state messages receptions from the other nodes
+ omp_set_lock(&lockStates);
+
+ // Determination of global convergence (if all nodes are in local CV)
+ if(localCV + nbOtherCVs == nbP){
+ // Entering global CV state
+ Finished = 1;
+ // Unlocking of sending thread to start sendings of end messages
+ omp_unset_lock(&lockSend);
+ MPI_Ssend(&Finished, 1, MPI_CHAR, numP, tagEnd, MPI_COMM_WORLD);
+ }else{
+ // Resetting of information about the states of the other nodes
+ ...
+ // Global barrier at the end of each iteration during the process
+ for(ind=0; ind<nbP; ++ind){
+ if(ind != numP){
+ MPI_Ssend(&Finished, 1, MPI_CHAR, ind, tagIter, MPI_COMM_WORLD);
+ }
+ }
+ omp_set_lock(&lockIter);
+ }
+
+
+ // Updating of the iteration number
+ iter++;
+}
+\end{Listing}
+%\end{algorithm}
+
+Most of the added code is related to the waiting for specific communications.
+Between lines~6 and~7, the use of the flag \texttt{SendsInProgress} is no longer
+needed since the sends are performed at each iteration. In line~12, the
+thread waits for the data receptions from its dependencies. In
+lines~26-34, the local states are determined and exchanged among all nodes. A
+new message tag (\texttt{tagState}) is required for identifying those messages.
+In line~37, the global termination state is determined. When it is reached,
+lines~38-42 change the \texttt{Finished} boolean to stop the iterative process,
+and send the end messages. Otherwise each node resets its local state
+information about the other nodes and a global barrier is done between all the
+nodes at the end of each iteration with another new tag (\texttt{tagIter}).
+That barrier is needed to ensure that data messages from successive iterations
+are actually received during the \emph{same} iteration on the destination nodes.
+Nevertheless, it is not useful at the termination of the global process as it is
+replaced by the global exchange of end messages.
+
+There is no big modification induced by the synchronization in the sending
+function. The only change could be the suppression of line~11 that is not useful
+in this case. Apart from that, the function stays the same as
+in \Lst{algo:ch6p2BasicAsyncSendings}.
+
+In the reception function, given in \Lst{algo:ch6p2SyncReceptions}, there are
+mainly two insertions (in lines~19-30 and 31-40), corresponding to the
+additional types of messages to receive. There is also the insertion of three
+variables that are used for the receptions of the new message types. In
+lines~24-29 and 34-39 are located messages counting and mutex unlocking
+mechanisms that are used to block the computing thread at the corresponding steps of its
+execution. They are similar to the mechanism used for managing the end messages
+at the end of the entire process. Line~23 directly updates the
+number of other nodes that are in local convergence by adding the
+received state of the source node. This is possible due to the encoding that is used to
+represent the local convergence (1) and the non-convergence (0).
+
+%\begin{algorithm}[H]
+% \caption{Reception function in the synchronized scheme.}
+% \label{algo:ch6p2SyncReceptions}
+\begin{Listing}{algo:ch6p2SyncReceptions}{Reception function in the synchronized scheme}
+// Variables declarations and initialization
+...
+int nbStateMsg = 0; // Number of local state messages received
+int nbIterMsg = 0; // Number of "end of iteration" messages received
+char recvdState; // Received state from another node (0 or 1)
+
+// Main loop of receptions
+while(!Finished){
+ // Waiting for an incoming message
+ MPI_Probe(MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
+ if(!Finished){
+ switch(status.MPI_TAG){ // Actions related to message type
+ case tagCom: // Management of data messages
+ ...
+ break;
+ case tagEnd: // Management of termination messages
+ ...
+ break;
+ case tagState: // Management of local state messages
+ // Actual reception of the message
+ MPI_Recv(&recvdState, 1, MPI_CHAR, status.MPI_SOURCE, tagState, MPI_COMM_WORLD, &status);
+ // Updates of numbers of stabilized nodes and received state msgs
+ nbOtherCVs += recvdState;
+ nbStateMsg++;
+ // Unlocking of the computing thread when states of all other nodes are received
+ if(nbStateMsg == nbP-1){
+ nbStateMsg = 0;
+ omp_unset_lock(&lockStates);
+ }
+ break;
+ case tagIter: // Management of "end of iteration" messages
+ // Actual reception of the message in dummy buffer
+ MPI_Recv(dummyBuffer, 1, MPI_CHAR, status.MPI_SOURCE, tagIter, MPI_COMM_WORLD, &status);
+ nbIterMsg++; // Update of the nb of iteration messages
+ // Unlocking of the computing thread when iteration messages are received from all other nodes
+ if(nbIterMsg == nbP - 1){
+ nbIterMsg = 0;
+ omp_unset_lock(&lockIter);
+ }
+ break;
+ }
+ }
+}
+
+// Reception of pending messages and counting of end messages
+do{ // Loop over the remaining incoming/waited messages
+ ...
+}while(arrived == 1 || nbEndMsg < nbDeps + 1);
+\end{Listing}
+%\end{algorithm}
+
+Now that we can synchronize our asynchronous scheme, the final step is to
+dynamically alternate the two operating modes in order to regularly check the
+global convergence of the iterative process. This is detailed in the following
+paragraph together with the inclusion of GPU computing in the final asynchronous
+scheme.
+
+\subsection{Asynchronous scheme using MPI, OpenMP and CUDA}
+\label{ch6:p2GPUAsync}
+
+As mentioned above, the strategy proposed to obtain a good compromise between
+simplicity and efficiency in the asynchronous scheme is to dynamically change
+the operating mode of the process. A good way to obtain a maximal
+simplification of the final scheme while preserving good performance is to
+perform local and global convergence detections only in synchronous
+mode. Moreover, as two successive iterations are sufficient in synchronous mode
+to detect local and global convergences, the key is to alternate some
+asynchronous iterations with two synchronous iterations until
+convergence.
+
+The last problem is to decide \emph{when} to switch from the
+asynchronous to the synchronous mode. Here again, for the sake of simplicity, any
+asynchronous mechanism for \emph{detecting} such instant is avoided and we
+prefer to use a mechanism that is local to each node. Obviously, that local system must
+rely neither on the number of local iterations done nor on the local
+convergence. The former would slow down the fastest nodes according to the
+slowest ones. The latter would provoke too much synchronization because the
+residuals on all nodes commonly do not evolve in the same way and, in most
+cases, there is a convergence wave phenomenon throughout the elements. So, a
+good solution is to insert a local timer mechanism on each node with a given
+initial duration. Then, that duration may be modified during the execution
+according to the successive results of the synchronous sections.
+
+Another problem induced by entering synchronous mode from the asynchronous one
+is the possibility to receive some data messages
+from previous asynchronous iterations during synchronous iterations. This could lead to deadlocks. In order
+to avoid this, a wait of the end of previous send is added to the
+transition between the two modes. This is implemented by replacing the variable
+\texttt{SendsInProgress} by a mutex \texttt{lockSendsDone} which is unlocked
+once all the messages have been sent in the sending function. Moreover, it is
+also necessary to stamp data messages\index{message!stamping} (by the function \texttt{stampData}) with
+a Boolean indicating whether they have been sent during a synchronous or
+asynchronous iteration. Then, the \texttt{lockRec} mutex is unlocked only
+after to the complete reception of data messages from synchronous
+iterations. The message ordering of point-to-point communications in MPI
+together with the barrier at the end of each iteration ensure two important
+properties of this mechanism. First, data messages from previous
+asynchronous iterations will be received but not taken into account during
+synchronous sections. Then, a data message from a synchronous
+iteration cannot be received in another synchronous iteration. In the
+asynchronous sections, no additional mechanism is needed as there are no such
+constraints concerning the data receptions.
+
+Finally, the required modifications of the previous scheme
+% to make it dynamically change its operating mode (asynchronous or synchronous)
+are mainly related to the computing thread. Small additions or modifications
+are also required in the main process and the other threads.
+
+In the main process, two new variables are added to store respectively the main
+operating mode of the iterative process (\texttt{mainMode}) and the duration of
+asynchronous sections (\texttt{asyncDuration}). Those variables are
+initialized by the programmer. The \texttt{lockSendsDone} mutex is also declared,
+initialized (locked) and destroyed with the other mutex in this process.
+
+In the computing function, shown in \Lst{algo:ch6p2AsyncSyncComp}, the
+modifications consist of the insertion of the timer mechanism and of the conditions
+to differentiate the actions to be done in each mode. Some additional variables
+are also required to store the current operating mode in action during the
+execution (\texttt{curMode}), the starting time of the current asynchronous
+section (\texttt{asyncStart}) and the number of successive synchronous
+iterations done (\texttt{nbSyncIter}).
+
+%\begin{algorithm}[H]
+% \caption{Computing function in the final asynchronous scheme.}% without GPU computing.}
+% \label{algo:ch6p2AsyncSyncComp}
+%\pagebreak
+\begin{Listing}{algo:ch6p2AsyncSyncComp}{Computing function in the final asynchronous scheme}% without GPU computing.}
+// Variables declarations and initialization
+...
+OpMode curMode = SYNC; // Current operating mode (always begin in sync)
+double asyncStart; // Starting time of the current async section
+int nbSyncIter = 0; // Number of sync iterations done in async mode
+
+// Computation loop
+while(!Finished){
+ // Determination of the dynamic operating mode
+ if(curMode == ASYNC){
+ // Entering synchronous mode when asyncDuration is reached
+@% // (additional conditions can be specified if needed)
+@ if(MPI_Wtime() - asyncStart >= asyncDuration){
+ // Waiting for the end of previous sends before starting sync mode
+ omp_set_lock(&lockSendsDone);
+ curMode = SYNC; // Entering synchronous mode
+ stampData(dataToSend, SYNC); // Mark data to send with sync flag
+ nbSyncIter = 0;
+ }
+ }else{
+ // In main async mode, going back to async mode when the max number of sync iterations are done
+ if(mainMode == ASYNC){
+ nbSyncIter++; // Update of the number of sync iterations done
+ if(nbSyncIter == 2){
+ curMode = ASYNC; // Going back to async mode
+ stampData(dataToSend, ASYNC); // Mark data to send
+ asyncStart = MPI_Wtime(); // Get the async starting time
+ }
+ }
+ }
+
+ // Sendings of data dependencies
+ if(curMode == SYNC || !SendsInProgress){
+ ...
+ }
+
+ // Blocking data receptions in sync mode
+ if(curMode == SYNC){
+ omp_set_lock(&lockRec);
+ }
+
+ // Local computation
+ // (init of residual, arrays swapping and iteration computation)
+ ...
+
+ // Checking of convergences (local & global) only in sync mode
+ if(curMode == SYNC){
+ // Local convergence checking (residual under threshold)
+ ...
+ // Blocking global exchange of local states of the nodes
+ ...
+ // Determination of global convergence (all nodes in local CV)
+ // Stop of the iterative process and sending of end messages
+ // or Re-initialization of state information and iteration barrier
+ ...
+ }
+ }
+
+ // Updating of the iteration number
+ iter++;
+}
+\end{Listing}
+%\end{algorithm}
+
+In the sending function, the only modification is the replacement in line~11 of
+the assignment of variable \texttt{SendsInProgress} by the unlocking of
+\texttt{lockSendsDone}. Finally, in the reception function, the only
+modification is the insertion before line~19
+of \Lst{algo:ch6p2BasicAsyncReceptions} of the extraction of the stamp from the
+message and its counting among the receipts only if the stamp is \texttt{SYNC}.
+
+The final step to get our complete scheme using GPU is to insert the GPU
+management in the computing thread. The first possibility, detailed
+in \Lst{algo:ch6p2syncGPU}, is to simply replace the
+CPU kernel (lines~41-43 in \Lst{algo:ch6p2AsyncSyncComp}) by a blocking GPU kernel call. This includes data
+transfers from the node RAM to the GPU RAM, the launching of the GPU kernel, the
+waiting for kernel completion and the results transfers from GPU RAM to
+node RAM.
+
+%\begin{algorithm}[H]
+% \caption{Computing function in the final asynchronous scheme.}
+% \label{algo:ch6p2syncGPU}
+\begin{Listing}{algo:ch6p2syncGPU}{Computing function in the final asynchronous scheme}
+// Variables declarations and initialization
+...
+dim3 Dg, Db; // CUDA kernel grids
+
+// Computation loop
+while(!Finished){
+ // Determination of the dynamic operating mode, sendings of data dependencies and blocking data receptions in sync mode
+ ...
+ // Local GPU computation
+ // Data transfers from node RAM to GPU
+ CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer");
+ ... // There may be several data transfers: typically A and b in linear problems
+ // GPU grid definition
+ Db.x = BLOCK_SIZE_X; // BLOCK_SIZE_# are kernel design dependent
+ Db.y = BLOCK_SIZE_Y;
+ Db.z = BLOCK_SIZE_Z;
+ Dg.x = localSize/BLOCK_SIZE_X + (localSize%BLOCK_SIZE_X ? 1 : 0);
+ Dg.y = localSize/BLOCK_SIZE_Y + (localSize%BLOCK_SIZE_Y ? 1 : 0);
+ Dg.z = localSize/BLOCK_SIZE_Z + (localSize%BLOCK_SIZE_Z ? 1 : 0);
+ // Use of shared memory (when possible)
+ cudaFuncSetCacheConfig(gpuKernelName, cudaFuncCachePreferShared);
+ // Kernel call
+ gpuKernelName<<<Dg,Db>>>(... @\emph{kernel parameters}@ ...);
+ // Waiting for kernel completion
+ cudaDeviceSynchronize();
+ // Results transfer from GPU to node RAM
+ CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer");
+ // Potential post-treatment of results on the CPU
+ ...
+
+ // Convergences checking
+ ...
+}
+\end{Listing}
+%\end{algorithm}
+
+This scheme provides asynchronism through a cluster of GPUs as well as a
+complete overlap of communications with GPU computations (similarly
+to~\Sec{ch6:part1}). However, the autonomy of GPU devices according to their
+host can be further exploited in order to perform some computations on the CPU
+while the GPU kernel is running. The nature of computations that can be done by
+the CPU may vary depending on the application. For example, when processing data
+streams (pipelines), pre-processing of next data item and/or post-processing of
+previous result can be done on the CPU while the GPU is processing the current
+data item. In other cases, the CPU can perform \emph{auxiliary}
+computations\index{computation!auxiliary}
+that are not absolutely required to obtain the result but that may accelerate
+the entire iterative process. Another possibility would be to distribute the
+main computations between the GPU and CPU. However, this
+usually leads to poor performance increases. This is mainly due to data
+dependencies that often require additional transfers between CPU and GPU.
+
+So, if we consider that the application enables such overlap of
+computations, its implementation is straightforward as it consists in inserting
+the additional CPU computations between lines~23 and~24
+in \Lst{algo:ch6p2syncGPU}. Nevertheless, such scheme is fully efficient only
+if the computation times on both sides are similar.
+
+In some cases, especially with auxiliary computations, another interesting
+solution is to add a fourth CPU thread to perform them. This suppresses the
+duration constraint over those optional computations as they are performed in
+parallel of the main iterative process, without blocking it. Moreover, this
+scheme stays coherent with current architectures as most nodes include four CPU
+cores. The algorithmic scheme of such context of complete overlap of
+CPU/GPU computations and communications is described in
+Listings~\ref{algo:ch6p2FullOverAsyncMain},~\ref{algo:ch6p2FullOverAsyncComp1}
+and~\ref{algo:ch6p2FullOverAsyncComp2}, where we suppose that auxiliary
+computations use intermediate results of the main computation process from any previous iteration. This may be
+different according to the application.
+
+%\begin{algorithm}[H]
+% \caption{Initialization of the main process of complete overlap with asynchronism.}
+% \label{algo:ch6p2FullOverAsyncMain}
+\pagebreak
+\begin{Listing}{algo:ch6p2FullOverAsyncMain}{Initialization of the main process of complete overlap with asynchronism}
+// Variables declarations and initialization
+...
+omp_lock_t lockAux; // Informs main thread about new aux results
+omp_lock_t lockRes; // Informs aux thread about new results
+omp_lock_t lockWrite; // Controls exclusion of results access
+... auxRes ... ; // Results of auxiliary computations
+
+// Parameters reading, MPI initialization, data initialization and distribution
+...
+// OpenMP initialization
+...
+omp_init_lock(&lockAux);
+omp_set_lock(&lockAux); // Unlocked when new aux results are available
+omp_init_lock(&lockRes);
+omp_set_lock(&lockRes); // Unlocked when new results are available
+omp_init_lock(&lockWrite);
+omp_unset_lock(&lockWrite); // Controls access to results from threads
+
+#pragma omp parallel
+{
+ switch(omp_get_thread_num()){
+ case COMPUTATION :
+ computations(... @\emph{relevant parameters}@ ...);
+ break;
+
+ case AUX_COMPS :
+ auxComps(... @\emph{relevant parameters}@ ...);
+ break;
+
+ case SENDINGS :
+ sendings();
+ break;
+
+ case RECEPTIONS :
+ receptions();
+ break;
+ }
+}
+
+// Cleaning of OpenMP locks
+...
+omp_test_lock(&lockAux);
+omp_unset_lock(&lockAux);
+omp_destroy_lock(&lockAux);
+omp_test_lock(&lockRes);
+omp_unset_lock(&lockRes);
+omp_destroy_lock(&lockRes);
+omp_test_lock(&lockWrite);
+omp_unset_lock(&lockWrite);
+omp_destroy_lock(&lockWrite);
+
+// MPI termination
+MPI_Finalize();
+\end{Listing}
+%\end{algorithm}
+
+%\begin{algorithm}[H]
+% \caption{Computing function in the final asynchronous scheme with CPU/GPU overlap.}
+% \label{algo:ch6p2FullOverAsyncComp1}
+\pagebreak
+\begin{Listing}{algo:ch6p2FullOverAsyncComp1}{Computing function in the final asynchronous scheme with CPU/GPU overlap}
+// Variables declarations and initialization
+...
+dim3 Dg, Db; // CUDA kernel grids
+
+// Computation loop
+while(!Finished){
+ // Determination of the dynamic operating mode, sendings of data dependencies and blocking data receptions in sync mode
+ ...
+ // Local GPU computation
+ // Data transfers from node RAM to GPU, GPU grid definition and init of shared mem
+ CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer");
+ ...
+ // Kernel call
+ gpuKernelName<<<Dg,Db>>>(... @\emph{kernel parameters}@ ...);
+ // Potential pre/post-treatments in pipeline like computations
+ ...
+ // Waiting for kernel completion
+ cudaDeviceSynchronize();
+ // Results transfer from GPU to node RAM
+ omp_set_lock(&lockWrite); // Wait for write access to resultsInRam
+ CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer");
+ // Potential post-treatments in non-pipeline computations
+ ...
+ omp_unset_lock(&lockWrite); // Give back read access to aux thread
+ omp_test_lock(&lockRes);
+ omp_unset_lock(&lockRes); // Informs aux thread of new results
+
+ // Auxiliary computations availability checking
+ if(omp_test_lock(&lockAux)){
+ // Use auxRes to update the iterative process
+ ... // May induce additional GPU transfers
+ }
+
+ // Convergences checking
+ if(curMode == SYNC){
+ // Local convergence checking and global exchange of local states
+ ...
+ // Determination of global convergence (all nodes in local CV)
+ if(cvLocale == 1 && nbCVLocales == nbP-1){
+ // Stop of the iterative process and sending of end messages
+ ...
+ // Unlocking of aux thread for termination
+ omp_test_lock(&lockRes);
+ omp_unset_lock(&lockRes);
+ }else{
+ // Re-initialization of state information and iteration barrier
+ ...
+ }
+ }
+}
+\end{Listing}
+%\end{algorithm}
+
+%\begin{algorithm}[H]
+% \caption{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap.}
+% \label{algo:ch6p2FullOverAsyncComp2}
+\pagebreak
+\begin{Listing}{algo:ch6p2FullOverAsyncComp2}{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap}
+// Variables declarations and initialization
+... auxInput ... // Local array for input data
+
+// Computation loop
+while(!Finished){
+ // Data copy from resultsInRam into auxInput
+ omp_set_lock(&lockRes); // Waiting for new results from main comps
+ if(!Finished){
+ omp_set_lock(&lockWrite); // Waiting for access to results
+ for(ind=0; ind<resultsSize; ++ind){
+ auxInput[ind] = resultsInRam[ind];
+ }
+ omp_unset_lock(&lockWrite); // Give back write access to main thread
+ // Auxiliary computations with possible interruption at the end
+ for(ind=0; ind<auxSize && !Finished; ++ind){
+ // Computation of auxRes array according to auxInput
+ ...
+ }
+ // Informs main thread that new aux results are available in auxData
+ omp_test_lock(&lockAux); // Ensures mutex is locked when unlocking
+ omp_unset_lock(&lockAux);
+ }
+}
+\end{Listing}
+%\end{algorithm}
+
+As can be seen in \Lst{algo:ch6p2FullOverAsyncMain}, there are three additional
+mutex (\texttt{lockAux}, \texttt{lockRes} and \texttt{lockWrite}) that are used
+respectively to inform the main computation thread that new auxiliary results
+are available (lines~20-21 in \Lst{algo:ch6p2FullOverAsyncComp2} and line~29 in
+\Lst{algo:ch6p2FullOverAsyncComp1}), to inform the auxiliary thread that new
+results from the main thread are available (lines~25-26
+in \Lst{algo:ch6p2FullOverAsyncComp1} and line~7
+in \Lst{algo:ch6p2FullOverAsyncComp2}), and to perform exclusive accesses to the
+results from those two threads (lines~20,~24
+in \Lst{algo:ch6p2FullOverAsyncComp1} and 9,~13
+in \Lst{algo:ch6p2FullOverAsyncComp2}). Also, an additional array
+(\texttt{auxRes}) is required to store the results of the auxiliary computations
+as well as a local array for the input of the auxiliary function
+(\texttt{auxInput}). That last function has the same general organization as the
+send/receive ones, that is a global loop conditioned by the end of the global
+process. At each iteration in this function, the thread waits for the
+availability of new results produced by the main computation thread. This avoids
+to perform the same computations several times with the same input data.
+Then, input data of auxiliary computations
+% (as supposed here, they often
+% correspond to the results of the main computations, but may sometimes be
+% something else)
+is copied with a mutual exclusion mechanism. Finally, auxiliary
+computations are performed. When they are completed, the associated mutex is
+unlocked to signal the availability of those auxiliary results to the main
+computing thread. The main thread regularly checks this availability at the end
+of its iterations and takes them into account whenever this is possible.
+
+Finally, we obtain an algorithmic scheme allowing maximal overlap between
+CPU and GPU computations as well as communications. It is worth noticing that
+such scheme is also usable for systems without GPUs but 4-cores nodes.
+
+\subsection{Experimental validation}
+\label{sec:ch6p2expes}
+
+As in~\Sec{ch6:part1}, we validate the feasibility of our asynchronous scheme
+with some experiments performed with a representative example of scientific
+application. It is a three-dimensional version of the
+advection-diffusion-reaction process\index{PDE example} that models the evolution of the
+concentrations of two chemical species in shallow waters. As this process is
+dynamic in time, the simulation is performed for a given number of consecutive
+time steps. This implies two nested loops in the iterative process, the outer
+one for the time steps and the inner one for solving the problem at each time.
+Full details about this PDE problem can be found in~\cite{ChapNRJ2011}. That
+two-stage iterative process implies a few adaptations of the general scheme
+presented above in order to include the outer iterations over the time steps, but the
+inner iterative process closely follows the same scheme.
+
+We show two series of experiments performed with 16 nodes of the first cluster
+described in~\Sec{ch6:p1expes}. The first one deals with the comparison of
+synchronous and asynchronous computations. The second one is related to the use
+of auxiliary computations. In the context of our PDE application, they consist in
+the update of the Jacobian of the system.
+
+\subsubsection*{Synchronous and asynchronous computations}
+
+The first experiment allows us to check that the asynchronous behavior obtained
+with our scheme corresponds to the expected one according to its synchronous
+counterpart. So, we show in~\Fig{fig:ch6p2syncasync} the computation times of
+our test application in both modes for different problem sizes. The size shown
+is the number of discrete spatial elements on each side of the cube representing
+the 3D volume. Moreover, for each of these elements, there are the
+concentrations of the two chemical species considered. So, for example, size 30
+corresponds in fact to $30\times30\times30\times2$ values.
+
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=.75\columnwidth]{Chapters/chapter6/curves/syncasync.pdf}
+ \caption{Computation times of the test application in synchronous and
+ asynchronous modes.}
+ \label{fig:ch6p2syncasync}
+\end{figure}
+
+The results obtained show that the asynchronous version is sensibly faster than
+the synchronous one for smaller problem sizes, then it becomes similar or even
+a bit slower for larger problem sizes. A closer comparison of computation and
+communication times in each execution confirms that this behavior is consistent.
+The asynchronous version is interesting if communication time
+is similar or larger than computation time. In our example, this is the
+case up to a problem size between 50 and 60. Then, computations become longer
+than communications. Since asynchronous computations often require more
+iterations to converge, the gain obtained on the communication side becomes smaller
+than the overhead generated on the computation side, and the asynchronous version
+takes longer.
+% So, the observed behavior is fully coherent with the expected behavior.
+
+\subsubsection*{Overlap of auxiliary computations}
+
+In this experiment, we use only the asynchronous version of the application. In
+the context of our test application, we have an iterative PDE solver based on
+Netwon resolution. Such solvers are written under the form
+$x=T(x),~x\in\Reals^n$ where $T(x)=x-F'(x)^{-1}F(x)$ and $F'$ is the Jacobian of
+the system. In such cases, it is necessary to compute the vector $\Delta x$ in
+$F'\times \Delta x=-F$ to update $x$ with $\Delta x$. There are two levels of
+iterations, the inner level to get a stabilized version of $x$, and the outer
+level to compute $x$ at the successive time steps in the simulation process. In
+this context, classical algorithms either compute $F'$ only at the first iteration
+of each time step or at some iterations but not all because the computation of $F'$ is done
+in the main iterative process and it has a relatively high computing cost.
+
+However, with the scheme presented above, it is possible to continuously compute
+new versions of $F'$ in parallel to the main iterative process without
+penalizing it. Hence, $F'$ is updated as often as possible and taken into
+account in the main computations when it is relevant. So, the Newton process
+should be accelerated a little bit.
+
+We compare the performance obtained with overlapped Jacobian updatings and
+non-overlapped ones for several problem sizes, see~\Fig{fig:ch6p2aux}.
+\begin{figure}[h]
+ \centering
+ \includegraphics[width=.75\columnwidth]{Chapters/chapter6/curves/recouvs.pdf}
+ \caption{Computation times with or without overlap of Jacobian updatings
+ in asynchronous mode.}
+ \label{fig:ch6p2aux}
+\end{figure}
+
+The overlap is clearly efficient as the computation times with overlapping
+Jacobian updatings are much better than without overlap. Moreover, the ratio
+between the two versions tend to increase with the problem size, which is as
+expected. Also, we have tested the application without auxiliary computations at
+all, that is, the Jacobian is computed only once at the beginning of each time
+step of the simulation. The results for this last version are quite similar to
+the overlapped auxiliary computations, and even better for small problem sizes.
+The fact that no sensible gain can be seen on this range of problem sizes is due
+to the limited number of Jacobian updates taken into account in the main
+computation. This happens when the Jacobian update is as long as
+several iterations of the main process. So, the benefit is reduced in this
+particular case.
+
+Those results show two things; first, auxiliary computations do not induce great
+overhead in the whole process. Second, for this particular application the
+choice of updating the Jacobian matrix as auxiliary computations does not speed
+up the iterative process. This does not question the parallel scheme in itself
+but merely points out the difficulty to identify relevant auxiliary
+computations. Indeed, this identification depends on the considered application
+and requires a profound specialized analysis.
+
+Another interesting choice could be the computation of load estimation for
+dynamic load balancing, especially in decentralized diffusion strategies where
+loads are transferred between neighboring nodes~\cite{BCVG11}. In such case,
+the load evaluation and the comparison with other nodes can be done in parallel
+of the main computations without perturbing them.
+
+%%% Local Variables:
+%%% mode: latex
+%%% fill-column: 80
+%%% ispell-dictionary: "american"
+%%% mode: flyspell
+%%% TeX-master: "../../BookGPU.tex"
+%%% End: