relaxed components to be used in the computational process may be guided by any criterion,
and in particular, a natural criterion is to pickup the most recently available
values of the components computed by the processors. Furthermore, the asynchronous
-iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI subroutines!nonblocking}
+iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI!nonblocking}
(asynchronous communications).
The important property ensuring the convergence of the parallel projected Richardson
and \verb+cublasGetVectorAsync()+ in the asynchronous algorithm. Moreover, we
use the communication routines of the MPI library to carry out the data exchanges
between the neighboring nodes. We use the following communication routines: \verb+MPI_Isend()+
-and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI subroutines!nonblocking}
+and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI!nonblocking}
sends and receives, respectively. For the synchronous algorithm, we use the MPI
routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node in
blocking status until all data exchanges with neighboring nodes (sends and receives)
are completed. In contrast, for the asynchronous algorithms, we use the MPI routine
\verb+MPI_Test()+ which tests the completion of a data exchange (send or receives)
-without putting the MPI process in blocking status\index{MPI subroutines!blocking}.
+without putting the MPI process in blocking status\index{MPI!blocking}.
The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{ch13:alg:02})
computes, at each iteration, the new elements of the iterate vector $U$. Its general code
conv \leftarrow true;
\end{array}
$$
-where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI subroutines!global}
+where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI!global}
\verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local
absolute errors, $error$, of all computing nodes, and $p$ (in Algorithm~\ref{ch13:alg:02})
is used as a counter of the local relaxations carried out by a computing node. In
\begin{table}
\centering
+\begin{scriptsize}
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline
\multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous} & \multicolumn{3}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-7}
$800^{3}$ & $3,950.87$ & $899,088$ & $56.22$ & $3,636.57$ & $834,900$ & $51.91$ & $7.95$ \\ \hline
\end{tabular}
+\end{scriptsize}
\vspace{0.5cm}
\caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 12 GPUs.}
\label{ch13:tab:02}
%%--------------------------%%
\section{Red-black ordering technique}
\label{ch13:sec:06}
-As is wellknown, the Jacobi method\index{iterative method!Jacobi} is characterized
+As is well-known, the Jacobi method\index{iterative method!Jacobi} is characterized
by a slow convergence\index{convergence} rate compared to some iterative methods\index{iterative method}
(for example, Gauss-Seidel method\index{iterative method!Gauss-Seidel}). So, in this
section, we present some solutions to reduce the execution time and the number of
we apply the point red-black ordering\index{iterative method!red-black ordering}
accordingly to the $y$-coordinate, as is shown in Figure~\ref{ch13:fig:06.02}. In
this case, the vector elements having even $y$-coordinate are computed in parallel
-using the values of those having odd $y$-coordinate and then viceversa. Moreover,
+using the values of those having odd $y$-coordinate and then vice-versa. Moreover,
in the GPU implementation of the parallel projected Richardson method (Section~\ref{ch13:sec:04}),
we have shown that a subproblem of size $(NX\times ny\times nz)$ is decomposed into
$nz$ grids of size $(NX\times ny)$. Then, each kernel is executed in parallel by