+\subsection{a MPI-CUDA approach}
+
+Our parallel implementation of EA to find root of polynomials using a
+CUDA-MPI approach follows a similar computing approach to the one used
+in CUDA-OpenMP. Each process is responsible to compute its own part of
+roots using all the roots computed by other processors at the previous
+iteration. The difference between both approaches lies in the way
+processes communicate and exchange data. With MPI processors need to
+send and receive data explicitely. So in
+Algorithm~\ref{alg2-cuda-mpi}, after the initialization all the
+processors have the same $Z$ vector. Then they need to compute the
+parameters used by the $MPI\_AlltoAll$ routines (line 4). In practise,
+each processor needs to compute its offset and its local size. Then
+processors need to allocate memory on their GPU (line 5). At the
+beginning of each iteration, a processor starts by transfering the
+whole vector Z from the CPU to the GPU (line 7). Then only the local
+part of $Z^{prev}$ is saved (line 8). After that, a processor is able
+to compute its own roots (line 9). Next, the local error can be
+computed (ligne 10) and the global error (line 11). Then the local
+roots are transfered from the GPU memory to the CPU memory (line 12)
+before being exchanged between all processors (linge 13) in order to
+give to all processors the last version of the roots. If the
+convergence is not statisfied, an new iteration is executed.