-\section {A CUDA parallel Ehrlich-Aberth method}
-In the following, we describe the parallel implementation of Ehrlich-Aberth method on GPU
-for solving high degree polynomials (up to $1,000,000$). First, the hardware and software of the GPUs are presented. Then, the CUDA parallel Ehrlich-Aberth method is presented.
-
-\subsection{Background on the GPU architecture}
-A GPU is viewed as an accelerator for the data-parallel and
-intensive arithmetic computations. It draws its computing power
-from the parallel nature of its hardware and software
-architectures. A GPU is composed of hundreds of Streaming
-Processors (SPs) organized in several blocks called Streaming
-Multiprocessors (SMs). It also has a memory hierarchy. It has a
-private read-write local memory per SP, fast shared memory and
-read-only constant and texture caches per SM and a read-write
-global memory shared by all its SPs~\cite{NVIDIA10}.
-
-On a CPU equipped with a GPU, all the data-parallel and intensive
-functions of an application running on the CPU are off-loaded onto
-the GPU in order to accelerate their computations. A similar
-data-parallel function is executed on a GPU as a kernel by
-thousands or even millions of parallel threads, grouped together
-as a grid of thread blocks. Therefore, each SM of the GPU executes
-one or more thread blocks in SIMD fashion (Single Instruction,
-Multiple Data) and in turn each SP of a GPU SM runs one or more
-threads within a block in SIMT fashion (Single Instruction,
-Multiple threads). Indeed at any given clock cycle, the threads
-execute the same instruction of a kernel, but each of them
-operates on different data.
- GPUs only work on data filled in their
-global memories and the final results of their kernel executions
-must be communicated to their CPUs. Hence, the data must be
-transferred in and out of the GPU. However, the speed of memory
-copy between the GPU and the CPU is slower than the memory
-bandwidths of the GPU memories and, thus, it dramatically affects
-the performances of GPU computations. Accordingly, it is necessary
-to limit as much as possible, data transfers between the GPU and its CPU during the
-computations.
-\subsection{Background on the CUDA Programming Model}
-
-The CUDA programming model is similar in style to a single program
-multiple-data (SPMD) software model. The GPU is viewed as a
-coprocessor that executes data-parallel kernel functions. CUDA
-provides three key abstractions, a hierarchy of thread groups,
-shared memories, and barrier synchronization. Threads have a three
-level hierarchy. A grid is a set of thread blocks that execute a
-kernel function. Each grid consists of blocks of threads. Each
-block is composed of hundreds of threads. Threads within one block
-can share data using shared memory and can be synchronized at a
-barrier. All threads within a block are executed concurrently on a
-multithreaded architecture.The programmer specifies the number of
-threads per block, and the number of blocks per grid. A thread in
-the CUDA programming language is much lighter weight than a thread
-in traditional operating systems. A thread in CUDA typically
-processes one data element at a time. The CUDA programming model
-has two shared read-write memory spaces, the shared memory space
-and the global memory space. The shared memory is local to a block
-and the global memory space is accessible by all blocks. CUDA also
-provides two read-only memory spaces, the constant space and the
-texture space, which reside in external DRAM, and are accessed via
-read-only caches.
+%% \section {A CUDA parallel Ehrlich-Aberth method}
+%% In the following, we describe the parallel implementation of Ehrlich-Aberth method on GPU
+%% for solving high degree polynomials (up to $1,000,000$). First, the hardware and software of the GPUs are presented. Then, the CUDA parallel Ehrlich-Aberth method is presented.
+
+%% \subsection{Background on the GPU architecture}
+%% A GPU is viewed as an accelerator for the data-parallel and
+%% intensive arithmetic computations. It draws its computing power
+%% from the parallel nature of its hardware and software
+%% architectures. A GPU is composed of hundreds of Streaming
+%% Processors (SPs) organized in several blocks called Streaming
+%% Multiprocessors (SMs). It also has a memory hierarchy. It has a
+%% private read-write local memory per SP, fast shared memory and
+%% read-only constant and texture caches per SM and a read-write
+%% global memory shared by all its SPs~\cite{NVIDIA10}.
+
+%% On a CPU equipped with a GPU, all the data-parallel and intensive
+%% functions of an application running on the CPU are off-loaded onto
+%% the GPU in order to accelerate their computations. A similar
+%% data-parallel function is executed on a GPU as a kernel by
+%% thousands or even millions of parallel threads, grouped together
+%% as a grid of thread blocks. Therefore, each SM of the GPU executes
+%% one or more thread blocks in SIMD fashion (Single Instruction,
+%% Multiple Data) and in turn each SP of a GPU SM runs one or more
+%% threads within a block in SIMT fashion (Single Instruction,
+%% Multiple threads). Indeed at any given clock cycle, the threads
+%% execute the same instruction of a kernel, but each of them
+%% operates on different data.
+%% GPUs only work on data filled in their
+%% global memories and the final results of their kernel executions
+%% must be communicated to their CPUs. Hence, the data must be
+%% transferred in and out of the GPU. However, the speed of memory
+%% copy between the GPU and the CPU is slower than the memory
+%% bandwidths of the GPU memories and, thus, it dramatically affects
+%% the performances of GPU computations. Accordingly, it is necessary
+%% to limit as much as possible, data transfers between the GPU and its CPU during the
+%% computations.
+%% \subsection{Background on the CUDA Programming Model}
+
+%% The CUDA programming model is similar in style to a single program
+%% multiple-data (SPMD) software model. The GPU is viewed as a
+%% coprocessor that executes data-parallel kernel functions. CUDA
+%% provides three key abstractions, a hierarchy of thread groups,
+%% shared memories, and barrier synchronization. Threads have a three
+%% level hierarchy. A grid is a set of thread blocks that execute a
+%% kernel function. Each grid consists of blocks of threads. Each
+%% block is composed of hundreds of threads. Threads within one block
+%% can share data using shared memory and can be synchronized at a
+%% barrier. All threads within a block are executed concurrently on a
+%% multithreaded architecture.The programmer specifies the number of
+%% threads per block, and the number of blocks per grid. A thread in
+%% the CUDA programming language is much lighter weight than a thread
+%% in traditional operating systems. A thread in CUDA typically
+%% processes one data element at a time. The CUDA programming model
+%% has two shared read-write memory spaces, the shared memory space
+%% and the global memory space. The shared memory is local to a block
+%% and the global memory space is accessible by all blocks. CUDA also
+%% provides two read-only memory spaces, the constant space and the
+%% texture space, which reside in external DRAM, and are accessed via
+%% read-only caches.