-\subsection{Parallel implementation with CUDA }
-On the CPU, both steps 3 and 4 contain the loop \verb=for= and a single thread executes all the instructions in the loop $n$ times. In this subsection, we explain how the GPU architecture can compute this loop and reduce the execution time.
-In the GPU, the scheduler assigns the execution of this loop to a
-group of threads organised as a grid of blocks with block containing a
-number of threads. All threads within a block are executed
-concurrently in parallel. The instructions run on the GPU are grouped
-in special function called kernels. With CUDA, a programmer must
-describe the kernel execution context: the size of the Grid, the number of blocks and the number of threads per block.
+
+%On the CPU, both steps 3 and 4 contain the loop \verb=for= and a single thread executes all the instructions in the loop $n$ times. In this subsection, we explain how the GPU architecture can compute this loop and reduce the execution time.
+%In the GPU, the scheduler assigns the execution of this loop to a
+%group of threads organised as a grid of blocks with block containing a
+%number of threads. All threads within a block are executed
+%concurrently in parallel. The instructions run on the GPU are grouped
+%in special function called kernels. With CUDA, a programmer must
+%describe the kernel execution context: the size of the Grid, the number of blocks and the number of threads per block.