-Such a mask allows to replace a generic 2-D convolution operation by two consecutive stages of a 1-D convolution operation: a vertical of mask $h_v$ and a horizontal of mask $h_h$.
-This saves a lot of arithmetic operations, as a generic $n\times n$ convolution applied on a $H\times L$ image basically represents $H.L.n^2$ multiplications and as many additions, while two consecutive $n\times 1$ convolutions only represents $2.H.L.n$ of each, \textit{eg} 60\% operations are saved per pixel of the image for a $5\times 5$ mask.\\
-However, beside reducing the operation count, performing a separable convolution also means writing an intermediate image into global memory.
-CPU implementations of separable convolutions often use a single function to perform both 1-D convolution stages. To do so, this function reads the input image and actually ouputs the transposed filtered image.
-Applying that principle to GPUs is not efficient, as outputting the transposed image means non-coalescent writes into global memory, generating severe performance loss. Hence the idea of developing two different kernels, one for each of both vertical and horizontal convolutions.
-
-Here, the use of Shared memory is the best choice, as there is no overlapping between neighbor windows and thus no possible optimization.
-Moreover, to ensure efficiency, it is important to read the input image from texture memory, which implies an internal GPU data copy between both 1-D convolution stages.
-Which, even if it is faster than CPU/GPU data transfer, makes separable convolutions slower than generic convolutions for small mask sizes. On C2070, the lower limit is $7\times 7$ pixels ($9\times 9$ for $512\times 512$ images).
-
-Both vertical and horizontal kernels feature similar runtimes: Table \ref{tab:convoSepSh1} only contains their average execution time, including the internal data copy stage, while Table \ref{tab:convoSepSh2} shows the achieved global throughput values. Timings of the data copy stage are given in Table \ref{tab:cpyToArray}.
-Listings \ref{lst:convoSepShV} and \ref{lst:convoSepShH} detail the implementation of both 1-D kernels, while Listing \ref{lst:convoSepSh} shows how to use them in addition with the data copy function in order to achieve a whole separable convolution. The shared memory size is dynamically passed as a parameter at kernel call time. Its expression is given in the comment line before its declaration.
-\begin{table}[h]
-\centering
-{\normalsize
-\begin{tabular}{|c||r|}
-\hline
-\textbf{Image size}&\textbf{C2070}\\\hline\hline
-$\mathbf{512\times 512}$ &0.029 \\\hline
-$\mathbf{1024\times 1024}$&0.101 \\\hline
-$\mathbf{2048\times 2048}$&0.387 \\\hline
-$\mathbf{4096\times 4096}$&1.533 \\\hline
-\end{tabular}
-}
-\caption{Time cost of data copy between the vertical and the horizontal 1-D convolution stages, on a C2070 cards (in milliseconds).}
-\label{tab:cpyToArray}
-\end{table}
+Such a mask allows us to replace a generic 2D convolution operation by two consecutive stages of a 1D convolution operation: a vertical of mask $h_v$ and a horizontal of mask $h_h$.
+This saves a lot of arithmetic operations, as a generic $n\times n$ convolution applied on an $H\times L$ image basically represents $HLn^2$ multiplications and as many additions, while two consecutive $n\times 1$ convolutions represents only $2HLn$ of each, e.g., 60\% operations are saved per pixel of the image for a $5\times 5$ mask.
+
+However, besides reducing the operation count, performing a separable convolution also means writing an intermediate image into global memory.
+CPU implementations of separable convolutions often use a single function to perform both 1D convolution stages. To do so, this function reads the input image and actually ouputs the transposed filtered image.
+Applying this principle to GPUs is not efficient, as outputting the transposed image means non coalescent writes into global memory, generating severe performance loss. Hence the idea of developing two different kernels, one for each of the vertical and horizontal convolutions.
+
+Here, the use of shared memory is the best choice, as there is no overlapping between neighbor windows and thus no possible optimization.
+Moreover, to ensure efficiency, it is important to read the input image from texture memory, which implies an internal GPU data copy between both 1D convolution stages.
+This, even if it is faster than CPU/GPU data transfer, makes separable convolutions slower than generic convolutions for small mask sizes. On C2070, the lower limit is $7\times 7$ pixels ($9\times 9$ for $512\times 512$ images).
+
+Both vertical and horizontal kernels feature similar runtimes: Table \ref{tab:convoSepSh1} contains only their average execution time, including the internal data copy stage, while Table \ref{tab:convoSepSh2} shows the achieved global throughput values. Timings of the data copy stage are given in Table \ref{tab:cpyToArray}.
+Listings \ref{lst:convoSepShV} and \ref{lst:convoSepShH} detail the implementation of both 1D kernels, while Listing \ref{lst:convoSepSh} shows how to use them in addition with the data copy function in order to achieve a whole separable convolution. The shared memory size is dynamically passed as a parameter at kernel call time. Its expression is given in both Listings (\ref{lst:convoSepShV} and \ref{lst:convoSepShH}), in the comment lines before its declaration.
+