The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
routines that are often used. Those routines are heavily used in many scientific
-applications and are very optimzed for vector operations, matrix-vector
+applications and are very optimized for vector operations, matrix-vector
operations and matrix-matrix
-operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seems
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
needed, implementing an efficient reduction routines with CUDA is far from being
simple.
In this second example, we consider that we have two vectors $A$ and $B$. First
-of all we want to compute the sum of both vectors in a vector $C$. Then we want
+of all, we want to compute the sum of both vectors in a vector $C$. Then we want
to compute the scalar product between $1/C$ and $1/A$. This is just an example
-which has not direct interest except to show how to program it with CUDA.
+which has no direct interest except to show how to program it with CUDA.
Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
-addition of two arrays is exactly the same that the one described in the
+addition of two arrays is exactly the same as the one described in the
previous example.
The kernel to compute the inverse of the elements of an array is very
array.
In the main function, the beginning is very similar to the one in the previous
-example. First the number of elements is asked to the user. Then a call
+example. First, the number of elements is asked to the user. Then a call
to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
handle. Then all the arrays are allocated in the host and the device, as in the
previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
computation is performed and the time for this CPU computation is measured. In
order to compute the same result on the GPU, first of all, data from the CPU
need to be copied into the memory of the GPU. For that, it is possible to use
-cublas function \texttt{cublasSetVector}.
+cublas function \texttt{cublasSetVector}. This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of each elements, the third element represents the
+source of the array to transfer (in the GPU), the fourth is an offset between
+each element of the source (usually this value is set to 1), the fifth is the
+destination (in the GPU) and the last is an offset between each element of the
+destination.
\lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}