X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/fc4670d0de6814f682df0ce247905cba40b9d547..2d004d3498accbbc57604339ae5815d96f2e3bf2:/BookGPU/Chapters/chapter2/ch2.tex diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index a9e9a87..ae0704b 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -70,20 +70,20 @@ block. The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant routines that are often used. Those routines are heavily used in many scientific -applications and are very optimzed for vector operations, matrix-vector +applications and are very optimized for vector operations, matrix-vector operations and matrix-matrix -operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seems +operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem to be easy to implement with CUDA. Nevertheless, as soon as a reduction is needed, implementing an efficient reduction routines with CUDA is far from being simple. In this second example, we consider that we have two vectors $A$ and $B$. First -of all we want to compute the sum of both vectors in a vector $C$. Then we want +of all, we want to compute the sum of both vectors in a vector $C$. Then we want to compute the scalar product between $1/C$ and $1/A$. This is just an example -which has not direct interest except to show how to program it with CUDA. +which has no direct interest except to show how to program it with CUDA. Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the -addition of two arrays is exactly the same that the one described in the +addition of two arrays is exactly the same as the one described in the previous example. The kernel to compute the inverse of the elements of an array is very @@ -91,14 +91,20 @@ simple. For each thread index, the inverse of the array replaces the initial array. In the main function, the beginning is very similar to the one in the previous -example. First the number of elements is asked to the user. Then a call +example. First, the number of elements is asked to the user. Then a call to \texttt{cublasCreate} allows to initialize the cublas library. It creates an handle. Then all the arrays are allocated in the host and the device, as in the previous example. Both arrays $A$ and $B$ are initialized. Then the CPU computation is performed and the time for this CPU computation is measured. In order to compute the same result on the GPU, first of all, data from the CPU need to be copied into the memory of the GPU. For that, it is possible to use -cublas function \texttt{cublasSetVector}. +cublas function \texttt{cublasSetVector}. This function several arguments. More +precisely, the first argument represents the number of elements to transfer, the +second arguments is the size of each elements, the third element represents the +source of the array to transfer (in the GPU), the fourth is an offset between +each element of the source (usually this value is set to 1), the fifth is the +destination (in the GPU) and the last is an offset between each element of the +destination. \lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}