X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/1ebb106491ad04e5627daf016c8ff77bdcb26ffa..9378973df8e8a9aac4a7c212a7efb7d831bfae94:/BookGPU/Chapters/chapter2/ch2.tex?ds=inline diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index b330d6b..75be84b 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -23,24 +23,24 @@ are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}. As GPUs have their own memory, the first step consists of allocating memory on -the GPU. A call to \texttt{cudaMalloc}\index{CUDA~functions!cudaMalloc} +the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc} allocates memory on the GPU. The second parameter represents the size of the allocated variables, this size is expressed in bits. - +\pagebreak \lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu} In this example, we want to compare the execution time of the additions of two arrays in CPU and GPU. So for both these operations, a timer is created to measure the time. CUDA proposes to manipulate timers quite easily. The first -step is to create the timer\index{CUDA~functions!timer}, then to start it, and at +step is to create the timer\index{CUDA functions!timer}, then to start it, and at the end to stop it. For each of these operations a dedicated function is used. In order to compute the same sum with a GPU, the first step consists of transferring the data from the CPU (considered as the host with CUDA) to the GPU (considered as the device with CUDA). A call to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth parameter is set -to \texttt{cudaMemcpyHostToDevice}\index{CUDA~functions!cudaMemcpy}. The first +to \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}. The first parameter of the function is the destination array, the second is the source array, and the third is the number of elements to copy (expressed in bytes). @@ -52,26 +52,26 @@ two arrays in parallel (if the number of blocks and threads per blocks is sufficient). In Listing~\ref{ch2:lst:ex1} at the beginning, a simple kernel, called \texttt{addition} is defined to compute in parallel the summation of the two arrays. With CUDA, a kernel starts with the -keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which +keyword \texttt{\_\_global\_\_} \index{CUDA keywords!\_\_shared\_\_} which indicates that this kernel can be called from the C code. The first instruction in this kernel is used to compute the variable \texttt{tid} which represents the -thread index. This thread index\index{thread index} is computed according to +thread index. This thread index\index{CUDA keywords!thread index} is computed according to the values of the block index -(called \texttt{blockIdx} \index{CUDA~keywords!blockIdx} in CUDA) and of the -thread index (called \texttt{threadIdx}\index{CUDA~keywords!threadIdx} in +(called \texttt{blockIdx} \index{CUDA keywords!blockIdx} in CUDA) and of the +thread index (called \texttt{threadIdx}\index{CUDA keywords!threadIdx} in CUDA). Blocks of threads and thread indexes can be decomposed into 1 dimension, -2 dimensions, or 3 dimensions. {\bf A REGARDER} According to the dimension of manipulated data, -the appropriate dimension can be useful. In our example, only one dimension is +2 dimensions, or 3 dimensions. According to the dimension of manipulated data, +the dimension of blocks of threads must be chosen carefully. In our example, only one dimension is used. Then using the notation \texttt{.x}, we can access the first dimension (\texttt{.y} and \texttt{.z}, respectively allow access to the second and -third dimension). The variable \texttt{blockDim}\index{CUDA~keywords!blockDim} +third dimension). The variable \texttt{blockDim}\index{CUDA keywords!blockDim} gives the size of each block. -\section{Second example: using CUBLAS} +\section{Second example: using CUBLAS \index{CUBLAS}} \label{ch2:2ex} The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient @@ -81,7 +81,7 @@ operations, and matrix-matri operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem to be easy to implement with CUDA. Nevertheless, as soon as a reduction is needed, implementing an efficient reduction routine with CUDA is far from being -simple. Roughly speaking, a reduction operation\index{reduction~operation} is an +simple. Roughly speaking, a reduction operation\index{reduction operation} is an operation which combines all the elements of an array and extracts a number computed from all the elements. For example, a sum, a maximum, or a dot product are reduction operations.