X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/f3c358c4bba6851fbf01121853d7e21866f01884..f8cabdbc3622e49f7f8e47e0ef8884770a84d07c:/BookGPU/Chapters/chapter2/ch2.tex diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index 9c6d0de..68c309a 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -23,7 +23,7 @@ are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}. As GPUs have their own memory, the first step consists of allocating memory on -the GPU. A call to \texttt{cudaMalloc}\index{CUDA~functions!cudaMalloc} +the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc} allocates memory on the GPU. The second parameter represents the size of the allocated variables, this size is expressed in bits. @@ -33,14 +33,14 @@ allocated variables, this size is expressed in bits. In this example, we want to compare the execution time of the additions of two arrays in CPU and GPU. So for both these operations, a timer is created to measure the time. CUDA proposes to manipulate timers quite easily. The first -step is to create the timer\index{CUDA~functions!timer}, then to start it, and at +step is to create the timer\index{CUDA functions!timer}, then to start it, and at the end to stop it. For each of these operations a dedicated function is used. In order to compute the same sum with a GPU, the first step consists of transferring the data from the CPU (considered as the host with CUDA) to the GPU (considered as the device with CUDA). A call to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth parameter is set -to \texttt{cudaMemcpyHostToDevice}\index{CUDA~functions!cudaMemcpy}. The first +to \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}. The first parameter of the function is the destination array, the second is the source array, and the third is the number of elements to copy (expressed in bytes). @@ -52,26 +52,26 @@ two arrays in parallel (if the number of blocks and threads per blocks is sufficient). In Listing~\ref{ch2:lst:ex1} at the beginning, a simple kernel, called \texttt{addition} is defined to compute in parallel the summation of the two arrays. With CUDA, a kernel starts with the -keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which +keyword \texttt{\_\_global\_\_} \index{CUDA keywords!\_\_shared\_\_} which indicates that this kernel can be called from the C code. The first instruction in this kernel is used to compute the variable \texttt{tid} which represents the -thread index. This thread index\index{thread index} is computed according to +thread index. This thread index\index{CUDA keywords!thread index} is computed according to the values of the block index -(called \texttt{blockIdx} \index{CUDA~keywords!blockIdx} in CUDA) and of the -thread index (called \texttt{threadIdx}\index{CUDA~keywords!threadIdx} in +(called \texttt{blockIdx} \index{CUDA keywords!blockIdx} in CUDA) and of the +thread index (called \texttt{threadIdx}\index{CUDA keywords!threadIdx} in CUDA). Blocks of threads and thread indexes can be decomposed into 1 dimension, -2 dimensions, or 3 dimensions. {\bf A REGARDER} According to the dimension of manipulated data, -the appropriate dimension can be useful. In our example, only one dimension is +2 dimensions, or 3 dimensions. According to the dimension of manipulated data, +the dimension of blocks of threads must be chosen carefully. In our example, only one dimension is used. Then using the notation \texttt{.x}, we can access the first dimension (\texttt{.y} and \texttt{.z}, respectively allow access to the second and -third dimension). The variable \texttt{blockDim}\index{CUDA~keywords!blockDim} +third dimension). The variable \texttt{blockDim}\index{CUDA keywords!blockDim} gives the size of each block. -\section{Second example: using CUBLAS} +\section{Second example: using CUBLAS \index{CUBLAS}} \label{ch2:2ex} The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient @@ -81,7 +81,7 @@ operations, and matrix-matri operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem to be easy to implement with CUDA. Nevertheless, as soon as a reduction is needed, implementing an efficient reduction routine with CUDA is far from being -simple. Roughly speaking, a reduction operation\index{reduction~operation} is an +simple. Roughly speaking, a reduction operation\index{reduction operation} is an operation which combines all the elements of an array and extracts a number computed from all the elements. For example, a sum, a maximum, or a dot product are reduction operations. @@ -189,10 +189,10 @@ considering the difficulty of parallelizing this code. \lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu} \section{Conclusion} -In this chapter, three simple CUDA examples have been presented. They are -quite simple. As we cannot present all the possibilities of the CUDA -programming, interested readers are invited to consult CUDA programming -introduction books if some issues regarding the CUDA programming are not clear. +In this chapter, three simple CUDA examples have been presented. As we cannot +present all the possibilities of the CUDA programming, interested readers are +invited to consult CUDA programming introduction books if some issues regarding +the CUDA programming are not clear. \putbib[Chapters/chapter2/biblio]