From: couturie Date: Fri, 7 Dec 2012 19:15:43 +0000 (+0100) Subject: ajout chap6 X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/44b8f845847505b81dc0f1199c49e67a495ed7a0?hp=e36f6851e20a7674991b7dbc9d984f59bd30d459 ajout chap6 --- diff --git a/BookGPU/BookGPU.tex b/BookGPU/BookGPU.tex index 9ec4258..293141d 100755 --- a/BookGPU/BookGPU.tex +++ b/BookGPU/BookGPU.tex @@ -21,6 +21,9 @@ \usepackage{multirow} \usepackage{layouts} \usepackage{comment} +\usepackage{xcolor} +\usepackage{upquote} +\usepackage{framed} \frenchspacing \tolerance=5000 @@ -114,6 +117,7 @@ \part{This is a Part} \include{Chapters/chapter1/ch1} \include{Chapters/chapter2/ch2} +\include{Chapters/chapter6/ch6} \include{Chapters/chapter8/ch8} \include{Chapters/chapter11/ch11} \include{Chapters/chapter14/ch14} diff --git a/BookGPU/Chapters/chapter6/Conclu.tex b/BookGPU/Chapters/chapter6/Conclu.tex new file mode 100644 index 0000000..8571a7f --- /dev/null +++ b/BookGPU/Chapters/chapter6/Conclu.tex @@ -0,0 +1,25 @@ +\section{Conclusion}\label{ch6:conclu} + +In this chapter, different methodologies that effectively take advantage of +current cluster technologies with GPUs have been presented. Beyond the simple +collaboration of several nodes that include GPUs, we have addressed parallel +schemes to efficiently overlap communications with computations (including GPU +transfers), and also computations on the CPU with computations on the +GPU. Moreover, parallel schemes for synchronous as well as asynchronous +iterative processes have been proposed. The proposed schemes have been validated +experimentally and provide the expected behavior. Finally, as a prospect we have +developed a programming tool that will, in middle or long term, provide all the +required technical elements to implement the proposed schemes in a single tool, +without requiring several external libraries. + +We conclude that GPU computations are very well suited to achieve overlap with +CPU computations and communications and they can be fully integrated in +algorithmic schemes combining several levels of parallelism. + +%%% Local Variables: +%%% mode: latex +%%% fill-column: 80 +%%% ispell-dictionary: "american" +%%% mode: flyspell +%%% TeX-master: "../../BookGPU.tex" +%%% End: diff --git a/BookGPU/Chapters/chapter6/Intro.tex b/BookGPU/Chapters/chapter6/Intro.tex new file mode 100644 index 0000000..5a8d0e7 --- /dev/null +++ b/BookGPU/Chapters/chapter6/Intro.tex @@ -0,0 +1,34 @@ +\section{Introduction}\label{ch6:intro} + +This chapter proposes to draw several development methodologies to obtain +efficient codes in classical scientific applications. Those methodologies are +based on the feedback from several research works involving GPUs, either alone +in a single machine or in a cluster of machines. Indeed, our past +collaborations with industries have allowed us to point out that in their +economical context, they can adopt a parallel technology only if its +implementation and maintenance costs are small according to the potential +benefits (performance, accuracy,...). So, in such contexts, GPU programming is +still regarded with some distance according to its specific field of +applicability (SIMD/SIMT model) and its still higher programming complexity and +maintenance. In the academic domain, things are a bit different but studies for +efficiently integrating GPU computations in multi-core clusters with maximal +overlapping of computations with communications and/or other computations, are +still rare. + +For these reasons, the major aim of that chapter is to propose as simple as +possible general programming patterns that can be followed or adapted in +practical implementations of parallel scientific applications. +% Also, according to our experience in industrial collaborations, we propose a +% small prospect analysis about the perenity of such accelerators in the +% middle/long term. +Also, we propose in a third part, a prospect analysis together with a particular +programming tool that is intended to ease multi-core GPU cluster programming. + + +%%% Local Variables: +%%% mode: latex +%%% fill-column: 80 +%%% ispell-dictionary: "american" +%%% mode: flyspell +%%% TeX-master: "../../BookGPU.tex" +%%% End: diff --git a/BookGPU/Chapters/chapter6/Makefile b/BookGPU/Chapters/chapter6/Makefile new file mode 100644 index 0000000..9c18dac --- /dev/null +++ b/BookGPU/Chapters/chapter6/Makefile @@ -0,0 +1,21 @@ +NOM=ch6 +CIBLE=$(NOM).pdf +SOURCES=$(wildcard *.tex) +DEPOT ?= . +WorkDir = `basename $(PWD)` +ARCHIVE=$(WorkDir)_BookGPU_`date +"%Y-%m-%d"`.tar.gz + +$(CIBLE): $(SOURCES) + @printf "Compilation du chapitre..." + @make -C ../.. + @printf "terminée.\n" + +archive: + @printf "Archivage local vers %s ... " "$(ARCHIVE)" + @cd ..; tar zcf $(DEPOT)/$(ARCHIVE) $(WorkDir)/Makefile $(WorkDir)/*.tex $(WorkDir)/*.bib $(WorkDir)/figures; cd - > /dev/null + @printf "terminé.\n" + +clean: + @printf "Nettoyage du répertoire..." + @rm -f *~ *.aux *.log *.toc *.bbl ../../*~ ../../*.aux ../../*.log ../../*.toc ../../*.ind ../../*.bbl + @printf "terminé.\n" \ No newline at end of file diff --git a/BookGPU/Chapters/chapter6/PartieAsync.tex b/BookGPU/Chapters/chapter6/PartieAsync.tex new file mode 100644 index 0000000..44eadd5 --- /dev/null +++ b/BookGPU/Chapters/chapter6/PartieAsync.tex @@ -0,0 +1,1168 @@ +\section{General scheme of asynchronous parallel code with +computation/communication overlapping} +\label{ch6:part2} + +In the previous part, we have seen how to efficiently implement overlap of +computations (CPU and GPU) with communications (GPU transfers and inter-node +communications). However, we have previously shown that for some parallel +iterative algorithms, it is sometimes even more efficient to use an asynchronous +scheme of iterations\index{iterations!asynchronous} \cite{HPCS2002,ParCo05,Para10}. In that case, the nodes do +not wait for each others but they perform their iterations using the last +external data they have received from the other nodes, even if this +data was produced \emph{before} the previous iteration on the other nodes. + +Formally, if we denote by $f=(f_1,...,f_n)$ the function representing the +iterative process and by $x^t=(x_1^t,...,x_n^t)$ the values of the $n$ elements of +the system at iteration $t$, we pass from a synchronous iterative scheme of the +form: +\begin{algorithm}[H] + \caption{Synchronous iterative scheme}\label{algo:ch6p2sync} + \begin{Algo} + $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\ + \textbf{for} $t=0,1,...$\\ + \>\textbf{for} $i=1,...,n$\\ + \>\>$x_{i}^{t+1}=f_{i}(x_{1}^t,...,x_i^t,...,x_{n}^t)$\\ + \>\textbf{endfor}\\ + \textbf{endfor} + \end{Algo} +\end{algorithm} +\noindent +to an asynchronous iterative scheme of the form: +\begin{algorithm}[H] + \caption{Asynchronous iterative scheme}\label{algo:ch6p2async} + \begin{Algo} + $x^{0}=(x_{1}^{0},...,x_{n}^{0})$\\ + \textbf{for} $t=0,1,...$\\ + \>\textbf{for} $i=1,...,n$\\ + \>\>$x_{i}^{t+1}=\left\{ + \begin{array}[h]{ll} + x_i^t & \text{if } i \text{ is \emph{not} updated at iteration } i\\ + f_i(x_1^{s_1^i(t)},...,x_n^{s_n^i(t)}) & \text{if } i \text{ is updated at iteration } i + \end{array} + \right.$\\ + \>\textbf{endfor}\\ + \textbf{endfor} + \end{Algo} +\end{algorithm} +where $s_j^i(t)$ is the iteration number of the production of the value $x_j$ of +element $j$ that is used on element $i$ at iteration $t$ (see for example~\cite{BT89, + FS2000} for further details). +Such schemes are called AIAC\index{AIAC} for \emph{Asynchronous Iterations and + Asynchronous Communications}. They combine two aspects that are respectively +different computation speeds of the computing elements and communication delays +between them. + +The key feature of such algorithmic schemes +is that they may be faster than their synchronous counterparts due to the +implied total overlap of computations with communications: in fact, this scheme +suppresses all the idle times induced by nodes synchronizations between each iteration. + +However, the efficiency of such a scheme is directly linked to the frequency at +which new data arrives on each node. Typically, +if a node receives newer data only every four or five local iterations, it is +strongly probable that the evolution of its local iterative process will be +slower than if it receives data at every iteration. The key point here is that +this frequency does not only depend on the hardware configuration of the +parallel system but it also depends on the software that is used to +implement the algorithmic scheme. + +The impact of the programming environments used to implement asynchronous +algorithms has already been investigated in~\cite{SuperCo05}. Although the +features required to efficiently implement asynchronous schemes have not +changed, the available programming environments and computing hardware have +evolved, in particular now GPUs are available. So, there is a need to reconsider the +implementation schemes of AIAC according to the new de facto standards for parallel +programming (communications and threads) as well as the integration of the GPUs. +One of the main objective here is to obtain a maximal overlap between the +activities of the three types of devices that are the CPU, the GPU and the +network. Moreover, another objective is to present what we think +is the best compromise between the simplicity of the implementation and its +maintainability on one side and its +performance on the other side. This is especially important for industries where +implementation and maintenance costs are strong constraints. + +For the sake of clarity, we present the different algorithmic schemes in a progressive +order of complexity, from the basic asynchronous scheme to the complete scheme +with full overlap. Between these two extremes, we propose a synchronization +mechanism on top of our asynchronous scheme that can be used either statically or +dynamically during the application execution. + +Although there exist several programming environments for inter-node +communications, multi-threading and GPU programming, a few of them have +become \emph{de facto standards}, either due to their good stability, their ease +of use and/or their wide adoption by the scientific community. +Therefore, as in the previous section all the schemes presented in the following use MPI~\cite{MPI}, +OpenMP~\cite{openMP} and CUDA~\cite{CUDA}. However, there is no loss of +generality as those schemes may easily be implemented with other libraries. + +Finally, in order to stay as clear as possible, only the parts of code and +variables related to the control of parallelism (communications, threads,...) +are presented in our schemes. The inner organization of data is not detailed as +it depends on the application. We only consider that we have two data arrays +(previous version and current version) and communication buffers. However, in +most of the cases, those buffers can correspond to the data arrays themselves to +avoid data copies. + +\subsection{A basic asynchronous scheme} +\label{ch6:p2BasicAsync} + +The first step toward our complete scheme is to implement a basic asynchronous +scheme that includes an actual overlap of the communications with the +computations\index{overlap!computation and communication}. In order to ensure that the communications are actually performed +in parallel of the computations, it is necessary to use different threads. It +is important to remember that asynchronous communications provided in +communication libraries like MPI are not systematically performed in parallel of +the computations~\cite{ChVCV13,Hoefler08a}. So, the logical and classical way +to implement such an overlap is to use three threads: one for +computing, one for sending and one for receiving. Moreover, since +the communication is performed by threads, blocking synchronous communications\index{MPI!communication!blocking}\index{MPI!communication!synchronous} +can be used without deteriorating the overall performance. + +In this basic version, the termination\index{termination} of the global process is performed +individually on each node according to their own termination. This can be guided by either a +number of iterations or a local convergence detection\index{convergence detection}. The important step at +the end of the process is to perform the receptions of all pending +communications in order to ensure the termination of the two communication +threads. + +So, the global organization of this scheme is set up in \Lst{algo:ch6p2BasicAsync}. + +% \begin{algorithm}[H] +% \caption{Initialization of the basic asynchronous scheme.} +% \label{algo:ch6p2BasicAsync} +\begin{Listing}{algo:ch6p2BasicAsync}{Initialization of the basic asynchronous scheme} +// Variables declaration and initialization +omp_lock_t lockSend; // Controls the sendings from the computing thread +omp_lock_t lockRec; // Ensures the initial reception of external data +char Finished = 0; // Boolean indicating the end of the process +char SendsInProgress = 0; // Boolean indicating if previous data sendings are still in progress +double Threshold; // Threshold of the residual for convergence detection + +// Parameters reading +... + +// MPI initialization +MPI_Init_thread(argc, argv, MPI_THREAD_MULTIPLE, &provided); +MPI_Comm_size(MPI_COMM_WORLD, &nbP); +MPI_Comm_rank(MPI_COMM_WORLD, &numP); + +// Data initialization and distribution among the nodes +... + +// OpenMP initialization (mainly declarations and setting up of locks) +omp_set_num_threads(3); +omp_init_lock(&lockSend); +omp_set_lock(&lockSend); // Initially locked, unlocked to start sendings +omp_init_lock(&lockRec); +omp_set_lock(&lockRec); // Initially locked, unlocked when initial data are received + +#pragma omp parallel +{ + switch(omp_get_thread_num()){ + case COMPUTATION : + computations(... @\emph{relevant parameters}@ ...); + break; + + case SENDINGS : + sendings(); + break; + + case RECEPTIONS : + receptions(); + break; + } +} + +// Cleaning of OpenMP locks +omp_test_lock(&lockSend); +omp_unset_lock(&lockSend); +omp_destroy_lock(&lockSend); +omp_test_lock(&lockRec); +omp_unset_lock(&lockRec); +omp_destroy_lock(&lockRec); + +// MPI termination +MPI_Finalize(); +\end{Listing} +%\end{algorithm} + +% \ANNOT{Est-ce qu'on laisse le 1er échange de données ou pas dans +% l'algo~\ref{algo:ch6p2BasicAsyncComp} ? (lignes 16-19)\\ +% (ça n'est pas nécessaire si les données de +% départ sont distribuées avec les dépendances qui correspondent, sinon il faut +% le faire)} + +In this scheme, the \texttt{lockRec} mutex\index{OpenMP!mutex} is not mandatory. +It is only used to ensure that data dependencies are actually exchanged at the +first iteration of the process. Data initialization and distribution +(lines~16-17) are not detailed here because they are directly related to the +application. The important point is that, in most cases, they should be done +before the iterative process. The computing function is given in +\Lst{algo:ch6p2BasicAsyncComp}. + +%\begin{algorithm}[H] +% \caption{Computing function in the basic asynchronous scheme.} +% \label{algo:ch6p2BasicAsyncComp} +\begin{Listing}{algo:ch6p2BasicAsyncComp}{Computing function in the basic asynchronous scheme} +// Variables declaration and initialization +int iter = 1; // Number of the current iteration +double difference; // Variation of one element between two iterations +double residual; // Residual of the current iteration + +// Computation loop +while(!Finished){ + // Sendings of data dependencies if there is no previous sending in progress + if(!SendsInProgress){ + // Potential copy of data to be sent in additional buffers + ... + // Change of sending state + SendsInProgress = 1; + omp_unset_lock(&lockSend); + } + + // Blocking receptions at the first iteration + if(iter == 1){ + omp_set_lock(&lockRec); + } + + // Initialization of the residual + residual = 0.0; + // Swapping of data arrays (current and previous) + tmp = current; // Pointers swapping to avoid + current = previous; // actual data copies between + previous = tmp; // the two data versions + // Computation of current iteration over local data + for(ind=0; ind residual){ + residual = difference; + } + } + + // Checking of the end of the process (residual under threshold) + // Other conditions can be added to the termination detection + if(residual <= Threshold){ + Finished = 1; + omp_unset_lock(&lockSend); // Activation of end messages sendings + MPI_Ssend(&Finished, 1, MPI_CHAR, numP, tagEnd, MPI_COMM_WORLD); + } + + // Updating of the iteration number + iter++; +} +\end{Listing} +%\end{algorithm} + +As mentioned above, it can be seen in line~18 of \Lst{algo:ch6p2BasicAsyncComp} +that the \texttt{lockRec} mutex is used only at the first iteration to wait for +the initial data dependencies before the computations. The residual\index{residual}, initialized +in line~23 and computed in lines~34-37, is defined by the maximal difference +between the elements from two consecutive iterations. It is classically used to +detect the local convergence of the process on each node. In the more +complete schemes presented in the sequel, a global termination detection that +takes the states of all the nodes into account will be exhibited. + +Finally, the local convergence is tested and updated when necessary. In line~44, +the \texttt{lockSend} mutex is unlocked to allow the sending function to send +final messages to the dependency nodes. Those messages are required to keep the +reception function alive until all the final messages have been received. +Otherwise, a node could stop its reception function whereas other nodes are +still trying to communicate with it. Moreover, a local sending of a final +message to the node itself is required (line~45) to ensure that the reception +function will not stay blocked in a message probing +(see~\Lst{algo:ch6p2BasicAsyncReceptions}, line~11). This may happen if the node +receives the final messages from its dependencies \emph{before} being itself in +local convergence. + +All the messages but this final local one are performed in the sending function +described in \Lst{algo:ch6p2BasicAsyncSendings}. + +The main loop is only conditioned by the end of the computing process (line~4). +At each iteration, the thread waits for the permission from the computing thread +(according to the \texttt{lockSend} mutex). Then, data are sent with +blocking synchronous communications. The \texttt{SendsInProgress} boolean +allows the computing thread to skip data sendings as long as a previous sending +is in progress. This skip is possible due to the nature of asynchronous +algorithms that allows such \emph{message loss}\index{message!loss/miss} or \emph{message miss}. After +the main loop, the final messages are sent to the dependencies of the node. + +%\begin{algorithm}[H] +% \caption{Sending function in the basic asynchronous scheme.} +% \label{algo:ch6p2BasicAsyncSendings} +\begin{Listing}{algo:ch6p2BasicAsyncSendings}{Sending function in the basic asynchronous scheme} +// Variables declaration and initialization +... + +while(!Finished){ + omp_set_lock(&lockSend); // Waiting for signal from the comp. thread + if(!Finished){ + // Blocking synchronous sends to all dependencies + for(i=0; i= asyncDuration){ + // Waiting for the end of previous sends before starting sync mode + omp_set_lock(&lockSendsDone); + curMode = SYNC; // Entering synchronous mode + stampData(dataToSend, SYNC); // Mark data to send with sync flag + nbSyncIter = 0; + } + }else{ + // In main async mode, going back to async mode when the max number of sync iterations are done + if(mainMode == ASYNC){ + nbSyncIter++; // Update of the number of sync iterations done + if(nbSyncIter == 2){ + curMode = ASYNC; // Going back to async mode + stampData(dataToSend, ASYNC); // Mark data to send + asyncStart = MPI_Wtime(); // Get the async starting time + } + } + } + + // Sendings of data dependencies + if(curMode == SYNC || !SendsInProgress){ + ... + } + + // Blocking data receptions in sync mode + if(curMode == SYNC){ + omp_set_lock(&lockRec); + } + + // Local computation + // (init of residual, arrays swapping and iteration computation) + ... + + // Checking of convergences (local & global) only in sync mode + if(curMode == SYNC){ + // Local convergence checking (residual under threshold) + ... + // Blocking global exchange of local states of the nodes + ... + // Determination of global convergence (all nodes in local CV) + // Stop of the iterative process and sending of end messages + // or Re-initialization of state information and iteration barrier + ... + } + } + + // Updating of the iteration number + iter++; +} +\end{Listing} +%\end{algorithm} + +In the sending function, the only modification is the replacement in line~11 of +the assignment of variable \texttt{SendsInProgress} by the unlocking of +\texttt{lockSendsDone}. Finally, in the reception function, the only +modification is the insertion before line~19 +of \Lst{algo:ch6p2BasicAsyncReceptions} of the extraction of the stamp from the +message and its counting among the receipts only if the stamp is \texttt{SYNC}. + +The final step to get our complete scheme using GPU is to insert the GPU +management in the computing thread. The first possibility, detailed +in \Lst{algo:ch6p2syncGPU}, is to simply replace the +CPU kernel (lines~41-43 in \Lst{algo:ch6p2AsyncSyncComp}) by a blocking GPU kernel call. This includes data +transfers from the node RAM to the GPU RAM, the launching of the GPU kernel, the +waiting for kernel completion and the results transfers from GPU RAM to +node RAM. + +%\begin{algorithm}[H] +% \caption{Computing function in the final asynchronous scheme.} +% \label{algo:ch6p2syncGPU} +\begin{Listing}{algo:ch6p2syncGPU}{Computing function in the final asynchronous scheme} +// Variables declarations and initialization +... +dim3 Dg, Db; // CUDA kernel grids + +// Computation loop +while(!Finished){ + // Determination of the dynamic operating mode, sendings of data dependencies and blocking data receptions in sync mode + ... + // Local GPU computation + // Data transfers from node RAM to GPU + CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer"); + ... // There may be several data transfers: typically A and b in linear problems + // GPU grid definition + Db.x = BLOCK_SIZE_X; // BLOCK_SIZE_# are kernel design dependent + Db.y = BLOCK_SIZE_Y; + Db.z = BLOCK_SIZE_Z; + Dg.x = localSize/BLOCK_SIZE_X + (localSize%BLOCK_SIZE_X ? 1 : 0); + Dg.y = localSize/BLOCK_SIZE_Y + (localSize%BLOCK_SIZE_Y ? 1 : 0); + Dg.z = localSize/BLOCK_SIZE_Z + (localSize%BLOCK_SIZE_Z ? 1 : 0); + // Use of shared memory (when possible) + cudaFuncSetCacheConfig(gpuKernelName, cudaFuncCachePreferShared); + // Kernel call + gpuKernelName<<>>(... @\emph{kernel parameters}@ ...); + // Waiting for kernel completion + cudaDeviceSynchronize(); + // Results transfer from GPU to node RAM + CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer"); + // Potential post-treatment of results on the CPU + ... + + // Convergences checking + ... +} +\end{Listing} +%\end{algorithm} + +This scheme provides asynchronism through a cluster of GPUs as well as a +complete overlap of communications with GPU computations (similarly +to~\Sec{ch6:part1}). However, the autonomy of GPU devices according to their +host can be further exploited in order to perform some computations on the CPU +while the GPU kernel is running. The nature of computations that can be done by +the CPU may vary depending on the application. For example, when processing data +streams (pipelines), pre-processing of next data item and/or post-processing of +previous result can be done on the CPU while the GPU is processing the current +data item. In other cases, the CPU can perform \emph{auxiliary} +computations\index{computation!auxiliary} +that are not absolutely required to obtain the result but that may accelerate +the entire iterative process. Another possibility would be to distribute the +main computations between the GPU and CPU. However, this +usually leads to poor performance increases. This is mainly due to data +dependencies that often require additional transfers between CPU and GPU. + +So, if we consider that the application enables such overlap of +computations, its implementation is straightforward as it consists in inserting +the additional CPU computations between lines~23 and~24 +in \Lst{algo:ch6p2syncGPU}. Nevertheless, such scheme is fully efficient only +if the computation times on both sides are similar. + +In some cases, especially with auxiliary computations, another interesting +solution is to add a fourth CPU thread to perform them. This suppresses the +duration constraint over those optional computations as they are performed in +parallel of the main iterative process, without blocking it. Moreover, this +scheme stays coherent with current architectures as most nodes include four CPU +cores. The algorithmic scheme of such context of complete overlap of +CPU/GPU computations and communications is described in +Listings~\ref{algo:ch6p2FullOverAsyncMain},~\ref{algo:ch6p2FullOverAsyncComp1} +and~\ref{algo:ch6p2FullOverAsyncComp2}, where we suppose that auxiliary +computations use intermediate results of the main computation process from any previous iteration. This may be +different according to the application. + +%\begin{algorithm}[H] +% \caption{Initialization of the main process of complete overlap with asynchronism.} +% \label{algo:ch6p2FullOverAsyncMain} +\pagebreak +\begin{Listing}{algo:ch6p2FullOverAsyncMain}{Initialization of the main process of complete overlap with asynchronism} +// Variables declarations and initialization +... +omp_lock_t lockAux; // Informs main thread about new aux results +omp_lock_t lockRes; // Informs aux thread about new results +omp_lock_t lockWrite; // Controls exclusion of results access +... auxRes ... ; // Results of auxiliary computations + +// Parameters reading, MPI initialization, data initialization and distribution +... +// OpenMP initialization +... +omp_init_lock(&lockAux); +omp_set_lock(&lockAux); // Unlocked when new aux results are available +omp_init_lock(&lockRes); +omp_set_lock(&lockRes); // Unlocked when new results are available +omp_init_lock(&lockWrite); +omp_unset_lock(&lockWrite); // Controls access to results from threads + +#pragma omp parallel +{ + switch(omp_get_thread_num()){ + case COMPUTATION : + computations(... @\emph{relevant parameters}@ ...); + break; + + case AUX_COMPS : + auxComps(... @\emph{relevant parameters}@ ...); + break; + + case SENDINGS : + sendings(); + break; + + case RECEPTIONS : + receptions(); + break; + } +} + +// Cleaning of OpenMP locks +... +omp_test_lock(&lockAux); +omp_unset_lock(&lockAux); +omp_destroy_lock(&lockAux); +omp_test_lock(&lockRes); +omp_unset_lock(&lockRes); +omp_destroy_lock(&lockRes); +omp_test_lock(&lockWrite); +omp_unset_lock(&lockWrite); +omp_destroy_lock(&lockWrite); + +// MPI termination +MPI_Finalize(); +\end{Listing} +%\end{algorithm} + +%\begin{algorithm}[H] +% \caption{Computing function in the final asynchronous scheme with CPU/GPU overlap.} +% \label{algo:ch6p2FullOverAsyncComp1} +\pagebreak +\begin{Listing}{algo:ch6p2FullOverAsyncComp1}{Computing function in the final asynchronous scheme with CPU/GPU overlap} +// Variables declarations and initialization +... +dim3 Dg, Db; // CUDA kernel grids + +// Computation loop +while(!Finished){ + // Determination of the dynamic operating mode, sendings of data dependencies and blocking data receptions in sync mode + ... + // Local GPU computation + // Data transfers from node RAM to GPU, GPU grid definition and init of shared mem + CHECK_CUDA_SUCCESS(cudaMemcpyToSymbol(dataOnGPU, dataInRAM, inputsSize, 0, cudaMemcpyHostToDevice), "Data transfer"); + ... + // Kernel call + gpuKernelName<<>>(... @\emph{kernel parameters}@ ...); + // Potential pre/post-treatments in pipeline like computations + ... + // Waiting for kernel completion + cudaDeviceSynchronize(); + // Results transfer from GPU to node RAM + omp_set_lock(&lockWrite); // Wait for write access to resultsInRam + CHECK_CUDA_SUCCESS(cudaMemcpyFromSymbol(resultsInRam, resultsOnGPU, resultsSize, 0, cudaMemcpyDeviceToHost), "Results transfer"); + // Potential post-treatments in non-pipeline computations + ... + omp_unset_lock(&lockWrite); // Give back read access to aux thread + omp_test_lock(&lockRes); + omp_unset_lock(&lockRes); // Informs aux thread of new results + + // Auxiliary computations availability checking + if(omp_test_lock(&lockAux)){ + // Use auxRes to update the iterative process + ... // May induce additional GPU transfers + } + + // Convergences checking + if(curMode == SYNC){ + // Local convergence checking and global exchange of local states + ... + // Determination of global convergence (all nodes in local CV) + if(cvLocale == 1 && nbCVLocales == nbP-1){ + // Stop of the iterative process and sending of end messages + ... + // Unlocking of aux thread for termination + omp_test_lock(&lockRes); + omp_unset_lock(&lockRes); + }else{ + // Re-initialization of state information and iteration barrier + ... + } + } +} +\end{Listing} +%\end{algorithm} + +%\begin{algorithm}[H] +% \caption{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap.} +% \label{algo:ch6p2FullOverAsyncComp2} +\pagebreak +\begin{Listing}{algo:ch6p2FullOverAsyncComp2}{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap} +// Variables declarations and initialization +... auxInput ... // Local array for input data + +// Computation loop +while(!Finished){ + // Data copy from resultsInRam into auxInput + omp_set_lock(&lockRes); // Waiting for new results from main comps + if(!Finished){ + omp_set_lock(&lockWrite); // Waiting for access to results + for(ind=0; ind + +for (size_t i = 0; i < k; ++i) { + MBlock next; + parallel-do { + operation 1: ; + operation 2: { + ; + ; + } + } + operation 3: { + ; + A = next; + } +} + + +\end{Listing} +%\end{algorithm} + + +\Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 as it could +be realized with ORWL. +It uses two resource handles +\texttt{nextRead} and \texttt{curWrite} and marks nested \emph{critical + sections} for these handles. Inside the nested sections it obtains pointers to +the resource data; the resource is \emph{mapped} into the address space of the +program, and then a standard call to \texttt{memcpy} achieves the operation +itself. The operation is integrated in its own \textbf{for}-loop, such that it +could run independently in an OS thread by its own. +%\begin{algorithm}[H] +% \caption{An iterative local copy operation.} +% \label{algo:ch6p3ORWLlcopy} +\begin{Listing}{algo:ch6p3ORWLlcopy}{An iterative local copy operation} +for (size_t i = 0; i < k; ++i) { + ORWL_SECTION(nextRead) { + MBlock const* sBlock = orwl_read_map(nextRead); + ORWL_SECTION(curWrite) { + MBlock * tBlock = orwl_write_map(curWrite); + memcpy(tBlock, sBlock, sizeof *tBlock); + } + } +} +\end{Listing} +%\end{algorithm} + +Next, in \Lst{algo:ch6p3ORWLrcopy} we copy data from a remote task to +a local task. Substantially the operation is the same, only that in the example +different handles (\texttt{remRead} and \texttt{nextWrite}) are used to +represent the respective resources. +%\begin{algorithm}[H] +% \caption{An iterative remote copy operation as part of a block cyclic matrix +% multiplication task.} +%\label{algo:ch6p3ORWLrcopy} +\begin{Listing}{algo:ch6p3ORWLrcopy}{An iterative remote copy operation as part of a block cyclic matrix multiplication task} +for (size_t i = 0; i < k; ++i) { + ORWL_SECTION(remRead) { + MBlock const* sBlock = orwl_read_map(remRead); + ORWL_SECTION(nextWrite) { + MBlock * tBlock = orwl_write_map(nextWrite); + memcpy(tBlock, sBlock, sizeof *tBlock); + } + } +} +\end{Listing} +%\end{algorithm} + +Now let us have a look into the operation that probably interests us the most, +the interaction with the GPU in \Lst{algo:ch6p3ORWLtrans}. Again there +is much structural resemblance to the copy operations from above, only that we +transfer the data to the GPU in the innermost block and then run the GPU MM +kernel while we still are inside the critical section for the GPU. +%\begin{algorithm}[H] +% \caption{An iterative GPU transfer and compute operation as part of a block cyclic matrix +% multiplication task.} +% \label{algo:ch6p3ORWLtrans} +\begin{Listing}{algo:ch6p3ORWLtrans}{An iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task} +for (size_t i = 0; i < k; ++i) { + ORWL_SECTION(GPUhandle) { + ORWL_SECTION(curRead) { + MBlock const* sBlock = orwl_read_map(curRead); + transferToGPU(sBlock, i); + } + runMMonGPU(i); + } +} +\end{Listing} +%\end{algorithm} + +Now that we have seen how the actual procedural access to the resources is +regulated we will show how the association between handles and resources is +specified. E.g in our application of block-cyclic MM the \texttt{curRead} handle +should correspond to current matrix block of the corresponding task, whereas +\texttt{remRead} should point to the current block of the neighboring task. +Both read operations on these matrix blocks can be effected without creating +conflicts, so we would like to express that fact in our resource specification. +From a point of view of the resource ``current block'' of a particular task, +this means that it can have two simultaneous readers, the task itself performing +the transfer to the GPU, and the neighboring task transferring the data to its +``next block''. + +\Lst{algo:ch6p3ORWLdecl} first shows the local dynamic declarations of +our application; it declares a \texttt{block} type for the matrix blocks, a +\texttt{result} data for the collective resource, and the six handles that we +have seen so far. +%\begin{algorithm}[H] +% \caption{Dynamic declaration of handles to represent the resources.} +% \label{algo:ch6p3ORWLdecl} +\begin{Listing}{algo:ch6p3ORWLdecl}{Dynamic declaration of handles to represent the resources} +/* A type for the matrix blocks */ +typedef double MBlock[N][N]; +/* Declaration to handle the collective resource */ +ORWL_GATHER_DECLARE(MBlock, result); + +/* Variables to handle data resources */ +orwl_handle2 remRead = ORWL_HANDLE2_INITIALIZER; +orwl_handle2 nextWrite = ORWL_HANDLE2_INITIALIZER; +orwl_handle2 nextRead = ORWL_HANDLE2_INITIALIZER; +orwl_handle2 curWrite = ORWL_HANDLE2_INITIALIZER; +orwl_handle2 curRead = ORWL_HANDLE2_INITIALIZER; + +/* Variable to handle the device resources */ +orwl_handle2 GPUhandle = ORWL_HANDLE2_INITIALIZER; +\end{Listing} +%\end{algorithm} + +With these declarations, we didn't yet tell ORWL much about the resources to +which these handles refer, nor the type (read or write) or the priority (FIFO +position) of the access. This is done in code +\Lst{algo:ch6p3ORWLinit}. The handles for +\Lst{algo:ch6p3ORWLtrans} are given first, \texttt{GPUhandle} will be +accessed exclusively (therefore the \texttt{write}) and, as said, +\texttt{curRead} is used shared (so a \texttt{read}). Both are inserted in the +FIFO of there respective resources with highest priority, specified by the +\texttt{0}s in the third function parameter. The resources to which they +correspond are specified through calls to the macro \texttt{ORWL\_LOCATION}, +indicating the task (\texttt{orwl\_mytid} is the ID of the current task) and the +specific resource of that task, here \texttt{GPU} and \texttt{curBlock}. + +Likewise, a second block of insertions concerns the handles for +\Lst{algo:ch6p3ORWLrcopy}: \texttt{newWrite} reclaims an exclusive +access and \texttt{remRead} a shared. \texttt{remRead} corresponds to a +resource of another task; \texttt{previous(orwl\_mytid)} is supposed to return +the ID of the previous task in the cycle. Both accesses can be effected +concurrently with the previous operation, so we insert with the same priority +\texttt{0} as before. + +Then, for the specification of the third operation +(\Lst{algo:ch6p3ORWLlcopy}) we need to use a different priority: the +copy operation from \texttt{nextBlock} to \texttt{curBlock} has to be performed +after the other operations have terminated. + +As a final step, we then tell ORWL that the specification of all accesses is +complete and that it may schedule\index{ORWL!schedule} all these accesses in the respective FIFOs of +the resources. +%\begin{algorithm}[H] +% \caption{Dynamic initialization of access mode and priorities.} +% \label{algo:ch6p3ORWLinit} +\begin{Listing}{algo:ch6p3ORWLinit}{Dynamic initialization of access mode and priorities} +/* One operation with priority 0 (highest) consists */ +/* in copying from current to the GPU and run MM, there. */ +orwl_write_insert(&GPUhandle, ORWL_LOCATION(orwl_mytid, GPU), 0); +orwl_read_insert(&curRead, ORWL_LOCATION(orwl_mytid, curBlock), 0); + +/* Another operation with priority 0 consists */ +/* in copying from remote to next */ +orwl_read_insert(&remRead, ORWL_LOCATION(previous(orwl_mytid), curBlock), 0); +orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid, nextBlock), 0); + +/* One operation with priority 1 consists */ +/* in copying from next to current */ +orwl_read_insert(&nextRead, ORWL_LOCATION(orwl_mytid, nextBlock), 1); +orwl_write_insert(&curWrite, ORWL_LOCATION(orwl_mytid, curBlock), 1); + +orwl_schedule(); +\end{Listing} +%\end{algorithm} + +\subsection{Tasks and operations} +\label{sec:ch6p3tasks} +With that example we have now seen that ORWL distinguishes +\emph{tasks}\index{ORWL!task} and +\emph{operations}\index{ORWL!operation}. An ORWL program is divided into tasks that can be seen as the +algorithmic units that will concurrently access the resources that the program +uses. A task for ORWL is characterized by +\begin{itemize} +\item a fixed set of resources that it manages, ``\emph{owns}'', in our example + the four resources that are declared in \Lst{algo:ch6p3ORWLresources}. +\item a larger set of resources that it accesses, in our example all resources + that are used in \Lst{algo:ch6p3ORWLinit}. +\item a set of operations that act on these resources, in our example the three + operations that are used in \Lst{algo:ch6p3ORWLBCCMM}, and that are + elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy} + and~\ref{algo:ch6p3ORWLtrans}. +\end{itemize} + +\noindent +Each ORWL operation is characterized by +\begin{itemize} +\item one resource, usually one that is owned by the enclosing task, that it + accesses exclusively. In our example, operation~1 has exclusive access to the + \texttt{next} block, operation~2 has exclusive access the \texttt{GPU} + resource, and operation~3 to the \texttt{A} block. +\item several resources that are accessed concurrently with others. +\end{itemize} +In fact each ORWL~operation can be viewed as a compute-and-update procedure of a +particular resource with knowledge of another set of resources. + + +%%% Local Variables: +%%% mode: latex +%%% fill-column: 80 +%%% ispell-dictionary: "american" +%%% mode: flyspell +%%% TeX-master: "../../BookGPU.tex" +%%% End: diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex new file mode 100755 index 0000000..ac9a3dd --- /dev/null +++ b/BookGPU/Chapters/chapter6/PartieSync.tex @@ -0,0 +1,713 @@ +\section{General scheme of synchronous code with computation/communication + overlapping in GPU clusters}\label{ch6:part1} + +%Notre chapitre précédent sur l'optimisation des schémas +%parallèles~\cite{ChVCV13}. + +\subsection{Synchronous parallel algorithms on GPU clusters} + +%\subsubsection{Considered parallel algorithms and implementations} + +\noindent {\bf Considered parallel algorithms and implementations} +\medskip + +This section focusses on synchronous parallel algorithms implemented with +overlapping computations and communications\index{overlap!computation and communication}. Parallel synchronous algorithms are +easier to implement, debug and maintain than asynchronous ones, see +Section~\ref{ch6:part2}. Usually, they follow a BSP-like parallel +scheme\index{BSP parallel scheme}, +alternating local computation steps and communication steps, +see~\cite{Valiant:BSP}. Their execution is usually deterministic, excepted +for stochastic algorithms that contain random number generations. Even in +this case, their execution can be controlled during debug steps, allowing to +track and to fix bugs quickly. + +However, depending on the properties of the algorithm, it is sometimes possible to +overlap computations and communications. If processes exchange data +that is not needed for the +computation that is following immediately, it is possible to implement such +an overlap. We have investigated the efficiency of this approach in previous +works~\cite{GUSTEDT:2007:HAL-00280094:1,ChVCV13}, using standard parallel +programming tools to achieve the implementation. + +The normalized and well known Message Passing Interface (MPI\index{MPI}) includes some asynchronous +point-to-point communication routines, that should allow to implement some +communication/computation overlap. However, current MPI implementations do not achieve +that goal efficiently; effective overlapping with MPI requires a group of +dedicated threads (in our case implemented with OpenMP\index{OpenMP}) for the basic +synchronous communications while another group of threads executes computations +in parallel. +% Finally, with explicit OpenMP +% threads executing MPI communications or various computations, we succeeded to +% decrease the execution time. +Nevertheless, communication and computation are not completely independent on +modern multicore architectures: they use shared hardware components such as the +interconnection bus and the RAM. Therefore that approach only saved up to $20\%$ +of the expected time on such a platform. This picture changes on clusters +equipped with GPU. They effectively allow independence of computations on the +GPU and communication on the mainboard (CPU, interconnection bus, RAM, network +card). We saved up to $100\%$ of the expected time on our GPU cluster, as we +will expose in the next section. + + +%\subsubsection{Specific interests for GPU clusters} + +\bigskip +\noindent {\bf Specific interests in GPU clusters} +\medskip + +In a computing node, a GPU is a kind of scientific coprocessor usually located +on an auxiliary board, with its own memory. So, when data have been transferred +from the CPU memory to the GPU memory, then GPU computations can be achieved on +the GPU board, totally in parallel of any CPU activities (like internode cluster +communications). The CPU and the GPU access their respective memories and do not +interfere, so they can achieve a very good overlap\index{overlap!computation +and computation} of their activities (better +than two CPU cores). + +But using a GPU on a computing node requires to transfer data from the CPU to +the GPU memory, and to transfer the computation results back from the GPU to the +CPU. Transfer times are not excessive, but depending on the application +they still can be significant compared to the GPU computation times. So, sometimes it +can be interesting to overlap the internode cluster communications with both the +CPU/GPU data transfers and the GPU computations. We can identify four main +parallel programming schemes on a GPU cluster: + +\begin{enumerate} +\item parallelizing only 'internode CPU communications' with 'GPU computations', + and achieving CPU/GPU data transfers before and after this parallel step, +\item parallelizing 'internode CPU communications' with a '(sequential) sequence + of CPU/GPU data transfers and GPU computations', +\item parallelizing 'internode CPU communications' with a 'streamed sequence of + CPU/GPU data transfers and GPU computations', +\item parallelizing 'internode CPU communications' with 'CPU/GPU data transfers' + and with 'GPU computations', interleaving computation-communication + iterations. +\end{enumerate} + + +\subsection{Native overlap of CPU communications and GPU computations} + +\begin{figure}[t] + \centering + \includegraphics{Chapters/chapter6/figures/Sync-NativeOverlap.pdf} + \caption{Native overlap of internode CPU communications with GPU computations.} + \label{fig:ch6p1overlapnative} +\end{figure} + +Using CUDA\index{CUDA}, GPU kernel executions are non-blocking, and GPU/CPU data +transfers\index{CUDA!data transfer} +are blocking or non-blocking operations. All GPU kernel executions and CPU/GPU +data transfers are associated to "streams"\index{CUDA!stream}, and all operations on a same stream +are serialized. When transferring data from the CPU to the GPU, then running GPU +computations and finally transferring results from the GPU to the CPU, there is +a natural synchronization and serialization if these operations are achieved on +the same stream. GPU developers can choose to use one (default) or several +streams. In this first scheme of overlapping, we consider parallel codes using +only one GPU stream. + +"Non-blocking GPU kernel execution" means a CPU routine runs a parallel +execution of a GPU computing kernel, and the CPU routine continues its execution +(on the CPU) while the GPU kernel is running (on the GPU). Then the CPU routine +can initiate some communications with some others CPU, and so it automatically +overlaps the internode CPU communications with the GPU computations (see +\Fig{fig:ch6p1overlapnative}). This overlapping is natural when programming with +CUDA and MPI: it is easy to deploy, but does not overlap the CPU/GPU data +transfers. + +%\begin{algorithm}[t] +% \caption{Generic scheme implicitly overlapping MPI communications with CUDA GPU +% computations}\label{algo:ch6p1overlapnative} +\pagebreak +\begin{Listing}{algo:ch6p1overlapnative}{Generic scheme implicitly overlapping MPI communications with CUDA GPU computations} +// Input data and result variables and arrays (example with +// float datatype, 1D input arrays, and scalar results) +float *cpuInputTabAdr, *gpuInputTabAdr; +float *cpuResTabAdr, *gpuResAdr; + +// CPU and GPU array allocations +cpuInputTabAdr = malloc(sizeof(float)*N); +cudaMalloc(&gpuInputTabAdr,sizeof(float)*N); +cpuResTabAdr = malloc(sizeof(float)*NbIter); +cudaMalloc(&gpuResAdr,sizeof(float)); + +// Definition of the Grid of blocks of GPU threads +dim3 Dg, Db; +Dg.x = ... +... + +// Indexes of source and destination MPI processes +int dest = ... +int src = ... + +// Computation loop (using the GPU) +for (int i = 0; i < NbIter; i++) { + cudaMemcpy(gpuInputTabAdr, cpuInputTabAdr, // Data transfer: + sizeof(float)*N, // CPU --> GPU (sync. op) + cudaMemcpyHostToDevice); + gpuKernel_k1<<>>(); // GPU comp. (async. op) + MPI_Sendrecv_replace(cpuInputTabAdr, // MPI comms. (sync. op) + N,MPI_FLOAT, + dest, 0, src, 0, ...); + // IF there is (now) a result to transfer from the GPU to the CPU: + cudaMemcpy(cpuResTabAdr + i, gpuResAdr, // Data transfer: + sizeof(float), // GPU --> CPU (sync. op) + cudaMemcpyDeviceToHost); +} +... +\end{Listing} +%\end{algorithm} + +\Lst{algo:ch6p1overlapnative} introduces the generic code of a MPI+CUDA +implementation, natively and implicitly overlapping MPI communications with CUDA +GPU computations. Some input data and output results arrays and variables are +declared and allocated from line 1 up to 10, and a computation loop is +implemented from line 21 up to 34. At each iteration: +\begin{itemize} +\item \texttt{cudaMemcpy} on line 23 transfers data from the CPU memory + to the GPU memory. This is a basic and synchronous data transfer. +\item \texttt{gpuKernel\_k1<<>>} on line 26 starts GPU computation + (running a GPU kernel on the grid of blocks of threads defined at line 12 to + 15). This is a standard GPU kernel run, it is an asynchronous operation. The + CPU can continue to run its code. +\item \texttt{MPI\_Sendrecv\_replace} on line 27 achieves some blocking + internode communications, overlapping GPU computations started just before. +\item If needed, \texttt{cudaMemcpy} on line 31 transfers the iteration result from + one variable in the GPU memory at one array index in the CPU memory (in this example the CPU + collects all iteration results in an array). This operation is started after + the end of the MPI communication (previous instruction) and after the end of + the GPU kernel execution. CUDA insures an implicit synchronization of all + operations involving the same GPU stream, like the default stream in this + example. Result transfer has to wait the GPU kernel execution is finished. + If there is no result transfer implemented, the next operation on the GPU + will wait until the GPU kernel execution will be ended. +\end{itemize} + +This implementation is the easiest one involving the GPU. It achieves an +implicit overlap of internode communications and GPU computations, no explicit +multithreading is required on the CPU. However, CPU/GPU data transfers are +achieved serially and not overlapped. + + +\subsection{Overlapping with sequences of transfers and computations} + +%\subsubsection{Overlapping with a sequential GPU sequence} + +\noindent {\bf Overlapping with a sequential GPU sequence} +\medskip + +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf} + \caption{Overlap of internode CPU communications with a sequence of CPU/GPU data transfers and GPU + computations.} + \label{fig:ch6p1overlapseqsequence} +\end{figure} + +When CPU/GPU data transfers are not negligible compared to GPU computations, it +can be interesting to overlap internode CPU computations with a \emph{GPU + sequence}\index{GPU sequence} including CPU/GPU data transfers and GPU computations (see +\Fig{fig:ch6p1overlapseqsequence}). Algorithmic issues of this approach are basic, +but their implementation require explicit CPU multithreading and +synchronization, and CPU data buffer duplication. We need to implement two +threads, one starting and achieving MPI communications, and the other running +the \emph{GPU sequence}. OpenMP allows an easy and portable implementation of +this overlapping strategy. However, it remains more complex to develop and to +maintain than the previous strategy (overlapping only internode CPU +communications and GPU computations), and should be adopted only when CPU/GPU +data transfer times are not negligible. + +%\begin{algorithm} +% \caption{Generic scheme explicitly overlapping MPI communications with +% sequences of CUDA CPU/GPU transfers and CUDA GPU +% computations}\label{algo:ch6p1overlapseqsequence} +\begin{Listing}{algo:ch6p1overlapseqsequence}{Generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations} +// Input data and result variables and arrays (example with +// float datatype, 1D input arrays, and scalar results) +float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr; +float *cpuResTabAdr, *gpuResAdr; + +// CPU and GPU array allocations +cpuInputTabAdrCurrent = malloc(sizeof(float)*N); +cpuInputTabAdrFuture = malloc(sizeof(float)*N); +cudaMalloc(&gpuInputTabAdr,sizeof(float)*N); +cpuResTabAdr = malloc(sizeof(float)*NbIter); +cudaMalloc(&gpuResAdr,sizeof(float)); + +// Definition of the Grid of blocks of GPU threads +dim3 Dg, Db; +Dg.x = ... +... + +// Indexes of source and destination MPI processes +int dest = ... +int src = ... + +// Set the number of OpenMP threads (to create) to 2 +omp_set_num_threads(2); +// Create threads and start the parallel OpenMP region +#pragma omp parallel +{ + // Buffer pointers (thread local variables) + float *current = cpuInputTabAdrCurrent; + float *future = cpuInputTabAdrFuture; + float *tmp; + + // Computation loop (using the GPU) + for (int i = 0; i < NbIter; i++) { + + // - Thread 0: achieves MPI communications + if (omp_get_thread_num() == 0) { + MPI_Sendrecv(current, // MPI comms. (sync. op) + N, MPI_FLOAT, dest, 0, + future, + N, MPI_FLOAT, dest, 0, ...); + + // - Thread 1: achieves the GPU sequence (GPU computations and + // CPU/GPU data transfers) + } else if (omp_get_thread_num() == 1) { + cudaMemcpy(gpuInputTabAdr, current, // Data transfer: + sizeof(float)*N, // CPU --> GPU (sync. op) + cudaMemcpyHostToDevice); + gpuKernel_k1<<>>(); // GPU comp. (async. op) + // IF there is (now) a result to transfer from the GPU to the CPU: + cudaMemcpy(cpuResTabAdr + i, gpuResAdr, // Data transfer: + sizeof(float), // GPU --> CPU (sync. op) + cudaMemcpyDeviceToHost); + } + + // - Wait for both threads have achieved their iteration tasks + #pragma omp barrier + // - Each thread permute its local buffer pointers + tmp = current; + current = future; + future = tmp; + } // End of computation loop +} // End of OpenMP parallel region +... +\end{Listing} +%\end{algorithm} + +\Lst{algo:ch6p1overlapseqsequence} introduces the generic code of a +MPI+OpenMP+CUDA implementation, explicitly overlapping MPI communications with +\emph{GPU sequences}. Lines 25--62 implement the OpenMP parallel region, +around the computation loop (lines 33--61). For performances it is +important to create and destroy threads only one time (not at each iteration): +the parallel region has to surround the computation loop. Lines 1--11 +consist in declaration and allocation of input data arrays and result arrays and +variables, like in previous algorithm (\Lst{algo:ch6p1overlapnative}). However, we implement two input data buffers on the +CPU (current and future version). As we aim to overlap internode MPI +communications and GPU sequence, including CPU to GPU data transfer of current +input data array, we need to store the received new input data array in a +separate buffer. Then, the current input data array will be safely read on the +CPU and copied into the GPU memory. + +The thread creations\index{OpenMP!thread creation} are easily achieved with one OpenMP directive (line +25). Then each thread defines and initializes \emph{its} local buffer pointers, +and enters \emph{its} computing loop (lines 27--33). Inside the computing +loop, a test on the thread number allows to run a different code in each +thread. Lines 37--40 implement the MPI synchronous communication run by +thread number $0$. Lines 45--52 implement the GPU sequence run by thread +$1$: CPU to GPU data transfer, GPU computation and GPU to CPU result +transfer (if needed). Details of the three operations of this sequence have not changed +compared to the previous overlapping strategy. + +At the end of \Lst{algo:ch6p1overlapseqsequence}, an OpenMP synchronization +barrier\index{OpenMP!barrier} on line 56 allows to wait OpenMP threads have achieved MPI +communications and GPU sequence, and do not need to access the current input +data buffer. Then, each thread permute its local buffer pointers (lines 58--60), +and is ready to enter the next iteration, processing the new current input +array. + + +%\subsubsection{Overlapping with a streamed GPU sequence} + +\bigskip +\noindent {\bf Overlapping with a streamed GPU sequence} +\medskip + +\begin{figure}[t] + \centering + \includegraphics[width=\columnwidth]{Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf} + \caption{Overlap of internode CPU communications with a streamed sequence of CPU/GPU data + transfers and GPU computations.} + \label{fig:ch6p1overlapstreamsequence} +\end{figure} + +Depending on the algorithm implemented, it is sometimes possible to split the +GPU computation into several parts processing distinct data. Then, we can +speedup the \emph{GPU sequence} using several CUDA \emph{streams}\index{CUDA!stream}. The goal is +to overlap CPU/GPU data transfers with GPU computations\index{overlap!GPU data transfers with GPU computation} inside the \emph{GPU + sequence}. Compared to the previous overlapping strategy, we have to split the +initial data transfer in a set of $n$ asynchronous and smaller data transfers, +and to split the initial GPU kernel call in a set of $n$ calls to the same GPU +kernel. Usually, these smaller calls are deployed with less GPU threads +(i.e. associated to a smaller grid of blocks of threads). Then, the first GPU +computations can start as soon as the first data transfer has been achieved, and +next transfers can be done in parallel of next GPU computations (see +\Fig{fig:ch6p1overlapstreamsequence}). + +NVIDIA advises to start all asynchronous CUDA data transfers, and then to call +all CUDA kernel executions, using up to $16$ streams \cite{cudabestpractices}. +Then, CUDA driver and +runtime optimize the global execution of these operations. So, we cumulate two +overlapping mechanisms. The former is controlled by CPU multithreading, and +overlap MPI communications and the \emph{streamed GPU sequence}. The latter is +controlled by CUDA programming, and overlap CPU/GPU data transfers and GPU +computations. Again, OpenMP allows to easily implement the CPU multithreading, +and to wait for the end of both CPU threads before to execute the next instructions +of the code. + +%\begin{algorithm} +% \caption{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA +% CPU/GPU transfers and CUDA GPU computations}\label{algo:ch6p1overlapstreamsequence} +\begin{Listing}{algo:ch6p1overlapstreamsequence}{Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations} +// Input data and result variables and arrays (example with +// float datatype, 1D input arrays, and scalar results) +float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture, *gpuInputTabAdr; +float *cpuResTabAdr, *gpuResAdr; +// CPU and GPU array allocations (allocates page-locked CPU memory) +cudaHostAlloc(&cpuInputTabAdrCurrent,sizeof(float)*N,cudaHostAllocDefault); +cudaHostAlloc(&cpuInputTabAdrFuture,sizeof(float)*N,cudaHostAllocDefault); +cudaMalloc(&gpuInputTabAdr,sizeof(float)*N); +cpuResTabAdr = malloc(sizeof(float)*NbIter); +cudaMalloc(&gpuResAdr,sizeof(float)); +// Stream declaration and creation +cudaStream_t TabS[NbS]; +for(int s = 0; s < NbS; s++) + cudaStreamCreate(&TabS[s]); +// Definition of the Grid of blocks of GPU threads +... +// Set the number of OpenMP threads (to create) to 2 +omp_set_num_threads(2); +// Create threads and start the parallel OpenMP region +#pragma omp parallel +{ + // Buffer pointers (thread local variables) + float *current = cpuInputTabAdrCurrent; + float *future = cpuInputTabAdrFuture; + float *tmp; + // Stride of data processed per stream + int stride = N/NbS; + // Computation loop (using the GPU) + for (int i = 0; i < NbIter; i++) { + // - Thread 0: achieves MPI communications + if (omp_get_thread_num() == 0) { + MPI_Sendrecv(current, // MPI comms. (sync. op) + N, MPI_FLOAT, dest, 0, + future, + N, MPI_FLOAT, dest, 0, ...); + // - Thread 1: achieves the streamed GPU sequence (GPU computations + // and CPU/GPU data transfers) + } else if (omp_get_thread_num() == 1) { + for (int s = 0; s < NbS; s++) { // Start all data transfers: + cudaMemcpyAsync(gpuInputTabAdr + s*stride, // CPU --> GPU + current + s*stride, // (async. ops) + sizeof(float)*stride, + cudaMemcpyHostToDevice, + TabS[s]); + } + for (int s = 0; s < NbS; s++) { // Start all GPU comps. (async.) + gpuKernel_k1<<>>(gpuInputTabAdr + s*stride); + } + cudaThreadSynchronize(); // Wait all threads are ended + // IF there is (now) a result to transfer from the GPU to the CPU: + cudaMemcpy(cpuResTabAdr, // Data transfers: + gpuResAdr, // GPU --> CPU (sync. op) + sizeof(float), + cudaMemcpyDeviceToHost); + } + // - Wait for both threads have achieved their iteration tasks + #pragma omp barrier + // - Each thread permute its local buffer pointers + tmp = current; current = future; future = tmp; + } // End of computation loop +} // End of OpenMP parallel region +... +// Destroy the streams +for(int s = 0; s < NbS; s++) + cudaStreamDestroy(TabS[s]); +... +\end{Listing} +%\end{algorithm} + +\Lst{algo:ch6p1overlapstreamsequence} introduces the generic MPI+OpenMP+CUDA +code explicitly overlapping MPI communications with +\emph{streamed GPU sequences}\index{GPU sequence!streamed}. Efficient usage of CUDA \emph{streams} requires to execute +asynchronous CPU/GPU data transfers, that needs to read page-locked +data\index{page-locked data} in CPU memory. So, CPU +memory allocations on lines 6 and 7 are implemented with \texttt{cudaHostAlloc} instead of +the basic \texttt{malloc} function. Then, $NbS$ \emph{streams} are created on lines 12--14. +Usually we create $16$ streams: the maximum number supported by CUDA. + +An OpenMP parallel region\index{OpenMP!parallel region} including two threads is implemented on lines 17--61 of +\Lst{algo:ch6p1overlapstreamsequence}, similarly to the previous algorithm (see +\Lst{algo:ch6p1overlapseqsequence}). Code of thread $0$ achieving MPI communication is unchanged, but +code of thread $1$ is now using streams. Following NVIDIA recommandations, we have first implemented +a loop starting $NbS$ asynchronous data transfers (lines 39--45): transferring $N/NbS$ data on +each stream. Then we have implemented a second loop (lines 46--48), starting asynchronous +executions of $NbS$ grids of blocks of GPU threads (one per stream). Data transfers and kernel +executions on the same stream are synchronized by CUDA and the GPU. So, each kernel execution will +start after its data will be transferred into the GPU memory, and the GPU scheduler ensures to start +some kernel executions as soon as the first data transfers are achieved. Then, next data transfers +will be overlapped with GPU computations. After the kernel calls, on the different streams, +we wait for the end of all GPU threads previously run, calling an explicit synchronization +function on line 49. This synchronization is not mandatory, but it will make the implementation more +robust and will facilitate the debugging steps: all GPU computations run by the OpenMP thread number +$1$ will be achieved before this thread will enter a new loop iteration, or before the computation +loop will be ended. + +If a partial result has to be transferred from GPU to CPU memory at the end of each loop iteration +(for example the result of one \emph{reduction} per iteration), this transfer is achieved +synchronously on the default stream (no particular stream is specified) on lines 51--54. +Availability of the result values is ensured by the synchronization implemented on line 49. +However, if a partial result has to be transferred on the CPU on each stream, then $NbS$ asynchronous data +transfers could be started in parallel (one per stream), and should be implemented before the +synchronization operation on line 49. The end of the computation loop includes a synchronization +barrier of the two OpenMP threads, waiting they have finished to access the different data +buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, like +in the previous algorithm. However, after the computation loop, we have added the +destruction of the CUDA streams (lines 63--65). + +Finally, CUDA streams\index{CUDA!stream} have been used to extend \Lst{algo:ch6p1overlapseqsequence} +with respect to its global scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an +OpenMP parallel region, with two CPU threads, one in charge of MPI communications, and the other +managing data transfers and GPU computations. Unfortunately, using GPU streams require to be able to +split a GPU computation in independent subparts, working on independent subsets of data. +\Lst{algo:ch6p1overlapstreamsequence} is not so generic than \Lst{algo:ch6p1overlapseqsequence}. + + +\subsection{Interleaved communications-transfers-computations overlapping} + +\begin{figure}[t] + \centering + \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf} + \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers and GPU + computations, interleaving computation-communication iterations} + \label{fig:ch6p1overlapinterleaved} +\end{figure} + +Many algorithms do not support to split data transfers and kernel calls, and can +not exploit CUDA streams. For example, when each GPU thread requires to access +some data spread in the global set of transferred data. Then, it is possible to +overlap internode CPU communications and CPU/GPU data transfers and GPU +computations, if the algorithm achieves \emph{computation-communication + iterations} and if we can interleave these iterations. At iteration $k$: CPUs +exchange data $D_k$, each CPU/GPU couple transfers data $D_k$, and each GPU +achieves computations on data $D_{k-1}$ (see +\Fig{fig:ch6p1overlapinterleaved}). Compared to the previous strategies, this +strategy requires twice as many CPU data buffers +and twice as many GPU buffers. + +%\begin{algorithm} +% \caption{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA +% GPU computations, interleaving computation-communication iterations}\label{algo:ch6p1overlapinterleaved} +\begin{Listing}{algo:ch6p1overlapinterleaved}{Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA GPU computations, interleaving computation-communication iterations} +// Input data and result variables and arrays (example with +// float datatype, 1D input arrays, and scalar results) +float *cpuInputTabAdrCurrent, *cpuInputTabAdrFuture; +float *gpuInputTabAdrCurrent, *gpuInputTabAdrFuture; +float *cpuResTabAdr, *gpuResAdr; + +// CPU and GPU array allocations +cpuInputTabAdrCurrent = malloc(sizeof(float)*N); +cpuInputTabAdrFuture = malloc(sizeof(float)*N); +cudaMalloc(&gpuInputTabAdrCurrent,sizeof(float)*N); +cudaMalloc(&gpuInputTabAdrFuture,sizeof(float)*N); +cpuResTabAdr = malloc(sizeof(float)*NbIter); +cudaMalloc(&gpuResAdr,sizeof(float)); + +// Definition of the Grid of blocks of GPU threads +dim3 Dg, Db; Dg.x = ... +// Indexes of source and destination MPI processes +int dest, src; dest = ... + +// Set the number of OpenMP threads (to create) to 2 +omp_set_num_threads(3); +// Create threads and start the parallel OpenMP region +#pragma omp parallel +{ + // Buffer pointers (thread local variables) + float *cpuCurrent = cpuInputTabAdrCurrent; + float *cpuFuture = cpuInputTabAdrFuture; + float *gpuCurrent = gpuInputTabAdrCurrent; + float *gpuFuture = gpuInputTabAdrFuture; + float *tmp; + + // Computation loop on: NbIter + 1 iteration + for (int i = 0; i < NbIter + 1; i++) { + // - Thread 0: achieves MPI communications + if (omp_get_thread_num() == 0) { + if (i < NbIter) { + MPI_Sendrecv(cpuCurrent, // MPI comms. (sync. op) + N, MPI_FLOAT, dest, 0, + cpuFuture, + N, MPI_FLOAT, dest, 0, ...); + } + // - Thread 1: achieves the CPU/GPU data transfers + } else if (omp_get_thread_num() == 1) { + if (i < NbIter) { + cudaMemcpy(gpuFuture, cpuCurrent, // Data transfer: + sizeof(float)*N, // CPU --> GPU (sync. op) + cudaMemcpyHostToDevice); + } + // - Thread 2: achieves the GPU computations and the result transfer + } else if (omp_get_thread_num() == 2) { + if (i > 0) { + gpuKernel_k1<<>>(gpuCurrent); // GPU comp. (async. op) + // IF there is (now) a result to transfer from GPU to CPU: + cudaMemcpy(cpuResTabAdr + (i-1), // Data transfer: + gpuResAdr, sizeof(float), // GPU --> CPU (sync. op) + cudaMemcpyDeviceToHost); + } + } + // - Wait for both threads have achieved their iteration tasks + #pragma omp barrier + // - Each thread permute its local buffer pointers + tmp = cpuCurrent; cpuCurrent = cpuFuture; cpuFuture = tmp; + tmp = gpuCurrent; gpuCurrent = gpuFuture; gpuFuture = tmp; + } // End of computation loop +} // End of OpenMP parallel region +... +\end{Listing} +%\end{algorithm} + +\Lst{algo:ch6p1overlapinterleaved} introduces the generic code of a +MPI+OpenMP+CUDA implementation, explicitly interleaving +computation-communication iterations and overlapping MPI communications, CUDA CPU/GPU +transfers and CUDA GPU computations. As in the previous algorithms, we declare two CPU input data arrays +(current and future version) on line 3, but we also declare two GPU input data arrays on line 4. On +lines 8--11, these four data arrays are allocated, using \texttt{malloc} and +\texttt{cudaMalloc}. +We do not need to allocate page-locked memory space. On lines 23--65 we +create an OpenMP parallel region, configured to run three threads (see line 21). Lines 26--30 are +declarations of thread local pointers on data arrays and variables (each thread will use its own +pointers). On line 33, the three threads enter a computation loop of \texttt{NbIter + 1} +iterations. We need to run one more iteration than with previous algorithms. + +Lines 34--41 are the MPI communications, achieved by the thread number $0$. They send the +current CPU input data array to another CPU, and receive the future CPU input data array from +another CPU, like in previous algorithms. But this thread achieves communications only during the +\emph{first} \texttt{NbIter} iterations. Lines 43--48 are the CPU to GPU input data +transfers, achieved by thread number $1$. These data transfers are run in parallel of MPI +communications. They are run during the \emph{first} \texttt{NbIter} iterations, and transfer +current CPU input data array into the future GPU data array. Lines 50--57 +correspond to the code run by +thread number $3$. They start GPU computations, to process the current GPU input data array, and if +necessary +transfer a GPU result at an index of the CPU result array. These GPU computations and result +transfers are run during the \emph{last} \texttt{NbIter} iterations: the GPU computations +have to wait the first data transfer is ended before to start to process any data, and can not run +during the first iteration. So, the activity of the third thread is shifted of one iteration +compared to the activities of other threads. Moreover, the address of the current GPU input data +array has to be passed as a parameter of the kernel call on line 52, in order the GPU threads access +the right data array. Like in previous algorithms the GPU result is copied at one index of the CPU +result array, in lines 53--56, but due to the shift of the third thread activity this index is +now \texttt{(i - 1)}. + +Line 60 is a synchronization barrier\index{OpenMP!barrier} of the three OpenMP threads, followed by a pointer permutation +of local pointers on current and future data arrays, on line 62 and 63. Each +thread waits for the completion of other +threads to use the data arrays, and then permutes its data array pointers before to +enter a new loop iteration. + +This complete overlap of MPI communications and CPU/GPU data transfers and GPU computations, is not +too complex to implement, and can be a solution when GPU computations are not adapted to use CUDA +streams: when GPU computations can not be split in subparts working on independent subsets of input +data. However, it requires to run one more iterations (a total of \texttt{NbIter + 1} +iterations). Then, if the number of iterations is very small, it could be more interesting not to +attempt to overlap CPU/GPU data transfers and GPU computations, and to implement \Lst{algo:ch6p1overlapseqsequence}. + + +\subsection{Experimental validation} +\label{ch6:p1expes} +%\subsubsection{Experimentation testbed} + +\noindent {\bf Experimentation testbed} +\medskip + +Two clusters located at SUPELEC in Metz (France) have been used for the entire set of +experiments presented in this chapter: +% +\begin{itemize} + +\item The first consists of 17 nodes with an Intel Nehalem quad-core processor + at 2.67Ghz, 6 Gb RAM and an NVIDIA GeForce GTX480 GPU, each. + +\item The second consists of 16 nodes with an Intel core2 dual-core processor at + 2.67Ghz, 4 Gb RAM and an NVIDIA GeForce GTX580 GPU, each + +\end{itemize} +% +Both clusters have a Gigabit Ethernet interconnection network that is connected +through a DELL Power Object 5324 switch. The two switches are linked twice, +insuring the interconnection of the two clusters. The software environment +consists of a Linux Fedora 64bit OS (kernel v. 2.6.35), GNU C and C++ compilers +(v. 4.5.1) and the CUDA library (v. 4.2). + + +%\subsubsection{Validation of the synchronous approach} + +\bigskip +\noindent {\bf Validation of the synchronous approach} +\medskip + +\begin{figure}[t] + \centering + \includegraphics{Chapters/chapter6/curves/gpuSyncOverlap.pdf} + \caption{Experimental performances of different synchronous algorithms computing a + dense matrix product} + \label{fig:ch6p1syncexpematrixprod} +\end{figure} + +\label{ch6:p1block-cyclic} +We have experimented our approach of synchronous parallel algorithms with a classic +block cyclic algorithm for dense matrix multiplication\index{matrix + multiplication!block cyclic}. This problem requires to split two input matrices ($A$ and $B$) on a ring of +computing nodes, and to establish a circulation of the slices of $A$ matrix on the ring ($B$ matrix +partition does not evolve during all the run). Compared to our generic algorithms, there is no +partial result to transfer from GPU to CPU at the end of each computing iteration. The part of the +result matrix computed on each GPU is transferred on the CPU at the end of the computation loop. + +We have first implemented a synchronous version without any overlap of MPI communications, CPU/GPU +data transfers, and GPU computations. We have added some synchronizations in the native overlapping +version in order to avoid any overlap. We have measured the performance achieved on our cluster with +NVIDIA GTX480 GPUs and matrices sizes of 4096$\times$4096, and we have obtained the curves in \Fig{fig:ch6p1syncexpematrixprod} labeled +\emph{no-ovlp}. We observe that performance increases when the number of processor increases. Of course, +there is a significant increase in cost when comparing a single node (without any MPI communication) with +two nodes (starting to use MPI communications). But beyond two nodes we get a classical performance +curve. + +Then, we implemented and experimented \Lst{algo:ch6p1overlapnative}, see +\emph{ovlp-native} in \Fig{fig:ch6p1syncexpematrixprod}. The native +overlap of MPI communications with asynchronous run of CUDA kernels appears efficient. When the +number of nodes increases the ratio of the MPI communications increases a lot (because the +computation times decrease a lot). So, there is not a lot of GPU computation +time that remains to be +overlapped, and both \emph{no-ovlp} and \emph{ovlp-native} tend to the same +limit. Already, the native overlap performed in \Lst{algo:ch6p1overlapnative} +achieves a high level of performance very quickly, using only four +nodes. Beyond four nodes, a faster interconnection network would be required for +a performance increase. + +Finally, we implemented \Lst{algo:ch6p1overlapseqsequence}, overlapping MPI communications +with a GPU sequence including both CPU/GPU data transfers and GPU computations, +see \emph{ovlp-GPUsequence} in \Fig{fig:ch6p1syncexpematrixprod}. From four +up to sixteen nodes it achieves better performances than \emph{ovlp-native}: we better overlap +MPI communications. However, this parallelization mechanism has more overhead: OpenMP threads +have to be created and synchronized. Only for two nodes it is less efficient than the native +overlapping algorithm. Beyond two nodes, the CPU multithreading overhead seems compensated. +% No, it doesn't need the more implementation of time, but more implementation +% of code :) +\Lst{algo:ch6p1overlapseqsequence} requires more time for the implemention +and can be more complex to maintain, but such extra development cost is +justified if we are looking for better performance. + + +%%% Local Variables: +%%% mode: latex +%%% fill-column: 80 +%%% ispell-dictionary: "american" +%%% mode: flyspell +%%% TeX-master: "../../BookGPU.tex" +%%% End: diff --git a/BookGPU/Chapters/chapter6/biblio6.bib b/BookGPU/Chapters/chapter6/biblio6.bib new file mode 100644 index 0000000..65dfda7 --- /dev/null +++ b/BookGPU/Chapters/chapter6/biblio6.bib @@ -0,0 +1,310 @@ +@manual{cudabestpractices, + TITLE = {{NVIDIA {CUDA C} Best Practices Guide 4.0}}, + organization = {NVIDIA}, + month = {May}, + year = {2011}, + howpublished = "\url{http://docs.nvidia.com/cuda/pdf/CUDA_C_Best_Practices_Guide.pdf}", +} + +@ARTICLE{FS2000, + author = {A. Frommer and D. B. Szyld}, + year = 2000, + title = {On Asynchronous Iterations}, + journal = {J. Comput. and Appl. Math.}, + volume = 123, + pages = {201-216} +} + + +@Book{BT89, + author = {D.P.~Bertsekas and J.N.~Tsitsiklis}, + ALTeditor = {}, + title = {Parallel and Distributed Computation}, + publisher = {Prentice Hall}, + year = {1999}, + address = {Englewood Cliffs, New Jersey}, +} + +@InBook{ChapNRJ2011, + author = {Vialle, S. and Contassot-Vivier, S. and Jost, T.}, + editor = {Sanjay Ranka and Ishfag Ahmad}, + title = {Handbook of Energy-Aware and Green Computing}, + chapter = {Optimizing Computing and Energy Performances in Heterogeneous Clusters of CPUs and GPUs}, + publisher = {Chapman and Hall/CRC}, + year = {2012}, + url = {http://www.crcpress.com/product/isbn/9781466501164#}, + isbn = {9781466501164}, + OPTkey = {}, + OPTvolume = {}, + OPTnumber = {}, + series = {Computer \& Information Science Series}, + OPTtype = {}, + OPTaddress = {}, + OPTedition = {}, + month = {Jan}, + OPTpages = {}, + OPTannote = {} +} + +@InBook{ChVCV13, + author = {St\'{e}phane Vialle and Sylvain Contassot-Vivier}, + editor = {Magoul\'{e}s, Fr\'{e}d\'{e}ric}, + title = {Patterns for parallel programming on {G}{P}{U}s}, + chapter = {Optimization methodology for Parallel Programming of Homogeneous or Hybrid Clusters}, + publisher = {Saxe-Coburg Publications}, + year = {2013}, + OPTkey = {}, + OPTvolume = {}, + OPTnumber = {}, + OPTseries = {}, + OPTtype = {}, + OPTaddress = {}, + OPTedition = {}, + month = feb, + OPTpages = {}, + note = {to appear}, + OPTannote = {} +} + +@Book{BCC07, + author = {Bahi, Jacques M. and Contassot-Vivier, Sylvain and Couturier, Rapha\"{e}l}, + title = {Parallel Iterative Algorithms: from sequential to grid computing}, + publisher = {Chapman \& Hall/CRC}, + year = {2007}, + series = {Numerical Analysis \& Scientific Computing Series}, + OPTdoi = {http://www.crcpress.com/shopping_cart/products/product_detail.asp?sku=C808X&isbn=9781584888086&parent_id=&pc=}, +} + +@InProceedings{HPCS2002, + author = {Bahi, Jacques M. and Contassot-Vivier, Sylvain and Couturier, Rapha\"{e}l}, + title = {Asynchronism for Iterative Algorithms in a Global Computing Environment}, + booktitle = {The 16th Annual International Symposium on High Performance +Computing Systems and Applications (HPCS'2002)}, + pages = "90--97", + year = {2002}, + address = {Moncton, Canada}, + month = jun, +} + +@InProceedings{Vecpar08a, + author = {Bahi, Jacques M. and Contassot-Vivier, Sylvain and Couturier, Rapha\"{e}l}, + title = {An efficient and robust decentralized algorithm for detecting the global +convergence in asynchronous iterative algorithms}, + booktitle = {8th International Meeting on High Performance Computing for Computational Science, VECPAR'08}, + pages = {251--264}, + year = {2008}, + address = {Toulouse}, + month = jun, +} + +@article{ParCo05, + author = {Bahi, Jacques M. and Contassot-Vivier, Sylvain and Couturier, Rapha\"{e}l}, + title = {Evaluation of the Asynchronous Iterative Algorithms in the Context of Distant Heterogeneous Clusters}, + journal = {Parallel Computing}, + volume = {31}, + number = {5}, + year = {2005}, + pages = {439-461} +} + +@InProceedings{ECost10, + author = {Contassot-Vivier, Sylvain and Vialle, St\'{e}phane and Jost, Thomas}, + title = {Optimizing computing and energy performances on {GPU} clusters: experimentation on a {PDE} solver}, + booktitle = {COST Action IC0804 on Large Scale Distributed Systems,1st Year}, + OPTpages = {}, + year = {2010}, + editor = {Jean-Marc Pierson and Helmut Hlavacs}, + organization = {IRIT}, + note = {ISBN: 978-2-917490-10-5}, +} + +@Article{SuperCo05, + author = {Bahi, Jacques M. and Contassot-Vivier, Sylvain and Couturier, Rapha\"{e}l}, + title = {Performance comparison of parallel programming environments for implementing {AIAC} algorithms}, + journal = {Journal of Supercomputing. Special Issue on Performance Modelling and Evaluation of Parallel and Distributed Systems}, + year = {2006}, + volume = {35}, + number = {3}, + pages = {227-244}, +} + +@InProceedings{Para10, + author = {Contassot-Vivier, Sylvain and Jost, Thomas and Vialle, St\'{e}phane}, + title = {Impact of asynchronism on {GPU} accelerated parallel iterative computations}, + booktitle = {PARA 2010 conference: State of the Art in Scientific and Parallel Computing}, + OPTpages = {--}, + year = {2010}, + address = {Reykjavík, Iceland}, + month = jun, +} + +@InProceedings{ECost10, + author = {Contassot-Vivier, Sylvain and Vialle, St\'{e}phane and Jost, Thomas}, + title = {Optimizing computing and energy performances on {GPU} clusters: experimentation on a {PDE} solver}, + booktitle = {COST Action IC0804 on Large Scale Distributed Systems,1st Year}, + OPTpages = {}, + year = {2010}, + editor = {Jean-Marc Pierson and Helmut Hlavacs}, + organization = {IRIT}, + note = {ISBN: 978-2-917490-10-5}, +} + +@InCollection{JCVV10, + author = {Thomas Jost and Sylvain Contassot-Vivier and St\'{e}phane Vialle}, + title = {An efficient multi-algorithm sparse linear solver for {GPU}s}, + booktitle = {Parallel Computing : From Multicores and GPU's to Petascale}, + pages = {546--553}, + publisher = {IOS Press}, + year = {2010}, + OPTeditor = {}, + volume = {19}, + OPTnumber = {}, + series = {Advances in Parallel Computing}, + OPTtype = {}, + OPTchapter = {}, + OPTaddress = {}, + OPTedition = {}, + OPTmonth = {}, + OPTnote = {}, + OPTannote = {Extended version of EuroGPU symposium article, in the International Conference on Parallel Computing (ParCo) 2009} +} + +@InProceedings{ParCo09, + author = {Thomas Jost and Sylvain Contassot-Vivier and St\'{e}phane Vialle}, + title = {An efficient multi-algorithms sparse linear solver for {GPU}s}, + booktitle = {EuroGPU mini-symposium of the International Conference on Parallel Computing, ParCo'2009}, + pages = {546--553}, + year = {2009}, + address = {Lyon}, + month = sep, +} + +@InProceedings{BCVG11, + author = {Bahi, Jacques M. and Contassot-Vivier, Sylvain and Giersch, Arnaud}, + title = {Load Balancing in Dynamic Networks by Bounded Delays Asynchronous Diffusion}, + booktitle = {VECPAR 2010}, + pages = {352--365}, + year = {2011}, + editor = {J.M.L.M. Palma et al.}, + volume = {6449}, + OPTnumber = {}, + series = {LNCS}, + publisher = {Springer, Heidelberg}, + note = "\url{DOI:~10.1007/978-3-642-19328-6\33}" +} + +@manual{CUDA, + title = {{NVIDIA {CUDA} C Programming Guide 4.0}}, + organization = {NVIDIA}, + howpublished = "\url{http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Programming_Guide.pdf}", + month = {June}, + year = {2011} +} + +@Misc{openMPI, + title = {Open Source High Performance Computing}, + howpublished = {\url{http://www.open-mpi.org}} +} + +@Misc{MPI, + title = {Message Passing Interface}, + howpublished = {\url{http://www.mpi-forum.org/docs}} +} + +@Misc{openMP, + title = {Open{M}{P} multi-threaded programming {API}}, + howpublished = {\url{http://www.openmp.org}} +} + +@article{Hoefler08a, +author = {Torsten Hoefler and Andrew Lumsdaine}, +title = {Overlapping Communication and Computation with High Level Communication Routines}, +journal ={Cluster Computing and the Grid, IEEE International Symposium on}, +OPTvolume = {0}, +isbn = {978-0-7695-3156-4}, +year = {2008}, +pages = {572-577}, +note = "\url{http://doi.ieeecomputersociety.org/10.1109/CCGRID.2008.15}", +publisher = {IEEE Computer Society}, +address = {Los Alamitos, CA, USA}, +} + +@Article{Valiant:BSP, + author = {Valiant, Leslie G.}, + title = {A bridging model for parallel computation}, + journal = {Communications of the ACM}, + year = 1990, + volume = 33, + number = 8, + pages = {103-111} +} + +@inproceedings{gustedt:hal-00639289, + AUTHOR = {Gustedt, Jens and Jeanvoine, Emmanuel}, + TITLE = {{Relaxed Synchronization with Ordered Read-Write Locks}}, + BOOKTITLE = {{Euro-Par 2011: Parallel Processing Workshops}}, + YEAR = {2012}, + MONTH = May, + SERIES = {LNCS}, + EDITOR = {Michael Alexander and others}, + PUBLISHER = {Springer}, + VOLUME = {7155}, + PAGES = {387-397}, + ADDRESS = {Bordeaux, France}, + X-INTERNATIONAL-AUDIENCE = {yes}, + X-PROCEEDINGS = {yes}, + URL = {http://hal.inria.fr/hal-00639289}, + X-ID-HAL = {hal-00639289}, +} + +@article{clauss:2010:inria-00330024:1, + AUTHOR = {Clauss, Pierre-Nicolas and Gustedt, Jens}, + TITLE = {{Iterative Computations with Ordered Read-Write Locks}}, + JOURNAL = {{Journal of Parallel and Distributed Computing}}, + PUBLISHER = {Elsevier}, + VOLUME = {70}, + NUMBER = {5}, + PAGES = {496-504}, + YEAR = {2010}, + DOI = {10.1016/j.jpdc.2009.09.002}, + X-INTERNATIONAL-AUDIENCE = {yes}, + X-EDITORIAL-BOARD = {yes}, + URL = {http://hal.inria.fr/inria-00330024/en}, + X-ID-HAL = {inria-00330024}, +} + +@inproceedings{GUSTEDT:2007:HAL-00280094:1, + TITLE = {The par{X}{X}{L} Environment: Scalable Fine Grained Development for Large Coarse Grained Platforms}, + X-PAYS = {IT}, + X-INTERNATIONAL-AUDIENCE = {yes}, + AUTHOR = {Gustedt, Jens AND Vialle, Stéphane AND De Vivo, Amelia}, + BOOKTITLE = {PARA 06}, + LONG-BOOKTITLE = {PARA 06: Worshop on state-of-the-art in scientific and parallel computing }, + EDITOR = {Bo K{\aa}gstr{\"o}m and others}, + PAGES = {1094-1104 }, + ADDRESS = {Ume{\aa}, Sweden}, + SERIES = LNCS, + PUBLISHER = {Springer}, + VOLUME = {4699}, + YEAR = 2007, + URL = {http://hal-supelec.archives-ouvertes.fr/hal-00280094/en/}, + X-PROCEEDINGS = {yes}, +} + +@InProceedings{suss04:users_exper, + author = {S\"{u}{\ss}, Michael and Leopold, Claudia}, + title = {A User's Experience with Parallel Sorting and {O}pen{M}{P}}, + booktitle = {Proceedings of the 6th European Workshop on OpenMP (EWOMP)}, + pages = {23-28}, + year = 2004, + editor = {Eduard Ayguad\'{e} and others}, + address = {Stockholm, Sweden}} + + +@Book{C11, + editor = {JTC1/SC22/WG14}, + title = {Programming languages - C}, + publisher = {ISO}, + year = 2011, + number = {ISO/IEC 9899}, + edition = {Cor. 1:2012}} diff --git a/BookGPU/Chapters/chapter6/ch6.aux b/BookGPU/Chapters/chapter6/ch6.aux new file mode 100644 index 0000000..973672c --- /dev/null +++ b/BookGPU/Chapters/chapter6/ch6.aux @@ -0,0 +1,147 @@ +\relax +\@writefile{toc}{\author{Sylvain Contassot-Vivier}{}} +\@writefile{toc}{\author{Stephane Vialle}{}} +\@writefile{toc}{\author{Jens Gustedt}{}} +\@writefile{loa}{\addvspace {10\p@ }} +\@writefile{toc}{\contentsline {chapter}{\numberline {3}Development methodologies for GPU and cluster of GPUs}{23}} +\@writefile{lof}{\addvspace {10\p@ }} +\@writefile{lot}{\addvspace {10\p@ }} +\@writefile{toc}{\contentsline {section}{\numberline {3.1}Introduction}{24}} +\newlabel{ch6:intro}{{3.1}{24}} +\@writefile{toc}{\contentsline {section}{\numberline {3.2}General scheme of synchronous code with computation/communication overlapping in GPU clusters}{24}} +\newlabel{ch6:part1}{{3.2}{24}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.2.1}Synchronous parallel algorithms on GPU clusters}{24}} +\@writefile{lof}{\contentsline {figure}{\numberline {3.1}{\ignorespaces Native overlap of internode CPU communications with GPU computations.\relax }}{26}} +\newlabel{fig:ch6p1overlapnative}{{3.1}{26}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.2.2}Native overlap of CPU communications and GPU computations}{26}} +\newlabel{algo:ch6p1overlapnative}{{3.1}{27}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.1}Generic scheme implicitly overlapping MPI communications with CUDA GPU computations}{27}} +\@writefile{lof}{\contentsline {figure}{\numberline {3.2}{\ignorespaces Overlap of internode CPU communications with a sequence of CPU/GPU data transfers and GPU computations.\relax }}{28}} +\newlabel{fig:ch6p1overlapseqsequence}{{3.2}{28}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.2.3}Overlapping with sequences of transfers and computations}{28}} +\newlabel{algo:ch6p1overlapseqsequence}{{3.2}{29}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.2}Generic scheme explicitly overlapping MPI communications with sequences of CUDA CPU/GPU transfers and CUDA GPU computations}{29}} +\@writefile{lof}{\contentsline {figure}{\numberline {3.3}{\ignorespaces Overlap of internode CPU communications with a streamed sequence of CPU/GPU data transfers and GPU computations.\relax }}{30}} +\newlabel{fig:ch6p1overlapstreamsequence}{{3.3}{30}} +\newlabel{algo:ch6p1overlapstreamsequence}{{3.3}{31}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.3}Generic scheme explicitly overlapping MPI communications with streamed sequences of CUDA CPU/GPU transfers and CUDA GPU computations}{31}} +\@writefile{lof}{\contentsline {figure}{\numberline {3.4}{\ignorespaces Complete overlap of internode CPU communications, CPU/GPU data transfers and GPU computations, interleaving computation-communication iterations\relax }}{33}} +\newlabel{fig:ch6p1overlapinterleaved}{{3.4}{33}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.2.4}Interleaved communications-transfers-computations overlapping}{33}} +\newlabel{algo:ch6p1overlapinterleaved}{{3.4}{34}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.4}Generic scheme explicitly overlapping MPI communications, CUDA CPU/GPU transfers and CUDA GPU computations, interleaving computation-communication iterations}{34}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.2.5}Experimental validation}{36}} +\newlabel{ch6:p1expes}{{3.2.5}{36}} +\newlabel{ch6:p1block-cyclic}{{3.2.5}{36}} +\@writefile{lof}{\contentsline {figure}{\numberline {3.5}{\ignorespaces Experimental performances of different synchronous algorithms computing a dense matrix product\relax }}{37}} +\newlabel{fig:ch6p1syncexpematrixprod}{{3.5}{37}} +\@writefile{toc}{\contentsline {section}{\numberline {3.3}General scheme of asynchronous parallel code with computation/communication overlapping}{38}} +\newlabel{ch6:part2}{{3.3}{38}} +\@writefile{loa}{\contentsline {algorithm}{\numberline {1}{\ignorespaces Synchronous iterative scheme\relax }}{38}} +\newlabel{algo:ch6p2sync}{{1}{38}} +\@writefile{loa}{\contentsline {algorithm}{\numberline {2}{\ignorespaces Asynchronous iterative scheme\relax }}{38}} +\newlabel{algo:ch6p2async}{{2}{38}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.3.1}A basic asynchronous scheme}{40}} +\newlabel{ch6:p2BasicAsync}{{3.3.1}{40}} +\newlabel{algo:ch6p2BasicAsync}{{3.5}{40}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.5}Initialization of the basic asynchronous scheme}{40}} +\newlabel{algo:ch6p2BasicAsyncComp}{{3.6}{41}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.6}Computing function in the basic asynchronous scheme}{41}} +\newlabel{algo:ch6p2BasicAsyncSendings}{{3.7}{42}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.7}Sending function in the basic asynchronous scheme}{42}} +\newlabel{algo:ch6p2BasicAsyncReceptions}{{3.8}{43}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.8}Reception function in the basic asynchronous scheme}{43}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.3.2}Synchronization of the asynchronous scheme}{44}} +\newlabel{ch6:p2SsyncOverAsync}{{3.3.2}{44}} +\newlabel{algo:ch6p2Sync}{{3.9}{45}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.9}Initialization of the synchronized scheme}{45}} +\newlabel{algo:ch6p2SyncComp}{{3.10}{46}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.10}Computing function in the synchronized scheme}{46}} +\newlabel{algo:ch6p2SyncReceptions}{{3.11}{47}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.11}Reception function in the synchronized scheme}{47}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.3.3}Asynchronous scheme using MPI, OpenMP and CUDA}{48}} +\newlabel{ch6:p2GPUAsync}{{3.3.3}{48}} +\newlabel{algo:ch6p2AsyncSyncComp}{{3.12}{50}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.12}Computing function in the final asynchronous scheme}{50}} +\newlabel{algo:ch6p2syncGPU}{{3.13}{51}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.13}Computing function in the final asynchronous scheme}{51}} +\newlabel{algo:ch6p2FullOverAsyncMain}{{3.14}{53}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.14}Initialization of the main process of complete overlap with asynchronism}{53}} +\newlabel{algo:ch6p2FullOverAsyncComp1}{{3.15}{54}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.15}Computing function in the final asynchronous scheme with CPU/GPU overlap}{54}} +\newlabel{algo:ch6p2FullOverAsyncComp2}{{3.16}{55}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.16}Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap}{55}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.3.4}Experimental validation}{56}} +\newlabel{sec:ch6p2expes}{{3.3.4}{56}} +\@writefile{lof}{\contentsline {figure}{\numberline {3.6}{\ignorespaces Computation times of the test application in synchronous and asynchronous modes.\relax }}{57}} +\newlabel{fig:ch6p2syncasync}{{3.6}{57}} +\@writefile{lof}{\contentsline {figure}{\numberline {3.7}{\ignorespaces Computation times with or without overlap of Jacobian updatings in asynchronous mode.\relax }}{58}} +\newlabel{fig:ch6p2aux}{{3.7}{58}} +\@writefile{toc}{\contentsline {section}{\numberline {3.4}Perspective: A unifying programming model}{59}} +\newlabel{sec:ch6p3unify}{{3.4}{59}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.4.1}Resources}{59}} +\newlabel{sec:ch6p3resources}{{3.4.1}{59}} +\newlabel{algo:ch6p3ORWLresources}{{3.17}{60}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.17}Declaration of ORWL resources for a block-cyclic matrix multiplication}{60}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.4.2}Control}{60}} +\newlabel{sec:ch6p3ORWLcontrol}{{3.4.2}{60}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.4.3}Example: block-cyclic matrix multiplication (MM)}{61}} +\newlabel{sec:ch6p3ORWLMM}{{3.4.3}{61}} +\newlabel{algo:ch6p3ORWLBCCMM}{{3.18}{61}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.18}Block-cyclic matrix multiplication, high level per task view}{61}} +\newlabel{algo:ch6p3ORWLlcopy}{{3.19}{62}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.19}An iterative local copy operation}{62}} +\newlabel{algo:ch6p3ORWLrcopy}{{3.20}{62}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.20}An iterative remote copy operation as part of a block cyclic matrix multiplication task}{62}} +\newlabel{algo:ch6p3ORWLtrans}{{3.21}{62}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.21}An iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}{62}} +\newlabel{algo:ch6p3ORWLdecl}{{3.22}{63}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.22}Dynamic declaration of handles to represent the resources}{63}} +\newlabel{algo:ch6p3ORWLinit}{{3.23}{64}} +\@writefile{lol}{\contentsline {lstlisting}{\numberline {3.23}Dynamic initialization of access mode and priorities}{64}} +\@writefile{toc}{\contentsline {subsection}{\numberline {3.4.4}Tasks and operations}{64}} +\newlabel{sec:ch6p3tasks}{{3.4.4}{64}} +\@writefile{toc}{\contentsline {section}{\numberline {3.5}Conclusion}{65}} +\newlabel{ch6:conclu}{{3.5}{65}} +\@writefile{toc}{\contentsline {section}{\numberline {3.6}Glossary}{65}} +\@writefile{toc}{\contentsline {section}{Bibliography}{66}} +\@setckpt{Chapters/chapter6/ch6}{ +\setcounter{page}{68} +\setcounter{equation}{0} +\setcounter{enumi}{4} +\setcounter{enumii}{0} +\setcounter{enumiii}{0} +\setcounter{enumiv}{21} +\setcounter{footnote}{0} +\setcounter{mpfootnote}{0} +\setcounter{part}{1} +\setcounter{chapter}{3} +\setcounter{section}{6} +\setcounter{subsection}{0} +\setcounter{subsubsection}{0} +\setcounter{paragraph}{0} +\setcounter{subparagraph}{0} +\setcounter{figure}{7} +\setcounter{table}{0} +\setcounter{numauthors}{0} +\setcounter{parentequation}{0} +\setcounter{subfigure}{0} +\setcounter{lofdepth}{1} +\setcounter{subtable}{0} +\setcounter{lotdepth}{1} +\setcounter{lstnumber}{17} +\setcounter{ContinuedFloat}{0} +\setcounter{float@type}{16} +\setcounter{algorithm}{2} +\setcounter{ALC@unique}{0} +\setcounter{ALC@line}{0} +\setcounter{ALC@rem}{0} +\setcounter{ALC@depth}{0} +\setcounter{AlgoLine}{0} +\setcounter{algocfline}{0} +\setcounter{algocfproc}{0} +\setcounter{algocf}{0} +\setcounter{proposition}{0} +\setcounter{proof}{0} +\setcounter{lstlisting}{23} +} diff --git a/BookGPU/Chapters/chapter6/ch6.tex b/BookGPU/Chapters/chapter6/ch6.tex new file mode 100755 index 0000000..79cd429 --- /dev/null +++ b/BookGPU/Chapters/chapter6/ch6.tex @@ -0,0 +1,89 @@ +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% Personnal defs +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\newcommand{\Sec}[1]{Section~\ref{#1}} +\newcommand{\Fig}[1]{Figure~\ref{#1}} +\newcommand{\Alg}[1]{Algorithm~\ref{#1}} +\newcommand{\Lst}[1]{Listing~\ref{#1}} +\newcommand{\Tab}[1]{Table~\ref{#1}} +\newcommand{\Equ}[1]{(\ref{#1})} +\def\Reals{\mathbb{R}} + +\definecolor{shadecolor}{rgb}{0.95,0.95,0.95} +\newenvironment{Algo}{\vspace{-1em}\begin{center}\begin{minipage}[h]{0.95\columnwidth}\begin{shaded}\begin{tabbing}% + \hspace{3mm}\=\hspace{3mm}\=\hspace{3mm}\=\hspace{3mm}\=\hspace{3mm}\=\hspace{3mm}\=\hspace{3mm}\= \kill} % + { \end{tabbing}\vspace{-1em}\end{shaded}\end{minipage}\end{center}\vspace{-1em}} + +\lstnewenvironment{Listing}[2]{\lstset{basicstyle=\scriptsize\ttfamily,% + breaklines=true, breakatwhitespace=true, language=C, keywordstyle=\color{black},% + prebreak = \raisebox{0ex}[0ex][0ex]{\ensuremath{\hookleftarrow}},% + commentstyle=\textit, numbersep=1em, numberstyle=\tiny, numbers=left,% + numberblanklines=false, mathescape, escapechar=@, label=#1, caption={#2}}}{} + +\def\N{$\mathbb N$ } +\def\R{$\mathbb R$ } +\def\Z{$\mathbb Z$ } +\def\Q{$\mathbb Q$ } +\def\C{$\mathbb C$ } +\def\affect{$\leftarrow$ } + +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% Main document +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +\chapterauthor{Sylvain Contassot-Vivier}{Université Lorraine, Loria UMR 7503 \& AlGorille INRIA Project Team, Nancy, France.} +\chapterauthor{Stephane Vialle}{SUPELEC, UMI GT-CNRS 2958 \& AlGorille INRIA Project Team, Metz, France.} +\chapterauthor{Jens Gustedt}{INRIA Nancy -- Grand Est, AlGorille INRIA Project Team, Strasbourg, France.} + +\chapter{Development methodologies for GPU and cluster of GPUs} + +% Introduction +\input{Chapters/chapter6/Intro} + +% Partie 1 : CUDA - MPI synchrone avec recouvrement +\input{Chapters/chapter6/PartieSync} + +% Partie 2 : CUDA - MPI asynchrone avec recouvrement +\input{Chapters/chapter6/PartieAsync} + +% Partie 6 : Analyse prospective +\input{Chapters/chapter6/PartieORWL} + +% Conclusion +\input{Chapters/chapter6/Conclu} + +% Glossaire +\section{Glossary} +\begin{Glossary} +\item[AIAC] Asynchronous Iterations - Asynchronous Communications. +\item[Asynchronous iterations] Iterative process where each element is updated + without waiting for the last updates of the other elements. +\item[Auxiliary computations] Optional computations performed in parallel to the + main computations and used to complete them or speed them up. +\item[BSP parallel scheme] Bulk Synchronous Parallel, a parallel model that uses + a repeated pattern (superstep) composed of: computation, communication, barrier. +\item[GPU stream] Serialized data transfers and computations performed on a same + piece of data. +\item[Message loss/miss] Can be said about a message that is either not + sent or sent but not received (possible with unreliable communication protocols). +\item[Message stamping] Inclusion of a specific value in messages of the same tag to + distinguish them (kind of secondary tag). +\item[ORWL] Ordered Read Write Locks, a programming tool proposing a unified + programming model. +\item[Page-locked data] Data that are locked in cache memory to ensure fast accesses. +\item[Residual] Difference between results of consecutive iterations in an + iterative process. +\item[Streamed GPU sequence] GPU transfers and computations performed + simultaneously via distinct GPU streams. +\end{Glossary} + +% Biblio +\vspace{-2em} +\putbib[Chapters/chapter6/biblio6] + +%%% Local Variables: +%%% mode: latex +%%% fill-column: 80 +%%% ispell-dictionary: "american" +%%% mode: flyspell +%%% TeX-master: "../../BookGPU.tex" +%%% End: diff --git a/BookGPU/Chapters/chapter6/curves/gpuSyncOverlap.pdf b/BookGPU/Chapters/chapter6/curves/gpuSyncOverlap.pdf new file mode 100644 index 0000000..76d47cf Binary files /dev/null and b/BookGPU/Chapters/chapter6/curves/gpuSyncOverlap.pdf differ diff --git a/BookGPU/Chapters/chapter6/curves/recouvs.pdf b/BookGPU/Chapters/chapter6/curves/recouvs.pdf new file mode 100644 index 0000000..c5a64ca Binary files /dev/null and b/BookGPU/Chapters/chapter6/curves/recouvs.pdf differ diff --git a/BookGPU/Chapters/chapter6/curves/syncasync.pdf b/BookGPU/Chapters/chapter6/curves/syncasync.pdf new file mode 100644 index 0000000..f9c764c Binary files /dev/null and b/BookGPU/Chapters/chapter6/curves/syncasync.pdf differ diff --git a/BookGPU/Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf b/BookGPU/Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf new file mode 100644 index 0000000..85d019d Binary files /dev/null and b/BookGPU/Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf differ diff --git a/BookGPU/Chapters/chapter6/figures/Sync-NativeOverlap.pdf b/BookGPU/Chapters/chapter6/figures/Sync-NativeOverlap.pdf new file mode 100755 index 0000000..3771aaf Binary files /dev/null and b/BookGPU/Chapters/chapter6/figures/Sync-NativeOverlap.pdf differ diff --git a/BookGPU/Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf b/BookGPU/Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf new file mode 100644 index 0000000..55ee6f6 Binary files /dev/null and b/BookGPU/Chapters/chapter6/figures/Sync-SeqSequenceOverlap.pdf differ diff --git a/BookGPU/Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf b/BookGPU/Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf new file mode 100644 index 0000000..f19ad31 Binary files /dev/null and b/BookGPU/Chapters/chapter6/figures/Sync-StreamSequenceOverlap.pdf differ diff --git a/BookGPU/Makefile b/BookGPU/Makefile index 43eb90f..56526d8 100644 --- a/BookGPU/Makefile +++ b/BookGPU/Makefile @@ -8,7 +8,8 @@ all: bibtex bu2 bibtex bu3 bibtex bu4 - bibtex bu6 + bibtex bu5 + bibtex bu7 makeindex ${BOOK}.idx pdflatex ${BOOK} pdflatex ${BOOK}