X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/bdec1b5087c2ea922fcf62ad0591b8d784ddf3b7..fa1939d2294b408e8a62f2d91149f369f8710113:/BookGPU/Chapters/chapter17/ch17.tex?ds=inline diff --git a/BookGPU/Chapters/chapter17/ch17.tex b/BookGPU/Chapters/chapter17/ch17.tex index 9d458de..d6e3a31 100755 --- a/BookGPU/Chapters/chapter17/ch17.tex +++ b/BookGPU/Chapters/chapter17/ch17.tex @@ -1,4 +1,9 @@ -\chapterauthor{G. Laville, C. Lang, K. Mazouzi, N. Marilleau, B. Herrmann, L. Philippe}{Femto-ST Institute, University of Franche-Comt{\'e}} +\chapterauthor{Guillaume Laville, Christophe Lang, Bénédicte Herrmann and Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Kamel Mazouzi}{Franche-Comte Computing Center, University of Franche-Comte, France} +\chapterauthor{Nicolas Marilleau}{UMMISCO, Institut de Recherche pour le Developpement (IRD), France} +%\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France} \newlength\mylen \newcommand\myinput[1]{% @@ -6,7 +11,7 @@ \setlength\hangindent{\mylen}% \hspace*{\mylen}#1\\} -\chapter{Implementing MAS on GPU} +\chapter{Implementing Multi-Agent Systems on GPU} \label{chapter17} @@ -22,7 +27,7 @@ systems are often modeled by mathematical representations and their dynamic behavior simulated by differential equations. The simulation of the system thus often relay on the resolution of a linear system that can be efficiently computed on a graphical processing unit as -shown in the preceeding chapters. But when the behavior of the system +shown in the preceding chapters. But when the behavior of the system elements is not uniformly driven by the same law, when these elements have their own behavior, the modeling process is too complex to rely on formal expressions. In this context MAS is a recognized approach to @@ -42,7 +47,7 @@ solution to increase simulation performance but Graphical Processing Units (GPU) are also a promising technology with an attractive performance/cost ratio. -Conceptually a MAS is a distributed system as it favors the definition +Conceptually a MAS\index{Multi-Agent System} is a distributed system as it favors the definition and description of large sets of individuals, the agents, that can be run in parallel. As a large set of agents could have the same behavior a SIMD model should fit the simulation execution. Most of the @@ -62,7 +67,7 @@ parallelism. Another key point that advocates for the use of many-core in MAS is the growing need for multi-scale simulations. Multi-scale simulations explore problems with interactions between several scales. The -different scales use different granularities of the structure and +different scales use different granularity of the structure and potentially different models. Most of the time the lower scale simulations provide results to higher scale simulations. In that case the execution of the simulations can easily be distributed between the @@ -70,8 +75,8 @@ local cores and a many-core architecture, i.e. a GPU device. We explore in this chapter the use of many-core architectures to execute agent-based simulations. We illustrate our reflexion with two -uses cases: the colembola simulator designed to simulate the diffusion -of collembola between plots of land and the MIOR simulator that +uses cases: the Colembola simulator designed to simulate the diffusion +of Colembola between plots of land and the MIOR simulator that reproduces effects of earthworms on bacteria dynamics in a bulked soil. In Section \ref{ch17:ABM} we present the work related to MAS and parallelization with a special focus on many-core use. In sections @@ -121,7 +126,7 @@ primitives or indirectly through the information stored in the environment. Agent based simulations have been used for more than one decade to -reproduce, understand even predic complex system dynamic. They have proved their interest in various +reproduce, understand even predict complex system dynamic. They have proved their interest in various scientific communities. Nowadays generic agent based frameworks are promoted such as Repast~\cite{repast_home} or NetLogo~\cite{netlogo_home} to implement simulators. Many ABM such as @@ -280,7 +285,7 @@ of agents or simulations sharing common data (such as the environment) or algorithms (such as the background evolution process). In the following examples the JOCL~\cite{jocl_home} binding is used -to access the OpenCL platfmorm from the Java programming language. +to access the OpenCL platform from the Java programming language. In the next sections we present two practical cases that will be studied in details, from the model to its implementation and @@ -290,20 +295,20 @@ performance. \label{ch17:sec:1stmodel} The first model, the Collembola model, simulates the propagation of -collembolas in flieds and forests. It is based on a diffusion +collembolas in fields and forests. It is based on a diffusion algorithm which illustrates the case of agents with a simple behavior and few synchronization problems. -\subsection{The Collembola model} +\subsection{The Collembola model\index{Collembola model}} \label{ch17:subsec:collembolamodel} The Collembola model is an example of multi-agent system using GIS (Geographical Information System) and survey data (population count) - to modelize the evolution of the biodiversity -across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modelized by populations of + to model the evolution of the biodiversity +across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modeled by populations of athropod individuals, the Collembola, which can reproduce and diffuse to favorable new habitats. The simulator allows to study the diffusion -of collembola, between plots of land depending on their landuse +of collembola, between plots of land depending on their use (artifical soil, crop, forest\ldots). In this model the environment is composed of the studied land and the colembola are agents. Every land plot is divided into several cells, @@ -340,10 +345,10 @@ associated to one plot to obtain its population. The {\bf diffusion} stage simulates the natural behavior of the collembola that tends toward occupying the whole space over time. This -stage constists in computing a new value for each cell depending on +stage consists in computing a new value for each cell depending on the population of the neighbor cells. This process can be assimilated -to a linear diffusion at each iteration of the populationof the cells -across their neightbors. +to a linear diffusion at each iteration of the population of the cells +across their neighbors. These two processes are quite common in numerical computations so that the collembola model can be adapted to a GPU execution without @@ -369,6 +374,7 @@ the total numbers of access needed to updated those populations. %\lstinputlisting[language=C,caption=Collembola OpenCL %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl} +\pagebreak \lstinputlisting[caption=Collembola OpenCL Diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl} The reproduction, diffusion and culling steps are implemented on GPU @@ -382,7 +388,7 @@ calculated, and divided by the number of neightbors. Note that, at the model frontier, populations can also overflow outside the environment grid but we do not manage those external populations, since there are no reason to assume our model to be isolated of its surroundings. The -overflow by neigbors value is stored for each cell, before +overflow by neighbors value is stored for each cell, before encountering the barrier. After the barrier is met, each cell read the overflows stored by all its neighbors at the previous step and applies them to its own population. In this manner, only one barrier is @@ -403,7 +409,7 @@ could be implemented. In this part we present the performance of the Collembola model on various CPU and GPU execution platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the -number of cores and the processor archiecture as a strong influence on the obtained results +number of cores and the processor architecture as a strong influence on the obtained results % : the % dual-core processor equipping our Thinkpad platform has two to six @@ -425,7 +431,7 @@ number of cores and the processor archiecture as a strong influence on the obtai \item Older GPU cards can be slower than modern processors. This can be explained by the new cache and memory access optimizations implemented in newer generations of GPU devices. These optimizations - reduce the penalities associated with irregular and frequent global + reduce the penalties associated with irregular and frequent global memory accesses. They are not available on our Tesla nodes. \item GPU curves exhibit a odd-even pattern in their performance results. Since this phenomenon is visible on two distinct @@ -441,7 +447,7 @@ number of cores and the processor archiecture as a strong influence on the obtai %begin{minipage}[t]{0.49\linewidth} \centering \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf} -\caption{Performances CPU et GPU du modèle Collemboles} +\caption{Performance of the Collembola model on CPU and GPU} \label{ch17:fig:mior_perfs_collem} %end{minipage} \end{figure} @@ -452,7 +458,7 @@ on modern hardware. These gains are more mixed on older GPU platforms due to the limitations when dealing with irregular memory or execution patterns often encountered in MAS systems. This can be closely linked to the availability of caching facilities on the GPU hardware and its -dramatical effects depending on the locality and frequency of data +dramatically effects depending on the locality and frequency of data accesses. In this case, even if the Tesla architectures provides more execution cores and is the far costlier solution, more recent, cheaper, solutions such as high-end GPU provide better performance when the @@ -619,7 +625,7 @@ list and the number of non productive accesses. \end{figure} Since dynamic memory allocation is not possible yet in OpenCL and only -provided in the lastest revisions of the CUDA standard, these matrices +provided in the latest revisions of the CUDA standard, these matrices are statically allocated. The allocation is based on the worst-case scenario where all OM and MM are linked as the real occupation of the matrix cells cannot be deduced without some kind of preprocessing @@ -641,6 +647,7 @@ implementation by enforcing a quasi-sequential execution. It is necessary, in the case of MIOR as well as for other ABM, to ensure that each work-item is not too constrained in its execution. +\pagebreak \lstinputlisting[caption=Main MIOR kernel,label=ch17:listing:mior_kernels]{Chapters/chapter17/code/mior_kernels.cl} From the sequential algorithm 1 where all the agents share the same @@ -674,7 +681,7 @@ detection can be done either on the CPU or the GPU but it requires a global view on the system execution. In the first case, when the CPU controls the GPU simulation process, -the dectection is done in two steps: (i) the CPU starts the execution +the detection is done in two steps: (i) the CPU starts the execution of a simulation step on the GPU, (ii) the CPU retrieves the GPU data and determines if another iteration must be launched or if the simulation has terminated. This approach allows a fine-grain control @@ -724,9 +731,9 @@ performance indicator. The two last implementations -- \textbf{GPU 4.0} and \textbf{GPU 5.0} -- illustrate the gain provided by a better usage of the hardware -resources, thanks to the driver execution overlaping capabilities. A +resources, thanks to the driver execution overlapping capabilities. A sequential version of the MIOR algorithm, labeled as \textbf{CPU}, is -provided for comparison purpose. This sequential version is developped +provided for comparison purpose. This sequential version is developed in Java, the same language used for GPU implementations and the Sworm model. @@ -838,7 +845,7 @@ hardware limits are reached. This indicates that the cost of preparing and launching kernels become negligible compared to the computing time once a good GPU occupation rate is achieved. -\section{Analysis and recommandations} +\section{Analysis and recommendations} \label{ch17:analysis} In this section we synthesize the observations done on the two models @@ -855,7 +862,7 @@ take into account considerations such as the cost of the different parts of the algorithm and the expected gains. In the case of the Collembola model two steps of the algorithm were -ported to GPU. Both steps use straighforward, easily-parallelizable, +ported to GPU. Both steps use straightforward, easily-parallelizable, operations where a direct gain can be expected by using more executions cores without important modifications of the algorithm. @@ -863,10 +870,10 @@ In the MIOR model case however no such inherently parallelizable parts are evident in the original sequential algorithm. This is mainly explained by the rate of interactions between agents in this model in the form of many operations (breathing, growth) on shared carbon -ressources. In this case the algorithm had to be more profundly +resources. In this case the algorithm had to be more profoundly modified while keeping in head the need to remain true to the original model, to synchronize the main execution step of all agents in the -model, to ensure enquity, and to minimize the numbers of +model, to ensure equity, and to minimize the numbers of synchronizations. The minimization is done by factoring the repartition of carbon in the model in two separated step at the beginning and the end of each iterations rather than at multiple point of the execution. @@ -892,8 +899,8 @@ This workflow works well if the considered data transfer time is negligible compared to GPU execution or can be done in parallel, thanks to the asynchronous nature of OpenCL. If we are to update the MAS model after each iteration then performance risk being -degratated. This is illustrated in the MIOR model by the fact that the -speedup observerd on GPU is much more significant for bigger +degraded. This is illustrated in the MIOR model by the fact that the +speedup observed on GPU is much more significant for bigger simulations, which implies longer GPU execution times. Our solution to this problem is to desynchronize the execution of the SMA model and its GPU parts by requesting the execution of multiple steps of the @@ -907,9 +914,9 @@ meaningless in regard to GPU usage, copying cost, and actual gain in execution time, if the process is not already parallel and costly at this scale. In the MIOR and the Collembola models, this is solved by executing the computations for all agents of the model at the same -time. If the model has hovewer only chronic needs for intensive +time. If the model has however only chronic needs for intensive computations then some kind of batching mechanism is required to store -waiting treatements in a queue, until the total sum of waiting +waiting treatments in a queue, until the total sum of waiting computations justify the transfers cost to the GPU platform. \subsection{Implementation challenges} @@ -919,14 +926,14 @@ implementation challenges also occur when implementing an OpenCL version of a MAS model, mainly related to the underlying limitations of the execution platform. -The first one is the impossibility (except in lastest CUDA versions) -to dynamicaly allocate memory during execution. This is a problem in +The first one is the impossibility (except in latest CUDA versions) +to dynamically allocate memory during execution. This is a problem in the case of models where the number of agent can vary during the simulation, such as prey-predator models. In this case, the only solution is to overestimate the size of arrays or data structures to -accomodate these additionnal individuals, or using the CPU to resize -data structures when these situations occur. Both approachs require -to trend either memory or performance and are not always pratical. +accommodate these additional individuals, or using the CPU to resize +data structures when these situations occur. Both approaches require +to trend either memory or performance and are not always practical. Another limitation is the impossibility to store pointers in data structures, we restraint any OpenCL to use only one dimension static @@ -935,10 +942,10 @@ usage of structures like linked-list, graphs or sparse matrix not represented by some combination of static arrays, and can be another source of memory or performance losses. -In the case of MIOR, this problem is especially exarcebed in the case -of neighboring storage: both representations consum much more memory +In the case of MIOR, this problem is especially exacerbated in the case +of neighboring storage: both representations consume much more memory that actually required, since the worst case (all agents have -access to all others agents) has to be taken into account defensivly +access to all others agents) has to be taken into account defensively when dimensioning the data structure. The existence of different generations of GPU hardwares is also a @@ -948,8 +955,8 @@ and requires very regular access patterns to perform efficiently. MAS having many random accesses, such as MIOR, or many small global memory accesses, such as Collembola, are penalized on these older cards. Fortunately, these requirements are less present is modern -cards, which offer caches and other facilities traditionnaly present -on CPU to offset these kind of penalities. +cards, which offer caches and other facilities traditionally present +on CPU to offset these kind of penalties. The final concern is related to the previous ones and often result in more memory consumption. The amount of memory available on GPU cards @@ -967,7 +974,7 @@ As shown in the previous sections, many data representations choices are common to entire classes of MAS. The paradigm of grid, for example, is often encountered in models where each cell constitutes either the elementary unit of simulation (SugarScape~\cite{Dsouza2007}) or a -discretisation of the environment space +discretization of the environment space (Pathfinding~\cite{Guy09clearpath}). These grids can be considered as two or three-dimensional matrices, whose processing can be directly distributed. @@ -976,25 +983,25 @@ Another common data representation encountered in MAS system is the usage of 2d or 3d coordinates to store the position of each agent of the model. In this case, even if the environment is no longer discrete, the location information still imply computations -(euclidian distances) which can be parallelized. +(euclidean distances) which can be parallelized. -MCSMA~\cite{lmlm+13:ip} is a framework developped to provide to the MAS designer +MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS designer those basic data structures, and the associated operations, to facilitate the portage of existing MAS on GPU. Two levels of -utilization are provided to the developper, depending on its usage +utilization are provided to the developer, depending on its usage profile. \begin{itemize} -\item A high-level library, constitued of modules regrouping classes +\item A high-level library, constituted of modules regrouping classes of operations. A module provides multiple methods of distance computations in 1d, 2d or 3d grids, another one the diffusion algorithm... -\item A low level API which allows the developped direct access to the +\item A low level API which allows the developed direct access to the GPU and the inner working of MCSMA, to develop its own module in the case where the required operations are not yet provided by the platform. \end{itemize} -Both usage levels were illustrated in the above two pratical cases. In +Both usage levels were illustrated in the above two practical cases. In MIOR, the whole algorithm (baring initialization) is ported on GPU as a specific plugin which allows executing $n$ MIOR simulations and retrieve their results. This first approach requires extensive @@ -1006,7 +1013,7 @@ is not modified and GPU execution is only applied to specific parts of the execution which may benefit from it. These two programming approaches allow incremental adaptations of existing Java MAS to accelerate their execution on GPU, while retaining the option to -develop their own reusable or more efficeint module to supplement the +develop their own reusable or more efficient module to supplement the already existing ones. \section{Conclusion} @@ -1041,7 +1048,7 @@ lists. Another goal is to provide easier means to control the distribution of specific processes between CPU or GPU, to allow the easy exploitation of the strengths of each platform in the same multi-agent simulation. We think that the same approach, i.e -developping specific environments that facilitate the developper +developing specific environments that facilitate the developer access to the GPU power, can be applied in many domains with compute intensive needs to open the GPU use to a larger community.