-\chapterauthor{G. Laville, C. Lang, K. Mazouzi, N. Marilleau, B. Herrmann, L. Philippe}{Femto-ST Institute, University of Franche-Comt{\'e}}
+\chapterauthor{Guillaume Laville}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Kamel Mazouzi}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Nicolas Marilleau}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
\newlength\mylen
\newcommand\myinput[1]{%
\setlength\hangindent{\mylen}%
\hspace*{\mylen}#1\\}
-\chapter{Implementing MAS on GPU}
+\chapter{Implementing Multi-Agent Systems on GPU}
\label{chapter17}
dynamic behavior simulated by differential equations. The simulation
of the system thus often relay on the resolution of a linear system
that can be efficiently computed on a graphical processing unit as
-shown in the preceeding chapters. But when the behavior of the system
+shown in the preceding chapters. But when the behavior of the system
elements is not uniformly driven by the same law, when these elements
have their own behavior, the modeling process is too complex to rely
on formal expressions. In this context MAS is a recognized approach to
Units (GPU) are also a promising technology with an attractive
performance/cost ratio.
-Conceptually a MAS is a distributed system as it favors the definition
+Conceptually a MAS\index{Multi-Agent System} is a distributed system as it favors the definition
and description of large sets of individuals, the agents, that can be
run in parallel. As a large set of agents could have the same behavior
a SIMD model should fit the simulation execution. Most of the
Another key point that advocates for the use of many-core in MAS is
the growing need for multi-scale simulations. Multi-scale simulations
explore problems with interactions between several scales. The
-different scales use different granularities of the structure and
+different scales use different granularity of the structure and
potentially different models. Most of the time the lower scale
simulations provide results to higher scale simulations. In that case
the execution of the simulations can easily be distributed between the
We explore in this chapter the use of many-core architectures to
execute agent-based simulations. We illustrate our reflexion with two
-uses cases: the colembola simulator designed to simulate the diffusion
-of collembola between plots of land and the MIOR simulator that
+uses cases: the Colembola simulator designed to simulate the diffusion
+of Colembola between plots of land and the MIOR simulator that
reproduces effects of earthworms on bacteria dynamics in a bulked
soil. In Section \ref{ch17:ABM} we present the work related to MAS and
parallelization with a special focus on many-core use. In sections
environment.
Agent based simulations have been used for more than one decade to
-reproduce, understand even predic complex system dynamic. They have proved their interest in various
+reproduce, understand even predict complex system dynamic. They have proved their interest in various
scientific communities. Nowadays generic agent based frameworks are
promoted such as Repast~\cite{repast_home} or
NetLogo~\cite{netlogo_home} to implement simulators. Many ABM such as
or algorithms (such as the background evolution process).
In the following examples the JOCL~\cite{jocl_home} binding is used
-to access the OpenCL platfmorm from the Java programming language.
+to access the OpenCL platform from the Java programming language.
In the next sections we present two practical cases that will be
studied in details, from the model to its implementation and
\label{ch17:sec:1stmodel}
The first model, the Collembola model, simulates the propagation of
-collembolas in flieds and forests. It is based on a diffusion
+collembolas in fields and forests. It is based on a diffusion
algorithm which illustrates the case of agents with a simple behavior
and few synchronization problems.
-\subsection{The Collembola model}
+\subsection{The Collembola model\index{Collembola model}}
\label{ch17:subsec:collembolamodel}
The Collembola model is an example of multi-agent system using GIS (Geographical Information System)
and survey data (population count)
- to modelize the evolution of the biodiversity
-across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modelized by populations of
+ to model the evolution of the biodiversity
+across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modeled by populations of
athropod individuals, the Collembola, which can reproduce and diffuse
to favorable new habitats. The simulator allows to study the diffusion
-of collembola, between plots of land depending on their landuse
+of collembola, between plots of land depending on their use
(artifical soil, crop, forest\ldots). In this
model the environment is composed of the studied land and the
colembola are agents. Every land plot is divided into several cells,
The {\bf diffusion} stage simulates the natural behavior of the
collembola that tends toward occupying the whole space over time. This
-stage constists in computing a new value for each cell depending on
+stage consists in computing a new value for each cell depending on
the population of the neighbor cells. This process can be assimilated
-to a linear diffusion at each iteration of the populationof the cells
-across their neightbors.
+to a linear diffusion at each iteration of the population of the cells
+across their neighbors.
These two processes are quite common in numerical computations so
that the collembola model can be adapted to a GPU execution without
model frontier, populations can also overflow outside the environment
grid but we do not manage those external populations, since there are
no reason to assume our model to be isolated of its surroundings. The
-overflow by neigbors value is stored for each cell, before
+overflow by neighbors value is stored for each cell, before
encountering the barrier. After the barrier is met, each cell read the
overflows stored by all its neighbors at the previous step and applies
them to its own population. In this manner, only one barrier is
In this part we present the performance of the Collembola model on
various CPU and GPU execution
platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
-number of cores and the processor archiecture as a strong influence on the obtained results
+number of cores and the processor architecture as a strong influence on the obtained results
% : the
% dual-core processor equipping our Thinkpad platform has two to six
\item Older GPU cards can be slower than modern processors. This can
be explained by the new cache and memory access optimizations
implemented in newer generations of GPU devices. These optimizations
- reduce the penalities associated with irregular and frequent global
+ reduce the penalties associated with irregular and frequent global
memory accesses. They are not available on our Tesla nodes.
\item GPU curves exhibit a odd-even pattern in their performance
results. Since this phenomenon is visible on two distinct
%begin{minipage}[t]{0.49\linewidth}
\centering
\includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
-\caption{Performances CPU et GPU du modèle Collemboles}
+\caption{Performance of the Collembola model on CPU and GPU}
\label{ch17:fig:mior_perfs_collem}
%end{minipage}
\end{figure}
due to the limitations when dealing with irregular memory or execution
patterns often encountered in MAS systems. This can be closely linked
to the availability of caching facilities on the GPU hardware and its
-dramatical effects depending on the locality and frequency of data
+dramatically effects depending on the locality and frequency of data
accesses. In this case, even if the Tesla architectures provides more
execution cores and is the far costlier solution, more recent, cheaper,
solutions such as high-end GPU provide better performance when the
\end{figure}
Since dynamic memory allocation is not possible yet in OpenCL and only
-provided in the lastest revisions of the CUDA standard, these matrices
+provided in the latest revisions of the CUDA standard, these matrices
are statically allocated. The allocation is based on the worst-case
scenario where all OM and MM are linked as the real occupation of the
matrix cells cannot be deduced without some kind of preprocessing
global view on the system execution.
In the first case, when the CPU controls the GPU simulation process,
-the dectection is done in two steps: (i) the CPU starts the execution
+the detection is done in two steps: (i) the CPU starts the execution
of a simulation step on the GPU, (ii) the CPU retrieves the GPU data
and determines if another iteration must be launched or if the
simulation has terminated. This approach allows a fine-grain control
The two last implementations -- \textbf{GPU 4.0} and \textbf{GPU 5.0}
-- illustrate the gain provided by a better usage of the hardware
-resources, thanks to the driver execution overlaping capabilities. A
+resources, thanks to the driver execution overlapping capabilities. A
sequential version of the MIOR algorithm, labeled as \textbf{CPU}, is
-provided for comparison purpose. This sequential version is developped
+provided for comparison purpose. This sequential version is developed
in Java, the same language used for GPU implementations and the Sworm
model.
and launching kernels become negligible compared to the computing
time once a good GPU occupation rate is achieved.
-\section{Analysis and recommandations}
+\section{Analysis and recommendations}
\label{ch17:analysis}
In this section we synthesize the observations done on the two models
parts of the algorithm and the expected gains.
In the case of the Collembola model two steps of the algorithm were
-ported to GPU. Both steps use straighforward, easily-parallelizable,
+ported to GPU. Both steps use straightforward, easily-parallelizable,
operations where a direct gain can be expected by using more
executions cores without important modifications of the algorithm.
are evident in the original sequential algorithm. This is mainly
explained by the rate of interactions between agents in this model in
the form of many operations (breathing, growth) on shared carbon
-ressources. In this case the algorithm had to be more profundly
+resources. In this case the algorithm had to be more profoundly
modified while keeping in head the need to remain true to the original
model, to synchronize the main execution step of all agents in the
-model, to ensure enquity, and to minimize the numbers of
+model, to ensure equity, and to minimize the numbers of
synchronizations. The minimization is done by factoring the repartition of
carbon in the model in two separated step at the beginning and the end
of each iterations rather than at multiple point of the execution.
negligible compared to GPU execution or can be done in parallel,
thanks to the asynchronous nature of OpenCL. If we are to update the
MAS model after each iteration then performance risk being
-degratated. This is illustrated in the MIOR model by the fact that the
-speedup observerd on GPU is much more significant for bigger
+degraded. This is illustrated in the MIOR model by the fact that the
+speedup observed on GPU is much more significant for bigger
simulations, which implies longer GPU execution times. Our solution to
this problem is to desynchronize the execution of the SMA model and
its GPU parts by requesting the execution of multiple steps of the
execution time, if the process is not already parallel and costly at
this scale. In the MIOR and the Collembola models, this is solved by
executing the computations for all agents of the model at the same
-time. If the model has hovewer only chronic needs for intensive
+time. If the model has however only chronic needs for intensive
computations then some kind of batching mechanism is required to store
-waiting treatements in a queue, until the total sum of waiting
+waiting treatments in a queue, until the total sum of waiting
computations justify the transfers cost to the GPU platform.
\subsection{Implementation challenges}
version of a MAS model, mainly related to the underlying limitations
of the execution platform.
-The first one is the impossibility (except in lastest CUDA versions)
-to dynamicaly allocate memory during execution. This is a problem in
+The first one is the impossibility (except in latest CUDA versions)
+to dynamically allocate memory during execution. This is a problem in
the case of models where the number of agent can vary during the
simulation, such as prey-predator models. In this case, the only
solution is to overestimate the size of arrays or data structures to
-accomodate these additionnal individuals, or using the CPU to resize
-data structures when these situations occur. Both approachs require
-to trend either memory or performance and are not always pratical.
+accommodate these additional individuals, or using the CPU to resize
+data structures when these situations occur. Both approaches require
+to trend either memory or performance and are not always practical.
Another limitation is the impossibility to store pointers in data
structures, we restraint any OpenCL to use only one dimension static
represented by some combination of static arrays, and can be another
source of memory or performance losses.
-In the case of MIOR, this problem is especially exarcebed in the case
-of neighboring storage: both representations consum much more memory
+In the case of MIOR, this problem is especially exacerbated in the case
+of neighboring storage: both representations consume much more memory
that actually required, since the worst case (all agents have
-access to all others agents) has to be taken into account defensivly
+access to all others agents) has to be taken into account defensively
when dimensioning the data structure.
The existence of different generations of GPU hardwares is also a
having many random accesses, such as MIOR, or many small global memory
accesses, such as Collembola, are penalized on these older
cards. Fortunately, these requirements are less present is modern
-cards, which offer caches and other facilities traditionnaly present
-on CPU to offset these kind of penalities.
+cards, which offer caches and other facilities traditionally present
+on CPU to offset these kind of penalties.
The final concern is related to the previous ones and often result in
more memory consumption. The amount of memory available on GPU cards
are common to entire classes of MAS. The paradigm of grid, for example,
is often encountered in models where each cell constitutes either the
elementary unit of simulation (SugarScape~\cite{Dsouza2007}) or a
-discretisation of the environment space
+discretization of the environment space
(Pathfinding~\cite{Guy09clearpath}). These grids can be considered as
two or three-dimensional matrices, whose processing can be directly
distributed.
usage of 2d or 3d coordinates to store the position of each agent of
the model. In this case, even if the environment is no longer
discrete, the location information still imply computations
-(euclidian distances) which can be parallelized.
+(euclidean distances) which can be parallelized.
-MCSMA~\cite{lmlm+13:ip} is a framework developped to provide to the MAS designer
+MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS designer
those basic data structures, and the associated operations, to
facilitate the portage of existing MAS on GPU. Two levels of
-utilization are provided to the developper, depending on its usage
+utilization are provided to the developer, depending on its usage
profile.
\begin{itemize}
-\item A high-level library, constitued of modules regrouping classes
+\item A high-level library, constituted of modules regrouping classes
of operations. A module provides multiple methods of
distance computations in 1d, 2d or 3d grids, another one the
diffusion algorithm...
-\item A low level API which allows the developped direct access to the
+\item A low level API which allows the developed direct access to the
GPU and the inner working of MCSMA, to develop its own module in
the case where the required operations are not yet provided by the platform.
\end{itemize}
-Both usage levels were illustrated in the above two pratical cases. In
+Both usage levels were illustrated in the above two practical cases. In
MIOR, the whole algorithm (baring initialization) is ported on GPU as
a specific plugin which allows executing $n$ MIOR simulations and
retrieve their results. This first approach requires extensive
the execution which may benefit from it. These two programming
approaches allow incremental adaptations of existing Java MAS to
accelerate their execution on GPU, while retaining the option to
-develop their own reusable or more efficeint module to supplement the
+develop their own reusable or more efficient module to supplement the
already existing ones.
\section{Conclusion}
distribution of specific processes between CPU or GPU, to allow the
easy exploitation of the strengths of each platform in the same
multi-agent simulation. We think that the same approach, i.e
-developping specific environments that facilitate the developper
+developing specific environments that facilitate the developer
access to the GPU power, can be applied in many domains with
compute intensive needs to open the GPU use to a larger community.