1 \chapterauthor{Guillaume Laville, Christophe Lang, Bénédicte Herrmann and Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
2 %\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France}
3 \chapterauthor{Kamel Mazouzi}{Franche-Comte Computing Center, University of Franche-Comte, France}
4 \chapterauthor{Nicolas Marilleau}{UMMISCO, Institut de Recherche pour le Developpement (IRD), France}
5 %\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France}
6 %\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
9 \newcommand\myinput[1]{%
10 \settowidth\mylen{\KwIn{}}%
11 \setlength\hangindent{\mylen}%
14 \chapter{Implementing Multi-Agent Systems on GPU}
18 \section{Introduction}
21 In this chapter we introduce the use of Graphical Processing Units for
22 multi-agents based systems as an example of a not-so-regular
23 application that could benefit from the GPU computing
24 power. Multi-Agent Systems or MAS are a simulation paradigm used to
25 study the behavior of dynamic systems. Dynamic systems as physical
26 systems are often modeled by mathematical representations and their
27 dynamic behavior simulated by differential equations. The simulation
28 of the system thus often relay on the resolution of a linear system
29 that can be efficiently computed on a graphical processing unit as
30 shown in the preceding chapters. But when the behavior of the system
31 elements is not uniformly driven by the same law, when these elements
32 have their own behavior, the modeling process is too complex to rely
33 on formal expressions. In this context MAS is a recognized approach to
34 model and simulate systems where individuals have an autonomous
35 behavior that cannot be simulated by the evolution of a set of
36 variables driven by mathematical laws. MAS are often used to simulate
37 natural or collective phenomenons whose individuals are too numerous or
38 various to provide a unified algorithm describing the system
39 evolution. The agent-based approach is to divide these complex systems
40 into individual self-contained entities with their smaller set of
41 attributes and functions. But, as for mathematical simulations, when
42 the size of the Multi-Agent System (MAS) increases the need of computing
43 power and memory also increases. For this reason, multi-agent systems
44 should benefit from the use of distributed computing
45 architectures. Clusters and Grids are often identified as the main
46 solution to increase simulation performance but Graphical Processing
47 Units (GPU) are also a promising technology with an attractive
48 performance/cost ratio.
50 Conceptually a MAS\index{Multi-Agent System} is a distributed system as it favors the definition
51 and description of large sets of individuals, the agents, that can be
52 run in parallel. As a large set of agents could have the same behavior
53 a SIMD model should fit the simulation execution. Most of the
54 agent-based simulators are however designed with a sequential scheme
55 in mind and these simulators seldom use more than one core for their
56 execution. Due to simulation scheduling constraints, data sharing and
57 exchange between agents and the huge amount of interactions between
58 agents and their environment, it is indeed rather difficult to
59 distribute an agent based simulator, for instance, to take advantage of new
60 multi-threaded computer architectures. Thus, guidelines and tools
61 dedicated to MAS paradigm and HPC is now a need for other complex
62 system communities. Note that, from the described structure (large
63 number of agents sharing data), we can conclude that MAS would more
64 easily benefit from many-cores architectures than from other kinds of
67 Another key point that advocates for the use of many-core in MAS is
68 the growing need for multi-scale simulations. Multi-scale simulations
69 explore problems with interactions between several scales. The
70 different scales use different granularity of the structure and
71 potentially different models. Most of the time the lower scale
72 simulations provide results to higher scale simulations. In that case
73 the execution of the simulations can easily be distributed between the
74 local cores and a many-core architecture, i.e. a GPU device.
76 We explore in this chapter the use of many-core architectures to
77 execute agent-based simulations. We illustrate our reflexion with two
78 uses cases: the Colembola simulator designed to simulate the diffusion
79 of Colembola between plots of land and the MIOR simulator that
80 reproduces effects of earthworms on bacteria dynamics in a bulked
81 soil. In Section \ref{ch17:ABM} we present the work related to MAS and
82 parallelization with a special focus on many-core use. In sections
83 \ref{ch17:sec:1stmodel} and \ref{ch17:sec:2ndmodel} we present in
84 details two multi-agent models, their GPU implementations, the
85 conducted experiments and their performance results. The first model,
86 given in Section \ref{ch17:sec:1stmodel}, illustrates the use of a GPU
87 architecture to speed up the execution of some computation intensive
88 functions while the main model is still executed on the central
89 processing unit. The second model, given in Section
90 \ref{ch17:sec:2ndmodel}, illustrates the use of a GPU architecture to
91 implement the whole model on the GPU processor which implies deeper
92 changes in the initial algorithm. In Section \ref{ch17:analysis} we
93 propose a more general reflexion on these implementations and we
94 provide some guidelines. Then we conclude in Section
95 \ref{ch17:conclusion} on the possible generalization of our work.
98 \section{Running Agent-Based Simulations}
101 In this section, we present the context of multi-agent systems, MAS,
102 their parallelization and we report several existing works on using
103 GPU to simulate multi-agent systems.
105 \subsection{Multi-agent systems and parallelism}
107 Agent-Based systems are often used to simulate natural or collective
108 phenomenons whose actors are too numerous or various to provide a
109 simple unified algorithm describing the studied system
110 dynamic~\cite{Schweitzer2003}. The implementation of an agent based
111 simulation usually starts by designing the underlying Agent Based
112 Model (ABM). Most ABM are based around a few types of entities such as Agents,
113 Environment and around an Interaction Organization~\cite{Vowel02}. In the
114 complex system domain, the environment often describes a real space,
115 its structure (e.g. soil textures and porosities) and its dynamics
116 (e.g. organic matter decomposition). It is a virtual world in which
117 agents represent studied entities (e.g. biotic organisms)
118 evolution. The actual agent is animated by a behavior that can range
119 between reactivity (only react to external stimuli) and cognition
120 (lives its own process alongside other individuals). Interaction
121 and Organization define functions, types and patterns of
122 communications of their member agents in the
123 system~\cite{Odell:2003:RRD:1807559.1807562}. Note that, depending on
124 the MAS, agents can communicate either directly through special
125 primitives or indirectly through the information stored in the
128 Agent based simulations have been used for more than one decade to
129 reproduce, understand even predict complex system dynamic. They have proved their interest in various
130 scientific communities. Nowadays generic agent based frameworks are
131 promoted such as Repast~\cite{repast_home} or
132 NetLogo~\cite{netlogo_home} to implement simulators. Many ABM such as
133 the crown model representing a city-wide
134 scale~\cite{Strippgen_Nagel_2009} tend however to require a large
135 number of agents to provide a realistic behavior and reliable global
136 statistics. Moreover, an achieved model analysis needs to run many
137 simulations identified into an experiment plan to obtain enough
138 confidence in a simulation. In this case the available computing power
139 often limits the simulation size and the result range thus requiring
140 the use of parallelism to explore bigger configurations.
142 For that, three major approaches can be identified:
144 \item parallelizing experiments plan on a cluster or a grid (one or
145 few simulations are submitted to each
146 core)~\cite{Blanchart11,Chuffart2010}.
147 \item parallelizing the simulator on a cluster (the environment of the
148 MAS is split and run on several distributed
149 nodes)~\cite{Cosenza2011,MAR06}
150 \item optimizing simulator by taking advantage of computer resources
151 (multi-threading, GPU, and so on) \cite{Aaby10}
154 In the first case, experiments are run independently each others and only simulation parameters change between two runs so
155 that a simple version of an existing simulator can be used. This
156 approach does not, however, allow to run larger models.
158 %model code changes except for Graphical Unit Interface extracting as
159 %the experiments does not run interactively but in batch.
161 and the third case, model and code modifications are necessary. Only few
162 frameworks however introduce distribution in agent simulation
163 (Madkit~\cite{Gutknecht2000}, MASON~\cite{Sean05}, repastHPC~\cite{Collier11}) and parallel
164 implementations are often based on the explicit use of threads using
165 shared memory~\cite{Guy09clearpath} or cluster libraries such as
166 MPI~\cite{Kiran2010}.
168 Parallelizing a multi-agent simulation is however complex due to space
169 and time constraints. Multi-agent simulations are usually based on a
170 synchronous execution:
171 %by the agents that share the same environment.
172 at each time step, numerous events (space data
173 modification, agent motion) and interactions between agents happen.
174 Distributing the simulation on several computers or grid nodes thus
175 implies to guaranty a distributed synchronous execution and
176 coherency. This often leads to poor performance or complex synchronization
177 problems. Multi-cores execution or delegating part of this execution
178 to others processors as GPUs~\cite{Bleiweiss_2008} are thus usually easier
179 to implement since all the threads share the data and the local clock.
181 % Different agent patterns can be adopted in an ABMs such as
182 % cognitive and reactive ones~\cite{Ferber99}. Cognitive agents act on
183 % the environment and interact with other agents according to a complex
184 % behavior. This behavior takes a local perception of the virtual world
185 % and the agent past (a memory characterized by an internal state and
186 % belief, imperfect knowledge about the world) into account. Reactive
187 % agents have a much systematic pattern of action based on stimuli
188 % response schemes (no or few knowledge and state conservation in agent). The
189 % evolution of the ABM environment, in particular, is often represented with
190 % this last kind of agents. As their behavior is usually simple, we
191 % propose in this chapter to delegate part of the environment and of the
192 % reactive agents execution to the graphical processing unit of the
193 % computer. This way we can balance the load between both CPU and GPU
194 % execution resources.
196 % In the particular case of multi-scale simulations such as the Sworm
197 % simulation~\cite{BLA09} the environment may be used at different
198 % levels. Since the representation of the whole simulated environment
199 % (i.e. the soil area) would be costly, the environment is organized as
200 % a multi-level tree of small soil cubes which can be lazily
201 % instantiated during the simulation. This allows to gradually refine
202 % distribution details in units of soil as agents progress and need
203 % those information, by using a fractal process based on the
204 % bigger-grained already instantiated levels. This characteristic,
205 % especially for a fractal model, could be the key of the
206 % distribution. For instance, each branch of a fractal environment could
207 % be identified as an independent area and parallelized. In addition
208 % Fractal is a famous approach to describe multi-scale environment (such
209 % as soil) and its organization~\cite{perrier}. In that case the lower
210 % scale simulations can also be delegated to the GPU card to limit the
211 % load of the main (upper scale) simulation.
213 \subsection{MAS Implementation on GPU}
214 \label{ch17:subsec:gpu}
216 The last few years saw the apparition of new generations of graphic
217 cards based on more general purpose execution units which are
218 promising for less regular systems as MAS.
219 % These units can be programmed using GPGPU languages to solve various
220 % problems such as linear algebra resolutions, usually based on matrix
222 The matrix-based data representation and SIMD computations are
223 however not straightforward in Multi-Agent Systems, where data
224 structures and algorithms are tightly coupled to the described
225 simulation. However, works from existing literature show that MAS can
226 benefit from these performance gains on various simulation types, such
227 as traffic simulation~\cite{Strippgen_Nagel_2009}, cellular
228 automatons~\cite{Dsouza2007}, mobile-agent based
229 path-finding~\cite{Silveira:2010:PRG:1948395.1948446} or genetic
230 algorithms~\cite{Maitre2009}. Note that an application-specific
231 adaptation process was required in the case of these MAS: some of the
232 previous examples are driven by mathematical laws (path-finding) or
233 use a natural mapping between a discrete environment (cellular
234 automaton) and GPU cores. Unfortunately, this mapping often requires
235 algorithmic adaptations in other models but the experience shows that
236 the more reactive a MAS is the more adapted its implementation is to
239 The first step in the adaptation of an ABM to GPU platforms is the
240 choice of the language. On the one hand, the Java programming language
241 is often used for the implementation of MAS due to its availability
242 on numerous platforms or frameworks and its focus on high-level,
243 object-oriented programming. On the other hand, GPU platforms can
244 only run specific languages as OpenCL or CUDA. OpenCL (supported on
245 AMD, Intel and NVIDIA hardware) better suits the portability concerns
246 across a wide range of hardware needed the agent simulators, as
247 opposed to CUDA which is a NVIDIA-specific library.
249 OpenCL is a C library which provides access to the underlying CPUs or
250 GPUs threads using an asynchronous interface. Various OpenCL functions allow
251 the compilation and the execution of programs on these execution
252 resources, the copy of data buffers between devices, or the collection
253 of profiling information.
255 This library is based around three main concepts:
258 \item the \emph{kernel} (similar to a CUDA kernel) which represents a runnable program
259 containing instructions to be executed on the GPU.
260 \item the \emph{work-item} (equivalent to a CUDA thread), which is analogous to the concept
261 of thread on GPU, in that it represents one running instance of a
263 \item the \emph{work-group} (or execution block) which is a set of work-items
264 sharing some memory to speed up data accesses and
265 computations. Synchronization operations such as barrier can only be
266 used across the same work-group.
269 Running an OpenCL computation consists in launching numerous work-items
270 that execute the same kernel. The work-items are submitted to a
271 submission queue to optimize the available cores usage. A calculus is
272 achieved once all these kernel instances have terminated.
274 The number of work-items used in each work-group is an important
275 implementation choice which determines how many tasks will share the
276 same cache memory. Data used by the work-items can be stored as
277 N-dimensions matrices in local or global GPU memory. Since the size
278 of this memory is often limited to a few hundred of kilobytes,
279 choosing this number often implies a compromise between the model
280 synchronization or data requirements and the available resources.
282 In the case of agent-based simulations, each agent can be naturally
283 mapped to a work-item. Work-groups can then be used to represents groups
284 of agents or simulations sharing common data (such as the environment)
285 or algorithms (such as the background evolution process).
287 In the following examples the JOCL~\cite{jocl_home} binding is used
288 to access the OpenCL platform from the Java programming language.
290 In the next sections we present two practical cases that will be
291 studied in details, from the model to its implementation and
294 \section{A first practical example}
295 \label{ch17:sec:1stmodel}
297 The first model, the Collembola model, simulates the propagation of
298 collembolas in fields and forests. It is based on a diffusion
299 algorithm which illustrates the case of agents with a simple behavior
300 and few synchronization problems.
302 \subsection{The Collembola model\index{Collembola model}}
303 \label{ch17:subsec:collembolamodel}
305 The Collembola model is an example of multi-agent system using GIS (Geographical Information System)
306 and survey data (population count)
307 to model the evolution of the biodiversity
308 across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modeled by populations of
309 athropod individuals, the Collembola, which can reproduce and diffuse
310 to favorable new habitats. The simulator allows to study the diffusion
311 of collembola, between plots of land depending on their use
312 (artifical soil, crop, forest\ldots). In this
313 model the environment is composed of the studied land and the
314 colembola are agents. Every land plot is divided into several cells,
315 each cell representing a surface unit (16*16 meters). A number of individuals per collembola species is associated to each cell. The model evolution
316 is then based on a common diffusion model that diffuse individuals between
317 cells. Each step of the simulation is based on four stages, as shown
318 on Figure~\ref{ch17:fig:collem_algorithm}:
321 % \item arrival of new individuals
322 % \item reproduction in each cell
323 % \item diffusion between cells
324 % \item updating of colembola lifetime
329 \includegraphics[width=0.6\textwidth]{Chapters/chapter17/figs/algo_collem.pdf}
330 \caption{Evolution algorithm of Collembola model}
331 \label{ch17:fig:collem_algorithm}
334 The algorithm is quite simple but includes two costly operations, the
335 reproduction and the diffusion, that must be parallelized to improve
336 the model performances.
338 The {\bf reproduction} stage consists in updating the total population
339 of each plot by taking the individuals arrived at the preceding
340 computation step. This stage implies to process the whole set of cells
341 of the environment to sum their population. The computed value is
342 recorded in the plot associated to each cell. This process can be
343 assimilated to a reduction operation on all the population cells
344 associated to one plot to obtain its population.
346 The {\bf diffusion} stage simulates the natural behavior of the
347 collembola that tends toward occupying the whole space over time. This
348 stage consists in computing a new value for each cell depending on
349 the population of the neighbor cells. This process can be assimilated
350 to a linear diffusion at each iteration of the population of the cells
351 across their neighbors.
353 These two processes are quite common in numerical computations so
354 that the collembola model can be adapted to a GPU execution without
357 \subsection{Collembola Implementation}
359 In the Collembola simulator biodiversity is modeled by populations
360 of Collembola individuals, which can reproduce and diffuse to
361 favorable new habitats. This is implemented in the algorithm as a fixed
362 reproduction factor, applied to the size of each population, followed
363 by a linear diffusion of each cell population to its eight
364 neighbors. These reproduction and diffusion processes are followed by
365 two more steps on the GPU implementation. The first one consist of
366 culling of populations in a inhospitable environment, by checking each
367 cell value, landuse, and setting its population to zero if
368 necessary. The final simulation step is the reduction of the cells
369 populations for each plot, to obtain an updated plot population
370 for statistic purposes. This separate computation step, done while
371 updating each cell population in the reference sequential algorithm,
372 is motivated by synchronization problems and allow the reduction of
373 the total numbers of access needed to updated those populations.
375 %\lstinputlisting[language=C,caption=Collembola OpenCL
376 %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl}
378 \lstinputlisting[caption=Collembola OpenCL Diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl}
380 The reproduction, diffusion and culling steps are implemented on GPU
381 (Figure~\ref{ch17:fig:collem_algorithm}) as a straight mapping of each cell
382 to an OpenCL work-item (GPU thread). Listing
383 \ref{ch17:listing:collembola-diffuse} gives the kernel for the
384 diffusion implementation. To prevent data coherency problems, the
385 diffusion step is split into two phases, separated by an execution
386 {\it barrier}. In the first phase, each cell diffusion overflow is
387 calculated, and divided by the number of neightbors. Note that, at the
388 model frontier, populations can also overflow outside the environment
389 grid but we do not manage those external populations, since there are
390 no reason to assume our model to be isolated of its surroundings. The
391 overflow by neighbors value is stored for each cell, before
392 encountering the barrier. After the barrier is met, each cell read the
393 overflows stored by all its neighbors at the previous step and applies
394 them to its own population. In this manner, only one barrier is
395 required to ensure the consistency of populations value, since no
396 cells ever modify a value other than its own.
398 \lstinputlisting[caption=Collembola OpenCL reduction kernel,label=ch17:listing:collembola-reduc]{Chapters/chapter17/code/collem_kernel_reduc.cl}
400 Listing \ref{ch17:listing:collembola-reduc} gives the kernel for the
401 reduction implementation. The only step requiring numerous
402 synchronized accesses is the reduction one: in a first approach, we
403 chose to use {\it atomic\_add} operation to implement this process, but more
404 efficient implementations using partial reduction and local GPU memory
405 could be implemented.
407 \subsection{Collembola performance}
409 In this part we present the performance of the Collembola model on
410 various CPU and GPU execution
411 platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
412 number of cores and the processor architecture as a strong influence on the obtained results
415 % dual-core processor equipping our Thinkpad platform has two to six
416 % longer executions times, compared to a six-core Phenom X6.
418 % % \begin{figure}[h]
419 % %begin{minipage}[t]{0.49\linewidth}
420 % \centering \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs.pdf}
421 % \caption{Performance CPU et GPU du modèle Collemboles}
422 % \label{ch17:fig:mior_perfs_collem}
426 % In figure~\ref{ch17:fig:mior_perfs_collem2} the Thinkpad curve is removed
427 % to make other trends clearer. Two more observation can be made, using
428 % this more detailled scale:
431 \item Older GPU cards can be slower than modern processors. This can
432 be explained by the new cache and memory access optimizations
433 implemented in newer generations of GPU devices. These optimizations
434 reduce the penalties associated with irregular and frequent global
435 memory accesses. They are not available on our Tesla nodes.
436 \item GPU curves exhibit a odd-even pattern in their performance
437 results. Since this phenomenon is visible on two distinct
438 manufacturer hardware, drivers and OpenCL implementation, it is
439 likely the result of the decomposition process based on warp of
440 fixed, power-of-two sizes.
441 \item The number of cores is not the only determining factor: an Intel
442 Core i7 2600K processor, even with only four cores, can provide
443 better performance than a Phenom one.
447 %begin{minipage}[t]{0.49\linewidth}
449 \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
450 \caption{Performance of the Collembola model on CPU and GPU}
451 \label{ch17:fig:mior_perfs_collem}
455 Both figures shows that using the GPU to parallelize part of the
456 simulator results in tangible performance gains over a CPU execution
457 on modern hardware. These gains are more mixed on older GPU platforms
458 due to the limitations when dealing with irregular memory or execution
459 patterns often encountered in MAS systems. This can be closely linked
460 to the availability of caching facilities on the GPU hardware and its
461 dramatically effects depending on the locality and frequency of data
462 accesses. In this case, even if the Tesla architectures provides more
463 execution cores and is the far costlier solution, more recent, cheaper,
464 solutions such as high-end GPU provide better performance when the
465 execution is not constrained by memory size.
467 \section{Second example}
468 \label{ch17:sec:2ndmodel}
470 The second model, the MIOR model, simulates the behavior of microbian
471 colonies. Its execution model is more complex so that it requires
472 changes in the initial algorithm and the use of synchronization to
473 benefit from the GPU architecture.
475 \subsection{The MIOR model}
476 \label{ch17:subsec:miormodel}
478 The MIOR~\cite{C.Cambier2007} model was developed to simulate local
479 interactions in a soil between microbial colonies and organic
480 matters. It reproduces each small cubic units ($0.002 m^3$) of soil as a
483 Multiple implementations of the MIOR model have already been
484 realized, in Smalltalk and Netlogo, in 2 or 3 dimensions. The last
485 implementation, used in our work and referenced as MIOR in the
486 rest of the chapter, is freely accessible online as
487 WebSimMior~\footnote{http://www.IRD.fr/websimmior/}.
489 MIOR is based around two types of agents: (i) the Meta-Mior (MM),
490 which represents microbial colonies consuming carbon and (ii) the
491 Organic Matter (OM) which represents carbon deposits occurring in soil.
493 The Meta-Mior agents are characterized by two distinct behaviors:
495 \item \emph{breath}: the action converts mineral carbon from the soil
496 to carbon dioxide $CO_{2}$ that is released in the soil.
497 \item \emph{growth}: by this action each microbial colony fixes the
498 carbon present in the environment to reproduce itself (augments its
499 size). This action is only possible if the colony breathing needs
500 where covered, i.e. enough mineral carbon is available.
503 These behaviors are described in Algorithm~\ref{ch17:seqalgo}.
506 \caption{Evolution step of each Meta-Mior (microbial colony) agent}
508 \KwIn{A static array $mmList$ of MM agents}
509 \myinput{A static array $omList$ of OM agents}
510 \myinput{A MIOR environment $world$}
511 $breathNeed \gets world.respirationRate \times mm.carbon$\;
512 $growthNeed \gets world.growthRate \times mm.carbon$\;
513 $availableCarbon \gets totalAccessibleCarbon(mm)$\;
514 \uIf{$availableCarbon > breathNeed$}{
516 $mm.active \gets true$\;
517 $availableCarbon \gets availableCarbon - consumCarbon(mm, breathNeed)$\;
518 $world.CO2 \gets world.CO2 + breathNeed$\;
519 \If{$availableCarbon > 0$}{
521 $growthConsum \gets max(totalAccessCarbon(mm), growthNeed)$\;
522 $consumCarbon(mm, growthConsum)$\;
523 $mm.carbon \gets mm.carbon + growthConsum$\;
527 $mm.active \gets false$
531 Since this simulation takes place at a microscopic scale, a large
532 number of these simulations must be executed for each macroscopic
533 simulation step to model realistic-sized unit of soil. This lead to
534 large computing needs despite the small computation cost of each
535 individual simulation.
538 \subsection{MIOR Implementation}
540 As pointed out previously, the MIOR implementation implied more
541 changes for the initial code to be run on GPU. As a first attempt, we
542 realized a simple GPU implementation of the MIOR simulator, with only
543 minimal changes to the CPU algorithm. Execution times showed the
544 inefficiency of this approach, and highlighted the necessity of
545 adapting the simulator to take advantage of the GPU execution
546 capabilities~\cite{lmlm+12:ip}. In this part, we show the main changes
547 which were realized to adapt the MIOR simulator on GPU architectures.
549 \subsubsection{Execution mapping on GPU}
553 \includegraphics[width=0.7\textwidth]{Chapters/chapter17/figs/repartition.pdf}
554 \caption{Execution distribution retained on GPU}
555 \label{ch17:fig:gpu_distribution}
558 Each MIOR simulation is represented by a work-group, and each agent by
559 a work-item. A kernel is in charge of the life cycle process for each
560 agent of the model. This kernel is executed by all the work-items of the simulation
561 each on its own gpu core.
563 The usage of one work-group for each simulation allows to easily
564 execute multiple simulations in parallel, as shown on
565 figure~\ref{ch17:fig:gpu_distribution}. By taking advantage of the
566 execution overlap possibilities provided by OpenCL it then becomes
567 possible to exploit all the cores at the same time, even if an unique
568 simulation is to small to use all the available GPU cores. However,
569 the maximum size of a work-group is limited (to $512$), which only allows
570 us to execute one simulation per work-group when using $310$ threads
571 (number of OM in the reference model) to execute the simulation.
573 The usage of the GPU to execute multiple simulations is initiated by
574 the CPU. The later keeps total control of the simulator execution
575 flow. Thus, optimized scheduling policies (such as submitting kernels
576 in batch, or limiting the number of kernels, or asynchronously
577 retrieving the simulation results) can be defined to minimize the cost
578 related to data transfers between CPU and GPUs.
580 \subsubsection{Data structures translation}
581 \label{ch17:subsec:datastructures}
583 The adaptation of the MIOR model to GPU requires the mapping of the
584 data model to OpenCL data structures. The environment and the agents
585 are represented by arrays of structures, where each structure describes the
586 state of one entity. The behavior of these entities are implemented
587 as OpenCL functions to be called from the kernels during execution.
589 Since the main program is written in Java, JOCL is responsible for the
590 allocation and the mapping of the object data structures to OpenCL
591 ones before execution.
593 Four main data structures
594 (Figure~\ref{ch17:listing:mior_data_structures}) are defined: (i) an
595 array of MM agents, representing the state of the microbial
596 colonies. (ii) an array of OM agents, representing the state of the
597 carbon deposits. (iii) a topology matrix, which stores accessibility
598 information between the two types of agents of the model (iv) a world
599 structure, which contains all the global input data (metabolism rate,
600 numbers of agents) and output data (quantity of $CO_{2}$ produced) of
601 the simulation. These data structures are initialized by the CPU and
602 then copied on the GPU.
604 \lstinputlisting[caption=Main data structures used in a MIOR simulation,label=ch17:listing:mior_data_structures]{Chapters/chapter17/code/data_structures.cl}
606 The world topology is stored as a two-dimension matrix which contains
607 OM indexes on the abscissa and the MM index on the ordinate. Each
608 agent walks through its line/column of the matrix at each iteration to
609 determinate which agents can be accessed during the simulation. Since
610 many agents are not connected this matrix is sparse, which introduces
611 a big number of useless memory accesses. To reduce the impact of these
612 memory accesses we use a compacted, optimized representation of this
613 matrix based on~\cite{Gomez-Luna:2009:PVS:1616772.1616869}, as
614 illustrated on the figure~\ref{ch17:fig:csr_representation}. This compact
615 representation considers each line of the matrix as an index list, and
616 only stores accessible agents continuously, to reduce the size of the
617 list and the number of non productive accesses.
621 \includegraphics[width=0.8\textwidth]{Chapters/chapter17/figs/grid.pdf}
622 %\psfig{file=figs/grid, height=1in}
623 \caption{Compact representation of the topology of a MIOR simulation}
624 \label{ch17:fig:csr_representation}
627 Since dynamic memory allocation is not possible yet in OpenCL and only
628 provided in the latest revisions of the CUDA standard, these matrices
629 are statically allocated. The allocation is based on the worst-case
630 scenario where all OM and MM are linked as the real occupation of the
631 matrix cells cannot be deduced without some kind of preprocessing
634 \subsubsection{Critical resources access management}
635 \label{ch17:subsec:concurrency}
637 One of the main issue in a MIOR model is to ensure that all the
638 microbial colonies will have an equitable access to carbon resources,
639 when multiple colonies share the same deposits. Access
640 synchronizations are mandatory in these cases to prevent conflicting
641 updates on the same data that may lead to calculus error (e.g. loss of
644 On a massively parallel architecture such as GPUs, this kind of
645 synchronization conflicts can however lead to an inefficient
646 implementation by enforcing a quasi-sequential execution. It is
647 necessary, in the case of MIOR as well as for other ABM, to ensure
648 that each work-item is not too constrained in its execution.
651 \lstinputlisting[caption=Main MIOR kernel,label=ch17:listing:mior_kernels]{Chapters/chapter17/code/mior_kernels.cl}
653 From the sequential algorithm 1 where all the agents share the same
654 data, we have developed a parallel algorithm composed of three
655 sequential stages separated by synchronization barriers. This new
656 algorithm is based on the distribution of the available OM carbon
657 deposits into parts at the beginning of each execution step. The three
658 stages, illustrated on Listing~\ref{ch17:listing:mior_kernels}, are the following:
661 \item \emph{distribution}: the available carbon in each carbon deposit
662 (OM) is equitably dispatched between all accessible MM in the form
664 \item \emph{metabolism}: each MM consumes carbon in its allocated parts
665 for its breathing and growing processes.
666 \item \emph{gathering}: unconsumed carbon in parts is gathered back
667 into the carbon deposits.
670 This solution suppresses the data synchronization needed by the first
671 algorithm and thus the need for synchronization barriers, and requires only
672 one kernel launch from Java as described on Listing~\ref{ch17:fig:mior_launcher}.
674 \lstinputlisting[caption=MIOR simulation launcher,label=ch17:fig:mior_launcher]{Chapters/chapter17/code/mior_launcher.java}
676 \subsubsection{Termination detection}
678 The termination of a MIOR simulation is reached when the model
679 stabilizes and no more $CO_{2}$ is produced. This termination
680 detection can be done either on the CPU or the GPU but it requires a
681 global view on the system execution.
683 In the first case, when the CPU controls the GPU simulation process,
684 the detection is done in two steps: (i) the CPU starts the execution
685 of a simulation step on the GPU, (ii) the CPU retrieves the GPU data
686 and determines if another iteration must be launched or if the
687 simulation has terminated. This approach allows a fine-grain control
688 over the GPU execution, but it requires many costly data transfers as
689 each iteration results must be sent from the GPU to the CPU. In the
690 case of the MIOR model these costs are mainly due to the inherent
691 PCI-express port latencies rather than to bandwidth limitation since
692 data sizes remains rather small, in the order of few dozens of
695 In the second case the termination detection is directly implemented
696 on the GPU by checking the amount of available carbon between two
697 iterations. The CPU does not have any feedback while the simulation is
698 running, but retrieves the results once the kernel execution is
699 finished. This approach minimizes the number of transfers between the
702 \subsection{Performance of MIOR implementations}
703 \label{ch17:subsec:miorexperiments}
705 In this part we present several MIOR GPU implementations using the
706 distribution/gathering process described in the previous section and
707 compare their performance on two distinct hardware platform; i.e two
708 different GPU devices. Five incremental MIOR implementations were
709 realized with an increasing level of adaptation for the algorithm: in
710 all cases, we choose the average time over 50 executions as a
711 performance indicator.
714 \item the \textbf{GPU 1.0} implementation is a direct implementation
715 of the existing algorithm and its data structures where data
716 dependencies were removed and that use the non-compact topology
717 representation described in Section~\ref{ch17:subsec:datastructures}
718 \item the \textbf{GPU 2.0} implementation uses the previously
719 described compact representation of the topology and remains
720 otherwise identical to the GPU 1.0 implementation.
721 \item the \textbf{GPU 3.0} implementation introduces the manual
722 copy into local (private) memory of often-accessed global data, such
723 as carbon parts or topology information.
724 \item the \textbf{GPU 4.0} implementation is a variant of the
725 GPU 1.0 implementation which allows the execution of
726 multiple simulations for each kernel execution.
727 \item the \textbf{GPU 5.0} implementation is a multi-simulation of the
728 GPU 2.0 implementation, using the execution of multiple simulations
729 for each kernel execution as for GPU 4.0
732 The two last implementations -- \textbf{GPU 4.0} and \textbf{GPU 5.0}
733 -- illustrate the gain provided by a better usage of the hardware
734 resources, thanks to the driver execution overlapping capabilities. A
735 sequential version of the MIOR algorithm, labeled as \textbf{CPU}, is
736 provided for comparison purpose. This sequential version is developed
737 in Java, the same language used for GPU implementations and the Sworm
740 For these performance evaluations, two platforms are used. The first
741 one is representative of the kind of hardware which is available on
742 HPC clusters. It is a cluster node dedicated to GPU computations with
743 two Intel X5550 processors running at $2.67GHz$ and two Tesla C1060
744 GPU devices running at $1.3$GHz and composed of $240$ cores ($30$
745 multi-processors). Only one of these GPU is used in these experiments,
746 at the moment. The second platform illustrates what can be expected
747 from a personal desktop computer built a few years ago. It uses a
748 Intel Q9300 CPU, running at $2.5$GHz, and a Geforce 8800GT GPU card
749 running at $1.5$GHz and composed of $112$ cores ($14$
750 multi-processors). The purpose of these two platforms is to assess the
751 benefit that could be obtained either when a scientist has access to
752 specialized hardware as a cluster or tries to take benefit from its
753 own personal computer.
756 %begin{minipage}[t]{0.49\linewidth}
758 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_tesla.pdf}
759 %\caption{Performance CPU and GPU sur carte Tesla C1060}
760 \caption{CPU and GPU performance on a Tesla C1060 node}
761 \label{ch17:fig:mior_perfs_tesla}
765 Figures~\ref{ch17:fig:mior_perfs_tesla}~and~\ref{ch17:fig:mior_perfs_8800gt}
766 show the execution time for $50$ simulations on the two hardware
767 platforms. A size factor is applied to the problem: at scale 1,
768 the model contains $38$ MM and $310$ OM, while at the scale 6 these
769 numbers are multiplied by six. The size of the environment is modified
770 as well to maintain the same average agent density in the model. This
771 scaling factor display the impact of the chosen size of simulation on
774 %hspace{0.02\linewidth}
775 %begin{minipage}[t]{0.49\linewidth}
778 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_8800gt.pdf}
779 \caption{CPU and GPU performance on a personal computer with a Geforce 8800GT}
780 \label{ch17:fig:mior_perfs_8800gt}
784 The charts show that for small problems the execution times of all
785 the implementations are very close. This is because the GPU execution
786 does not have enough threads (representing agents) for an optimal
787 usage of GPU resources. This trend changes after scale $5$ where GPU
788 2.0 and GPU 3.0 take the advantage on the GPU 1.0 and CPU
789 implementations. This advantage continues to grow with the scaling
790 factor, and reaches a speedup of $10$ at the scale $10$ between the
791 fastest single-simulation GPU implementation and the first, naive one
794 Multiple trends can be observed in these results. First, optimizations
795 for the GPU hardware show a big, positive impact on performance,
796 illustrating the strong requirements on the algorithm properties to
797 reach execution efficiency. These charts also show that despite the
798 vast difference in numbers of cores between the two GPUs platforms the
799 same trends can be observed in both case. We can therefore expect
800 similar results on other GPU cards, without the need for more
805 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/monokernel.pdf}
806 \caption{Execution time of one multi-simulation kernel on the Tesla
808 \label{ch17:fig:monokernel_graph}
813 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/multikernel.pdf}
814 \caption{Total execution time for 1000 simulations on the Tesla
815 platform, while varying the number of simulations for each kernel}
816 \label{ch17:fig:multikernel_graph}
819 There are two ways to measure simulations performance: (i) by
820 executing only one kernel, and varying its size (the number of
821 simulations executed in parallel), as shown on
822 Figure~\ref{ch17:fig:monokernel_graph}, to test the costs linked to the
823 parallelization process or (ii) by executing a fixed number of
824 simulations and varying the size
825 of each kernel, as shown on Figure~\ref{ch17:fig:multikernel_graph}.
827 Figure~\ref{ch17:fig:monokernel_graph} illustrates the execution time for
828 only one kernel. It shows that for small numbers of simulations run in
829 parallel, the compact implementation of the model topology is faster
830 than the two-dimension matrix representation. This trends reverse with
831 more than $50$ simulations in parallel, which can be explained either by
832 the non linear progression of the synchronization costs or by the
833 additional memory required for the access-efficient representation.
835 Figure~\ref{ch17:fig:multikernel_graph} illustrates the execution time of a
836 fixed number of simulations. It shows that for a small number of
837 simulations run in parallel, the costs resulting of program setup,
838 data copies and launch on GPU are determinant and very detrimental to
839 performance. Once the number of simulations executed for each kernel
840 grows, these costs are counterbalanced by computation costs. This
841 trend is more marked in the case of the sparse implementation (GPU
842 4.0) than the compact one but appears on both curves. With more than
843 $30$ simulations for each kernel, execution times stalls, since
844 hardware limits are reached. This indicates that the cost of preparing
845 and launching kernels become negligible compared to the computing
846 time once a good GPU occupation rate is achieved.
848 \section{Analysis and recommendations}
849 \label{ch17:analysis}
851 In this section we synthesize the observations done on the two models
852 and identify some recommendations on implementing complex systems on
855 \subsection{Analysis}
857 In both the Collembola and the MIOR model, a critical problematic is
858 the determination of the part of the simulation which are to be run on
859 GPU and which are to remain on the CPU. The determination of these
860 parts is a classic step of any algorithm parallelization and must
861 take into account considerations such as the cost of the different
862 parts of the algorithm and the expected gains.
864 In the case of the Collembola model two steps of the algorithm were
865 ported to GPU. Both steps use straightforward, easily-parallelizable,
866 operations where a direct gain can be expected by using more
867 executions cores without important modifications of the algorithm.
869 In the MIOR model case however no such inherently parallelizable parts
870 are evident in the original sequential algorithm. This is mainly
871 explained by the rate of interactions between agents in this model in
872 the form of many operations (breathing, growth) on shared carbon
873 resources. In this case the algorithm had to be more profoundly
874 modified while keeping in head the need to remain true to the original
875 model, to synchronize the main execution step of all agents in the
876 model, to ensure equity, and to minimize the numbers of
877 synchronizations. The minimization is done by factoring the repartition of
878 carbon in the model in two separated step at the beginning and the end
879 of each iterations rather than at multiple point of the execution.
881 \subsection{MAS execution workflow}
883 Many MAS simulations decompose their execution process into discrete
884 evolution steps where each step represents a quantum of time (minimal
885 unit of time described). At the end of each step many global
886 information, graphical displays or results are updated. This
887 execution model may not correctly fit on GPU platforms as they assume
888 more or less a batch-like workflow model. The execution model must be
889 split into the following ever repeating steps:
892 \item Allocation of GPU data buffers
893 \item Copy of data from CPU to GPU
894 \item GPU kernels execution
895 \item Copy of results from GPU to CPU
898 This workflow works well if the considered data transfer time is
899 negligible compared to GPU execution or can be done in parallel,
900 thanks to the asynchronous nature of OpenCL. If we are to update the
901 MAS model after each iteration then performance risk being
902 degraded. This is illustrated in the MIOR model by the fact that the
903 speedup observed on GPU is much more significant for bigger
904 simulations, which implies longer GPU execution times. Our solution to
905 this problem is to desynchronize the execution of the SMA model and
906 its GPU parts by requesting the execution of multiple steps of the
907 GPU simulations for each launch.
909 Another related prerequisite of GPU execution is the ability to have
910 many threads to be executed, to allow an efficient exploitation of the
911 superior number of cores provided by the architecture. In the case of
912 MAS models, this means that executing one agent at a time on GPU is
913 meaningless in regard to GPU usage, copying cost, and actual gain in
914 execution time, if the process is not already parallel and costly at
915 this scale. In the MIOR and the Collembola models, this is solved by
916 executing the computations for all agents of the model at the same
917 time. If the model has however only chronic needs for intensive
918 computations then some kind of batching mechanism is required to store
919 waiting treatments in a queue, until the total sum of waiting
920 computations justify the transfers cost to the GPU platform.
922 \subsection{Implementation challenges}
924 Besides the execution strategy challenges described above, some
925 implementation challenges also occur when implementing an OpenCL
926 version of a MAS model, mainly related to the underlying limitations
927 of the execution platform.
929 The first one is the impossibility (except in latest CUDA versions)
930 to dynamically allocate memory during execution. This is a problem in
931 the case of models where the number of agent can vary during the
932 simulation, such as prey-predator models. In this case, the only
933 solution is to overestimate the size of arrays or data structures to
934 accommodate these additional individuals, or using the CPU to resize
935 data structures when these situations occur. Both approaches require
936 to trend either memory or performance and are not always practical.
938 Another limitation is the impossibility to store pointers in data
939 structures, we restraint any OpenCL to use only one dimension static
940 arrays or specific data structures such as textures. This preclude the
941 usage of structures like linked-list, graphs or sparse matrix not
942 represented by some combination of static arrays, and can be another
943 source of memory or performance losses.
945 In the case of MIOR, this problem is especially exacerbated in the case
946 of neighboring storage: both representations consume much more memory
947 that actually required, since the worst case (all agents have
948 access to all others agents) has to be taken into account defensively
949 when dimensioning the data structure.
951 The existence of different generations of GPU hardwares is also a
952 challenge. Older implementations, such as the four year old Tesla
953 C1060 cards, have very strong expectation in term of memory accesses
954 and requires very regular access patterns to perform efficiently. MAS
955 having many random accesses, such as MIOR, or many small global memory
956 accesses, such as Collembola, are penalized on these older
957 cards. Fortunately, these requirements are less present is modern
958 cards, which offer caches and other facilities traditionally present
959 on CPU to offset these kind of penalties.
961 The final concern is related to the previous ones and often result in
962 more memory consumption. The amount of memory available on GPU cards
963 is much more limited and adding new memory capabilities is more costly
964 compared to expending a CPU RAM. On computing clusters, hardwares
965 nodes with 128GB of memory or more have become affordable, whereas
966 newer Tesla architecture remains limited to 16GB of memory. This can
967 be a concern in the case of big MAS models, or small ones which can
968 only use memory-inefficient OpenCL structures.
973 As shown in the previous sections, many data representations choices
974 are common to entire classes of MAS. The paradigm of grid, for example,
975 is often encountered in models where each cell constitutes either the
976 elementary unit of simulation (SugarScape~\cite{Dsouza2007}) or a
977 discretization of the environment space
978 (Pathfinding~\cite{Guy09clearpath}). These grids can be considered as
979 two or three-dimensional matrices, whose processing can be directly
982 Another common data representation encountered in MAS system is the
983 usage of 2d or 3d coordinates to store the position of each agent of
984 the model. In this case, even if the environment is no longer
985 discrete, the location information still imply computations
986 (euclidean distances) which can be parallelized.
988 MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS designer
989 those basic data structures, and the associated operations, to
990 facilitate the portage of existing MAS on GPU. Two levels of
991 utilization are provided to the developer, depending on its usage
995 \item A high-level library, constituted of modules regrouping classes
996 of operations. A module provides multiple methods of
997 distance computations in 1d, 2d or 3d grids, another one the
998 diffusion algorithm...
999 \item A low level API which allows the developed direct access to the
1000 GPU and the inner working of MCSMA, to develop its own module in
1001 the case where the required operations are not yet provided by the platform.
1004 Both usage levels were illustrated in the above two practical cases. In
1005 MIOR, the whole algorithm (baring initialization) is ported on GPU as
1006 a specific plugin which allows executing $n$ MIOR simulations and
1007 retrieve their results. This first approach requires extensive
1008 adaptations to the original algorithm. In Collembola, on the
1009 contrary, the main steps of the algorithm remain executed on the CPU,
1010 and only specific operations are delegated to generic, already
1011 existing, diffusion and reduction kernels. The fundamental algorithm
1012 is not modified and GPU execution is only applied to specific parts of
1013 the execution which may benefit from it. These two programming
1014 approaches allow incremental adaptations of existing Java MAS to
1015 accelerate their execution on GPU, while retaining the option to
1016 develop their own reusable or more efficient module to supplement the
1017 already existing ones.
1019 \section{Conclusion}
1020 \label{ch17:conclusion}
1022 This chapter addresses the issue of complex system simulation by using
1023 agent based paradigm and GPU hardware. From the experiments on two
1024 existing Agent Based Models of soil science we intend to provide
1025 useful information on the architecture, the algorithm design and, the
1026 data management to run Agent based simulations on GPU, and more
1027 generally to run compute intensive applications that are not based on
1028 a not-so-regular model. The first result of this work is that adapting
1029 the algorithm to a GPU architecture is possible and suitable to speed
1030 up agent based simulations as illustrated by the MIOR model. Coupling
1031 CPU with GPU seems even to be an interesting way to better take
1032 advantage of the power given by computers and clusters as illustrated
1033 by the Collembola model. From our point of view the adaptation process
1034 is lighter than a traditional parallelization on distributed nodes and
1035 not much difficult than a standard multi-threaded parallelization,
1036 since all the data remains on the same host and can be shared in
1037 central memory. The OCL adaptation also enables a portable simulator
1038 that can be run on different graphical units. Even using a mainstream
1039 card as the GPU card of a standard computer can lead to significant
1040 performance improvements. This is an interesting result as it opens
1041 the field of inexpensive HPC to a large community.
1043 In this perspective, we are working on MCSMA a development platform
1044 that would facilitate the use of GPU or many-core architectures for
1045 multi-agent simulations. Our first work has been the definition of
1046 common, efficient, reusable data structures, such as grids or
1047 lists. Another goal is to provide easier means to control the
1048 distribution of specific processes between CPU or GPU, to allow the
1049 easy exploitation of the strengths of each platform in the same
1050 multi-agent simulation. We think that the same approach, i.e
1051 developing specific environments that facilitate the developer
1052 access to the GPU power, can be applied in many domains with
1053 compute intensive needs to open the GPU use to a larger community.
1055 \putbib[Chapters/chapter17/biblio]