1 \chapterauthor{Guillaume Laville, Christophe Lang, Bénédicte Herrmann,
2 and Laurent Philippe}{Femto-ST Institute, University of
4 %\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France}
5 \chapterauthor{Kamel Mazouzi}{Franche-Comte Computing Center, University of Franche-Comte, France}
6 \chapterauthor{Nicolas Marilleau}{UMMISCO, Institut de Recherche pour le Developpement (IRD), France}
7 %\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France}
8 %\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
11 \newcommand\myinput[1]{%
12 \settowidth\mylen{\KwIn{}}%
13 \setlength\hangindent{\mylen}%
16 \chapter{Implementing Multi-Agent Systems on GPU}
20 \section{Introduction}
23 In this chapter we introduce the use of Graphical Processing Units
24 (GPU) for multi-agents-based systems as an example of a not-so-regular
25 application that could benefit from the GPU computing
26 power. Multi-Agent Systems (MAS) are a simulation paradigm used to
27 study the behavior of dynamic systems. Dynamic systems as physical
28 systems are often modeled by mathematical representations and their
29 dynamic behavior is simulated by differential equations. The
30 simulation of the system thus often relies on the resolution of a
31 linear system that can be efficiently computed on a graphical
32 processing unit as shown in the preceding chapters. But when the
33 behavior of the system elements is not uniformly driven by the same
34 law, when these elements have their own behavior, the modeling process
35 is too complex to rely on formal expressions. In this context MAS is a
36 recognized approach to model and simulate systems where individuals
37 have an autonomous behavior that cannot be simulated by the evolution
38 of a set of variables driven by mathematical laws. MAS are often used
39 to simulate natural or collective phenomena whose individuals are too
40 numerous or various to provide a unified algorithm describing the
41 system evolution. The agent-based approach is to divide these complex
42 systems into individual self-contained entities with their smaller set
43 of attributes and functions. But, as for mathematical simulations,
44 when the size of the MAS increases, the need of computing power and
45 memory also increases. For this reason, multi-agent systems should
46 benefit from the use of distributed computing architectures. Clusters
47 and grids are often identified as the main solution to increase
48 simulation performance but GPUs are also a promising technology with
49 an attractive performance/cost ratio.
51 Conceptually a MAS\index{Multi-Agent System} is a distributed system
52 as it favors the definition and description of large sets of
53 individuals, the agents, that can be run in parallel. As a large set
54 of agents could have the same behavior, a Single Instruction Multiple
55 Data (SIMD) execution architecture should fit the simulation
56 execution. Most of the agent-based simulators are, however, designed
57 with a sequential scheme in mind, and these simulators seldom use more
58 than one core for their execution. Due to simulation scheduling
59 constraints, data sharing and exchange between agents and the huge
60 amount of interactions between agents and their environment, it is
61 indeed rather difficult to distribute an agent based simulator, for
62 instance, to take advantage of new multithreaded computer
63 architectures. Thus, guidelines and tools dedicated to MAS paradigm
64 and High Performance Computing (HPC) are now a need for other complex
65 system communities. Note that, from the described structure (large
66 number of agents sharing data), we can conclude that MAS would more
67 easily benefit from many-core architectures than from other kinds of
70 Another key point that advocates for the use of many-core in MAS is
71 the growing need for multiscale simulations. Multiscale simulations
72 explore problems with interactions between several scales. The
73 different scales use different granularity of the structure and
74 potentially different models. Most of the time the lower scale
75 simulations provide results to higher scale simulations. In that case
76 the execution of the simulations can easily be distributed between the
77 local cores and a many-core architecture, i.e., a GPU device.
79 We explore in this chapter the use of many-core architectures to
80 execute agent-based simulations. We illustrate our reflexion with two
81 cases: the Collembola simulator designed to simulate the diffusion of
82 Collembola between plots of land and the MIOR (MIcro ORganism)
83 simulator that reproduces effects of earthworms on bacteria dynamics
84 in a bulked soil. In Section \ref{ch17:ABM} we present the work
85 related to MAS and parallelization with a special focus on many-core
86 use. In sections \ref{ch17:sec:1stmodel} and \ref{ch17:sec:2ndmodel}
87 we present in detail two multi-agent models, their GPU
88 implementations, the conducted experiments, and their performance
89 results. The first model, given in Section \ref{ch17:sec:1stmodel},
90 illustrates the use of a GPU architecture to speed up the execution of
91 some computation-intensive functions while the main model is still
92 executed on the central processing unit. The second model, given in
93 Section \ref{ch17:sec:2ndmodel}, illustrates the use of a GPU
94 architecture to implement the whole model on the GPU processor which
95 implies deeper changes in the initial algorithm. In Section
96 \ref{ch17:analysis} we propose a more general reflexion on these
97 implementations and provide some guidelines. Then, we conclude in
98 Section \ref{ch17:conclusion} on the possible generalization of our
102 \section{Running agent-based simulations}
105 In this section, we present the context of MAS, their parallelization,
106 and we report several existing works on using GPU to simulate
109 \subsection{Multi-agent systems and parallelism}
111 Agent-based systems are often used to simulate natural or collective
112 phenomena whose actors are too numerous or various to provide a simple
113 unified algorithm describing the studied system
114 dynamic~\cite{Schweitzer2003}. The implementation of an agent based
115 simulation usually starts by designing the underlying agent-based
116 model (ABM). Most ABM are based around a few types of entities such as
117 agents, one environment, or an interaction
118 organization~\cite{Vowel02}. In the complex system domain, the
119 environment often describes a real space, its structure (e.g. soil
120 textures and porosities), and its dynamics (e.g., organic matter
121 decomposition). It is a virtual world in which agents represent
122 studied entities (e.g., biotic organisms) evolution. The actual agent
123 is animated by a behavior that can range between reactivity (only
124 reacts to external stimuli) and cognition (makes complex decisions
125 based on environmental and internal factors). Interaction and
126 organization define functions, types, and patterns of communications
127 of their member agents in the
128 system~\cite{Odell:2003:RRD:1807559.1807562}. Note that, depending on
129 the MAS, agents can communicate either directly through special
130 primitives or indirectly through the information stored in the
133 Agent-based simulations have been used for more than a decade to
134 reproduce, understand and even predict complex system dynamics. They
135 have proved their usefulness in various scientific
136 communities. Nowadays generic agent based frameworks such as
137 Repast~\cite{repast_home} or NetLogo~\cite{netlogo_home} are promoted
138 to implement simulators. Many ABMs such as the crown model
139 representing a city wide scale~\cite{Strippgen_Nagel_2009} tend
140 however to require a large number of agents to provide a realistic
141 behavior and reliable global statistics. Moreover, an achieved model
142 analysis needs to resort to an experiment plan, consisting of multiple
143 simulation runs, to obtain enough confidence in a simulation. In this
144 case the available computing power often limits the simulation size,
145 and the resulting range thus requires the use of parallelism to
146 explore bigger configurations.
148 For that, three major approaches can be identified:
150 \item parallelizing experiments execution on a cluster or a grid (one
151 or a few simulations are submitted to each
152 core)~\cite{Blanchart11,Chuffart2010},
153 \item parallelizing the simulator on a cluster (the environment of the
154 MAS is split and run on several distributed
155 nodes)~\cite{Cosenza2011,MAR06},
156 \item optimizing the simulator by taking advantage of computer
157 resources (multi-threading, GPU, and so on) \cite{Aaby10}.
160 In the first case, experiments are run independently of each other and
161 only simulation parameters are changed between two runs so that a
162 simple version of an existing simulator can be used. This approach
163 does not, however, allow to run larger models. In the second and the
164 third case, model and code modifications are necessary. Only a few
165 frameworks, however, introduce distribution in agent simulation
166 (Madkit~\cite{Gutknecht2000}, MASON~\cite{Sean05},
167 repastHPC~\cite{Collier11}), and parallel implementations are often
168 based on the explicit use of threads using shared
169 memory~\cite{Guy09clearpath} or cluster libraries such as
170 MPI~\cite{Kiran2010}.
172 Parallelizing a multi-agent simulation is however complex due to space
173 and time constraints. Multi-agent simulations are usually based on a
174 synchronous execution: at each time step, numerous events (space data
175 modification, agent motion) and interactions between agents happen.
176 Distributing the simulation on several computers or grid nodes thus
177 implies to guarantee a distributed synchronous execution and
178 coherency. This often leads to poor performance or complex
179 synchronization problems. Multicore execution or delegating part of
180 this execution to others processors such as GPUs~\cite{Bleiweiss_2008}
181 is usually easier to implement since all the threads share the data
184 % Different agent patterns can be adopted in an ABMs such as
185 % cognitive and reactive ones~\cite{Ferber99}. Cognitive agents act on
186 % the environment and interact with other agents according to a complex
187 % behavior. This behavior takes a local perception of the virtual world
188 % and the agent past (a memory characterized by an internal state and
189 % belief, imperfect knowledge about the world) into account. Reactive
190 % agents have a much systematic pattern of action based on stimuli
191 % response schemes (no or few knowledge and state conservation in agent). The
192 % evolution of the ABM environment, in particular, is often represented with
193 % this last kind of agents. As their behavior is usually simple, we
194 % propose in this chapter to delegate part of the environment and of the
195 % reactive agents execution to the graphical processing unit of the
196 % computer. This way we can balance the load between both CPU and GPU
197 % execution resources.
199 % In the particular case of multi-scale simulations such as the Sworm
200 % simulation~\cite{BLA09} the environment may be used at different
201 % levels. Since the representation of the whole simulated environment
202 % (i.e. the soil area) would be costly, the environment is organized as
203 % a multi-level tree of small soil cubes which can be lazily
204 % instantiated during the simulation. This allows to gradually refine
205 % distribution details in units of soil as agents progress and need
206 % those information, by using a fractal process based on the
207 % bigger-grained already instantiated levels. This characteristic,
208 % especially for a fractal model, could be the key of the
209 % distribution. For instance, each branch of a fractal environment could
210 % be identified as an independent area and parallelized. In addition
211 % Fractal is a famous approach to describe multi-scale environment (such
212 % as soil) and its organization~\cite{perrier}. In that case the lower
213 % scale simulations can also be delegated to the GPU card to limit the
214 % load of the main (upper scale) simulation.
216 \subsection{MAS implementation on GPU}
217 \label{ch17:subsec:gpu}
219 The last few years have seen the appearance of new generations of
220 graphic cards based on more general purpose execution units which are
221 promising for large systems such as MAS. Using matrix-based data
222 representations and SIMD computations is however not always
223 straightforward in MAS, where data structures and algorithms are
224 tightly coupled to the described simulation. However, works from
225 existing literature show that MAS can benefit from these performance
226 gains on various simulation types, such as traffic
227 simulation~\cite{Strippgen_Nagel_2009}, cellular
228 automata~\cite{Dsouza2007}, mobile-agent based
229 path-finding~\cite{Silveira:2010:PRG:1948395.1948446} or genetic
230 algorithms~\cite{Maitre2009}. Note that an application-specific
231 adaptation process was required in the case of these MAS: some of the
232 previous examples are driven by mathematical laws (path-finding) or
233 use a natural mapping between a discrete environment (cellular
234 automaton) and GPU cores. Unfortunately, this mapping often requires
235 algorithmic adaptations in other models but experience shows that the
236 more reactive a MAS is the more adapted its implementation is to GPU.
238 The first step in the adaptation of an ABM to GPU platforms is the
239 choice of language. On the one hand, the Java programming language is
240 often used for the implementation of MAS due to its availability on
241 numerous platforms or frameworks and its focus on high-level,
242 object-oriented programming. On the other hand, GPU platforms can only
243 run specific languages such as OpenCL or CUDA. OpenCL (supported on
244 AMD, Intel, and NVIDIA hardware) better suits the portability concerns
245 across a wide range of hardware needed the agent simulators, as
246 opposed to CUDA which is an NVIDIA-specific library.
248 OpenCL is a C library which provides access to the underlying CPU or
249 GPU threads using an asynchronous interface. Various OpenCL functions
250 allow the compilation and the execution of programs on these execution
251 resources, the copying of data buffers between devices, and the
252 collection of profiling information.
254 This library is based around three main concepts:
257 \item the \emph{kernel} (similar to a CUDA kernel), which represents a runnable program
258 containing instructions to be executed on the GPU;
259 \item the \emph{work-item} (equivalent to a CUDA thread), which is analogous to the concept
260 of thread on GPU, in that it represents one running instance of a
262 \item the \emph{work-group} (or execution block) which is a set of work-items
263 sharing some memory to speed up data accesses and
264 computations. Synchronization operations such as barrier can only be
265 used across the same work-group.
268 Running an OpenCL computation consists of launching numerous
269 work-items that execute the same kernel. The work-items are submitted
270 to a submission queue to optimize the available cores usage. A
271 calculus is achieved once all these kernel instances have terminated.
273 The number of work-items used in each work-group is an important
274 implementation choice which determines how many tasks will share the
275 same cache memory. Data used by the work-items can be stored as
276 N-dimensions matrices in local or global GPU memory. Since the size
277 of this memory is often limited to a few hundred kilobytes, choosing
278 this number often implies a compromise between the model
279 synchronization or data requirements and the available resources.
281 In the case of agent-based simulations, each agent can be naturally
282 mapped to a work-item. Work-groups can then be used to represent
283 groups of agents or simulations sharing common data (such as the
284 environment) or algorithms (such as the background evolution process).
286 In the following examples a binding named JOCL~\cite{jocl_home} is
287 used to access the OpenCL platform from the Java programming language.
289 In the next sections we present two practical cases that will be
290 studied in detail, from the model to its implementation and
293 \section{A first practical example}
294 \label{ch17:sec:1stmodel}
296 The first model, the Collembola model, simulates the propagation of
297 collembolas in fields and forests. It is based on a diffusion
298 algorithm which illustrates the case of agents with a simple behavior
299 and few synchronization problems.
301 \subsection{The Collembola model\index{Collembola model}}
302 \label{ch17:subsec:collembolamodel}
304 The Collembola model is an example of multi-agent system using GIS
305 (Geographical Information System) and survey data (population count)
306 to model the evolution of the biodiversity across land plots. A first
307 version of this model has been developed with the Netlogo framework by
308 Bioemco and UMMISCO researchers. In this model, the biodiversity is
309 modeled by populations of athropod individuals, the Collembola, which
310 can reproduce and diffuse to favorable new habitats. The simulator
311 allows us to study the diffusion of collembola, between plots of land
312 depending on their use (artifical soil, crop, forest, etc.) In this
313 model the environment is composed of the studied land, and collembola
314 are used as agents. Every land plot is divided into several cells,
315 each cell representing a surface unit (16x16 meters). A number of
316 individuals per collembola species is associated to each cell. The
317 model evolution is then based on a common diffusion model that
318 diffuses individuals between cells. Each step of the simulation is
319 based on four stages, as shown on
320 Figure~\ref{ch17:fig:collem_algorithm}:
323 % \item arrival of new individuals
324 % \item reproduction in each cell
325 % \item diffusion between cells
326 % \item updating of colembola lifetime
331 \includegraphics[width=0.6\textwidth]{Chapters/chapter17/figs/algo_collem.pdf}
332 \caption{Evolution algorithm of Collembola model.}
333 \label{ch17:fig:collem_algorithm}
336 The algorithm is quite simple but includes two costly operations, the
337 reproduction and the diffusion, that must be parallelized to improve
338 the model performances.
340 The {\bf reproduction} stage consists in updating the total population
341 of each plot by taking the individuals arrived at the preceding
342 computation step. This stage involves processing the whole set of
343 cells of the environment to sum their population. The computed value
344 is recorded in the plot associated to each cell. This process can be
345 assimilated to a reduction operation on all the population cells
346 associated to one plot to obtain its population.
348 The {\bf diffusion} stage simulates the natural behavior of the
349 collembola that tends toward occupying the whole space over time. This
350 stage consists in computing a new value for each cell depending on
351 the population of the neighbor cells. This process can be assimilated
352 to a linear diffusion at each iteration of the population of the cells
353 across their neighbors.
355 These two processes are quite common in numerical computations so that
356 the collembola model can be adapted to a GPU execution without much
359 \subsection{Collembola implementation}
361 In the collembola simulator biodiversity is modeled by populations of
362 collembola individuals, which can reproduce and diffuse to favorable
363 new habitats. This is implemented as a fixed reproduction factor,
364 applied to the size of each population, followed by a linear diffusion
365 of each cell population to its eight neighbors. These reproduction and
366 diffusion processes are followed by two more steps on the GPU
367 implementation. The first one consist of culling of populations in an
368 inhospitable environment, by checking each cell value and terrain
369 type, and setting its population to zero if necessary. The final
370 simulation step is the reduction of the cell populations for each
371 plot, to obtain an updated plot population for statistic
372 purposes. This separate computation step, done while updating each
373 cell population in the reference sequential algorithm, is motivated by
374 synchronization problems and allows the reduction of the total number
375 of memory accesses needed to updated those populations.
377 %\lstinputlisting[language=C,caption=Collembola OpenCL
378 %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl}
380 \lstinputlisting[caption=collembola openCL diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl}
382 The reproduction, diffusion and culling steps are implemented on GPU
383 (Figure~\ref{ch17:fig:collem_algorithm}) as a straight mapping of each
384 cell to an OpenCL work-item (GPU thread). Listing
385 \ref{ch17:listing:collembola-diffuse} gives the kernel for the
386 diffusion implementation. To prevent data coherency problems, the
387 diffusion step is split into two phases, separated by an execution
388 {\it barrier}. In the first phase each cell diffusion overflow is
389 calculated and divided by the number of neighbors. Note that, on the
390 border of the grid, populations can also overflow outside the
391 environment grid but we do not manage those external populations,
392 since there are no reason to assume our model to be isolated of its
393 surroundings. The overflow by neighbors value is stored for each cell
394 before encountering the barrier. After the barrier is met, each cell
395 reads the overflows stored by all its neighbors at the previous step
396 and applies them to its own population. In this manner, only one
397 barrier is required to ensure the consistency of population numbers,
398 since no cell ever modify a value other than its own.
400 Listing \ref{ch17:listing:collembola-reduc} gives the kernel for the
401 reduction implementation. The only step requiring numerous
402 synchronized accesses is the reduction one: in this first approach, we
403 chose to use {\it atomic\_add} operation to implement this process,
404 but more efficient implementations using partial reduction and local
405 GPU memory could be implemented.
408 \lstinputlisting[caption=collembola OpenCL reduction kernel,label=ch17:listing:collembola-reduc]{Chapters/chapter17/code/collem_kernel_reduc.cl}
410 \subsection{Collembola performance}
412 In this part we present the performance of the collembola model on
413 various CPU and GPU execution
414 platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
415 number of cores and the processor architecture as a strong influence
416 on the obtained results
419 % dual-core processor equipping our Thinkpad platform has two to six
420 % longer executions times, compared to a six-core Phenom X6.
422 % % \begin{figure}[h]
423 % %begin{minipage}[t]{0.49\linewidth}
424 % \centering \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs.pdf}
425 % \caption{Performance CPU et GPU du modèle Collemboles}
426 % \label{ch17:fig:mior_perfs_collem}
430 % In figure~\ref{ch17:fig:mior_perfs_collem2} the Thinkpad curve is removed
431 % to make other trends clearer. Two more observation can be made, using
432 % this more detailled scale:
435 \item Older GPU cards can be slower than modern processors. This can
436 be explained by the new cache and memory access optimizations
437 implemented in newer generations of GPU devices. These optimizations
438 reduce the penalties associated with irregular and frequent global
439 memory accesses. They are not available on our Tesla nodes.
440 \item GPU curves exhibit an odd-even pattern in their performance
441 results. Since this phenomenon is visible on two distinct
442 manufacturer hardware, driver, and OpenCL implementation, it is
443 likely the result of the decomposition process based on warp of
444 fixed, power-of-two sizes.
445 \item The number of cores is not the only determining factor: an Intel
446 Core i7 2600K processor, even with only four cores, can provide
447 better performance than a Phenom one.
451 %begin{minipage}[t]{0.49\linewidth}
453 \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
454 \caption{Performance of the Collembola model on CPU and GPU.}
455 \label{ch17:fig:mior_perfs_collem}
459 Both graphs show that using the GPU to parallelize part of the
460 simulator results in tangible performance gains over a CPU execution
461 on modern hardware. These gains are more mixed on older GPU platforms
462 due to the limitations when dealing with irregular memory or execution
463 patterns often encountered in MAS systems. This can be closely linked
464 to the availability of caching facilities on the GPU hardware and its
465 dramatic effects depend on the locality and frequency of data
466 accesses. In this case, even if the Tesla architecture offers more
467 execution cores and is the far costlier solution, more recent,
468 cheaper, solutions such as high-end GPU provide better performance
469 when the execution is not constrained by memory size.
471 \section{Second example}
472 \label{ch17:sec:2ndmodel}
474 The second model, the MIOR model, simulates the behavior of microbian
475 colonies. Its execution model is more complex so that it requires
476 changes in the initial algorithm and the use of synchronization to
477 benefit from the GPU architecture.
479 \subsection{The MIOR model}
480 \label{ch17:subsec:miormodel}
482 The MIOR~\cite{C.Cambier2007} model was developed to simulate local
483 interactions in soil between microbial colonies and organic
484 matters. It reproduces each small cubic unit ($0.002 m^3$) of soil as
487 Multiple implementations of the MIOR model have already been
488 realized, in Smalltalk and Netlogo, in 2 or 3 dimensions. The last
489 implementation, used in our work and referenced as MIOR in the
490 rest of the chapter, is freely accessible online as
491 WebSimMior~\footnote{http://www.IRD.fr/websimmior/}.
493 MIOR is based around two types of agents: (1) the Meta-Mior (MM),
494 which represents microbial colonies consuming carbon and (2) the
495 Organic Matter (OM) which represents carbon deposits occurring in
498 The Meta-Mior agents are characterized by two distinct behaviors:
500 \item \emph{breath}: this action converts mineral carbon from the soil
501 to carbon dioxide ($CO_{2}$) that is released into the soil and
502 \item \emph{growth}: by this action each microbial colony fixes the
503 carbon present in the environment to reproduce itself (augments its
504 size). This action is only possible if the colony breathing needs
505 are covered, i.e., enough mineral carbon is available.
508 These behaviors are described in Algorithm~\ref{ch17:seqalgo}.
511 \caption{evolution step of each Meta-Mior (microbial colony) agent}
513 \KwIn{A static array $mmList$ of MM agents}
514 \myinput{A static array $omList$ of OM agents}
515 \myinput{A MIOR environment $world$}
516 $breathNeed \gets world.respirationRate \times mm.carbon$\;
517 $growthNeed \gets world.growthRate \times mm.carbon$\;
518 $availableCarbon \gets totalAccessibleCarbon(mm)$\;
519 \uIf{$availableCarbon > breathNeed$}{
521 $mm.active \gets true$\;
522 $availableCarbon \gets availableCarbon - consumCarbon(mm, breathNeed)$\;
523 $world.CO2 \gets world.CO2 + breathNeed$\;
524 \If{$availableCarbon > 0$}{
526 $growthConsum \gets max(totalAccessCarbon(mm), growthNeed)$\;
527 $consumCarbon(mm, growthConsum)$\;
528 $mm.carbon \gets mm.carbon + growthConsum$\;
532 $mm.active \gets false$
536 Since this simulation takes place at a microscopic scale, a large
537 number of these simulations must be executed for each macroscopic
538 simulation step to model a realistic-sized unit of soil. This leads to
539 large computing needs despite the small computation cost of each
540 individual simulation.
543 \subsection{MIOR implementation}
545 As pointed out previously, the MIOR implementation implied more
546 changes for the initial code to be run on GPU. As a first attempt, we
547 tried a simple GPU implementation of the MIOR simulator, with only
548 minimal changes to the CPU algorithm. Execution times showed the
549 inefficiency of this approach and highlighted the necessity of
550 adapting the simulator to take advantage of the GPU execution
551 capabilities~\cite{lmlm+12:ip}. In this part, we show the main changes
552 which were realized to adapt the MIOR simulator on GPU architectures.
554 \subsubsection{Execution mapping on GPU}
558 \includegraphics[width=0.7\textwidth]{Chapters/chapter17/figs/repartition.pdf}
559 \caption{Consolidation of multiple simulations in one OpenCL kernel execution.}
560 \label{ch17:fig:gpu_distribution}
563 Each MIOR simulation is represented by a work-group, and each agent by
564 a work-item. A kernel is in charge of the life cycle process for each
565 agent of the model. This kernel is executed by all the work-items of
566 the simulation each on its own GPU core.
568 The usage of one work-group for each simulation allows the easy
569 execution of multiple simulations in parallel, as shown on
570 figure~\ref{ch17:fig:gpu_distribution}. By taking advantage of the
571 execution overlap possibilities provided by OpenCL, it then becomes
572 possible to exploit all the cores at the same time, even if an unique
573 simulation is too small to use all the available GPU cores. However,
574 the maximum size of a work-group is limited (to $512$), which allows
575 us to execute only one simulation per work-group when using $310$
576 threads (number of OM in the reference model) to execute the
579 The usage of the GPU to execute multiple simulations is initiated by
580 the CPU. The CPU keeps total control of the simulator execution
581 flow. Thus, optimized scheduling policies (such as submitting kernels
582 in batch, limiting the number of kernels, or asynchronously retrieving
583 the simulation results) can be defined to minimize the cost related to
584 data transfers between CPU and GPUs.
586 \subsubsection{Data structures translation}
587 \label{ch17:subsec:datastructures}
589 The adaptation of the MIOR model to GPU requires the mapping of the
590 data model to OpenCL data structures. The environment and the agents
591 are represented by arrays of structures, where each structure
592 describes the state of one entity. The behaviors of these entities are
593 implemented as OpenCL functions to be called from the kernels during
596 Since the main program is written in Java, JOCL is responsible for the
597 allocation and mapping of the object data structures to OpenCL ones
600 Four main data structures are defined: (1) an array of MM agents,
601 representing the state of the microbial colonies. (2) an array of OM
602 agents, representing the state of the carbon deposits. (3) a topology
603 matrix, which stores accessibility information between the two types
604 of agents of the model (4) a world structure, which contains all the
605 global input data (metabolism rate, numbers of agents) and output data
606 (quantity of $CO_{2}$ produced) of the simulation. The C-like OpenCL
607 structures used to represent each type of to agent and the environment
609 (Figure~\ref{ch17:listing:mior_data_structures}). These data
610 structures are initialized by the CPU and then copied on the GPU.
613 \lstinputlisting[caption=main data structures used in a MIOR simulation,label=ch17:listing:mior_data_structures]{Chapters/chapter17/code/data_structures.cl}
615 The world topology is stored as a two-dimension matrix which
616 represents OM indexes on the abscissa and MM indexes on the
617 ordinate. Each agent walks through its line/column of the matrix at
618 each iteration to determinate which agents can be accessed during the
619 simulation. Since many agents are not connected, this matrix is
620 sparse, which introduces a big number of useless memory accesses. To
621 reduce the impact of these memory accesses we use a compacted,
622 optimized representation of this matrix based
623 on~\cite{Gomez-Luna:2009:PVS:1616772.1616869}, as illustrated in
624 Figure~\ref{ch17:fig:csr_representation}. This compact representation
625 considers each line of the matrix as an index list, and only stores
626 accessible agents compactly, to reduce the number of non-productive
631 \includegraphics[width=0.8\textwidth]{Chapters/chapter17/figs/grid.pdf}
632 %\psfig{file=figs/grid, height=1in}
633 \caption{Compact representation of the topology of a MIOR simulation.}
634 \label{ch17:fig:csr_representation}
637 Since dynamic memory allocation is not possible yet in OpenCL and is
638 only provided in the latest revisions of the CUDA standard, these
639 matrices are statically allocated. The allocation is based on the
640 worst-case scenario where all OM and MM are linked since the real
641 occupation of the matrix cells cannot be deduced without some kind of
642 preprocessing computations.
644 \subsubsection{Critical resources access management}
645 \label{ch17:subsec:concurrency}
647 One of the main concers in the MIOR model is to ensure that all the
648 microbial colonies will have an equitable access to carbon resources,
649 when multiple colonies share the same deposits. Access
650 synchronizations are mandatory in these cases to prevent conflicting
651 updates on the same data that may lead to calculation error (e.g. loss
654 On massively parallel architectures such as GPUs, these kind of
655 synchronization conflicts can lead to an inefficient implementation by
656 enforcing a quasi-sequential execution. It is necessary, in the case
657 of MIOR as well as for other ABM, to ensure that each work-item is not
658 too constrained in its execution.
661 \lstinputlisting[caption=main MIOR kernel,label=ch17:listing:mior_kernels]{Chapters/chapter17/code/mior_kernels.cl}
663 From the sequential algorithm (Algorithm~\ref{ch17:seqalgo}) where all
664 the agents share the same data, we have developed a parallel algorithm
665 composed of three sequential stages separated by synchronization
666 barriers. This new algorithm is based on the distribution of the
667 available OM carbon deposits into parts at the beginning of each
668 execution step. The three stages, illustrated in
669 Listing~\ref{ch17:listing:mior_kernels}, are the following:
672 \item \emph{scattering}: the available carbon in each carbon deposit
673 (OM) is equitably dispatched among all accessible MM in the form of
675 \item \emph{live}: each MM consumes carbon in its allocated parts for
676 its breathing and growing processes, and
677 \item \emph{gathering}: unconsumed carbon in parts is gathered back
678 into the carbon deposits.
681 This solution suppresses the data synchronization needed by the first
682 algorithm, thus the need for synchronization barriers, and requires
683 only one kernel launch from Java as described on
684 Listing~\ref{ch17:fig:mior_launcher}.
686 \lstinputlisting[caption=MIOR simulation launcher,label=ch17:fig:mior_launcher]{Chapters/chapter17/code/mior_launcher.java}
688 \subsubsection{Termination detection}
690 The termination of a MIOR simulation is reached when the model
691 stabilizes and no more $CO_{2}$ is produced. This termination
692 detection can be done on either the CPU or the GPU but it requires a
693 global view on the system execution.
695 In the first case, when the CPU controls the GPU simulation process,
696 the detection is done in two steps: (1) the CPU starts the execution
697 of a simulation step on the GPU and (2) the CPU retrieves the GPU data
698 and determines if another iteration must be launched or if the
699 simulation has terminated. This approach allows a fine-grain control
700 over the GPU execution, but it requires many costly data transfers as
701 each iteration result must be sent from the GPU to the CPU. In the
702 case of the MIOR model these costs are mainly due to the inherent
703 PCI-express port latencies rather than to bandwidth limitation since
704 data sizes remains rather small, on the order of few dozens of
707 In the second case the termination detection is directly implemented
708 on the GPU by checking the amount of available carbon between two
709 iterations. The CPU does not have any feedback while the simulation is
710 running, but retrieves the results once the kernel execution is
711 finished. This approach minimizes the number of transfers between the
714 \subsection{Performance of MIOR implementations}
715 \label{ch17:subsec:miorexperiments}
717 In this part we present several MIOR GPU implementations using the
718 distribution/gathering process described in the previous section and
719 compare their performance on two distinct hardware platform, i.e., two
720 different GPU devices. Five incremental MIOR implementations were
721 realized with an increasing level of adaptation for the algorithm: in
722 all cases, we choose the average time over 50 executions as a
723 performance indicator.
726 \item The \textbf{GPU 1.0} implementation is a direct implementation
727 of the existing algorithm and its data structures where data
728 dependencies were removed, and it uses the non-compact topology
729 representation described in Section~\ref{ch17:subsec:datastructures}
730 \item The \textbf{GPU 2.0} implementation uses the previously
731 described compact representation of the topology and remains
732 otherwise identical to the GPU 1.0 implementation.
733 \item The \textbf{GPU 3.0} implementation introduces the manual
734 copy into local (private) memory of often-accessed global data, such
735 as carbon parts or topology information.
736 \item The \textbf{GPU 4.0} implementation is a variant of the GPU 1.0
737 implementation but allows the execution of multiple simulations for
738 each kernel execution.
739 \item the \textbf{GPU 5.0} implementation is a multi-simulation
740 version of the GPU 2.0 implementation, using the execution of
741 multiple simulations for each kernel execution as for GPU 4.0.
744 The two last implementations \textbf{GPU 4.0} and \textbf{GPU 5.0}
745 illustrate the gain provided by a better usage of the hardware
746 resources, thanks to the driver execution overlapping capabilities. A
747 sequential version of the MIOR algorithm, labeled \textbf{CPU}, is
748 included for comparison purpose. This sequential version was developed
749 in Java, the same language used for GPU implementations.
751 For these performance evaluations, two platforms are used. The first
752 one is representative of the kind of hardware which is available on
753 HPC clusters. It is a cluster node dedicated to GPU computations with
754 two Intel X5550 processors running at $2.67$GHz and one Tesla C1060
755 GPU device running at $1.3$GHz and composed of $240$ cores ($30$
756 multi-processors). The second platform illustrates what can be
757 expected from a personal desktop computer built a few years ago. It
758 uses an Intel Q9300 CPU, running at $2.5$GHz, and a Geforce 8800GT GPU
759 running at $1.5$GHz and composed of $112$ cores ($14$
760 multi-processors). The purpose of these two platforms is to assess the
761 benefit that could be obtained when a scientist has access either to
762 specialized hardware as a cluster or tries to take advantage of its
763 own personal computer.
766 %begin{minipage}[t]{0.49\linewidth}
768 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_tesla.pdf}
769 %\caption{Performance CPU and GPU sur carte Tesla C1060}
770 \caption{CPU and GPU performance on a Tesla C1060 node.}
771 \label{ch17:fig:mior_perfs_tesla}
775 Figures~\ref{ch17:fig:mior_perfs_tesla}~and~\ref{ch17:fig:mior_perfs_8800gt}
776 show the execution time for $50$ simulations on the two hardware
777 platforms. A size factor is applied to the problem: at scale 1, the
778 model contains $38$ MM and $310$ OM, while at the scale 6 these
779 numbers are multiplied by six. The size of the environment is modified
780 as well to maintain the same average agent density in the model. This
781 scaling factor displays the impact of the chosen size of simulation on
784 %hspace{0.02\linewidth}
785 %begin{minipage}[t]{0.49\linewidth}
788 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_8800gt.pdf}
789 \caption{CPU and GPU performance on a personal computer with a Geforce 8800GT}
790 \label{ch17:fig:mior_perfs_8800gt}
794 \b The charts show that for small problems the execution times of all
795 the implementations are very close. This is because the GPU execution
796 does not have enough threads (representing agents) for an optimal
797 usage of GPU resources. This trend changes around scale $5$ where GPU
798 2.0 and GPU 3.0 take the advantage over the GPU 1.0 and CPU
799 implementations. This advantage continues to grow with the scaling
800 factor, and reaches a speedup of $10$ at the scale $10$ between the
801 fastest single-simulation GPU implementation and the first, naive one
804 Multiple trends can be observed in these results. First, optimizations
805 for the GPU hardware show a large, positive impact on performance,
806 illustrating the strong requirements on the algorithm properties to
807 reach execution efficiency. These charts also show that despite the
808 vast difference in numbers of cores between the two GPU platforms, the
809 same trends can be observed in both cases. We can therefore expect
810 similar results on other GPU cards, without the need for more
815 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/monokernel.pdf}
816 \caption{Execution time of one multi-simulation kernel on the Tesla
818 \label{ch17:fig:monokernel_graph}
823 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/multikernel.pdf}
824 \caption{Total execution time for 1000 simulations on the Tesla
825 platform, while varying the number of simulations for each kernel.}
826 \label{ch17:fig:multikernel_graph}
829 There are two ways to measure simulations performance: (1) by
830 executing only one kernel, and varying its size (the number of
831 simulations executed in parallel), as shown in
832 Figure~\ref{ch17:fig:monokernel_graph}, to test the costs linked to
833 the parallelization process or (2) by executing a fixed number of
834 simulations and varying the size of each kernel, as shown in
835 Figure~\ref{ch17:fig:multikernel_graph}.
837 Figure~\ref{ch17:fig:monokernel_graph} illustrates the execution time
838 for only one kernel. It shows that for small numbers of simulations
839 run in parallel, the compact implementation of the model topology is
840 faster than the two-dimension matrix representation. This trends
841 reverse with more than $50$ simulations in parallel, which can be
842 explained either by the nonlinear progression of the synchronization
843 costs or by the additional memory required for the access-efficient
846 Figure~\ref{ch17:fig:multikernel_graph} illustrates the execution time
847 of a fixed number of simulations. It shows that for a small number of
848 simulations run in parallel, the costs resulting from program setup,
849 data copies, and launch on GPU are very detrimental to
850 performance. Once the number of simulations executed for each kernel
851 grows, these costs are counterbalanced by computation costs. This
852 trend is more marked in the case of the sparse implementation (GPU
853 4.0) than in the compact one but appears on both curves. With more
854 than $30$ simulations for each kernel, execution times stall, since
855 hardware limits are reached. This indicates that the cost of preparing
856 and launching kernels become negligible compared to the computing time
857 once a good GPU occupancy rate is achieved.
859 \section{Analysis and recommendations}
860 \label{ch17:analysis}
862 In this section we synthesize the observations done on the two models
863 and identify some recommendations for implementing complex systems on
866 \subsection{Analysis}
868 In both the collembola and the MIOR model, a critical problematic is
869 the determination of the parts of the simulation that are to be run on
870 GPU and which are to remain on the CPU. The determination of these
871 parts is a classic step of any algorithm parallelization and must take
872 into account considerations such as the cost of the different parts of
873 the algorithm and the expected gains.
875 In the case of the collembola model two steps of the algorithm were
876 ported to GPU. Both steps use straightforward, easily parallelizable
877 operations where a direct gain can be expected by using more execution
878 cores without important modifications to the algorithm.
880 In the MIOR model case, however, no such inherently parallelizable
881 parts are evident in the original sequential algorithm. This is mainly
882 explained by the rate of interactions between agents in this model in
883 the form of two operations (breathing, growth) using heavily-shared
884 carbon resources. In this case the algorithm had to be more profoundly
885 modified while keeping in mind the need to remain true to the original
886 model, to synchronize the main execution step of all agents in the
887 model, to ensure equity, and to minimize the numbers of
888 synchronizations. The minimization is done by factoring the
889 distribution of carbon in the model in two separated steps at the
890 beginning and the end of each iteration rather than at multiple points
893 \subsection{MAS execution workflow}
895 Many MAS simulations decompose their execution process into discrete
896 evolution steps where each step represents a quantum of time (minimal
897 unit of time described). At the end of each step many global data,
898 graphical displays or output files are updated. This execution model
899 may not correctly fit on GPU platforms as they assume more or less a
900 batch-like workflow model. The execution model must be split into the
901 following ever repeating steps:
904 \item Allocation of GPU data buffers
905 \item Copy of data from CPU to GPU
906 \item GPU kernels execution
907 \item Copy of results from GPU to CPU
910 This workflow works well if the considered data transfer time is
911 negligible compared to GPU execution or can be done in parallel,
912 thanks to the asynchronous nature of OpenCL. If we are to update the
913 MAS model after each iteration then performance risks being
914 degraded. This is illustrated in the MIOR model by the fact that the
915 speedup observed on GPU is much more significant for bigger
916 simulations, which imply longer GPU execution times. Our solution to
917 this problem is to desynchronize the execution of the MAS model and
918 its GPU parts by requesting the execution of multiple steps of the GPU
919 simulations for each launch.
921 Another related prerequisite of GPU execution is the ability to have
922 many threads executed, to allow an efficient exploitation of the
923 superior number of cores provided by the architecture. In the case of
924 MAS models, this means that executing one agent at a time on GPU is
925 meaningless in regard to GPU usage, copying cost, and actual gain in
926 execution time, if the agent computations are not complex enough. In
927 the MIOR and the collembola models, this is solved by executing the
928 computations for all agents of the model at the same time. If the
929 model has only chronic needs for intensive computations, then some
930 kind of batching mechanism is required to store waiting treatments in
931 a queue, until the total sum of waiting computations justifies the
932 transfer cost to the GPU platform.
934 \subsection{Implementation challenges}
936 Besides the execution strategy challenges described above, some
937 implementation challenges also occur when implementing an OpenCL
938 version of a MAS model, mainly related to the underlying limitations
939 of the execution platform.
941 The first one is the impossibility (except in latest CUDA versions) to
942 dynamically allocate memory during execution. This is a problem in the
943 case of models where the number of agents can vary during the
944 simulation, such as prey-predator models. In this case, the only
945 solution is to overestimate the size of arrays or data structures to
946 accommodate these additional individuals, or to use the CPU to resize
947 data structures when these situations occur. Both approaches require
948 trending either memory or performance and are not always practical.
950 Another limitation is the impossibility to store pointers in data
951 structures, since OpenCL only allows one dimension static arrays. This
952 precludes the usage of structures such as linked-list, graphs or
953 sparse matrices not represented by some combination of static arrays,
954 and can be another source of memory or performance losses.
956 In the case of MIOR, this problem is especially exacerbated in the
957 case of neighboring storage: both representations consume much more
958 memory than is actually required, since the worst case (all agents
959 have access to all others agents) has to be taken into account when
960 dimensioning the data structure.
962 The existence of different generations of GPU hardware is also a
963 challenge. Older implementations, such as the four year old Tesla
964 C1060 cards, have very strong constraints in term of memory accesses
965 and requires very regular access patterns to perform efficiently. MAS
966 having many random accesses, such as MIOR, or many small global memory
967 accesses, such as Collembola, are penalized on these older
968 cards. Fortunately, these requirements are less present is modern
969 cards, which offer caches and other facilities traditionally present
970 on CPU to offset these kinds of penalties.
972 The final concern is related to the previous ones and often results in
973 more memory consumption. The amount of memory available on GPU cards
974 is much more limited and adding new memory capabilities is more costly
975 compared to expending a CPU RAM. On computing clusters, hardwares
976 nodes with 128GB of memory or more have become affordable, whereas
977 newer Tesla architecture remains limited to 16GB of memory. This can
978 be a concern in the case of big MAS models, or small ones which can
979 only use memory-inefficient OpenCL structures.
984 As shown in the previous sections, many data representation choices
985 are common to entire classes of MAS. The paradigm of grid, for
986 example, is often encountered in models where each cell constitutes
987 either the elementary unit of simulation
988 (SugarScape~\cite{Dsouza2007}) or a discretization of the environment
989 space (Pathfinding~\cite{Guy09clearpath}). These grids can be
990 considered as two- or three-dimensional matrices, whose processing can
991 be directly distributed.
993 Another common data representation encountered in MAS system is the
994 usage of 2D or 3D coordinates to store the position of each agent of
995 the model. In this case, even if the environment is no longer
996 discrete, the location information still imply computations (Euclidean
997 distances) which can be parallelized.
999 MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS
1000 designer those basic data structures and the associated operations, to
1001 facilitate the portage of existing MAS on GPU. Two levels of
1002 utilization are provided to the developer, depending on its usage
1006 \item A high-level library, composed of modules regrouping classes of
1007 operations. Such operation can distance computations in 1D, 2D or 3D
1008 grids, diffusion or reduction operations on matrices...
1009 \item A low-level API which allows the developer direct access to the
1010 GPU and the inner working of MCSMA, to develop new modules in the
1011 case where the required operations are not yet provided by the
1015 Both usage levels were illustrated in the above two practical
1016 cases. In MIOR, the whole algorithm (baring initialization) is ported
1017 on GPU as a specific plugin which allows executing $n$ MIOR
1018 simulations and retrieving their results. This first approach requires
1019 extensive adaptations to the original algorithm. In collembola, to
1020 the contrary, the main steps of the algorithm remain executed on the
1021 CPU, and only specific operations are delegated to generic, already
1022 existing diffusion and reduction kernels. The fundamental algorithm is
1023 not modified and GPU execution is only applied to specific parts of
1024 the execution which may benefit from it. These two programming
1025 approaches allow incremental adaptations of existing Java MAS to
1026 accelerate their execution on GPU, while retaining the option to
1027 develop their own reusable or more efficient module to supplement the
1028 already existing ones.
1030 \section{Conclusion}
1031 \label{ch17:conclusion}
1033 This chapter has addressed the issue of complex system simulation by
1034 using agent-based paradigms and GPU hardware. From the experiments on
1035 two existing agent-based models of soil science we have provided
1036 useful information on the architecture, the algorithm design, and the
1037 data management to run agent-based simulations on GPU, and more
1038 generally to run computationally intensive applications that are not
1039 based on purely-matricial models. The first result of this work is
1040 that adapting the algorithm to a GPU architecture is possible and
1041 suitable to speed up agent based simulations as illustrated by the
1042 MIOR model. Coupling CPU with GPU seems to be an interesting way to
1043 take better advantage of the power given by computers and clusters as
1044 illustrated by the collembola model. From our point of view the
1045 adaptation process is less costy in time than a traditional
1046 parallelization on distributed nodes and not much difficult than a
1047 standard multithreaded parallelization, since all the data remains on
1048 the same host and can be shared in central memory. The usage of OpenCL
1049 also enables a portable simulator that can be run on different
1050 graphical units. Even using a mainstream card such as the GPU card of
1051 a standard computer can lead to significant performance
1052 improvements. This is an interesting result as it opens up the field
1053 of inexpensive HPC to a large community.
1055 In this perspective, we are working on MCSMA, a development platform
1056 that would facilitate the use of GPU or many-core architectures for
1057 multi-agent simulations. Our first work has been the definition of
1058 common, efficient, reusable data structures, such as grids or
1059 lists. Another goal is to provide easier means to control the
1060 distribution of specific processes between CPU or GPU, to allow the
1061 easy exploitation of the strengths of each platform in the same
1062 multi-agent simulation. We think that the same approach, i.e.,
1063 developing specific environments that facilitate the developer access
1064 to the GPU power, can be applied in many domains with computationally
1065 intensive needs to open the GPU use to a larger community.
1067 \putbib[Chapters/chapter17/biblio]