BookGPU/Chapters/chapter17/ch17.tex

   1 \chapterauthor{Guillaume Laville, Christophe Lang, Bénédicte Herrmann,
   2   and Laurent Philippe}{Femto-ST Institute, University of
   3   Franche-Comte, France}
   4 %\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France}
   5 \chapterauthor{Kamel Mazouzi}{Franche-Comte Computing Center, University of Franche-Comte, France}
   6 \chapterauthor{Nicolas Marilleau}{UMMISCO, Institut de Recherche pour le Developpement (IRD), France}
   7 %\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France}
   8 %\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
   9
  10 \newlength\mylen
  11 \newcommand\myinput[1]{%
  12   \settowidth\mylen{\KwIn{}}%
  13   \setlength\hangindent{\mylen}%
  14   \hspace*{\mylen}#1\\}
  15
  16 \chapter{Implementing multi-agent systems on GPU}
  17 \label{chapter17}
  18
  19
  20 \section{Introduction}
  21 \label{ch17:intro}
  22
  23 In this chapter we introduce the use of Graphical Processing Units
  24 (GPU) for multi-agents-based systems as an example of a not-so-regular
  25 application that could benefit from the GPU computing
  26 power. Multi-Agent Systems (MAS) are a simulation paradigm used to
  27 study the behavior of dynamic systems. Dynamic systems as physical
  28 systems are often modeled by mathematical representations and their
  29 dynamic behavior is simulated by differential equations. The
  30 simulation of the system thus often relies on the resolution of a
  31 linear system that can be efficiently computed on a graphical
  32 processing unit as shown in the preceding chapters. But when the
  33 behavior of the system elements is not uniformly driven by the same
  34 law, when these elements have their own behavior, the modeling process
  35 is too complex to rely on formal expressions. In this context MAS is a
  36 recognized approach to model and simulate systems where individuals
  37 have an autonomous behavior that cannot be simulated by the evolution
  38 of a set of variables driven by mathematical laws. MAS are often used
  39 to simulate natural or collective phenomena whose individuals are too
  40 numerous or various to provide a unified algorithm describing the
  41 system evolution. The agent-based approach is to divide these complex
  42 systems into individual self-contained entities with their smaller set
  43 of attributes and functions. But, as for mathematical simulations,
  44 when the size of the MAS increases, the need of computing power and
  45 memory also increases. For this reason, multi-agent systems should
  46 benefit from the use of distributed computing architectures. Clusters
  47 and grids are often identified as the main solution to increase
  48 simulation performance but GPUs are also a promising technology with
  49 an attractive performance/cost ratio.
  50
  51 Conceptually a MAS\index{multi-agent system} is a distributed system
  52 as it favors the definition and description of large sets of
  53 individuals, the agents, that can be run in parallel. As a large set
  54 of agents could have the same behavior, a Single Instruction Multiple
  55 Data (SIMD) execution architecture should fit the simulation
  56 execution. Most of the agent-based simulators are, however, designed
  57 with a sequential scheme in mind, and these simulators seldom use more
  58 than one core for their execution. Due to simulation scheduling
  59 constraints, data sharing and exchange between agents and the huge
  60 amount of interactions between agents and their environment, it is
  61 indeed rather difficult to distribute an agent based simulator, for
  62 instance, to take advantage of new multithreaded computer
  63 architectures. Thus, guidelines and tools dedicated to MAS paradigm
  64 and High Performance Computing (HPC) are now a need for other complex
  65 system communities. Note that, from the described structure (large
  66 number of agents sharing data), we can conclude that MAS would more
  67 easily benefit from many-core architectures than from other kinds of
  68 parallelism.
  69
  70 Another key point that advocates for the use of many-core in MAS is
  71 the growing need for multiscale simulations. Multiscale simulations
  72 explore problems with interactions between several scales. The
  73 different scales use different granularity of the structure and
  74 potentially different models. Most of the time the lower scale
  75 simulations provide results to higher scale simulations. In that case
  76 the execution of the simulations can easily be distributed between the
  77 local cores and a many-core architecture, i.e., a GPU device.
  78
  79 We explore in this chapter the use of many-core architectures to
  80 execute agent-based simulations. We illustrate our reflexion with two
  81 cases: the Collembola simulator designed to simulate the diffusion of
  82 Collembola between plots of land and the MIOR (MIcro ORganism)
  83 simulator that reproduces effects of earthworms on bacteria dynamics
  84 in a bulked soil. In Section \ref{ch17:ABM} we present the work
  85 related to MAS and parallelization with a special focus on many-core
  86 use. In sections \ref{ch17:sec:1stmodel} and \ref{ch17:sec:2ndmodel}
  87 we present in detail two multi-agent models, their GPU
  88 implementations, the conducted experiments, and their performance
  89 results. The first model, given in Section \ref{ch17:sec:1stmodel},
  90 illustrates the use of a GPU architecture to speed up the execution of
  91 some computation-intensive functions while the main model is still
  92 executed on the central processing unit. The second model, given in
  93 Section \ref{ch17:sec:2ndmodel}, illustrates the use of a GPU
  94 architecture to implement the whole model on the GPU processor which
  95 implies deeper changes in the initial algorithm. In Section
  96 \ref{ch17:analysis} we propose a more general reflexion on these
  97 implementations and provide some guidelines. Then, we conclude in
  98 Section \ref{ch17:conclusion} on the possible generalization of our
  99 work.
 100
 101
 102 \section{Running agent-based simulations}
 103 \label{ch17:ABM}
 104
 105 In this section, we present the context of MAS, their parallelization,
 106 and we report several existing works on using GPU to simulate
 107 multi-agent systems.
 108
 109 \subsection{Multi-agent systems and parallelism}
 110
 111 Agent-based systems are often used to simulate natural or collective
 112 phenomena whose actors are too numerous or various to provide a simple
 113 unified algorithm describing the studied system
 114 dynamic~\cite{Schweitzer2003}. The implementation of an agent based
 115 simulation usually starts by designing the underlying agent-based
 116 model (ABM). Most ABM are based around a few types of entities such as
 117 agents, one environment, or an interaction
 118 organization~\cite{Vowel02}. In the complex system domain, the
 119 environment often describes a real space, its structure (e.g. soil
 120 textures and porosities), and its dynamics (e.g., organic matter
 121 decomposition). It is a virtual world in which agents represent
 122 studied entities (e.g., biotic organisms) evolution. The actual agent
 123 is animated by a behavior that can range between reactivity (only
 124 reacts to external stimuli) and cognition (makes complex decisions
 125 based on environmental and internal factors). Interaction and
 126 organization define functions, types, and patterns of communications
 127 of their member agents in the
 128 system~\cite{Odell:2003:RRD:1807559.1807562}. Note that, depending on
 129 the MAS, agents can communicate either directly through special
 130 primitives or indirectly through the information stored in the
 131 environment.
 132
 133 Agent-based simulations have been used for more than a decade to
 134 reproduce, understand and even predict complex system dynamics. They
 135 have proved their usefulness in various scientific
 136 communities. Nowadays generic agent based frameworks such as
 137 Repast~\cite{repast_home} or NetLogo~\cite{netlogo_home} are promoted
 138 to implement simulators. Many ABMs such as the crown model
 139 representing a city wide scale~\cite{Strippgen_Nagel_2009} tend
 140 however to require a large number of agents to provide a realistic
 141 behavior and reliable global statistics. Moreover, an achieved model
 142 analysis needs to resort to an experiment plan, consisting of multiple
 143 simulation runs, to obtain enough confidence in a simulation. In this
 144 case the available computing power often limits the simulation size,
 145 and the resulting range thus requires the use of parallelism to
 146 explore bigger configurations.
 147
 148 For that, three major approaches can be identified:
 149 \begin{enumerate}
 150 \item parallelizing experiments execution on a cluster or a grid (one
 151   or a few simulations are submitted to each
 152   core)~\cite{Blanchart11,Chuffart2010},
 153 \item parallelizing the simulator on a cluster (the environment of the
 154   MAS is split and run on several distributed
 155   nodes)~\cite{Cosenza2011,MAR06},
 156 \item optimizing the simulator by taking advantage of computer
 157   resources (multi-threading, GPU, and so on) \cite{Aaby10}.
 158 \end{enumerate}
 159
 160 In the first case, experiments are run independent of each other and
 161 only simulation parameters are changed between two runs so that a
 162 simple version of an existing simulator can be used. This approach
 163 does not, however, allow to run larger models.  In the second and the
 164 third case, model and code modifications are necessary. Only a few
 165 frameworks, however, introduce distribution in agent simulation
 166 (Madkit~\cite{Gutknecht2000}, MASON~\cite{Sean05},
 167 repastHPC~\cite{Collier11}), and parallel implementations are often
 168 based on the explicit use of threads using shared
 169 memory~\cite{Guy09clearpath} or cluster libraries such as
 170 MPI~\cite{Kiran2010}.
 171
 172 Parallelizing a multi-agent simulation is, however, complex due to space
 173 and time constraints. Multi-agent simulations are usually based on a
 174 synchronous execution: at each time step, numerous events (space data
 175 modification, agent motion) and interactions between agents happen.
 176 Distributing the simulation on several computers or grid nodes to guarantee a distributed synchronous execution and
 177 coherency. This often leads to poor performance or complex
 178 synchronization problems. Multicore execution or delegating part of
 179 this execution to others processors such as GPUs~\cite{Bleiweiss_2008}
 180 is usually easier to implement since all the threads share the data
 181 and the local clock.
 182
 183 % Different agent patterns can be adopted in an ABMs such as
 184 % cognitive and reactive ones~\cite{Ferber99}. Cognitive agents act on
 185 % the environment and interact with other agents according to a complex
 186 % behavior. This behavior takes a local perception of the virtual world
 187 % and the agent past (a memory characterized by an internal state and
 188 % belief, imperfect knowledge about the world) into account. Reactive
 189 % agents have a much systematic pattern of action based on stimuli
 190 % response schemes (no or few knowledge and state conservation in agent). The
 191 % evolution of the ABM environment, in particular, is often represented with
 192 % this last kind of agents. As their behavior is usually simple, we
 193 % propose in this chapter to delegate part of the environment and of the
 194 % reactive agents execution to the graphical processing unit of the
 195 % computer. This way we can balance the load between both CPU and GPU
 196 % execution resources.
 197
 198 % In the particular case of multi-scale simulations such as the Sworm
 199 % simulation~\cite{BLA09} the environment may be used at different
 200 % levels. Since the representation of the whole simulated environment
 201 % (i.e. the soil area) would be costly, the environment is organized as
 202 % a multi-level tree of small soil cubes which can be lazily
 203 % instantiated during the simulation. This allows to gradually refine
 204 % distribution details in units of soil as agents progress and need
 205 % those information, by using a fractal process based on the
 206 % bigger-grained already instantiated levels. This characteristic,
 207 % especially for a fractal model, could be the key of the
 208 % distribution. For instance, each branch of a fractal environment could
 209 % be identified as an independent area and parallelized. In addition
 210 % Fractal is a famous approach to describe multi-scale environment (such
 211 % as soil) and its organization~\cite{perrier}. In that case the lower
 212 % scale simulations can also be delegated to the GPU card to limit the
 213 % load of the main (upper scale) simulation.
 214
 215 \subsection{MAS implementation on GPU}
 216 \label{ch17:subsec:gpu}
 217
 218 The last few years have seen the appearance of new generations of
 219 graphic cards based on more general purpose execution units which are
 220 promising for large systems such as MAS. Using matrix-based data
 221 representations and SIMD computations is however not always
 222 straightforward in MAS, where data structures and algorithms are
 223 tightly coupled to the described simulation. However, works from
 224 existing literature show that MAS can benefit from these performance
 225 gains on various simulation types, such as traffic
 226 simulation~\cite{Strippgen_Nagel_2009}, cellular
 227 automata~\cite{Dsouza2007}, mobile-agent based
 228 path-finding~\cite{Silveira:2010:PRG:1948395.1948446} or genetic
 229 algorithms~\cite{Maitre2009}. Note that an application-specific
 230 adaptation process was required in the case of these MAS: some of the
 231 previous examples are driven by mathematical laws (path-finding) or
 232 use a natural mapping between a discrete environment (cellular
 233 automaton) and GPU cores. Unfortunately, this mapping often requires
 234 algorithmic adaptations in other models but experience shows that the
 235 more reactive a MAS is the more adapted its implementation is to GPU.
 236
 237 The first step in the adaptation of an ABM to GPU platforms is the
 238 choice of language. On the one hand, the Java programming language is
 239 often used for the implementation of MAS due to its availability on
 240 numerous platforms or frameworks and its focus on high-level,
 241 object-oriented programming. On the other hand, GPU platforms can only
 242 run specific languages such as OpenCL or CUDA. OpenCL (supported on
 243 AMD, Intel, and NVIDIA hardware) better suits the portability concerns
 244 across a wide range of hardware needed the agent simulators, as
 245 opposed to CUDA which is an NVIDIA-specific library.
 246
 247 OpenCL is a C library which provides access to the underlying CPU or
 248 GPU threads using an asynchronous interface. Various OpenCL functions
 249 allow the compilation and the execution of programs on these execution
 250 resources, the copying of data buffers between devices, and the
 251 collection of profiling information.
 252
 253 This library is based around three main concepts:
 254
 255 \begin{itemize}
 256 \item the \emph{kernel} (similar to a CUDA kernel), which represents a runnable program
 257   containing instructions to be executed on the GPU;
 258 \item the \emph{work-item} (equivalent to a CUDA thread), which is analogous to the concept
 259   of thread on GPU, in that it represents one running instance of a
 260   GPU kernel; and
 261 \item the \emph{work-group} (or execution block) which is a set of work-items
 262   sharing some memory to speed up data accesses and
 263   computations. Synchronization operations such as barrier can only be
 264   used across the same work-group.
 265 \end{itemize}
 266
 267 Running an OpenCL computation consists of launching numerous
 268 work-items that execute the same kernel. The work-items are submitted
 269 to a submission queue to optimize the available cores usage. A
 270 calculus is achieved once all these kernel instances have terminated.
 271
 272 The number of work-items used in each work-group is an important
 273 implementation choice which determines how many tasks will share the
 274 same cache memory. Data used by the work-items can be stored as
 275 N-dimensions matrices in local or global GPU memory.  Since the size
 276 of this memory is often limited to a few hundred kilobytes, choosing
 277 this number often implies a compromise between the model
 278 synchronization or data requirements and the available resources.
 279
 280 In the case of agent-based simulations, each agent can be naturally
 281 mapped to a work-item. Work-groups can then be used to represent
 282 groups of agents or simulations sharing common data (such as the
 283 environment) or algorithms (such as the background evolution process).
 284
 285 In the following examples a binding named JOCL~\cite{jocl_home} is
 286 used to access the OpenCL platform from the Java programming language.
 287
 288 In the next sections we present two practical cases that will be
 289 studied in detail, from the model to its implementation and
 290 performance.
 291
 292 \section{A first practical example}
 293 \label{ch17:sec:1stmodel}
 294
 295 The first model, the Collembola model, simulates the propagation of
 296 collembolas in fields and forests. It is based on a diffusion
 297 algorithm which illustrates the case of agents with a simple behavior
 298 and few synchronization problems.
 299
 300 \subsection{The Collembola model\index{collembola model}}
 301 \label{ch17:subsec:collembolamodel}
 302
 303 The Collembola model is an example of multi-agent system using GIS
 304 (Geographical Information System) and survey data (population count)
 305 to model the evolution of the biodiversity across land plots. A first
 306 version of this model has been developed with the Netlogo framework by
 307 Bioemco and UMMISCO researchers. In this model, the biodiversity is
 308 modeled by populations of athropod individuals, the Collembola, which
 309 can reproduce and diffuse to favorable new habitats. The simulator
 310 allows us to study the diffusion of Collembola, between plots of land
 311 depending on their use (artifical soil, crop, forest, etc.) In this
 312 model the environment is composed of the studied land, and collembola
 313 are used as agents. Every land plot is divided into several cells,
 314 each cell representing a surface unit (16x16 meters). A number of
 315 individuals per collembola species is associated to each cell. The
 316 model evolution is then based on a common diffusion model that
 317 diffuses individuals between cells. Each step of the simulation is
 318 based on four stages, as shown on
 319 Figure~\ref{ch17:fig:collem_algorithm}:
 320
 321 % \begin{enumerate}
 322 % \item arrival of new individuals
 323 % \item reproduction in each cell
 324 % \item diffusion between cells
 325 % \item updating of colembola lifetime
 326 % \end{enumerate}
 327
 328 \begin{figure}[h]
 329 \centering
 330 \includegraphics[width=0.6\textwidth]{Chapters/chapter17/figs/algo_collem.pdf}
 331 \caption{Evolution algorithm of the Collembola model.}
 332 \label{ch17:fig:collem_algorithm}
 333 \end{figure}
 334
 335 The algorithm is quite simple but includes two costly operations, the
 336 reproduction and the diffusion, that must be parallelized to improve
 337 the model performances.
 338
 339 The {\bf reproduction} stage consists in updating the total population
 340 of each plot by taking the individuals arrived at the preceding
 341 computation step. This stage involves processing the whole set of
 342 cells of the environment to sum their population. The computed value
 343 is recorded in the plot associated to each cell. This process can be
 344 assimilated to a reduction operation on all the population cells
 345 associated to one plot to obtain its population.
 346
 347 The {\bf diffusion} stage simulates the natural behavior of the
 348 Collembola that tends toward occupying the whole space over time. This
 349 stage consists in computing a new value for each cell depending on
 350 the population of the neighbor cells. This process can be assimilated
 351 to a linear diffusion at each iteration of the population of the cells
 352 across their neighbors.
 353
 354 These two processes are quite common in numerical computations so that
 355 the Collembola model can be adapted to a GPU execution without much
 356 difficulty.
 357
 358 \subsection{Collembola implementation}
 359
 360 In the collembola simulator biodiversity is modeled by populations of
 361 collembola individuals, which can reproduce and diffuse to favorable
 362 new habitats. This is implemented as a fixed reproduction factor,
 363 applied to the size of each population, followed by a linear diffusion
 364 of each cell population to its eight neighbors. These reproduction and
 365 diffusion processes are followed by two more steps on the GPU
 366 implementation. The first one consist of culling of populations in an
 367 inhospitable environment, by checking each cell value and terrain
 368 type, and setting its population to zero if necessary.  The final
 369 simulation step is the reduction of the cell populations for each
 370 plot, to obtain an updated plot population for statistic
 371 purposes. This separate computation step, done while updating each
 372 cell population in the reference sequential algorithm, is motivated by
 373 synchronization problems and allows the reduction of the total number
 374 of memory accesses needed to updated those populations.
 375
 376 %\lstinputlisting[language=C,caption=Collembola OpenCL
 377 %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl}
 378 %\pagebreak
 379 \lstinputlisting[caption=collembola openCL diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl}
 380
 381 The reproduction, diffusion and culling steps are implemented on GPU
 382 (Figure~\ref{ch17:fig:collem_algorithm}) as a straight mapping of each
 383 cell to an OpenCL work-item (GPU thread). Listing
 384 \ref{ch17:listing:collembola-diffuse} gives the kernel for the
 385 diffusion implementation.  To prevent data coherency problems, the
 386 diffusion step is split into two phases, separated by an execution
 387 {\it barrier}. In the first phase each cell diffusion overflow is
 388 calculated and divided by the number of neighbors. Note that, on the
 389 border of the grid, populations can also overflow outside the
 390 environment grid but we do not manage those external populations,
 391 since there are no reason to assume our model to be isolated of its
 392 surroundings. The overflow by neighbors value is stored for each cell
 393 before encountering the barrier. After the barrier is met, each cell
 394 reads the overflows stored by all its neighbors at the previous step
 395 and applies them to its own population. In this manner, only one
 396 barrier is required to ensure the consistency of population numbers,
 397 since no cell ever modify a value other than its own.
 398
 399 Listing \ref{ch17:listing:collembola-reduc} gives the kernel for the
 400 reduction implementation.  The only step requiring numerous
 401 synchronized accesses is the reduction one: in this first approach, we
 402 chose to use {\it atomic\_add} operation to implement this process,
 403 but more efficient implementations using partial reduction and local
 404 GPU memory could be implemented.
 405
 406 \pagebreak
 407 \lstinputlisting[caption=collembola OpenCL reduction kernel,label=ch17:listing:collembola-reduc]{Chapters/chapter17/code/collem_kernel_reduc.cl}
 408
 409 \subsection{Collembola performance}
 410
 411 In this part we present the performance of the collembola model on
 412 various CPU and GPU execution
 413 platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
 414 number of cores and the processor architecture as a strong influence
 415 on the obtained results
 416
 417 % : the
 418 % dual-core processor equipping our Thinkpad platform has two to six
 419 % longer executions times, compared to a six-core Phenom X6.
 420
 421 % % \begin{figure}[h]
 422 % %begin{minipage}[t]{0.49\linewidth}
 423 % \centering \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs.pdf}
 424 % \caption{Performance CPU et GPU du modèle Collemboles}
 425 % \label{ch17:fig:mior_perfs_collem}
 426 % %end{minipage}
 427 % \end{figure}
 428
 429 % In figure~\ref{ch17:fig:mior_perfs_collem2} the Thinkpad curve is removed
 430 % to make other trends clearer. Two more observation can be made, using
 431 % this more detailled scale:
 432
 433 \begin{itemize}
 434 \item Older GPU cards can be slower than modern processors. This can
 435   be explained by the new cache and memory access optimizations
 436   implemented in newer generations of GPU devices. These optimizations
 437   reduce the penalties associated with irregular and frequent global
 438   memory accesses. They are not available on our Tesla nodes.
 439 \item GPU curves exhibit an odd-even pattern in their performance
 440   results. Since this phenomenon is visible on two distinct
 441   manufacturer hardware, driver, and OpenCL implementation, it is
 442   likely the result of the decomposition process based on warp of
 443   fixed, power-of-two sizes.
 444 \item The number of cores is not the only determining factor: an Intel
 445   Core i7 2600K processor, even with only four cores, can provide
 446   better performance than a Phenom one.
 447 \end{itemize}
 448
 449 \begin{figure}[h]
 450 %begin{minipage}[t]{0.49\linewidth}
 451 \centering
 452 \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
 453 \caption{Performance of the Collembola model on CPU and GPU.}
 454 \label{ch17:fig:mior_perfs_collem}
 455 %end{minipage}
 456 \end{figure}
 457
 458 Both graphs show that using the GPU to parallelize part of the
 459 simulator results in tangible performance gains over a CPU execution
 460 on modern hardware. These gains are more mixed on older GPU platforms
 461 due to the limitations when dealing with irregular memory or execution
 462 patterns often encountered in MAS systems. This can be closely linked
 463 to the availability of caching facilities on the GPU hardware and its
 464 dramatic effects depend on the locality and frequency of data
 465 accesses. In this case, even if the Tesla architecture offers more
 466 execution cores and is the far costlier solution, more recent,
 467 cheaper, solutions such as high-end GPU provide better performance
 468 when the execution is not constrained by memory size.
 469
 470 \section{Second example}
 471 \label{ch17:sec:2ndmodel}
 472
 473 The second model, the MIOR model, simulates the behavior of microbian
 474 colonies. Its execution model is more complex so that it requires
 475 changes in the initial algorithm and the use of synchronization to
 476 benefit from the GPU architecture.
 477
 478 \subsection{The MIOR model}
 479 \label{ch17:subsec:miormodel}
 480
 481 The MIOR~\cite{C.Cambier2007} model was developed to simulate local
 482 interactions in soil between microbial colonies and organic
 483 matters. It reproduces each small cubic unit ($0.002 m^3$) of soil as
 484 a MAS.
 485
 486 Multiple implementations of the MIOR model have already been
 487 realized, in Smalltalk and Netlogo, in 2 or 3 dimensions. The last
 488 implementation, used in our work and referenced as MIOR in the
 489 rest of the chapter, is freely accessible online as
 490 WebSimMior\footnote{http://www.IRD.fr/websimmior/}.
 491
 492 MIOR is based around two types of agents: (1) the Meta-Mior (MM),
 493 which represents microbial colonies consuming carbon and (2) the
 494 Organic Matter (OM) which represents carbon deposits occurring in
 495 soil.
 496
 497 The Meta-Mior agents are characterized by two distinct behaviors:
 498 \begin{itemize}
 499 \item \emph{breath}: this action converts mineral carbon from the soil
 500   to carbon dioxide ($CO_{2}$) that is released into the soil and
 501 \item \emph{growth}: by this action each microbial colony fixes the
 502   carbon present in the environment to reproduce itself (augment its
 503   size). This action is only possible if the colony breathing needs
 504   are covered, i.e., enough mineral carbon is available.
 505 \end{itemize}
 506
 507 These behaviors are described in Algorithm~\ref{ch17:seqalgo}.
 508
 509 \begin{algorithm}[h]
 510 \caption{evolution step of each Meta-Mior (microbial colony) agent}
 511 \label{ch17:seqalgo}
 512 \KwIn{A static array $mmList$ of MM agents}
 513 \myinput{A static array $omList$ of OM agents}
 514 \myinput{A MIOR environment $world$}
 515 $breathNeed \gets world.respirationRate \times mm.carbon$\;
 516 $growthNeed \gets world.growthRate \times mm.carbon$\;
 517 $availableCarbon \gets totalAccessibleCarbon(mm)$\;
 518 \uIf{$availableCarbon > breathNeed$}{
 519   \tcc{ Breath }
 520   $mm.active \gets true$\;
 521   $availableCarbon \gets availableCarbon - consumCarbon(mm, breathNeed)$\;
 522   $world.CO2 \gets world.CO2 + breathNeed$\;
 523   \If{$availableCarbon > 0$}{
 524     \tcc{ Growth }
 525     $growthConsum \gets max(totalAccessCarbon(mm), growthNeed)$\;
 526     $consumCarbon(mm, growthConsum)$\;
 527     $mm.carbon \gets mm.carbon + growthConsum$\;
 528   }
 529 }
 530 \Else{
 531   $mm.active \gets false$
 532 }
 533 \end{algorithm}
 534
 535 Since this simulation takes place at a microscopic scale, a large
 536 number of these simulations must be executed for each macroscopic
 537 simulation step to model a realistic-sized unit of soil. This leads to
 538 large computing needs despite the small computation cost of each
 539 individual simulation.
 540
 541
 542 \subsection{MIOR implementation}
 543
 544 As pointed out previously, the MIOR implementation implied more
 545 changes for the initial code to be run on GPU.  As a first attempt, we
 546 tried a simple GPU implementation of the MIOR simulator, with only
 547 minimal changes to the CPU algorithm. Execution times showed the
 548 inefficiency of this approach and highlighted the necessity of
 549 adapting the simulator to take advantage of the GPU execution
 550 capabilities~\cite{lmlm+12:ip}. In this part, we show the main changes
 551 which were realized to adapt the MIOR simulator on GPU architectures.
 552
 553 \subsubsection{Execution mapping on GPU}
 554
 555 \begin{figure}
 556 \centering
 557 \includegraphics[width=0.7\textwidth]{Chapters/chapter17/figs/repartition.pdf}
 558 \caption{Consolidation of multiple simulations in one OpenCL kernel execution.}
 559 \label{ch17:fig:gpu_distribution}
 560 \end{figure}
 561
 562 Each MIOR simulation is represented by a work-group, and each agent by
 563 a work-item. A kernel is in charge of the life cycle process for each
 564 agent of the model. This kernel is executed by all the work-items of
 565 the simulation each on its own GPU core.
 566
 567 The usage of one work-group for each simulation allows the easy
 568 execution of multiple simulations in parallel, as shown on
 569 Figure~\ref{ch17:fig:gpu_distribution}.  By taking advantage of the
 570 execution overlap possibilities provided by OpenCL, it then becomes
 571 possible to exploit all the cores at the same time, even if an unique
 572 simulation is too small to use all the available GPU cores. However,
 573 the maximum size of a work-group is limited (to $512$), which allows
 574 us to execute only one simulation per work-group when using $310$
 575 threads (number of OM in the reference model) to execute the
 576 simulation.
 577
 578 The usage of the GPU to execute multiple simulations is initiated by
 579 the CPU. The CPU keeps total control of the simulator execution
 580 flow. Thus, optimized scheduling policies (such as submitting kernels
 581 in batch, limiting the number of kernels, or asynchronously retrieving
 582 the simulation results) can be defined to minimize the cost related to
 583 data transfers between CPU and GPUs.
 584
 585 \subsubsection{Data structures translation}
 586 \label{ch17:subsec:datastructures}
 587
 588 The adaptation of the MIOR model to GPU requires the mapping of the
 589 data model to OpenCL data structures. The environment and the agents
 590 are represented by arrays of structures, where each structure
 591 describes the state of one entity. The behaviors of these entities are
 592 implemented as OpenCL functions to be called from the kernels during
 593 execution.
 594
 595 Since the main program is written in Java, JOCL is responsible for the
 596 allocation and mapping of the object data structures to OpenCL ones
 597 before execution.
 598
 599 Four main data structures are defined: (1) an array of MM agents,
 600 representing the state of the microbial colonies. (2) an array of OM
 601 agents, representing the state of the carbon deposits. (3) a topology
 602 matrix, which stores accessibility information between the two types
 603 of agents of the model (4) a world structure, which contains all the
 604 global input data (metabolism rate, numbers of agents) and output data
 605 (quantity of $CO_{2}$ produced) of the simulation. The C-like OpenCL
 606 structures used to represent each type of to agent and the environment
 607 are illustrated in
 608 (Figure~\ref{ch17:listing:mior_data_structures}). These data
 609 structures are initialized by the CPU and then copied on the GPU.
 610
 611
 612 \lstinputlisting[caption=main data structures used in a MIOR simulation,label=ch17:listing:mior_data_structures]{Chapters/chapter17/code/data_structures.cl}
 613
 614 The world topology is stored as a two-dimension matrix which
 615 represents OM indexes on the abscissa and MM indexes on the
 616 ordinate. Each agent walks through its line/column of the matrix at
 617 each iteration to determinate which agents can be accessed during the
 618 simulation.  Since many agents are not connected, this matrix is
 619 sparse, which introduces a big number of useless memory accesses. To
 620 reduce the impact of these memory accesses we use a compacted,
 621 optimized representation of this matrix based
 622 on~\cite{Gomez-Luna:2009:PVS:1616772.1616869}, as illustrated in
 623 Figure~\ref{ch17:fig:csr_representation}. This compact representation
 624 considers each line of the matrix as an index list, and only stores
 625 accessible agents compactly, to reduce the number of nonproductive
 626 accesses.
 627
 628 \begin{figure}[h]
 629 \centering
 630 \includegraphics[width=0.8\textwidth]{Chapters/chapter17/figs/grid.pdf}
 631 %\psfig{file=figs/grid, height=1in}
 632 \caption{Compact representation of the topology of a MIOR simulation.}
 633 \label{ch17:fig:csr_representation}
 634 \end{figure}
 635
 636 Since dynamic memory allocation is not possible yet in OpenCL and is
 637 only provided in the latest revisions of the CUDA standard, these
 638 matrices are statically allocated. The allocation is based on the
 639 worst-case scenario where all OM and MM are linked since the real
 640 occupation of the matrix cells cannot be deduced without some kind of
 641 preprocessing computations.
 642
 643 \subsubsection{Critical resources access management}
 644 \label{ch17:subsec:concurrency}
 645
 646 One of the main concers in the MIOR model is to ensure that all the
 647 microbial colonies will have an equitable access to carbon resources,
 648 when multiple colonies share the same deposits. Access
 649 synchronizations are mandatory in these cases to prevent conflicting
 650 updates on the same data that may lead to calculation error (e.g. loss
 651 of matter).
 652
 653 On massively parallel architectures such as GPUs, these kind of
 654 synchronization conflicts can lead to an inefficient implementation by
 655 enforcing a quasi-sequential execution. It is necessary, in the case
 656 of MIOR as well as for other ABM, to ensure that each work-item is not
 657 too constrained in its execution.
 658
 659 \pagebreak
 660 \lstinputlisting[caption=main MIOR kernel,label=ch17:listing:mior_kernels]{./Chapters/chapter17/code/mior_kernels.cl}
 661
 662 From the sequential algorithm (Algorithm~\ref{ch17:seqalgo}) where all
 663 the agents share the same data, we have developed a parallel algorithm
 664 composed of three sequential stages separated by synchronization
 665 barriers. This new algorithm is based on the distribution of the
 666 available OM carbon deposits into parts at the beginning of each
 667 execution step. The three stages, illustrated in
 668 Listing~\ref{ch17:listing:mior_kernels}, are the following:
 669
 670 \begin{enumerate}
 671 \item \emph{scattering}: the available carbon in each carbon deposit
 672   (OM) is equitably dispatched among all accessible MM in the form of
 673   parts,
 674 \item \emph{live}: each MM consumes carbon in its allocated parts for
 675   its breathing and growing processes, and
 676 \item \emph{gathering}: unconsumed carbon in parts is gathered back
 677   into the carbon deposits.
 678 \end{enumerate}
 679
 680 This solution suppresses the data synchronization needed by the first
 681 algorithm, thus the need for synchronization barriers, and requires
 682 only one kernel launch from Java as described on
 683 Listing~\ref{ch17:fig:mior_launcher}.
 684
 685 \lstinputlisting[caption=MIOR simulation launcher,label=ch17:fig:mior_launcher]{Chapters/chapter17/code/mior_launcher.java}
 686
 687 \subsubsection{Termination detection}
 688
 689 The termination of a MIOR simulation is reached when the model
 690 stabilizes and no more $CO_{2}$ is produced. This termination
 691 detection can be done on either the CPU or the GPU but it requires a
 692 global view on the system execution.
 693
 694 In the first case, when the CPU controls the GPU simulation process,
 695 the detection is done in two steps: (1) the CPU starts the execution
 696 of a simulation step on the GPU and (2) the CPU retrieves the GPU data
 697 and determines if another iteration must be launched or if the
 698 simulation has terminated. This approach allows a fine-grain control
 699 over the GPU execution, but it requires many costly data transfers as
 700 each iteration result must be sent from the GPU to the CPU. In the
 701 case of the MIOR model these costs are mainly due to the inherent
 702 PCI-express port latencies rather than to bandwidth limitation since
 703 data sizes remains rather small, on the order of few dozens of
 704 Megabytes.
 705
 706 In the second case the termination detection is directly implemented
 707 on the GPU by checking the amount of available carbon between two
 708 iterations. The CPU does not have any feedback while the simulation is
 709 running, but retrieves the results once the kernel execution is
 710 finished. This approach minimizes the number of transfers between the
 711 CPU and the GPU.
 712
 713 \subsection{Performance of MIOR implementations}
 714 \label{ch17:subsec:miorexperiments}
 715
 716 In this part we present several MIOR GPU implementations using the
 717 distribution/gathering process described in the previous section and
 718 compare their performance on two distinct hardware platform, i.e., two
 719 different GPU devices. Five incremental MIOR implementations were
 720 realized with an increasing level of adaptation for the algorithm: in
 721 all cases, we choose the average time over 50 executions as a
 722 performance indicator.
 723
 724 \begin{itemize}
 725 \item The \textbf{GPU 1.0} implementation is a direct implementation
 726   of the existing algorithm and its data structures where data
 727   dependencies were removed, and it uses the noncompact topology
 728   representation described in Section~\ref{ch17:subsec:datastructures}
 729 \item The \textbf{GPU 2.0} implementation uses the previously
 730   described compact representation of the topology and remains
 731   otherwise identical to the GPU 1.0 implementation.
 732 \item The \textbf{GPU 3.0} implementation introduces the manual
 733   copy into local (private) memory of often-accessed global data, such
 734   as carbon parts or topology information.
 735 \item The \textbf{GPU 4.0} implementation is a variant of the GPU 1.0
 736   implementation but allows the execution of multiple simulations for
 737   each kernel execution.
 738 \item The \textbf{GPU 5.0} implementation is a multisimulation
 739   version of the GPU 2.0 implementation, using the execution of
 740   multiple simulations for each kernel execution as for GPU 4.0.
 741 \end{itemize}
 742
 743 The two last implementations \textbf{GPU 4.0} and \textbf{GPU 5.0}
 744 illustrate the gain provided by a better usage of the hardware
 745 resources, thanks to the driver execution overlapping capabilities. A
 746 sequential version of the MIOR algorithm, labeled \textbf{CPU}, is
 747 included for comparison purpose. This sequential version was developed
 748 in Java, the same language used for GPU implementations.
 749
 750 For these performance evaluations, two platforms are used. The first
 751 one is representative of the kind of hardware which is available on
 752 HPC clusters. It is a cluster node dedicated to GPU computations with
 753 two Intel X5550 processors running at $2.67$GHz and one Tesla C1060
 754 GPU device running at $1.3$GHz and composed of $240$ cores ($30$
 755 multiprocessors). The second platform illustrates what can be
 756 expected from a personal desktop computer built a few years ago. It
 757 uses an Intel Q9300 CPU, running at $2.5$GHz, and a Geforce 8800GT GPU
 758 running at $1.5$GHz and composed of $112$ cores ($14$
 759 multiprocessors). The purpose of these two platforms is to assess the
 760 benefit that could be obtained when a scientist has access either to
 761 specialized hardware as a cluster or tries to take advantage of its
 762 own personal computer.
 763
 764 \begin{figure}[!h]
 765 %begin{minipage}[t]{0.49\linewidth}
 766 \centering
 767 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_tesla.pdf}
 768 %\caption{Performance CPU and GPU sur carte Tesla C1060}
 769 \caption{CPU and GPU performance on a Tesla C1060 node.}
 770 \label{ch17:fig:mior_perfs_tesla}
 771 %end{minipage}
 772 \end{figure}
 773
 774 Figures~\ref{ch17:fig:mior_perfs_tesla}~and~\ref{ch17:fig:mior_perfs_8800gt}
 775 show the execution time for $50$ simulations on the two hardware
 776 platforms. A size factor is applied to the problem: at scale 1, the
 777 model contains $38$ MM and $310$ OM, while at the scale 6 these
 778 numbers are multiplied by six. The size of the environment is modified
 779 as well to maintain the same average agent density in the model. This
 780 scaling factor displays the impact of the chosen size of simulation on
 781 performance.
 782
 783 %hspace{0.02\linewidth}
 784 %begin{minipage}[t]{0.49\linewidth}
 785 \begin{figure}[!h]
 786 \centering
 787 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_8800gt.pdf}
 788 \caption{CPU and GPU performance on a personal computer with a Geforce 8800GT.}
 789 \label{ch17:fig:mior_perfs_8800gt}
 790 %end{minipage}
 791 \end{figure}
 792
 793 The charts show that for small problems the execution times of all
 794 the implementations are very close. This is because the GPU execution
 795 does not have enough threads (representing agents) for an optimal
 796 usage of GPU resources. This trend changes around scale $5$ where GPU
 797 2.0 and GPU 3.0 take the advantage over the GPU 1.0 and CPU
 798 implementations. This advantage continues to grow with the scaling
 799 factor, and reaches a speedup of $10$ at the scale $10$ between the
 800 fastest single-simulation GPU implementation and the first, naive one
 801 GPU 1.0.
 802
 803 Multiple trends can be observed in these results. First, optimizations
 804 for the GPU hardware show a large, positive impact on performance,
 805 illustrating the strong requirements on the algorithm properties to
 806 reach execution efficiency. These charts also show that despite the
 807 vast difference in numbers of cores between the two GPU platforms, the
 808 same trends can be observed in both cases. We can therefore expect
 809 similar results on other GPU cards, without the need for more
 810 adaptations.
 811
 812 \begin{figure}[!h]
 813 \centering
 814 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/monokernel.pdf}
 815 \caption{Execution time of one multi-simulation kernel on the Tesla
 816   platform.}
 817 \label{ch17:fig:monokernel_graph}
 818 \end{figure}
 819
 820 \begin{figure}[!h]
 821 \centering
 822 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/multikernel.pdf}
 823 \caption{Total execution time for 1000 simulations on the Tesla
 824   platform, while varying the number of simulations for each kernel.}
 825 \label{ch17:fig:multikernel_graph}
 826 \end{figure}
 827
 828 There are two ways to measure simulations performance: (1) by
 829 executing only one kernel, and varying its size (the number of
 830 simulations executed in parallel), as shown in
 831 Figure~\ref{ch17:fig:monokernel_graph}, to test the costs linked to
 832 the parallelization process or (2) by executing a fixed number of
 833 simulations and varying the size of each kernel, as shown in
 834 Figure~\ref{ch17:fig:multikernel_graph}.
 835
 836 Figure~\ref{ch17:fig:monokernel_graph} illustrates the execution time
 837 for only one kernel. It shows that for small numbers of simulations
 838 run in parallel, the compact implementation of the model topology is
 839 faster than the two-dimension matrix representation. This trends
 840 reverse with more than $50$ simulations in parallel, which can be
 841 explained either by the nonlinear progression of the synchronization
 842 costs or by the additional memory required for the access-efficient
 843 representation.
 844
 845 Figure~\ref{ch17:fig:multikernel_graph} illustrates the execution time
 846 of a fixed number of simulations. It shows that for a small number of
 847 simulations run in parallel, the costs resulting from program setup,
 848 data copies, and launch on GPU are very detrimental to
 849 performance. Once the number of simulations executed for each kernel
 850 grows, these costs are counterbalanced by computation costs. This
 851 trend is more marked in the case of the sparse implementation (GPU
 852 4.0) than in the compact one but appears on both curves. With more
 853 than $30$ simulations for each kernel, execution times stall, since
 854 hardware limits are reached. This indicates that the cost of preparing
 855 and launching kernels become negligible compared to the computing time
 856 once a good GPU occupancy rate is achieved.
 857
 858 \section{Analysis and recommendations}
 859 \label{ch17:analysis}
 860
 861 In this section we synthesize the observations done on the two models
 862 and identify some recommendations for implementing complex systems on
 863 GPU platforms.
 864
 865 \subsection{Analysis}
 866
 867 In both the collembola and the MIOR model, a critical problematic is
 868 the determination of the parts of the simulation that are to be run on
 869 GPU and which are to remain on the CPU. The determination of these
 870 parts is a classic step of any algorithm parallelization and must take
 871 into account considerations such as the cost of the different parts of
 872 the algorithm and the expected gains.
 873
 874 In the case of the collembola model two steps of the algorithm were
 875 ported to GPU. Both steps use straightforward, easily parallelizable
 876 operations where a direct gain can be expected by using more execution
 877 cores without important modifications to the algorithm.
 878
 879 In the MIOR model case, however, no such inherently parallelizable
 880 parts are evident in the original sequential algorithm. This is mainly
 881 explained by the rate of interactions between agents in this model in
 882 the form of two operations (breathing, growth) using heavily-shared
 883 carbon resources. In this case the algorithm had to be more profoundly
 884 modified while keeping in mind the need to remain true to the original
 885 model, to synchronize the main execution step of all agents in the
 886 model, to ensure equity, and to minimize the numbers of
 887 synchronizations. The minimization is done by factoring the
 888 distribution of carbon in the model in two separated steps at the
 889 beginning and the end of each iteration rather than at multiple points
 890 of the execution.
 891
 892 \subsection{MAS execution workflow}
 893
 894 Many MAS simulations decompose their execution process into discrete
 895 evolution steps where each step represents a quantum of time (minimal
 896 unit of time described). At the end of each step many global data,
 897 graphical displays or output files are updated. This execution model
 898 may not correctly fit on GPU platforms as they assume more or less a
 899 batch-like workflow model. The execution model must be split into the
 900 following ever repeating steps:
 901
 902 \begin{itemize}
 903 \item Allocation of GPU data buffers
 904 \item Copy of data from CPU to GPU
 905 \item GPU kernels execution
 906 \item Copy of results from GPU to CPU
 907 \end{itemize}
 908
 909 This workflow works well if the considered data transfer time is
 910 negligible compared to GPU execution or can be done in parallel,
 911 thanks to the asynchronous nature of OpenCL. If we are to update the
 912 MAS model after each iteration then performance risks being
 913 degraded. This is illustrated in the MIOR model by the fact that the
 914 speedup observed on GPU is much more significant for bigger
 915 simulations, which imply longer GPU execution times. Our solution to
 916 this problem is to desynchronize the execution of the MAS model and
 917 its GPU parts by requesting the execution of multiple steps of the GPU
 918 simulations for each launch.
 919
 920 Another related prerequisite of GPU execution is the ability to have
 921 many threads executed, to allow an efficient exploitation of the
 922 superior number of cores provided by the architecture. In the case of
 923 MAS models, this means that executing one agent at a time on GPU is
 924 meaningless in regard to GPU usage, copying cost, and actual gain in
 925 execution time, if the agent computations are not complex enough. In
 926 the MIOR and the collembola models, this is solved by executing the
 927 computations for all agents of the model at the same time. If the
 928 model has only chronic needs for intensive computations, then some
 929 kind of batching mechanism is required to store waiting treatments in
 930 a queue, until the total sum of waiting computations justifies the
 931 transfer cost to the GPU platform.
 932
 933 \subsection{Implementation challenges}
 934
 935 Besides the execution strategy challenges described above, some
 936 implementation challenges also occur when implementing an OpenCL
 937 version of a MAS model, mainly related to the underlying limitations
 938 of the execution platform.
 939
 940 The first one is the impossibility (except in latest CUDA versions) to
 941 dynamically allocate memory during execution. This is a problem in the
 942 case of models where the number of agents can vary during the
 943 simulation, such as prey-predator models. In this case, the only
 944 solution is to overestimate the size of arrays or data structures to
 945 accommodate these additional individuals, or to use the CPU to resize
 946 data structures when these situations occur. Both approaches require
 947 trending either memory or performance and are not always practical.
 948
 949 Another limitation is the impossibility to store pointers in data
 950 structures, since OpenCL only allows one-dimension static arrays. This
 951 precludes the usage of structures such as linked-list, graphs or
 952 sparse matrices not represented by some combination of static arrays,
 953 and can be another source of memory or performance losses.
 954
 955 In the case of MIOR, this problem is especially exacerbated in the
 956 case of neighboring storage: both representations consume much more
 957 memory than is actually required, since the worst case (all agents
 958 have access to all others agents) has to be taken into account when
 959 dimensioning the data structure.
 960
 961 The existence of different generations of GPU hardware is also a
 962 challenge. Older implementations, such as the four year old Tesla
 963 C1060 cards, have very strong constraints in term of memory accesses
 964 and requires very regular access patterns to perform efficiently. MAS
 965 having many random accesses, such as MIOR, or many small global memory
 966 accesses, such as Collembola, are penalized on these older
 967 cards. Fortunately, these requirements are less present is modern
 968 cards, which offer caches and other facilities traditionally present
 969 on CPU to offset these kinds of penalties.
 970
 971 The final concern is related to the previous ones and often results in
 972 more memory consumption. The amount of memory available on GPU cards
 973 is much more limited and adding new memory capabilities is more costly
 974 compared to expending a CPU RAM. On computing clusters, hardwares
 975 nodes with 128GB of memory or more have become affordable, whereas
 976 newer Tesla architecture remains limited to 16GB of memory. This can
 977 be a concern in the case of big MAS models, or small ones which can
 978 only use memory-inefficient OpenCL structures.
 979
 980 \subsection{MCSMA}
 981 \label{ch17:Mcsma}
 982
 983 As shown in the previous sections, many data representation choices
 984 are common to entire classes of MAS. The paradigm of grid, for
 985 example, is often encountered in models where each cell constitutes
 986 either the elementary unit of simulation
 987 (SugarScape~\cite{Dsouza2007}) or a discretization of the environment
 988 space (Pathfinding~\cite{Guy09clearpath}). These grids can be
 989 considered as two- or three-dimensional matrices, whose processing can
 990 be directly distributed.
 991
 992 Another common data representation encountered in MAS system is the
 993 usage of 2D or 3D coordinates to store the position of each agent of
 994 the model. In this case, even if the environment is no longer
 995 discrete, the location information still imply computations (Euclidean
 996 distances) which can be parallelized.
 997
 998 MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS
 999 designer those basic data structures and the associated operations, to
1000 facilitate the portage of existing MAS on GPU. Two levels of
1001 utilization are provided to the developer, depending on its usage
1002 profile:
1003
1004 \begin{itemize}
1005 \item A high-level library, composed of modules regrouping classes of
1006   operations. Such operation can distance computations in 1D, 2D or 3D
1007   grids, diffusion or reduction operations on matrices...
1008 \item A low-level API which allows the developer direct access to the
1009   GPU and the inner working of MCSMA, to develop new modules in the
1010   case where the required operations are not yet provided by the
1011   platform.
1012 \end{itemize}
1013
1014 Both usage levels were illustrated in the above two practical
1015 cases. In MIOR, the whole algorithm (baring initialization) is ported
1016 on GPU as a specific plugin which allows executing $n$ MIOR
1017 simulations and retrieving their results. This first approach requires
1018 extensive adaptations to the original algorithm.  In collembola, to
1019 the contrary, the main steps of the algorithm remain executed on the
1020 CPU, and only specific operations are delegated to generic, already
1021 existing diffusion and reduction kernels. The fundamental algorithm is
1022 not modified and GPU execution is only applied to specific parts of
1023 the execution which may benefit from it. These two programming
1024 approaches allow incremental adaptations of existing Java MAS to
1025 accelerate their execution on GPU, while retaining the option to
1026 develop their own reusable or more efficient module to supplement the
1027 already existing ones.
1028
1029 \section{Conclusion}
1030 \label{ch17:conclusion}
1031
1032 This chapter has addressed the issue of complex system simulation by
1033 using agent-based paradigms and GPU hardware. From the experiments on
1034 two existing agent-based models of soil science we have provided
1035 useful information on the architecture, the algorithm design, and the
1036 data management to run agent-based simulations on GPU, and more
1037 generally to run computationally intensive applications that are not
1038 based on purely-matricial models. The first result of this work is
1039 that adapting the algorithm to a GPU architecture is possible and
1040 suitable to speed up agent based simulations as illustrated by the
1041 MIOR model. Coupling CPU with GPU seems to be an interesting way to
1042 take better advantage of the power given by computers and clusters as
1043 illustrated by the collembola model. From our point of view the
1044 adaptation process is less costy in time than a traditional
1045 parallelization on distributed nodes and not much difficult than a
1046 standard multithreaded parallelization, since all the data remains on
1047 the same host and can be shared in central memory. The usage of OpenCL
1048 also enables a portable simulator that can be run on different
1049 graphical units. Even using a mainstream card such as the GPU card of
1050 a standard computer can lead to significant performance
1051 improvements. This is an interesting result as it opens up the field
1052 of inexpensive HPC to a large community.
1053
1054 In this perspective, we are working on MCSMA, a development platform
1055 that would facilitate the use of GPU or many-core architectures for
1056 multi-agent simulations. Our first work has been the definition of
1057 common, efficient, reusable data structures, such as grids or
1058 lists. Another goal is to provide easier means to control the
1059 distribution of specific processes between CPU or GPU, to allow the
1060 easy exploitation of the strengths of each platform in the same
1061 multi-agent simulation. We think that the same approach, i.e.,
1062 developing specific environments that facilitate the developer access
1063 to the GPU power, can be applied in many domains with computationally
1064 intensive needs to open the GPU use to a larger community.
1065
1066 \putbib[Chapters/chapter17/biblio]