BookGPU/Chapters/chapter17/ch17.tex

   1 \chapterauthor{Guillaume Laville, Christophe Lang, Bénédicte Herrmann,
   2   and Laurent Philippe}{Femto-ST Institute, University of
   3   Franche-Comte, France}
   4 %\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France}
   5 \chapterauthor{Kamel Mazouzi}{Franche-Comte Computing Center, University of Franche-Comte, France}
   6 \chapterauthor{Nicolas Marilleau}{UMMISCO, Institut de Recherche pour le Developpement (IRD), France}
   7 %\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France}
   8 %\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
   9
  10 \newlength\mylen
  11 \newcommand\myinput[1]{%
  12   \settowidth\mylen{\KwIn{}}%
  13   \setlength\hangindent{\mylen}%
  14   \hspace*{\mylen}#1\\}
  15
  16 \chapter{Implementing Multi-Agent Systems on GPU}
  17 \label{chapter17}
  18
  19
  20 \section{Introduction}
  21 \label{ch17:intro}
  22
  23 In this chapter we introduce the use of Graphical Processing Units
  24 (GPU) for multi-agents-based systems as an example of a not-so-regular
  25 application that could benefit from the GPU computing
  26 power. Multi-Agent Systems (MAS) are a simulation paradigm used to
  27 study the behavior of dynamic systems. Dynamic systems as physical
  28 systems are often modeled by mathematical representations and their
  29 dynamic behavior is simulated by differential equations. The
  30 simulation of the system thus often relies on the resolution of a
  31 linear system that can be efficiently computed on a graphical
  32 processing unit as shown in the preceding chapters. But when the
  33 behavior of the system elements is not uniformly driven by the same
  34 law, when these elements have their own behavior, the modeling process
  35 is too complex to rely on formal expressions. In this context MAS is a
  36 recognized approach to model and simulate systems where individuals
  37 have an autonomous behavior that cannot be simulated by the evolution
  38 of a set of variables driven by mathematical laws. MAS are often used
  39 to simulate natural or collective phenomena whose individuals are too
  40 numerous or various to provide a unified algorithm describing the
  41 system evolution. The agent-based approach is to divide these complex
  42 systems into individual self-contained entities with their smaller set
  43 of attributes and functions. But, as for mathematical simulations,
  44 when the size of the MAS increases, the need of computing power and
  45 memory also increases. For this reason, multi-agent systems should
  46 benefit from the use of distributed computing architectures. Clusters
  47 and grids are often identified as the main solution to increase
  48 simulation performance but GPUs are also a promising technology with
  49 an attractive performance/cost ratio.
  50
  51 Conceptually a MAS\index{Multi-Agent System} is a distributed system
  52 as it favors the definition and description of large sets of
  53 individuals, the agents, that can be run in parallel. As a large set
  54 of agents could have the same behavior, a Single Instruction Multiple
  55 Data (SIMD) execution architecture should fit the simulation
  56 execution. Most of the agent-based simulators are, however, designed
  57 with a sequential scheme in mind, and these simulators seldom use more
  58 than one core for their execution. Due to simulation scheduling
  59 constraints, data sharing and exchange between agents and the huge
  60 amount of interactions between agents and their environment, it is
  61 indeed rather difficult to distribute an agent based simulator, for
  62 instance, to take advantage of new multithreaded computer
  63 architectures. Thus, guidelines and tools dedicated to MAS paradigm
  64 and High Performance Computing (HPC) are now a need for other complex
  65 system communities. Note that, from the described structure (large
  66 number of agents sharing data), we can conclude that MAS would more
  67 easily benefit from many-core architectures than from other kinds of
  68 parallelism.
  69
  70 Another key point that advocates for the use of many-core in MAS is
  71 the growing need for multiscale simulations. Multiscale simulations
  72 explore problems with interactions between several scales. The
  73 different scales use different granularity of the structure and
  74 potentially different models. Most of the time the lower scale
  75 simulations provide results to higher scale simulations. In that case
  76 the execution of the simulations can easily be distributed between the
  77 local cores and a many-core architecture, i.e., a GPU device.
  78
  79 We explore in this chapter the use of many-core architectures to
  80 execute agent-based simulations. We illustrate our reflexion with two
  81 cases: the Collembola simulator designed to simulate the diffusion of
  82 Collembola between plots of land and the MIOR (MIcro ORganism)
  83 simulator that reproduces effects of earthworms on bacteria dynamics
  84 in a bulked soil. In Section \ref{ch17:ABM} we present the work
  85 related to MAS and parallelization with a special focus on many-core
  86 use. In sections \ref{ch17:sec:1stmodel} and \ref{ch17:sec:2ndmodel}
  87 we present in detail two multi-agent models, their GPU
  88 implementations, the conducted experiments, and their performance
  89 results. The first model, given in Section \ref{ch17:sec:1stmodel},
  90 illustrates the use of a GPU architecture to speed up the execution of
  91 some computation-intensive functions while the main model is still
  92 executed on the central processing unit. The second model, given in
  93 Section \ref{ch17:sec:2ndmodel}, illustrates the use of a GPU
  94 architecture to implement the whole model on the GPU processor which
  95 implies deeper changes in the initial algorithm. In Section
  96 \ref{ch17:analysis} we propose a more general reflexion on these
  97 implementations and provide some guidelines. Then, we conclude in
  98 Section \ref{ch17:conclusion} on the possible generalization of our
  99 work.
 100
 101
 102 \section{Running agent-based simulations}
 103 \label{ch17:ABM}
 104
 105 In this section, we present the context of MAS, their parallelization,
 106 and we report several existing works on using GPU to simulate
 107 multi-agent systems.
 108
 109 \subsection{Multi-agent systems and parallelism}
 110
 111 Agent-based systems are often used to simulate natural or collective
 112 phenomena whose actors are too numerous or various to provide a simple
 113 unified algorithm describing the studied system
 114 dynamic~\cite{Schweitzer2003}. The implementation of an agent based
 115 simulation usually starts by designing the underlying agent-based
 116 model (ABM). Most ABM are based around a few types of entities such as
 117 agents, one environment, or an interaction
 118 organization~\cite{Vowel02}. In the complex system domain, the
 119 environment often describes a real space, its structure (e.g. soil
 120 textures and porosities), and its dynamics (e.g., organic matter
 121 decomposition). It is a virtual world in which agents represent
 122 studied entities (e.g., biotic organisms) evolution. The actual agent
 123 is animated by a behavior that can range between reactivity (only
 124 reacts to external stimuli) and cognition (makes complex decisions
 125 based on environmental and internal factors). Interaction and
 126 organization define functions, types, and patterns of communications
 127 of their member agents in the
 128 system~\cite{Odell:2003:RRD:1807559.1807562}. Note that, depending on
 129 the MAS, agents can communicate either directly through special
 130 primitives or indirectly through the information stored in the
 131 environment.
 132
 133 Agent-based simulations have been used for more than a decade to
 134 reproduce, understand and even predict complex system dynamics. They
 135 have proved their usefulness in various scientific
 136 communities. Nowadays generic agent based frameworks such as
 137 Repast~\cite{repast_home} or NetLogo~\cite{netlogo_home} are promoted
 138 to implement simulators. Many ABMs such as the crown model
 139 representing a city wide scale~\cite{Strippgen_Nagel_2009} tend
 140 however to require a large number of agents to provide a realistic
 141 behavior and reliable global statistics. Moreover, an achieved model
 142 analysis needs to resort to an experiment plan, consisting of multiple
 143 simulation runs, to obtain enough confidence in a simulation. In this
 144 case the available computing power often limits the simulation size,
 145 and the resulting range thus requires the use of parallelism to
 146 explore bigger configurations.
 147
 148 For that, three major approaches can be identified:
 149 \begin{enumerate}
 150 \item parallelizing experiments execution on a cluster or a grid (one
 151   or a few simulations are submitted to each
 152   core)~\cite{Blanchart11,Chuffart2010},
 153 \item parallelizing the simulator on a cluster (the environment of the
 154   MAS is split and run on several distributed
 155   nodes)~\cite{Cosenza2011,MAR06},
 156 \item optimizing the simulator by taking advantage of computer
 157   resources (multi-threading, GPU, and so on) \cite{Aaby10}.
 158 \end{enumerate}
 159
 160 In the first case, experiments are run independently of each other and
 161 only simulation parameters are changed between two runs so that a
 162 simple version of an existing simulator can be used. This approach
 163 does not, however, allow to run larger models.  In the second and the
 164 third case, model and code modifications are necessary. Only a few
 165 frameworks, however, introduce distribution in agent simulation
 166 (Madkit~\cite{Gutknecht2000}, MASON~\cite{Sean05},
 167 repastHPC~\cite{Collier11}), and parallel implementations are often
 168 based on the explicit use of threads using shared
 169 memory~\cite{Guy09clearpath} or cluster libraries such as
 170 MPI~\cite{Kiran2010}.
 171
 172 Parallelizing a multi-agent simulation is however complex due to space
 173 and time constraints. Multi-agent simulations are usually based on a
 174 synchronous execution: at each time step, numerous events (space data
 175 modification, agent motion) and interactions between agents happen.
 176 Distributing the simulation on several computers or grid nodes thus
 177 implies to guarantee a distributed synchronous execution and
 178 coherency. This often leads to poor performance or complex
 179 synchronization problems. Multicore execution or delegating part of
 180 this execution to others processors such as GPUs~\cite{Bleiweiss_2008}
 181 is usually easier to implement since all the threads share the data
 182 and the local clock.
 183
 184 % Different agent patterns can be adopted in an ABMs such as
 185 % cognitive and reactive ones~\cite{Ferber99}. Cognitive agents act on
 186 % the environment and interact with other agents according to a complex
 187 % behavior. This behavior takes a local perception of the virtual world
 188 % and the agent past (a memory characterized by an internal state and
 189 % belief, imperfect knowledge about the world) into account. Reactive
 190 % agents have a much systematic pattern of action based on stimuli
 191 % response schemes (no or few knowledge and state conservation in agent). The
 192 % evolution of the ABM environment, in particular, is often represented with
 193 % this last kind of agents. As their behavior is usually simple, we
 194 % propose in this chapter to delegate part of the environment and of the
 195 % reactive agents execution to the graphical processing unit of the
 196 % computer. This way we can balance the load between both CPU and GPU
 197 % execution resources.
 198
 199 % In the particular case of multi-scale simulations such as the Sworm
 200 % simulation~\cite{BLA09} the environment may be used at different
 201 % levels. Since the representation of the whole simulated environment
 202 % (i.e. the soil area) would be costly, the environment is organized as
 203 % a multi-level tree of small soil cubes which can be lazily
 204 % instantiated during the simulation. This allows to gradually refine
 205 % distribution details in units of soil as agents progress and need
 206 % those information, by using a fractal process based on the
 207 % bigger-grained already instantiated levels. This characteristic,
 208 % especially for a fractal model, could be the key of the
 209 % distribution. For instance, each branch of a fractal environment could
 210 % be identified as an independent area and parallelized. In addition
 211 % Fractal is a famous approach to describe multi-scale environment (such
 212 % as soil) and its organization~\cite{perrier}. In that case the lower
 213 % scale simulations can also be delegated to the GPU card to limit the
 214 % load of the main (upper scale) simulation.
 215
 216 \subsection{MAS implementation on GPU}
 217 \label{ch17:subsec:gpu}
 218
 219 The last few years have seen the appearance of new generations of
 220 graphic cards based on more general purpose execution units which are
 221 promising for large systems such as MAS. Using matrix-based data
 222 representations and SIMD computations is however not always
 223 straightforward in MAS, where data structures and algorithms are
 224 tightly coupled to the described simulation. However, works from
 225 existing literature show that MAS can benefit from these performance
 226 gains on various simulation types, such as traffic
 227 simulation~\cite{Strippgen_Nagel_2009}, cellular
 228 automata~\cite{Dsouza2007}, mobile-agent based
 229 path-finding~\cite{Silveira:2010:PRG:1948395.1948446} or genetic
 230 algorithms~\cite{Maitre2009}. Note that an application-specific
 231 adaptation process was required in the case of these MAS: some of the
 232 previous examples are driven by mathematical laws (path-finding) or
 233 use a natural mapping between a discrete environment (cellular
 234 automaton) and GPU cores. Unfortunately, this mapping often requires
 235 algorithmic adaptations in other models but experience shows that the
 236 more reactive a MAS is the more adapted its implementation is to GPU.
 237
 238 The first step in the adaptation of an ABM to GPU platforms is the
 239 choice of language. On the one hand, the Java programming language is
 240 often used for the implementation of MAS due to its availability on
 241 numerous platforms or frameworks and its focus on high-level,
 242 object-oriented programming. On the other hand, GPU platforms can only
 243 run specific languages such as OpenCL or CUDA. OpenCL (supported on
 244 AMD, Intel, and NVIDIA hardware) better suits the portability concerns
 245 across a wide range of hardware needed the agent simulators, as
 246 opposed to CUDA which is an NVIDIA-specific library.
 247
 248 OpenCL is a C library which provides access to the underlying CPU or
 249 GPU threads using an asynchronous interface. Various OpenCL functions
 250 allow the compilation and the execution of programs on these execution
 251 resources, the copying of data buffers between devices, and the
 252 collection of profiling information.
 253
 254 This library is based around three main concepts:
 255
 256 \begin{itemize}
 257 \item the \emph{kernel} (similar to a CUDA kernel), which represents a runnable program
 258   containing instructions to be executed on the GPU;
 259 \item the \emph{work-item} (equivalent to a CUDA thread), which is analogous to the concept
 260   of thread on GPU, in that it represents one running instance of a
 261   GPU kernel; and
 262 \item the \emph{work-group} (or execution block) which is a set of work-items
 263   sharing some memory to speed up data accesses and
 264   computations. Synchronization operations such as barrier can only be
 265   used across the same work-group.
 266 \end{itemize}
 267
 268 Running an OpenCL computation consists of launching numerous
 269 work-items that execute the same kernel. The work-items are submitted
 270 to a submission queue to optimize the available cores usage. A
 271 calculus is achieved once all these kernel instances have terminated.
 272
 273 The number of work-items used in each work-group is an important
 274 implementation choice which determines how many tasks will share the
 275 same cache memory. Data used by the work-items can be stored as
 276 N-dimensions matrices in local or global GPU memory.  Since the size
 277 of this memory is often limited to a few hundred kilobytes, choosing
 278 this number often implies a compromise between the model
 279 synchronization or data requirements and the available resources.
 280
 281 In the case of agent-based simulations, each agent can be naturally
 282 mapped to a work-item. Work-groups can then be used to represent
 283 groups of agents or simulations sharing common data (such as the
 284 environment) or algorithms (such as the background evolution process).
 285
 286 In the following examples a binding named JOCL~\cite{jocl_home} is
 287 used to access the OpenCL platform from the Java programming language.
 288
 289 In the next sections we present two practical cases that will be
 290 studied in detail, from the model to its implementation and
 291 performance.
 292
 293 \section{A first practical example}
 294 \label{ch17:sec:1stmodel}
 295
 296 The first model, the Collembola model, simulates the propagation of
 297 collembolas in fields and forests. It is based on a diffusion
 298 algorithm which illustrates the case of agents with a simple behavior
 299 and few synchronization problems.
 300
 301 \subsection{The Collembola model\index{Collembola model}}
 302 \label{ch17:subsec:collembolamodel}
 303
 304 The Collembola model is an example of multi-agent system using GIS
 305 (Geographical Information System) and survey data (population count)
 306 to model the evolution of the biodiversity across land plots. A first
 307 version of this model has been developed with the Netlogo framework by
 308 Bioemco and UMMISCO researchers. In this model, the biodiversity is
 309 modeled by populations of athropod individuals, the Collembola, which
 310 can reproduce and diffuse to favorable new habitats. The simulator
 311 allows us to study the diffusion of collembola, between plots of land
 312 depending on their use (artifical soil, crop, forest, etc.) In this
 313 model the environment is composed of the studied land, and collembola
 314 are used as agents. Every land plot is divided into several cells,
 315 each cell representing a surface unit (16x16 meters). A number of
 316 individuals per collembola species is associated to each cell. The
 317 model evolution is then based on a common diffusion model that
 318 diffuses individuals between cells. Each step of the simulation is
 319 based on four stages, as shown on
 320 Figure~\ref{ch17:fig:collem_algorithm}:
 321
 322 % \begin{enumerate}
 323 % \item arrival of new individuals
 324 % \item reproduction in each cell
 325 % \item diffusion between cells
 326 % \item updating of colembola lifetime
 327 % \end{enumerate}
 328
 329 \begin{figure}[h]
 330 \centering
 331 \includegraphics[width=0.6\textwidth]{Chapters/chapter17/figs/algo_collem.pdf}
 332 \caption{Evolution algorithm of Collembola model.}
 333 \label{ch17:fig:collem_algorithm}
 334 \end{figure}
 335
 336 The algorithm is quite simple but includes two costly operations, the
 337 reproduction and the diffusion, that must be parallelized to improve
 338 the model performances.
 339
 340 The {\bf reproduction} stage consists in updating the total population
 341 of each plot by taking the individuals arrived at the preceding
 342 computation step. This stage involves processing the whole set of
 343 cells of the environment to sum their population. The computed value
 344 is recorded in the plot associated to each cell. This process can be
 345 assimilated to a reduction operation on all the population cells
 346 associated to one plot to obtain its population.
 347
 348 The {\bf diffusion} stage simulates the natural behavior of the
 349 collembola that tends toward occupying the whole space over time. This
 350 stage consists in computing a new value for each cell depending on
 351 the population of the neighbor cells. This process can be assimilated
 352 to a linear diffusion at each iteration of the population of the cells
 353 across their neighbors.
 354
 355 These two processes are quite common in numerical computations so that
 356 the collembola model can be adapted to a GPU execution without much
 357 difficulty.
 358
 359 \subsection{Collembola implementation}
 360
 361 In the collembola simulator biodiversity is modeled by populations of
 362 collembola individuals, which can reproduce and diffuse to favorable
 363 new habitats. This is implemented as a fixed reproduction factor,
 364 applied to the size of each population, followed by a linear diffusion
 365 of each cell population to its eight neighbors. These reproduction and
 366 diffusion processes are followed by two more steps on the GPU
 367 implementation. The first one consist of culling of populations in an
 368 inhospitable environment, by checking each cell value and terrain
 369 type, and setting its population to zero if necessary.  The final
 370 simulation step is the reduction of the cell populations for each
 371 plot, to obtain an updated plot population for statistic
 372 purposes. This separate computation step, done while updating each
 373 cell population in the reference sequential algorithm, is motivated by
 374 synchronization problems and allows the reduction of the total number
 375 of memory accesses needed to updated those populations.
 376
 377 %\lstinputlisting[language=C,caption=Collembola OpenCL
 378 %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl}
 379 %\pagebreak
 380 \lstinputlisting[caption=collembola openCL diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl}
 381
 382 The reproduction, diffusion and culling steps are implemented on GPU
 383 (Figure~\ref{ch17:fig:collem_algorithm}) as a straight mapping of each
 384 cell to an OpenCL work-item (GPU thread). Listing
 385 \ref{ch17:listing:collembola-diffuse} gives the kernel for the
 386 diffusion implementation.  To prevent data coherency problems, the
 387 diffusion step is split into two phases, separated by an execution
 388 {\it barrier}. In the first phase each cell diffusion overflow is
 389 calculated and divided by the number of neighbors. Note that, on the
 390 border of the grid, populations can also overflow outside the
 391 environment grid but we do not manage those external populations,
 392 since there are no reason to assume our model to be isolated of its
 393 surroundings. The overflow by neighbors value is stored for each cell
 394 before encountering the barrier. After the barrier is met, each cell
 395 reads the overflows stored by all its neighbors at the previous step
 396 and applies them to its own population. In this manner, only one
 397 barrier is required to ensure the consistency of population numbers,
 398 since no cell ever modify a value other than its own.
 399
 400 Listing \ref{ch17:listing:collembola-reduc} gives the kernel for the
 401 reduction implementation.  The only step requiring numerous
 402 synchronized accesses is the reduction one: in this first approach, we
 403 chose to use {\it atomic\_add} operation to implement this process,
 404 but more efficient implementations using partial reduction and local
 405 GPU memory could be implemented.
 406
 407 \pagebreak
 408 \lstinputlisting[caption=collembola OpenCL reduction kernel,label=ch17:listing:collembola-reduc]{Chapters/chapter17/code/collem_kernel_reduc.cl}
 409
 410 \subsection{Collembola performance}
 411
 412 In this part we present the performance of the collembola model on
 413 various CPU and GPU execution
 414 platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
 415 number of cores and the processor architecture as a strong influence
 416 on the obtained results
 417
 418 % : the
 419 % dual-core processor equipping our Thinkpad platform has two to six
 420 % longer executions times, compared to a six-core Phenom X6.
 421
 422 % % \begin{figure}[h]
 423 % %begin{minipage}[t]{0.49\linewidth}
 424 % \centering \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs.pdf}
 425 % \caption{Performance CPU et GPU du modèle Collemboles}
 426 % \label{ch17:fig:mior_perfs_collem}
 427 % %end{minipage}
 428 % \end{figure}
 429
 430 % In figure~\ref{ch17:fig:mior_perfs_collem2} the Thinkpad curve is removed
 431 % to make other trends clearer. Two more observation can be made, using
 432 % this more detailled scale:
 433
 434 \begin{itemize}
 435 \item Older GPU cards can be slower than modern processors. This can
 436   be explained by the new cache and memory access optimizations
 437   implemented in newer generations of GPU devices. These optimizations
 438   reduce the penalties associated with irregular and frequent global
 439   memory accesses. They are not available on our Tesla nodes.
 440 \item GPU curves exhibit an odd-even pattern in their performance
 441   results. Since this phenomenon is visible on two distinct
 442   manufacturer hardware, driver, and OpenCL implementation, it is
 443   likely the result of the decomposition process based on warp of
 444   fixed, power-of-two sizes.
 445 \item The number of cores is not the only determining factor: an Intel
 446   Core i7 2600K processor, even with only four cores, can provide
 447   better performance than a Phenom one.
 448 \end{itemize}
 449
 450 \begin{figure}[h]
 451 %begin{minipage}[t]{0.49\linewidth}
 452 \centering
 453 \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
 454 \caption{Performance of the Collembola model on CPU and GPU.}
 455 \label{ch17:fig:mior_perfs_collem}
 456 %end{minipage}
 457 \end{figure}
 458
 459 Both graphs show that using the GPU to parallelize part of the
 460 simulator results in tangible performance gains over a CPU execution
 461 on modern hardware. These gains are more mixed on older GPU platforms
 462 due to the limitations when dealing with irregular memory or execution
 463 patterns often encountered in MAS systems. This can be closely linked
 464 to the availability of caching facilities on the GPU hardware and its
 465 dramatic effects depend on the locality and frequency of data
 466 accesses. In this case, even if the Tesla architecture offers more
 467 execution cores and is the far costlier solution, more recent,
 468 cheaper, solutions such as high-end GPU provide better performance
 469 when the execution is not constrained by memory size.
 470
 471 \section{Second example}
 472 \label{ch17:sec:2ndmodel}
 473
 474 The second model, the MIOR model, simulates the behavior of microbian
 475 colonies. Its execution model is more complex so that it requires
 476 changes in the initial algorithm and the use of synchronization to
 477 benefit from the GPU architecture.
 478
 479 \subsection{The MIOR model}
 480 \label{ch17:subsec:miormodel}
 481
 482 The MIOR~\cite{C.Cambier2007} model was developed to simulate local
 483 interactions in soil between microbial colonies and organic
 484 matters. It reproduces each small cubic unit ($0.002 m^3$) of soil as
 485 a MAS.
 486
 487 Multiple implementations of the MIOR model have already been
 488 realized, in Smalltalk and Netlogo, in 2 or 3 dimensions. The last
 489 implementation, used in our work and referenced as MIOR in the
 490 rest of the chapter, is freely accessible online as
 491 WebSimMior~\footnote{http://www.IRD.fr/websimmior/}.
 492
 493 MIOR is based around two types of agents: (1) the Meta-Mior (MM),
 494 which represents microbial colonies consuming carbon and (2) the
 495 Organic Matter (OM) which represents carbon deposits occurring in
 496 soil.
 497
 498 The Meta-Mior agents are characterized by two distinct behaviors:
 499 \begin{itemize}
 500 \item \emph{breath}: this action converts mineral carbon from the soil
 501   to carbon dioxide ($CO_{2}$) that is released into the soil and
 502 \item \emph{growth}: by this action each microbial colony fixes the
 503   carbon present in the environment to reproduce itself (augments its
 504   size). This action is only possible if the colony breathing needs
 505   are covered, i.e., enough mineral carbon is available.
 506 \end{itemize}
 507
 508 These behaviors are described in Algorithm~\ref{ch17:seqalgo}.
 509
 510 \begin{algorithm}[h]
 511 \caption{evolution step of each Meta-Mior (microbial colony) agent}
 512 \label{ch17:seqalgo}
 513 \KwIn{A static array $mmList$ of MM agents}
 514 \myinput{A static array $omList$ of OM agents}
 515 \myinput{A MIOR environment $world$}
 516 $breathNeed \gets world.respirationRate \times mm.carbon$\;
 517 $growthNeed \gets world.growthRate \times mm.carbon$\;
 518 $availableCarbon \gets totalAccessibleCarbon(mm)$\;
 519 \uIf{$availableCarbon > breathNeed$}{
 520   \tcc{ Breath }
 521   $mm.active \gets true$\;
 522   $availableCarbon \gets availableCarbon - consumCarbon(mm, breathNeed)$\;
 523   $world.CO2 \gets world.CO2 + breathNeed$\;
 524   \If{$availableCarbon > 0$}{
 525     \tcc{ Growth }
 526     $growthConsum \gets max(totalAccessCarbon(mm), growthNeed)$\;
 527     $consumCarbon(mm, growthConsum)$\;
 528     $mm.carbon \gets mm.carbon + growthConsum$\;
 529   }
 530 }
 531 \Else{
 532   $mm.active \gets false$
 533 }
 534 \end{algorithm}
 535
 536 Since this simulation takes place at a microscopic scale, a large
 537 number of these simulations must be executed for each macroscopic
 538 simulation step to model a realistic-sized unit of soil. This leads to
 539 large computing needs despite the small computation cost of each
 540 individual simulation.
 541
 542
 543 \subsection{MIOR implementation}
 544
 545 As pointed out previously, the MIOR implementation implied more
 546 changes for the initial code to be run on GPU.  As a first attempt, we
 547 tried a simple GPU implementation of the MIOR simulator, with only
 548 minimal changes to the CPU algorithm. Execution times showed the
 549 inefficiency of this approach and highlighted the necessity of
 550 adapting the simulator to take advantage of the GPU execution
 551 capabilities~\cite{lmlm+12:ip}. In this part, we show the main changes
 552 which were realized to adapt the MIOR simulator on GPU architectures.
 553
 554 \subsubsection{Execution mapping on GPU}
 555
 556 \begin{figure}
 557 \centering
 558 \includegraphics[width=0.7\textwidth]{Chapters/chapter17/figs/repartition.pdf}
 559 \caption{Consolidation of multiple simulations in one OpenCL kernel execution.}
 560 \label{ch17:fig:gpu_distribution}
 561 \end{figure}
 562
 563 Each MIOR simulation is represented by a work-group, and each agent by
 564 a work-item. A kernel is in charge of the life cycle process for each
 565 agent of the model. This kernel is executed by all the work-items of
 566 the simulation each on its own GPU core.
 567
 568 The usage of one work-group for each simulation allows the easy
 569 execution of multiple simulations in parallel, as shown on
 570 figure~\ref{ch17:fig:gpu_distribution}.  By taking advantage of the
 571 execution overlap possibilities provided by OpenCL, it then becomes
 572 possible to exploit all the cores at the same time, even if an unique
 573 simulation is too small to use all the available GPU cores. However,
 574 the maximum size of a work-group is limited (to $512$), which allows
 575 us to execute only one simulation per work-group when using $310$
 576 threads (number of OM in the reference model) to execute the
 577 simulation.
 578
 579 The usage of the GPU to execute multiple simulations is initiated by
 580 the CPU. The CPU keeps total control of the simulator execution
 581 flow. Thus, optimized scheduling policies (such as submitting kernels
 582 in batch, limiting the number of kernels, or asynchronously retrieving
 583 the simulation results) can be defined to minimize the cost related to
 584 data transfers between CPU and GPUs.
 585
 586 \subsubsection{Data structures translation}
 587 \label{ch17:subsec:datastructures}
 588
 589 The adaptation of the MIOR model to GPU requires the mapping of the
 590 data model to OpenCL data structures. The environment and the agents
 591 are represented by arrays of structures, where each structure
 592 describes the state of one entity. The behaviors of these entities are
 593 implemented as OpenCL functions to be called from the kernels during
 594 execution.
 595
 596 Since the main program is written in Java, JOCL is responsible for the
 597 allocation and mapping of the object data structures to OpenCL ones
 598 before execution.
 599
 600 Four main data structures are defined: (1) an array of MM agents,
 601 representing the state of the microbial colonies. (2) an array of OM
 602 agents, representing the state of the carbon deposits. (3) a topology
 603 matrix, which stores accessibility information between the two types
 604 of agents of the model (4) a world structure, which contains all the
 605 global input data (metabolism rate, numbers of agents) and output data
 606 (quantity of $CO_{2}$ produced) of the simulation. The C-like OpenCL
 607 structures used to represent each type of to agent and the environment
 608 are illustrated in
 609 (Figure~\ref{ch17:listing:mior_data_structures}). These data
 610 structures are initialized by the CPU and then copied on the GPU.
 611
 612
 613 \lstinputlisting[caption=main data structures used in a MIOR simulation,label=ch17:listing:mior_data_structures]{Chapters/chapter17/code/data_structures.cl}
 614
 615 The world topology is stored as a two-dimension matrix which
 616 represents OM indexes on the abscissa and MM indexes on the
 617 ordinate. Each agent walks through its line/column of the matrix at
 618 each iteration to determinate which agents can be accessed during the
 619 simulation.  Since many agents are not connected, this matrix is
 620 sparse, which introduces a big number of useless memory accesses. To
 621 reduce the impact of these memory accesses we use a compacted,
 622 optimized representation of this matrix based
 623 on~\cite{Gomez-Luna:2009:PVS:1616772.1616869}, as illustrated in
 624 Figure~\ref{ch17:fig:csr_representation}. This compact representation
 625 considers each line of the matrix as an index list, and only stores
 626 accessible agents compactly, to reduce the number of non-productive
 627 accesses.
 628
 629 \begin{figure}[h]
 630 \centering
 631 \includegraphics[width=0.8\textwidth]{Chapters/chapter17/figs/grid.pdf}
 632 %\psfig{file=figs/grid, height=1in}
 633 \caption{Compact representation of the topology of a MIOR simulation.}
 634 \label{ch17:fig:csr_representation}
 635 \end{figure}
 636
 637 Since dynamic memory allocation is not possible yet in OpenCL and is
 638 only provided in the latest revisions of the CUDA standard, these
 639 matrices are statically allocated. The allocation is based on the
 640 worst-case scenario where all OM and MM are linked since the real
 641 occupation of the matrix cells cannot be deduced without some kind of
 642 preprocessing computations.
 643
 644 \subsubsection{Critical resources access management}
 645 \label{ch17:subsec:concurrency}
 646
 647 One of the main concers in the MIOR model is to ensure that all the
 648 microbial colonies will have an equitable access to carbon resources,
 649 when multiple colonies share the same deposits. Access
 650 synchronizations are mandatory in these cases to prevent conflicting
 651 updates on the same data that may lead to calculation error (e.g. loss
 652 of matter).
 653
 654 On massively parallel architectures such as GPUs, these kind of
 655 synchronization conflicts can lead to an inefficient implementation by
 656 enforcing a quasi-sequential execution. It is necessary, in the case
 657 of MIOR as well as for other ABM, to ensure that each work-item is not
 658 too constrained in its execution.
 659
 660 \pagebreak
 661 \lstinputlisting[caption=main MIOR kernel,label=ch17:listing:mior_kernels]{./Chapters/chapter17/code/mior_kernels.cl}
 662
 663 From the sequential algorithm (Algorithm~\ref{ch17:seqalgo}) where all
 664 the agents share the same data, we have developed a parallel algorithm
 665 composed of three sequential stages separated by synchronization
 666 barriers. This new algorithm is based on the distribution of the
 667 available OM carbon deposits into parts at the beginning of each
 668 execution step. The three stages, illustrated in
 669 Listing~\ref{ch17:listing:mior_kernels}, are the following:
 670
 671 \begin{enumerate}
 672 \item \emph{scattering}: the available carbon in each carbon deposit
 673   (OM) is equitably dispatched among all accessible MM in the form of
 674   parts,
 675 \item \emph{live}: each MM consumes carbon in its allocated parts for
 676   its breathing and growing processes, and
 677 \item \emph{gathering}: unconsumed carbon in parts is gathered back
 678   into the carbon deposits.
 679 \end{enumerate}
 680
 681 This solution suppresses the data synchronization needed by the first
 682 algorithm, thus the need for synchronization barriers, and requires
 683 only one kernel launch from Java as described on
 684 Listing~\ref{ch17:fig:mior_launcher}.
 685
 686 \lstinputlisting[caption=MIOR simulation launcher,label=ch17:fig:mior_launcher]{Chapters/chapter17/code/mior_launcher.java}
 687
 688 \subsubsection{Termination detection}
 689
 690 The termination of a MIOR simulation is reached when the model
 691 stabilizes and no more $CO_{2}$ is produced. This termination
 692 detection can be done on either the CPU or the GPU but it requires a
 693 global view on the system execution.
 694
 695 In the first case, when the CPU controls the GPU simulation process,
 696 the detection is done in two steps: (1) the CPU starts the execution
 697 of a simulation step on the GPU and (2) the CPU retrieves the GPU data
 698 and determines if another iteration must be launched or if the
 699 simulation has terminated. This approach allows a fine-grain control
 700 over the GPU execution, but it requires many costly data transfers as
 701 each iteration result must be sent from the GPU to the CPU. In the
 702 case of the MIOR model these costs are mainly due to the inherent
 703 PCI-express port latencies rather than to bandwidth limitation since
 704 data sizes remains rather small, on the order of few dozens of
 705 Megabytes.
 706
 707 In the second case the termination detection is directly implemented
 708 on the GPU by checking the amount of available carbon between two
 709 iterations. The CPU does not have any feedback while the simulation is
 710 running, but retrieves the results once the kernel execution is
 711 finished. This approach minimizes the number of transfers between the
 712 CPU and the GPU.
 713
 714 \subsection{Performance of MIOR implementations}
 715 \label{ch17:subsec:miorexperiments}
 716
 717 In this part we present several MIOR GPU implementations using the
 718 distribution/gathering process described in the previous section and
 719 compare their performance on two distinct hardware platform, i.e., two
 720 different GPU devices. Five incremental MIOR implementations were
 721 realized with an increasing level of adaptation for the algorithm: in
 722 all cases, we choose the average time over 50 executions as a
 723 performance indicator.
 724
 725 \begin{itemize}
 726 \item The \textbf{GPU 1.0} implementation is a direct implementation
 727   of the existing algorithm and its data structures where data
 728   dependencies were removed, and it uses the non-compact topology
 729   representation described in Section~\ref{ch17:subsec:datastructures}
 730 \item The \textbf{GPU 2.0} implementation uses the previously
 731   described compact representation of the topology and remains
 732   otherwise identical to the GPU 1.0 implementation.
 733 \item The \textbf{GPU 3.0} implementation introduces the manual
 734   copy into local (private) memory of often-accessed global data, such
 735   as carbon parts or topology information.
 736 \item The \textbf{GPU 4.0} implementation is a variant of the GPU 1.0
 737   implementation but allows the execution of multiple simulations for
 738   each kernel execution.
 739 \item the \textbf{GPU 5.0} implementation is a multi-simulation
 740   version of the GPU 2.0 implementation, using the execution of
 741   multiple simulations for each kernel execution as for GPU 4.0.
 742 \end{itemize}
 743
 744 The two last implementations \textbf{GPU 4.0} and \textbf{GPU 5.0}
 745 illustrate the gain provided by a better usage of the hardware
 746 resources, thanks to the driver execution overlapping capabilities. A
 747 sequential version of the MIOR algorithm, labeled \textbf{CPU}, is
 748 included for comparison purpose. This sequential version was developed
 749 in Java, the same language used for GPU implementations.
 750
 751 For these performance evaluations, two platforms are used. The first
 752 one is representative of the kind of hardware which is available on
 753 HPC clusters. It is a cluster node dedicated to GPU computations with
 754 two Intel X5550 processors running at $2.67$GHz and one Tesla C1060
 755 GPU device running at $1.3$GHz and composed of $240$ cores ($30$
 756 multi-processors). The second platform illustrates what can be
 757 expected from a personal desktop computer built a few years ago. It
 758 uses an Intel Q9300 CPU, running at $2.5$GHz, and a Geforce 8800GT GPU
 759 running at $1.5$GHz and composed of $112$ cores ($14$
 760 multi-processors). The purpose of these two platforms is to assess the
 761 benefit that could be obtained when a scientist has access either to
 762 specialized hardware as a cluster or tries to take advantage of its
 763 own personal computer.
 764
 765 \begin{figure}[!h]
 766 %begin{minipage}[t]{0.49\linewidth}
 767 \centering
 768 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_tesla.pdf}
 769 %\caption{Performance CPU and GPU sur carte Tesla C1060}
 770 \caption{CPU and GPU performance on a Tesla C1060 node.}
 771 \label{ch17:fig:mior_perfs_tesla}
 772 %end{minipage}
 773 \end{figure}
 774
 775 Figures~\ref{ch17:fig:mior_perfs_tesla}~and~\ref{ch17:fig:mior_perfs_8800gt}
 776 show the execution time for $50$ simulations on the two hardware
 777 platforms. A size factor is applied to the problem: at scale 1, the
 778 model contains $38$ MM and $310$ OM, while at the scale 6 these
 779 numbers are multiplied by six. The size of the environment is modified
 780 as well to maintain the same average agent density in the model. This
 781 scaling factor displays the impact of the chosen size of simulation on
 782 performance.
 783
 784 %hspace{0.02\linewidth}
 785 %begin{minipage}[t]{0.49\linewidth}
 786 \begin{figure}[!h]
 787 \centering
 788 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_8800gt.pdf}
 789 \caption{CPU and GPU performance on a personal computer with a Geforce 8800GT}
 790 \label{ch17:fig:mior_perfs_8800gt}
 791 %end{minipage}
 792 \end{figure}
 793
 794 \b The charts show that for small problems the execution times of all
 795 the implementations are very close. This is because the GPU execution
 796 does not have enough threads (representing agents) for an optimal
 797 usage of GPU resources. This trend changes around scale $5$ where GPU
 798 2.0 and GPU 3.0 take the advantage over the GPU 1.0 and CPU
 799 implementations. This advantage continues to grow with the scaling
 800 factor, and reaches a speedup of $10$ at the scale $10$ between the
 801 fastest single-simulation GPU implementation and the first, naive one
 802 GPU 1.0.
 803
 804 Multiple trends can be observed in these results. First, optimizations
 805 for the GPU hardware show a large, positive impact on performance,
 806 illustrating the strong requirements on the algorithm properties to
 807 reach execution efficiency. These charts also show that despite the
 808 vast difference in numbers of cores between the two GPU platforms, the
 809 same trends can be observed in both cases. We can therefore expect
 810 similar results on other GPU cards, without the need for more
 811 adaptations.
 812
 813 \begin{figure}[!h]
 814 \centering
 815 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/monokernel.pdf}
 816 \caption{Execution time of one multi-simulation kernel on the Tesla
 817   platform.}
 818 \label{ch17:fig:monokernel_graph}
 819 \end{figure}
 820
 821 \begin{figure}[!h]
 822 \centering
 823 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/multikernel.pdf}
 824 \caption{Total execution time for 1000 simulations on the Tesla
 825   platform, while varying the number of simulations for each kernel.}
 826 \label{ch17:fig:multikernel_graph}
 827 \end{figure}
 828
 829 There are two ways to measure simulations performance: (1) by
 830 executing only one kernel, and varying its size (the number of
 831 simulations executed in parallel), as shown in
 832 Figure~\ref{ch17:fig:monokernel_graph}, to test the costs linked to
 833 the parallelization process or (2) by executing a fixed number of
 834 simulations and varying the size of each kernel, as shown in
 835 Figure~\ref{ch17:fig:multikernel_graph}.
 836
 837 Figure~\ref{ch17:fig:monokernel_graph} illustrates the execution time
 838 for only one kernel. It shows that for small numbers of simulations
 839 run in parallel, the compact implementation of the model topology is
 840 faster than the two-dimension matrix representation. This trends
 841 reverse with more than $50$ simulations in parallel, which can be
 842 explained either by the nonlinear progression of the synchronization
 843 costs or by the additional memory required for the access-efficient
 844 representation.
 845
 846 Figure~\ref{ch17:fig:multikernel_graph} illustrates the execution time
 847 of a fixed number of simulations. It shows that for a small number of
 848 simulations run in parallel, the costs resulting from program setup,
 849 data copies, and launch on GPU are very detrimental to
 850 performance. Once the number of simulations executed for each kernel
 851 grows, these costs are counterbalanced by computation costs. This
 852 trend is more marked in the case of the sparse implementation (GPU
 853 4.0) than in the compact one but appears on both curves. With more
 854 than $30$ simulations for each kernel, execution times stall, since
 855 hardware limits are reached. This indicates that the cost of preparing
 856 and launching kernels become negligible compared to the computing time
 857 once a good GPU occupancy rate is achieved.
 858
 859 \section{Analysis and recommendations}
 860 \label{ch17:analysis}
 861
 862 In this section we synthesize the observations done on the two models
 863 and identify some recommendations for implementing complex systems on
 864 GPU platforms.
 865
 866 \subsection{Analysis}
 867
 868 In both the collembola and the MIOR model, a critical problematic is
 869 the determination of the parts of the simulation that are to be run on
 870 GPU and which are to remain on the CPU. The determination of these
 871 parts is a classic step of any algorithm parallelization and must take
 872 into account considerations such as the cost of the different parts of
 873 the algorithm and the expected gains.
 874
 875 In the case of the collembola model two steps of the algorithm were
 876 ported to GPU. Both steps use straightforward, easily parallelizable
 877 operations where a direct gain can be expected by using more execution
 878 cores without important modifications to the algorithm.
 879
 880 In the MIOR model case, however, no such inherently parallelizable
 881 parts are evident in the original sequential algorithm. This is mainly
 882 explained by the rate of interactions between agents in this model in
 883 the form of two operations (breathing, growth) using heavily-shared
 884 carbon resources. In this case the algorithm had to be more profoundly
 885 modified while keeping in mind the need to remain true to the original
 886 model, to synchronize the main execution step of all agents in the
 887 model, to ensure equity, and to minimize the numbers of
 888 synchronizations. The minimization is done by factoring the
 889 distribution of carbon in the model in two separated steps at the
 890 beginning and the end of each iteration rather than at multiple points
 891 of the execution.
 892
 893 \subsection{MAS execution workflow}
 894
 895 Many MAS simulations decompose their execution process into discrete
 896 evolution steps where each step represents a quantum of time (minimal
 897 unit of time described). At the end of each step many global data,
 898 graphical displays or output files are updated. This execution model
 899 may not correctly fit on GPU platforms as they assume more or less a
 900 batch-like workflow model. The execution model must be split into the
 901 following ever repeating steps:
 902
 903 \begin{itemize}
 904 \item Allocation of GPU data buffers
 905 \item Copy of data from CPU to GPU
 906 \item GPU kernels execution
 907 \item Copy of results from GPU to CPU
 908 \end{itemize}
 909
 910 This workflow works well if the considered data transfer time is
 911 negligible compared to GPU execution or can be done in parallel,
 912 thanks to the asynchronous nature of OpenCL. If we are to update the
 913 MAS model after each iteration then performance risks being
 914 degraded. This is illustrated in the MIOR model by the fact that the
 915 speedup observed on GPU is much more significant for bigger
 916 simulations, which imply longer GPU execution times. Our solution to
 917 this problem is to desynchronize the execution of the MAS model and
 918 its GPU parts by requesting the execution of multiple steps of the GPU
 919 simulations for each launch.
 920
 921 Another related prerequisite of GPU execution is the ability to have
 922 many threads executed, to allow an efficient exploitation of the
 923 superior number of cores provided by the architecture. In the case of
 924 MAS models, this means that executing one agent at a time on GPU is
 925 meaningless in regard to GPU usage, copying cost, and actual gain in
 926 execution time, if the agent computations are not complex enough. In
 927 the MIOR and the collembola models, this is solved by executing the
 928 computations for all agents of the model at the same time. If the
 929 model has only chronic needs for intensive computations, then some
 930 kind of batching mechanism is required to store waiting treatments in
 931 a queue, until the total sum of waiting computations justifies the
 932 transfer cost to the GPU platform.
 933
 934 \subsection{Implementation challenges}
 935
 936 Besides the execution strategy challenges described above, some
 937 implementation challenges also occur when implementing an OpenCL
 938 version of a MAS model, mainly related to the underlying limitations
 939 of the execution platform.
 940
 941 The first one is the impossibility (except in latest CUDA versions) to
 942 dynamically allocate memory during execution. This is a problem in the
 943 case of models where the number of agents can vary during the
 944 simulation, such as prey-predator models. In this case, the only
 945 solution is to overestimate the size of arrays or data structures to
 946 accommodate these additional individuals, or to use the CPU to resize
 947 data structures when these situations occur. Both approaches require
 948 trending either memory or performance and are not always practical.
 949
 950 Another limitation is the impossibility to store pointers in data
 951 structures, since OpenCL only allows one dimension static arrays. This
 952 precludes the usage of structures such as linked-list, graphs or
 953 sparse matrices not represented by some combination of static arrays,
 954 and can be another source of memory or performance losses.
 955
 956 In the case of MIOR, this problem is especially exacerbated in the
 957 case of neighboring storage: both representations consume much more
 958 memory than is actually required, since the worst case (all agents
 959 have access to all others agents) has to be taken into account when
 960 dimensioning the data structure.
 961
 962 The existence of different generations of GPU hardware is also a
 963 challenge. Older implementations, such as the four year old Tesla
 964 C1060 cards, have very strong constraints in term of memory accesses
 965 and requires very regular access patterns to perform efficiently. MAS
 966 having many random accesses, such as MIOR, or many small global memory
 967 accesses, such as Collembola, are penalized on these older
 968 cards. Fortunately, these requirements are less present is modern
 969 cards, which offer caches and other facilities traditionally present
 970 on CPU to offset these kinds of penalties.
 971
 972 The final concern is related to the previous ones and often results in
 973 more memory consumption. The amount of memory available on GPU cards
 974 is much more limited and adding new memory capabilities is more costly
 975 compared to expending a CPU RAM. On computing clusters, hardwares
 976 nodes with 128GB of memory or more have become affordable, whereas
 977 newer Tesla architecture remains limited to 16GB of memory. This can
 978 be a concern in the case of big MAS models, or small ones which can
 979 only use memory-inefficient OpenCL structures.
 980
 981 \subsection{MCSMA}
 982 \label{ch17:Mcsma}
 983
 984 As shown in the previous sections, many data representation choices
 985 are common to entire classes of MAS. The paradigm of grid, for
 986 example, is often encountered in models where each cell constitutes
 987 either the elementary unit of simulation
 988 (SugarScape~\cite{Dsouza2007}) or a discretization of the environment
 989 space (Pathfinding~\cite{Guy09clearpath}). These grids can be
 990 considered as two- or three-dimensional matrices, whose processing can
 991 be directly distributed.
 992
 993 Another common data representation encountered in MAS system is the
 994 usage of 2D or 3D coordinates to store the position of each agent of
 995 the model. In this case, even if the environment is no longer
 996 discrete, the location information still imply computations (Euclidean
 997 distances) which can be parallelized.
 998
 999 MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS
1000 designer those basic data structures and the associated operations, to
1001 facilitate the portage of existing MAS on GPU. Two levels of
1002 utilization are provided to the developer, depending on its usage
1003 profile:²<
1004
1005 \begin{itemize}
1006 \item A high-level library, composed of modules regrouping classes of
1007   operations. Such operation can distance computations in 1D, 2D or 3D
1008   grids, diffusion or reduction operations on matrices...
1009 \item A low-level API which allows the developer direct access to the
1010   GPU and the inner working of MCSMA, to develop new modules in the
1011   case where the required operations are not yet provided by the
1012   platform.
1013 \end{itemize}
1014
1015 Both usage levels were illustrated in the above two practical
1016 cases. In MIOR, the whole algorithm (baring initialization) is ported
1017 on GPU as a specific plugin which allows executing $n$ MIOR
1018 simulations and retrieving their results. This first approach requires
1019 extensive adaptations to the original algorithm.  In collembola, to
1020 the contrary, the main steps of the algorithm remain executed on the
1021 CPU, and only specific operations are delegated to generic, already
1022 existing diffusion and reduction kernels. The fundamental algorithm is
1023 not modified and GPU execution is only applied to specific parts of
1024 the execution which may benefit from it. These two programming
1025 approaches allow incremental adaptations of existing Java MAS to
1026 accelerate their execution on GPU, while retaining the option to
1027 develop their own reusable or more efficient module to supplement the
1028 already existing ones.
1029
1030 \section{Conclusion}
1031 \label{ch17:conclusion}
1032
1033 This chapter has addressed the issue of complex system simulation by
1034 using agent-based paradigms and GPU hardware. From the experiments on
1035 two existing agent-based models of soil science we have provided
1036 useful information on the architecture, the algorithm design, and the
1037 data management to run agent-based simulations on GPU, and more
1038 generally to run computationally intensive applications that are not
1039 based on purely-matricial models. The first result of this work is
1040 that adapting the algorithm to a GPU architecture is possible and
1041 suitable to speed up agent based simulations as illustrated by the
1042 MIOR model. Coupling CPU with GPU seems to be an interesting way to
1043 take better advantage of the power given by computers and clusters as
1044 illustrated by the collembola model. From our point of view the
1045 adaptation process is less costy in time than a traditional
1046 parallelization on distributed nodes and not much difficult than a
1047 standard multithreaded parallelization, since all the data remains on
1048 the same host and can be shared in central memory. The usage of OpenCL
1049 also enables a portable simulator that can be run on different
1050 graphical units. Even using a mainstream card such as the GPU card of
1051 a standard computer can lead to significant performance
1052 improvements. This is an interesting result as it opens up the field
1053 of inexpensive HPC to a large community.
1054
1055 In this perspective, we are working on MCSMA, a development platform
1056 that would facilitate the use of GPU or many-core architectures for
1057 multi-agent simulations. Our first work has been the definition of
1058 common, efficient, reusable data structures, such as grids or
1059 lists. Another goal is to provide easier means to control the
1060 distribution of specific processes between CPU or GPU, to allow the
1061 easy exploitation of the strengths of each platform in the same
1062 multi-agent simulation. We think that the same approach, i.e.,
1063 developing specific environments that facilitate the developer access
1064 to the GPU power, can be applied in many domains with computationally
1065 intensive needs to open the GPU use to a larger community.
1066
1067 \putbib[Chapters/chapter17/biblio]