BookGPU/Chapters/chapter17/ch17.tex

   1 \chapterauthor{G. Laville, C. Lang, K. Mazouzi, N. Marilleau, B. Herrmann, L. Philippe}{Femto-ST Institute, University of Franche-Comt{\'e}}
   2
   3 \newlength\mylen
   4 \newcommand\myinput[1]{%
   5   \settowidth\mylen{\KwIn{}}%
   6   \setlength\hangindent{\mylen}%
   7   \hspace*{\mylen}#1\\}
   8
   9 \chapter{Implementing MAS on GPU}
  10 \label{chapter17}
  11
  12
  13 \section{Introduction}
  14 \label{ch17:intro}
  15
  16 In this chapter we introduce the use of Graphical Processing Units for
  17 multi-agents based systems as an example of a not-so-regular
  18 application that could benefit from the GPU computing
  19 power. Multi-Agent Systems or MAS are a simulation paradigm used to
  20 study the behavior of dynamic systems. Dynamic systems as physical
  21 systems are often modeled by mathematical representations and their
  22 dynamic behavior simulated by differential equations. The simulation
  23 of the system thus often relay on the resolution of a linear system
  24 that can be efficiently computed on a graphical processing unit as
  25 shown in the preceeding chapters. But when the behavior of the system
  26 elements is not uniformly driven by the same law, when these elements
  27 have their own behavior, the modeling process is too complex to rely
  28 on formal expressions. In this context MAS is a recognized approach to
  29 model and simulate systems where individuals have an autonomous
  30 behavior that cannot be simulated by the evolution of a set of
  31 variables driven by mathematical laws. MAS are often used to simulate
  32 natural or collective phenomenons whose individuals are too numerous or
  33 various to provide a unified algorithm describing the system
  34 evolution. The agent-based approach is to divide these complex systems
  35 into individual self-contained entities with their smaller set of
  36 attributes and functions. But, as for mathematical simulations, when
  37 the size of the Multi-Agent System (MAS) increases the need of computing
  38 power and memory also increases. For this reason, multi-agent systems
  39 should benefit from the use of distributed computing
  40 architectures. Clusters and Grids are often identified as the main
  41 solution to increase simulation performance but Graphical Processing
  42 Units (GPU) are also a promising technology with an attractive
  43 performance/cost ratio.
  44
  45 Conceptually a MAS is a distributed system as it favors the definition
  46 and description of large sets of individuals, the agents, that can be
  47 run in parallel. As a large set of agents could have the same behavior
  48 a SIMD model should fit the simulation execution. Most of the
  49 agent-based simulators are however designed with a sequential scheme
  50 in mind and these simulators seldom use more than one core for their
  51 execution. Due to simulation scheduling constraints, data sharing and
  52 exchange between agents and the huge amount of interactions between
  53 agents and their environment, it is indeed rather difficult to
  54 distribute an agent based simulator, for instance, to take advantage of new
  55 multi-threaded computer architectures. Thus, guidelines and tools
  56 dedicated to MAS paradigm and HPC is now a need for other complex
  57 system communities. Note that, from the described structure (large
  58 number of agents sharing data), we can conclude that MAS would more
  59 easily benefit from many-cores architectures than from other kinds of
  60 parallelism.
  61
  62 Another key point that advocates for the use of many-core in MAS is
  63 the growing need for multi-scale simulations. Multi-scale simulations
  64 explore problems with interactions between several scales. The
  65 different scales use different granularities of the structure and
  66 potentially different models. Most of the time the lower scale
  67 simulations provide results to higher scale simulations. In that case
  68 the execution of the simulations can easily be distributed between the
  69 local cores and a many-core architecture, i.e. a GPU device.
  70
  71 We explore in this chapter the use of many-core architectures to
  72 execute agent-based simulations. We illustrate our reflexion with two
  73 uses cases: the colembola simulator designed to simulate the diffusion
  74 of collembola between plots of land and the MIOR simulator that
  75 reproduces effects of earthworms on bacteria dynamics in a bulked
  76 soil. In Section \ref{ch17:ABM} we present the work related to MAS and
  77 parallelization with a special focus on many-core use. In sections
  78 \ref{ch17:sec:1stmodel} and \ref{ch17:sec:2ndmodel} we present in
  79 details two multi-agent models, their GPU implementations, the
  80 conducted experiments and their performance results. The first model,
  81 given in Section \ref{ch17:sec:1stmodel}, illustrates the use of a GPU
  82 architecture to speed up the execution of some computation intensive
  83 functions while the main model is still executed on the central
  84 processing unit. The second model, given in Section
  85 \ref{ch17:sec:2ndmodel}, illustrates the use of a GPU architecture to
  86 implement the whole model on the GPU processor which implies deeper
  87 changes in the initial algorithm. In Section \ref{ch17:analysis} we
  88 propose a more general reflexion on these implementations and we
  89 provide some guidelines. Then we conclude in Section
  90 \ref{ch17:conclusion} on the possible generalization of our work.
  91
  92
  93 \section{Running Agent-Based Simulations}
  94 \label{ch17:ABM}
  95
  96 In this section, we present the context of multi-agent systems, MAS,
  97 their parallelization and we report several existing works on using
  98 GPU to simulate multi-agent systems.
  99
 100 \subsection{Multi-agent systems and parallelism}
 101
 102 Agent-Based systems are often used to simulate natural or collective
 103 phenomenons whose actors are too numerous or various to provide a
 104 simple unified algorithm describing the studied system
 105 dynamic~\cite{Schweitzer2003}. The implementation of an agent based
 106 simulation usually starts by designing the underlying Agent Based
 107 Model (ABM). Most ABM are based around a few types of entities such as Agents,
 108 Environment and around an Interaction Organization~\cite{Vowel02}. In the
 109 complex system domain, the environment often describes a real space,
 110 its structure (e.g. soil textures and porosities) and its dynamics
 111 (e.g. organic matter decomposition). It is a virtual world in which
 112 agents represent studied entities (e.g. biotic organisms)
 113 evolution. The actual agent is animated by a behavior that can range
 114 between reactivity (only react to external stimuli) and cognition
 115 (lives its own process alongside other individuals). Interaction
 116 and Organization define functions, types and patterns of
 117 communications of their member agents in the
 118 system~\cite{Odell:2003:RRD:1807559.1807562}. Note that, depending on
 119 the MAS, agents can communicate either directly through special
 120 primitives or indirectly through the information stored in the
 121 environment.
 122
 123 Agent based simulations have been used for more than one decade to
 124 reproduce, understand even predic complex system dynamic. They have proved their interest in various
 125 scientific communities. Nowadays generic agent based frameworks are
 126 promoted such as Repast~\cite{repast_home} or
 127 NetLogo~\cite{netlogo_home} to implement simulators. Many ABM such as
 128 the crown model representing a city-wide
 129 scale~\cite{Strippgen_Nagel_2009} tend however to require a large
 130 number of agents to provide a realistic behavior and reliable global
 131 statistics. Moreover, an achieved model analysis needs to run many
 132 simulations identified into an experiment plan to obtain enough
 133 confidence in a simulation. In this case the available computing power
 134 often limits the simulation size and the result range thus requiring
 135 the use of parallelism to explore bigger configurations.
 136
 137 For that, three major approaches can be identified:
 138 \begin{enumerate}
 139 \item parallelizing experiments plan on a cluster or a grid (one or
 140   few simulations are submitted to each
 141   core)~\cite{Blanchart11,Chuffart2010}.
 142 \item parallelizing the simulator on a cluster (the environment of the
 143   MAS is split and run on several distributed
 144   nodes)~\cite{Cosenza2011,MAR06}
 145 \item optimizing simulator by taking advantage of computer resources
 146   (multi-threading, GPU, and so on) \cite{Aaby10}
 147 \end{enumerate}
 148
 149 In the first case, experiments are run independently each others and only simulation parameters change between two runs so
 150 that a simple version of an existing simulator can be used. This
 151 approach does not, however, allow to run larger models.
 152 % It does not imply
 153 %model code changes except for Graphical Unit Interface extracting as
 154 %the experiments does not run interactively but in batch.
 155 In the second
 156 and the third case, model and code modifications are necessary. Only few
 157 frameworks however introduce distribution in agent simulation
 158 (Madkit~\cite{Gutknecht2000}, MASON~\cite{Sean05}, repastHPC~\cite{Collier11}) and parallel
 159 implementations are often based on the explicit use of threads using
 160 shared memory~\cite{Guy09clearpath} or cluster libraries such as
 161 MPI~\cite{Kiran2010}.
 162
 163 Parallelizing a multi-agent simulation is however complex due to space
 164 and time constraints. Multi-agent simulations are usually based on a
 165 synchronous execution:
 166 %by the agents that share the same environment.
 167 at each time step, numerous events (space data
 168 modification, agent motion) and interactions between agents happen.
 169 Distributing the simulation on several computers or grid nodes thus
 170 implies to guaranty a distributed synchronous execution and
 171 coherency. This often leads to poor performance or complex synchronization
 172 problems. Multi-cores execution or delegating part of this execution
 173 to others processors as GPUs~\cite{Bleiweiss_2008} are thus usually easier
 174 to implement since all the threads share the data and the local clock.
 175
 176 % Different agent patterns can be adopted in an ABMs such as
 177 % cognitive and reactive ones~\cite{Ferber99}. Cognitive agents act on
 178 % the environment and interact with other agents according to a complex
 179 % behavior. This behavior takes a local perception of the virtual world
 180 % and the agent past (a memory characterized by an internal state and
 181 % belief, imperfect knowledge about the world) into account. Reactive
 182 % agents have a much systematic pattern of action based on stimuli
 183 % response schemes (no or few knowledge and state conservation in agent). The
 184 % evolution of the ABM environment, in particular, is often represented with
 185 % this last kind of agents. As their behavior is usually simple, we
 186 % propose in this chapter to delegate part of the environment and of the
 187 % reactive agents execution to the graphical processing unit of the
 188 % computer. This way we can balance the load between both CPU and GPU
 189 % execution resources.
 190
 191 % In the particular case of multi-scale simulations such as the Sworm
 192 % simulation~\cite{BLA09} the environment may be used at different
 193 % levels. Since the representation of the whole simulated environment
 194 % (i.e. the soil area) would be costly, the environment is organized as
 195 % a multi-level tree of small soil cubes which can be lazily
 196 % instantiated during the simulation. This allows to gradually refine
 197 % distribution details in units of soil as agents progress and need
 198 % those information, by using a fractal process based on the
 199 % bigger-grained already instantiated levels. This characteristic,
 200 % especially for a fractal model, could be the key of the
 201 % distribution. For instance, each branch of a fractal environment could
 202 % be identified as an independent area and parallelized. In addition
 203 % Fractal is a famous approach to describe multi-scale environment (such
 204 % as soil) and its organization~\cite{perrier}. In that case the lower
 205 % scale simulations can also be delegated to the GPU card to limit the
 206 % load of the main (upper scale) simulation.
 207
 208 \subsection{MAS Implementation on GPU}
 209 \label{ch17:subsec:gpu}
 210
 211 The last few years saw the apparition of new generations of graphic
 212 cards based on more general purpose execution units which are
 213 promising for less regular systems as MAS.
 214 % These units can be programmed using GPGPU languages to solve various
 215 % problems such as linear algebra resolutions, usually based on matrix
 216 % manipulations.
 217 The matrix-based data representation and SIMD computations are
 218 however not straightforward in Multi-Agent Systems, where data
 219 structures and algorithms are tightly coupled to the described
 220 simulation. However, works from existing literature show that MAS can
 221 benefit from these performance gains on various simulation types, such
 222 as traffic simulation~\cite{Strippgen_Nagel_2009}, cellular
 223 automatons~\cite{Dsouza2007}, mobile-agent based
 224 path-finding~\cite{Silveira:2010:PRG:1948395.1948446} or genetic
 225 algorithms~\cite{Maitre2009}. Note that an application-specific
 226 adaptation process was required in the case of these MAS: some of the
 227 previous examples are driven by mathematical laws (path-finding) or
 228 use a natural mapping between a discrete environment (cellular
 229 automaton) and GPU cores. Unfortunately, this mapping often requires
 230 algorithmic adaptations in other models but the experience shows that
 231 the more reactive a MAS is the more adapted its implementation is to
 232 GPU.
 233
 234 The first step in the adaptation of an ABM to GPU platforms is the
 235 choice of the language. On the one hand, the Java programming language
 236 is often used for the implementation of MAS due to its availability
 237 on numerous platforms or frameworks and its focus on high-level,
 238 object-oriented programming. On the other hand, GPU platforms can
 239 only run specific languages as OpenCL or CUDA. OpenCL (supported on
 240 AMD, Intel and NVIDIA hardware) better suits the portability concerns
 241 across a wide range of hardware needed the agent simulators, as
 242 opposed to CUDA which is a NVIDIA-specific library.
 243
 244 OpenCL is a C library which provides access to the underlying CPUs or
 245 GPUs threads using an asynchronous interface. Various OpenCL functions allow
 246 the compilation and the execution of programs on these execution
 247 resources, the copy of data buffers between devices, or the collection
 248 of profiling information.
 249
 250 This library is based around three main concepts:
 251
 252 \begin{itemize}
 253 \item the \emph{kernel} (similar to a CUDA kernel) which represents a runnable program
 254   containing instructions to be executed on the GPU.
 255 \item the \emph{work-item} (equivalent to a CUDA thread), which is analogous to the concept
 256   of thread on GPU, in that it represents one running instance of a
 257   GPU kernel.
 258 \item the \emph{work-group} (or execution block) which is a set of work-items
 259   sharing some memory to speed up data accesses and
 260   computations. Synchronization operations such as barrier can only be
 261   used across the same work-group.
 262 \end{itemize}
 263
 264 Running an OpenCL computation consists in launching numerous work-items
 265 that execute the same kernel. The work-items are submitted to a
 266 submission queue to optimize the available cores usage. A calculus is
 267 achieved once all these kernel instances have terminated.
 268
 269 The number of work-items used in each work-group is an important
 270 implementation choice which determines how many tasks will share the
 271 same cache memory. Data used by the work-items can be stored as
 272 N-dimensions matrices in local or global GPU memory.  Since the size
 273 of this memory is often limited to a few hundred of kilobytes,
 274 choosing this number often implies a compromise between the model
 275 synchronization or data requirements and the available resources.
 276
 277 In the case of agent-based simulations, each agent can be naturally
 278 mapped to a work-item. Work-groups can then be used to represents groups
 279 of agents or simulations sharing common data (such as the environment)
 280 or algorithms (such as the background evolution process).
 281
 282 In the following examples the JOCL~\cite{jocl_home} binding is used
 283 to access the OpenCL platfmorm from the Java programming language.
 284
 285 In the next sections we present two practical cases that will be
 286 studied in details, from the model to its implementation and
 287 performance.
 288
 289 \section{A first practical example}
 290 \label{ch17:sec:1stmodel}
 291
 292 The first model, the Collembola model, simulates the propagation of
 293 collembolas in flieds and forests. It is based on a diffusion
 294 algorithm which illustrates the case of agents with a simple behavior
 295 and few synchronization problems.
 296
 297 \subsection{The Collembola model}
 298 \label{ch17:subsec:collembolamodel}
 299
 300 The Collembola model is an example of multi-agent system using GIS (Geographical Information System)
 301  and survey data (population count)
 302  to modelize the evolution of the biodiversity
 303 across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modelized by populations of
 304 athropod individuals, the Collembola, which can reproduce and diffuse
 305 to favorable new habitats. The simulator allows to study the diffusion
 306 of collembola, between plots of land depending on their landuse
 307 (artifical soil, crop, forest\ldots). In this
 308 model the environment is composed of the studied land and the
 309 colembola are agents. Every land plot is divided into several cells,
 310 each cell representing a surface unit (16*16 meters). A number of individuals per collembola species is associated to each cell. The model evolution
 311 is then based on a common diffusion model that diffuse individuals between
 312 cells. Each step of the simulation is based on four stages, as shown
 313 on Figure~\ref{ch17:fig:collem_algorithm}:
 314
 315 % \begin{enumerate}
 316 % \item arrival of new individuals
 317 % \item reproduction in each cell
 318 % \item diffusion between cells
 319 % \item updating of colembola lifetime
 320 % \end{enumerate}
 321
 322 \begin{figure}[h]
 323 \centering
 324 \includegraphics[width=0.6\textwidth]{Chapters/chapter17/figs/algo_collem.pdf}
 325 \caption{Evolution algorithm of Collembola model}
 326 \label{ch17:fig:collem_algorithm}
 327 \end{figure}
 328
 329 The algorithm is quite simple but includes two costly operations, the
 330 reproduction and the diffusion, that must be parallelized to improve
 331 the model performances.
 332
 333 The {\bf reproduction} stage consists in updating the total population
 334 of each plot by taking the individuals arrived at the preceding
 335 computation step. This stage implies to process the whole set of cells
 336 of the environment to sum their population. The computed value is
 337 recorded in the plot associated to each cell. This process can be
 338 assimilated to a reduction operation on all the population cells
 339 associated to one plot to obtain its population.
 340
 341 The {\bf diffusion} stage simulates the natural behavior of the
 342 collembola that tends toward occupying the whole space over time. This
 343 stage constists in computing a new value for each cell depending on
 344 the population of the neighbor cells. This process can be assimilated
 345 to a linear diffusion at each iteration of the populationof the cells
 346 across their neightbors.
 347
 348 These two processes are quite common in numerical computations so
 349 that the collembola model can be adapted to a GPU execution without
 350 much difficulties.
 351
 352 \subsection{Collembola Implementation}
 353
 354 In the Collembola simulator biodiversity is modeled by populations
 355 of Collembola individuals, which can reproduce and diffuse to
 356 favorable new habitats. This is implemented in the algorithm as a fixed
 357 reproduction factor, applied to the size of each population, followed
 358 by a linear diffusion of each cell population to its eight
 359 neighbors. These reproduction and diffusion processes are followed by
 360 two more steps on the GPU implementation. The first one consist of
 361 culling of populations in a inhospitable environment, by checking each
 362 cell value, landuse, and setting its population to zero if
 363 necessary.  The final simulation step is the reduction of the cells
 364 populations for each plot, to obtain an updated plot population
 365 for statistic purposes. This separate computation step, done while
 366 updating each cell population in the reference sequential algorithm,
 367 is motivated by synchronization problems and allow the reduction of
 368 the total numbers of access needed to updated those populations.
 369
 370 %\lstinputlisting[language=C,caption=Collembola OpenCL
 371 %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl}
 372 \lstinputlisting[caption=Collembola OpenCL Diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl}
 373
 374 The reproduction, diffusion and culling steps are implemented on GPU
 375 (Figure~\ref{ch17:fig:collem_algorithm}) as a straight mapping of each cell
 376 to an OpenCL work-item (GPU thread). Listing
 377 \ref{ch17:listing:collembola-diffuse} gives the kernel for the
 378 diffusion implementation.  To prevent data coherency problems, the
 379 diffusion step is split into two phases, separated by an execution
 380 {\it barrier}. In the first phase, each cell diffusion overflow is
 381 calculated, and divided by the number of neightbors. Note that, at the
 382 model frontier, populations can also overflow outside the environment
 383 grid but we do not manage those external populations, since there are
 384 no reason to assume our model to be isolated of its surroundings. The
 385 overflow by neigbors value is stored for each cell, before
 386 encountering the barrier. After the barrier is met, each cell read the
 387 overflows stored by all its neighbors at the previous step and applies
 388 them to its own population. In this manner, only one barrier is
 389 required to ensure the consistency of populations value, since no
 390 cells ever modify a value other than its own.
 391
 392 \lstinputlisting[caption=Collembola OpenCL reduction kernel,label=ch17:listing:collembola-reduc]{Chapters/chapter17/code/collem_kernel_reduc.cl}
 393
 394 Listing \ref{ch17:listing:collembola-reduc} gives the kernel for the
 395 reduction implementation.  The only step requiring numerous
 396 synchronized accesses is the reduction one: in a first approach, we
 397 chose to use {\it atomic\_add} operation to implement this process, but more
 398 efficient implementations using partial reduction and local GPU memory
 399 could be implemented.
 400
 401 \subsection{Collembola performance}
 402
 403 In this part we present the performance of the Collembola model on
 404 various CPU and GPU execution
 405 platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
 406 number of cores and the processor archiecture as a strong influence on the obtained results
 407
 408 % : the
 409 % dual-core processor equipping our Thinkpad platform has two to six
 410 % longer executions times, compared to a six-core Phenom X6.
 411
 412 % % \begin{figure}[h]
 413 % %begin{minipage}[t]{0.49\linewidth}
 414 % \centering \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs.pdf}
 415 % \caption{Performance CPU et GPU du modèle Collemboles}
 416 % \label{ch17:fig:mior_perfs_collem}
 417 % %end{minipage}
 418 % \end{figure}
 419
 420 % In figure~\ref{ch17:fig:mior_perfs_collem2} the Thinkpad curve is removed
 421 % to make other trends clearer. Two more observation can be made, using
 422 % this more detailled scale:
 423
 424 \begin{itemize}
 425 \item Older GPU cards can be slower than modern processors. This can
 426   be explained by the new cache and memory access optimizations
 427   implemented in newer generations of GPU devices. These optimizations
 428   reduce the penalities associated with irregular and frequent global
 429   memory accesses. They are not available on our Tesla nodes.
 430 \item GPU curves exhibit a odd-even pattern in their performance
 431   results. Since this phenomenon is visible on two distinct
 432   manufacturer hardware, drivers and OpenCL implementation, it is
 433   likely the result of the decomposition process based on warp of
 434   fixed, power-of-two sizes.
 435 \item The number of cores is not the only determining factor: an Intel
 436   Core i7 2600K processor, even with only four cores, can provide
 437   better performance than a Phenom one.
 438 \end{itemize}
 439
 440 \begin{figure}[h]
 441 %begin{minipage}[t]{0.49\linewidth}
 442 \centering
 443 \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
 444 \caption{Performances CPU et GPU du modèle Collemboles}
 445 \label{ch17:fig:mior_perfs_collem}
 446 %end{minipage}
 447 \end{figure}
 448
 449 Both figures shows that using the GPU to parallelize part of the
 450 simulator results in tangible performance gains over a CPU execution
 451 on modern hardware. These gains are more mixed on older GPU platforms
 452 due to the limitations when dealing with irregular memory or execution
 453 patterns often encountered in MAS systems. This can be closely linked
 454 to the availability of caching facilities on the GPU hardware and its
 455 dramatical effects depending on the locality and frequency of data
 456 accesses. In this case, even if the Tesla architectures provides more
 457 execution cores and is the far costlier solution, more recent, cheaper,
 458 solutions such as high-end GPU provide better performance when the
 459 execution is not constrained by memory size.
 460
 461 \section{Second example}
 462 \label{ch17:sec:2ndmodel}
 463
 464 The second model, the MIOR model, simulates the behavior of microbian
 465 colonies. Its execution model is more complex so that it requires
 466 changes in the initial algorithm and the use of synchronization to
 467 benefit from the GPU architecture.
 468
 469 \subsection{The MIOR model}
 470 \label{ch17:subsec:miormodel}
 471
 472 The MIOR~\cite{C.Cambier2007} model was developed to simulate local
 473 interactions in a soil between microbial colonies and organic
 474 matters. It reproduces each small cubic units ($0.002 m^3$) of soil as a
 475 MAS.
 476
 477 Multiple implementations of the MIOR model have already been
 478 realized, in Smalltalk and Netlogo, in 2 or 3 dimensions. The last
 479 implementation, used in our work and referenced as MIOR in the
 480 rest of the chapter, is freely accessible online as
 481 WebSimMior~\footnote{http://www.IRD.fr/websimmior/}.
 482
 483 MIOR is based around two types of agents: (i) the Meta-Mior (MM),
 484 which represents microbial colonies consuming carbon and (ii) the
 485 Organic Matter (OM) which represents carbon deposits occurring in soil.
 486
 487 The Meta-Mior agents are characterized by two distinct behaviors:
 488 \begin{itemize}
 489 \item \emph{breath}: the action converts mineral carbon from the soil
 490   to carbon dioxide $CO_{2}$ that is released in the soil.
 491 \item \emph{growth}: by this action each microbial colony fixes the
 492   carbon present in the environment to reproduce itself (augments its
 493   size). This action is only possible if the colony breathing needs
 494   where covered, i.e. enough mineral carbon is available.
 495 \end{itemize}
 496
 497 These behaviors are described in Algorithm~\ref{ch17:seqalgo}.
 498
 499 \begin{algorithm}[h]
 500 \caption{Evolution step of each Meta-Mior (microbial colony) agent}
 501 \label{ch17:seqalgo}
 502 \KwIn{A static array $mmList$ of MM agents}
 503 \myinput{A static array $omList$ of OM agents}
 504 \myinput{A MIOR environment $world$}
 505 $breathNeed \gets world.respirationRate \times mm.carbon$\;
 506 $growthNeed \gets world.growthRate \times mm.carbon$\;
 507 $availableCarbon \gets totalAccessibleCarbon(mm)$\;
 508 \uIf{$availableCarbon > breathNeed$}{
 509   \tcc{ Breath }
 510   $mm.active \gets true$\;
 511   $availableCarbon \gets availableCarbon - consumCarbon(mm, breathNeed)$\;
 512   $world.CO2 \gets world.CO2 + breathNeed$\;
 513   \If{$availableCarbon > 0$}{
 514     \tcc{ Growth }
 515     $growthConsum \gets max(totalAccessCarbon(mm), growthNeed)$\;
 516     $consumCarbon(mm, growthConsum)$\;
 517     $mm.carbon \gets mm.carbon + growthConsum$\;
 518   }
 519 }
 520 \Else{
 521   $mm.active \gets false$
 522 }
 523 \end{algorithm}
 524
 525 Since this simulation takes place at a microscopic scale, a large
 526 number of these simulations must be executed for each macroscopic
 527 simulation step to model realistic-sized unit of soil. This lead to
 528 large computing needs despite the small computation cost of each
 529 individual simulation.
 530
 531
 532 \subsection{MIOR Implementation}
 533
 534 As pointed out previously, the MIOR implementation implied more
 535 changes for the initial code to be run on GPU.  As a first attempt, we
 536 realized a simple GPU implementation of the MIOR simulator, with only
 537 minimal changes to the CPU algorithm. Execution times showed the
 538 inefficiency of this approach, and highlighted the necessity of
 539 adapting the simulator to take advantage of the GPU execution
 540 capabilities~\cite{lmlm+12:ip}. In this part, we show the main changes
 541 which were realized to adapt the MIOR simulator on GPU architectures.
 542
 543 \subsubsection{Execution mapping on GPU}
 544
 545 \begin{figure}
 546 \centering
 547 \includegraphics[width=0.7\textwidth]{Chapters/chapter17/figs/repartition.pdf}
 548 \caption{Execution distribution retained on GPU}
 549 \label{ch17:fig:gpu_distribution}
 550 \end{figure}
 551
 552 Each MIOR simulation is represented by a work-group, and each agent by
 553 a work-item. A kernel is in charge of the life cycle process for each
 554 agent of the model. This kernel is executed by all the work-items of the simulation
 555 each on its own gpu core.
 556
 557 The usage of one work-group for each simulation allows to easily
 558 execute multiple simulations in parallel, as shown on
 559 figure~\ref{ch17:fig:gpu_distribution}.  By taking advantage of the
 560 execution overlap possibilities provided by OpenCL it then becomes
 561 possible to exploit all the cores at the same time, even if an unique
 562 simulation is to small to use all the available GPU cores. However,
 563 the maximum size of a work-group is limited (to $512$), which only allows
 564 us to execute one simulation per work-group when using $310$ threads
 565 (number of OM in the reference model) to execute the simulation.
 566
 567 The usage of the GPU to execute multiple simulations is initiated by
 568 the CPU. The later keeps total control of the simulator execution
 569 flow. Thus, optimized scheduling policies (such as submitting kernels
 570 in batch, or limiting the number of kernels, or asynchronously
 571 retrieving the simulation results) can be defined to minimize the cost
 572 related to data transfers between CPU and GPUs.
 573
 574 \subsubsection{Data structures translation}
 575 \label{ch17:subsec:datastructures}
 576
 577 The adaptation of the MIOR model to GPU requires the mapping of the
 578 data model to OpenCL data structures. The environment and the agents
 579 are represented by arrays of structures, where each structure describes the
 580 state of one entity. The behavior of these entities are implemented
 581 as OpenCL functions to be called from the kernels during execution.
 582
 583 Since the main program is written in Java, JOCL is responsible for the
 584 allocation and the mapping of the object data structures to OpenCL
 585 ones before execution.
 586
 587 Four main data structures
 588 (Figure~\ref{ch17:listing:mior_data_structures}) are defined: (i) an
 589 array of MM agents, representing the state of the microbial
 590 colonies. (ii) an array of OM agents, representing the state of the
 591 carbon deposits. (iii) a topology matrix, which stores accessibility
 592 information between the two types of agents of the model (iv) a world
 593 structure, which contains all the global input data (metabolism rate,
 594 numbers of agents) and output data (quantity of $CO_{2}$ produced) of
 595 the simulation. These data structures are initialized by the CPU and
 596 then copied on the GPU.
 597
 598 \lstinputlisting[caption=Main data structures used in a MIOR simulation,label=ch17:listing:mior_data_structures]{Chapters/chapter17/code/data_structures.cl}
 599
 600 The world topology is stored as a two-dimension matrix which contains
 601 OM indexes on the abscissa and the MM index on the ordinate. Each
 602 agent walks through its line/column of the matrix at each iteration to
 603 determinate which agents can be accessed during the simulation.  Since
 604 many agents are not connected this matrix is sparse, which introduces
 605 a big number of useless memory accesses. To reduce the impact of these
 606 memory accesses we use a compacted, optimized representation of this
 607 matrix based on~\cite{Gomez-Luna:2009:PVS:1616772.1616869}, as
 608 illustrated on the figure~\ref{ch17:fig:csr_representation}. This compact
 609 representation considers each line of the matrix as an index list, and
 610 only stores accessible agents continuously, to reduce the size of the
 611 list and the number of non productive accesses.
 612
 613 \begin{figure}[h]
 614 \centering
 615 \includegraphics[width=0.8\textwidth]{Chapters/chapter17/figs/grid.pdf}
 616 %\psfig{file=figs/grid, height=1in}
 617 \caption{Compact representation of the topology of a MIOR simulation}
 618 \label{ch17:fig:csr_representation}
 619 \end{figure}
 620
 621 Since dynamic memory allocation is not possible yet in OpenCL and only
 622 provided in the lastest revisions of the CUDA standard, these matrices
 623 are statically allocated. The allocation is based on the worst-case
 624 scenario where all OM and MM are linked as the real occupation of the
 625 matrix cells cannot be deduced without some kind of preprocessing
 626 computations.
 627
 628 \subsubsection{Critical resources access management}
 629 \label{ch17:subsec:concurrency}
 630
 631 One of the main issue in a MIOR model is to ensure that all the
 632 microbial colonies will have an equitable access to carbon resources,
 633 when multiple colonies share the same deposits. Access
 634 synchronizations are mandatory in these cases to prevent conflicting
 635 updates on the same data that may lead to calculus error (e.g. loss of
 636 matter).
 637
 638 On a massively parallel architecture such as GPUs, this kind of
 639 synchronization conflicts can however lead to an inefficient
 640 implementation by enforcing a quasi-sequential execution. It is
 641 necessary, in the case of MIOR as well as for other ABM, to ensure
 642 that each work-item is not too constrained in its execution.
 643
 644 \lstinputlisting[caption=Main MIOR kernel,label=ch17:listing:mior_kernels]{Chapters/chapter17/code/mior_kernels.cl}
 645
 646 From the sequential algorithm 1 where all the agents share the same
 647 data, we have developed a parallel algorithm composed of three
 648 sequential stages separated by synchronization barriers. This new
 649 algorithm is based on the distribution of the available OM carbon
 650 deposits into parts at the beginning of each execution step. The three
 651 stages, illustrated on Listing~\ref{ch17:listing:mior_kernels}, are the following:
 652
 653 \begin{enumerate}
 654 \item \emph{distribution}: the available carbon in each carbon deposit
 655   (OM) is equitably dispatched between all accessible MM in the form
 656   of parts.
 657 \item \emph{metabolism}: each MM consumes carbon in its allocated parts
 658   for its breathing and growing processes.
 659 \item \emph{gathering}: unconsumed carbon in parts is gathered back
 660   into the carbon deposits.
 661 \end{enumerate}
 662
 663 This solution suppresses the data synchronization needed by the first
 664 algorithm and thus the need for synchronization barriers, and requires only
 665 one kernel launch from Java as described on Listing~\ref{ch17:fig:mior_launcher}.
 666
 667 \lstinputlisting[caption=MIOR simulation launcher,label=ch17:fig:mior_launcher]{Chapters/chapter17/code/mior_launcher.java}
 668
 669 \subsubsection{Termination detection}
 670
 671 The termination of a MIOR simulation is reached when the model
 672 stabilizes and no more $CO_{2}$ is produced. This termination
 673 detection can be done either on the CPU or the GPU but it requires a
 674 global view on the system execution.
 675
 676 In the first case, when the CPU controls the GPU simulation process,
 677 the dectection is done in two steps: (i) the CPU starts the execution
 678 of a simulation step on the GPU, (ii) the CPU retrieves the GPU data
 679 and determines if another iteration must be launched or if the
 680 simulation has terminated. This approach allows a fine-grain control
 681 over the GPU execution, but it requires many costly data transfers as
 682 each iteration results must be sent from the GPU to the CPU. In the
 683 case of the MIOR model these costs are mainly due to the inherent
 684 PCI-express port latencies rather than to bandwidth limitation since
 685 data sizes remains rather small, in the order of few dozens of
 686 Megabytes.
 687
 688 In the second case the termination detection is directly implemented
 689 on the GPU by checking the amount of available carbon between two
 690 iterations. The CPU does not have any feedback while the simulation is
 691 running, but retrieves the results once the kernel execution is
 692 finished. This approach minimizes the number of transfers between the
 693 CPU and the GPU.
 694
 695 \subsection{Performance of MIOR implementations}
 696 \label{ch17:subsec:miorexperiments}
 697
 698 In this part we present several MIOR GPU implementations using the
 699 distribution/gathering process described in the previous section and
 700 compare their performance on two distinct hardware platform; i.e two
 701 different GPU devices. Five incremental MIOR implementations were
 702 realized with an increasing level of adaptation for the algorithm: in
 703 all cases, we choose the average time over 50 executions as a
 704 performance indicator.
 705
 706 \begin{itemize}
 707 \item the \textbf{GPU 1.0} implementation is a direct implementation
 708   of the existing algorithm and its data structures where data
 709   dependencies were removed and that use the non-compact topology
 710   representation described in Section~\ref{ch17:subsec:datastructures}
 711 \item the \textbf{GPU 2.0} implementation uses the previously
 712   described compact representation of the topology and remains
 713   otherwise identical to the GPU 1.0 implementation.
 714 \item the \textbf{GPU 3.0} implementation introduces the manual
 715   copy into local (private) memory of often-accessed global data, such
 716   as carbon parts or topology information.
 717 \item the \textbf{GPU 4.0} implementation is a variant of the
 718   GPU 1.0 implementation which allows the execution of
 719   multiple simulations for each kernel execution.
 720 \item the \textbf{GPU 5.0} implementation is a multi-simulation of the
 721   GPU 2.0 implementation, using the execution of multiple simulations
 722   for each kernel execution as for GPU 4.0
 723 \end{itemize}
 724
 725 The two last implementations -- \textbf{GPU 4.0} and \textbf{GPU 5.0}
 726 -- illustrate the gain provided by a better usage of the hardware
 727 resources, thanks to the driver execution overlaping capabilities. A
 728 sequential version of the MIOR algorithm, labeled as \textbf{CPU}, is
 729 provided for comparison purpose. This sequential version is developped
 730 in Java, the same language used for GPU implementations and the Sworm
 731 model.
 732
 733 For these performance evaluations, two platforms are used. The first
 734 one is representative of the kind of hardware which is available on
 735 HPC clusters. It is a cluster node dedicated to GPU computations with
 736 two Intel X5550 processors running at $2.67GHz$ and two Tesla C1060
 737 GPU devices running at $1.3$GHz and composed of $240$ cores ($30$
 738 multi-processors). Only one of these GPU is used in these experiments,
 739 at the moment. The second platform illustrates what can be expected
 740 from a personal desktop computer built a few years ago. It uses a
 741 Intel Q9300 CPU, running at $2.5$GHz, and a Geforce 8800GT GPU card
 742 running at $1.5$GHz and composed of $112$ cores ($14$
 743 multi-processors). The purpose of these two platforms is to assess the
 744 benefit that could be obtained either when a scientist has access to
 745 specialized hardware as a cluster or tries to take benefit from its
 746 own personal computer.
 747
 748 \begin{figure}[h]
 749 %begin{minipage}[t]{0.49\linewidth}
 750 \centering
 751 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_tesla.pdf}
 752 %\caption{Performance CPU and GPU sur carte Tesla C1060}
 753 \caption{CPU and GPU performance on a Tesla C1060 node}
 754 \label{ch17:fig:mior_perfs_tesla}
 755 %end{minipage}
 756 \end{figure}
 757
 758 Figures~\ref{ch17:fig:mior_perfs_tesla}~and~\ref{ch17:fig:mior_perfs_8800gt}
 759 show the execution time for $50$ simulations on the two hardware
 760 platforms. A size factor is applied to the problem: at scale 1,
 761 the model contains $38$ MM and $310$ OM, while at the scale 6 these
 762 numbers are multiplied by six. The size of the environment is modified
 763 as well to maintain the same average agent density in the model. This
 764 scaling factor display the impact of the chosen size of simulation on
 765 performance.
 766
 767 %hspace{0.02\linewidth}
 768 %begin{minipage}[t]{0.49\linewidth}
 769 \begin{figure}[h]
 770 \centering
 771 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_8800gt.pdf}
 772 \caption{CPU and GPU performance on a personal computer with a Geforce 8800GT}
 773 \label{ch17:fig:mior_perfs_8800gt}
 774 %end{minipage}
 775 \end{figure}
 776
 777 The charts show that for small problems the execution times of all
 778 the implementations are very close. This is because the GPU execution
 779 does not have enough threads (representing agents) for an optimal
 780 usage of GPU resources. This trend changes after scale $5$ where GPU
 781 2.0 and GPU 3.0 take the advantage on the GPU 1.0 and CPU
 782 implementations. This advantage continues to grow with the scaling
 783 factor, and reaches a speedup of $10$ at the scale $10$ between the
 784 fastest single-simulation GPU implementation and the first, naive one
 785 GPU 1.0.
 786
 787 Multiple trends can be observed in these results. First, optimizations
 788 for the GPU hardware show a big, positive impact on performance,
 789 illustrating the strong requirements on the algorithm properties to
 790 reach execution efficiency. These charts also show that despite the
 791 vast difference in numbers of cores between the two GPUs platforms the
 792 same trends can be observed in both case. We can therefore expect
 793 similar results on other GPU cards, without the need for more
 794 adaptations.
 795
 796 \begin{figure}[h]
 797 \centering
 798 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/monokernel.pdf}
 799 \caption{Execution time of one multi-simulation kernel on the Tesla
 800   platform}
 801 \label{ch17:fig:monokernel_graph}
 802 \end{figure}
 803
 804 \begin{figure}[h]
 805 \centering
 806 \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/multikernel.pdf}
 807 \caption{Total execution time for 1000 simulations on the Tesla
 808   platform, while varying the number of simulations for each kernel}
 809 \label{ch17:fig:multikernel_graph}
 810 \end{figure}
 811
 812 There are two ways to measure simulations performance: (i) by
 813 executing only one kernel, and varying its size (the number of
 814 simulations executed in parallel), as shown on
 815 Figure~\ref{ch17:fig:monokernel_graph}, to test the costs linked to the
 816 parallelization process or (ii) by executing a fixed number of
 817 simulations and varying the size
 818 of each kernel, as shown on Figure~\ref{ch17:fig:multikernel_graph}.
 819
 820 Figure~\ref{ch17:fig:monokernel_graph} illustrates the execution time for
 821 only one kernel. It shows that for small numbers of simulations run in
 822 parallel, the compact implementation of the model topology is faster
 823 than the two-dimension matrix representation. This trends reverse with
 824 more than $50$ simulations in parallel, which can be explained either by
 825 the non linear progression of the synchronization costs or by the
 826 additional memory required for the access-efficient representation.
 827
 828 Figure~\ref{ch17:fig:multikernel_graph} illustrates the execution time of a
 829 fixed number of simulations. It shows that for a small number of
 830 simulations run in parallel, the costs resulting of program setup,
 831 data copies and launch on GPU are determinant and very detrimental to
 832 performance. Once the number of simulations executed for each kernel
 833 grows, these costs are counterbalanced by computation costs. This
 834 trend is more marked in the case of the sparse implementation (GPU
 835 4.0) than the compact one but appears on both curves. With more than
 836 $30$ simulations for each kernel, execution times stalls, since
 837 hardware limits are reached. This indicates that the cost of preparing
 838 and launching kernels become negligible compared to the computing
 839 time once a good GPU occupation rate is achieved.
 840
 841 \section{Analysis and recommandations}
 842 \label{ch17:analysis}
 843
 844 In this section we synthesize the observations done on the two models
 845 and identify some recommendations on implementing complex systems on
 846 GPU platforms.
 847
 848 \subsection{Analysis}
 849
 850 In both the Collembola and the MIOR model, a critical problematic is
 851 the determination of the part of the simulation which are to be run on
 852 GPU and which are to remain on the CPU. The determination of these
 853 parts is a classic step of any algorithm parallelization and must
 854 take into account considerations such as the cost of the different
 855 parts of the algorithm and the expected gains.
 856
 857 In the case of the Collembola model two steps of the algorithm were
 858 ported to GPU. Both steps use straighforward, easily-parallelizable,
 859 operations where a direct gain can be expected by using more
 860 executions cores without important modifications of the algorithm.
 861
 862 In the MIOR model case however no such inherently parallelizable parts
 863 are evident in the original sequential algorithm. This is mainly
 864 explained by the rate of interactions between agents in this model in
 865 the form of many operations (breathing, growth) on shared carbon
 866 ressources. In this case the algorithm had to be more profundly
 867 modified while keeping in head the need to remain true to the original
 868 model, to synchronize the main execution step of all agents in the
 869 model, to ensure enquity, and to minimize the numbers of
 870 synchronizations. The minimization is done by factoring the repartition of
 871 carbon in the model in two separated step at the beginning and the end
 872 of each iterations rather than at multiple point of the execution.
 873
 874 \subsection{MAS execution workflow}
 875
 876 Many MAS simulations decompose their execution process into discrete
 877 evolution steps where each step represents a quantum of time (minimal
 878 unit of time described). At the end of each step many global
 879 information, graphical displays or results are updated. This
 880 execution model may not correctly fit on GPU platforms as they assume
 881 more or less a batch-like workflow model. The execution model must be
 882 split into the following ever repeating steps:
 883
 884 \begin{itemize}
 885 \item Allocation of GPU data buffers
 886 \item Copy of data from CPU to GPU
 887 \item GPU kernels execution
 888 \item Copy of results from GPU to CPU
 889 \end{itemize}
 890
 891 This workflow works well if the considered data transfer time is
 892 negligible compared to GPU execution or can be done in parallel,
 893 thanks to the asynchronous nature of OpenCL. If we are to update the
 894 MAS model after each iteration then performance risk being
 895 degratated. This is illustrated in the MIOR model by the fact that the
 896 speedup observerd on GPU is much more significant for bigger
 897 simulations, which implies longer GPU execution times. Our solution to
 898 this problem is to desynchronize the execution of the SMA model and
 899 its GPU parts by requesting the execution of multiple steps of the
 900 GPU simulations for each launch.
 901
 902 Another related prerequisite of GPU execution is the ability to have
 903 many threads to be executed, to allow an efficient exploitation of the
 904 superior number of cores provided by the architecture. In the case of
 905 MAS models, this means that executing one agent at a time on GPU is
 906 meaningless in regard to GPU usage, copying cost, and actual gain in
 907 execution time, if the process is not already parallel and costly at
 908 this scale. In the MIOR and the Collembola models, this is solved by
 909 executing the computations for all agents of the model at the same
 910 time. If the model has hovewer only chronic needs for intensive
 911 computations then some kind of batching mechanism is required to store
 912 waiting treatements in a queue, until the total sum of waiting
 913 computations justify the transfers cost to the GPU platform.
 914
 915 \subsection{Implementation challenges}
 916
 917 Besides the execution strategy challenges described above, some
 918 implementation challenges also occur when implementing an OpenCL
 919 version of a MAS model, mainly related to the underlying limitations
 920 of the execution platform.
 921
 922 The first one is the impossibility (except in lastest CUDA versions)
 923 to dynamicaly allocate memory during execution. This is a problem in
 924 the case of models where the number of agent can vary during the
 925 simulation, such as prey-predator models. In this case, the only
 926 solution is to overestimate the size of arrays or data structures to
 927 accomodate these additionnal individuals, or using the CPU to resize
 928 data structures when these situations occur. Both approachs require
 929 to trend either memory or performance and are not always pratical.
 930
 931 Another limitation is the impossibility to store pointers in data
 932 structures, we restraint any OpenCL to use only one dimension static
 933 arrays or specific data structures such as textures. This preclude the
 934 usage of structures like linked-list, graphs or sparse matrix not
 935 represented by some combination of static arrays, and can be another
 936 source of memory or performance losses.
 937
 938 In the case of MIOR, this problem is especially exarcebed in the case
 939 of neighboring storage: both representations consum much more memory
 940 that actually required, since the worst case (all agents have
 941 access to all others agents) has to be taken into account defensivly
 942 when dimensioning the data structure.
 943
 944 The existence of different generations of GPU hardwares is also a
 945 challenge. Older implementations, such as the four year old Tesla
 946 C1060 cards, have very strong expectation in term of memory accesses
 947 and requires very regular access patterns to perform efficiently. MAS
 948 having many random accesses, such as MIOR, or many small global memory
 949 accesses, such as Collembola, are penalized on these older
 950 cards. Fortunately, these requirements are less present is modern
 951 cards, which offer caches and other facilities traditionnaly present
 952 on CPU to offset these kind of penalities.
 953
 954 The final concern is related to the previous ones and often result in
 955 more memory consumption. The amount of memory available on GPU cards
 956 is much more limited and adding new memory capabilities is more costly
 957 compared to expending a CPU RAM. On computing clusters, hardwares
 958 nodes with 128GB of memory or more have become affordable, whereas
 959 newer Tesla architecture remains limited to 16GB of memory. This can
 960 be a concern in the case of big MAS models, or small ones which can
 961 only use memory-inefficient OpenCL structures.
 962
 963 \subsection{MCSMA}
 964 \label{ch17:Mcsma}
 965
 966 As shown in the previous sections, many data representations choices
 967 are common to entire classes of MAS. The paradigm of grid, for example,
 968 is often encountered in models where each cell constitutes either the
 969 elementary unit of simulation (SugarScape~\cite{Dsouza2007}) or a
 970 discretisation of the environment space
 971 (Pathfinding~\cite{Guy09clearpath}). These grids can be considered as
 972 two or three-dimensional matrices, whose processing can be directly
 973 distributed.
 974
 975 Another common data representation encountered in MAS system is the
 976 usage of 2d or 3d coordinates to store the position of each agent of
 977 the model. In this case, even if the environment is no longer
 978 discrete, the location information still imply computations
 979 (euclidian distances) which can be parallelized.
 980
 981 MCSMA~\cite{lmlm+13:ip} is a framework developped to provide to the MAS designer
 982 those basic data structures, and the associated operations, to
 983 facilitate the portage of existing MAS on GPU. Two levels of
 984 utilization are provided to the developper, depending on its usage
 985 profile.
 986
 987 \begin{itemize}
 988 \item A high-level library, constitued of modules regrouping classes
 989   of operations. A module provides multiple methods of
 990   distance computations in 1d, 2d or 3d grids, another one the
 991   diffusion algorithm...
 992 \item A low level API which allows the developped direct access to the
 993   GPU and the inner working of MCSMA, to develop its own module in
 994   the case where the required operations are not yet provided by the platform.
 995 \end{itemize}
 996
 997 Both usage levels were illustrated in the above two pratical cases. In
 998 MIOR, the whole algorithm (baring initialization) is ported on GPU as
 999 a specific plugin which allows executing $n$ MIOR simulations and
1000 retrieve their results. This first approach requires extensive
1001 adaptations to the original algorithm.  In Collembola, on the
1002 contrary, the main steps of the algorithm remain executed on the CPU,
1003 and only specific operations are delegated to generic, already
1004 existing, diffusion and reduction kernels. The fundamental algorithm
1005 is not modified and GPU execution is only applied to specific parts of
1006 the execution which may benefit from it. These two programming
1007 approaches allow incremental adaptations of existing Java MAS to
1008 accelerate their execution on GPU, while retaining the option to
1009 develop their own reusable or more efficeint module to supplement the
1010 already existing ones.
1011
1012 \section{Conclusion}
1013 \label{ch17:conclusion}
1014
1015 This chapter addresses the issue of complex system simulation by using
1016 agent based paradigm and GPU hardware. From the experiments on two
1017 existing Agent Based Models of soil science we intend to provide
1018 useful information on the architecture, the algorithm design and, the
1019 data management to run Agent based simulations on GPU, and more
1020 generally to run compute intensive applications that are not based on
1021 a not-so-regular model. The first result of this work is that adapting
1022 the algorithm to a GPU architecture is possible and suitable to speed
1023 up agent based simulations as illustrated by the MIOR model. Coupling
1024 CPU with GPU seems even to be an interesting way to better take
1025 advantage of the power given by computers and clusters as illustrated
1026 by the Collembola model. From our point of view the adaptation process
1027 is lighter than a traditional parallelization on distributed nodes and
1028 not much difficult than a standard multi-threaded parallelization,
1029 since all the data remains on the same host and can be shared in
1030 central memory. The OCL adaptation also enables a portable simulator
1031 that can be run on different graphical units. Even using a mainstream
1032 card as the GPU card of a standard computer can lead to significant
1033 performance improvements. This is an interesting result as it opens
1034 the field of inexpensive HPC to a large community.
1035
1036 In this perspective, we are working on MCSMA a development platform
1037 that would facilitate the use of GPU or many-core architectures for
1038 multi-agent simulations. Our first work has been the definition of
1039 common, efficient, reusable data structures, such as grids or
1040 lists. Another goal is to provide easier means to control the
1041 distribution of specific processes between CPU or GPU, to allow the
1042 easy exploitation of the strengths of each platform in the same
1043 multi-agent simulation. We think that the same approach, i.e
1044 developping specific environments that facilitate the developper
1045 access to the GPU power, can be applied in many domains with
1046 compute intensive needs to open the GPU use to a larger community.
1047
1048 \putbib[Chapters/chapter17/biblio]
1049