X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/b59596b1c8419b7e13191270be64b855dd28d01e..2ae394d22362fb3f71c54c11871811afcac3eed0:/BookGPU/Chapters/chapter17/ch17.tex diff --git a/BookGPU/Chapters/chapter17/ch17.tex b/BookGPU/Chapters/chapter17/ch17.tex index 6ad3a8c..f69219e 100755 --- a/BookGPU/Chapters/chapter17/ch17.tex +++ b/BookGPU/Chapters/chapter17/ch17.tex @@ -1,9 +1,11 @@ -\chapterauthor{Guillaume Laville}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Kamel Mazouzi}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Nicolas Marilleau}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Guillaume Laville, Christophe Lang, Bénédicte Herrmann, + and Laurent Philippe}{Femto-ST Institute, University of + Franche-Comte, France} +%\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Kamel Mazouzi}{Franche-Comte Computing Center, University of Franche-Comte, France} +\chapterauthor{Nicolas Marilleau}{UMMISCO, Institut de Recherche pour le Developpement (IRD), France} +%\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France} \newlength\mylen \newcommand\myinput[1]{% @@ -11,172 +13,173 @@ \setlength\hangindent{\mylen}% \hspace*{\mylen}#1\\} -\chapter{Implementing MAS on GPU} +\chapter{Implementing Multi-Agent Systems on GPU} \label{chapter17} \section{Introduction} \label{ch17:intro} -In this chapter we introduce the use of Graphical Processing Units for -multi-agents based systems as an example of a not-so-regular +In this chapter we introduce the use of Graphical Processing Units +(GPU) for multi-agents-based systems as an example of a not-so-regular application that could benefit from the GPU computing -power. Multi-Agent Systems or MAS are a simulation paradigm used to +power. Multi-Agent Systems (MAS) are a simulation paradigm used to study the behavior of dynamic systems. Dynamic systems as physical systems are often modeled by mathematical representations and their -dynamic behavior simulated by differential equations. The simulation -of the system thus often relay on the resolution of a linear system -that can be efficiently computed on a graphical processing unit as -shown in the preceding chapters. But when the behavior of the system -elements is not uniformly driven by the same law, when these elements -have their own behavior, the modeling process is too complex to rely -on formal expressions. In this context MAS is a recognized approach to -model and simulate systems where individuals have an autonomous -behavior that cannot be simulated by the evolution of a set of -variables driven by mathematical laws. MAS are often used to simulate -natural or collective phenomenons whose individuals are too numerous or -various to provide a unified algorithm describing the system -evolution. The agent-based approach is to divide these complex systems -into individual self-contained entities with their smaller set of -attributes and functions. But, as for mathematical simulations, when -the size of the Multi-Agent System (MAS) increases the need of computing -power and memory also increases. For this reason, multi-agent systems -should benefit from the use of distributed computing -architectures. Clusters and Grids are often identified as the main -solution to increase simulation performance but Graphical Processing -Units (GPU) are also a promising technology with an attractive -performance/cost ratio. - -Conceptually a MAS is a distributed system as it favors the definition -and description of large sets of individuals, the agents, that can be -run in parallel. As a large set of agents could have the same behavior -a SIMD model should fit the simulation execution. Most of the -agent-based simulators are however designed with a sequential scheme -in mind and these simulators seldom use more than one core for their -execution. Due to simulation scheduling constraints, data sharing and -exchange between agents and the huge amount of interactions between -agents and their environment, it is indeed rather difficult to -distribute an agent based simulator, for instance, to take advantage of new -multi-threaded computer architectures. Thus, guidelines and tools -dedicated to MAS paradigm and HPC is now a need for other complex +dynamic behavior is simulated by differential equations. The +simulation of the system thus often relies on the resolution of a +linear system that can be efficiently computed on a graphical +processing unit as shown in the preceding chapters. But when the +behavior of the system elements is not uniformly driven by the same +law, when these elements have their own behavior, the modeling process +is too complex to rely on formal expressions. In this context MAS is a +recognized approach to model and simulate systems where individuals +have an autonomous behavior that cannot be simulated by the evolution +of a set of variables driven by mathematical laws. MAS are often used +to simulate natural or collective phenomena whose individuals are too +numerous or various to provide a unified algorithm describing the +system evolution. The agent-based approach is to divide these complex +systems into individual self-contained entities with their smaller set +of attributes and functions. But, as for mathematical simulations, +when the size of the MAS increases, the need of computing power and +memory also increases. For this reason, multi-agent systems should +benefit from the use of distributed computing architectures. Clusters +and grids are often identified as the main solution to increase +simulation performance but GPUs are also a promising technology with +an attractive performance/cost ratio. + +Conceptually a MAS\index{Multi-Agent System} is a distributed system +as it favors the definition and description of large sets of +individuals, the agents, that can be run in parallel. As a large set +of agents could have the same behavior, a Single Instruction Multiple +Data (SIMD) execution architecture should fit the simulation +execution. Most of the agent-based simulators are, however, designed +with a sequential scheme in mind, and these simulators seldom use more +than one core for their execution. Due to simulation scheduling +constraints, data sharing and exchange between agents and the huge +amount of interactions between agents and their environment, it is +indeed rather difficult to distribute an agent based simulator, for +instance, to take advantage of new multithreaded computer +architectures. Thus, guidelines and tools dedicated to MAS paradigm +and High Performance Computing (HPC) are now a need for other complex system communities. Note that, from the described structure (large number of agents sharing data), we can conclude that MAS would more -easily benefit from many-cores architectures than from other kinds of +easily benefit from many-core architectures than from other kinds of parallelism. Another key point that advocates for the use of many-core in MAS is -the growing need for multi-scale simulations. Multi-scale simulations +the growing need for multiscale simulations. Multiscale simulations explore problems with interactions between several scales. The different scales use different granularity of the structure and potentially different models. Most of the time the lower scale simulations provide results to higher scale simulations. In that case the execution of the simulations can easily be distributed between the -local cores and a many-core architecture, i.e. a GPU device. +local cores and a many-core architecture, i.e., a GPU device. We explore in this chapter the use of many-core architectures to execute agent-based simulations. We illustrate our reflexion with two -uses cases: the Colembola simulator designed to simulate the diffusion -of Colembola between plots of land and the MIOR simulator that -reproduces effects of earthworms on bacteria dynamics in a bulked -soil. In Section \ref{ch17:ABM} we present the work related to MAS and -parallelization with a special focus on many-core use. In sections -\ref{ch17:sec:1stmodel} and \ref{ch17:sec:2ndmodel} we present in -details two multi-agent models, their GPU implementations, the -conducted experiments and their performance results. The first model, -given in Section \ref{ch17:sec:1stmodel}, illustrates the use of a GPU -architecture to speed up the execution of some computation intensive -functions while the main model is still executed on the central -processing unit. The second model, given in Section -\ref{ch17:sec:2ndmodel}, illustrates the use of a GPU architecture to -implement the whole model on the GPU processor which implies deeper -changes in the initial algorithm. In Section \ref{ch17:analysis} we -propose a more general reflexion on these implementations and we -provide some guidelines. Then we conclude in Section -\ref{ch17:conclusion} on the possible generalization of our work. - - -\section{Running Agent-Based Simulations} +cases: the Collembola simulator designed to simulate the diffusion of +Collembola between plots of land and the MIOR (MIcro ORganism) +simulator that reproduces effects of earthworms on bacteria dynamics +in a bulked soil. In Section \ref{ch17:ABM} we present the work +related to MAS and parallelization with a special focus on many-core +use. In sections \ref{ch17:sec:1stmodel} and \ref{ch17:sec:2ndmodel} +we present in detail two multi-agent models, their GPU +implementations, the conducted experiments, and their performance +results. The first model, given in Section \ref{ch17:sec:1stmodel}, +illustrates the use of a GPU architecture to speed up the execution of +some computation-intensive functions while the main model is still +executed on the central processing unit. The second model, given in +Section \ref{ch17:sec:2ndmodel}, illustrates the use of a GPU +architecture to implement the whole model on the GPU processor which +implies deeper changes in the initial algorithm. In Section +\ref{ch17:analysis} we propose a more general reflexion on these +implementations and provide some guidelines. Then, we conclude in +Section \ref{ch17:conclusion} on the possible generalization of our +work. + + +\section{Running agent-based simulations} \label{ch17:ABM} -In this section, we present the context of multi-agent systems, MAS, -their parallelization and we report several existing works on using -GPU to simulate multi-agent systems. +In this section, we present the context of MAS, their parallelization, +and we report several existing works on using GPU to simulate +multi-agent systems. \subsection{Multi-agent systems and parallelism} -Agent-Based systems are often used to simulate natural or collective -phenomenons whose actors are too numerous or various to provide a -simple unified algorithm describing the studied system +Agent-based systems are often used to simulate natural or collective +phenomena whose actors are too numerous or various to provide a simple +unified algorithm describing the studied system dynamic~\cite{Schweitzer2003}. The implementation of an agent based -simulation usually starts by designing the underlying Agent Based -Model (ABM). Most ABM are based around a few types of entities such as Agents, -Environment and around an Interaction Organization~\cite{Vowel02}. In the -complex system domain, the environment often describes a real space, -its structure (e.g. soil textures and porosities) and its dynamics -(e.g. organic matter decomposition). It is a virtual world in which -agents represent studied entities (e.g. biotic organisms) -evolution. The actual agent is animated by a behavior that can range -between reactivity (only react to external stimuli) and cognition -(lives its own process alongside other individuals). Interaction -and Organization define functions, types and patterns of -communications of their member agents in the +simulation usually starts by designing the underlying agent-based +model (ABM). Most ABM are based around a few types of entities such as +agents, one environment, or an interaction +organization~\cite{Vowel02}. In the complex system domain, the +environment often describes a real space, its structure (e.g. soil +textures and porosities), and its dynamics (e.g., organic matter +decomposition). It is a virtual world in which agents represent +studied entities (e.g., biotic organisms) evolution. The actual agent +is animated by a behavior that can range between reactivity (only +reacts to external stimuli) and cognition (makes complex decisions +based on environmental and internal factors). Interaction and +organization define functions, types, and patterns of communications +of their member agents in the system~\cite{Odell:2003:RRD:1807559.1807562}. Note that, depending on the MAS, agents can communicate either directly through special primitives or indirectly through the information stored in the environment. -Agent based simulations have been used for more than one decade to -reproduce, understand even predict complex system dynamic. They have proved their interest in various -scientific communities. Nowadays generic agent based frameworks are -promoted such as Repast~\cite{repast_home} or -NetLogo~\cite{netlogo_home} to implement simulators. Many ABM such as -the crown model representing a city-wide -scale~\cite{Strippgen_Nagel_2009} tend however to require a large -number of agents to provide a realistic behavior and reliable global -statistics. Moreover, an achieved model analysis needs to run many -simulations identified into an experiment plan to obtain enough -confidence in a simulation. In this case the available computing power -often limits the simulation size and the result range thus requiring -the use of parallelism to explore bigger configurations. +Agent-based simulations have been used for more than a decade to +reproduce, understand and even predict complex system dynamics. They +have proved their usefulness in various scientific +communities. Nowadays generic agent based frameworks such as +Repast~\cite{repast_home} or NetLogo~\cite{netlogo_home} are promoted +to implement simulators. Many ABMs such as the crown model +representing a city wide scale~\cite{Strippgen_Nagel_2009} tend +however to require a large number of agents to provide a realistic +behavior and reliable global statistics. Moreover, an achieved model +analysis needs to resort to an experiment plan, consisting of multiple +simulation runs, to obtain enough confidence in a simulation. In this +case the available computing power often limits the simulation size, +and the resulting range thus requires the use of parallelism to +explore bigger configurations. For that, three major approaches can be identified: \begin{enumerate} -\item parallelizing experiments plan on a cluster or a grid (one or - few simulations are submitted to each - core)~\cite{Blanchart11,Chuffart2010}. +\item parallelizing experiments execution on a cluster or a grid (one + or a few simulations are submitted to each + core)~\cite{Blanchart11,Chuffart2010}, \item parallelizing the simulator on a cluster (the environment of the MAS is split and run on several distributed - nodes)~\cite{Cosenza2011,MAR06} -\item optimizing simulator by taking advantage of computer resources - (multi-threading, GPU, and so on) \cite{Aaby10} + nodes)~\cite{Cosenza2011,MAR06}, +\item optimizing the simulator by taking advantage of computer + resources (multi-threading, GPU, and so on) \cite{Aaby10}. \end{enumerate} -In the first case, experiments are run independently each others and only simulation parameters change between two runs so -that a simple version of an existing simulator can be used. This -approach does not, however, allow to run larger models. -% It does not imply -%model code changes except for Graphical Unit Interface extracting as -%the experiments does not run interactively but in batch. -In the second -and the third case, model and code modifications are necessary. Only few -frameworks however introduce distribution in agent simulation -(Madkit~\cite{Gutknecht2000}, MASON~\cite{Sean05}, repastHPC~\cite{Collier11}) and parallel -implementations are often based on the explicit use of threads using -shared memory~\cite{Guy09clearpath} or cluster libraries such as +In the first case, experiments are run independently of each other and +only simulation parameters are changed between two runs so that a +simple version of an existing simulator can be used. This approach +does not, however, allow to run larger models. In the second and the +third case, model and code modifications are necessary. Only a few +frameworks, however, introduce distribution in agent simulation +(Madkit~\cite{Gutknecht2000}, MASON~\cite{Sean05}, +repastHPC~\cite{Collier11}), and parallel implementations are often +based on the explicit use of threads using shared +memory~\cite{Guy09clearpath} or cluster libraries such as MPI~\cite{Kiran2010}. Parallelizing a multi-agent simulation is however complex due to space and time constraints. Multi-agent simulations are usually based on a -synchronous execution: -%by the agents that share the same environment. -at each time step, numerous events (space data +synchronous execution: at each time step, numerous events (space data modification, agent motion) and interactions between agents happen. Distributing the simulation on several computers or grid nodes thus -implies to guaranty a distributed synchronous execution and -coherency. This often leads to poor performance or complex synchronization -problems. Multi-cores execution or delegating part of this execution -to others processors as GPUs~\cite{Bleiweiss_2008} are thus usually easier -to implement since all the threads share the data and the local clock. +implies to guarantee a distributed synchronous execution and +coherency. This often leads to poor performance or complex +synchronization problems. Multicore execution or delegating part of +this execution to others processors such as GPUs~\cite{Bleiweiss_2008} +is usually easier to implement since all the threads share the data +and the local clock. % Different agent patterns can be adopted in an ABMs such as % cognitive and reactive ones~\cite{Ferber99}. Cognitive agents act on @@ -210,86 +213,82 @@ to implement since all the threads share the data and the local clock. % scale simulations can also be delegated to the GPU card to limit the % load of the main (upper scale) simulation. -\subsection{MAS Implementation on GPU} +\subsection{MAS implementation on GPU} \label{ch17:subsec:gpu} -The last few years saw the apparition of new generations of graphic -cards based on more general purpose execution units which are -promising for less regular systems as MAS. -% These units can be programmed using GPGPU languages to solve various -% problems such as linear algebra resolutions, usually based on matrix -% manipulations. -The matrix-based data representation and SIMD computations are -however not straightforward in Multi-Agent Systems, where data -structures and algorithms are tightly coupled to the described -simulation. However, works from existing literature show that MAS can -benefit from these performance gains on various simulation types, such -as traffic simulation~\cite{Strippgen_Nagel_2009}, cellular -automatons~\cite{Dsouza2007}, mobile-agent based +The last few years have seen the appearance of new generations of +graphic cards based on more general purpose execution units which are +promising for large systems such as MAS. Using matrix-based data +representations and SIMD computations is however not always +straightforward in MAS, where data structures and algorithms are +tightly coupled to the described simulation. However, works from +existing literature show that MAS can benefit from these performance +gains on various simulation types, such as traffic +simulation~\cite{Strippgen_Nagel_2009}, cellular +automata~\cite{Dsouza2007}, mobile-agent based path-finding~\cite{Silveira:2010:PRG:1948395.1948446} or genetic algorithms~\cite{Maitre2009}. Note that an application-specific adaptation process was required in the case of these MAS: some of the previous examples are driven by mathematical laws (path-finding) or use a natural mapping between a discrete environment (cellular automaton) and GPU cores. Unfortunately, this mapping often requires -algorithmic adaptations in other models but the experience shows that -the more reactive a MAS is the more adapted its implementation is to -GPU. +algorithmic adaptations in other models but experience shows that the +more reactive a MAS is the more adapted its implementation is to GPU. The first step in the adaptation of an ABM to GPU platforms is the -choice of the language. On the one hand, the Java programming language -is often used for the implementation of MAS due to its availability -on numerous platforms or frameworks and its focus on high-level, -object-oriented programming. On the other hand, GPU platforms can -only run specific languages as OpenCL or CUDA. OpenCL (supported on -AMD, Intel and NVIDIA hardware) better suits the portability concerns +choice of language. On the one hand, the Java programming language is +often used for the implementation of MAS due to its availability on +numerous platforms or frameworks and its focus on high-level, +object-oriented programming. On the other hand, GPU platforms can only +run specific languages such as OpenCL or CUDA. OpenCL (supported on +AMD, Intel, and NVIDIA hardware) better suits the portability concerns across a wide range of hardware needed the agent simulators, as -opposed to CUDA which is a NVIDIA-specific library. +opposed to CUDA which is an NVIDIA-specific library. -OpenCL is a C library which provides access to the underlying CPUs or -GPUs threads using an asynchronous interface. Various OpenCL functions allow -the compilation and the execution of programs on these execution -resources, the copy of data buffers between devices, or the collection -of profiling information. +OpenCL is a C library which provides access to the underlying CPU or +GPU threads using an asynchronous interface. Various OpenCL functions +allow the compilation and the execution of programs on these execution +resources, the copying of data buffers between devices, and the +collection of profiling information. This library is based around three main concepts: \begin{itemize} -\item the \emph{kernel} (similar to a CUDA kernel) which represents a runnable program - containing instructions to be executed on the GPU. +\item the \emph{kernel} (similar to a CUDA kernel), which represents a runnable program + containing instructions to be executed on the GPU; \item the \emph{work-item} (equivalent to a CUDA thread), which is analogous to the concept of thread on GPU, in that it represents one running instance of a - GPU kernel. + GPU kernel; and \item the \emph{work-group} (or execution block) which is a set of work-items sharing some memory to speed up data accesses and computations. Synchronization operations such as barrier can only be used across the same work-group. \end{itemize} -Running an OpenCL computation consists in launching numerous work-items -that execute the same kernel. The work-items are submitted to a -submission queue to optimize the available cores usage. A calculus is -achieved once all these kernel instances have terminated. +Running an OpenCL computation consists of launching numerous +work-items that execute the same kernel. The work-items are submitted +to a submission queue to optimize the available cores usage. A +calculus is achieved once all these kernel instances have terminated. The number of work-items used in each work-group is an important implementation choice which determines how many tasks will share the same cache memory. Data used by the work-items can be stored as N-dimensions matrices in local or global GPU memory. Since the size -of this memory is often limited to a few hundred of kilobytes, -choosing this number often implies a compromise between the model +of this memory is often limited to a few hundred kilobytes, choosing +this number often implies a compromise between the model synchronization or data requirements and the available resources. In the case of agent-based simulations, each agent can be naturally -mapped to a work-item. Work-groups can then be used to represents groups -of agents or simulations sharing common data (such as the environment) -or algorithms (such as the background evolution process). +mapped to a work-item. Work-groups can then be used to represent +groups of agents or simulations sharing common data (such as the +environment) or algorithms (such as the background evolution process). -In the following examples the JOCL~\cite{jocl_home} binding is used -to access the OpenCL platform from the Java programming language. +In the following examples a binding named JOCL~\cite{jocl_home} is +used to access the OpenCL platform from the Java programming language. In the next sections we present two practical cases that will be -studied in details, from the model to its implementation and -performance. +studied in detail, from the model to its implementation and +performance. \section{A first practical example} \label{ch17:sec:1stmodel} @@ -299,23 +298,26 @@ collembolas in fields and forests. It is based on a diffusion algorithm which illustrates the case of agents with a simple behavior and few synchronization problems. -\subsection{The Collembola model} +\subsection{The Collembola model\index{Collembola model}} \label{ch17:subsec:collembolamodel} -The Collembola model is an example of multi-agent system using GIS (Geographical Information System) - and survey data (population count) - to model the evolution of the biodiversity -across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modeled by populations of -athropod individuals, the Collembola, which can reproduce and diffuse -to favorable new habitats. The simulator allows to study the diffusion -of collembola, between plots of land depending on their use -(artifical soil, crop, forest\ldots). In this -model the environment is composed of the studied land and the -colembola are agents. Every land plot is divided into several cells, -each cell representing a surface unit (16*16 meters). A number of individuals per collembola species is associated to each cell. The model evolution -is then based on a common diffusion model that diffuse individuals between -cells. Each step of the simulation is based on four stages, as shown -on Figure~\ref{ch17:fig:collem_algorithm}: +The Collembola model is an example of multi-agent system using GIS +(Geographical Information System) and survey data (population count) +to model the evolution of the biodiversity across land plots. A first +version of this model has been developed with the Netlogo framework by +Bioemco and UMMISCO researchers. In this model, the biodiversity is +modeled by populations of athropod individuals, the Collembola, which +can reproduce and diffuse to favorable new habitats. The simulator +allows us to study the diffusion of collembola, between plots of land +depending on their use (artifical soil, crop, forest, etc.) In this +model the environment is composed of the studied land, and collembola +are used as agents. Every land plot is divided into several cells, +each cell representing a surface unit (16x16 meters). A number of +individuals per collembola species is associated to each cell. The +model evolution is then based on a common diffusion model that +diffuses individuals between cells. Each step of the simulation is +based on four stages, as shown on +Figure~\ref{ch17:fig:collem_algorithm}: % \begin{enumerate} % \item arrival of new individuals @@ -327,7 +329,7 @@ on Figure~\ref{ch17:fig:collem_algorithm}: \begin{figure}[h] \centering \includegraphics[width=0.6\textwidth]{Chapters/chapter17/figs/algo_collem.pdf} -\caption{Evolution algorithm of Collembola model} +\caption{Evolution algorithm of Collembola model.} \label{ch17:fig:collem_algorithm} \end{figure} @@ -337,9 +339,9 @@ the model performances. The {\bf reproduction} stage consists in updating the total population of each plot by taking the individuals arrived at the preceding -computation step. This stage implies to process the whole set of cells -of the environment to sum their population. The computed value is -recorded in the plot associated to each cell. This process can be +computation step. This stage involves processing the whole set of +cells of the environment to sum their population. The computed value +is recorded in the plot associated to each cell. This process can be assimilated to a reduction operation on all the population cells associated to one plot to obtain its population. @@ -350,65 +352,68 @@ the population of the neighbor cells. This process can be assimilated to a linear diffusion at each iteration of the population of the cells across their neighbors. -These two processes are quite common in numerical computations so -that the collembola model can be adapted to a GPU execution without -much difficulties. - -\subsection{Collembola Implementation} - -In the Collembola simulator biodiversity is modeled by populations -of Collembola individuals, which can reproduce and diffuse to -favorable new habitats. This is implemented in the algorithm as a fixed -reproduction factor, applied to the size of each population, followed -by a linear diffusion of each cell population to its eight -neighbors. These reproduction and diffusion processes are followed by -two more steps on the GPU implementation. The first one consist of -culling of populations in a inhospitable environment, by checking each -cell value, landuse, and setting its population to zero if -necessary. The final simulation step is the reduction of the cells -populations for each plot, to obtain an updated plot population -for statistic purposes. This separate computation step, done while -updating each cell population in the reference sequential algorithm, -is motivated by synchronization problems and allow the reduction of -the total numbers of access needed to updated those populations. +These two processes are quite common in numerical computations so that +the collembola model can be adapted to a GPU execution without much +difficulty. + +\subsection{Collembola implementation} + +In the collembola simulator biodiversity is modeled by populations of +collembola individuals, which can reproduce and diffuse to favorable +new habitats. This is implemented as a fixed reproduction factor, +applied to the size of each population, followed by a linear diffusion +of each cell population to its eight neighbors. These reproduction and +diffusion processes are followed by two more steps on the GPU +implementation. The first one consist of culling of populations in an +inhospitable environment, by checking each cell value and terrain +type, and setting its population to zero if necessary. The final +simulation step is the reduction of the cell populations for each +plot, to obtain an updated plot population for statistic +purposes. This separate computation step, done while updating each +cell population in the reference sequential algorithm, is motivated by +synchronization problems and allows the reduction of the total number +of memory accesses needed to updated those populations. %\lstinputlisting[language=C,caption=Collembola OpenCL %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl} -\lstinputlisting[caption=Collembola OpenCL Diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl} +%\pagebreak +\lstinputlisting[caption=collembola openCL diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl} The reproduction, diffusion and culling steps are implemented on GPU -(Figure~\ref{ch17:fig:collem_algorithm}) as a straight mapping of each cell -to an OpenCL work-item (GPU thread). Listing +(Figure~\ref{ch17:fig:collem_algorithm}) as a straight mapping of each +cell to an OpenCL work-item (GPU thread). Listing \ref{ch17:listing:collembola-diffuse} gives the kernel for the diffusion implementation. To prevent data coherency problems, the diffusion step is split into two phases, separated by an execution -{\it barrier}. In the first phase, each cell diffusion overflow is -calculated, and divided by the number of neightbors. Note that, at the -model frontier, populations can also overflow outside the environment -grid but we do not manage those external populations, since there are -no reason to assume our model to be isolated of its surroundings. The -overflow by neighbors value is stored for each cell, before -encountering the barrier. After the barrier is met, each cell read the -overflows stored by all its neighbors at the previous step and applies -them to its own population. In this manner, only one barrier is -required to ensure the consistency of populations value, since no -cells ever modify a value other than its own. - -\lstinputlisting[caption=Collembola OpenCL reduction kernel,label=ch17:listing:collembola-reduc]{Chapters/chapter17/code/collem_kernel_reduc.cl} +{\it barrier}. In the first phase each cell diffusion overflow is +calculated and divided by the number of neighbors. Note that, on the +border of the grid, populations can also overflow outside the +environment grid but we do not manage those external populations, +since there are no reason to assume our model to be isolated of its +surroundings. The overflow by neighbors value is stored for each cell +before encountering the barrier. After the barrier is met, each cell +reads the overflows stored by all its neighbors at the previous step +and applies them to its own population. In this manner, only one +barrier is required to ensure the consistency of population numbers, +since no cell ever modify a value other than its own. Listing \ref{ch17:listing:collembola-reduc} gives the kernel for the reduction implementation. The only step requiring numerous -synchronized accesses is the reduction one: in a first approach, we -chose to use {\it atomic\_add} operation to implement this process, but more -efficient implementations using partial reduction and local GPU memory -could be implemented. +synchronized accesses is the reduction one: in this first approach, we +chose to use {\it atomic\_add} operation to implement this process, +but more efficient implementations using partial reduction and local +GPU memory could be implemented. + +\pagebreak +\lstinputlisting[caption=collembola OpenCL reduction kernel,label=ch17:listing:collembola-reduc]{Chapters/chapter17/code/collem_kernel_reduc.cl} \subsection{Collembola performance} -In this part we present the performance of the Collembola model on +In this part we present the performance of the collembola model on various CPU and GPU execution platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the -number of cores and the processor architecture as a strong influence on the obtained results +number of cores and the processor architecture as a strong influence +on the obtained results % : the % dual-core processor equipping our Thinkpad platform has two to six @@ -432,9 +437,9 @@ number of cores and the processor architecture as a strong influence on the obta implemented in newer generations of GPU devices. These optimizations reduce the penalties associated with irregular and frequent global memory accesses. They are not available on our Tesla nodes. -\item GPU curves exhibit a odd-even pattern in their performance +\item GPU curves exhibit an odd-even pattern in their performance results. Since this phenomenon is visible on two distinct - manufacturer hardware, drivers and OpenCL implementation, it is + manufacturer hardware, driver, and OpenCL implementation, it is likely the result of the decomposition process based on warp of fixed, power-of-two sizes. \item The number of cores is not the only determining factor: an Intel @@ -446,22 +451,22 @@ number of cores and the processor architecture as a strong influence on the obta %begin{minipage}[t]{0.49\linewidth} \centering \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf} -\caption{Performance of the Collembola model on CPU and GPU} +\caption{Performance of the Collembola model on CPU and GPU.} \label{ch17:fig:mior_perfs_collem} %end{minipage} \end{figure} -Both figures shows that using the GPU to parallelize part of the +Both graphs show that using the GPU to parallelize part of the simulator results in tangible performance gains over a CPU execution on modern hardware. These gains are more mixed on older GPU platforms due to the limitations when dealing with irregular memory or execution patterns often encountered in MAS systems. This can be closely linked to the availability of caching facilities on the GPU hardware and its -dramatically effects depending on the locality and frequency of data -accesses. In this case, even if the Tesla architectures provides more -execution cores and is the far costlier solution, more recent, cheaper, -solutions such as high-end GPU provide better performance when the -execution is not constrained by memory size. +dramatic effects depend on the locality and frequency of data +accesses. In this case, even if the Tesla architecture offers more +execution cores and is the far costlier solution, more recent, +cheaper, solutions such as high-end GPU provide better performance +when the execution is not constrained by memory size. \section{Second example} \label{ch17:sec:2ndmodel} @@ -475,9 +480,9 @@ benefit from the GPU architecture. \label{ch17:subsec:miormodel} The MIOR~\cite{C.Cambier2007} model was developed to simulate local -interactions in a soil between microbial colonies and organic -matters. It reproduces each small cubic units ($0.002 m^3$) of soil as a -MAS. +interactions in soil between microbial colonies and organic +matters. It reproduces each small cubic unit ($0.002 m^3$) of soil as +a MAS. Multiple implementations of the MIOR model have already been realized, in Smalltalk and Netlogo, in 2 or 3 dimensions. The last @@ -485,24 +490,25 @@ implementation, used in our work and referenced as MIOR in the rest of the chapter, is freely accessible online as WebSimMior~\footnote{http://www.IRD.fr/websimmior/}. -MIOR is based around two types of agents: (i) the Meta-Mior (MM), -which represents microbial colonies consuming carbon and (ii) the -Organic Matter (OM) which represents carbon deposits occurring in soil. +MIOR is based around two types of agents: (1) the Meta-Mior (MM), +which represents microbial colonies consuming carbon and (2) the +Organic Matter (OM) which represents carbon deposits occurring in +soil. The Meta-Mior agents are characterized by two distinct behaviors: \begin{itemize} -\item \emph{breath}: the action converts mineral carbon from the soil - to carbon dioxide $CO_{2}$ that is released in the soil. +\item \emph{breath}: this action converts mineral carbon from the soil + to carbon dioxide ($CO_{2}$) that is released into the soil and \item \emph{growth}: by this action each microbial colony fixes the carbon present in the environment to reproduce itself (augments its size). This action is only possible if the colony breathing needs - where covered, i.e. enough mineral carbon is available. + are covered, i.e., enough mineral carbon is available. \end{itemize} These behaviors are described in Algorithm~\ref{ch17:seqalgo}. \begin{algorithm}[h] -\caption{Evolution step of each Meta-Mior (microbial colony) agent} +\caption{evolution step of each Meta-Mior (microbial colony) agent} \label{ch17:seqalgo} \KwIn{A static array $mmList$ of MM agents} \myinput{A static array $omList$ of OM agents} @@ -529,18 +535,18 @@ $availableCarbon \gets totalAccessibleCarbon(mm)$\; Since this simulation takes place at a microscopic scale, a large number of these simulations must be executed for each macroscopic -simulation step to model realistic-sized unit of soil. This lead to +simulation step to model a realistic-sized unit of soil. This leads to large computing needs despite the small computation cost of each individual simulation. -\subsection{MIOR Implementation} +\subsection{MIOR implementation} As pointed out previously, the MIOR implementation implied more changes for the initial code to be run on GPU. As a first attempt, we -realized a simple GPU implementation of the MIOR simulator, with only +tried a simple GPU implementation of the MIOR simulator, with only minimal changes to the CPU algorithm. Execution times showed the -inefficiency of this approach, and highlighted the necessity of +inefficiency of this approach and highlighted the necessity of adapting the simulator to take advantage of the GPU execution capabilities~\cite{lmlm+12:ip}. In this part, we show the main changes which were realized to adapt the MIOR simulator on GPU architectures. @@ -550,124 +556,132 @@ which were realized to adapt the MIOR simulator on GPU architectures. \begin{figure} \centering \includegraphics[width=0.7\textwidth]{Chapters/chapter17/figs/repartition.pdf} -\caption{Execution distribution retained on GPU} +\caption{Consolidation of multiple simulations in one OpenCL kernel execution.} \label{ch17:fig:gpu_distribution} \end{figure} Each MIOR simulation is represented by a work-group, and each agent by a work-item. A kernel is in charge of the life cycle process for each -agent of the model. This kernel is executed by all the work-items of the simulation -each on its own gpu core. +agent of the model. This kernel is executed by all the work-items of +the simulation each on its own GPU core. -The usage of one work-group for each simulation allows to easily -execute multiple simulations in parallel, as shown on +The usage of one work-group for each simulation allows the easy +execution of multiple simulations in parallel, as shown on figure~\ref{ch17:fig:gpu_distribution}. By taking advantage of the -execution overlap possibilities provided by OpenCL it then becomes +execution overlap possibilities provided by OpenCL, it then becomes possible to exploit all the cores at the same time, even if an unique -simulation is to small to use all the available GPU cores. However, -the maximum size of a work-group is limited (to $512$), which only allows -us to execute one simulation per work-group when using $310$ threads -(number of OM in the reference model) to execute the simulation. +simulation is too small to use all the available GPU cores. However, +the maximum size of a work-group is limited (to $512$), which allows +us to execute only one simulation per work-group when using $310$ +threads (number of OM in the reference model) to execute the +simulation. The usage of the GPU to execute multiple simulations is initiated by -the CPU. The later keeps total control of the simulator execution +the CPU. The CPU keeps total control of the simulator execution flow. Thus, optimized scheduling policies (such as submitting kernels -in batch, or limiting the number of kernels, or asynchronously -retrieving the simulation results) can be defined to minimize the cost -related to data transfers between CPU and GPUs. +in batch, limiting the number of kernels, or asynchronously retrieving +the simulation results) can be defined to minimize the cost related to +data transfers between CPU and GPUs. \subsubsection{Data structures translation} \label{ch17:subsec:datastructures} The adaptation of the MIOR model to GPU requires the mapping of the data model to OpenCL data structures. The environment and the agents -are represented by arrays of structures, where each structure describes the -state of one entity. The behavior of these entities are implemented -as OpenCL functions to be called from the kernels during execution. +are represented by arrays of structures, where each structure +describes the state of one entity. The behaviors of these entities are +implemented as OpenCL functions to be called from the kernels during +execution. Since the main program is written in Java, JOCL is responsible for the -allocation and the mapping of the object data structures to OpenCL -ones before execution. - -Four main data structures -(Figure~\ref{ch17:listing:mior_data_structures}) are defined: (i) an -array of MM agents, representing the state of the microbial -colonies. (ii) an array of OM agents, representing the state of the -carbon deposits. (iii) a topology matrix, which stores accessibility -information between the two types of agents of the model (iv) a world -structure, which contains all the global input data (metabolism rate, -numbers of agents) and output data (quantity of $CO_{2}$ produced) of -the simulation. These data structures are initialized by the CPU and -then copied on the GPU. - -\lstinputlisting[caption=Main data structures used in a MIOR simulation,label=ch17:listing:mior_data_structures]{Chapters/chapter17/code/data_structures.cl} - -The world topology is stored as a two-dimension matrix which contains -OM indexes on the abscissa and the MM index on the ordinate. Each -agent walks through its line/column of the matrix at each iteration to -determinate which agents can be accessed during the simulation. Since -many agents are not connected this matrix is sparse, which introduces -a big number of useless memory accesses. To reduce the impact of these -memory accesses we use a compacted, optimized representation of this -matrix based on~\cite{Gomez-Luna:2009:PVS:1616772.1616869}, as -illustrated on the figure~\ref{ch17:fig:csr_representation}. This compact -representation considers each line of the matrix as an index list, and -only stores accessible agents continuously, to reduce the size of the -list and the number of non productive accesses. +allocation and mapping of the object data structures to OpenCL ones +before execution. + +Four main data structures are defined: (1) an array of MM agents, +representing the state of the microbial colonies. (2) an array of OM +agents, representing the state of the carbon deposits. (3) a topology +matrix, which stores accessibility information between the two types +of agents of the model (4) a world structure, which contains all the +global input data (metabolism rate, numbers of agents) and output data +(quantity of $CO_{2}$ produced) of the simulation. The C-like OpenCL +structures used to represent each type of to agent and the environment +are illustrated in +(Figure~\ref{ch17:listing:mior_data_structures}). These data +structures are initialized by the CPU and then copied on the GPU. + + +\lstinputlisting[caption=main data structures used in a MIOR simulation,label=ch17:listing:mior_data_structures]{Chapters/chapter17/code/data_structures.cl} + +The world topology is stored as a two-dimension matrix which +represents OM indexes on the abscissa and MM indexes on the +ordinate. Each agent walks through its line/column of the matrix at +each iteration to determinate which agents can be accessed during the +simulation. Since many agents are not connected, this matrix is +sparse, which introduces a big number of useless memory accesses. To +reduce the impact of these memory accesses we use a compacted, +optimized representation of this matrix based +on~\cite{Gomez-Luna:2009:PVS:1616772.1616869}, as illustrated in +Figure~\ref{ch17:fig:csr_representation}. This compact representation +considers each line of the matrix as an index list, and only stores +accessible agents compactly, to reduce the number of non-productive +accesses. \begin{figure}[h] \centering \includegraphics[width=0.8\textwidth]{Chapters/chapter17/figs/grid.pdf} %\psfig{file=figs/grid, height=1in} -\caption{Compact representation of the topology of a MIOR simulation} +\caption{Compact representation of the topology of a MIOR simulation.} \label{ch17:fig:csr_representation} \end{figure} -Since dynamic memory allocation is not possible yet in OpenCL and only -provided in the latest revisions of the CUDA standard, these matrices -are statically allocated. The allocation is based on the worst-case -scenario where all OM and MM are linked as the real occupation of the -matrix cells cannot be deduced without some kind of preprocessing -computations. +Since dynamic memory allocation is not possible yet in OpenCL and is +only provided in the latest revisions of the CUDA standard, these +matrices are statically allocated. The allocation is based on the +worst-case scenario where all OM and MM are linked since the real +occupation of the matrix cells cannot be deduced without some kind of +preprocessing computations. \subsubsection{Critical resources access management} \label{ch17:subsec:concurrency} -One of the main issue in a MIOR model is to ensure that all the +One of the main concers in the MIOR model is to ensure that all the microbial colonies will have an equitable access to carbon resources, when multiple colonies share the same deposits. Access synchronizations are mandatory in these cases to prevent conflicting -updates on the same data that may lead to calculus error (e.g. loss of -matter). - -On a massively parallel architecture such as GPUs, this kind of -synchronization conflicts can however lead to an inefficient -implementation by enforcing a quasi-sequential execution. It is -necessary, in the case of MIOR as well as for other ABM, to ensure -that each work-item is not too constrained in its execution. - -\lstinputlisting[caption=Main MIOR kernel,label=ch17:listing:mior_kernels]{Chapters/chapter17/code/mior_kernels.cl} - -From the sequential algorithm 1 where all the agents share the same -data, we have developed a parallel algorithm composed of three -sequential stages separated by synchronization barriers. This new -algorithm is based on the distribution of the available OM carbon -deposits into parts at the beginning of each execution step. The three -stages, illustrated on Listing~\ref{ch17:listing:mior_kernels}, are the following: +updates on the same data that may lead to calculation error (e.g. loss +of matter). + +On massively parallel architectures such as GPUs, these kind of +synchronization conflicts can lead to an inefficient implementation by +enforcing a quasi-sequential execution. It is necessary, in the case +of MIOR as well as for other ABM, to ensure that each work-item is not +too constrained in its execution. + +\pagebreak +\lstinputlisting[caption=main MIOR kernel,label=ch17:listing:mior_kernels]{./Chapters/chapter17/code/mior_kernels.cl} + +From the sequential algorithm (Algorithm~\ref{ch17:seqalgo}) where all +the agents share the same data, we have developed a parallel algorithm +composed of three sequential stages separated by synchronization +barriers. This new algorithm is based on the distribution of the +available OM carbon deposits into parts at the beginning of each +execution step. The three stages, illustrated in +Listing~\ref{ch17:listing:mior_kernels}, are the following: \begin{enumerate} -\item \emph{distribution}: the available carbon in each carbon deposit - (OM) is equitably dispatched between all accessible MM in the form - of parts. -\item \emph{metabolism}: each MM consumes carbon in its allocated parts - for its breathing and growing processes. +\item \emph{scattering}: the available carbon in each carbon deposit + (OM) is equitably dispatched among all accessible MM in the form of + parts, +\item \emph{live}: each MM consumes carbon in its allocated parts for + its breathing and growing processes, and \item \emph{gathering}: unconsumed carbon in parts is gathered back into the carbon deposits. \end{enumerate} This solution suppresses the data synchronization needed by the first -algorithm and thus the need for synchronization barriers, and requires only -one kernel launch from Java as described on Listing~\ref{ch17:fig:mior_launcher}. +algorithm, thus the need for synchronization barriers, and requires +only one kernel launch from Java as described on +Listing~\ref{ch17:fig:mior_launcher}. \lstinputlisting[caption=MIOR simulation launcher,label=ch17:fig:mior_launcher]{Chapters/chapter17/code/mior_launcher.java} @@ -675,19 +689,19 @@ one kernel launch from Java as described on Listing~\ref{ch17:fig:mior_launcher} The termination of a MIOR simulation is reached when the model stabilizes and no more $CO_{2}$ is produced. This termination -detection can be done either on the CPU or the GPU but it requires a +detection can be done on either the CPU or the GPU but it requires a global view on the system execution. In the first case, when the CPU controls the GPU simulation process, -the detection is done in two steps: (i) the CPU starts the execution -of a simulation step on the GPU, (ii) the CPU retrieves the GPU data +the detection is done in two steps: (1) the CPU starts the execution +of a simulation step on the GPU and (2) the CPU retrieves the GPU data and determines if another iteration must be launched or if the simulation has terminated. This approach allows a fine-grain control over the GPU execution, but it requires many costly data transfers as -each iteration results must be sent from the GPU to the CPU. In the +each iteration result must be sent from the GPU to the CPU. In the case of the MIOR model these costs are mainly due to the inherent PCI-express port latencies rather than to bandwidth limitation since -data sizes remains rather small, in the order of few dozens of +data sizes remains rather small, on the order of few dozens of Megabytes. In the second case the termination detection is directly implemented @@ -702,76 +716,74 @@ CPU and the GPU. In this part we present several MIOR GPU implementations using the distribution/gathering process described in the previous section and -compare their performance on two distinct hardware platform; i.e two +compare their performance on two distinct hardware platform, i.e., two different GPU devices. Five incremental MIOR implementations were realized with an increasing level of adaptation for the algorithm: in all cases, we choose the average time over 50 executions as a performance indicator. \begin{itemize} -\item the \textbf{GPU 1.0} implementation is a direct implementation +\item The \textbf{GPU 1.0} implementation is a direct implementation of the existing algorithm and its data structures where data - dependencies were removed and that use the non-compact topology + dependencies were removed, and it uses the non-compact topology representation described in Section~\ref{ch17:subsec:datastructures} -\item the \textbf{GPU 2.0} implementation uses the previously +\item The \textbf{GPU 2.0} implementation uses the previously described compact representation of the topology and remains otherwise identical to the GPU 1.0 implementation. -\item the \textbf{GPU 3.0} implementation introduces the manual +\item The \textbf{GPU 3.0} implementation introduces the manual copy into local (private) memory of often-accessed global data, such as carbon parts or topology information. -\item the \textbf{GPU 4.0} implementation is a variant of the - GPU 1.0 implementation which allows the execution of - multiple simulations for each kernel execution. -\item the \textbf{GPU 5.0} implementation is a multi-simulation of the - GPU 2.0 implementation, using the execution of multiple simulations - for each kernel execution as for GPU 4.0 +\item The \textbf{GPU 4.0} implementation is a variant of the GPU 1.0 + implementation but allows the execution of multiple simulations for + each kernel execution. +\item the \textbf{GPU 5.0} implementation is a multi-simulation + version of the GPU 2.0 implementation, using the execution of + multiple simulations for each kernel execution as for GPU 4.0. \end{itemize} -The two last implementations -- \textbf{GPU 4.0} and \textbf{GPU 5.0} --- illustrate the gain provided by a better usage of the hardware +The two last implementations \textbf{GPU 4.0} and \textbf{GPU 5.0} +illustrate the gain provided by a better usage of the hardware resources, thanks to the driver execution overlapping capabilities. A -sequential version of the MIOR algorithm, labeled as \textbf{CPU}, is -provided for comparison purpose. This sequential version is developed -in Java, the same language used for GPU implementations and the Sworm -model. +sequential version of the MIOR algorithm, labeled \textbf{CPU}, is +included for comparison purpose. This sequential version was developed +in Java, the same language used for GPU implementations. For these performance evaluations, two platforms are used. The first one is representative of the kind of hardware which is available on HPC clusters. It is a cluster node dedicated to GPU computations with -two Intel X5550 processors running at $2.67GHz$ and two Tesla C1060 -GPU devices running at $1.3$GHz and composed of $240$ cores ($30$ -multi-processors). Only one of these GPU is used in these experiments, -at the moment. The second platform illustrates what can be expected -from a personal desktop computer built a few years ago. It uses a -Intel Q9300 CPU, running at $2.5$GHz, and a Geforce 8800GT GPU card +two Intel X5550 processors running at $2.67$GHz and one Tesla C1060 +GPU device running at $1.3$GHz and composed of $240$ cores ($30$ +multi-processors). The second platform illustrates what can be +expected from a personal desktop computer built a few years ago. It +uses an Intel Q9300 CPU, running at $2.5$GHz, and a Geforce 8800GT GPU running at $1.5$GHz and composed of $112$ cores ($14$ multi-processors). The purpose of these two platforms is to assess the -benefit that could be obtained either when a scientist has access to -specialized hardware as a cluster or tries to take benefit from its +benefit that could be obtained when a scientist has access either to +specialized hardware as a cluster or tries to take advantage of its own personal computer. -\begin{figure}[h] +\begin{figure}[!h] %begin{minipage}[t]{0.49\linewidth} \centering \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_tesla.pdf} %\caption{Performance CPU and GPU sur carte Tesla C1060} -\caption{CPU and GPU performance on a Tesla C1060 node} +\caption{CPU and GPU performance on a Tesla C1060 node.} \label{ch17:fig:mior_perfs_tesla} %end{minipage} \end{figure} Figures~\ref{ch17:fig:mior_perfs_tesla}~and~\ref{ch17:fig:mior_perfs_8800gt} show the execution time for $50$ simulations on the two hardware -platforms. A size factor is applied to the problem: at scale 1, -the model contains $38$ MM and $310$ OM, while at the scale 6 these +platforms. A size factor is applied to the problem: at scale 1, the +model contains $38$ MM and $310$ OM, while at the scale 6 these numbers are multiplied by six. The size of the environment is modified as well to maintain the same average agent density in the model. This -scaling factor display the impact of the chosen size of simulation on +scaling factor displays the impact of the chosen size of simulation on performance. %hspace{0.02\linewidth} %begin{minipage}[t]{0.49\linewidth} -\begin{figure}[h] +\begin{figure}[!h] \centering \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/mior_perfs_8800gt.pdf} \caption{CPU and GPU performance on a personal computer with a Geforce 8800GT} @@ -779,112 +791,114 @@ performance. %end{minipage} \end{figure} -The charts show that for small problems the execution times of all +\b The charts show that for small problems the execution times of all the implementations are very close. This is because the GPU execution does not have enough threads (representing agents) for an optimal -usage of GPU resources. This trend changes after scale $5$ where GPU -2.0 and GPU 3.0 take the advantage on the GPU 1.0 and CPU +usage of GPU resources. This trend changes around scale $5$ where GPU +2.0 and GPU 3.0 take the advantage over the GPU 1.0 and CPU implementations. This advantage continues to grow with the scaling factor, and reaches a speedup of $10$ at the scale $10$ between the fastest single-simulation GPU implementation and the first, naive one GPU 1.0. Multiple trends can be observed in these results. First, optimizations -for the GPU hardware show a big, positive impact on performance, +for the GPU hardware show a large, positive impact on performance, illustrating the strong requirements on the algorithm properties to reach execution efficiency. These charts also show that despite the -vast difference in numbers of cores between the two GPUs platforms the -same trends can be observed in both case. We can therefore expect +vast difference in numbers of cores between the two GPU platforms, the +same trends can be observed in both cases. We can therefore expect similar results on other GPU cards, without the need for more adaptations. -\begin{figure}[h] +\begin{figure}[!h] \centering \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/monokernel.pdf} \caption{Execution time of one multi-simulation kernel on the Tesla - platform} + platform.} \label{ch17:fig:monokernel_graph} \end{figure} -\begin{figure}[h] +\begin{figure}[!h] \centering \includegraphics[width=0.7\linewidth]{Chapters/chapter17/figs/multikernel.pdf} \caption{Total execution time for 1000 simulations on the Tesla - platform, while varying the number of simulations for each kernel} + platform, while varying the number of simulations for each kernel.} \label{ch17:fig:multikernel_graph} \end{figure} -There are two ways to measure simulations performance: (i) by +There are two ways to measure simulations performance: (1) by executing only one kernel, and varying its size (the number of -simulations executed in parallel), as shown on -Figure~\ref{ch17:fig:monokernel_graph}, to test the costs linked to the -parallelization process or (ii) by executing a fixed number of -simulations and varying the size -of each kernel, as shown on Figure~\ref{ch17:fig:multikernel_graph}. - -Figure~\ref{ch17:fig:monokernel_graph} illustrates the execution time for -only one kernel. It shows that for small numbers of simulations run in -parallel, the compact implementation of the model topology is faster -than the two-dimension matrix representation. This trends reverse with -more than $50$ simulations in parallel, which can be explained either by -the non linear progression of the synchronization costs or by the -additional memory required for the access-efficient representation. - -Figure~\ref{ch17:fig:multikernel_graph} illustrates the execution time of a -fixed number of simulations. It shows that for a small number of -simulations run in parallel, the costs resulting of program setup, -data copies and launch on GPU are determinant and very detrimental to +simulations executed in parallel), as shown in +Figure~\ref{ch17:fig:monokernel_graph}, to test the costs linked to +the parallelization process or (2) by executing a fixed number of +simulations and varying the size of each kernel, as shown in +Figure~\ref{ch17:fig:multikernel_graph}. + +Figure~\ref{ch17:fig:monokernel_graph} illustrates the execution time +for only one kernel. It shows that for small numbers of simulations +run in parallel, the compact implementation of the model topology is +faster than the two-dimension matrix representation. This trends +reverse with more than $50$ simulations in parallel, which can be +explained either by the nonlinear progression of the synchronization +costs or by the additional memory required for the access-efficient +representation. + +Figure~\ref{ch17:fig:multikernel_graph} illustrates the execution time +of a fixed number of simulations. It shows that for a small number of +simulations run in parallel, the costs resulting from program setup, +data copies, and launch on GPU are very detrimental to performance. Once the number of simulations executed for each kernel grows, these costs are counterbalanced by computation costs. This trend is more marked in the case of the sparse implementation (GPU -4.0) than the compact one but appears on both curves. With more than -$30$ simulations for each kernel, execution times stalls, since +4.0) than in the compact one but appears on both curves. With more +than $30$ simulations for each kernel, execution times stall, since hardware limits are reached. This indicates that the cost of preparing -and launching kernels become negligible compared to the computing -time once a good GPU occupation rate is achieved. +and launching kernels become negligible compared to the computing time +once a good GPU occupancy rate is achieved. \section{Analysis and recommendations} \label{ch17:analysis} In this section we synthesize the observations done on the two models -and identify some recommendations on implementing complex systems on +and identify some recommendations for implementing complex systems on GPU platforms. \subsection{Analysis} -In both the Collembola and the MIOR model, a critical problematic is -the determination of the part of the simulation which are to be run on +In both the collembola and the MIOR model, a critical problematic is +the determination of the parts of the simulation that are to be run on GPU and which are to remain on the CPU. The determination of these -parts is a classic step of any algorithm parallelization and must -take into account considerations such as the cost of the different -parts of the algorithm and the expected gains. +parts is a classic step of any algorithm parallelization and must take +into account considerations such as the cost of the different parts of +the algorithm and the expected gains. -In the case of the Collembola model two steps of the algorithm were -ported to GPU. Both steps use straightforward, easily-parallelizable, -operations where a direct gain can be expected by using more -executions cores without important modifications of the algorithm. +In the case of the collembola model two steps of the algorithm were +ported to GPU. Both steps use straightforward, easily parallelizable +operations where a direct gain can be expected by using more execution +cores without important modifications to the algorithm. -In the MIOR model case however no such inherently parallelizable parts -are evident in the original sequential algorithm. This is mainly +In the MIOR model case, however, no such inherently parallelizable +parts are evident in the original sequential algorithm. This is mainly explained by the rate of interactions between agents in this model in -the form of many operations (breathing, growth) on shared carbon -resources. In this case the algorithm had to be more profoundly -modified while keeping in head the need to remain true to the original +the form of two operations (breathing, growth) using heavily-shared +carbon resources. In this case the algorithm had to be more profoundly +modified while keeping in mind the need to remain true to the original model, to synchronize the main execution step of all agents in the model, to ensure equity, and to minimize the numbers of -synchronizations. The minimization is done by factoring the repartition of -carbon in the model in two separated step at the beginning and the end -of each iterations rather than at multiple point of the execution. +synchronizations. The minimization is done by factoring the +distribution of carbon in the model in two separated steps at the +beginning and the end of each iteration rather than at multiple points +of the execution. \subsection{MAS execution workflow} Many MAS simulations decompose their execution process into discrete evolution steps where each step represents a quantum of time (minimal -unit of time described). At the end of each step many global -information, graphical displays or results are updated. This -execution model may not correctly fit on GPU platforms as they assume -more or less a batch-like workflow model. The execution model must be -split into the following ever repeating steps: +unit of time described). At the end of each step many global data, +graphical displays or output files are updated. This execution model +may not correctly fit on GPU platforms as they assume more or less a +batch-like workflow model. The execution model must be split into the +following ever repeating steps: \begin{itemize} \item Allocation of GPU data buffers @@ -896,26 +910,26 @@ split into the following ever repeating steps: This workflow works well if the considered data transfer time is negligible compared to GPU execution or can be done in parallel, thanks to the asynchronous nature of OpenCL. If we are to update the -MAS model after each iteration then performance risk being +MAS model after each iteration then performance risks being degraded. This is illustrated in the MIOR model by the fact that the speedup observed on GPU is much more significant for bigger -simulations, which implies longer GPU execution times. Our solution to -this problem is to desynchronize the execution of the SMA model and -its GPU parts by requesting the execution of multiple steps of the -GPU simulations for each launch. +simulations, which imply longer GPU execution times. Our solution to +this problem is to desynchronize the execution of the MAS model and +its GPU parts by requesting the execution of multiple steps of the GPU +simulations for each launch. Another related prerequisite of GPU execution is the ability to have -many threads to be executed, to allow an efficient exploitation of the +many threads executed, to allow an efficient exploitation of the superior number of cores provided by the architecture. In the case of MAS models, this means that executing one agent at a time on GPU is meaningless in regard to GPU usage, copying cost, and actual gain in -execution time, if the process is not already parallel and costly at -this scale. In the MIOR and the Collembola models, this is solved by -executing the computations for all agents of the model at the same -time. If the model has however only chronic needs for intensive -computations then some kind of batching mechanism is required to store -waiting treatments in a queue, until the total sum of waiting -computations justify the transfers cost to the GPU platform. +execution time, if the agent computations are not complex enough. In +the MIOR and the collembola models, this is solved by executing the +computations for all agents of the model at the same time. If the +model has only chronic needs for intensive computations, then some +kind of batching mechanism is required to store waiting treatments in +a queue, until the total sum of waiting computations justifies the +transfer cost to the GPU platform. \subsection{Implementation challenges} @@ -924,39 +938,38 @@ implementation challenges also occur when implementing an OpenCL version of a MAS model, mainly related to the underlying limitations of the execution platform. -The first one is the impossibility (except in latest CUDA versions) -to dynamically allocate memory during execution. This is a problem in -the case of models where the number of agent can vary during the +The first one is the impossibility (except in latest CUDA versions) to +dynamically allocate memory during execution. This is a problem in the +case of models where the number of agents can vary during the simulation, such as prey-predator models. In this case, the only solution is to overestimate the size of arrays or data structures to -accommodate these additional individuals, or using the CPU to resize +accommodate these additional individuals, or to use the CPU to resize data structures when these situations occur. Both approaches require -to trend either memory or performance and are not always practical. +trending either memory or performance and are not always practical. Another limitation is the impossibility to store pointers in data -structures, we restraint any OpenCL to use only one dimension static -arrays or specific data structures such as textures. This preclude the -usage of structures like linked-list, graphs or sparse matrix not -represented by some combination of static arrays, and can be another -source of memory or performance losses. - -In the case of MIOR, this problem is especially exacerbated in the case -of neighboring storage: both representations consume much more memory -that actually required, since the worst case (all agents have -access to all others agents) has to be taken into account defensively -when dimensioning the data structure. - -The existence of different generations of GPU hardwares is also a +structures, since OpenCL only allows one dimension static arrays. This +precludes the usage of structures such as linked-list, graphs or +sparse matrices not represented by some combination of static arrays, +and can be another source of memory or performance losses. + +In the case of MIOR, this problem is especially exacerbated in the +case of neighboring storage: both representations consume much more +memory than is actually required, since the worst case (all agents +have access to all others agents) has to be taken into account when +dimensioning the data structure. + +The existence of different generations of GPU hardware is also a challenge. Older implementations, such as the four year old Tesla -C1060 cards, have very strong expectation in term of memory accesses +C1060 cards, have very strong constraints in term of memory accesses and requires very regular access patterns to perform efficiently. MAS having many random accesses, such as MIOR, or many small global memory accesses, such as Collembola, are penalized on these older cards. Fortunately, these requirements are less present is modern cards, which offer caches and other facilities traditionally present -on CPU to offset these kind of penalties. +on CPU to offset these kinds of penalties. -The final concern is related to the previous ones and often result in +The final concern is related to the previous ones and often results in more memory consumption. The amount of memory available on GPU cards is much more limited and adding new memory capabilities is more costly compared to expending a CPU RAM. On computing clusters, hardwares @@ -968,46 +981,46 @@ only use memory-inefficient OpenCL structures. \subsection{MCSMA} \label{ch17:Mcsma} -As shown in the previous sections, many data representations choices -are common to entire classes of MAS. The paradigm of grid, for example, -is often encountered in models where each cell constitutes either the -elementary unit of simulation (SugarScape~\cite{Dsouza2007}) or a -discretization of the environment space -(Pathfinding~\cite{Guy09clearpath}). These grids can be considered as -two or three-dimensional matrices, whose processing can be directly -distributed. +As shown in the previous sections, many data representation choices +are common to entire classes of MAS. The paradigm of grid, for +example, is often encountered in models where each cell constitutes +either the elementary unit of simulation +(SugarScape~\cite{Dsouza2007}) or a discretization of the environment +space (Pathfinding~\cite{Guy09clearpath}). These grids can be +considered as two- or three-dimensional matrices, whose processing can +be directly distributed. Another common data representation encountered in MAS system is the -usage of 2d or 3d coordinates to store the position of each agent of +usage of 2D or 3D coordinates to store the position of each agent of the model. In this case, even if the environment is no longer -discrete, the location information still imply computations -(euclidean distances) which can be parallelized. +discrete, the location information still imply computations (Euclidean +distances) which can be parallelized. -MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS designer -those basic data structures, and the associated operations, to +MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS +designer those basic data structures and the associated operations, to facilitate the portage of existing MAS on GPU. Two levels of utilization are provided to the developer, depending on its usage -profile. +profile:²< \begin{itemize} -\item A high-level library, constituted of modules regrouping classes - of operations. A module provides multiple methods of - distance computations in 1d, 2d or 3d grids, another one the - diffusion algorithm... -\item A low level API which allows the developed direct access to the - GPU and the inner working of MCSMA, to develop its own module in - the case where the required operations are not yet provided by the platform. +\item A high-level library, composed of modules regrouping classes of + operations. Such operation can distance computations in 1D, 2D or 3D + grids, diffusion or reduction operations on matrices... +\item A low-level API which allows the developer direct access to the + GPU and the inner working of MCSMA, to develop new modules in the + case where the required operations are not yet provided by the + platform. \end{itemize} -Both usage levels were illustrated in the above two practical cases. In -MIOR, the whole algorithm (baring initialization) is ported on GPU as -a specific plugin which allows executing $n$ MIOR simulations and -retrieve their results. This first approach requires extensive -adaptations to the original algorithm. In Collembola, on the -contrary, the main steps of the algorithm remain executed on the CPU, -and only specific operations are delegated to generic, already -existing, diffusion and reduction kernels. The fundamental algorithm -is not modified and GPU execution is only applied to specific parts of +Both usage levels were illustrated in the above two practical +cases. In MIOR, the whole algorithm (baring initialization) is ported +on GPU as a specific plugin which allows executing $n$ MIOR +simulations and retrieving their results. This first approach requires +extensive adaptations to the original algorithm. In collembola, to +the contrary, the main steps of the algorithm remain executed on the +CPU, and only specific operations are delegated to generic, already +existing diffusion and reduction kernels. The fundamental algorithm is +not modified and GPU execution is only applied to specific parts of the execution which may benefit from it. These two programming approaches allow incremental adaptations of existing Java MAS to accelerate their execution on GPU, while retaining the option to @@ -1017,38 +1030,38 @@ already existing ones. \section{Conclusion} \label{ch17:conclusion} -This chapter addresses the issue of complex system simulation by using -agent based paradigm and GPU hardware. From the experiments on two -existing Agent Based Models of soil science we intend to provide -useful information on the architecture, the algorithm design and, the -data management to run Agent based simulations on GPU, and more -generally to run compute intensive applications that are not based on -a not-so-regular model. The first result of this work is that adapting -the algorithm to a GPU architecture is possible and suitable to speed -up agent based simulations as illustrated by the MIOR model. Coupling -CPU with GPU seems even to be an interesting way to better take -advantage of the power given by computers and clusters as illustrated -by the Collembola model. From our point of view the adaptation process -is lighter than a traditional parallelization on distributed nodes and -not much difficult than a standard multi-threaded parallelization, -since all the data remains on the same host and can be shared in -central memory. The OCL adaptation also enables a portable simulator -that can be run on different graphical units. Even using a mainstream -card as the GPU card of a standard computer can lead to significant -performance improvements. This is an interesting result as it opens -the field of inexpensive HPC to a large community. - -In this perspective, we are working on MCSMA a development platform +This chapter has addressed the issue of complex system simulation by +using agent-based paradigms and GPU hardware. From the experiments on +two existing agent-based models of soil science we have provided +useful information on the architecture, the algorithm design, and the +data management to run agent-based simulations on GPU, and more +generally to run computationally intensive applications that are not +based on purely-matricial models. The first result of this work is +that adapting the algorithm to a GPU architecture is possible and +suitable to speed up agent based simulations as illustrated by the +MIOR model. Coupling CPU with GPU seems to be an interesting way to +take better advantage of the power given by computers and clusters as +illustrated by the collembola model. From our point of view the +adaptation process is less costy in time than a traditional +parallelization on distributed nodes and not much difficult than a +standard multithreaded parallelization, since all the data remains on +the same host and can be shared in central memory. The usage of OpenCL +also enables a portable simulator that can be run on different +graphical units. Even using a mainstream card such as the GPU card of +a standard computer can lead to significant performance +improvements. This is an interesting result as it opens up the field +of inexpensive HPC to a large community. + +In this perspective, we are working on MCSMA, a development platform that would facilitate the use of GPU or many-core architectures for multi-agent simulations. Our first work has been the definition of common, efficient, reusable data structures, such as grids or lists. Another goal is to provide easier means to control the distribution of specific processes between CPU or GPU, to allow the easy exploitation of the strengths of each platform in the same -multi-agent simulation. We think that the same approach, i.e -developing specific environments that facilitate the developer -access to the GPU power, can be applied in many domains with -compute intensive needs to open the GPU use to a larger community. +multi-agent simulation. We think that the same approach, i.e., +developing specific environments that facilitate the developer access +to the GPU power, can be applied in many domains with computationally +intensive needs to open the GPU use to a larger community. \putbib[Chapters/chapter17/biblio] -