X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/bdec1b5087c2ea922fcf62ad0591b8d784ddf3b7..fa1939d2294b408e8a62f2d91149f369f8710113:/BookGPU/Chapters/chapter17/ch17.tex

diff --git a/BookGPU/Chapters/chapter17/ch17.tex b/BookGPU/Chapters/chapter17/ch17.tex
index 9d458de..d6e3a31 100755
--- a/BookGPU/Chapters/chapter17/ch17.tex
+++ b/BookGPU/Chapters/chapter17/ch17.tex
@@ -1,4 +1,9 @@
-\chapterauthor{G. Laville, C. Lang, K. Mazouzi, N. Marilleau, B. Herrmann, L. Philippe}{Femto-ST Institute, University of Franche-Comt{\'e}}
+\chapterauthor{Guillaume Laville, Christophe Lang, BÃ©nÃ©dicte Herrmann and Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
+%\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Kamel Mazouzi}{Franche-Comte Computing Center, University of Franche-Comte, France}
+\chapterauthor{Nicolas Marilleau}{UMMISCO, Institut de Recherche pour le Developpement (IRD), France}
+%\chapterauthor{BÃ©nÃ©dicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France}
+%\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
 
 \newlength\mylen
 \newcommand\myinput[1]{%
@@ -6,7 +11,7 @@
   \setlength\hangindent{\mylen}%
   \hspace*{\mylen}#1\\}
 
-\chapter{Implementing MAS on GPU}
+\chapter{Implementing Multi-Agent Systems on GPU}
 \label{chapter17}
 
 
@@ -22,7 +27,7 @@ systems are often modeled by mathematical representations and their
 dynamic behavior simulated by differential equations. The simulation
 of the system thus often relay on the resolution of a linear system
 that can be efficiently computed on a graphical processing unit as
-shown in the preceeding chapters. But when the behavior of the system
+shown in the preceding chapters. But when the behavior of the system
 elements is not uniformly driven by the same law, when these elements
 have their own behavior, the modeling process is too complex to rely
 on formal expressions. In this context MAS is a recognized approach to
@@ -42,7 +47,7 @@ solution to increase simulation performance but Graphical Processing
 Units (GPU) are also a promising technology with an attractive
 performance/cost ratio.
 
-Conceptually a MAS is a distributed system as it favors the definition
+Conceptually a MAS\index{Multi-Agent System} is a distributed system as it favors the definition
 and description of large sets of individuals, the agents, that can be
 run in parallel. As a large set of agents could have the same behavior
 a SIMD model should fit the simulation execution. Most of the
@@ -62,7 +67,7 @@ parallelism.
 Another key point that advocates for the use of many-core in MAS is
 the growing need for multi-scale simulations. Multi-scale simulations
 explore problems with interactions between several scales. The
-different scales use different granularities of the structure and
+different scales use different granularity of the structure and
 potentially different models. Most of the time the lower scale
 simulations provide results to higher scale simulations. In that case
 the execution of the simulations can easily be distributed between the
@@ -70,8 +75,8 @@ local cores and a many-core architecture, i.e. a GPU device.
 
 We explore in this chapter the use of many-core architectures to
 execute agent-based simulations. We illustrate our reflexion with two
-uses cases: the colembola simulator designed to simulate the diffusion
-of collembola between plots of land and the MIOR simulator that
+uses cases: the Colembola simulator designed to simulate the diffusion
+of Colembola between plots of land and the MIOR simulator that
 reproduces effects of earthworms on bacteria dynamics in a bulked
 soil. In Section \ref{ch17:ABM} we present the work related to MAS and
 parallelization with a special focus on many-core use. In sections
@@ -121,7 +126,7 @@ primitives or indirectly through the information stored in the
 environment.
 
 Agent based simulations have been used for more than one decade to
-reproduce, understand even predic complex system dynamic. They have proved their interest in various
+reproduce, understand even predict complex system dynamic. They have proved their interest in various
 scientific communities. Nowadays generic agent based frameworks are
 promoted such as Repast~\cite{repast_home} or
 NetLogo~\cite{netlogo_home} to implement simulators. Many ABM such as
@@ -280,7 +285,7 @@ of agents or simulations sharing common data (such as the environment)
 or algorithms (such as the background evolution process).
 
 In the following examples the JOCL~\cite{jocl_home} binding is used
-to access the OpenCL platfmorm from the Java programming language.
+to access the OpenCL platform from the Java programming language.
 
 In the next sections we present two practical cases that will be
 studied in details, from the model to its implementation and 
@@ -290,20 +295,20 @@ performance.
 \label{ch17:sec:1stmodel}
 
 The first model, the Collembola model, simulates the propagation of
-collembolas in flieds and forests. It is based on a diffusion
+collembolas in fields and forests. It is based on a diffusion
 algorithm which illustrates the case of agents with a simple behavior
 and few synchronization problems.
 
-\subsection{The Collembola model}
+\subsection{The Collembola model\index{Collembola model}}
 \label{ch17:subsec:collembolamodel}
 
 The Collembola model is an example of multi-agent system using GIS (Geographical Information System)
  and survey data (population count)
- to modelize the evolution of the biodiversity
-across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modelized by populations of
+ to model the evolution of the biodiversity
+across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modeled by populations of
 athropod individuals, the Collembola, which can reproduce and diffuse
 to favorable new habitats. The simulator allows to study the diffusion
-of collembola, between plots of land depending on their landuse
+of collembola, between plots of land depending on their use
 (artifical soil, crop, forest\ldots). In this
 model the environment is composed of the studied land and the
 colembola are agents. Every land plot is divided into several cells,
@@ -340,10 +345,10 @@ associated to one plot to obtain its population.
 
 The {\bf diffusion} stage simulates the natural behavior of the
 collembola that tends toward occupying the whole space over time. This
-stage constists in computing a new value for each cell depending on
+stage consists in computing a new value for each cell depending on
 the population of the neighbor cells. This process can be assimilated
-to a linear diffusion at each iteration of the populationof the cells
-across their neightbors.
+to a linear diffusion at each iteration of the population of the cells
+across their neighbors.
 
 These two processes are quite common in numerical computations so
 that the collembola model can be adapted to a GPU execution without
@@ -369,6 +374,7 @@ the total numbers of access needed to updated those populations.
 
 %\lstinputlisting[language=C,caption=Collembola OpenCL
 %kernels,label=fig:collem_kernels]{Chapters/chapter17/code/collem_kernels.cl}
+\pagebreak
 \lstinputlisting[caption=Collembola OpenCL Diffusion kernel,label=ch17:listing:collembola-diffuse]{Chapters/chapter17/code/collem_kernel_diffuse.cl}
 
 The reproduction, diffusion and culling steps are implemented on GPU
@@ -382,7 +388,7 @@ calculated, and divided by the number of neightbors. Note that, at the
 model frontier, populations can also overflow outside the environment
 grid but we do not manage those external populations, since there are
 no reason to assume our model to be isolated of its surroundings. The
-overflow by neigbors value is stored for each cell, before
+overflow by neighbors value is stored for each cell, before
 encountering the barrier. After the barrier is met, each cell read the
 overflows stored by all its neighbors at the previous step and applies
 them to its own population. In this manner, only one barrier is
@@ -403,7 +409,7 @@ could be implemented.
 In this part we present the performance of the Collembola model on
 various CPU and GPU execution
 platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
-number of cores and the processor archiecture as a strong influence on the obtained results
+number of cores and the processor architecture as a strong influence on the obtained results
 
 % : the
 % dual-core processor equipping our Thinkpad platform has two to six
@@ -425,7 +431,7 @@ number of cores and the processor archiecture as a strong influence on the obtai
 \item Older GPU cards can be slower than modern processors. This can
   be explained by the new cache and memory access optimizations
   implemented in newer generations of GPU devices. These optimizations
-  reduce the penalities associated with irregular and frequent global
+  reduce the penalties associated with irregular and frequent global
   memory accesses. They are not available on our Tesla nodes.
 \item GPU curves exhibit a odd-even pattern in their performance
   results. Since this phenomenon is visible on two distinct
@@ -441,7 +447,7 @@ number of cores and the processor archiecture as a strong influence on the obtai
 %begin{minipage}[t]{0.49\linewidth}
 \centering
 \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
-\caption{Performances CPU et GPU du modÃ¨le Collemboles}
+\caption{Performance of the Collembola model on CPU and GPU}
 \label{ch17:fig:mior_perfs_collem}
 %end{minipage}
 \end{figure}
@@ -452,7 +458,7 @@ on modern hardware. These gains are more mixed on older GPU platforms
 due to the limitations when dealing with irregular memory or execution
 patterns often encountered in MAS systems. This can be closely linked
 to the availability of caching facilities on the GPU hardware and its
-dramatical effects depending on the locality and frequency of data
+dramatically effects depending on the locality and frequency of data
 accesses. In this case, even if the Tesla architectures provides more
 execution cores and is the far costlier solution, more recent, cheaper,
 solutions such as high-end GPU provide better performance when the
@@ -619,7 +625,7 @@ list and the number of non productive accesses.
 \end{figure}
 
 Since dynamic memory allocation is not possible yet in OpenCL and only
-provided in the lastest revisions of the CUDA standard, these matrices
+provided in the latest revisions of the CUDA standard, these matrices
 are statically allocated. The allocation is based on the worst-case
 scenario where all OM and MM are linked as the real occupation of the
 matrix cells cannot be deduced without some kind of preprocessing
@@ -641,6 +647,7 @@ implementation by enforcing a quasi-sequential execution. It is
 necessary, in the case of MIOR as well as for other ABM, to ensure
 that each work-item is not too constrained in its execution.
 
+\pagebreak
 \lstinputlisting[caption=Main MIOR kernel,label=ch17:listing:mior_kernels]{Chapters/chapter17/code/mior_kernels.cl}
 
 From the sequential algorithm 1 where all the agents share the same
@@ -674,7 +681,7 @@ detection can be done either on the CPU or the GPU but it requires a
 global view on the system execution.
 
 In the first case, when the CPU controls the GPU simulation process,
-the dectection is done in two steps: (i) the CPU starts the execution
+the detection is done in two steps: (i) the CPU starts the execution
 of a simulation step on the GPU, (ii) the CPU retrieves the GPU data
 and determines if another iteration must be launched or if the
 simulation has terminated. This approach allows a fine-grain control
@@ -724,9 +731,9 @@ performance indicator.
 
 The two last implementations -- \textbf{GPU 4.0} and \textbf{GPU 5.0}
 -- illustrate the gain provided by a better usage of the hardware
-resources, thanks to the driver execution overlaping capabilities. A
+resources, thanks to the driver execution overlapping capabilities. A
 sequential version of the MIOR algorithm, labeled as \textbf{CPU}, is
-provided for comparison purpose. This sequential version is developped
+provided for comparison purpose. This sequential version is developed
 in Java, the same language used for GPU implementations and the Sworm
 model.
 
@@ -838,7 +845,7 @@ hardware limits are reached. This indicates that the cost of preparing
 and launching kernels become negligible compared to the computing
 time once a good GPU occupation rate is achieved.
 
-\section{Analysis and recommandations}
+\section{Analysis and recommendations}
 \label{ch17:analysis}
 
 In this section we synthesize the observations done on the two models
@@ -855,7 +862,7 @@ take into account considerations such as the cost of the different
 parts of the algorithm and the expected gains.
 
 In the case of the Collembola model two steps of the algorithm were
-ported to GPU. Both steps use straighforward, easily-parallelizable,
+ported to GPU. Both steps use straightforward, easily-parallelizable,
 operations where a direct gain can be expected by using more
 executions cores without important modifications of the algorithm.
 
@@ -863,10 +870,10 @@ In the MIOR model case however no such inherently parallelizable parts
 are evident in the original sequential algorithm. This is mainly
 explained by the rate of interactions between agents in this model in
 the form of many operations (breathing, growth) on shared carbon
-ressources. In this case the algorithm had to be more profundly
+resources. In this case the algorithm had to be more profoundly
 modified while keeping in head the need to remain true to the original
 model, to synchronize the main execution step of all agents in the
-model, to ensure enquity, and to minimize the numbers of
+model, to ensure equity, and to minimize the numbers of
 synchronizations. The minimization is done by factoring the repartition of
 carbon in the model in two separated step at the beginning and the end
 of each iterations rather than at multiple point of the execution.
@@ -892,8 +899,8 @@ This workflow works well if the considered data transfer time is
 negligible compared to GPU execution or can be done in parallel,
 thanks to the asynchronous nature of OpenCL. If we are to update the
 MAS model after each iteration then performance risk being
-degratated. This is illustrated in the MIOR model by the fact that the
-speedup observerd on GPU is much more significant for bigger
+degraded. This is illustrated in the MIOR model by the fact that the
+speedup observed on GPU is much more significant for bigger
 simulations, which implies longer GPU execution times. Our solution to
 this problem is to desynchronize the execution of the SMA model and
 its GPU parts by requesting the execution of multiple steps of the
@@ -907,9 +914,9 @@ meaningless in regard to GPU usage, copying cost, and actual gain in
 execution time, if the process is not already parallel and costly at
 this scale. In the MIOR and the Collembola models, this is solved by
 executing the computations for all agents of the model at the same
-time. If the model has hovewer only chronic needs for intensive
+time. If the model has however only chronic needs for intensive
 computations then some kind of batching mechanism is required to store
-waiting treatements in a queue, until the total sum of waiting
+waiting treatments in a queue, until the total sum of waiting
 computations justify the transfers cost to the GPU platform.
 
 \subsection{Implementation challenges}
@@ -919,14 +926,14 @@ implementation challenges also occur when implementing an OpenCL
 version of a MAS model, mainly related to the underlying limitations
 of the execution platform.
 
-The first one is the impossibility (except in lastest CUDA versions)
-to dynamicaly allocate memory during execution. This is a problem in
+The first one is the impossibility (except in latest CUDA versions)
+to dynamically allocate memory during execution. This is a problem in
 the case of models where the number of agent can vary during the
 simulation, such as prey-predator models. In this case, the only
 solution is to overestimate the size of arrays or data structures to
-accomodate these additionnal individuals, or using the CPU to resize
-data structures when these situations occur. Both approachs require
-to trend either memory or performance and are not always pratical.
+accommodate these additional individuals, or using the CPU to resize
+data structures when these situations occur. Both approaches require
+to trend either memory or performance and are not always practical.
 
 Another limitation is the impossibility to store pointers in data
 structures, we restraint any OpenCL to use only one dimension static
@@ -935,10 +942,10 @@ usage of structures like linked-list, graphs or sparse matrix not
 represented by some combination of static arrays, and can be another
 source of memory or performance losses.
 
-In the case of MIOR, this problem is especially exarcebed in the case
-of neighboring storage: both representations consum much more memory
+In the case of MIOR, this problem is especially exacerbated in the case
+of neighboring storage: both representations consume much more memory
 that actually required, since the worst case (all agents have
-access to all others agents) has to be taken into account defensivly
+access to all others agents) has to be taken into account defensively
 when dimensioning the data structure.
 
 The existence of different generations of GPU hardwares is also a
@@ -948,8 +955,8 @@ and requires very regular access patterns to perform efficiently. MAS
 having many random accesses, such as MIOR, or many small global memory
 accesses, such as Collembola, are penalized on these older
 cards. Fortunately, these requirements are less present is modern
-cards, which offer caches and other facilities traditionnaly present
-on CPU to offset these kind of penalities.
+cards, which offer caches and other facilities traditionally present
+on CPU to offset these kind of penalties.
 
 The final concern is related to the previous ones and often result in
 more memory consumption. The amount of memory available on GPU cards
@@ -967,7 +974,7 @@ As shown in the previous sections, many data representations choices
 are common to entire classes of MAS. The paradigm of grid, for example,
 is often encountered in models where each cell constitutes either the
 elementary unit of simulation (SugarScape~\cite{Dsouza2007}) or a
-discretisation of the environment space
+discretization of the environment space
 (Pathfinding~\cite{Guy09clearpath}). These grids can be considered as
 two or three-dimensional matrices, whose processing can be directly
 distributed.
@@ -976,25 +983,25 @@ Another common data representation encountered in MAS system is the
 usage of 2d or 3d coordinates to store the position of each agent of
 the model. In this case, even if the environment is no longer
 discrete, the location information still imply computations
-(euclidian distances) which can be parallelized.
+(euclidean distances) which can be parallelized.
 
-MCSMA~\cite{lmlm+13:ip} is a framework developped to provide to the MAS designer
+MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS designer
 those basic data structures, and the associated operations, to
 facilitate the portage of existing MAS on GPU. Two levels of
-utilization are provided to the developper, depending on its usage
+utilization are provided to the developer, depending on its usage
 profile.
 
 \begin{itemize}
-\item A high-level library, constitued of modules regrouping classes
+\item A high-level library, constituted of modules regrouping classes
   of operations. A module provides multiple methods of
   distance computations in 1d, 2d or 3d grids, another one the
   diffusion algorithm...
-\item A low level API which allows the developped direct access to the
+\item A low level API which allows the developed direct access to the
   GPU and the inner working of MCSMA, to develop its own module in
   the case where the required operations are not yet provided by the platform.
 \end{itemize}
 
-Both usage levels were illustrated in the above two pratical cases. In
+Both usage levels were illustrated in the above two practical cases. In
 MIOR, the whole algorithm (baring initialization) is ported on GPU as
 a specific plugin which allows executing $n$ MIOR simulations and
 retrieve their results. This first approach requires extensive
@@ -1006,7 +1013,7 @@ is not modified and GPU execution is only applied to specific parts of
 the execution which may benefit from it. These two programming
 approaches allow incremental adaptations of existing Java MAS to
 accelerate their execution on GPU, while retaining the option to
-develop their own reusable or more efficeint module to supplement the
+develop their own reusable or more efficient module to supplement the
 already existing ones.
 
 \section{Conclusion}
@@ -1041,7 +1048,7 @@ lists. Another goal is to provide easier means to control the
 distribution of specific processes between CPU or GPU, to allow the
 easy exploitation of the strengths of each platform in the same
 multi-agent simulation. We think that the same approach, i.e
-developping specific environments that facilitate the developper
+developing specific environments that facilitate the developer
 access to the GPU power, can be applied in many domains with
 compute intensive needs to open the GPU use to a larger community.