new

[book_gpu.git] / BookGPU / Chapters / chapter17 / ch17.tex
diff --git a/BookGPU/Chapters/chapter17/ch17.tex b/BookGPU/Chapters/chapter17/ch17.tex

index 9d458de3b2fd3962e6b61691c5984f94dcd573d6..28e8251421d8a72c9cc4f13a965cafa148224348 100755 (executable)
--- a/BookGPU/Chapters/chapter17/ch17.tex
+++ b/BookGPU/Chapters/chapter17/ch17.tex
@@ -1,4 +1,9 @@
-\chapterauthor{G. Laville, C. Lang, K. Mazouzi, N. Marilleau, B. Herrmann, L. Philippe}{Femto-ST Institute, University of Franche-Comt{\'e}}
+\chapterauthor{Guillaume Laville}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Christophe Lang}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Kamel Mazouzi}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Nicolas Marilleau}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Bénédicte Herrmann}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Laurent Philippe}{Femto-ST Institute, University of Franche-Comte, France}
  
  \newlength\mylen
  \newcommand\myinput[1]{%
@@ -6,7 +11,7 @@
    \setlength\hangindent{\mylen}%
    \hspace*{\mylen}#1\\}
  
-\chapter{Implementing MAS on GPU}
+\chapter{Implementing Multi-Agent Systems on GPU}
  \label{chapter17}
  
  
@@ -22,7 +27,7 @@ systems are often modeled by mathematical representations and their
  dynamic behavior simulated by differential equations. The simulation
  of the system thus often relay on the resolution of a linear system
  that can be efficiently computed on a graphical processing unit as
-shown in the preceeding chapters. But when the behavior of the system
+shown in the preceding chapters. But when the behavior of the system
  elements is not uniformly driven by the same law, when these elements
  have their own behavior, the modeling process is too complex to rely
  on formal expressions. In this context MAS is a recognized approach to
@@ -42,7 +47,7 @@ solution to increase simulation performance but Graphical Processing
  Units (GPU) are also a promising technology with an attractive
  performance/cost ratio.
  
-Conceptually a MAS is a distributed system as it favors the definition
+Conceptually a MAS\index{Multi-Agent System} is a distributed system as it favors the definition
  and description of large sets of individuals, the agents, that can be
  run in parallel. As a large set of agents could have the same behavior
  a SIMD model should fit the simulation execution. Most of the
@@ -62,7 +67,7 @@ parallelism.
  Another key point that advocates for the use of many-core in MAS is
  the growing need for multi-scale simulations. Multi-scale simulations
  explore problems with interactions between several scales. The
-different scales use different granularities of the structure and
+different scales use different granularity of the structure and
  potentially different models. Most of the time the lower scale
  simulations provide results to higher scale simulations. In that case
  the execution of the simulations can easily be distributed between the
@@ -70,8 +75,8 @@ local cores and a many-core architecture, i.e. a GPU device.
  
  We explore in this chapter the use of many-core architectures to
  execute agent-based simulations. We illustrate our reflexion with two
-uses cases: the colembola simulator designed to simulate the diffusion
-of collembola between plots of land and the MIOR simulator that
+uses cases: the Colembola simulator designed to simulate the diffusion
+of Colembola between plots of land and the MIOR simulator that
  reproduces effects of earthworms on bacteria dynamics in a bulked
  soil. In Section \ref{ch17:ABM} we present the work related to MAS and
  parallelization with a special focus on many-core use. In sections
@@ -121,7 +126,7 @@ primitives or indirectly through the information stored in the
  environment.
  
  Agent based simulations have been used for more than one decade to
-reproduce, understand even predic complex system dynamic. They have proved their interest in various
+reproduce, understand even predict complex system dynamic. They have proved their interest in various
  scientific communities. Nowadays generic agent based frameworks are
  promoted such as Repast~\cite{repast_home} or
  NetLogo~\cite{netlogo_home} to implement simulators. Many ABM such as
@@ -280,7 +285,7 @@ of agents or simulations sharing common data (such as the environment)
  or algorithms (such as the background evolution process).
  
  In the following examples the JOCL~\cite{jocl_home} binding is used
-to access the OpenCL platfmorm from the Java programming language.
+to access the OpenCL platform from the Java programming language.
  
  In the next sections we present two practical cases that will be
  studied in details, from the model to its implementation and 
@@ -290,20 +295,20 @@ performance.
  \label{ch17:sec:1stmodel}
  
  The first model, the Collembola model, simulates the propagation of
-collembolas in flieds and forests. It is based on a diffusion
+collembolas in fields and forests. It is based on a diffusion
  algorithm which illustrates the case of agents with a simple behavior
  and few synchronization problems.
  
-\subsection{The Collembola model}
+\subsection{The Collembola model\index{Collembola model}}
  \label{ch17:subsec:collembolamodel}
  
  The Collembola model is an example of multi-agent system using GIS (Geographical Information System)
   and survey data (population count)
- to modelize the evolution of the biodiversity
-across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modelized by populations of
+ to model the evolution of the biodiversity
+across land plots. A first version of this model has been developed with Netlogo framework by Bioemco and UMMISCO researchers. In this model, the biodiversity is modeled by populations of
  athropod individuals, the Collembola, which can reproduce and diffuse
  to favorable new habitats. The simulator allows to study the diffusion
-of collembola, between plots of land depending on their landuse
+of collembola, between plots of land depending on their use
  (artifical soil, crop, forest\ldots). In this
  model the environment is composed of the studied land and the
  colembola are agents. Every land plot is divided into several cells,
@@ -340,10 +345,10 @@ associated to one plot to obtain its population.
  
  The {\bf diffusion} stage simulates the natural behavior of the
  collembola that tends toward occupying the whole space over time. This
-stage constists in computing a new value for each cell depending on
+stage consists in computing a new value for each cell depending on
  the population of the neighbor cells. This process can be assimilated
-to a linear diffusion at each iteration of the populationof the cells
-across their neightbors.
+to a linear diffusion at each iteration of the population of the cells
+across their neighbors.
  
  These two processes are quite common in numerical computations so
  that the collembola model can be adapted to a GPU execution without
@@ -382,7 +387,7 @@ calculated, and divided by the number of neightbors. Note that, at the
  model frontier, populations can also overflow outside the environment
  grid but we do not manage those external populations, since there are
  no reason to assume our model to be isolated of its surroundings. The
-overflow by neigbors value is stored for each cell, before
+overflow by neighbors value is stored for each cell, before
  encountering the barrier. After the barrier is met, each cell read the
  overflows stored by all its neighbors at the previous step and applies
  them to its own population. In this manner, only one barrier is
@@ -403,7 +408,7 @@ could be implemented.
  In this part we present the performance of the Collembola model on
  various CPU and GPU execution
  platforms. Figure~\ref{ch17:fig:mior_perfs_collem} shows that the
-number of cores and the processor archiecture as a strong influence on the obtained results
+number of cores and the processor architecture as a strong influence on the obtained results
  
  % : the
  % dual-core processor equipping our Thinkpad platform has two to six
@@ -425,7 +430,7 @@ number of cores and the processor archiecture as a strong influence on the obtai
  \item Older GPU cards can be slower than modern processors. This can
    be explained by the new cache and memory access optimizations
    implemented in newer generations of GPU devices. These optimizations
-  reduce the penalities associated with irregular and frequent global
+  reduce the penalties associated with irregular and frequent global
    memory accesses. They are not available on our Tesla nodes.
  \item GPU curves exhibit a odd-even pattern in their performance
    results. Since this phenomenon is visible on two distinct
@@ -441,7 +446,7 @@ number of cores and the processor archiecture as a strong influence on the obtai
  %begin{minipage}[t]{0.49\linewidth}
  \centering
  \includegraphics[width=0.7\linewidth]{./Chapters/chapter17/figs/collem_perfs_nothinkpad.pdf}
-\caption{Performances CPU et GPU du modèle Collemboles}
+\caption{Performance of the Collembola model on CPU and GPU}
  \label{ch17:fig:mior_perfs_collem}
  %end{minipage}
  \end{figure}
@@ -452,7 +457,7 @@ on modern hardware. These gains are more mixed on older GPU platforms
  due to the limitations when dealing with irregular memory or execution
  patterns often encountered in MAS systems. This can be closely linked
  to the availability of caching facilities on the GPU hardware and its
-dramatical effects depending on the locality and frequency of data
+dramatically effects depending on the locality and frequency of data
  accesses. In this case, even if the Tesla architectures provides more
  execution cores and is the far costlier solution, more recent, cheaper,
  solutions such as high-end GPU provide better performance when the
@@ -619,7 +624,7 @@ list and the number of non productive accesses.
  \end{figure}
  
  Since dynamic memory allocation is not possible yet in OpenCL and only
-provided in the lastest revisions of the CUDA standard, these matrices
+provided in the latest revisions of the CUDA standard, these matrices
  are statically allocated. The allocation is based on the worst-case
  scenario where all OM and MM are linked as the real occupation of the
  matrix cells cannot be deduced without some kind of preprocessing
@@ -674,7 +679,7 @@ detection can be done either on the CPU or the GPU but it requires a
  global view on the system execution.
  
  In the first case, when the CPU controls the GPU simulation process,
-the dectection is done in two steps: (i) the CPU starts the execution
+the detection is done in two steps: (i) the CPU starts the execution
  of a simulation step on the GPU, (ii) the CPU retrieves the GPU data
  and determines if another iteration must be launched or if the
  simulation has terminated. This approach allows a fine-grain control
@@ -724,9 +729,9 @@ performance indicator.
  
  The two last implementations -- \textbf{GPU 4.0} and \textbf{GPU 5.0}
  -- illustrate the gain provided by a better usage of the hardware
-resources, thanks to the driver execution overlaping capabilities. A
+resources, thanks to the driver execution overlapping capabilities. A
  sequential version of the MIOR algorithm, labeled as \textbf{CPU}, is
-provided for comparison purpose. This sequential version is developped
+provided for comparison purpose. This sequential version is developed
  in Java, the same language used for GPU implementations and the Sworm
  model.
  
@@ -838,7 +843,7 @@ hardware limits are reached. This indicates that the cost of preparing
  and launching kernels become negligible compared to the computing
  time once a good GPU occupation rate is achieved.
  
-\section{Analysis and recommandations}
+\section{Analysis and recommendations}
  \label{ch17:analysis}
  
  In this section we synthesize the observations done on the two models
@@ -855,7 +860,7 @@ take into account considerations such as the cost of the different
  parts of the algorithm and the expected gains.
  
  In the case of the Collembola model two steps of the algorithm were
-ported to GPU. Both steps use straighforward, easily-parallelizable,
+ported to GPU. Both steps use straightforward, easily-parallelizable,
  operations where a direct gain can be expected by using more
  executions cores without important modifications of the algorithm.
  
@@ -863,10 +868,10 @@ In the MIOR model case however no such inherently parallelizable parts
  are evident in the original sequential algorithm. This is mainly
  explained by the rate of interactions between agents in this model in
  the form of many operations (breathing, growth) on shared carbon
-ressources. In this case the algorithm had to be more profundly
+resources. In this case the algorithm had to be more profoundly
  modified while keeping in head the need to remain true to the original
  model, to synchronize the main execution step of all agents in the
-model, to ensure enquity, and to minimize the numbers of
+model, to ensure equity, and to minimize the numbers of
  synchronizations. The minimization is done by factoring the repartition of
  carbon in the model in two separated step at the beginning and the end
  of each iterations rather than at multiple point of the execution.
@@ -892,8 +897,8 @@ This workflow works well if the considered data transfer time is
  negligible compared to GPU execution or can be done in parallel,
  thanks to the asynchronous nature of OpenCL. If we are to update the
  MAS model after each iteration then performance risk being
-degratated. This is illustrated in the MIOR model by the fact that the
-speedup observerd on GPU is much more significant for bigger
+degraded. This is illustrated in the MIOR model by the fact that the
+speedup observed on GPU is much more significant for bigger
  simulations, which implies longer GPU execution times. Our solution to
  this problem is to desynchronize the execution of the SMA model and
  its GPU parts by requesting the execution of multiple steps of the
@@ -907,9 +912,9 @@ meaningless in regard to GPU usage, copying cost, and actual gain in
  execution time, if the process is not already parallel and costly at
  this scale. In the MIOR and the Collembola models, this is solved by
  executing the computations for all agents of the model at the same
-time. If the model has hovewer only chronic needs for intensive
+time. If the model has however only chronic needs for intensive
  computations then some kind of batching mechanism is required to store
-waiting treatements in a queue, until the total sum of waiting
+waiting treatments in a queue, until the total sum of waiting
  computations justify the transfers cost to the GPU platform.
  
  \subsection{Implementation challenges}
@@ -919,14 +924,14 @@ implementation challenges also occur when implementing an OpenCL
  version of a MAS model, mainly related to the underlying limitations
  of the execution platform.
  
-The first one is the impossibility (except in lastest CUDA versions)
-to dynamicaly allocate memory during execution. This is a problem in
+The first one is the impossibility (except in latest CUDA versions)
+to dynamically allocate memory during execution. This is a problem in
  the case of models where the number of agent can vary during the
  simulation, such as prey-predator models. In this case, the only
  solution is to overestimate the size of arrays or data structures to
-accomodate these additionnal individuals, or using the CPU to resize
-data structures when these situations occur. Both approachs require
-to trend either memory or performance and are not always pratical.
+accommodate these additional individuals, or using the CPU to resize
+data structures when these situations occur. Both approaches require
+to trend either memory or performance and are not always practical.
  
  Another limitation is the impossibility to store pointers in data
  structures, we restraint any OpenCL to use only one dimension static
@@ -935,10 +940,10 @@ usage of structures like linked-list, graphs or sparse matrix not
  represented by some combination of static arrays, and can be another
  source of memory or performance losses.
  
-In the case of MIOR, this problem is especially exarcebed in the case
-of neighboring storage: both representations consum much more memory
+In the case of MIOR, this problem is especially exacerbated in the case
+of neighboring storage: both representations consume much more memory
  that actually required, since the worst case (all agents have
-access to all others agents) has to be taken into account defensivly
+access to all others agents) has to be taken into account defensively
  when dimensioning the data structure.
  
  The existence of different generations of GPU hardwares is also a
@@ -948,8 +953,8 @@ and requires very regular access patterns to perform efficiently. MAS
  having many random accesses, such as MIOR, or many small global memory
  accesses, such as Collembola, are penalized on these older
  cards. Fortunately, these requirements are less present is modern
-cards, which offer caches and other facilities traditionnaly present
-on CPU to offset these kind of penalities.
+cards, which offer caches and other facilities traditionally present
+on CPU to offset these kind of penalties.
  
  The final concern is related to the previous ones and often result in
  more memory consumption. The amount of memory available on GPU cards
@@ -967,7 +972,7 @@ As shown in the previous sections, many data representations choices
  are common to entire classes of MAS. The paradigm of grid, for example,
  is often encountered in models where each cell constitutes either the
  elementary unit of simulation (SugarScape~\cite{Dsouza2007}) or a
-discretisation of the environment space
+discretization of the environment space
  (Pathfinding~\cite{Guy09clearpath}). These grids can be considered as
  two or three-dimensional matrices, whose processing can be directly
  distributed.
@@ -976,25 +981,25 @@ Another common data representation encountered in MAS system is the
  usage of 2d or 3d coordinates to store the position of each agent of
  the model. In this case, even if the environment is no longer
  discrete, the location information still imply computations
-(euclidian distances) which can be parallelized.
+(euclidean distances) which can be parallelized.
  
-MCSMA~\cite{lmlm+13:ip} is a framework developped to provide to the MAS designer
+MCSMA~\cite{lmlm+13:ip} is a framework developed to provide to the MAS designer
  those basic data structures, and the associated operations, to
  facilitate the portage of existing MAS on GPU. Two levels of
-utilization are provided to the developper, depending on its usage
+utilization are provided to the developer, depending on its usage
  profile.
  
  \begin{itemize}
-\item A high-level library, constitued of modules regrouping classes
+\item A high-level library, constituted of modules regrouping classes
    of operations. A module provides multiple methods of
    distance computations in 1d, 2d or 3d grids, another one the
    diffusion algorithm...
-\item A low level API which allows the developped direct access to the
+\item A low level API which allows the developed direct access to the
    GPU and the inner working of MCSMA, to develop its own module in
    the case where the required operations are not yet provided by the platform.
  \end{itemize}
  
-Both usage levels were illustrated in the above two pratical cases. In
+Both usage levels were illustrated in the above two practical cases. In
  MIOR, the whole algorithm (baring initialization) is ported on GPU as
  a specific plugin which allows executing $n$ MIOR simulations and
  retrieve their results. This first approach requires extensive
@@ -1006,7 +1011,7 @@ is not modified and GPU execution is only applied to specific parts of
  the execution which may benefit from it. These two programming
  approaches allow incremental adaptations of existing Java MAS to
  accelerate their execution on GPU, while retaining the option to
-develop their own reusable or more efficeint module to supplement the
+develop their own reusable or more efficient module to supplement the
  already existing ones.
  
  \section{Conclusion}
@@ -1041,7 +1046,7 @@ lists. Another goal is to provide easier means to control the
  distribution of specific processes between CPU or GPU, to allow the
  easy exploitation of the strengths of each platform in the same
  multi-agent simulation. We think that the same approach, i.e
-developping specific environments that facilitate the developper
+developing specific environments that facilitate the developer
  access to the GPU power, can be applied in many domains with
  compute intensive needs to open the GPU use to a larger community.