new

author couturie <couturie@extinction>

Mon, 22 Jul 2013 18:12:39 +0000 (20:12 +0200)

committer couturie <couturie@extinction>

Mon, 22 Jul 2013 18:12:39 +0000 (20:12 +0200)
author couturie <couturie@extinction>
Mon, 22 Jul 2013 18:12:39 +0000 (20:12 +0200)
committer couturie <couturie@extinction>
Mon, 22 Jul 2013 18:12:39 +0000 (20:12 +0200)
diff --git a/BookGPU/Chapters/chapter14/ch14.tex b/BookGPU/Chapters/chapter14/ch14.tex

index 318015f7bdff1131bb78b89603d99f62a53022a6..174a250d35b0c616577880a63be73ebe641369b1 100755 (executable)
--- a/BookGPU/Chapters/chapter14/ch14.tex
+++ b/BookGPU/Chapters/chapter14/ch14.tex
@@ -10,17 +10,17 @@ application}
  \section{Introduction}
  The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overview see, e.g.,
  \cite{succi-book}) has become a popular approach to a variety of fluid
-dynamics problems.  It provides a way to solve the incompressible,
+dynamics problems.  It provides a way to solve the incompressible
  isothermal Navier-Stokes equations and has the attractive features of
  being both explicit in time and local in space. This makes the LB
-method well-suited to parallel computation. Many efficient parallel
+method wellsuited to parallel computation. Many efficient parallel
  implementations of the LB method have been undertaken, typically using
  a combination of distributed domain decomposition and the Message
  Passing Interface (MPI). However, the potential
-performance benefits offered by GPUs has motivated a new `mixed-mode'
+performance benefits offered by GPUs has motivated a new ``mixed-mode''
  approach to address very large problems. Here, fine-grained
  parallelism is implemented on the GPU, while MPI is reserved for
-larger-scale parallelism.  This mixed mode is of increasing interest
+largerscale parallelism.  This mixed mode is of increasing interest
  to application programmers at a time when many supercomputing services
  are moving
  toward clusters of GPU accelerated nodes. The design questions which
@@ -61,7 +61,7 @@ an LB application developed specifically for complex fluids
  (\textit{Ludwig} was named for Boltzmann, 1844--1906).
  We will present the steps
  required to allow \textit{Ludwig} to exploit efficiently both a single
-GPU, and also many GPUs in parallel.  We show that
+GPU and many GPUs in parallel.  We show that
  \textit{Ludwig} scales excellently to at least the one thousand GPU level
  (the largest resource available at the time of writing) with
  indications that the code will scale to much larger systems as
@@ -69,7 +69,7 @@ they become available. In addition, we consider the steps required
  to represent fully-resolved moving solid particles (usually referred
  to as colloids in the context of complex fluids). Such particles
  need to have their surface resolved on the lattice scale in order
-to include relevant surface physics, and must be able to move: e.g.,
+to include relevant surface physics and must be able to move, e.g.,
  to execute Brownian motion in response to random forces from the
  fluid. Standard methods are available to represent such particles
  (e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
@@ -82,7 +82,7 @@ In the following section we provide a brief overview of the lattice
  Boltzmann method, and mention some of the general issues which can
  influence performance in the CPU context.
  In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
-which are required to exploit the GPU architecture effectively, and
+which are required to exploit the GPU architecture effectively and
  highlight how the resulting code differs from the CPU version. In
  Section \ref{ch14:sec:parallelgpu}, we extend this description to
  include the steps required to allow exploitation of many GPUs in
@@ -128,8 +128,8 @@ are chosen to capture the correct symmetries of the Navier-Stokes
  equations. A typical choice, used here, is the so-called D3Q19
  basis in three dimensions where there is one velocity such that
  $\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
-neighbour
-lattice sites, and twelve extending to the next-nearest neighbour sites
+neighbor
+lattice sites, and twelve extending to the next-nearest neighbor sites
  ($\Delta t$ being the discrete time step). The fundamental object
  in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
  moments are related to the local hydrodynamic quantities: the fluid
@@ -139,16 +139,16 @@ function is described by a discrete Boltzmann equation
  f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t) 
  = - {\cal L}_{ij} f_j(\mathbf{r};t).
  \end{equation}
-It is convenient to think of this in two stages. First, the right hand
+It is convenient to think of this in two stages. First, the right-hand
  side represents the action of a collision operator ${\cal L}_{ij}$,
  which is local to each lattice site and relaxes the distribution toward
  a local equilibrium at a rate ultimately related to the fluid viscosity.
-Second, the left hand side represents a propagation step (sometimes referred
+Second, the left-hand side represents a propagation step (sometimes referred
  to as streaming step), in which each element $i$ of the distribution is
  displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
  appropriate direction per discrete time step. 
  
-More specifically, we store a vector of 19 double precision floating
+More specifically, we store a vector of 19 double-precision floating
  point values at each lattice site for the distribution function
  $f_i(\mathbf{r};t)$.
  The collision operation ${\cal L}_{ij}$, which is local at each lattice
@@ -156,19 +156,35 @@ site, may be thought of as follows. A matrix-vector multiplication
  ${\cal M}_{ij}f_j$ is used to transform the distributions into the
  hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
  matrix related to the choice of
-$\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
-relaxed toward their (known) equilibrium values, and are transformed
-back to new post-collision distributions via the inverse transformation
-${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of 2x19$^2$
+$\mathbf{c}_i$. The nonconserved hydrodynamic quantities are then
+relaxed toward their (known) equilibrium values and are transformed
+back to new postcollision distributions via the inverse transformation
+${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$
  floating point multiplications per lattice site. (Additional operations are
  required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
  In contrast, the
  propagation stage consists solely of moving the distribution values
-between lattice sites, and involves no floating point operations.
+between lattice sites and involves no floating point operations.
+
+
+\begin{figure}[tbp]
+\centering
+\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
+\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
+  clarity we show a 2D decomposition of a 3D lattice, but in practice
+  we decompose in all 3 dimensions. Halo cells are added to each
+  subdomain (as shown on the upper right for a single slice) which store
+  data retrieved from remote neighbors in the halo exchange. Lower
+  right: the D3Q19 velocity set resident on a lattice site;
+  highlighted are the 5 ``outgoing'' elements to be transferred in a
+  specific direction.}
+\label{ch14:fig:decomphalo}
+\end{figure}
+
  
  In the CPU version, the \textit{Ludwig} implementation stores one time
  level of distribution values $f_i(\mathbf{r}; t)$. This distribution
-is stored as a flat array of C data type \texttt{double}, and laid out
+is stored as a flat array of C data type \texttt{double} and laid out
  so that all elements of the velocity are contiguous at a given site
  (often referred to as ``array-of-structures'' order). This is
  appropriate for sums over the distribution required locally at the
@@ -182,19 +198,21 @@ standard sequential optimisations are relevant for the collision stage
  \cite{wellein2006}).  For example, the collision stage in
  \textit{Ludwig} includes explicit SIMD code, which is useful if the
  compiler on a given platform cannot identify it.  The propagation
-stage is separate, and is organised as a ``pull'' (again, various
-optimisation have been considered, e.g.,
-\cite{pohl2003,mattila2007,wittmann2012}).  No further optimisation is
+stage is separate and is organised as a ``pull'' (again, various
+optimizations have been considered, e.g.,
+\cite{pohl2003,mattila2007,wittmann2012}).  No further optimization is
  done here beyond ensuring that the ordering of the discrete velocities
  allows memory access to be as efficient as possible. While these
-optimisations are important, it should be remembered that for some
+optimizations are important, it should be remembered that for some
  complex fluid problems, the hydrodynamics embodied in the LB
  calculation is a relatively small part of the total computational cost
-(at the 10\%-level in some cases).  This means optimisation effort may
+(at the 10\%level in some cases).  This means optimization effort may
  be better concentrated elsewhere.
  
+
+
  \begin{lstlisting}[float, label=ch14:listing1,
-caption = Collision schematic for CPU.]
+caption = collision schematic for CPU.]
  /* loop over lattice sites */
  for (is = 0; is < nsites; is++) {
    ...
@@ -215,29 +233,17 @@ for (is = 0; is < nsites; is++) {
  \end{lstlisting}
  
  
-\begin{figure}[!t]
-\centering
-\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
-\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
-  clarity we show a 2D decomposition of a 3D lattice, but in practice
-  we decompose in all 3 dimensions. Halo cells are added to each
-  sub-domain (as shown on the upper right for a single slice) which store
-  data retrieved from remote neighbours in the halo exchange. Lower
-  right: the D3Q19 velocity set resident on a lattice site;
-  highlighted are the 5 ``outgoing'' elements to be transferred in a
-  specific direction.}
-\label{ch14:fig:decomphalo}
-\end{figure}
+
  
  
  The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
-Each local sub-domain is surrounded by a halo, or ghost, region of width one
+Each local subdomain is surrounded by a halo, or ghost, region of width one
  lattice site. While the collision is local, elements of the distribution
  must be exchanged at the edges of the domains to facilitate the propagation.
  To achieve the full 3D halo exchange, the standard approach of shifting the
-relevant data in each co-ordinate direction in turn is adopted. This
-requires appropriate synchronisation, i.e., a receive in the the first
-co-ordinate direction must be complete before a send in the second direction
+relevant data in each coordinate direction in turn is adopted. This
+requires appropriate synchronization, i.e., a receive in the the first
+coordinate direction must be complete before a send in the second direction
  involving relevant data can take place, and so on. We note that only
  ``outgoing'' elements of the distribution need to be sent at each edge.
  For the D3Q19 model, this reduces the volume of data traffic from 19 to
@@ -246,13 +252,13 @@ version, the necessary transfers are implemented in place using
  a vector of appropriately strided MPI datatypes for each direction.
  
  
-\section{Single GPU Implementation}\label{ch14:sec:singlegpu}
+\section{Single GPU implementation}\label{ch14:sec:singlegpu}
  
  
  In this section we describe the steps taken to enable \textit{Ludwig}
  for the GPU. There are a number of crucial issues: first, the
-minimisation of data traffic between host and device; second, the
-optimal mapping of available parallelism onto the architecture and
+minimization of data traffic between host and device; second, the
+optimal mapping of available parallelism onto the architecture; and
  third, the issue of memory coalescing. We discuss each of these in
  turn.
  
@@ -274,12 +280,12 @@ parallelism inherent in the GPU architecture, particularly for
  those matrix-vector operations within the collision kernel.
  The GPU architecture features a hierarchy of parallelism. At the
  lowest level, groups of 32 threads (warps) operate in
-lock-step on different data elements: this is SIMD style vector-level
+lock-step on different data elements: this is SIMD-style vector-level
  parallelism. Multiple warps are combined into a thread block (in which
-communication and synchronisation are possible), and multiple blocks can
+communication and synchronization are possible), and multiple blocks can
  run concurrently across the streaming multiprocessors in the GPU
  (with no communication or
-synchronisation possible across blocks).  To decompose on the GPU, we
+synchronization possible across blocks).  To decompose on the GPU, we
  must choose which part of the collision to assign to which level of
  parallelism.
  
@@ -290,14 +296,14 @@ to all levels of parallelism, i.e., we use a separate CUDA thread for
  every lattice site, and each thread performs the full matrix-vector
  operation. We find that a block size of 256 performs well on current
  devices: we therefore decompose
-the lattice into groups of 256 sites, and assign each group to a block
+the lattice into groups of 256 sites and assign each group to a block
  of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
  assigned to the fast {\it constant} on-chip device memory.
  
  For the propagation stage, the GPU implementation adds a second time
-level of distribution values. The data-dependencies inherent in the
+level of distribution values. The datadependencies inherent in the
  propagation mean that the in-place propagation of the CPU version
-cannot be parallelised effectively without the additional time level.
+cannot be parallelized effectively without the additional time level.
  As both time levels may remain resident on the GPU, this is not a
  significant overhead.
  
@@ -308,7 +314,7 @@ is only achieved when data are structured such that threads within a
  segment in a single transaction: this is memory coalescing. It can be
  seen that the array-of-structures ordering used for the distribution
  in the CPU code would not be suitable for coalescing; in fact, it would
-result in serialised memory accesses and relative poor performance.
+result in serialized memory accesses and relative poor performance.
  To meet the coalescing criteria and allow consecutive threads to read
  consecutive memory addresses on the GPU, we transpose the layout of the
  distribution so that, for each velocity component, consecutive sites
@@ -316,7 +322,7 @@ are contiguous in memory (``structure-of-arrays'' order). A schematic of
  the GPU collision code is given in Listing~\ref{ch14:listing2}.
  
  \begin{lstlisting}[float, label=ch14:listing2,
-caption = Collision schematic for GPU.]
+caption = collision schematic for GPU.]
  /* compute current site index 'is' from CUDA thread and block */
  /* indices and load distribution from "structure-of-arrays" */
  
@@ -334,18 +340,17 @@ defer some further comments on software engineering aspects of the code to
  the summary.
  
  
-\section{Multiple GPU Implementation}\label{ch14:sec:parallelgpu}
+\section{Multiple GPU implementation}\label{ch14:sec:parallelgpu}
  
  To execute on multiple GPUs, we use the same domain decomposition
  and message passing framework as the CPU version. Within each
-sub-domain (allocated to one MPI task) the GPU implementation
+subdomain (allocated to one MPI task) the GPU implementation
  proceeds as described in the previous section. The only additional
  complication is that halo transfers between GPUs must be staged
-through the host (in future, direct GPU to GPU data transfers via
+through the host (in the future, direct GPU-to-GPU data transfers via
  MPI may be possible, obviating the need for these steps). This
-means host MPI sends must be preceded by appropriate device to host
-transfers and host MPI receives must be followed by corresponding host
-to device transfers.
+means host MPI sends must be preceded by appropriate device-to-host
+transfers and host MPI receives must be followed by corresponding host-to-device transfers.
  
  %To execute on multiple GPUs in parallel we decompose the problem using
  %the existing high-level parallelisaton framework: each MPI task acts
@@ -380,18 +385,18 @@ to device transfers.
  
  In practice, this data movement requires additional GPU kernels to
  pack and unpack the relevant data before and after corresponding MPI
-calls. However, the standard shift algorithm, in which each co-ordinate
+calls. However, the standard shift algorithm, in which each coordinate
  direction is treated in turn, does provide some scope for the
  overlapping of different operations.
-For example, after the data for the first co-ordinate direction have been
+For example, after the data for the first coordinate direction have been
  retrieved by the host, these can be exchanged using MPI between
  hosts at the same time as kernels for packing and retrieving of
-data for the second co-ordinate direction are
+data for the second coordinate direction are
  executed. This
  overlapping must respect the synchronisation required to ensure
-that data values at the corners of the sub-domain are transferred
+that data values at the corners of the subdomain are transferred
  correctly.
-We use a separate CUDA stream for each co-ordinate direction:
+We use a separate CUDA stream for each coordinate direction:
  this allows some of the host-device communication time to be
  effectively ``hidden'' behind the host-host MPI communication,
  resulting in an overall speedup.
@@ -415,7 +420,7 @@ CPUs per node, and with nodes interconnected using Cray Gemini technology.
  For GPU results, a Cray XK6 system is used: this is very similar to the
  XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
  node in the GPU system therefore features a single Opteron CPU acting
-as a host to a single GPU. The inter-node interconnect architecture is
+as a host to a single GPU. The internode interconnect architecture is
  the same as for the Cray XE6. The GPU performance tests use a prototype
  Cray XK6 system with 936 nodes (the largest available at the time of writing).
  To provide a fair comparison, we compare scaling on a \textit{per node}
@@ -465,7 +470,7 @@ a simulation in terms of accounting budgets, or electricity.
  
  Figure \ref{ch14:fig:scaling} shows the results of both weak and
  strong scaling tests (top and bottom panels, respectively). For
-weak scaling, where the local sub-domain size is kept fixed
+weak scaling, where the local subdomain size is kept fixed
  (here a relatively large 196$^3$ lattice), the time taken by
  an ideal code would remain constant. It can be seen that while
  scaling of the CPU version shows a pronounced upward slope, the
@@ -475,9 +480,9 @@ has a factor of 32 fewer MPI tasks per node, so communications
  require a significantly smaller number of larger data transfers.
  The performance
  advantage of the GPU version ranges from a factor of around 1.5 to
-around 1.8. Careful study by other authors \cite{williams2011} have
-found the absolute performance (in terms of floating point
-operations per second, or floating point operations per Watt) to
+around 1.8. Careful studies by other authors \cite{williams2011} have
+found the absolute performance (in terms of floating-point
+operations per second, or floating-point operations per Watt) to
  be remarkably similar between architectures.
  
  Perhaps of more interest is the strong scaling picture (lower
@@ -493,7 +498,7 @@ In contrast, the GPU version shows
  reasonably robust scaling even for the smaller system sizes and good
  scaling for the larger systems.
  
-\section{Moving Solid Particles}\label{ch14:sec:particles}
+\section{Moving solid particles}\label{ch14:sec:particles}
  
  \begin{figure}[t]
  \centering
@@ -535,15 +540,15 @@ The introduction of moving solid particles poses an additional hurdle
  to efficient GPU implementation of an LB code such as \textit{Ludwig}.
  In this section, we give a brief overview of the essential features
  of the CPU implementation, and how the considerations raised in the
-previous sections --- maximising parallelism and minimising both
-host to device and GPU to GPU data movement --- shape the design decisions
+previous sections---maximizing parallelism and minimising both
+host-to-device and GPU to GPU data movement---shape the design decisions
  for a GPU implementation. We restrict our discussion to a simple fluid
  in this section; additional solid-fluid boundary conditions (e.g.,
  wetting at a fluid-fluid-solid contact line) usually arise elsewhere
-in the calculation and broadly independent of the hydrodynamic boundary
+in the calculation broadly independent of the hydrodynamic boundary
  conditions which are described in what follows.
  
-Moving solid particles (here, spheres) are defined by a centre position
+Moving solid particles (here, spheres) are defined by a center position
  which is allowed to move continuously across the space of the lattice,
  and a fixed radius which is typically a few lattice spacings $\Delta r$.
  The surface of the particle is defined by a series of \textit{links}
@@ -551,7 +556,7 @@ where a discrete velocity propagation $\mathbf{c}_i \Delta t$ would
  intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
  Hydrodynamic boundary conditions are then implemented via the standard
  approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
-relevant post-collision distribution values are reversed at the
+relevant postcollision distribution values are reversed at the
  propagation stage with an appropriate correction to allow for the
  solid body motion. The exchange of momentum at each link must then be
  accumulated around the entire particle surface to provide the net
@@ -560,11 +565,11 @@ and torque on the sphere. The particle motion can then be updated in a
  molecular dynamics-like step.
  
  For the CPU implementation, a number of additional MPI communications
-are required: (1) to exchange the centre position, radius, etc of each
+are required: (1) to exchange the center position, radius, etc of each
  particle as it moves and (2), to allow the accumulation of the force
  and torque for each particle (the links of which may be distributed
  between up to 8 MPI tasks in three dimensions). Appropriate marshaling
-of these data can provide an effective parallelisation for relatively
+of these data can provide an effective parallelization for relatively
  dense particle suspensions of up to  40\% solid volume fraction
  \cite{stratford2008}. As a final consideration, fluid distributions
  must be removed and replaced consistently as a given particle moves
@@ -579,7 +584,7 @@ entire particle code to the GPU. However, the code in question involves
  largely conditional logic (e.g., identifying cut surface links) and
  irregular memory accesses (e.g., access to distribution elements around
  a spherical particle). These operations seem poorly suited to effective
-parallelisation on the GPU. As an additional complication, the sums
+parallelization on the GPU. As an additional complication, the sums
  required over the particle surface would involve potentially tricky
  and inefficient reductions in GPU memory.
  \item
@@ -597,7 +602,7 @@ particle information.
  \end{enumerate}
  
  
-We have implemented the second option as follows. For each sub-domain,
+We have implemented the second option as follows. For each subdomain,
  a list of boundary-cutting links is assembled on the CPU which includes
  the identity of the relevant element of the distribution. This list,
  together with the particle information required to compute the correct
@@ -618,14 +623,14 @@ at solid lattice points by consulting a look-up table of solid/fluid
  status. On the GPU, it is perhaps preferable to perform the collision
  stage everywhere instead of moving the look-up table to the GPU and
  introducing the associated logic.
-Ultimately, the GPU might favour other boundary methods which treat solid and
+Ultimately, the GPU might favor other boundary methods which treat solid and
  fluid on a somewhat more equal basis, for example, the immersed boundary
  method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method
  \cite{ch14:spm}.
  However, the approach adopted here  allows us to exploit
-the GPU for the intensive fluid simulation whilst maintaining the complex
+the GPU for the intensive fluid simulation while maintaining the complex
  code required for particles on the CPU. Overheads of CPU-GPU transfer are
-minimised by transferring only those data relevant to the hydrodynamic
+minimized by transferring only those data relevant to the hydrodynamic
  interaction implemented via bounce-back on links.
  
  % END NEW SOLID TEXT
@@ -736,7 +741,7 @@ we have tried to maintain the modular structure of the CPU where
  possible. For each data structure, such as the distribution, a separate
  analogue is maintained in both the CPU and GPU memory spaces. However,
  the GPU copy does not include the complete CPU structure: in
-particular, non-intrinsic datatypes such as MPI datatypes are not
+particular, nonintrinsic datatypes such as MPI datatypes are not
  required on the GPU. Functions to marshal data between CPU and GPU
  are provided for each data structure, abstracting the underlying
  CUDA implementation. (This reasonably lightweight abstraction layer
@@ -769,7 +774,7 @@ keep the entire problem resident on the device.
  \section{Summary}
  \label{ch14:sec:summary}
  
-We have described the steps take to implement, on both single and
+We have described the steps taken to implement, on both single and
  multiple GPUs, the \textit{Ludwig} code which was originally
  designed for complex fluid problems on a conventional CPU
  architecture. We have added the necessary functionality using
@@ -777,14 +782,14 @@ NVIDIA CUDA and discussed the important changes to the main
  data structures. By retaining domain decomposition and message
  passing via MPI, we have demonstrated it is possible to scale
  complex fluid problems to large numbers of GPUs in parallel.
-By following the key design criteria of maximising parallelism
-and minimising host-device data transfers, we have confirmed
+By following the key design criteria of maximizing parallelism
+and minimizing host-device data transfers, we have confirmed that
  the mixed MPI-GPU approach is viable for scientific applications.
  For the more intricate problem of moving solid particles, we find
  it is possible to retain the more serial elements related to
  particle link operations on the CPU, while offloading only the
  parallel lattice-based operations involving the LB distribution
-to the GPU. Again, this minimises host-device movement of data.
+to the GPU. Again, this minimizes host-device movement of data.
  
  From the software engineering viewpoint, some duplication of code
  to allow efficient implementation on both host and device is
@@ -797,7 +802,7 @@ codes which also exhibit portable performance (perhaps in conjunction
  with automatic tuning approaches).
  
  So, while the challenges in designing portable and efficient scientific
-applications remain very real, this work provides some hope the large
+applications remain very real, this work provides some hope that large
  clusters of GPU machines can be used effectively for a wide range of
  complex fluid problems.
  
diff --git a/BookGPU/Chapters/chapter18/biblio18.bib b/BookGPU/Chapters/chapter18/biblio18.bib

index fe7bb5d6083e7272a528ef1a1b3ec7d82addc3a2..c55e31d7832985eb39fafef2eab753ff63edf13e 100644 (file)
--- a/BookGPU/Chapters/chapter18/biblio18.bib
+++ b/BookGPU/Chapters/chapter18/biblio18.bib
@@ -34,7 +34,7 @@
  
  @manual{Nvid10,
   author = {{NVIDIA}},
- title = {CUDA CUBLAS library},
+ title = {{CUDA CUBLAS} library},
   year = {2011},
   Note = {Version 4.0},
   }
@@ -101,7 +101,7 @@ author = "Wu, C. and Xu, Z. and Lin, W. and  Ruan, J."
  
  
  @article{Werndl01032009,
-author = {Werndl, C}, 
+author = {Werndl, C.}, 
  title = {What Are the New Implications of Chaos for Unpredictability?},
  volume = {60}, 
  number = {1},
author	couturie <couturie@extinction>
	Mon, 22 Jul 2013 18:12:39 +0000 (20:12 +0200)
committer	couturie <couturie@extinction>
	Mon, 22 Jul 2013 18:12:39 +0000 (20:12 +0200)
BookGPU/Chapters/chapter14/ch14.tex		patch \| blob \| history
BookGPU/Chapters/chapter18/biblio18.bib		patch \| blob \| history