new

[book_gpu.git] / BookGPU / Chapters / chapter6 / PartieORWL.tex
diff --git a/BookGPU/Chapters/chapter6/PartieORWL.tex b/BookGPU/Chapters/chapter6/PartieORWL.tex

index 2efb11588cd5a5dbcfd95926f3f494ac28bc3120..5ff76bc02db93cc935015807782f67dc7d12f90b 100644 (file)
--- a/BookGPU/Chapters/chapter6/PartieORWL.tex
+++ b/BookGPU/Chapters/chapter6/PartieORWL.tex
@@ -1,38 +1,38 @@
-\clearpage
-\section{Perspective: A unifying programming model}
+%\clearpage
+\section{Perspective: a unifying programming model}
  \label{sec:ch6p3unify}
  
  In the previous sections we have seen that controlling a distributed GPU
-application when using tools that are commonly available is a quite challenging
+application when using tools that are commonly available is quite a challenging
  task. To summarize, such an application has components that can be roughly
-classified as:
+classified as
  
  \begin{description}
    \itemsep=0pt
-\item[CPU:] CPU bound computations, realized as procedures in the chosen
+\item[CPU:] CPU-bound computations, realized as procedures in the chosen
    programming language
-\item[CUDA$_{kern}$:] GPU bound computations, in our context realized as CUDA compute kernels
+\item[CUDA$_{kern}$:] GPU-bound computations, in our context realized as CUDA compute kernels
  \item[CUDA$_{trans}$:] data transfer between CPU and GPU, realized with CUDA function calls
  \item[MPI:] distributed data transfer routines, realized with MPI communication primitives
  \item[OpenMP:] inter-thread control, realized with OpenMP synchronization tools such as mutexes
-\item[CUDA$_{sync}$] synchronization of the GPU, realized with CUDA functions
+\item[CUDA$_{sync}$:] synchronization of the GPU, realized with CUDA functions
  \end{description}
  
  Among these, the last (CUDA$_{sync}$) is not strictly necessary on modern
-systems, but still recommended to obtain optimal performance. With or without
+systems, but it is still recommended to obtain optimal performance. With or without
  that last step, such an application is highly complex: it is difficult to design
  or to maintain, and depends on a lot of different software components. The goal
-of this section is to present a new path of development that allows to replace
-the last three or four types of components that control the application (MPI,
-OpenMP, CUDA$_{sync}$ and eventually CUDA$_{trans}$) by a single tool:
-\textbf{O}rdered \textbf{R}ead-\textbf{W}rite \textbf{L}ocks, ORWL\index{ORWL},
-see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}. Besides the
-simplification of the algorithmic scheme that we already have mentioned, the
-ongoing implementation of ORWL allows to use a feature of modern platforms that
-can improve the performance of CPU bound computations: lock-free atomic
+of this section is to present a new path of development that allows the
+replacement of the last three or four types of components that control the application (MPI,
+OpenMP, CUDA$_{sync}$, and eventually CUDA$_{trans}$) with a single tool:
+{O}rdered {R}ead-{W}rite {L}ocks, ORWL\index{ORWL}
+(see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}). Besides the
+simplification of the algorithmic scheme that we have already mentioned, the
+ongoing implementation of ORWL allows the use of a feature of modern platforms that
+can improve the performance of CPU-bound computations: lock-free atomic
  operations to update shared data consistently. For these, ORWL relies on new
  interfaces that are available with the latest revision of the ISO standard for
-the C programming language, see~\cite{C11}.
+the C programming language (see~\cite{C11}).
  
  \subsection{Resources}
  \label{sec:ch6p3resources}
@@ -41,57 +41,58 @@ ORWL places all its concepts that concern data and control around a single
  abstraction: \emph{resources}\index{ORWL!resource}. An ORWL resource may correspond to a local or
  remote entity and is identified through a \emph{location}\index{ORWL!location}, that is a unique
  identification through which it can be accessed from all different components of
-the same application. In fact, resources and locations (entities and their names
+the same application. In fact, resources and locations (entities and their names,
  so to speak) are mostly identified by ORWL and these words will be used
  interchangeably.
  
-Resources may be of very different kind:
+Resources may be of very different kinds:
  \begin{description}
  \item[Data] resources\index{ORWL!resource!data} are entities that represents data and not specific
-  memory buffers or locations. During the execution of an application it can be
+  memory buffers or locations. During the execution of an application they can be
    \emph{mapped} repeatedly into the address space and in effect be represented
    at different addresses. Data resources can be made accessible uniformly in all
-  parts of the application, provided that the locking protocol is observed, see
-  below. Data resources can have different persistence:
+  parts of the application, provided that the locking protocol is observed (see
+  below). Data resources can have different persistence:
    \begin{description}
    \item[RAM] data resources are typically temporary data that serve only during
      a single run of the application. They must be initialized at the beginning of
-    their lifetime and the contents is lost at the end.
+    their lifetime and the contents are lost at the end.
    \item[File] data resources are persistent and linked to a file in the file
      system of the platform.
    \item[Collective] data resources are data to which all tasks of an
      application contribute (see below). Examples for such resources are \emph{broadcast},
-    \emph{gather} or
-    \emph{reduce} resources, e.g to distribute initial data or to collect the
+    \emph{gather}, or
+    \emph{reduce} resources, e.g., to distribute initial data or to collect the
      result of a distributed computation.
    \end{description}
-  Other types of data resources could be easily implemented with ORWL, e.g web
+  Other types of data resources could be easily implemented with ORWL, e.g., web
    resources (through a ftp, http or whatever server address) or fixed hardware
    addresses.
  \item[Device] resources\index{ORWL!resource!device} represent hardware entities of the
    platform. ORWL can then be used to regulate the access to such device
    resources. In our context the most important such resource is the GPU, but we
-  could easily use it to represent a CPU core, a camera or other peripheral
+  could easily use it to represent a CPU core, a camera, or another peripheral
    device.
  \end{description}
  
  \Lst{algo:ch6p3ORWLresources} shows an example of a declaration of
  four resources per task. Two (\texttt{curBlock} and \texttt{nextBlock}) are
  intended to represent the data in a block-cyclic parallel matrix multiplication
-(see p.~\pageref{ch6:p1block-cyclic}),
-\texttt{GPU} represents a GPU device and \texttt{result} will represent a
+(see Section~\ref{ch6:p1block-cyclic}),
+\texttt{GPU} represents a GPU device, and \texttt{result} will represent a
  collective ``gather'' resource among all the tasks.
  
+\pagebreak
  %\begin{algorithm}[H]
  %  \caption{Declaration of ORWL resources for a block-cyclic matrix
  %    multiplication.}
  %  \label{algo:ch6p3ORWLresources}
-\begin{Listing}{algo:ch6p3ORWLresources}{Declaration of ORWL resources for a block-cyclic matrix multiplication}
+\begin{Listing}{algo:ch6p3ORWLresources}{declaration of ORWL resources for a block-cyclic matrix multiplication}
  #include "orwl.h"
  ...
  ORWL_LOCATIONS_PER_TASK(curBlock, nextBlock, GPU, result);
  ORWL_DATA_LOCATION(curBlock);
-ORWL_DATA_LOCATION(nexBlock);
+ORWL_DATA_LOCATION(nextBlock);
  ORWL_DEVICE_LOCATION(GPU);
  ORWL_GATHER_LOCATION(result);
  \end{Listing}
@@ -102,18 +103,18 @@ ORWL_GATHER_LOCATION(result);
  \label{sec:ch6p3ORWLcontrol}
  
  
-ORWL regulates access to all its resources, no ``random access'' to a resource
+ORWL regulates access to all its resources; no ``random access'' to a resource
  is possible. It doesn't even have a user-visible data type for resources.
  \begin{itemize}
  \item All access is provided through \emph{handle}s\index{ORWL!handle}. Similar to pointers or
    links, these only refer to a resource and help to manipulate it. Usually
    several handles to the same resource exist, even inside the same OS process
    or thread, or in the same application task.
-\item The access is locked with RW semantics, where \emph{R} stands for
-  concurrent \textbf{R}ead~access, and \emph{W} for exclusive
-  \textbf{W}rite~access. This feature replaces the control aspect of MPI
-  communications, OpenMP inter-thread control and CUDA$_{sync}$.
-\item This access is \textbf{O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
+\item The access is locked with RW semantics, where {R} stands for
+  concurrent {R}ead~access, and {W} for exclusive
+  {W}rite~access. This feature replaces the control aspect of MPI
+  communications, OpenMP inter-thread control, and CUDA$_{sync}$.
+\item This access is {O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
      FIFO per resource}. This helps to run the different tasks of an application
    in a controlled order and to always have all resources in a known state. This
    aspect largely replaces and extends the ordering of tasks that MPI typically
@@ -126,15 +127,16 @@ is possible. It doesn't even have a user-visible data type for resources.
  \subsection{Example: block-cyclic matrix multiplication (MM)}
  \label{sec:ch6p3ORWLMM}
  
-Let us now have a look how a block-cyclic matrix multiplication\index{matrix
+Let us now have a look at how a block-cyclic matrix multiplication\index{matrix
    multiplication!block cyclic} algorithm can be
-implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop it mainly consists of three
-different operations, of which the first two can be run concurrently, and the
+implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop
+there are mainly three
+different operations, the first two of which can be run concurrently, and the
  third must be done after the other two.
  %\begin{algorithm}[H]
  %  \caption{Block-cyclic matrix multiplication, high level per task view.}
  %  \label{algo:ch6p3ORWLBCCMM}
-\begin{Listing}{algo:ch6p3ORWLBCCMM}{Block-cyclic matrix multiplication, high level per task view}
+\begin{Listing}{algo:ch6p3ORWLBCCMM}{block-cyclic matrix multiplication, high level per task view}
  typedef double MBlock[N][N];
  MBlock A;
  MBlock B[k];
@@ -162,19 +164,20 @@ for (size_t i = 0; i < k; ++i) {
  %\end{algorithm}
  
  
-\Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 as it could
+\Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 from line 17 of
+\Lst{algo:ch6p3ORWLBCCMM} as it could
  be realized with ORWL.
  It uses two resource handles
  \texttt{nextRead} and \texttt{curWrite} and marks nested \emph{critical
    sections} for these handles. Inside the nested sections it obtains pointers to
  the resource data; the resource is \emph{mapped} into the address space of the
  program, and then a standard call to \texttt{memcpy} achieves the operation
-itself. The operation is integrated in its own \textbf{for}-loop, such that it
+itself. The operation is integrated in its own {for}-loop, such that it
  could run independently in an OS thread by its own.
  %\begin{algorithm}[H]
  %  \caption{An iterative local copy operation.}
  %  \label{algo:ch6p3ORWLlcopy}
-\begin{Listing}{algo:ch6p3ORWLlcopy}{An iterative local copy operation}
+\begin{Listing}{algo:ch6p3ORWLlcopy}{an iterative local copy operation}
  for (size_t i = 0; i < k; ++i) {
    ORWL_SECTION(nextRead) {
      MBlock const* sBlock = orwl_read_map(nextRead);
@@ -188,14 +191,14 @@ for (size_t i = 0; i < k; ++i) {
  %\end{algorithm}
  
  Next, in \Lst{algo:ch6p3ORWLrcopy} we copy data from a remote task to
-a local task. Substantially the operation is the same, only that in the example
-different handles (\texttt{remRead} and \texttt{nextWrite}) are used to
+a local task. Substantially the operation is the same, save for the
+different handles (\texttt{remRead} and \texttt{nextWrite}) that are used to
  represent the respective resources.
  %\begin{algorithm}[H]
  %  \caption{An iterative remote copy operation as part of a block cyclic matrix
  %    multiplication task.}
  %\label{algo:ch6p3ORWLrcopy}
-\begin{Listing}{algo:ch6p3ORWLrcopy}{An iterative remote copy operation as part of a block cyclic matrix multiplication task}
+\begin{Listing}{algo:ch6p3ORWLrcopy}{an iterative remote copy operation as part of a block cyclic matrix multiplication task}
  for (size_t i = 0; i < k; ++i) {
    ORWL_SECTION(remRead) {
      MBlock const* sBlock = orwl_read_map(remRead);
@@ -210,14 +213,14 @@ for (size_t i = 0; i < k; ++i) {
  
  Now let us have a look into the operation that probably interests us the most,
  the interaction with the GPU in \Lst{algo:ch6p3ORWLtrans}. Again there
-is much structural resemblance to the copy operations from above, only that  we
+is much structural resemblance to the copy operations from above, but  we
  transfer the data to the GPU in the innermost block and then run the GPU MM
-kernel while we still are inside the critical section for the GPU.
+kernel while we are still inside the critical section for the GPU.
  %\begin{algorithm}[H]
  %  \caption{An iterative GPU transfer and compute operation as part of a block cyclic matrix
  %    multiplication task.}
  %  \label{algo:ch6p3ORWLtrans}
-\begin{Listing}{algo:ch6p3ORWLtrans}{An iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
+\begin{Listing}{algo:ch6p3ORWLtrans}{an iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
  for (size_t i = 0; i < k; ++i) {
    ORWL_SECTION(GPUhandle) {
      ORWL_SECTION(curRead) {
@@ -231,16 +234,16 @@ for (size_t i = 0; i < k; ++i) {
  %\end{algorithm}
  
  Now that we have seen how the actual procedural access to the resources is
-regulated we will show how the association between handles and resources is
-specified. E.g in our application of block-cyclic MM the \texttt{curRead} handle
-should correspond to current matrix block of the corresponding task, whereas
+regulated, we will show how the association between handles and resources is
+specified. In our application of block-cyclic MM the \texttt{curRead} handle
+should correspond to the current matrix block of the corresponding task, whereas
  \texttt{remRead} should point to the current block of the neighboring task.
-Both read operations on these matrix blocks can be effected without creating
+Both read operations on these matrix blocks can be performed without creating
  conflicts, so we would like to express that fact in our resource specification.
  From a point of view of the resource ``current block'' of a particular task,
  this means that it can have two simultaneous readers, the task itself performing
  the transfer to the GPU, and the neighboring task transferring the data to its
-``next block''.
+``next block.''
  
  \Lst{algo:ch6p3ORWLdecl} first shows the local dynamic declarations of
  our application; it declares a \texttt{block} type for the matrix blocks, a
@@ -249,7 +252,7 @@ have seen so far.
  %\begin{algorithm}[H]
  %  \caption{Dynamic declaration of handles to represent the resources.}
  %  \label{algo:ch6p3ORWLdecl}
-\begin{Listing}{algo:ch6p3ORWLdecl}{Dynamic declaration of handles to represent the resources}
+\begin{Listing}{algo:ch6p3ORWLdecl}{dynamic declaration of handles to represent the resources}
  /* A type for the matrix blocks */
  typedef double MBlock[N][N];
  /* Declaration to handle the collective resource */
@@ -267,51 +270,54 @@ orwl_handle2 GPUhandle = ORWL_HANDLE2_INITIALIZER;
  \end{Listing}
  %\end{algorithm}
  
-With these declarations, we didn't yet tell ORWL much about the resources to
-which these handles refer, nor the type (read or write) or the priority (FIFO
-position) of the access. This is done in code
-\Lst{algo:ch6p3ORWLinit}. The handles for
-\Lst{algo:ch6p3ORWLtrans} are given first, \texttt{GPUhandle} will be
-accessed exclusively (therefore the \texttt{write}) and, as said,
-\texttt{curRead} is used shared (so a \texttt{read}). Both are inserted in the
-FIFO of there respective resources with highest priority, specified by the
-\texttt{0}s in the third function parameter. The resources to which they
-correspond are specified through calls to the macro \texttt{ORWL\_LOCATION},
-indicating the task (\texttt{orwl\_mytid} is the ID of the current task) and the
-specific resource of that task, here \texttt{GPU} and \texttt{curBlock}.
-
-Likewise, a second block of insertions concerns the handles for
-\Lst{algo:ch6p3ORWLrcopy}: \texttt{newWrite} reclaims an exclusive
-access and \texttt{remRead} a shared. \texttt{remRead} corresponds to a
-resource of another task; \texttt{previous(orwl\_mytid)} is supposed to return
-the ID of the previous task in the cycle. Both accesses can be effected
-concurrently with the previous operation, so we insert with the same priority
-\texttt{0} as before.
-
-Then, for the specification of the third operation
-(\Lst{algo:ch6p3ORWLlcopy}) we need to use a different priority: the
-copy operation from \texttt{nextBlock} to \texttt{curBlock} has to be performed
-after the other operations have terminated.
-
-As a final step, we then tell ORWL that the specification of all accesses is
+With these  declarations, we didn't  yet tell ORWL  much about the  resources to
+which these  handles refer, nor the type  (read or write) or  the priority (FIFO
+position)  of the  access. This  is done  in code  \Lst{algo:ch6p3ORWLinit} that
+shows  six insertions  of  handles  into their  respective  FIFO locations.  The
+handles \texttt{GPUhandle} and \texttt{curRead} are given first (lines 3 and 4),
+\texttt{GPUhandle} will  be accessed exclusively  (therefore the \texttt{write})
+and,   as   said,   \texttt{curRead}   is   used  in   shared   access   (so   a
+\texttt{read}). Both are inserted in the FIFO of their respective resources with
+highest  priority,   specified  by  the   \texttt{0}s  in  the   third  function
+parameter. The resources to which they correspond are specified through calls to
+the macro \texttt{ORWL\_LOCATION},  indicating the task (\texttt{orwl\_mytid} is
+the  ID of  the  current task)  and the  specific  resource of  that task,  here
+\texttt{GPU} and \texttt{curBlock}.
+
+Likewise, a second block of insertions concerns the handles \texttt{remRead} and
+\texttt{nextWrite}  (lines 8  and 9).  \texttt{nextWrite} reclaims  an exclusive
+access  and \texttt{remRead}  a shared  one. \texttt{remRead}  corresponds  to a
+resource of another task; the call to \texttt{previous(orwl\_mytid)} is supposed
+to  return the  ID of  the previous  task  in the  cycle. Both  accesses can  be
+performed concurrently with  the previous operation, so we  insert them with the
+same priority \texttt{0} as previously.
+
+Then,  for  the  specification  of   the  third  operation  related  to  handles
+\texttt{nextRead} and  \texttt{curWrite} (lines 13 and 14), we  need to
+use  a  different  priority:  the  copy  operation  from data locations \texttt{nextBlock}  to
+\texttt{curBlock}  has   to  be  performed  after  the   other  operations  have
+terminated (so the priority \texttt{1}).
+
+As a final step, we then tell ORWL (line 16) that the specification of all accesses is
  complete and that it may schedule\index{ORWL!schedule} all these accesses in the respective FIFOs of
  the resources.
  %\begin{algorithm}[H]
  %  \caption{Dynamic initialization of access mode and priorities.}
  %  \label{algo:ch6p3ORWLinit}
-\begin{Listing}{algo:ch6p3ORWLinit}{Dynamic initialization of access mode and priorities}
+%\pagebreak
+\begin{Listing}{algo:ch6p3ORWLinit}{dynamic initialization of access mode and priorities}
  /* One operation with priority 0 (highest) consists      */
-/* in copying from current to the GPU and run MM, there. */
+/* of copying from current to the GPU and running MM, there. */
  orwl_write_insert(&GPUhandle, ORWL_LOCATION(orwl_mytid, GPU), 0);
  orwl_read_insert(&curRead, ORWL_LOCATION(orwl_mytid, curBlock), 0);
  
  /* Another operation with priority 0 consists             */
-/* in copying from remote to next                         */
+/* of copying from remote to next                         */
  orwl_read_insert(&remRead, ORWL_LOCATION(previous(orwl_mytid), curBlock), 0);
-orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid, nextBlock), 0);
+orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid,nextBlock), 0);
  
  /* One operation with priority 1 consists       */
-/* in copying from next to current              */
+/* of copying from next to current              */
  orwl_read_insert(&nextRead, ORWL_LOCATION(orwl_mytid, nextBlock), 1);
  orwl_write_insert(&curWrite, ORWL_LOCATION(orwl_mytid, curBlock), 1);
  
@@ -321,19 +327,19 @@ orwl_schedule();
  
  \subsection{Tasks and operations}
  \label{sec:ch6p3tasks}
-With that example we have now seen that ORWL distinguishes
+With this example we have now seen that ORWL distinguishes
  \emph{tasks}\index{ORWL!task} and
  \emph{operations}\index{ORWL!operation}. An ORWL program is divided into tasks that can be seen as the
  algorithmic units that will concurrently access the resources that the program
  uses. A task for ORWL is characterized by
  \begin{itemize}
-\item a fixed set of resources that it manages, ``\emph{owns}'', in our example
+\item a fixed set of resources that it manages, ``\emph{owns},'' in our example
    the four resources that are declared in \Lst{algo:ch6p3ORWLresources}.
  \item a larger set of resources that it accesses, in our example all resources
    that are used in \Lst{algo:ch6p3ORWLinit}.
  \item a set of operations that act on these resources, in our example the three
    operations that are used in \Lst{algo:ch6p3ORWLBCCMM}, and that are
-  elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy}
+  elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy},
    and~\ref{algo:ch6p3ORWLtrans}.
  \end{itemize}