-\clearpage
-\section{Perspective: A unifying programming model}
+%\clearpage
+\section{Perspective: a unifying programming model}
\label{sec:ch6p3unify}
In the previous sections we have seen that controlling a distributed GPU
-application when using tools that are commonly available is a quite challenging
+application when using tools that are commonly available is quite a challenging
task. To summarize, such an application has components that can be roughly
-classified as:
+classified as
\begin{description}
\itemsep=0pt
-\item[CPU:] CPU bound computations, realized as procedures in the chosen
+\item[CPU:] CPU-bound computations, realized as procedures in the chosen
programming language
-\item[CUDA$_{kern}$:] GPU bound computations, in our context realized as CUDA compute kernels
+\item[CUDA$_{kern}$:] GPU-bound computations, in our context realized as CUDA compute kernels
\item[CUDA$_{trans}$:] data transfer between CPU and GPU, realized with CUDA function calls
\item[MPI:] distributed data transfer routines, realized with MPI communication primitives
\item[OpenMP:] inter-thread control, realized with OpenMP synchronization tools such as mutexes
-\item[CUDA$_{sync}$] synchronization of the GPU, realized with CUDA functions
+\item[CUDA$_{sync}$:] synchronization of the GPU, realized with CUDA functions
\end{description}
Among these, the last (CUDA$_{sync}$) is not strictly necessary on modern
-systems, but still recommended to obtain optimal performance. With or without
+systems, but it is still recommended to obtain optimal performance. With or without
that last step, such an application is highly complex: it is difficult to design
or to maintain, and depends on a lot of different software components. The goal
-of this section is to present a new path of development that allows to replace
-the last three or four types of components that control the application (MPI,
-OpenMP, CUDA$_{sync}$ and eventually CUDA$_{trans}$) by a single tool:
-\textbf{O}rdered \textbf{R}ead-\textbf{W}rite \textbf{L}ocks, ORWL\index{ORWL},
-see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}. Besides the
-simplification of the algorithmic scheme that we already have mentioned, the
-ongoing implementation of ORWL allows to use a feature of modern platforms that
-can improve the performance of CPU bound computations: lock-free atomic
+of this section is to present a new path of development that allows the
+replacement of the last three or four types of components that control the application (MPI,
+OpenMP, CUDA$_{sync}$, and eventually CUDA$_{trans}$) with a single tool:
+{O}rdered {R}ead-{W}rite {L}ocks, ORWL\index{ORWL}
+(see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}). Besides the
+simplification of the algorithmic scheme that we have already mentioned, the
+ongoing implementation of ORWL allows the use of a feature of modern platforms that
+can improve the performance of CPU-bound computations: lock-free atomic
operations to update shared data consistently. For these, ORWL relies on new
interfaces that are available with the latest revision of the ISO standard for
-the C programming language, see~\cite{C11}.
+the C programming language (see~\cite{C11}).
\subsection{Resources}
\label{sec:ch6p3resources}
abstraction: \emph{resources}\index{ORWL!resource}. An ORWL resource may correspond to a local or
remote entity and is identified through a \emph{location}\index{ORWL!location}, that is a unique
identification through which it can be accessed from all different components of
-the same application. In fact, resources and locations (entities and their names
+the same application. In fact, resources and locations (entities and their names,
so to speak) are mostly identified by ORWL and these words will be used
interchangeably.
-Resources may be of very different kind:
+Resources may be of very different kinds:
\begin{description}
\item[Data] resources\index{ORWL!resource!data} are entities that represents data and not specific
- memory buffers or locations. During the execution of an application it can be
+ memory buffers or locations. During the execution of an application they can be
\emph{mapped} repeatedly into the address space and in effect be represented
at different addresses. Data resources can be made accessible uniformly in all
- parts of the application, provided that the locking protocol is observed, see
- below. Data resources can have different persistence:
+ parts of the application, provided that the locking protocol is observed (see
+ below). Data resources can have different persistence:
\begin{description}
\item[RAM] data resources are typically temporary data that serve only during
a single run of the application. They must be initialized at the beginning of
- their lifetime and the contents is lost at the end.
+ their lifetime and the contents are lost at the end.
\item[File] data resources are persistent and linked to a file in the file
system of the platform.
\item[Collective] data resources are data to which all tasks of an
application contribute (see below). Examples for such resources are \emph{broadcast},
- \emph{gather} or
- \emph{reduce} resources, e.g to distribute initial data or to collect the
+ \emph{gather}, or
+ \emph{reduce} resources, e.g., to distribute initial data or to collect the
result of a distributed computation.
\end{description}
- Other types of data resources could be easily implemented with ORWL, e.g web
+ Other types of data resources could be easily implemented with ORWL, e.g., web
resources (through a ftp, http or whatever server address) or fixed hardware
addresses.
\item[Device] resources\index{ORWL!resource!device} represent hardware entities of the
platform. ORWL can then be used to regulate the access to such device
resources. In our context the most important such resource is the GPU, but we
- could easily use it to represent a CPU core, a camera or other peripheral
+ could easily use it to represent a CPU core, a camera, or another peripheral
device.
\end{description}
\Lst{algo:ch6p3ORWLresources} shows an example of a declaration of
four resources per task. Two (\texttt{curBlock} and \texttt{nextBlock}) are
intended to represent the data in a block-cyclic parallel matrix multiplication
-(see p.~\pageref{ch6:p1block-cyclic}),
-\texttt{GPU} represents a GPU device and \texttt{result} will represent a
+(see Section~\ref{ch6:p1block-cyclic}),
+\texttt{GPU} represents a GPU device, and \texttt{result} will represent a
collective ``gather'' resource among all the tasks.
+\pagebreak
%\begin{algorithm}[H]
% \caption{Declaration of ORWL resources for a block-cyclic matrix
% multiplication.}
% \label{algo:ch6p3ORWLresources}
-\begin{Listing}{algo:ch6p3ORWLresources}{Declaration of ORWL resources for a block-cyclic matrix multiplication}
+\begin{Listing}{algo:ch6p3ORWLresources}{declaration of ORWL resources for a block-cyclic matrix multiplication}
#include "orwl.h"
...
ORWL_LOCATIONS_PER_TASK(curBlock, nextBlock, GPU, result);
ORWL_DATA_LOCATION(curBlock);
-ORWL_DATA_LOCATION(nexBlock);
+ORWL_DATA_LOCATION(nextBlock);
ORWL_DEVICE_LOCATION(GPU);
ORWL_GATHER_LOCATION(result);
\end{Listing}
\label{sec:ch6p3ORWLcontrol}
-ORWL regulates access to all its resources, no ``random access'' to a resource
+ORWL regulates access to all its resources; no ``random access'' to a resource
is possible. It doesn't even have a user-visible data type for resources.
\begin{itemize}
\item All access is provided through \emph{handle}s\index{ORWL!handle}. Similar to pointers or
links, these only refer to a resource and help to manipulate it. Usually
several handles to the same resource exist, even inside the same OS process
or thread, or in the same application task.
-\item The access is locked with RW semantics, where \emph{R} stands for
- concurrent \textbf{R}ead~access, and \emph{W} for exclusive
- \textbf{W}rite~access. This feature replaces the control aspect of MPI
- communications, OpenMP inter-thread control and CUDA$_{sync}$.
-\item This access is \textbf{O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
+\item The access is locked with RW semantics, where {R} stands for
+ concurrent {R}ead~access, and {W} for exclusive
+ {W}rite~access. This feature replaces the control aspect of MPI
+ communications, OpenMP inter-thread control, and CUDA$_{sync}$.
+\item This access is {O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
FIFO per resource}. This helps to run the different tasks of an application
in a controlled order and to always have all resources in a known state. This
aspect largely replaces and extends the ordering of tasks that MPI typically
\subsection{Example: block-cyclic matrix multiplication (MM)}
\label{sec:ch6p3ORWLMM}
-Let us now have a look how a block-cyclic matrix multiplication\index{matrix
+Let us now have a look at how a block-cyclic matrix multiplication\index{matrix
multiplication!block cyclic} algorithm can be
-implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop it mainly consists of three
-different operations, of which the first two can be run concurrently, and the
+implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop
+there are mainly three
+different operations, the first two of which can be run concurrently, and the
third must be done after the other two.
%\begin{algorithm}[H]
% \caption{Block-cyclic matrix multiplication, high level per task view.}
% \label{algo:ch6p3ORWLBCCMM}
-\begin{Listing}{algo:ch6p3ORWLBCCMM}{Block-cyclic matrix multiplication, high level per task view}
+\begin{Listing}{algo:ch6p3ORWLBCCMM}{block-cyclic matrix multiplication, high level per task view}
typedef double MBlock[N][N];
MBlock A;
MBlock B[k];
%\end{algorithm}
-\Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 as it could
+\Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 from line 17 of
+\Lst{algo:ch6p3ORWLBCCMM} as it could
be realized with ORWL.
It uses two resource handles
\texttt{nextRead} and \texttt{curWrite} and marks nested \emph{critical
sections} for these handles. Inside the nested sections it obtains pointers to
the resource data; the resource is \emph{mapped} into the address space of the
program, and then a standard call to \texttt{memcpy} achieves the operation
-itself. The operation is integrated in its own \textbf{for}-loop, such that it
+itself. The operation is integrated in its own {for}-loop, such that it
could run independently in an OS thread by its own.
%\begin{algorithm}[H]
% \caption{An iterative local copy operation.}
% \label{algo:ch6p3ORWLlcopy}
-\begin{Listing}{algo:ch6p3ORWLlcopy}{An iterative local copy operation}
+\begin{Listing}{algo:ch6p3ORWLlcopy}{an iterative local copy operation}
for (size_t i = 0; i < k; ++i) {
ORWL_SECTION(nextRead) {
MBlock const* sBlock = orwl_read_map(nextRead);
%\end{algorithm}
Next, in \Lst{algo:ch6p3ORWLrcopy} we copy data from a remote task to
-a local task. Substantially the operation is the same, only that in the example
-different handles (\texttt{remRead} and \texttt{nextWrite}) are used to
+a local task. Substantially the operation is the same, save for the
+different handles (\texttt{remRead} and \texttt{nextWrite}) that are used to
represent the respective resources.
%\begin{algorithm}[H]
% \caption{An iterative remote copy operation as part of a block cyclic matrix
% multiplication task.}
%\label{algo:ch6p3ORWLrcopy}
-\begin{Listing}{algo:ch6p3ORWLrcopy}{An iterative remote copy operation as part of a block cyclic matrix multiplication task}
+\begin{Listing}{algo:ch6p3ORWLrcopy}{an iterative remote copy operation as part of a block cyclic matrix multiplication task}
for (size_t i = 0; i < k; ++i) {
ORWL_SECTION(remRead) {
MBlock const* sBlock = orwl_read_map(remRead);
Now let us have a look into the operation that probably interests us the most,
the interaction with the GPU in \Lst{algo:ch6p3ORWLtrans}. Again there
-is much structural resemblance to the copy operations from above, only that we
+is much structural resemblance to the copy operations from above, but we
transfer the data to the GPU in the innermost block and then run the GPU MM
-kernel while we still are inside the critical section for the GPU.
+kernel while we are still inside the critical section for the GPU.
%\begin{algorithm}[H]
% \caption{An iterative GPU transfer and compute operation as part of a block cyclic matrix
% multiplication task.}
% \label{algo:ch6p3ORWLtrans}
-\begin{Listing}{algo:ch6p3ORWLtrans}{An iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
+\begin{Listing}{algo:ch6p3ORWLtrans}{an iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
for (size_t i = 0; i < k; ++i) {
ORWL_SECTION(GPUhandle) {
ORWL_SECTION(curRead) {
%\end{algorithm}
Now that we have seen how the actual procedural access to the resources is
-regulated we will show how the association between handles and resources is
-specified. E.g in our application of block-cyclic MM the \texttt{curRead} handle
-should correspond to current matrix block of the corresponding task, whereas
+regulated, we will show how the association between handles and resources is
+specified. In our application of block-cyclic MM the \texttt{curRead} handle
+should correspond to the current matrix block of the corresponding task, whereas
\texttt{remRead} should point to the current block of the neighboring task.
-Both read operations on these matrix blocks can be effected without creating
+Both read operations on these matrix blocks can be performed without creating
conflicts, so we would like to express that fact in our resource specification.
From a point of view of the resource ``current block'' of a particular task,
this means that it can have two simultaneous readers, the task itself performing
the transfer to the GPU, and the neighboring task transferring the data to its
-``next block''.
+``next block.''
\Lst{algo:ch6p3ORWLdecl} first shows the local dynamic declarations of
our application; it declares a \texttt{block} type for the matrix blocks, a
%\begin{algorithm}[H]
% \caption{Dynamic declaration of handles to represent the resources.}
% \label{algo:ch6p3ORWLdecl}
-\begin{Listing}{algo:ch6p3ORWLdecl}{Dynamic declaration of handles to represent the resources}
+\begin{Listing}{algo:ch6p3ORWLdecl}{dynamic declaration of handles to represent the resources}
/* A type for the matrix blocks */
typedef double MBlock[N][N];
/* Declaration to handle the collective resource */
\end{Listing}
%\end{algorithm}
-With these declarations, we didn't yet tell ORWL much about the resources to
-which these handles refer, nor the type (read or write) or the priority (FIFO
-position) of the access. This is done in code
-\Lst{algo:ch6p3ORWLinit}. The handles for
-\Lst{algo:ch6p3ORWLtrans} are given first, \texttt{GPUhandle} will be
-accessed exclusively (therefore the \texttt{write}) and, as said,
-\texttt{curRead} is used shared (so a \texttt{read}). Both are inserted in the
-FIFO of there respective resources with highest priority, specified by the
-\texttt{0}s in the third function parameter. The resources to which they
-correspond are specified through calls to the macro \texttt{ORWL\_LOCATION},
-indicating the task (\texttt{orwl\_mytid} is the ID of the current task) and the
-specific resource of that task, here \texttt{GPU} and \texttt{curBlock}.
-
-Likewise, a second block of insertions concerns the handles for
-\Lst{algo:ch6p3ORWLrcopy}: \texttt{newWrite} reclaims an exclusive
-access and \texttt{remRead} a shared. \texttt{remRead} corresponds to a
-resource of another task; \texttt{previous(orwl\_mytid)} is supposed to return
-the ID of the previous task in the cycle. Both accesses can be effected
-concurrently with the previous operation, so we insert with the same priority
-\texttt{0} as before.
-
-Then, for the specification of the third operation
-(\Lst{algo:ch6p3ORWLlcopy}) we need to use a different priority: the
-copy operation from \texttt{nextBlock} to \texttt{curBlock} has to be performed
-after the other operations have terminated.
-
-As a final step, we then tell ORWL that the specification of all accesses is
+With these declarations, we didn't yet tell ORWL much about the resources to
+which these handles refer, nor the type (read or write) or the priority (FIFO
+position) of the access. This is done in code \Lst{algo:ch6p3ORWLinit} that
+shows six insertions of handles into their respective FIFO locations. The
+handles \texttt{GPUhandle} and \texttt{curRead} are given first (lines 3 and 4),
+\texttt{GPUhandle} will be accessed exclusively (therefore the \texttt{write})
+and, as said, \texttt{curRead} is used in shared access (so a
+\texttt{read}). Both are inserted in the FIFO of their respective resources with
+highest priority, specified by the \texttt{0}s in the third function
+parameter. The resources to which they correspond are specified through calls to
+the macro \texttt{ORWL\_LOCATION}, indicating the task (\texttt{orwl\_mytid} is
+the ID of the current task) and the specific resource of that task, here
+\texttt{GPU} and \texttt{curBlock}.
+
+Likewise, a second block of insertions concerns the handles \texttt{remRead} and
+\texttt{nextWrite} (lines 8 and 9). \texttt{nextWrite} reclaims an exclusive
+access and \texttt{remRead} a shared one. \texttt{remRead} corresponds to a
+resource of another task; the call to \texttt{previous(orwl\_mytid)} is supposed
+to return the ID of the previous task in the cycle. Both accesses can be
+performed concurrently with the previous operation, so we insert them with the
+same priority \texttt{0} as previously.
+
+Then, for the specification of the third operation related to handles
+\texttt{nextRead} and \texttt{curWrite} (lines 13 and 14), we need to
+use a different priority: the copy operation from data locations \texttt{nextBlock} to
+\texttt{curBlock} has to be performed after the other operations have
+terminated (so the priority \texttt{1}).
+
+As a final step, we then tell ORWL (line 16) that the specification of all accesses is
complete and that it may schedule\index{ORWL!schedule} all these accesses in the respective FIFOs of
the resources.
%\begin{algorithm}[H]
% \caption{Dynamic initialization of access mode and priorities.}
% \label{algo:ch6p3ORWLinit}
-\begin{Listing}{algo:ch6p3ORWLinit}{Dynamic initialization of access mode and priorities}
+%\pagebreak
+\begin{Listing}{algo:ch6p3ORWLinit}{dynamic initialization of access mode and priorities}
/* One operation with priority 0 (highest) consists */
-/* in copying from current to the GPU and run MM, there. */
+/* of copying from current to the GPU and running MM, there. */
orwl_write_insert(&GPUhandle, ORWL_LOCATION(orwl_mytid, GPU), 0);
orwl_read_insert(&curRead, ORWL_LOCATION(orwl_mytid, curBlock), 0);
/* Another operation with priority 0 consists */
-/* in copying from remote to next */
+/* of copying from remote to next */
orwl_read_insert(&remRead, ORWL_LOCATION(previous(orwl_mytid), curBlock), 0);
-orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid, nextBlock), 0);
+orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid,nextBlock), 0);
/* One operation with priority 1 consists */
-/* in copying from next to current */
+/* of copying from next to current */
orwl_read_insert(&nextRead, ORWL_LOCATION(orwl_mytid, nextBlock), 1);
orwl_write_insert(&curWrite, ORWL_LOCATION(orwl_mytid, curBlock), 1);
\subsection{Tasks and operations}
\label{sec:ch6p3tasks}
-With that example we have now seen that ORWL distinguishes
+With this example we have now seen that ORWL distinguishes
\emph{tasks}\index{ORWL!task} and
\emph{operations}\index{ORWL!operation}. An ORWL program is divided into tasks that can be seen as the
algorithmic units that will concurrently access the resources that the program
uses. A task for ORWL is characterized by
\begin{itemize}
-\item a fixed set of resources that it manages, ``\emph{owns}'', in our example
+\item a fixed set of resources that it manages, ``\emph{owns},'' in our example
the four resources that are declared in \Lst{algo:ch6p3ORWLresources}.
\item a larger set of resources that it accesses, in our example all resources
that are used in \Lst{algo:ch6p3ORWLinit}.
\item a set of operations that act on these resources, in our example the three
operations that are used in \Lst{algo:ch6p3ORWLBCCMM}, and that are
- elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy}
+ elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy},
and~\ref{algo:ch6p3ORWLtrans}.
\end{itemize}