BookGPU/Chapters/chapter6/PartieORWL.tex

   1 \clearpage
   2 \section{Perspective: A unifying programming model}
   3 \label{sec:ch6p3unify}
   4
   5 In the previous sections we have seen that controlling a distributed GPU
   6 application when using tools that are commonly available is a quite challenging
   7 task. To summarize, such an application has components that can be roughly
   8 classified as:
   9
  10 \begin{description}
  11   \itemsep=0pt
  12 \item[CPU:] CPU bound computations, realized as procedures in the chosen
  13   programming language
  14 \item[CUDA$_{kern}$:] GPU bound computations, in our context realized as CUDA compute kernels
  15 \item[CUDA$_{trans}$:] data transfer between CPU and GPU, realized with CUDA function calls
  16 \item[MPI:] distributed data transfer routines, realized with MPI communication primitives
  17 \item[OpenMP:] inter-thread control, realized with OpenMP synchronization tools such as mutexes
  18 \item[CUDA$_{sync}$] synchronization of the GPU, realized with CUDA functions
  19 \end{description}
  20
  21 Among these, the last (CUDA$_{sync}$) is not strictly necessary on modern
  22 systems, but still recommended to obtain optimal performance. With or without
  23 that last step, such an application is highly complex: it is difficult to design
  24 or to maintain, and depends on a lot of different software components. The goal
  25 of this section is to present a new path of development that allows to replace
  26 the last three or four types of components that control the application (MPI,
  27 OpenMP, CUDA$_{sync}$ and eventually CUDA$_{trans}$) by a single tool:
  28 \textbf{O}rdered \textbf{R}ead-\textbf{W}rite \textbf{L}ocks, ORWL\index{ORWL},
  29 see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}. Besides the
  30 simplification of the algorithmic scheme that we already have mentioned, the
  31 ongoing implementation of ORWL allows to use a feature of modern platforms that
  32 can improve the performance of CPU bound computations: lock-free atomic
  33 operations to update shared data consistently. For these, ORWL relies on new
  34 interfaces that are available with the latest revision of the ISO standard for
  35 the C programming language, see~\cite{C11}.
  36
  37 \subsection{Resources}
  38 \label{sec:ch6p3resources}
  39
  40 ORWL places all its concepts that concern data and control around a single
  41 abstraction: \emph{resources}\index{ORWL!resource}. An ORWL resource may correspond to a local or
  42 remote entity and is identified through a \emph{location}\index{ORWL!location}, that is a unique
  43 identification through which it can be accessed from all different components of
  44 the same application. In fact, resources and locations (entities and their names
  45 so to speak) are mostly identified by ORWL and these words will be used
  46 interchangeably.
  47
  48 Resources may be of very different kind:
  49 \begin{description}
  50 \item[Data] resources\index{ORWL!resource!data} are entities that represents data and not specific
  51   memory buffers or locations. During the execution of an application it can be
  52   \emph{mapped} repeatedly into the address space and in effect be represented
  53   at different addresses. Data resources can be made accessible uniformly in all
  54   parts of the application, provided that the locking protocol is observed, see
  55   below. Data resources can have different persistence:
  56   \begin{description}
  57   \item[RAM] data resources are typically temporary data that serve only during
  58     a single run of the application. They must be initialized at the beginning of
  59     their lifetime and the contents is lost at the end.
  60   \item[File] data resources are persistent and linked to a file in the file
  61     system of the platform.
  62   \item[Collective] data resources are data to which all tasks of an
  63     application contribute (see below). Examples for such resources are \emph{broadcast},
  64     \emph{gather} or
  65     \emph{reduce} resources, e.g to distribute initial data or to collect the
  66     result of a distributed computation.
  67   \end{description}
  68   Other types of data resources could be easily implemented with ORWL, e.g web
  69   resources (through a ftp, http or whatever server address) or fixed hardware
  70   addresses.
  71 \item[Device] resources\index{ORWL!resource!device} represent hardware entities of the
  72   platform. ORWL can then be used to regulate the access to such device
  73   resources. In our context the most important such resource is the GPU, but we
  74   could easily use it to represent a CPU core, a camera or other peripheral
  75   device.
  76 \end{description}
  77
  78 \Lst{algo:ch6p3ORWLresources} shows an example of a declaration of
  79 four resources per task. Two (\texttt{curBlock} and \texttt{nextBlock}) are
  80 intended to represent the data in a block-cyclic parallel matrix multiplication
  81 (see p.~\pageref{ch6:p1block-cyclic}),
  82 \texttt{GPU} represents a GPU device and \texttt{result} will represent a
  83 collective ``gather'' resource among all the tasks.
  84
  85 %\begin{algorithm}[H]
  86 %  \caption{Declaration of ORWL resources for a block-cyclic matrix
  87 %    multiplication.}
  88 %  \label{algo:ch6p3ORWLresources}
  89 \begin{Listing}{algo:ch6p3ORWLresources}{Declaration of ORWL resources for a block-cyclic matrix multiplication}
  90 #include "orwl.h"
  91 ...
  92 ORWL_LOCATIONS_PER_TASK(curBlock, nextBlock, GPU, result);
  93 ORWL_DATA_LOCATION(curBlock);
  94 ORWL_DATA_LOCATION(nexBlock);
  95 ORWL_DEVICE_LOCATION(GPU);
  96 ORWL_GATHER_LOCATION(result);
  97 \end{Listing}
  98 %\end{algorithm}
  99
 100
 101 \subsection{Control}
 102 \label{sec:ch6p3ORWLcontrol}
 103
 104
 105 ORWL regulates access to all its resources, no ``random access'' to a resource
 106 is possible. It doesn't even have a user-visible data type for resources.
 107 \begin{itemize}
 108 \item All access is provided through \emph{handle}s\index{ORWL!handle}. Similar to pointers or
 109   links, these only refer to a resource and help to manipulate it. Usually
 110   several handles to the same resource exist, even inside the same OS process
 111   or thread, or in the same application task.
 112 \item The access is locked with RW semantics, where \emph{R} stands for
 113   concurrent \textbf{R}ead~access, and \emph{W} for exclusive
 114   \textbf{W}rite~access. This feature replaces the control aspect of MPI
 115   communications, OpenMP inter-thread control and CUDA$_{sync}$.
 116 \item This access is \textbf{O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
 117     FIFO per resource}. This helps to run the different tasks of an application
 118   in a controlled order and to always have all resources in a known state. This
 119   aspect largely replaces and extends the ordering of tasks that MPI typically
 120   achieves through the passing of messages.
 121 \item The access is transparently managed for remote or local resources.
 122   Communication, if necessary, is done asynchronously behind the scenes. This
 123   replaces the explicit handling of buffers and messages with MPI.
 124 \end{itemize}
 125
 126 \subsection{Example: block-cyclic matrix multiplication (MM)}
 127 \label{sec:ch6p3ORWLMM}
 128
 129 Let us now have a look how a block-cyclic matrix multiplication\index{matrix
 130   multiplication!block cyclic} algorithm can be
 131 implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop it mainly consists of three
 132 different operations, of which the first two can be run concurrently, and the
 133 third must be done after the other two.
 134 %\begin{algorithm}[H]
 135 %  \caption{Block-cyclic matrix multiplication, high level per task view.}
 136 %  \label{algo:ch6p3ORWLBCCMM}
 137 \begin{Listing}{algo:ch6p3ORWLBCCMM}{Block-cyclic matrix multiplication, high level per task view}
 138 typedef double MBlock[N][N];
 139 MBlock A;
 140 MBlock B[k];
 141 MBlock C[k];
 142
 143 <do some initialization>
 144
 145 for (size_t i = 0; i < k; ++i) {
 146   MBlock next;
 147   parallel-do {
 148     operation 1: <copy the matrix A of the left neighbor into next>;
 149     operation 2: {
 150       <copy the local matrix A to the GPU >;
 151       <on GPU perform C[i] = A * B[0] + ... + A * B[k-1]; >;
 152     }
 153   }
 154   operation 3: {
 155     <wait until the right neighbor has read our block A>;
 156     A = next;
 157   }
 158 }
 159
 160 <collect the result matrix C consisting of all C blocks>
 161 \end{Listing}
 162 %\end{algorithm}
 163
 164
 165 \Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 as it could
 166 be realized with ORWL.
 167 It uses two resource handles
 168 \texttt{nextRead} and \texttt{curWrite} and marks nested \emph{critical
 169   sections} for these handles. Inside the nested sections it obtains pointers to
 170 the resource data; the resource is \emph{mapped} into the address space of the
 171 program, and then a standard call to \texttt{memcpy} achieves the operation
 172 itself. The operation is integrated in its own \textbf{for}-loop, such that it
 173 could run independently in an OS thread by its own.
 174 %\begin{algorithm}[H]
 175 %  \caption{An iterative local copy operation.}
 176 %  \label{algo:ch6p3ORWLlcopy}
 177 \begin{Listing}{algo:ch6p3ORWLlcopy}{An iterative local copy operation}
 178 for (size_t i = 0; i < k; ++i) {
 179   ORWL_SECTION(nextRead) {
 180     MBlock const* sBlock = orwl_read_map(nextRead);
 181     ORWL_SECTION(curWrite) {
 182       MBlock * tBlock = orwl_write_map(curWrite);
 183       memcpy(tBlock, sBlock, sizeof *tBlock);
 184     }
 185   }
 186 }
 187 \end{Listing}
 188 %\end{algorithm}
 189
 190 Next, in \Lst{algo:ch6p3ORWLrcopy} we copy data from a remote task to
 191 a local task. Substantially the operation is the same, only that in the example
 192 different handles (\texttt{remRead} and \texttt{nextWrite}) are used to
 193 represent the respective resources.
 194 %\begin{algorithm}[H]
 195 %  \caption{An iterative remote copy operation as part of a block cyclic matrix
 196 %    multiplication task.}
 197 %\label{algo:ch6p3ORWLrcopy}
 198 \begin{Listing}{algo:ch6p3ORWLrcopy}{An iterative remote copy operation as part of a block cyclic matrix multiplication task}
 199 for (size_t i = 0; i < k; ++i) {
 200   ORWL_SECTION(remRead) {
 201     MBlock const* sBlock = orwl_read_map(remRead);
 202     ORWL_SECTION(nextWrite) {
 203       MBlock * tBlock = orwl_write_map(nextWrite);
 204       memcpy(tBlock, sBlock, sizeof *tBlock);
 205     }
 206   }
 207 }
 208 \end{Listing}
 209 %\end{algorithm}
 210
 211 Now let us have a look into the operation that probably interests us the most,
 212 the interaction with the GPU in \Lst{algo:ch6p3ORWLtrans}. Again there
 213 is much structural resemblance to the copy operations from above, only that  we
 214 transfer the data to the GPU in the innermost block and then run the GPU MM
 215 kernel while we still are inside the critical section for the GPU.
 216 %\begin{algorithm}[H]
 217 %  \caption{An iterative GPU transfer and compute operation as part of a block cyclic matrix
 218 %    multiplication task.}
 219 %  \label{algo:ch6p3ORWLtrans}
 220 \begin{Listing}{algo:ch6p3ORWLtrans}{An iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
 221 for (size_t i = 0; i < k; ++i) {
 222   ORWL_SECTION(GPUhandle) {
 223     ORWL_SECTION(curRead) {
 224       MBlock const* sBlock = orwl_read_map(curRead);
 225       transferToGPU(sBlock, i);
 226     }
 227     runMMonGPU(i);
 228   }
 229 }
 230 \end{Listing}
 231 %\end{algorithm}
 232
 233 Now that we have seen how the actual procedural access to the resources is
 234 regulated we will show how the association between handles and resources is
 235 specified. E.g in our application of block-cyclic MM the \texttt{curRead} handle
 236 should correspond to current matrix block of the corresponding task, whereas
 237 \texttt{remRead} should point to the current block of the neighboring task.
 238 Both read operations on these matrix blocks can be effected without creating
 239 conflicts, so we would like to express that fact in our resource specification.
 240 From a point of view of the resource ``current block'' of a particular task,
 241 this means that it can have two simultaneous readers, the task itself performing
 242 the transfer to the GPU, and the neighboring task transferring the data to its
 243 ``next block''.
 244
 245 \Lst{algo:ch6p3ORWLdecl} first shows the local dynamic declarations of
 246 our application; it declares a \texttt{block} type for the matrix blocks, a
 247 \texttt{result} data for the collective resource, and the six handles that we
 248 have seen so far.
 249 %\begin{algorithm}[H]
 250 %  \caption{Dynamic declaration of handles to represent the resources.}
 251 %  \label{algo:ch6p3ORWLdecl}
 252 \begin{Listing}{algo:ch6p3ORWLdecl}{Dynamic declaration of handles to represent the resources}
 253 /* A type for the matrix blocks */
 254 typedef double MBlock[N][N];
 255 /* Declaration to handle the collective resource */
 256 ORWL_GATHER_DECLARE(MBlock, result);
 257
 258 /* Variables to handle data resources */
 259 orwl_handle2 remRead   = ORWL_HANDLE2_INITIALIZER;
 260 orwl_handle2 nextWrite = ORWL_HANDLE2_INITIALIZER;
 261 orwl_handle2 nextRead  = ORWL_HANDLE2_INITIALIZER;
 262 orwl_handle2 curWrite  = ORWL_HANDLE2_INITIALIZER;
 263 orwl_handle2 curRead   = ORWL_HANDLE2_INITIALIZER;
 264
 265 /* Variable to handle the device resources */
 266 orwl_handle2 GPUhandle = ORWL_HANDLE2_INITIALIZER;
 267 \end{Listing}
 268 %\end{algorithm}
 269
 270 With these declarations, we didn't yet tell ORWL much about the resources to
 271 which these handles refer, nor the type (read or write) or the priority (FIFO
 272 position) of the access. This is done in code
 273 \Lst{algo:ch6p3ORWLinit}. The handles for
 274 \Lst{algo:ch6p3ORWLtrans} are given first, \texttt{GPUhandle} will be
 275 accessed exclusively (therefore the \texttt{write}) and, as said,
 276 \texttt{curRead} is used shared (so a \texttt{read}). Both are inserted in the
 277 FIFO of there respective resources with highest priority, specified by the
 278 \texttt{0}s in the third function parameter. The resources to which they
 279 correspond are specified through calls to the macro \texttt{ORWL\_LOCATION},
 280 indicating the task (\texttt{orwl\_mytid} is the ID of the current task) and the
 281 specific resource of that task, here \texttt{GPU} and \texttt{curBlock}.
 282
 283 Likewise, a second block of insertions concerns the handles for
 284 \Lst{algo:ch6p3ORWLrcopy}: \texttt{newWrite} reclaims an exclusive
 285 access and \texttt{remRead} a shared. \texttt{remRead} corresponds to a
 286 resource of another task; \texttt{previous(orwl\_mytid)} is supposed to return
 287 the ID of the previous task in the cycle. Both accesses can be effected
 288 concurrently with the previous operation, so we insert with the same priority
 289 \texttt{0} as before.
 290
 291 Then, for the specification of the third operation
 292 (\Lst{algo:ch6p3ORWLlcopy}) we need to use a different priority: the
 293 copy operation from \texttt{nextBlock} to \texttt{curBlock} has to be performed
 294 after the other operations have terminated.
 295
 296 As a final step, we then tell ORWL that the specification of all accesses is
 297 complete and that it may schedule\index{ORWL!schedule} all these accesses in the respective FIFOs of
 298 the resources.
 299 %\begin{algorithm}[H]
 300 %  \caption{Dynamic initialization of access mode and priorities.}
 301 %  \label{algo:ch6p3ORWLinit}
 302 \begin{Listing}{algo:ch6p3ORWLinit}{Dynamic initialization of access mode and priorities}
 303 /* One operation with priority 0 (highest) consists      */
 304 /* in copying from current to the GPU and run MM, there. */
 305 orwl_write_insert(&GPUhandle, ORWL_LOCATION(orwl_mytid, GPU), 0);
 306 orwl_read_insert(&curRead, ORWL_LOCATION(orwl_mytid, curBlock), 0);
 307
 308 /* Another operation with priority 0 consists             */
 309 /* in copying from remote to next                         */
 310 orwl_read_insert(&remRead, ORWL_LOCATION(previous(orwl_mytid), curBlock), 0);
 311 orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid, nextBlock), 0);
 312
 313 /* One operation with priority 1 consists       */
 314 /* in copying from next to current              */
 315 orwl_read_insert(&nextRead, ORWL_LOCATION(orwl_mytid, nextBlock), 1);
 316 orwl_write_insert(&curWrite, ORWL_LOCATION(orwl_mytid, curBlock), 1);
 317
 318 orwl_schedule();
 319 \end{Listing}
 320 %\end{algorithm}
 321
 322 \subsection{Tasks and operations}
 323 \label{sec:ch6p3tasks}
 324 With that example we have now seen that ORWL distinguishes
 325 \emph{tasks}\index{ORWL!task} and
 326 \emph{operations}\index{ORWL!operation}. An ORWL program is divided into tasks that can be seen as the
 327 algorithmic units that will concurrently access the resources that the program
 328 uses. A task for ORWL is characterized by
 329 \begin{itemize}
 330 \item a fixed set of resources that it manages, ``\emph{owns}'', in our example
 331   the four resources that are declared in \Lst{algo:ch6p3ORWLresources}.
 332 \item a larger set of resources that it accesses, in our example all resources
 333   that are used in \Lst{algo:ch6p3ORWLinit}.
 334 \item a set of operations that act on these resources, in our example the three
 335   operations that are used in \Lst{algo:ch6p3ORWLBCCMM}, and that are
 336   elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy}
 337   and~\ref{algo:ch6p3ORWLtrans}.
 338 \end{itemize}
 339
 340 \noindent
 341 Each ORWL operation is characterized by
 342 \begin{itemize}
 343 \item one resource, usually one that is owned by the enclosing task, that it
 344   accesses exclusively. In our example, operation~1 has exclusive access to the
 345   \texttt{next} block, operation~2 has exclusive access the \texttt{GPU}
 346   resource, and operation~3 to the \texttt{A} block.
 347 \item several resources that are accessed concurrently with others.
 348 \end{itemize}
 349 In fact each ORWL~operation can be viewed as a compute-and-update procedure of a
 350 particular resource with knowledge of another set of resources.
 351
 352
 353 %%% Local Variables:
 354 %%% mode: latex
 355 %%% fill-column: 80
 356 %%% ispell-dictionary: "american"
 357 %%% mode: flyspell
 358 %%% TeX-master: "../../BookGPU.tex"
 359 %%% End: