BookGPU/Chapters/chapter6/PartieORWL.tex

   1 %\clearpage
   2 \section{Perspective: a unifying programming model}
   3 \label{sec:ch6p3unify}
   4
   5 In the previous sections we have seen that controlling a distributed GPU
   6 application when using tools that are commonly available is quite a challenging
   7 task. To summarize, such an application has components that can be roughly
   8 classified as
   9
  10 \begin{description}
  11   \itemsep=0pt
  12 \item[CPU:] CPU-bound computations, realized as procedures in the chosen
  13   programming language
  14 \item[CUDA$_{kern}$:] GPU-bound computations, in our context realized as CUDA compute kernels
  15 \item[CUDA$_{trans}$:] data transfer between CPU and GPU, realized with CUDA function calls
  16 \item[MPI:] distributed data transfer routines, realized with MPI communication primitives
  17 \item[OpenMP:] inter-thread control, realized with OpenMP synchronization tools such as mutexes
  18 \item[CUDA$_{sync}$:] synchronization of the GPU, realized with CUDA functions
  19 \end{description}
  20
  21 Among these, the last (CUDA$_{sync}$) is not strictly necessary on modern
  22 systems, but it is still recommended to obtain optimal performance. With or without
  23 that last step, such an application is highly complex: it is difficult to design
  24 or to maintain, and depends on a lot of different software components. The goal
  25 of this section is to present a new path of development that allows the
  26 replacement of the last three or four types of components that control the application (MPI,
  27 OpenMP, CUDA$_{sync}$, and eventually CUDA$_{trans}$) with a single tool:
  28 {O}rdered {R}ead-{W}rite {L}ocks, ORWL\index{ORWL}
  29 (see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}). Besides the
  30 simplification of the algorithmic scheme that we have already mentioned, the
  31 ongoing implementation of ORWL allows the use of a feature of modern platforms that
  32 can improve the performance of CPU-bound computations: lock-free atomic
  33 operations to update shared data consistently. For these, ORWL relies on new
  34 interfaces that are available with the latest revision of the ISO standard for
  35 the C programming language (see~\cite{C11}).
  36
  37 \subsection{Resources}
  38 \label{sec:ch6p3resources}
  39
  40 ORWL places all its concepts that concern data and control around a single
  41 abstraction: \emph{resources}\index{ORWL!resource}. An ORWL resource may correspond to a local or
  42 remote entity and is identified through a \emph{location}\index{ORWL!location}, that is a unique
  43 identification through which it can be accessed from all different components of
  44 the same application. In fact, resources and locations (entities and their names,
  45 so to speak) are mostly identified by ORWL and these words will be used
  46 interchangeably.
  47
  48 Resources may be of very different kinds:
  49 \begin{description}
  50 \item[Data] resources\index{ORWL!resource!data} are entities that represents data and not specific
  51   memory buffers or locations. During the execution of an application they can be
  52   \emph{mapped} repeatedly into the address space and in effect be represented
  53   at different addresses. Data resources can be made accessible uniformly in all
  54   parts of the application, provided that the locking protocol is observed (see
  55   below). Data resources can have different persistence:
  56   \begin{description}
  57   \item[RAM] data resources are typically temporary data that serve only during
  58     a single run of the application. They must be initialized at the beginning of
  59     their lifetime and the contents are lost at the end.
  60   \item[File] data resources are persistent and linked to a file in the file
  61     system of the platform.
  62   \item[Collective] data resources are data to which all tasks of an
  63     application contribute (see below). Examples for such resources are \emph{broadcast},
  64     \emph{gather}, or
  65     \emph{reduce} resources, e.g., to distribute initial data or to collect the
  66     result of a distributed computation.
  67   \end{description}
  68   Other types of data resources could be easily implemented with ORWL, e.g., web
  69   resources (through a ftp, http or whatever server address) or fixed hardware
  70   addresses.
  71 \item[Device] resources\index{ORWL!resource!device} represent hardware entities of the
  72   platform. ORWL can then be used to regulate the access to such device
  73   resources. In our context the most important such resource is the GPU, but we
  74   could easily use it to represent a CPU core, a camera, or another peripheral
  75   device.
  76 \end{description}
  77
  78 \Lst{algo:ch6p3ORWLresources} shows an example of a declaration of
  79 four resources per task. Two (\texttt{curBlock} and \texttt{nextBlock}) are
  80 intended to represent the data in a block-cyclic parallel matrix multiplication
  81 (see Section~\ref{ch6:p1block-cyclic}),
  82 \texttt{GPU} represents a GPU device, and \texttt{result} will represent a
  83 collective ``gather'' resource among all the tasks.
  84
  85 \pagebreak
  86 %\begin{algorithm}[H]
  87 %  \caption{Declaration of ORWL resources for a block-cyclic matrix
  88 %    multiplication.}
  89 %  \label{algo:ch6p3ORWLresources}
  90 \begin{Listing}{algo:ch6p3ORWLresources}{declaration of ORWL resources for a block-cyclic matrix multiplication}
  91 #include "orwl.h"
  92 ...
  93 ORWL_LOCATIONS_PER_TASK(curBlock, nextBlock, GPU, result);
  94 ORWL_DATA_LOCATION(curBlock);
  95 ORWL_DATA_LOCATION(nextBlock);
  96 ORWL_DEVICE_LOCATION(GPU);
  97 ORWL_GATHER_LOCATION(result);
  98 \end{Listing}
  99 %\end{algorithm}
 100
 101
 102 \subsection{Control}
 103 \label{sec:ch6p3ORWLcontrol}
 104
 105
 106 ORWL regulates access to all its resources; no ``random access'' to a resource
 107 is possible. It doesn't even have a user-visible data type for resources.
 108 \begin{itemize}
 109 \item All access is provided through \emph{handle}s\index{ORWL!handle}. Similar to pointers or
 110   links, these only refer to a resource and help to manipulate it. Usually
 111   several handles to the same resource exist, even inside the same OS process
 112   or thread, or in the same application task.
 113 \item The access is locked with RW semantics, where {R} stands for
 114   concurrent {R}ead~access, and {W} for exclusive
 115   {W}rite~access. This feature replaces the control aspect of MPI
 116   communications, OpenMP inter-thread control, and CUDA$_{sync}$.
 117 \item This access is {O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
 118     FIFO per resource}. This helps to run the different tasks of an application
 119   in a controlled order and to always have all resources in a known state. This
 120   aspect largely replaces and extends the ordering of tasks that MPI typically
 121   achieves through the passing of messages.
 122 \item The access is transparently managed for remote or local resources.
 123   Communication, if necessary, is done asynchronously behind the scenes. This
 124   replaces the explicit handling of buffers and messages with MPI.
 125 \end{itemize}
 126
 127 \subsection{Example: block-cyclic matrix multiplication (MM)}
 128 \label{sec:ch6p3ORWLMM}
 129
 130 Let us now have a look at how a block-cyclic matrix multiplication\index{matrix
 131   multiplication!block cyclic} algorithm can be
 132 implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop
 133 there are mainly three
 134 different operations, the first two of which can be run concurrently, and the
 135 third must be done after the other two.
 136 %\begin{algorithm}[H]
 137 %  \caption{Block-cyclic matrix multiplication, high level per task view.}
 138 %  \label{algo:ch6p3ORWLBCCMM}
 139 \begin{Listing}{algo:ch6p3ORWLBCCMM}{block-cyclic matrix multiplication, high level per task view}
 140 typedef double MBlock[N][N];
 141 MBlock A;
 142 MBlock B[k];
 143 MBlock C[k];
 144
 145 <do some initialization>
 146
 147 for (size_t i = 0; i < k; ++i) {
 148   MBlock next;
 149   parallel-do {
 150     operation 1: <copy the matrix A of the left neighbor into next>;
 151     operation 2: {
 152       <copy the local matrix A to the GPU >;
 153       <on GPU perform C[i] = A * B[0] + ... + A * B[k-1]; >;
 154     }
 155   }
 156   operation 3: {
 157     <wait until the right neighbor has read our block A>;
 158     A = next;
 159   }
 160 }
 161
 162 <collect the result matrix C consisting of all C blocks>
 163 \end{Listing}
 164 %\end{algorithm}
 165
 166
 167 \Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 from line 17 of
 168 \Lst{algo:ch6p3ORWLBCCMM} as it could
 169 be realized with ORWL.
 170 It uses two resource handles
 171 \texttt{nextRead} and \texttt{curWrite} and marks nested \emph{critical
 172   sections} for these handles. Inside the nested sections it obtains pointers to
 173 the resource data; the resource is \emph{mapped} into the address space of the
 174 program, and then a standard call to \texttt{memcpy} achieves the operation
 175 itself. The operation is integrated in its own {for}-loop, such that it
 176 could run independently in an OS thread by its own.
 177 %\begin{algorithm}[H]
 178 %  \caption{An iterative local copy operation.}
 179 %  \label{algo:ch6p3ORWLlcopy}
 180 \begin{Listing}{algo:ch6p3ORWLlcopy}{an iterative local copy operation}
 181 for (size_t i = 0; i < k; ++i) {
 182   ORWL_SECTION(nextRead) {
 183     MBlock const* sBlock = orwl_read_map(nextRead);
 184     ORWL_SECTION(curWrite) {
 185       MBlock * tBlock = orwl_write_map(curWrite);
 186       memcpy(tBlock, sBlock, sizeof *tBlock);
 187     }
 188   }
 189 }
 190 \end{Listing}
 191 %\end{algorithm}
 192
 193 Next, in \Lst{algo:ch6p3ORWLrcopy} we copy data from a remote task to
 194 a local task. Substantially the operation is the same, save for the
 195 different handles (\texttt{remRead} and \texttt{nextWrite}) that are used to
 196 represent the respective resources.
 197 %\begin{algorithm}[H]
 198 %  \caption{An iterative remote copy operation as part of a block cyclic matrix
 199 %    multiplication task.}
 200 %\label{algo:ch6p3ORWLrcopy}
 201 \begin{Listing}{algo:ch6p3ORWLrcopy}{an iterative remote copy operation as part of a block cyclic matrix multiplication task}
 202 for (size_t i = 0; i < k; ++i) {
 203   ORWL_SECTION(remRead) {
 204     MBlock const* sBlock = orwl_read_map(remRead);
 205     ORWL_SECTION(nextWrite) {
 206       MBlock * tBlock = orwl_write_map(nextWrite);
 207       memcpy(tBlock, sBlock, sizeof *tBlock);
 208     }
 209   }
 210 }
 211 \end{Listing}
 212 %\end{algorithm}
 213
 214 Now let us have a look into the operation that probably interests us the most,
 215 the interaction with the GPU in \Lst{algo:ch6p3ORWLtrans}. Again there
 216 is much structural resemblance to the copy operations from above, but  we
 217 transfer the data to the GPU in the innermost block and then run the GPU MM
 218 kernel while we are still inside the critical section for the GPU.
 219 %\begin{algorithm}[H]
 220 %  \caption{An iterative GPU transfer and compute operation as part of a block cyclic matrix
 221 %    multiplication task.}
 222 %  \label{algo:ch6p3ORWLtrans}
 223 \begin{Listing}{algo:ch6p3ORWLtrans}{an iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
 224 for (size_t i = 0; i < k; ++i) {
 225   ORWL_SECTION(GPUhandle) {
 226     ORWL_SECTION(curRead) {
 227       MBlock const* sBlock = orwl_read_map(curRead);
 228       transferToGPU(sBlock, i);
 229     }
 230     runMMonGPU(i);
 231   }
 232 }
 233 \end{Listing}
 234 %\end{algorithm}
 235
 236 Now that we have seen how the actual procedural access to the resources is
 237 regulated, we will show how the association between handles and resources is
 238 specified. In our application of block-cyclic MM the \texttt{curRead} handle
 239 should correspond to the current matrix block of the corresponding task, whereas
 240 \texttt{remRead} should point to the current block of the neighboring task.
 241 Both read operations on these matrix blocks can be performed without creating
 242 conflicts, so we would like to express that fact in our resource specification.
 243 From a point of view of the resource ``current block'' of a particular task,
 244 this means that it can have two simultaneous readers, the task itself performing
 245 the transfer to the GPU, and the neighboring task transferring the data to its
 246 ``next block.''
 247
 248 \Lst{algo:ch6p3ORWLdecl} first shows the local dynamic declarations of
 249 our application; it declares a \texttt{block} type for the matrix blocks, a
 250 \texttt{result} data for the collective resource, and the six handles that we
 251 have seen so far.
 252 %\begin{algorithm}[H]
 253 %  \caption{Dynamic declaration of handles to represent the resources.}
 254 %  \label{algo:ch6p3ORWLdecl}
 255 \begin{Listing}{algo:ch6p3ORWLdecl}{dynamic declaration of handles to represent the resources}
 256 /* A type for the matrix blocks */
 257 typedef double MBlock[N][N];
 258 /* Declaration to handle the collective resource */
 259 ORWL_GATHER_DECLARE(MBlock, result);
 260
 261 /* Variables to handle data resources */
 262 orwl_handle2 remRead   = ORWL_HANDLE2_INITIALIZER;
 263 orwl_handle2 nextWrite = ORWL_HANDLE2_INITIALIZER;
 264 orwl_handle2 nextRead  = ORWL_HANDLE2_INITIALIZER;
 265 orwl_handle2 curWrite  = ORWL_HANDLE2_INITIALIZER;
 266 orwl_handle2 curRead   = ORWL_HANDLE2_INITIALIZER;
 267
 268 /* Variable to handle the device resources */
 269 orwl_handle2 GPUhandle = ORWL_HANDLE2_INITIALIZER;
 270 \end{Listing}
 271 %\end{algorithm}
 272
 273 With these  declarations, we didn't  yet tell ORWL  much about the  resources to
 274 which these  handles refer, nor the type  (read or write) or  the priority (FIFO
 275 position)  of the  access. This  is done  in code  \Lst{algo:ch6p3ORWLinit} that
 276 shows  six insertions  of  handles  into their  respective  FIFO locations.  The
 277 handles \texttt{GPUhandle} and \texttt{curRead} are given first (lines 3 and 4),
 278 \texttt{GPUhandle} will  be accessed exclusively  (therefore the \texttt{write})
 279 and,   as   said,   \texttt{curRead}   is   used  in   shared   access   (so   a
 280 \texttt{read}). Both are inserted in the FIFO of their respective resources with
 281 highest  priority,   specified  by  the   \texttt{0}s  in  the   third  function
 282 parameter. The resources to which they correspond are specified through calls to
 283 the macro \texttt{ORWL\_LOCATION},  indicating the task (\texttt{orwl\_mytid} is
 284 the  ID of  the  current task)  and the  specific  resource of  that task,  here
 285 \texttt{GPU} and \texttt{curBlock}.
 286
 287 Likewise, a second block of insertions concerns the handles \texttt{remRead} and
 288 \texttt{nextWrite}  (lines 8  and 9).  \texttt{nextWrite} reclaims  an exclusive
 289 access  and \texttt{remRead}  a shared  one. \texttt{remRead}  corresponds  to a
 290 resource of another task; the call to \texttt{previous(orwl\_mytid)} is supposed
 291 to  return the  ID of  the previous  task  in the  cycle. Both  accesses can  be
 292 performed concurrently with  the previous operation, so we  insert them with the
 293 same priority \texttt{0} as previously.
 294
 295 Then,  for  the  specification  of   the  third  operation  related  to  handles
 296 \texttt{nextRead} and  \texttt{curWrite} (lines 13 and 14), we  need to
 297 use  a  different  priority:  the  copy  operation  from data locations \texttt{nextBlock}  to
 298 \texttt{curBlock}  has   to  be  performed  after  the   other  operations  have
 299 terminated (so the priority \texttt{1}).
 300
 301 As a final step, we then tell ORWL (line 16) that the specification of all accesses is
 302 complete and that it may schedule\index{ORWL!schedule} all these accesses in the respective FIFOs of
 303 the resources.
 304 %\begin{algorithm}[H]
 305 %  \caption{Dynamic initialization of access mode and priorities.}
 306 %  \label{algo:ch6p3ORWLinit}
 307 %\pagebreak
 308 \begin{Listing}{algo:ch6p3ORWLinit}{dynamic initialization of access mode and priorities}
 309 /* One operation with priority 0 (highest) consists      */
 310 /* of copying from current to the GPU and running MM, there. */
 311 orwl_write_insert(&GPUhandle, ORWL_LOCATION(orwl_mytid, GPU), 0);
 312 orwl_read_insert(&curRead, ORWL_LOCATION(orwl_mytid, curBlock), 0);
 313
 314 /* Another operation with priority 0 consists             */
 315 /* of copying from remote to next                         */
 316 orwl_read_insert(&remRead, ORWL_LOCATION(previous(orwl_mytid), curBlock), 0);
 317 orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid,nextBlock), 0);
 318
 319 /* One operation with priority 1 consists       */
 320 /* of copying from next to current              */
 321 orwl_read_insert(&nextRead, ORWL_LOCATION(orwl_mytid, nextBlock), 1);
 322 orwl_write_insert(&curWrite, ORWL_LOCATION(orwl_mytid, curBlock), 1);
 323
 324 orwl_schedule();
 325 \end{Listing}
 326 %\end{algorithm}
 327
 328 \subsection{Tasks and operations}
 329 \label{sec:ch6p3tasks}
 330 With this example we have now seen that ORWL distinguishes
 331 \emph{tasks}\index{ORWL!task} and
 332 \emph{operations}\index{ORWL!operation}. An ORWL program is divided into tasks that can be seen as the
 333 algorithmic units that will concurrently access the resources that the program
 334 uses. A task for ORWL is characterized by
 335 \begin{itemize}
 336 \item a fixed set of resources that it manages, ``\emph{owns},'' in our example
 337   the four resources that are declared in \Lst{algo:ch6p3ORWLresources}.
 338 \item a larger set of resources that it accesses, in our example all resources
 339   that are used in \Lst{algo:ch6p3ORWLinit}.
 340 \item a set of operations that act on these resources, in our example the three
 341   operations that are used in \Lst{algo:ch6p3ORWLBCCMM}, and that are
 342   elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy},
 343   and~\ref{algo:ch6p3ORWLtrans}.
 344 \end{itemize}
 345
 346 \noindent
 347 Each ORWL operation is characterized by
 348 \begin{itemize}
 349 \item one resource, usually one that is owned by the enclosing task, that it
 350   accesses exclusively. In our example, operation~1 has exclusive access to the
 351   \texttt{next} block, operation~2 has exclusive access the \texttt{GPU}
 352   resource, and operation~3 to the \texttt{A} block.
 353 \item several resources that are accessed concurrently with others.
 354 \end{itemize}
 355 In fact each ORWL~operation can be viewed as a compute-and-update procedure of a
 356 particular resource with knowledge of another set of resources.
 357
 358
 359 %%% Local Variables:
 360 %%% mode: latex
 361 %%% fill-column: 80
 362 %%% ispell-dictionary: "american"
 363 %%% mode: flyspell
 364 %%% TeX-master: "../../BookGPU.tex"
 365 %%% End: