2 \section{Perspective: A unifying programming model}
5 In the previous sections we have seen that controlling a distributed GPU
6 application when using tools that are commonly available is a quite challenging
7 task. To summarize, such an application has components that can be roughly
12 \item[CPU:] CPU bound computations, realized as procedures in the chosen
14 \item[CUDA$_{kern}$:] GPU bound computations, in our context realized as CUDA compute kernels
15 \item[CUDA$_{trans}$:] data transfer between CPU and GPU, realized with CUDA function calls
16 \item[MPI:] distributed data transfer routines, realized with MPI communication primitives
17 \item[OpenMP:] inter-thread control, realized with OpenMP synchronization tools such as mutexes
18 \item[CUDA$_{sync}$] synchronization of the GPU, realized with CUDA functions
21 Among these, the last (CUDA$_{sync}$) is not strictly necessary on modern
22 systems, but still recommended to obtain optimal performance. With or without
23 that last step, such an application is highly complex: it is difficult to design
24 or to maintain, and depends on a lot of different software components. The goal
25 of this section is to present a new path of development that allows to replace
26 the last three or four types of components that control the application (MPI,
27 OpenMP, CUDA$_{sync}$ and eventually CUDA$_{trans}$) by a single tool:
28 \textbf{O}rdered \textbf{R}ead-\textbf{W}rite \textbf{L}ocks, ORWL\index{ORWL},
29 see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}. Besides the
30 simplification of the algorithmic scheme that we already have mentioned, the
31 ongoing implementation of ORWL allows to use a feature of modern platforms that
32 can improve the performance of CPU bound computations: lock-free atomic
33 operations to update shared data consistently. For these, ORWL relies on new
34 interfaces that are available with the latest revision of the ISO standard for
35 the C programming language, see~\cite{C11}.
37 \subsection{Resources}
38 \label{sec:ch6p3resources}
40 ORWL places all its concepts that concern data and control around a single
41 abstraction: \emph{resources}\index{ORWL!resource}. An ORWL resource may correspond to a local or
42 remote entity and is identified through a \emph{location}\index{ORWL!location}, that is a unique
43 identification through which it can be accessed from all different components of
44 the same application. In fact, resources and locations (entities and their names
45 so to speak) are mostly identified by ORWL and these words will be used
48 Resources may be of very different kind:
50 \item[Data] resources\index{ORWL!resource!data} are entities that represents data and not specific
51 memory buffers or locations. During the execution of an application it can be
52 \emph{mapped} repeatedly into the address space and in effect be represented
53 at different addresses. Data resources can be made accessible uniformly in all
54 parts of the application, provided that the locking protocol is observed, see
55 below. Data resources can have different persistence:
57 \item[RAM] data resources are typically temporary data that serve only during
58 a single run of the application. They must be initialized at the beginning of
59 their lifetime and the contents is lost at the end.
60 \item[File] data resources are persistent and linked to a file in the file
61 system of the platform.
62 \item[Collective] data resources are data to which all tasks of an
63 application contribute (see below). Examples for such resources are \emph{broadcast},
65 \emph{reduce} resources, e.g to distribute initial data or to collect the
66 result of a distributed computation.
68 Other types of data resources could be easily implemented with ORWL, e.g web
69 resources (through a ftp, http or whatever server address) or fixed hardware
71 \item[Device] resources\index{ORWL!resource!device} represent hardware entities of the
72 platform. ORWL can then be used to regulate the access to such device
73 resources. In our context the most important such resource is the GPU, but we
74 could easily use it to represent a CPU core, a camera or other peripheral
78 \Lst{algo:ch6p3ORWLresources} shows an example of a declaration of
79 four resources per task. Two (\texttt{curBlock} and \texttt{nextBlock}) are
80 intended to represent the data in a block-cyclic parallel matrix multiplication
81 (see p.~\pageref{ch6:p1block-cyclic}),
82 \texttt{GPU} represents a GPU device and \texttt{result} will represent a
83 collective ``gather'' resource among all the tasks.
86 % \caption{Declaration of ORWL resources for a block-cyclic matrix
88 % \label{algo:ch6p3ORWLresources}
89 \begin{Listing}{algo:ch6p3ORWLresources}{Declaration of ORWL resources for a block-cyclic matrix multiplication}
92 ORWL_LOCATIONS_PER_TASK(curBlock, nextBlock, GPU, result);
93 ORWL_DATA_LOCATION(curBlock);
94 ORWL_DATA_LOCATION(nexBlock);
95 ORWL_DEVICE_LOCATION(GPU);
96 ORWL_GATHER_LOCATION(result);
102 \label{sec:ch6p3ORWLcontrol}
105 ORWL regulates access to all its resources, no ``random access'' to a resource
106 is possible. It doesn't even have a user-visible data type for resources.
108 \item All access is provided through \emph{handle}s\index{ORWL!handle}. Similar to pointers or
109 links, these only refer to a resource and help to manipulate it. Usually
110 several handles to the same resource exist, even inside the same OS process
111 or thread, or in the same application task.
112 \item The access is locked with RW semantics, where \emph{R} stands for
113 concurrent \textbf{R}ead~access, and \emph{W} for exclusive
114 \textbf{W}rite~access. This feature replaces the control aspect of MPI
115 communications, OpenMP inter-thread control and CUDA$_{sync}$.
116 \item This access is \textbf{O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
117 FIFO per resource}. This helps to run the different tasks of an application
118 in a controlled order and to always have all resources in a known state. This
119 aspect largely replaces and extends the ordering of tasks that MPI typically
120 achieves through the passing of messages.
121 \item The access is transparently managed for remote or local resources.
122 Communication, if necessary, is done asynchronously behind the scenes. This
123 replaces the explicit handling of buffers and messages with MPI.
126 \subsection{Example: block-cyclic matrix multiplication (MM)}
127 \label{sec:ch6p3ORWLMM}
129 Let us now have a look how a block-cyclic matrix multiplication\index{matrix
130 multiplication!block cyclic} algorithm can be
131 implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop it mainly consists of three
132 different operations, of which the first two can be run concurrently, and the
133 third must be done after the other two.
134 %\begin{algorithm}[H]
135 % \caption{Block-cyclic matrix multiplication, high level per task view.}
136 % \label{algo:ch6p3ORWLBCCMM}
137 \begin{Listing}{algo:ch6p3ORWLBCCMM}{Block-cyclic matrix multiplication, high level per task view}
138 typedef double MBlock[N][N];
143 <do some initialization>
145 for (size_t i = 0; i < k; ++i) {
148 operation 1: <copy the matrix A of the left neighbor into next>;
150 <copy the local matrix A to the GPU >;
151 <on GPU perform C[i] = A * B[0] + ... + A * B[k-1]; >;
155 <wait until the right neighbor has read our block A>;
160 <collect the result matrix C consisting of all C blocks>
165 \Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 as it could
166 be realized with ORWL.
167 It uses two resource handles
168 \texttt{nextRead} and \texttt{curWrite} and marks nested \emph{critical
169 sections} for these handles. Inside the nested sections it obtains pointers to
170 the resource data; the resource is \emph{mapped} into the address space of the
171 program, and then a standard call to \texttt{memcpy} achieves the operation
172 itself. The operation is integrated in its own \textbf{for}-loop, such that it
173 could run independently in an OS thread by its own.
174 %\begin{algorithm}[H]
175 % \caption{An iterative local copy operation.}
176 % \label{algo:ch6p3ORWLlcopy}
177 \begin{Listing}{algo:ch6p3ORWLlcopy}{An iterative local copy operation}
178 for (size_t i = 0; i < k; ++i) {
179 ORWL_SECTION(nextRead) {
180 MBlock const* sBlock = orwl_read_map(nextRead);
181 ORWL_SECTION(curWrite) {
182 MBlock * tBlock = orwl_write_map(curWrite);
183 memcpy(tBlock, sBlock, sizeof *tBlock);
190 Next, in \Lst{algo:ch6p3ORWLrcopy} we copy data from a remote task to
191 a local task. Substantially the operation is the same, only that in the example
192 different handles (\texttt{remRead} and \texttt{nextWrite}) are used to
193 represent the respective resources.
194 %\begin{algorithm}[H]
195 % \caption{An iterative remote copy operation as part of a block cyclic matrix
196 % multiplication task.}
197 %\label{algo:ch6p3ORWLrcopy}
198 \begin{Listing}{algo:ch6p3ORWLrcopy}{An iterative remote copy operation as part of a block cyclic matrix multiplication task}
199 for (size_t i = 0; i < k; ++i) {
200 ORWL_SECTION(remRead) {
201 MBlock const* sBlock = orwl_read_map(remRead);
202 ORWL_SECTION(nextWrite) {
203 MBlock * tBlock = orwl_write_map(nextWrite);
204 memcpy(tBlock, sBlock, sizeof *tBlock);
211 Now let us have a look into the operation that probably interests us the most,
212 the interaction with the GPU in \Lst{algo:ch6p3ORWLtrans}. Again there
213 is much structural resemblance to the copy operations from above, only that we
214 transfer the data to the GPU in the innermost block and then run the GPU MM
215 kernel while we still are inside the critical section for the GPU.
216 %\begin{algorithm}[H]
217 % \caption{An iterative GPU transfer and compute operation as part of a block cyclic matrix
218 % multiplication task.}
219 % \label{algo:ch6p3ORWLtrans}
220 \begin{Listing}{algo:ch6p3ORWLtrans}{An iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
221 for (size_t i = 0; i < k; ++i) {
222 ORWL_SECTION(GPUhandle) {
223 ORWL_SECTION(curRead) {
224 MBlock const* sBlock = orwl_read_map(curRead);
225 transferToGPU(sBlock, i);
233 Now that we have seen how the actual procedural access to the resources is
234 regulated we will show how the association between handles and resources is
235 specified. E.g in our application of block-cyclic MM the \texttt{curRead} handle
236 should correspond to current matrix block of the corresponding task, whereas
237 \texttt{remRead} should point to the current block of the neighboring task.
238 Both read operations on these matrix blocks can be effected without creating
239 conflicts, so we would like to express that fact in our resource specification.
240 From a point of view of the resource ``current block'' of a particular task,
241 this means that it can have two simultaneous readers, the task itself performing
242 the transfer to the GPU, and the neighboring task transferring the data to its
245 \Lst{algo:ch6p3ORWLdecl} first shows the local dynamic declarations of
246 our application; it declares a \texttt{block} type for the matrix blocks, a
247 \texttt{result} data for the collective resource, and the six handles that we
249 %\begin{algorithm}[H]
250 % \caption{Dynamic declaration of handles to represent the resources.}
251 % \label{algo:ch6p3ORWLdecl}
252 \begin{Listing}{algo:ch6p3ORWLdecl}{Dynamic declaration of handles to represent the resources}
253 /* A type for the matrix blocks */
254 typedef double MBlock[N][N];
255 /* Declaration to handle the collective resource */
256 ORWL_GATHER_DECLARE(MBlock, result);
258 /* Variables to handle data resources */
259 orwl_handle2 remRead = ORWL_HANDLE2_INITIALIZER;
260 orwl_handle2 nextWrite = ORWL_HANDLE2_INITIALIZER;
261 orwl_handle2 nextRead = ORWL_HANDLE2_INITIALIZER;
262 orwl_handle2 curWrite = ORWL_HANDLE2_INITIALIZER;
263 orwl_handle2 curRead = ORWL_HANDLE2_INITIALIZER;
265 /* Variable to handle the device resources */
266 orwl_handle2 GPUhandle = ORWL_HANDLE2_INITIALIZER;
270 With these declarations, we didn't yet tell ORWL much about the resources to
271 which these handles refer, nor the type (read or write) or the priority (FIFO
272 position) of the access. This is done in code
273 \Lst{algo:ch6p3ORWLinit}. The handles for
274 \Lst{algo:ch6p3ORWLtrans} are given first, \texttt{GPUhandle} will be
275 accessed exclusively (therefore the \texttt{write}) and, as said,
276 \texttt{curRead} is used shared (so a \texttt{read}). Both are inserted in the
277 FIFO of there respective resources with highest priority, specified by the
278 \texttt{0}s in the third function parameter. The resources to which they
279 correspond are specified through calls to the macro \texttt{ORWL\_LOCATION},
280 indicating the task (\texttt{orwl\_mytid} is the ID of the current task) and the
281 specific resource of that task, here \texttt{GPU} and \texttt{curBlock}.
283 Likewise, a second block of insertions concerns the handles for
284 \Lst{algo:ch6p3ORWLrcopy}: \texttt{newWrite} reclaims an exclusive
285 access and \texttt{remRead} a shared. \texttt{remRead} corresponds to a
286 resource of another task; \texttt{previous(orwl\_mytid)} is supposed to return
287 the ID of the previous task in the cycle. Both accesses can be effected
288 concurrently with the previous operation, so we insert with the same priority
289 \texttt{0} as before.
291 Then, for the specification of the third operation
292 (\Lst{algo:ch6p3ORWLlcopy}) we need to use a different priority: the
293 copy operation from \texttt{nextBlock} to \texttt{curBlock} has to be performed
294 after the other operations have terminated.
296 As a final step, we then tell ORWL that the specification of all accesses is
297 complete and that it may schedule\index{ORWL!schedule} all these accesses in the respective FIFOs of
299 %\begin{algorithm}[H]
300 % \caption{Dynamic initialization of access mode and priorities.}
301 % \label{algo:ch6p3ORWLinit}
302 \begin{Listing}{algo:ch6p3ORWLinit}{Dynamic initialization of access mode and priorities}
303 /* One operation with priority 0 (highest) consists */
304 /* in copying from current to the GPU and run MM, there. */
305 orwl_write_insert(&GPUhandle, ORWL_LOCATION(orwl_mytid, GPU), 0);
306 orwl_read_insert(&curRead, ORWL_LOCATION(orwl_mytid, curBlock), 0);
308 /* Another operation with priority 0 consists */
309 /* in copying from remote to next */
310 orwl_read_insert(&remRead, ORWL_LOCATION(previous(orwl_mytid), curBlock), 0);
311 orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid, nextBlock), 0);
313 /* One operation with priority 1 consists */
314 /* in copying from next to current */
315 orwl_read_insert(&nextRead, ORWL_LOCATION(orwl_mytid, nextBlock), 1);
316 orwl_write_insert(&curWrite, ORWL_LOCATION(orwl_mytid, curBlock), 1);
322 \subsection{Tasks and operations}
323 \label{sec:ch6p3tasks}
324 With that example we have now seen that ORWL distinguishes
325 \emph{tasks}\index{ORWL!task} and
326 \emph{operations}\index{ORWL!operation}. An ORWL program is divided into tasks that can be seen as the
327 algorithmic units that will concurrently access the resources that the program
328 uses. A task for ORWL is characterized by
330 \item a fixed set of resources that it manages, ``\emph{owns}'', in our example
331 the four resources that are declared in \Lst{algo:ch6p3ORWLresources}.
332 \item a larger set of resources that it accesses, in our example all resources
333 that are used in \Lst{algo:ch6p3ORWLinit}.
334 \item a set of operations that act on these resources, in our example the three
335 operations that are used in \Lst{algo:ch6p3ORWLBCCMM}, and that are
336 elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy}
337 and~\ref{algo:ch6p3ORWLtrans}.
341 Each ORWL operation is characterized by
343 \item one resource, usually one that is owned by the enclosing task, that it
344 accesses exclusively. In our example, operation~1 has exclusive access to the
345 \texttt{next} block, operation~2 has exclusive access the \texttt{GPU}
346 resource, and operation~3 to the \texttt{A} block.
347 \item several resources that are accessed concurrently with others.
349 In fact each ORWL~operation can be viewed as a compute-and-update procedure of a
350 particular resource with knowledge of another set of resources.
356 %%% ispell-dictionary: "american"
358 %%% TeX-master: "../../BookGPU.tex"