2 \section{Perspective: a unifying programming model}
5 In the previous sections we have seen that controlling a distributed GPU
6 application when using tools that are commonly available is quite a challenging
7 task. To summarize, such an application has components that can be roughly
12 \item[CPU:] CPU-bound computations, realized as procedures in the chosen
14 \item[CUDA$_{kern}$:] GPU-bound computations, in our context realized as CUDA compute kernels
15 \item[CUDA$_{trans}$:] data transfer between CPU and GPU, realized with CUDA function calls
16 \item[MPI:] distributed data transfer routines, realized with MPI communication primitives
17 \item[OpenMP:] inter-thread control, realized with OpenMP synchronization tools such as mutexes
18 \item[CUDA$_{sync}$:] synchronization of the GPU, realized with CUDA functions
21 Among these, the last (CUDA$_{sync}$) is not strictly necessary on modern
22 systems, but it is still recommended to obtain optimal performance. With or without
23 that last step, such an application is highly complex: it is difficult to design
24 or to maintain, and depends on a lot of different software components. The goal
25 of this section is to present a new path of development that allows the
26 replacement of the last three or four types of components that control the application (MPI,
27 OpenMP, CUDA$_{sync}$, and eventually CUDA$_{trans}$) with a single tool:
28 {O}rdered {R}ead-{W}rite {L}ocks, ORWL\index{ORWL}
29 (see~\cite{clauss:2010:inria-00330024:1,gustedt:hal-00639289}). Besides the
30 simplification of the algorithmic scheme that we have already mentioned, the
31 ongoing implementation of ORWL allows the use of a feature of modern platforms that
32 can improve the performance of CPU-bound computations: lock-free atomic
33 operations to update shared data consistently. For these, ORWL relies on new
34 interfaces that are available with the latest revision of the ISO standard for
35 the C programming language (see~\cite{C11}).
37 \subsection{Resources}
38 \label{sec:ch6p3resources}
40 ORWL places all its concepts that concern data and control around a single
41 abstraction: \emph{resources}\index{ORWL!resource}. An ORWL resource may correspond to a local or
42 remote entity and is identified through a \emph{location}\index{ORWL!location}, that is a unique
43 identification through which it can be accessed from all different components of
44 the same application. In fact, resources and locations (entities and their names,
45 so to speak) are mostly identified by ORWL and these words will be used
48 Resources may be of very different kinds:
50 \item[Data] resources\index{ORWL!resource!data} are entities that represents data and not specific
51 memory buffers or locations. During the execution of an application they can be
52 \emph{mapped} repeatedly into the address space and in effect be represented
53 at different addresses. Data resources can be made accessible uniformly in all
54 parts of the application, provided that the locking protocol is observed (see
55 below). Data resources can have different persistence:
57 \item[RAM] data resources are typically temporary data that serve only during
58 a single run of the application. They must be initialized at the beginning of
59 their lifetime and the contents are lost at the end.
60 \item[File] data resources are persistent and linked to a file in the file
61 system of the platform.
62 \item[Collective] data resources are data to which all tasks of an
63 application contribute (see below). Examples for such resources are \emph{broadcast},
65 \emph{reduce} resources, e.g., to distribute initial data or to collect the
66 result of a distributed computation.
68 Other types of data resources could be easily implemented with ORWL, e.g., web
69 resources (through a ftp, http or whatever server address) or fixed hardware
71 \item[Device] resources\index{ORWL!resource!device} represent hardware entities of the
72 platform. ORWL can then be used to regulate the access to such device
73 resources. In our context the most important such resource is the GPU, but we
74 could easily use it to represent a CPU core, a camera, or another peripheral
78 \Lst{algo:ch6p3ORWLresources} shows an example of a declaration of
79 four resources per task. Two (\texttt{curBlock} and \texttt{nextBlock}) are
80 intended to represent the data in a block-cyclic parallel matrix multiplication
81 (see Section~\ref{ch6:p1block-cyclic}),
82 \texttt{GPU} represents a GPU device, and \texttt{result} will represent a
83 collective ``gather'' resource among all the tasks.
87 % \caption{Declaration of ORWL resources for a block-cyclic matrix
89 % \label{algo:ch6p3ORWLresources}
90 \begin{Listing}{algo:ch6p3ORWLresources}{declaration of ORWL resources for a block-cyclic matrix multiplication}
93 ORWL_LOCATIONS_PER_TASK(curBlock, nextBlock, GPU, result);
94 ORWL_DATA_LOCATION(curBlock);
95 ORWL_DATA_LOCATION(nextBlock);
96 ORWL_DEVICE_LOCATION(GPU);
97 ORWL_GATHER_LOCATION(result);
103 \label{sec:ch6p3ORWLcontrol}
106 ORWL regulates access to all its resources; no ``random access'' to a resource
107 is possible. It doesn't even have a user-visible data type for resources.
109 \item All access is provided through \emph{handle}s\index{ORWL!handle}. Similar to pointers or
110 links, these only refer to a resource and help to manipulate it. Usually
111 several handles to the same resource exist, even inside the same OS process
112 or thread, or in the same application task.
113 \item The access is locked with RW semantics, where {R} stands for
114 concurrent {R}ead~access, and {W} for exclusive
115 {W}rite~access. This feature replaces the control aspect of MPI
116 communications, OpenMP inter-thread control, and CUDA$_{sync}$.
117 \item This access is {O}rdered (or serialized)\index{ordered access} through a FIFO, \emph{one
118 FIFO per resource}. This helps to run the different tasks of an application
119 in a controlled order and to always have all resources in a known state. This
120 aspect largely replaces and extends the ordering of tasks that MPI typically
121 achieves through the passing of messages.
122 \item The access is transparently managed for remote or local resources.
123 Communication, if necessary, is done asynchronously behind the scenes. This
124 replaces the explicit handling of buffers and messages with MPI.
127 \subsection{Example: block-cyclic matrix multiplication (MM)}
128 \label{sec:ch6p3ORWLMM}
130 Let us now have a look at how a block-cyclic matrix multiplication\index{matrix
131 multiplication!block cyclic} algorithm can be
132 implemented with these concepts (\Lst{algo:ch6p3ORWLBCCMM}). Inside the loop
133 there are mainly three
134 different operations, the first two of which can be run concurrently, and the
135 third must be done after the other two.
136 %\begin{algorithm}[H]
137 % \caption{Block-cyclic matrix multiplication, high level per task view.}
138 % \label{algo:ch6p3ORWLBCCMM}
139 \begin{Listing}{algo:ch6p3ORWLBCCMM}{block-cyclic matrix multiplication, high level per task view}
140 typedef double MBlock[N][N];
145 <do some initialization>
147 for (size_t i = 0; i < k; ++i) {
150 operation 1: <copy the matrix A of the left neighbor into next>;
152 <copy the local matrix A to the GPU >;
153 <on GPU perform C[i] = A * B[0] + ... + A * B[k-1]; >;
157 <wait until the right neighbor has read our block A>;
162 <collect the result matrix C consisting of all C blocks>
167 \Lst{algo:ch6p3ORWLlcopy} shows the local copy operation~3 from line 17 of
168 \Lst{algo:ch6p3ORWLBCCMM} as it could
169 be realized with ORWL.
170 It uses two resource handles
171 \texttt{nextRead} and \texttt{curWrite} and marks nested \emph{critical
172 sections} for these handles. Inside the nested sections it obtains pointers to
173 the resource data; the resource is \emph{mapped} into the address space of the
174 program, and then a standard call to \texttt{memcpy} achieves the operation
175 itself. The operation is integrated in its own {for}-loop, such that it
176 could run independently in an OS thread by its own.
177 %\begin{algorithm}[H]
178 % \caption{An iterative local copy operation.}
179 % \label{algo:ch6p3ORWLlcopy}
180 \begin{Listing}{algo:ch6p3ORWLlcopy}{an iterative local copy operation}
181 for (size_t i = 0; i < k; ++i) {
182 ORWL_SECTION(nextRead) {
183 MBlock const* sBlock = orwl_read_map(nextRead);
184 ORWL_SECTION(curWrite) {
185 MBlock * tBlock = orwl_write_map(curWrite);
186 memcpy(tBlock, sBlock, sizeof *tBlock);
193 Next, in \Lst{algo:ch6p3ORWLrcopy} we copy data from a remote task to
194 a local task. Substantially the operation is the same, save for the
195 different handles (\texttt{remRead} and \texttt{nextWrite}) that are used to
196 represent the respective resources.
197 %\begin{algorithm}[H]
198 % \caption{An iterative remote copy operation as part of a block cyclic matrix
199 % multiplication task.}
200 %\label{algo:ch6p3ORWLrcopy}
201 \begin{Listing}{algo:ch6p3ORWLrcopy}{an iterative remote copy operation as part of a block cyclic matrix multiplication task}
202 for (size_t i = 0; i < k; ++i) {
203 ORWL_SECTION(remRead) {
204 MBlock const* sBlock = orwl_read_map(remRead);
205 ORWL_SECTION(nextWrite) {
206 MBlock * tBlock = orwl_write_map(nextWrite);
207 memcpy(tBlock, sBlock, sizeof *tBlock);
214 Now let us have a look into the operation that probably interests us the most,
215 the interaction with the GPU in \Lst{algo:ch6p3ORWLtrans}. Again there
216 is much structural resemblance to the copy operations from above, but we
217 transfer the data to the GPU in the innermost block and then run the GPU MM
218 kernel while we are still inside the critical section for the GPU.
219 %\begin{algorithm}[H]
220 % \caption{An iterative GPU transfer and compute operation as part of a block cyclic matrix
221 % multiplication task.}
222 % \label{algo:ch6p3ORWLtrans}
223 \begin{Listing}{algo:ch6p3ORWLtrans}{an iterative GPU transfer and compute operation as part of a block cyclic matrix multiplication task}
224 for (size_t i = 0; i < k; ++i) {
225 ORWL_SECTION(GPUhandle) {
226 ORWL_SECTION(curRead) {
227 MBlock const* sBlock = orwl_read_map(curRead);
228 transferToGPU(sBlock, i);
236 Now that we have seen how the actual procedural access to the resources is
237 regulated, we will show how the association between handles and resources is
238 specified. In our application of block-cyclic MM the \texttt{curRead} handle
239 should correspond to the current matrix block of the corresponding task, whereas
240 \texttt{remRead} should point to the current block of the neighboring task.
241 Both read operations on these matrix blocks can be performed without creating
242 conflicts, so we would like to express that fact in our resource specification.
243 From a point of view of the resource ``current block'' of a particular task,
244 this means that it can have two simultaneous readers, the task itself performing
245 the transfer to the GPU, and the neighboring task transferring the data to its
248 \Lst{algo:ch6p3ORWLdecl} first shows the local dynamic declarations of
249 our application; it declares a \texttt{block} type for the matrix blocks, a
250 \texttt{result} data for the collective resource, and the six handles that we
252 %\begin{algorithm}[H]
253 % \caption{Dynamic declaration of handles to represent the resources.}
254 % \label{algo:ch6p3ORWLdecl}
255 \begin{Listing}{algo:ch6p3ORWLdecl}{dynamic declaration of handles to represent the resources}
256 /* A type for the matrix blocks */
257 typedef double MBlock[N][N];
258 /* Declaration to handle the collective resource */
259 ORWL_GATHER_DECLARE(MBlock, result);
261 /* Variables to handle data resources */
262 orwl_handle2 remRead = ORWL_HANDLE2_INITIALIZER;
263 orwl_handle2 nextWrite = ORWL_HANDLE2_INITIALIZER;
264 orwl_handle2 nextRead = ORWL_HANDLE2_INITIALIZER;
265 orwl_handle2 curWrite = ORWL_HANDLE2_INITIALIZER;
266 orwl_handle2 curRead = ORWL_HANDLE2_INITIALIZER;
268 /* Variable to handle the device resources */
269 orwl_handle2 GPUhandle = ORWL_HANDLE2_INITIALIZER;
273 With these declarations, we didn't yet tell ORWL much about the resources to
274 which these handles refer, nor the type (read or write) or the priority (FIFO
275 position) of the access. This is done in code \Lst{algo:ch6p3ORWLinit} that
276 shows six insertions of handles into their respective FIFO locations. The
277 handles \texttt{GPUhandle} and \texttt{curRead} are given first (lines 3 and 4),
278 \texttt{GPUhandle} will be accessed exclusively (therefore the \texttt{write})
279 and, as said, \texttt{curRead} is used in shared access (so a
280 \texttt{read}). Both are inserted in the FIFO of their respective resources with
281 highest priority, specified by the \texttt{0}s in the third function
282 parameter. The resources to which they correspond are specified through calls to
283 the macro \texttt{ORWL\_LOCATION}, indicating the task (\texttt{orwl\_mytid} is
284 the ID of the current task) and the specific resource of that task, here
285 \texttt{GPU} and \texttt{curBlock}.
287 Likewise, a second block of insertions concerns the handles \texttt{remRead} and
288 \texttt{nextWrite} (lines 8 and 9). \texttt{nextWrite} reclaims an exclusive
289 access and \texttt{remRead} a shared one. \texttt{remRead} corresponds to a
290 resource of another task; the call to \texttt{previous(orwl\_mytid)} is supposed
291 to return the ID of the previous task in the cycle. Both accesses can be
292 performed concurrently with the previous operation, so we insert them with the
293 same priority \texttt{0} as previously.
295 Then, for the specification of the third operation related to handles
296 \texttt{nextRead} and \texttt{curWrite} (lines 13 and 14), we need to
297 use a different priority: the copy operation from data locations \texttt{nextBlock} to
298 \texttt{curBlock} has to be performed after the other operations have
299 terminated (so the priority \texttt{1}).
301 As a final step, we then tell ORWL (line 16) that the specification of all accesses is
302 complete and that it may schedule\index{ORWL!schedule} all these accesses in the respective FIFOs of
304 %\begin{algorithm}[H]
305 % \caption{Dynamic initialization of access mode and priorities.}
306 % \label{algo:ch6p3ORWLinit}
308 \begin{Listing}{algo:ch6p3ORWLinit}{dynamic initialization of access mode and priorities}
309 /* One operation with priority 0 (highest) consists */
310 /* of copying from current to the GPU and running MM, there. */
311 orwl_write_insert(&GPUhandle, ORWL_LOCATION(orwl_mytid, GPU), 0);
312 orwl_read_insert(&curRead, ORWL_LOCATION(orwl_mytid, curBlock), 0);
314 /* Another operation with priority 0 consists */
315 /* of copying from remote to next */
316 orwl_read_insert(&remRead, ORWL_LOCATION(previous(orwl_mytid), curBlock), 0);
317 orwl_write_insert(&nextWrite, ORWL_LOCATION(orwl_mytid,nextBlock), 0);
319 /* One operation with priority 1 consists */
320 /* of copying from next to current */
321 orwl_read_insert(&nextRead, ORWL_LOCATION(orwl_mytid, nextBlock), 1);
322 orwl_write_insert(&curWrite, ORWL_LOCATION(orwl_mytid, curBlock), 1);
328 \subsection{Tasks and operations}
329 \label{sec:ch6p3tasks}
330 With this example we have now seen that ORWL distinguishes
331 \emph{tasks}\index{ORWL!task} and
332 \emph{operations}\index{ORWL!operation}. An ORWL program is divided into tasks that can be seen as the
333 algorithmic units that will concurrently access the resources that the program
334 uses. A task for ORWL is characterized by
336 \item a fixed set of resources that it manages, ``\emph{owns},'' in our example
337 the four resources that are declared in \Lst{algo:ch6p3ORWLresources}.
338 \item a larger set of resources that it accesses, in our example all resources
339 that are used in \Lst{algo:ch6p3ORWLinit}.
340 \item a set of operations that act on these resources, in our example the three
341 operations that are used in \Lst{algo:ch6p3ORWLBCCMM}, and that are
342 elaborated in Listings~\ref{algo:ch6p3ORWLlcopy},~\ref{algo:ch6p3ORWLrcopy},
343 and~\ref{algo:ch6p3ORWLtrans}.
347 Each ORWL operation is characterized by
349 \item one resource, usually one that is owned by the enclosing task, that it
350 accesses exclusively. In our example, operation~1 has exclusive access to the
351 \texttt{next} block, operation~2 has exclusive access the \texttt{GPU}
352 resource, and operation~3 to the \texttt{A} block.
353 \item several resources that are accessed concurrently with others.
355 In fact each ORWL~operation can be viewed as a compute-and-update procedure of a
356 particular resource with knowledge of another set of resources.
362 %%% ispell-dictionary: "american"
364 %%% TeX-master: "../../BookGPU.tex"