c_extensions.doxy 15 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465
  1. /*
  2. * This file is part of the StarPU Handbook.
  3. * Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
  4. * Copyright (C) 2010, 2011, 2012, 2013 Centre National de la Recherche Scientifique
  5. * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
  6. * See the file version.doxy for copying conditions.
  7. */
  8. /*! \page cExtensions C Extensions
  9. When GCC plug-in support is available, StarPU builds a plug-in for the
  10. GNU Compiler Collection (GCC), which defines extensions to languages of
  11. the C family (C, C++, Objective-C) that make it easier to write StarPU
  12. code. This feature is only available for GCC 4.5 and later; it
  13. is known to work with GCC 4.5, 4.6, and 4.7. You
  14. may need to install a specific <c>-dev</c> package of your distro, such
  15. as <c>gcc-4.6-plugin-dev</c> on Debian and derivatives. In addition,
  16. the plug-in's test suite is only run when <a href="http://www.gnu.org/software/guile/">GNU Guile</a> is found at
  17. <c>configure</c>-time. Building the GCC plug-in
  18. can be disabled by configuring with <c>--disable-gcc-extensions</c>.
  19. Those extensions include syntactic sugar for defining
  20. tasks and their implementations, invoking a task, and manipulating data
  21. buffers. Use of these extensions can be made conditional on the
  22. availability of the plug-in, leading to valid C sequential code when the
  23. plug-in is not used (\ref Conditional_Extensions).
  24. When StarPU has been installed with its GCC plug-in, programs that use
  25. these extensions can be compiled this way:
  26. \verbatim
  27. $ gcc -c -fplugin=`pkg-config starpu-1.1 --variable=gccplugin` foo.c
  28. \endverbatim
  29. When the plug-in is not available, the above <c>pkg-config</c>
  30. command returns the empty string.
  31. In addition, the <c>-fplugin-arg-starpu-verbose</c> flag can be used to
  32. obtain feedback from the compiler as it analyzes the C extensions used
  33. in source files.
  34. This section describes the C extensions implemented by StarPU's GCC
  35. plug-in. It does not require detailed knowledge of the StarPU library.
  36. Note: as of StarPU @value{VERSION}, this is still an area under
  37. development and subject to change.
  38. \section Defining_Tasks Defining Tasks
  39. The StarPU GCC plug-in views tasks as ``extended'' C functions:
  40. <ul>
  41. <Li>
  42. tasks may have several implementations---e.g., one for CPUs, one written
  43. in OpenCL, one written in CUDA;
  44. </li>
  45. <Li>
  46. tasks may have several implementations of the same target---e.g.,
  47. several CPU implementations;
  48. </li>
  49. <li>
  50. when a task is invoked, it may run in parallel, and StarPU is free to
  51. choose any of its implementations.
  52. </li>
  53. </ul>
  54. Tasks and their implementations must be <em>declared</em>. These
  55. declarations are annotated with attributes (@pxref{Attribute
  56. Syntax, attributes in GNU C,, gcc, Using the GNU Compiler Collection
  57. (GCC)}): the declaration of a task is a regular C function declaration
  58. with an additional <c>task</c> attribute, and task implementations are
  59. declared with a <c>task_implementation</c> attribute.
  60. The following function attributes are provided:
  61. <dl>
  62. <dt><c>task</c></dt>
  63. <dd>
  64. Declare the given function as a StarPU task. Its return type must be
  65. <c>void</c>. When a function declared as <c>task</c> has a user-defined
  66. body, that body is interpreted as the implicit definition of the
  67. task's CPU implementation (see example below). In all cases, the
  68. actual definition of a task's body is automatically generated by the
  69. compiler.
  70. Under the hood, declaring a task leads to the declaration of the
  71. corresponding <c>codelet</c> (@pxref{Codelet and Tasks}). If one or
  72. more task implementations are declared in the same compilation unit,
  73. then the codelet and the function itself are also defined; they inherit
  74. the scope of the task.
  75. Scalar arguments to the task are passed by value and copied to the
  76. target device if need be---technically, they are passed as the
  77. <c>cl_arg</c> buffer (@pxref{Codelets and Tasks, <c>cl_arg</c>}).
  78. Pointer arguments are assumed to be registered data buffers---the
  79. <c>buffers</c> argument of a task (@pxref{Codelets and Tasks,
  80. <c>buffers</c>}); <c>const</c>-qualified pointer arguments are viewed as
  81. read-only buffers (<c>STARPU_R</c>), and non-<c>const</c>-qualified
  82. buffers are assumed to be used read-write (<c>STARPU_RW</c>). In
  83. addition, the <c>output</c> type attribute can be as a type qualifier
  84. for output pointer or array parameters (<c>STARPU_W</c>).
  85. </dd>
  86. <dt><c>task_implementation (target, task)</c></dt>
  87. <dd>
  88. Declare the given function as an implementation of <c>task</c> to run on
  89. <c>target</c>. <c>target</c> must be a string, currently one of
  90. <c>"cpu"</c>, <c>"opencl"</c>, or <c>"cuda"</c>.
  91. \internal
  92. FIXME: Update when OpenCL support is ready.
  93. \endinternal
  94. </dd>
  95. </dl>
  96. Here is an example:
  97. \code{.c}
  98. #define __output __attribute__ ((output))
  99. static void matmul (const float *A, const float *B,
  100. __output float *C,
  101. unsigned nx, unsigned ny, unsigned nz)
  102. __attribute__ ((task));
  103. static void matmul_cpu (const float *A, const float *B,
  104. __output float *C,
  105. unsigned nx, unsigned ny, unsigned nz)
  106. __attribute__ ((task_implementation ("cpu", matmul)));
  107. static void
  108. matmul_cpu (const float *A, const float *B, __output float *C,
  109. unsigned nx, unsigned ny, unsigned nz)
  110. {
  111. unsigned i, j, k;
  112. for (j = 0; j < ny; j++)
  113. for (i = 0; i < nx; i++)
  114. {
  115. for (k = 0; k < nz; k++)
  116. C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
  117. }
  118. }
  119. \endcode
  120. A <c>matmult</c> task is defined; it has only one implementation,
  121. <c>matmult_cpu</c>, which runs on the CPU. Variables <c>A</c> and
  122. <c>B</c> are input buffers, whereas <c>C</c> is considered an input/output
  123. buffer.
  124. For convenience, when a function declared with the <c>task</c> attribute
  125. has a user-defined body, that body is assumed to be that of the CPU
  126. implementation of a task, which we call an implicit task CPU
  127. implementation. Thus, the above snippet can be simplified like this:
  128. \code{.c}
  129. #define __output __attribute__ ((output))
  130. static void matmul (const float *A, const float *B,
  131. __output float *C,
  132. unsigned nx, unsigned ny, unsigned nz)
  133. __attribute__ ((task));
  134. /* Implicit definition of the CPU implementation of the
  135. `matmul' task. */
  136. static void
  137. matmul (const float *A, const float *B, __output float *C,
  138. unsigned nx, unsigned ny, unsigned nz)
  139. {
  140. unsigned i, j, k;
  141. for (j = 0; j < ny; j++)
  142. for (i = 0; i < nx; i++)
  143. {
  144. for (k = 0; k < nz; k++)
  145. C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
  146. }
  147. }
  148. \endcode
  149. Use of implicit CPU task implementations as above has the advantage that
  150. the code is valid sequential code when StarPU's GCC plug-in is not used
  151. (\ref Conditional_Extensions).
  152. CUDA and OpenCL implementations can be declared in a similar way:
  153. \code{.c}
  154. static void matmul_cuda (const float *A, const float *B, float *C,
  155. unsigned nx, unsigned ny, unsigned nz)
  156. __attribute__ ((task_implementation ("cuda", matmul)));
  157. static void matmul_opencl (const float *A, const float *B, float *C,
  158. unsigned nx, unsigned ny, unsigned nz)
  159. __attribute__ ((task_implementation ("opencl", matmul)));
  160. \endcode
  161. The CUDA and OpenCL implementations typically either invoke a kernel
  162. written in CUDA or OpenCL (for similar code, @pxref{CUDA Kernel}, and
  163. @pxref{OpenCL Kernel}), or call a library function that uses CUDA or
  164. OpenCL under the hood, such as CUBLAS functions:
  165. \code{.c}
  166. static void
  167. matmul_cuda (const float *A, const float *B, float *C,
  168. unsigned nx, unsigned ny, unsigned nz)
  169. {
  170. cublasSgemm ('n', 'n', nx, ny, nz,
  171. 1.0f, A, 0, B, 0,
  172. 0.0f, C, 0);
  173. cudaStreamSynchronize (starpu_cuda_get_local_stream ());
  174. }
  175. \endcode
  176. A task can be invoked like a regular C function:
  177. \code{.c}
  178. matmul (&A[i * zdim * bydim + k * bzdim * bydim],
  179. &B[k * xdim * bzdim + j * bxdim * bzdim],
  180. &C[i * xdim * bydim + j * bxdim * bydim],
  181. bxdim, bydim, bzdim);
  182. \endcode
  183. This leads to an asynchronous invocation, whereby <c>matmult</c>'s
  184. implementation may run in parallel with the continuation of the caller.
  185. The next section describes how memory buffers must be handled in
  186. StarPU-GCC code. For a complete example, see the
  187. <c>gcc-plugin/examples</c> directory of the source distribution, and
  188. \ref Vector_Scaling_Using_the_C_Extension.
  189. \section Synchronization_and_Other_Pragmas Initialization, Termination, and Synchronization
  190. The following pragmas allow user code to control StarPU's life time and
  191. to synchronize with tasks.
  192. <dl>
  193. <dt><c>\#pragma starpu initialize</c></dt>
  194. <dd>
  195. Initialize StarPU. This call is compulsory and is <em>never</em> added
  196. implicitly. One of the reasons this has to be done explicitly is that
  197. it provides greater control to user code over its resource usage.
  198. </dd>
  199. <dt><c>\#pragma starpu shutdown</c></dt>
  200. <dd>
  201. Shut down StarPU, giving it an opportunity to write profiling info to a
  202. file on disk, for instance (\ref Off-line_performance_feedback).
  203. </dd>
  204. <dt><c>\#pragma starpu wait</c></dt>
  205. <dd>
  206. Wait for all task invocations to complete, as with
  207. starpu_wait_for_all().
  208. </dd>
  209. </dl>
  210. \section Registered_Data_Buffers Registered Data Buffers
  211. Data buffers such as matrices and vectors that are to be passed to tasks
  212. must be registered. Registration allows StarPU to handle data
  213. transfers among devices---e.g., transferring an input buffer from the
  214. CPU's main memory to a task scheduled to run a GPU (\ref StarPU_Data_Management_Library).
  215. The following pragmas are provided:
  216. <dl>
  217. <dt><c>\#pragma starpu register ptr [size]</c></dt>
  218. <dd>
  219. Register <c>ptr</c> as a <c>size</c>-element buffer. When <c>ptr</c> has
  220. an array type whose size is known, <c>size</c> may be omitted.
  221. Alternatively, the <c>registered</c> attribute can be used (see below.)
  222. </dd>
  223. <dt><c>\#pragma starpu unregister ptr</c></dt>
  224. <dd>
  225. Unregister the previously-registered memory area pointed to by
  226. <c>ptr</c>. As a side-effect, <c>ptr</c> points to a valid copy in main
  227. memory.
  228. </dd>
  229. <dt><c>\#pragma starpu acquire ptr</c></dt>
  230. <dd>
  231. Acquire in main memory an up-to-date copy of the previously-registered
  232. memory area pointed to by <c>ptr</c>, for read-write access.
  233. </dd>
  234. <dt><c>\#pragma starpu release ptr</c></dt>
  235. <dd>
  236. Release the previously-register memory area pointed to by <c>ptr</c>,
  237. making it available to the tasks.
  238. </dd>
  239. </dl>
  240. Additionally, the following attributes offer a simple way to allocate
  241. and register storage for arrays:
  242. <dl>
  243. <dt><c>registered</c></dt>
  244. <dd>
  245. This attributes applies to local variables with an array type. Its
  246. effect is to automatically register the array's storage, as per
  247. <c>\#pragma starpu register</c>. The array is automatically unregistered
  248. when the variable's scope is left. This attribute is typically used in
  249. conjunction with the <c>heap_allocated</c> attribute, described below.
  250. </dd>
  251. <dt><c>heap_allocated</c></dt>
  252. <dd>
  253. This attributes applies to local variables with an array type. Its
  254. effect is to automatically allocate the array's storage on
  255. the heap, using starpu_malloc() under the hood. The heap-allocated array is automatically
  256. freed when the variable's scope is left, as with
  257. automatic variables.
  258. </dd>
  259. </dl>
  260. The following example illustrates use of the <c>heap_allocated</c>
  261. attribute:
  262. \code{.c}
  263. extern void cholesky(unsigned nblocks, unsigned size,
  264. float mat[nblocks][nblocks][size])
  265. __attribute__ ((task));
  266. int
  267. main (int argc, char *argv[])
  268. {
  269. #pragma starpu initialize
  270. /* ... */
  271. int nblocks, size;
  272. parse_args (&nblocks, &size);
  273. /* Allocate an array of the required size on the heap,
  274. and register it. */
  275. {
  276. float matrix[nblocks][nblocks][size]
  277. __attribute__ ((heap_allocated, registered));
  278. cholesky (nblocks, size, matrix);
  279. #pragma starpu wait
  280. } /* MATRIX is automatically unregistered & freed here. */
  281. #pragma starpu shutdown
  282. return EXIT_SUCCESS;
  283. }
  284. \endcode
  285. \section Conditional_Extensions Using C Extensions Conditionally
  286. The C extensions described in this chapter are only available when GCC
  287. and its StarPU plug-in are in use. Yet, it is possible to make use of
  288. these extensions when they are available---leading to hybrid CPU/GPU
  289. code---and discard them when they are not available---leading to valid
  290. sequential code.
  291. To that end, the GCC plug-in defines a C preprocessor macro when it is
  292. being used:
  293. @defmac STARPU_GCC_PLUGIN
  294. Defined for code being compiled with the StarPU GCC plug-in. When
  295. defined, this macro expands to an integer denoting the version of the
  296. supported C extensions.
  297. @end defmac
  298. The code below illustrates how to define a task and its implementations
  299. in a way that allows it to be compiled without the GCC plug-in:
  300. \code{.c}
  301. /* This program is valid, whether or not StarPU's GCC plug-in
  302. is being used. */
  303. #include <stdlib.h>
  304. /* The attribute below is ignored when GCC is not used. */
  305. static void matmul (const float *A, const float *B, float * C,
  306. unsigned nx, unsigned ny, unsigned nz)
  307. __attribute__ ((task));
  308. static void
  309. matmul (const float *A, const float *B, float * C,
  310. unsigned nx, unsigned ny, unsigned nz)
  311. {
  312. /* Code of the CPU kernel here... */
  313. }
  314. #ifdef STARPU_GCC_PLUGIN
  315. /* Optional OpenCL task implementation. */
  316. static void matmul_opencl (const float *A, const float *B, float * C,
  317. unsigned nx, unsigned ny, unsigned nz)
  318. __attribute__ ((task_implementation ("opencl", matmul)));
  319. static void
  320. matmul_opencl (const float *A, const float *B, float * C,
  321. unsigned nx, unsigned ny, unsigned nz)
  322. {
  323. /* Code that invokes the OpenCL kernel here... */
  324. }
  325. #endif
  326. int
  327. main (int argc, char *argv[])
  328. {
  329. /* The pragmas below are simply ignored when StarPU-GCC
  330. is not used. */
  331. #pragma starpu initialize
  332. float A[123][42][7], B[123][42][7], C[123][42][7];
  333. #pragma starpu register A
  334. #pragma starpu register B
  335. #pragma starpu register C
  336. /* When StarPU-GCC is used, the call below is asynchronous;
  337. otherwise, it is synchronous. */
  338. matmul ((float *) A, (float *) B, (float *) C, 123, 42, 7);
  339. #pragma starpu wait
  340. #pragma starpu shutdown
  341. return EXIT_SUCCESS;
  342. }
  343. \endcode
  344. The above program is a valid StarPU program when StarPU's GCC plug-in is
  345. used; it is also a valid sequential program when the plug-in is not
  346. used.
  347. Note that attributes such as <c>task</c> as well as <c>starpu</c>
  348. pragmas are simply ignored by GCC when the StarPU plug-in is not loaded.
  349. However, <c>gcc -Wall</c> emits a warning for unknown attributes and
  350. pragmas, which can be inconvenient. In addition, other compilers may be
  351. unable to parse the attribute syntax (In practice, Clang and
  352. several proprietary compilers implement attributes.), so you may want to
  353. wrap attributes in macros like this:
  354. \code{.c}
  355. /* Use the `task' attribute only when StarPU's GCC plug-in
  356. is available. */
  357. #ifdef STARPU_GCC_PLUGIN
  358. # define __task __attribute__ ((task))
  359. #else
  360. # define __task
  361. #endif
  362. static void matmul (const float *A, const float *B, float *C,
  363. unsigned nx, unsigned ny, unsigned nz) __task;
  364. \endcode
  365. */