440_c_extensions.doxy 13 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354
  1. /* StarPU --- Runtime system for heterogeneous multicore architectures.
  2. *
  3. * Copyright (C) 2010-2018 CNRS
  4. * Copyright (C) 2009-2011,2014-2015 Université de Bordeaux
  5. * Copyright (C) 2011-2012 Inria
  6. *
  7. * StarPU is free software; you can redistribute it and/or modify
  8. * it under the terms of the GNU Lesser General Public License as published by
  9. * the Free Software Foundation; either version 2.1 of the License, or (at
  10. * your option) any later version.
  11. *
  12. * StarPU is distributed in the hope that it will be useful, but
  13. * WITHOUT ANY WARRANTY; without even the implied warranty of
  14. * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
  15. *
  16. * See the GNU Lesser General Public License in COPYING.LGPL for more details.
  17. */
  18. /*! \page cExtensions C Extensions
  19. When GCC plug-in support is available, StarPU builds a plug-in for the
  20. GNU Compiler Collection (GCC), which defines extensions to languages of
  21. the C family (C, C++, Objective-C) that make it easier to write StarPU
  22. code. This feature is only available for GCC 4.5 and later; it
  23. is known to work with GCC 4.5, 4.6, and 4.7. You
  24. may need to install a specific <c>-dev</c> package of your distro, such
  25. as <c>gcc-4.6-plugin-dev</c> on Debian and derivatives. In addition,
  26. the plug-in's test suite is only run when GNU Guile (http://www.gnu.org/software/guile/)
  27. is found at <c>configure</c>-time. Building the GCC plug-in
  28. can be disabled by configuring with \ref disable-gcc-extensions "--disable-gcc-extensions".
  29. Those extensions include syntactic sugar for defining
  30. tasks and their implementations, invoking a task, and manipulating data
  31. buffers. Use of these extensions can be made conditional on the
  32. availability of the plug-in, leading to valid C sequential code when the
  33. plug-in is not used (\ref UsingCExtensionsConditionally).
  34. When StarPU has been installed with its GCC plug-in, programs that use
  35. these extensions can be compiled this way:
  36. \verbatim
  37. $ gcc -c -fplugin=`pkg-config starpu-1.3 --variable=gccplugin` foo.c
  38. \endverbatim
  39. When the plug-in is not available, the above <c>pkg-config</c>
  40. command returns the empty string.
  41. In addition, the <c>-fplugin-arg-starpu-verbose</c> flag can be used to
  42. obtain feedback from the compiler as it analyzes the C extensions used
  43. in source files.
  44. This section describes the C extensions implemented by StarPU's GCC
  45. plug-in. It does not require detailed knowledge of the StarPU library.
  46. Note: this is still an area under development and subject to change.
  47. \section DefiningTasks Defining Tasks
  48. The StarPU GCC plug-in views tasks as ``extended'' C functions:
  49. <ul>
  50. <Li>
  51. tasks may have several implementations---e.g., one for CPUs, one written
  52. in OpenCL, one written in CUDA;
  53. </li>
  54. <Li>
  55. tasks may have several implementations of the same target---e.g.,
  56. several CPU implementations;
  57. </li>
  58. <li>
  59. when a task is invoked, it may run in parallel, and StarPU is free to
  60. choose any of its implementations.
  61. </li>
  62. </ul>
  63. Tasks and their implementations must be <em>declared</em>. These
  64. declarations are annotated with attributes
  65. (http://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html#Attribute-Syntax):
  66. the declaration of a task is a regular C function declaration with an
  67. additional <c>task</c> attribute, and task implementations are
  68. declared with a <c>task_implementation</c> attribute.
  69. The following function attributes are provided:
  70. <dl>
  71. <dt><c>task</c></dt>
  72. <dd>
  73. Declare the given function as a StarPU task. Its return type must be
  74. <c>void</c>. When a function declared as <c>task</c> has a user-defined
  75. body, that body is interpreted as the implicit definition of the
  76. task's CPU implementation (see example below). In all cases, the
  77. actual definition of a task's body is automatically generated by the
  78. compiler.
  79. Under the hood, declaring a task leads to the declaration of the
  80. corresponding <c>codelet</c> (\ref CodeletAndTasks). If one or
  81. more task implementations are declared in the same compilation unit,
  82. then the codelet and the function itself are also defined; they inherit
  83. the scope of the task.
  84. Scalar arguments to the task are passed by value and copied to the
  85. target device if need be---technically, they are passed as the buffer
  86. starpu_task::cl_arg (\ref CodeletAndTasks).
  87. Pointer arguments are assumed to be registered data buffers---the
  88. handles argument of a task (starpu_task::handles) ; <c>const</c>-qualified
  89. pointer arguments are viewed as read-only buffers (::STARPU_R), and
  90. non-<c>const</c>-qualified buffers are assumed to be used read-write
  91. (::STARPU_RW). In addition, the <c>output</c> type attribute can be
  92. as a type qualifier for output pointer or array parameters
  93. (::STARPU_W).
  94. </dd>
  95. <dt><c>task_implementation (target, task)</c></dt>
  96. <dd>
  97. Declare the given function as an implementation of <c>task</c> to run on
  98. <c>target</c>. <c>target</c> must be a string, currently one of
  99. <c>"cpu"</c>, <c>"opencl"</c>, or <c>"cuda"</c>.
  100. // FIXME: Update when OpenCL support is ready.
  101. </dd>
  102. </dl>
  103. Here is an example:
  104. \code{.c}
  105. #define __output __attribute__ ((output))
  106. static void matmul (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz)
  107. __attribute__ ((task));
  108. static void matmul_cpu (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz)
  109. __attribute__ ((task_implementation ("cpu", matmul)));
  110. static void
  111. matmul_cpu (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz)
  112. {
  113. unsigned i, j, k;
  114. for (j = 0; j < ny; j++)
  115. for (i = 0; i < nx; i++)
  116. {
  117. for (k = 0; k < nz; k++)
  118. C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
  119. }
  120. }
  121. \endcode
  122. A <c>matmult</c> task is defined; it has only one implementation,
  123. <c>matmult_cpu</c>, which runs on the CPU. Variables <c>A</c> and
  124. <c>B</c> are input buffers, whereas <c>C</c> is considered an input/output
  125. buffer.
  126. For convenience, when a function declared with the <c>task</c> attribute
  127. has a user-defined body, that body is assumed to be that of the CPU
  128. implementation of a task, which we call an implicit task CPU
  129. implementation. Thus, the above snippet can be simplified like this:
  130. \code{.c}
  131. #define __output __attribute__ ((output))
  132. static void matmul (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz)
  133. __attribute__ ((task));
  134. /* Implicit definition of the CPU implementation of the
  135. `matmul' task. */
  136. static void matmul (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz)
  137. {
  138. unsigned i, j, k;
  139. for (j = 0; j < ny; j++)
  140. for (i = 0; i < nx; i++)
  141. {
  142. for (k = 0; k < nz; k++)
  143. C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
  144. }
  145. }
  146. \endcode
  147. Use of implicit CPU task implementations as above has the advantage that
  148. the code is valid sequential code when StarPU's GCC plug-in is not used
  149. (\ref UsingCExtensionsConditionally).
  150. CUDA and OpenCL implementations can be declared in a similar way:
  151. \code{.c}
  152. static void matmul_cuda (const float *A, const float *B, float *C, unsigned nx, unsigned ny, unsigned nz)
  153. __attribute__ ((task_implementation ("cuda", matmul)));
  154. static void matmul_opencl (const float *A, const float *B, float *C, unsigned nx, unsigned ny, unsigned nz)
  155. __attribute__ ((task_implementation ("opencl", matmul)));
  156. \endcode
  157. The CUDA and OpenCL implementations typically either invoke a kernel
  158. written in CUDA or OpenCL (for similar code, \ref CUDAKernel, and
  159. \ref OpenCLKernel), or call a library function that uses CUDA or
  160. OpenCL under the hood, such as CUBLAS functions:
  161. \code{.c}
  162. static void matmul_cuda (const float *A, const float *B, float *C, unsigned nx, unsigned ny, unsigned nz)
  163. {
  164. cublasSgemm ('n', 'n', nx, ny, nz, 1.0f, A, 0, B, 0, 0.0f, C, 0);
  165. cudaStreamSynchronize (starpu_cuda_get_local_stream ());
  166. }
  167. \endcode
  168. A task can be invoked like a regular C function:
  169. \code{.c}
  170. matmul (&A[i * zdim * bydim + k * bzdim * bydim],
  171. &B[k * xdim * bzdim + j * bxdim * bzdim],
  172. &C[i * xdim * bydim + j * bxdim * bydim],
  173. bxdim, bydim, bzdim);
  174. \endcode
  175. This leads to an asynchronous invocation, whereby <c>matmult</c>'s
  176. implementation may run in parallel with the continuation of the caller.
  177. The next section describes how memory buffers must be handled in
  178. StarPU-GCC code. For a complete example, see the
  179. <c>gcc-plugin/examples</c> directory of the source distribution, and
  180. \ref VectorScalingUsingTheCExtension.
  181. \section InitializationTerminationAndSynchronization Initialization, Termination, and Synchronization
  182. The following pragmas allow user code to control StarPU's life time and
  183. to synchronize with tasks.
  184. <dl>
  185. <dt><c>\#pragma starpu initialize</c></dt>
  186. <dd>
  187. Initialize StarPU. This call is compulsory and is <em>never</em> added
  188. implicitly. One of the reasons this has to be done explicitly is that
  189. it provides greater control to user code over its resource usage.
  190. </dd>
  191. <dt><c>\#pragma starpu shutdown</c></dt>
  192. <dd>
  193. Shut down StarPU, giving it an opportunity to write profiling info to a
  194. file on disk, for instance (\ref Off-linePerformanceFeedback).
  195. </dd>
  196. <dt><c>\#pragma starpu wait</c></dt>
  197. <dd>
  198. Wait for all task invocations to complete, as with
  199. starpu_task_wait_for_all().
  200. </dd>
  201. </dl>
  202. \section RegisteredDataBuffers Registered Data Buffers
  203. Data buffers such as matrices and vectors that are to be passed to tasks
  204. must be registered. Registration allows StarPU to handle data
  205. transfers among devices---e.g., transferring an input buffer from the
  206. CPU's main memory to a task scheduled to run a GPU (\ref StarPUDataManagementLibrary).
  207. The following pragmas are provided:
  208. <dl>
  209. <dt><c>\#pragma starpu register ptr [size]</c></dt>
  210. <dd>
  211. Register <c>ptr</c> as a <c>size</c>-element buffer. When <c>ptr</c> has
  212. an array type whose size is known, <c>size</c> may be omitted.
  213. Alternatively, the <c>registered</c> attribute can be used (see below.)
  214. </dd>
  215. <dt><c>\#pragma starpu unregister ptr</c></dt>
  216. <dd>
  217. Unregister the previously-registered memory area pointed to by
  218. <c>ptr</c>. As a side-effect, <c>ptr</c> points to a valid copy in main
  219. memory.
  220. </dd>
  221. <dt><c>\#pragma starpu acquire ptr</c></dt>
  222. <dd>
  223. Acquire in main memory an up-to-date copy of the previously-registered
  224. memory area pointed to by <c>ptr</c>, for read-write access.
  225. </dd>
  226. <dt><c>\#pragma starpu release ptr</c></dt>
  227. <dd>
  228. Release the previously-register memory area pointed to by <c>ptr</c>,
  229. making it available to the tasks.
  230. </dd>
  231. </dl>
  232. Additionally, the following attributes offer a simple way to allocate
  233. and register storage for arrays:
  234. <dl>
  235. <dt><c>registered</c></dt>
  236. <dd>
  237. This attributes applies to local variables with an array type. Its
  238. effect is to automatically register the array's storage, as per
  239. <c>\#pragma starpu register</c>. The array is automatically unregistered
  240. when the variable's scope is left. This attribute is typically used in
  241. conjunction with the <c>heap_allocated</c> attribute, described below.
  242. </dd>
  243. <dt><c>heap_allocated</c></dt>
  244. <dd>
  245. This attributes applies to local variables with an array type. Its
  246. effect is to automatically allocate the array's storage on
  247. the heap, using starpu_malloc() under the hood. The heap-allocated array is automatically
  248. freed when the variable's scope is left, as with
  249. automatic variables.
  250. </dd>
  251. </dl>
  252. The following example illustrates use of the <c>heap_allocated</c>
  253. attribute:
  254. \snippet cholesky_pragma.c To be included. You should update doxygen if you see this text.
  255. \section UsingCExtensionsConditionally Using C Extensions Conditionally
  256. The C extensions described in this chapter are only available when GCC
  257. and its StarPU plug-in are in use. Yet, it is possible to make use of
  258. these extensions when they are available---leading to hybrid CPU/GPU
  259. code---and discard them when they are not available---leading to valid
  260. sequential code.
  261. To that end, the GCC plug-in defines the C preprocessor macro ---
  262. <c>STARPU_GCC_PLUGIN</c> --- when it is being used. When defined, this
  263. macro expands to an integer denoting the version of the supported C
  264. extensions.
  265. The code below illustrates how to define a task and its implementations
  266. in a way that allows it to be compiled without the GCC plug-in:
  267. \snippet matmul_pragma.c To be included. You should update doxygen if you see this text.
  268. The above program is a valid StarPU program when StarPU's GCC plug-in is
  269. used; it is also a valid sequential program when the plug-in is not
  270. used.
  271. Note that attributes such as <c>task</c> as well as <c>starpu</c>
  272. pragmas are simply ignored by GCC when the StarPU plug-in is not loaded.
  273. However, <c>gcc -Wall</c> emits a warning for unknown attributes and
  274. pragmas, which can be inconvenient. In addition, other compilers may be
  275. unable to parse the attribute syntax (In practice, Clang and
  276. several proprietary compilers implement attributes.), so you may want to
  277. wrap attributes in macros like this:
  278. \snippet matmul_pragma2.c To be included. You should update doxygen if you see this text.
  279. */