/*
* This file is part of the StarPU Handbook.
* Copyright (C) 2009--2011 Universit@'e de Bordeaux 1
* Copyright (C) 2010, 2011, 2012, 2013 Centre National de la Recherche Scientifique
* Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
* See the file version.doxy for copying conditions.
*/
/*! \page cExtensions C Extensions
When GCC plug-in support is available, StarPU builds a plug-in for the
GNU Compiler Collection (GCC), which defines extensions to languages of
the C family (C, C++, Objective-C) that make it easier to write StarPU
code. This feature is only available for GCC 4.5 and later; it
is known to work with GCC 4.5, 4.6, and 4.7. You
may need to install a specific -dev package of your distro, such
as gcc-4.6-plugin-dev on Debian and derivatives. In addition,
the plug-in's test suite is only run when GNU Guile is found at
configure-time. Building the GCC plug-in
can be disabled by configuring with --disable-gcc-extensions.
Those extensions include syntactic sugar for defining
tasks and their implementations, invoking a task, and manipulating data
buffers. Use of these extensions can be made conditional on the
availability of the plug-in, leading to valid C sequential code when the
plug-in is not used (\ref Conditional_Extensions).
When StarPU has been installed with its GCC plug-in, programs that use
these extensions can be compiled this way:
\verbatim
$ gcc -c -fplugin=`pkg-config starpu-1.1 --variable=gccplugin` foo.c
\endverbatim
When the plug-in is not available, the above pkg-config
command returns the empty string.
In addition, the -fplugin-arg-starpu-verbose flag can be used to
obtain feedback from the compiler as it analyzes the C extensions used
in source files.
This section describes the C extensions implemented by StarPU's GCC
plug-in. It does not require detailed knowledge of the StarPU library.
Note: as of StarPU @value{VERSION}, this is still an area under
development and subject to change.
\section Defining_Tasks Defining Tasks
The StarPU GCC plug-in views tasks as ``extended'' C functions:
-
tasks may have several implementations---e.g., one for CPUs, one written
in OpenCL, one written in CUDA;
-
tasks may have several implementations of the same target---e.g.,
several CPU implementations;
-
when a task is invoked, it may run in parallel, and StarPU is free to
choose any of its implementations.
Tasks and their implementations must be declared. These
declarations are annotated with attributes (@pxref{Attribute
Syntax, attributes in GNU C,, gcc, Using the GNU Compiler Collection
(GCC)}): the declaration of a task is a regular C function declaration
with an additional task attribute, and task implementations are
declared with a task_implementation attribute.
The following function attributes are provided:
- task
-
Declare the given function as a StarPU task. Its return type must be
void. When a function declared as task has a user-defined
body, that body is interpreted as the implicit definition of the
task's CPU implementation (see example below). In all cases, the
actual definition of a task's body is automatically generated by the
compiler.
Under the hood, declaring a task leads to the declaration of the
corresponding codelet (@pxref{Codelet and Tasks}). If one or
more task implementations are declared in the same compilation unit,
then the codelet and the function itself are also defined; they inherit
the scope of the task.
Scalar arguments to the task are passed by value and copied to the
target device if need be---technically, they are passed as the
cl_arg buffer (@pxref{Codelets and Tasks, cl_arg}).
Pointer arguments are assumed to be registered data buffers---the
buffers argument of a task (@pxref{Codelets and Tasks,
buffers}); const-qualified pointer arguments are viewed as
read-only buffers (STARPU_R), and non-const-qualified
buffers are assumed to be used read-write (STARPU_RW). In
addition, the output type attribute can be as a type qualifier
for output pointer or array parameters (STARPU_W).
- task_implementation (target, task)
-
Declare the given function as an implementation of task to run on
target. target must be a string, currently one of
"cpu", "opencl", or "cuda".
\internal
FIXME: Update when OpenCL support is ready.
\endinternal
Here is an example:
\code{.c}
#define __output __attribute__ ((output))
static void matmul (const float *A, const float *B,
__output float *C,
unsigned nx, unsigned ny, unsigned nz)
__attribute__ ((task));
static void matmul_cpu (const float *A, const float *B,
__output float *C,
unsigned nx, unsigned ny, unsigned nz)
__attribute__ ((task_implementation ("cpu", matmul)));
static void
matmul_cpu (const float *A, const float *B, __output float *C,
unsigned nx, unsigned ny, unsigned nz)
{
unsigned i, j, k;
for (j = 0; j < ny; j++)
for (i = 0; i < nx; i++)
{
for (k = 0; k < nz; k++)
C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
}
}
\endcode
A matmult task is defined; it has only one implementation,
matmult_cpu, which runs on the CPU. Variables A and
B are input buffers, whereas C is considered an input/output
buffer.
For convenience, when a function declared with the task attribute
has a user-defined body, that body is assumed to be that of the CPU
implementation of a task, which we call an implicit task CPU
implementation. Thus, the above snippet can be simplified like this:
\code{.c}
#define __output __attribute__ ((output))
static void matmul (const float *A, const float *B,
__output float *C,
unsigned nx, unsigned ny, unsigned nz)
__attribute__ ((task));
/* Implicit definition of the CPU implementation of the
`matmul' task. */
static void
matmul (const float *A, const float *B, __output float *C,
unsigned nx, unsigned ny, unsigned nz)
{
unsigned i, j, k;
for (j = 0; j < ny; j++)
for (i = 0; i < nx; i++)
{
for (k = 0; k < nz; k++)
C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
}
}
\endcode
Use of implicit CPU task implementations as above has the advantage that
the code is valid sequential code when StarPU's GCC plug-in is not used
(\ref Conditional_Extensions).
CUDA and OpenCL implementations can be declared in a similar way:
\code{.c}
static void matmul_cuda (const float *A, const float *B, float *C,
unsigned nx, unsigned ny, unsigned nz)
__attribute__ ((task_implementation ("cuda", matmul)));
static void matmul_opencl (const float *A, const float *B, float *C,
unsigned nx, unsigned ny, unsigned nz)
__attribute__ ((task_implementation ("opencl", matmul)));
\endcode
The CUDA and OpenCL implementations typically either invoke a kernel
written in CUDA or OpenCL (for similar code, @pxref{CUDA Kernel}, and
@pxref{OpenCL Kernel}), or call a library function that uses CUDA or
OpenCL under the hood, such as CUBLAS functions:
\code{.c}
static void
matmul_cuda (const float *A, const float *B, float *C,
unsigned nx, unsigned ny, unsigned nz)
{
cublasSgemm ('n', 'n', nx, ny, nz,
1.0f, A, 0, B, 0,
0.0f, C, 0);
cudaStreamSynchronize (starpu_cuda_get_local_stream ());
}
\endcode
A task can be invoked like a regular C function:
\code{.c}
matmul (&A[i * zdim * bydim + k * bzdim * bydim],
&B[k * xdim * bzdim + j * bxdim * bzdim],
&C[i * xdim * bydim + j * bxdim * bydim],
bxdim, bydim, bzdim);
\endcode
This leads to an asynchronous invocation, whereby matmult's
implementation may run in parallel with the continuation of the caller.
The next section describes how memory buffers must be handled in
StarPU-GCC code. For a complete example, see the
gcc-plugin/examples directory of the source distribution, and
\ref Vector_Scaling_Using_the_C_Extension.
\section Synchronization_and_Other_Pragmas Initialization, Termination, and Synchronization
The following pragmas allow user code to control StarPU's life time and
to synchronize with tasks.
- \#pragma starpu initialize
-
Initialize StarPU. This call is compulsory and is never added
implicitly. One of the reasons this has to be done explicitly is that
it provides greater control to user code over its resource usage.
- \#pragma starpu shutdown
-
Shut down StarPU, giving it an opportunity to write profiling info to a
file on disk, for instance (\ref Off-line_performance_feedback).
- \#pragma starpu wait
-
Wait for all task invocations to complete, as with
starpu_wait_for_all().
\section Registered_Data_Buffers Registered Data Buffers
Data buffers such as matrices and vectors that are to be passed to tasks
must be registered. Registration allows StarPU to handle data
transfers among devices---e.g., transferring an input buffer from the
CPU's main memory to a task scheduled to run a GPU (\ref StarPU_Data_Management_Library).
The following pragmas are provided:
- \#pragma starpu register ptr [size]
-
Register ptr as a size-element buffer. When ptr has
an array type whose size is known, size may be omitted.
Alternatively, the registered attribute can be used (see below.)
- \#pragma starpu unregister ptr
-
Unregister the previously-registered memory area pointed to by
ptr. As a side-effect, ptr points to a valid copy in main
memory.
- \#pragma starpu acquire ptr
-
Acquire in main memory an up-to-date copy of the previously-registered
memory area pointed to by ptr, for read-write access.
- \#pragma starpu release ptr
-
Release the previously-register memory area pointed to by ptr,
making it available to the tasks.
Additionally, the following attributes offer a simple way to allocate
and register storage for arrays:
- registered
-
This attributes applies to local variables with an array type. Its
effect is to automatically register the array's storage, as per
\#pragma starpu register. The array is automatically unregistered
when the variable's scope is left. This attribute is typically used in
conjunction with the heap_allocated attribute, described below.
- heap_allocated
-
This attributes applies to local variables with an array type. Its
effect is to automatically allocate the array's storage on
the heap, using starpu_malloc() under the hood. The heap-allocated array is automatically
freed when the variable's scope is left, as with
automatic variables.
The following example illustrates use of the heap_allocated
attribute:
\code{.c}
extern void cholesky(unsigned nblocks, unsigned size,
float mat[nblocks][nblocks][size])
__attribute__ ((task));
int
main (int argc, char *argv[])
{
#pragma starpu initialize
/* ... */
int nblocks, size;
parse_args (&nblocks, &size);
/* Allocate an array of the required size on the heap,
and register it. */
{
float matrix[nblocks][nblocks][size]
__attribute__ ((heap_allocated, registered));
cholesky (nblocks, size, matrix);
#pragma starpu wait
} /* MATRIX is automatically unregistered & freed here. */
#pragma starpu shutdown
return EXIT_SUCCESS;
}
\endcode
\section Conditional_Extensions Using C Extensions Conditionally
The C extensions described in this chapter are only available when GCC
and its StarPU plug-in are in use. Yet, it is possible to make use of
these extensions when they are available---leading to hybrid CPU/GPU
code---and discard them when they are not available---leading to valid
sequential code.
To that end, the GCC plug-in defines a C preprocessor macro when it is
being used:
@defmac STARPU_GCC_PLUGIN
Defined for code being compiled with the StarPU GCC plug-in. When
defined, this macro expands to an integer denoting the version of the
supported C extensions.
@end defmac
The code below illustrates how to define a task and its implementations
in a way that allows it to be compiled without the GCC plug-in:
\code{.c}
/* This program is valid, whether or not StarPU's GCC plug-in
is being used. */
#include
/* The attribute below is ignored when GCC is not used. */
static void matmul (const float *A, const float *B, float * C,
unsigned nx, unsigned ny, unsigned nz)
__attribute__ ((task));
static void
matmul (const float *A, const float *B, float * C,
unsigned nx, unsigned ny, unsigned nz)
{
/* Code of the CPU kernel here... */
}
#ifdef STARPU_GCC_PLUGIN
/* Optional OpenCL task implementation. */
static void matmul_opencl (const float *A, const float *B, float * C,
unsigned nx, unsigned ny, unsigned nz)
__attribute__ ((task_implementation ("opencl", matmul)));
static void
matmul_opencl (const float *A, const float *B, float * C,
unsigned nx, unsigned ny, unsigned nz)
{
/* Code that invokes the OpenCL kernel here... */
}
#endif
int
main (int argc, char *argv[])
{
/* The pragmas below are simply ignored when StarPU-GCC
is not used. */
#pragma starpu initialize
float A[123][42][7], B[123][42][7], C[123][42][7];
#pragma starpu register A
#pragma starpu register B
#pragma starpu register C
/* When StarPU-GCC is used, the call below is asynchronous;
otherwise, it is synchronous. */
matmul ((float *) A, (float *) B, (float *) C, 123, 42, 7);
#pragma starpu wait
#pragma starpu shutdown
return EXIT_SUCCESS;
}
\endcode
The above program is a valid StarPU program when StarPU's GCC plug-in is
used; it is also a valid sequential program when the plug-in is not
used.
Note that attributes such as task as well as starpu
pragmas are simply ignored by GCC when the StarPU plug-in is not loaded.
However, gcc -Wall emits a warning for unknown attributes and
pragmas, which can be inconvenient. In addition, other compilers may be
unable to parse the attribute syntax (In practice, Clang and
several proprietary compilers implement attributes.), so you may want to
wrap attributes in macros like this:
\code{.c}
/* Use the `task' attribute only when StarPU's GCC plug-in
is available. */
#ifdef STARPU_GCC_PLUGIN
# define __task __attribute__ ((task))
#else
# define __task
#endif
static void matmul (const float *A, const float *B, float *C,
unsigned nx, unsigned ny, unsigned nz) __task;
\endcode
*/