/* StarPU --- Runtime system for heterogeneous multicore architectures. * * Copyright (C) 2010-2018 CNRS * Copyright (C) 2009-2011,2014-2015 Université de Bordeaux * Copyright (C) 2011-2012 Inria * * StarPU is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or (at * your option) any later version. * * StarPU is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. * * See the GNU Lesser General Public License in COPYING.LGPL for more details. */ /*! \page cExtensions C Extensions When GCC plug-in support is available, StarPU builds a plug-in for the GNU Compiler Collection (GCC), which defines extensions to languages of the C family (C, C++, Objective-C) that make it easier to write StarPU code. This feature is only available for GCC 4.5 and later; it is known to work with GCC 4.5, 4.6, and 4.7. You may need to install a specific -dev package of your distro, such as gcc-4.6-plugin-dev on Debian and derivatives. In addition, the plug-in's test suite is only run when GNU Guile (http://www.gnu.org/software/guile/) is found at configure-time. Building the GCC plug-in can be disabled by configuring with \ref disable-gcc-extensions "--disable-gcc-extensions". Those extensions include syntactic sugar for defining tasks and their implementations, invoking a task, and manipulating data buffers. Use of these extensions can be made conditional on the availability of the plug-in, leading to valid C sequential code when the plug-in is not used (\ref UsingCExtensionsConditionally). When StarPU has been installed with its GCC plug-in, programs that use these extensions can be compiled this way: \verbatim $ gcc -c -fplugin=`pkg-config starpu-1.3 --variable=gccplugin` foo.c \endverbatim When the plug-in is not available, the above pkg-config command returns the empty string. In addition, the -fplugin-arg-starpu-verbose flag can be used to obtain feedback from the compiler as it analyzes the C extensions used in source files. This section describes the C extensions implemented by StarPU's GCC plug-in. It does not require detailed knowledge of the StarPU library. Note: this is still an area under development and subject to change. \section DefiningTasks Defining Tasks The StarPU GCC plug-in views tasks as ``extended'' C functions:

tasks may have several implementations---e.g., one for CPUs, one written in OpenCL, one written in CUDA;
tasks may have several implementations of the same target---e.g., several CPU implementations;
when a task is invoked, it may run in parallel, and StarPU is free to choose any of its implementations.

Tasks and their implementations must be declared. These declarations are annotated with attributes (http://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html#Attribute-Syntax): the declaration of a task is a regular C function declaration with an additional task attribute, and task implementations are declared with a task_implementation attribute. The following function attributes are provided:

task: Declare the given function as a StarPU task. Its return type must be void. When a function declared as task has a user-defined body, that body is interpreted as the implicit definition of the task's CPU implementation (see example below). In all cases, the actual definition of a task's body is automatically generated by the compiler. Under the hood, declaring a task leads to the declaration of the corresponding codelet (\ref CodeletAndTasks). If one or more task implementations are declared in the same compilation unit, then the codelet and the function itself are also defined; they inherit the scope of the task. Scalar arguments to the task are passed by value and copied to the target device if need be---technically, they are passed as the buffer starpu_task::cl_arg (\ref CodeletAndTasks). Pointer arguments are assumed to be registered data buffers---the handles argument of a task (starpu_task::handles) ; const-qualified pointer arguments are viewed as read-only buffers (::STARPU_R), and non-const-qualified buffers are assumed to be used read-write (::STARPU_RW). In addition, the output type attribute can be as a type qualifier for output pointer or array parameters (::STARPU_W).
task_implementation (target, task): Declare the given function as an implementation of task to run on target. target must be a string, currently one of "cpu", "opencl", or "cuda". // FIXME: Update when OpenCL support is ready.

Here is an example: \code{.c} #define __output __attribute__ ((output)) static void matmul (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz) __attribute__ ((task)); static void matmul_cpu (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz) __attribute__ ((task_implementation ("cpu", matmul))); static void matmul_cpu (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz) { unsigned i, j, k; for (j = 0; j < ny; j++) for (i = 0; i < nx; i++) { for (k = 0; k < nz; k++) C[j * nx + i] += A[j * nz + k] * B[k * nx + i]; } } \endcode A matmult task is defined; it has only one implementation, matmult_cpu, which runs on the CPU. Variables A and B are input buffers, whereas C is considered an input/output buffer. For convenience, when a function declared with the task attribute has a user-defined body, that body is assumed to be that of the CPU implementation of a task, which we call an implicit task CPU implementation. Thus, the above snippet can be simplified like this: \code{.c} #define __output __attribute__ ((output)) static void matmul (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz) __attribute__ ((task)); /* Implicit definition of the CPU implementation of the `matmul' task. */ static void matmul (const float *A, const float *B, __output float *C, unsigned nx, unsigned ny, unsigned nz) { unsigned i, j, k; for (j = 0; j < ny; j++) for (i = 0; i < nx; i++) { for (k = 0; k < nz; k++) C[j * nx + i] += A[j * nz + k] * B[k * nx + i]; } } \endcode Use of implicit CPU task implementations as above has the advantage that the code is valid sequential code when StarPU's GCC plug-in is not used (\ref UsingCExtensionsConditionally). CUDA and OpenCL implementations can be declared in a similar way: \code{.c} static void matmul_cuda (const float *A, const float *B, float *C, unsigned nx, unsigned ny, unsigned nz) __attribute__ ((task_implementation ("cuda", matmul))); static void matmul_opencl (const float *A, const float *B, float *C, unsigned nx, unsigned ny, unsigned nz) __attribute__ ((task_implementation ("opencl", matmul))); \endcode The CUDA and OpenCL implementations typically either invoke a kernel written in CUDA or OpenCL (for similar code, \ref CUDAKernel, and \ref OpenCLKernel), or call a library function that uses CUDA or OpenCL under the hood, such as CUBLAS functions: \code{.c} static void matmul_cuda (const float *A, const float *B, float *C, unsigned nx, unsigned ny, unsigned nz) { cublasSgemm ('n', 'n', nx, ny, nz, 1.0f, A, 0, B, 0, 0.0f, C, 0); cudaStreamSynchronize (starpu_cuda_get_local_stream ()); } \endcode A task can be invoked like a regular C function: \code{.c} matmul (&A[i * zdim * bydim + k * bzdim * bydim], &B[k * xdim * bzdim + j * bxdim * bzdim], &C[i * xdim * bydim + j * bxdim * bydim], bxdim, bydim, bzdim); \endcode This leads to an asynchronous invocation, whereby matmult's implementation may run in parallel with the continuation of the caller. The next section describes how memory buffers must be handled in StarPU-GCC code. For a complete example, see the gcc-plugin/examples directory of the source distribution, and \ref VectorScalingUsingTheCExtension. \section InitializationTerminationAndSynchronization Initialization, Termination, and Synchronization The following pragmas allow user code to control StarPU's life time and to synchronize with tasks.

\#pragma starpu initialize: Initialize StarPU. This call is compulsory and is never added implicitly. One of the reasons this has to be done explicitly is that it provides greater control to user code over its resource usage.
\#pragma starpu shutdown: Shut down StarPU, giving it an opportunity to write profiling info to a file on disk, for instance (\ref Off-linePerformanceFeedback).
\#pragma starpu wait: Wait for all task invocations to complete, as with starpu_task_wait_for_all().

\section RegisteredDataBuffers Registered Data Buffers Data buffers such as matrices and vectors that are to be passed to tasks must be registered. Registration allows StarPU to handle data transfers among devices---e.g., transferring an input buffer from the CPU's main memory to a task scheduled to run a GPU (\ref StarPUDataManagementLibrary). The following pragmas are provided:

\#pragma starpu register ptr [size]: Register ptr as a size-element buffer. When ptr has an array type whose size is known, size may be omitted. Alternatively, the registered attribute can be used (see below.)
\#pragma starpu unregister ptr: Unregister the previously-registered memory area pointed to by ptr. As a side-effect, ptr points to a valid copy in main memory.
\#pragma starpu acquire ptr: Acquire in main memory an up-to-date copy of the previously-registered memory area pointed to by ptr, for read-write access.
\#pragma starpu release ptr: Release the previously-register memory area pointed to by ptr, making it available to the tasks.

Additionally, the following attributes offer a simple way to allocate and register storage for arrays:

registered: This attributes applies to local variables with an array type. Its effect is to automatically register the array's storage, as per \#pragma starpu register. The array is automatically unregistered when the variable's scope is left. This attribute is typically used in conjunction with the heap_allocated attribute, described below.
heap_allocated: This attributes applies to local variables with an array type. Its effect is to automatically allocate the array's storage on the heap, using starpu_malloc() under the hood. The heap-allocated array is automatically freed when the variable's scope is left, as with automatic variables.

The following example illustrates use of the heap_allocated attribute: \snippet cholesky_pragma.c To be included. You should update doxygen if you see this text. \section UsingCExtensionsConditionally Using C Extensions Conditionally The C extensions described in this chapter are only available when GCC and its StarPU plug-in are in use. Yet, it is possible to make use of these extensions when they are available---leading to hybrid CPU/GPU code---and discard them when they are not available---leading to valid sequential code. To that end, the GCC plug-in defines the C preprocessor macro --- STARPU_GCC_PLUGIN --- when it is being used. When defined, this macro expands to an integer denoting the version of the supported C extensions. The code below illustrates how to define a task and its implementations in a way that allows it to be compiled without the GCC plug-in: \snippet matmul_pragma.c To be included. You should update doxygen if you see this text. The above program is a valid StarPU program when StarPU's GCC plug-in is used; it is also a valid sequential program when the plug-in is not used. Note that attributes such as task as well as starpu pragmas are simply ignored by GCC when the StarPU plug-in is not loaded. However, gcc -Wall emits a warning for unknown attributes and pragmas, which can be inconvenient. In addition, other compilers may be unable to parse the attribute syntax (In practice, Clang and several proprietary compilers implement attributes.), so you may want to wrap attributes in macros like this: \snippet matmul_pragma2.c To be included. You should update doxygen if you see this text. */