\input texinfo @c -*-texinfo-*- @c %**start of header @setfilename starpu.info @settitle StarPU @c %**end of header @setchapternewpage odd @titlepage @title StarPU @page @vskip 0pt plus 1filll @comment For the @value{version-GCC} Version* @end titlepage @summarycontents @contents @page @node Top @top Preface @cindex Preface This manual documents the usage of StarPU @comment @comment When you add a new menu item, please keep the right hand @comment aligned to the same column. Do not use tabs. This provides @comment better formatting. @comment @menu * Introduction:: A basic introduction to using StarPU. * Installing StarPU:: How to configure, build and install StarPU * StarPU API:: The API to use StarPU * Basic Examples:: Basic examples of the use of StarPU * Advanced Topics:: Advanced use of StarPU @end menu @c --------------------------------------------------------------------- @c Introduction to StarPU @c --------------------------------------------------------------------- @node Introduction @chapter Introduction to StarPU @menu * Motivation:: Why StarPU ? * StarPU in a Nutshell:: The Fundamentals of StarPU @end menu @node Motivation @section Motivation @c complex machines with heterogeneous cores/devices The use of specialized hardware such as accelerators or coprocessors offers an interesting approach to overcome the physical limits encountered by processor architects. As a result, many machines are now equipped with one or several accelerators (eg. a GPU), in addition to the usual processor(s). While a lot of efforts have been devoted to offload computation onto such accelerators, very little attention as been paid to portability concerns on the one hand, and to the possibility of having heterogeneous accelerators and processors to interact on the other hand. StarPU is a runtime system that offers support for heterogeneous multicore architectures, it not only offers a unified view of the computational resources (ie. CPUs and accelerators at the same time), but it also takes care to efficiently map and execute tasks onto an heterogeneous machine while transparently handling low-level issues in a portable fashion. @c this leads to a complicated distributed memory design @c which is not (easily) manageable by hand @c added value/benefits of StarPU @c - portability @c - scheduling, perf. portability @node StarPU in a Nutshell @section StarPU in a Nutshell From a programming point of view, StarPU is not a new language but a library that executes tasks explicitly submitted by the application. The data that a task manipulate are automatically transferred onto the accelerator so that the programmer does not have to take care of complex data movements. StarPU also takes particular care of scheduling those tasks efficiently and allows scheduling experts to implement custom scheduling policies in a portable fashion. @c explain the notion of codelet and task (ie. g(A, B) @subsection Codelet and Tasks One of StarPU primary data structure is the @b{codelet}. A codelet describes a computational kernel that can possibly be implemented on multiple architectures such as a CPU, a CUDA device or a Cell's SPU. @c TODO insert illustration f : f_spu, f_cpu, ... Another important data structure is the @b{task}. Executing a StarPU task consists in applying a codelet on a data set, on one of the architecture on which the codelet is implemented. In addition to the codelet that a task implements, it also describes which data are accessed, and how they are accessed during the computation (read and/or write). StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking operation. The task structure can also specify a @b{callback} function that is called once StarPU has properly executed the task. It also contains optional fields that the application may use to give hints to the scheduler (such as priority levels). A task may be identified by a unique 64-bit number which we refer as a @b{tag}. Task dependencies can be enforced either by the means of callback functions, or by expressing dependencies between tags. @c TODO insert illustration f(Ar, Brw, Cr) + .. @c DSM @subsection StarPU Data Management Library @c --------------------------------------------------------------------- @c Installing StarPU @c --------------------------------------------------------------------- @node Installing StarPU @chapter Installing StarPU StarPU can be built and installed by the standard means of the GNU autotools. The following chapter is intended to briefly remind how these tools can be used to install StarPU. @section Configuring StarPU @subsection Generating Makefiles and configuration scripts This step is not necessary when using the tarball releases of StarPU. If you are using the source code from the svn repository, you first need to generate the configure scripts and the Makefiles. @example $ autoreconf -i @end example @subsection Configuring StarPU @example $ ./configure @end example @c TODO enumerate the list of interesting options @section Building and Installing StarPU @subsection Building @example $ make @end example @subsection Sanity Checks In order to make sure that StarPU is working properly on the system, it is also possible to run a test suite. @example $ make check @end example @subsection Installing In order to install StarPU at the location that was specified during configuration: @example # make install @end example @subsection pkg-config configuration It is possible that compiling and linking an application against StarPU requires to use specific flags or libraries (for instance @code{CUDA} or @code{libspe2}). Therefore, it is possible to use the @code{pkg-config} tool. If StarPU was not installed at some standard location, the path of StarPU's library must be specified in the @code{PKG_CONFIG_PATH} environment variable so that @code{pkg-config} can find it. So if StarPU was installed in @code{$(prefix_dir)}: @example @c TODO: heu, c'est vraiment du shell ça ? :) $ PKG_CONFIG_PATH = @{PKG_CONFIG_PATH@}:$(prefix_dir)/lib/ @end example The flags required to compiled or linked against StarPU are then accessible with the following commands: @example $ pkg-config --cflags libstarpu # options for the compiler $ pkg-config --libs libstarpu # options for the linker @end example @c --------------------------------------------------------------------- @c StarPU API @c --------------------------------------------------------------------- @node StarPU API @chapter StarPU API @menu * Initialization and Termination:: Initialization and Termination methods * Data Library:: Methods to manipulate data * Codelets and Tasks:: Methods to construct tasks * Tags:: Task dependencies @end menu @node Initialization and Termination @section Initialization and Termination @menu * starpu_init:: Initialize StarPU * struct starpu_conf:: StarPU runtime configuration * starpu_shutdown:: Terminate StarPU @end menu @node starpu_init @subsection @code{starpu_init} -- Initialize StarPU @table @asis @item @emph{Description}: This is StarPU initialization method, which must be called prior to any other StarPU call. It is possible to specify StarPU's configuration (eg. scheduling policy, number of cores, ...) by passing a non-null argument. Default configuration is used if the passed argument is @code{NULL}. @item @emph{Prototype}: @code{void starpu_init(struct starpu_conf *conf);} @end table @node struct starpu_conf @subsection @code{struct starpu_conf} -- StarPU runtime configuration @table @asis @item @emph{Description}: TODO @item @emph{Definition}: TODO @end table @node starpu_shutdown @subsection @code{starpu_shutdown} -- Terminate StarPU @table @asis @item @emph{Description}: This is StarPU termination method. It must be called at the end of the application: statistics and other post-mortem debugging information are not garanteed to be available until this method has been called. @item @emph{Prototype}: @code{void starpu_shutdown(void);} @end table @node Data Library @section Data Library @c data_handle_t @c void starpu_delete_data(struct starpu_data_state_t *state); @c user interaction with the DSM @c void starpu_sync_data_with_mem(struct starpu_data_state_t *state); @c void starpu_notify_data_modification(struct starpu_data_state_t *state, uint32_t modifying_node); @node Codelets and Tasks @section Codelets and Tasks @menu * starpu_task_create:: Allocate and Initialize a Task @end menu @c struct starpu_task @c struct starpu_codelet @node starpu_task_create @subsection @code{starpu_task_create} -- Allocate and Initialize a Task @table @asis @item @emph{Description}: TODO @item @emph{Prototype}: @code{struct starpu_task *starpu_task_create(void);} @end table @c Callbacks : what can we put in callbacks ? @node Tags @section Tags @menu * starpu_tag_t:: Task identifier * starpu_tag_declare_deps:: Declare the Dependencies of a Tag * starpu_tag_declare_deps_array:: Declare the Dependencies of a Tag * starpu_tag_wait:: Block until a Tag is terminated * starpu_tag_wait_array:: Block until a set of Tags is terminated * starpu_tag_remove:: Destroy a Tag @end menu @node starpu_tag_t @subsection @code{starpu_tag_t} -- Task identifier @c mention the tag_id field of the task structure @table @asis @item @emph{Definition}: TODO @end table @node starpu_tag_declare_deps @subsection @code{starpu_tag_declare_deps} -- Declare the Dependencies of a Tag @table @asis @item @emph{Description}: TODO @item @emph{Prototype}: @code{void starpu_tag_declare_deps(starpu_tag_t id, unsigned ndeps, ...);} @end table @node starpu_tag_declare_deps_array @subsection @code{starpu_tag_declare_deps_array} -- Declare the Dependencies of a Tag @table @asis @item @emph{Description}: TODO @item @emph{Prototype}: @code{void starpu_tag_declare_deps_array(starpu_tag_t id, unsigned ndeps, starpu_tag_t *array);} @end table @node starpu_tag_wait @subsection @code{starpu_tag_wait} -- Block until a Tag is terminated @table @asis @item @emph{Description}: TODO @item @emph{Prototype}: @code{void starpu_tag_wait(starpu_tag_t id);} @end table @node starpu_tag_wait_array @subsection @code{starpu_tag_wait_array} -- Block until a set of Tags is terminated @table @asis @item @emph{Description}: TODO @item @emph{Prototype}: @code{void starpu_tag_wait_array(unsigned ntags, starpu_tag_t *id);} @end table @node starpu_tag_remove @subsection @code{starpu_tag_remove} -- Destroy a Tag @table @asis @item @emph{Description}: TODO @item @emph{Prototype}: @code{void starpu_tag_remove(starpu_tag_t id);} @end table @section Extensions @subsection CUDA extensions @c void starpu_malloc_pinned_if_possible(float **A, size_t dim); @c subsubsection driver API specific calls @subsection Cell extensions @c --------------------------------------------------------------------- @c Basic Examples @c --------------------------------------------------------------------- @node Basic Examples @chapter Basic Examples @menu * Compiling and linking:: Compiling and Linking Options * Hello World:: Submitting Tasks * Scaling a Vector:: Manipulating Data * Scaling a Vector (hybrid):: Handling Heterogeneous Architectures @end menu @node Compiling and linking @section Compiling and linking options The Makefile could for instance contain the following lines to define which options must be given to the compiler and to the linker: @example @c @cartouche CFLAGS+=$$(pkg-config --cflags libstarpu) LIBS+=$$(pkg-config --libs libstarpu) @c @end cartouche @end example @node Hello World @section Hello World In this section, we show how to implement a simple program that submits a task to StarPU. @subsection Required Headers The @code{starpu.h} header should be included in any code using StarPU. @example @c @cartouche #include @c @end cartouche @end example @subsection Defining a Codelet @example @c @cartouche void cpu_func(starpu_data_interface_t *buffers, void *func_arg) @{ float *array = func_arg; printf("Hello world (array = @{%f, %f@} )\n", array[0], array[1]); @} starpu_codelet cl = @{ .where = CORE, .core_func = cpu_func, .nbuffers = 0 @}; @c @end cartouche @end example A codelet is a structure that represents a computational kernel. Such a codelet may contain an implementation of the same kernel on different architectures (eg. CUDA, Cell's SPU, x86, ...). The ''@code{.nbuffers}'' field specifies the number of data buffers that are manipulated by the codelet: here the codelet does not access or modify any data that is controlled by our data management library. Note that the argument passed to the codelet (the ''@code{.cl_arg}'' field of the @code{starpu_task} structure) does not count as a buffer since it is not managed by our data management library. @c TODO need a crossref to the proper description of "where" see bla for more ... We create a codelet which may only be executed on the CPUs. The ''@code{.where}'' field is a bitmask that defines where the codelet may be executed. Here, the @code{CORE} value means that only CPUs can execute this codelet (@pxref{Codelets and Tasks} for more details on that field). When a CPU core executes a codelet, it calls the @code{.core_func} function, which @emph{must} have the following prototype: @code{void (*core_func)(starpu_data_interface_t *, void *)} In this example, we can ignore the first argument of this function which gives a description of the input and output buffers (eg. the size and the location of the matrices). The second argument is a pointer to a buffer passed as an argument to the codelet by the means of the ''@code{.cl_arg}'' field of the @code{starpu_task} structure. Be aware that this may be a pointer to a @emph{copy} of the actual buffer, and not the pointer given by the programmer: if the codelet modifies this buffer, there is no garantee that the initial buffer will be modified as well: this for instance implies that the buffer cannot be used as a synchronization medium. @subsection Submitting a Task @example @c @cartouche void callback_func(void *callback_arg) @{ printf("Callback function (arg %x)\n", callback_arg); @} int main(int argc, char **argv) @{ /* initialize StarPU */ starpu_init(NULL); struct starpu_task *task = starpu_task_create(); task->cl = &cl; float array[2] = @{1.0f, -1.0f@}; task->cl_arg = &array; task->cl_arg_size = 2*sizeof(float); task->callback_func = callback_func; task->callback_arg = 0x42; /* starpu_submit_task will be a blocking call */ task->synchronous = 1; /* submit the task to StarPU */ starpu_submit_task(task); /* terminate StarPU */ starpu_shutdown(); return 0; @} @c @end cartouche @end example Before submitting any tasks to StarPU, @code{starpu_init} must be called. The @code{NULL} argument specifies that we use default configuration. Tasks cannot be submitted after the termination of StarPU by a call to @code{starpu_shutdown}. In the example above, a task structure is allocated by a call to @code{starpu_task_create}. This function only allocates and fills the corresponding structure with the default settings (@pxref{starpu_task_create}), but it does not submit the task to StarPU. @c not really clear ;) The ''@code{.cl}'' field is a pointer to the codelet which the task will execute: in other words, the codelet structure describes which computational kernel should be offloaded on the different architectures, and the task structure is a wrapper containing a codelet and the piece of data on which the codelet should operate. The optional ''@code{.cl_arg}'' field is a pointer to a buffer (of size @code{.cl_arg_size}) with some parameters for the kernel described by the codelet. For instance, if a codelet implements a computational kernel that multiplies its input vector by a constant, the constant could be specified by the means of this buffer. Once a task has been executed, an optional callback function can be called. While the computational kernel could be offloaded on various architectures, the callback function is always executed on a CPU. The ''@code{.callback_arg}'' pointer is passed as an argument of the callback. The prototype of a callback function must be: @example void (*callback_function)(void *); @end example If the @code{.synchronous} field is non-null, task submission will be synchronous: the @code{starpu_submit_task} function will not return until the task was executed. Note that the @code{starpu_shutdown} method does not guarantee that asynchronous tasks have been executed before it returns. @node Scaling a Vector @section Manipulating Data: Scaling a Vector The previous example has shown how to submit tasks. In this section we show how StarPU tasks can manipulate data. Programmers can describe the data layout of their application so that StarPU is responsible for enforcing data coherency and availability accross the machine. Instead of handling complex (and non-portable) mechanisms to perform data movements, programmers only declare which piece of data is accessed and/or modified by a task, and StarPU makes sure that when a computational kernel starts somewhere (eg. on a GPU), its data are available locally. Before submitting those tasks, the programmer first needs to declare the different pieces of data to StarPU using the @code{starpu_register_*_data} functions. To ease the development of applications for StarPU, it is possible to describe multiple types of data layout. A type of data layout is called an @b{interface}. By default, there are different interfaces available in StarPU: here we will consider the @b{vector interface}. The following lines show how to declare an array of @code{n} elements of type @code{float} using the vector interface: @example float tab[n]; starpu_data_handle tab_handle; starpu_register_vector_data(&tab_handle, 0, tab, n, sizeof(float)); @end example The first argument, called the @b{data handle}, is an opaque pointer which designates the array in StarPU. This is also the structure which is used to describe which data is used by a task. @c TODO: what is 0 ? It is possible to construct a StarPU task that multiplies this vector by a constant factor: @example float factor; struct starpu_task *task = starpu_task_create(); task->cl = &cl; task->buffers[0].handle = tab_handle; task->buffers[0].mode = STARPU_RW; task->cl_arg = &factor; task->cl_arg_size = sizeof(float); @end example Since the factor is constant, it does not need a preliminary declaration, and can just be passed through the @code{cl_arg} pointer like in the previous example. The vector parameter is described by its handle. There are two fields in each element of the @code{buffers} array. @code{.handle} is the handle of the data, and @code{.mode} specifies how the kernel will access the data (@code{STARPU_R} for read-only, @code{STARPU_W} for write-only and @code{STARPU_RW} for read and write access). The definition of the codelet can be written as follows: @example void scal_func(starpu_data_interface_t *buffers, void *arg) @{ unsigned i; float *factor = arg; /* length of the vector */ unsigned n = buffers[0].vector.nx; /* local copy of the vector pointer */ float *val = (float *)buffers[0].vector.ptr; for (i = 0; i < n; i++) val[i] *= *factor; @} starpu_codelet cl = @{ .where = CORE, .core_func = scal_func, .nbuffers = 1 @}; @end example The second argument of the @code{scal_func} function contains a pointer to the parameters of the codelet (given in @code{task->cl_arg}), so that we read the constant factor from this pointer. The first argument is an array that gives a description of every buffers passed in the @code{task->buffers}@ array, the number of which is given by the @code{.nbuffers} field of the codelet structure. In the @b{vector interface}, the location of the vector (resp. its length) is accessible in the @code{.vector.ptr} (resp. @code{.vector.nx}) of this array. Since the vector is accessed in a read-write fashion, any modification will automatically affect future accesses to that vector made by other tasks. @node Scaling a Vector (hybrid) @section Vector Scaling on an Hybrid CPU/GPU Machine Contrary to the previous examples, the task submitted in the example may not only be executed by the CPUs, but also by a CUDA device. TODO @c --------------------------------------------------------------------- @c Advanced Topics @c --------------------------------------------------------------------- @node Advanced Topics @chapter Advanced Topics @bye