\input texinfo @c -*-texinfo-*-

@c %**start of header
@setfilename starpu.info
@settitle StarPU
@c %**end of header

@setchapternewpage odd

@titlepage
@title StarPU
@page
@vskip 0pt plus 1filll
@comment For the @value{version-GCC} Version*
@end titlepage

@summarycontents
@contents
@page

@node Top
@top Preface
@cindex Preface

This manual documents the usage of StarPU


@comment
@comment  When you add a new menu item, please keep the right hand
@comment  aligned to the same column.  Do not use tabs.  This provides
@comment  better formatting.
@comment
@menu
* Introduction::          A basic introduction to using StarPU.
* Installing StarPU::     How to configure, build and install StarPU.
* Configuration options:: Configurations options
* Environment variables:: Environment variables used by StarPU.
* StarPU API::            The API to use StarPU.
* Basic Examples::        Basic examples of the use of StarPU.
* Advanced Topics::       Advanced use of StarPU.
@end menu

@c ---------------------------------------------------------------------
@c Introduction to StarPU
@c ---------------------------------------------------------------------

@node Introduction
@chapter Introduction to StarPU

@menu
* Motivation::             Why StarPU ?
* StarPU in a Nutshell::   The Fundamentals of StarPU
@end menu

@node Motivation
@section Motivation

@c complex machines with heterogeneous cores/devices
The use of specialized hardware such as accelerators or coprocessors offers an
interesting approach to overcome the physical limits encountered by processor
architects. As a result, many machines are now equipped with one or several
accelerators (eg. a GPU), in addition to the usual processor(s). While a lot of
efforts have been devoted to offload computation onto such accelerators, very
little attention as been paid to portability concerns on the one hand, and to the
possibility of having heterogeneous accelerators and processors to interact on the other hand.

StarPU is a runtime system that offers support for heterogeneous multicore
architectures, it not only offers a unified view of the computational resources
(ie. CPUs and accelerators at the same time), but it also takes care to
efficiently map and execute tasks onto an heterogeneous machine while
transparently handling low-level issues in a portable fashion.

@c this leads to a complicated distributed memory design
@c which is not (easily) manageable by hand

@c added value/benefits of StarPU
@c   - portability
@c   - scheduling, perf. portability

@node StarPU in a Nutshell
@section StarPU in a Nutshell

From a programming point of view, StarPU is not a new language but a library
that executes tasks explicitly submitted by the application.  The data that a
task manipulate are automatically transferred onto the accelerator so that the
programmer does not have to take care of complex data movements.  StarPU also
takes particular care of scheduling those tasks efficiently and allows
scheduling experts to implement custom scheduling policies in a portable
fashion.

@c explain the notion of codelet and task (ie. g(A, B)
@subsection Codelet and Tasks
One of StarPU primary data structure is the @b{codelet}. A codelet describes a
computational kernel that can possibly be implemented on multiple architectures
such as a CPU, a CUDA device or a Cell's SPU.

@c TODO insert illustration f : f_spu, f_cpu, ...

Another important data structure is the @b{task}. Executing a StarPU task
consists in applying a codelet on a data set, on one of the architecture on
which the codelet is implemented. In addition to the codelet that a task
implements, it also describes which data are accessed, and how they are
accessed during the computation (read and/or write).
StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking
operation. The task structure can also specify a @b{callback} function that is
called once StarPU has properly executed the task. It also contains optional
fields that the application may use to give hints to the scheduler (such as
priority levels).

A task may be identified by a unique 64-bit number which we refer as a @b{tag}. 
Task dependencies can be enforced either by the means of callback functions, or
by expressing dependencies between tags.

@c TODO insert illustration f(Ar, Brw, Cr) + ..

@c DSM
@subsection StarPU Data Management Library

Because StarPU schedules tasks at runtime, data transfers have to be
done automatically and ``just-in-time'' between processing units,
relieving the application programmer from explicit data transfers.
Moreover, to avoid unnecessary transfers, StarPU keeps data
where it was last needed, even if was modified there, and it
allows multiple copies of the same data to reside at the same time on
several processing units as long as it is not modified. 

@c ---------------------------------------------------------------------
@c Installing StarPU
@c ---------------------------------------------------------------------

@node Installing StarPU
@chapter Installing StarPU

StarPU can be built and installed by the standard means of the GNU
autotools. The following chapter is intended to briefly remind how these tools
can be used to install StarPU.

@section Configuring StarPU

@subsection Generating Makefiles and configuration scripts

This step is not necessary when using the tarball releases of StarPU.  If you
are using the source code from the svn repository, you first need to generate
the configure scripts and the Makefiles.

@example
$ autoreconf -vfi
@end example

@subsection Configuring StarPU

@example
$ ./configure
@end example

@c TODO enumerate the list of interesting options: refer to a specific section

@section Building and Installing StarPU

@subsection Building

@example
$ make
@end example

@subsection Sanity Checks

In order to make sure that StarPU is working properly on the system, it is also
possible to run a test suite.

@example
$ make check
@end example

@subsection Installing

In order to install StarPU at the location that was specified during
configuration:

@example
$ make install
@end example

@subsection pkg-config configuration

It is possible that compiling and linking an application against StarPU
requires to use specific flags or libraries (for instance @code{CUDA} or
@code{libspe2}). Therefore, it is possible to use the @code{pkg-config} tool.

If StarPU was not installed at some standard location, the path of StarPU's
library must be specified in the @code{PKG_CONFIG_PATH} environment variable so
that @code{pkg-config} can find it. So if StarPU was installed in
@code{$(prefix_dir)}:

@example
@c TODO: heu, c'est vraiment du shell ça ? :)
$ PKG_CONFIG_PATH = @{PKG_CONFIG_PATH@}:$(prefix_dir)/lib/
@end example

The flags required to compiled or linked against StarPU are then
accessible with the following commands:

@example
$ pkg-config --cflags libstarpu  # options for the compiler
$ pkg-config --libs libstarpu    # options for the linker
@end example

@c ---------------------------------------------------------------------
@c Configuration options
@c ---------------------------------------------------------------------

@node Configuration options
@chapter Configuration options

TODO

@c ---------------------------------------------------------------------
@c Environment variables
@c ---------------------------------------------------------------------

@node Environment variables
@chapter Environment variables

@menu
* Workers::     Configuring workers
* Scheduling::  Configuring the Scheduling engine
* Misc::        Miscellaneous and debug
@end menu

TODO, explicit configuration (passed to starpu_init) overrides env variables.

@node Workers
@section Configuring workers

@menu
* NCPUS     :: Number of CPU workers
* NCUDA     :: Number of CUDA workers
* NGORDON   :: Number of SPU workers (Cell)
* WORKERS_CPUID  :: Bind workers to specific CPUs
* WORKERS_GPUID  :: Select specific CUDA devices
@end menu

@node NCPUS
@subsection @code{NCPUS} -- Number of CPU workers
@table @asis

@item @emph{Description}:
TODO

@end table

@node NCUDA
@subsection @code{NCUDA} -- Number of CUDA workers
@table @asis

@item @emph{Description}:
TODO

@end table

@node NGORDON
@subsection @code{NGORDON} -- Number of SPU workers (Cell)
@table @asis

@item @emph{Description}:
TODO

@end table


@node WORKERS_CPUID
@subsection @code{WORKERS_CPUID} -- Bind workers to specific CPUs
@table @asis

@item @emph{Description}:
TODO

@end table

@node WORKERS_GPUID
@subsection @code{WORKERS_GPUID} -- Select specific CUDA devices
@table @asis

@item @emph{Description}:
TODO

@end table

@node Scheduling
@section Configuring the Scheduling engine

@menu
* SCHED     :: Scheduling policy
* CALIBRATE :: Calibrate performance models
* PREFETCH  :: Use data prefetch
@end menu

@node SCHED
@subsection @code{SCHED} -- Scheduling policy
@table @asis

@item @emph{Description}:
TODO

@end table

@node CALIBRATE
@subsection @code{CALIBRATE} -- Calibrate performance models
@table @asis

@item @emph{Description}:
TODO

@end table

@node PREFETCH
@subsection @code{PREFETCH} -- Use data prefetch
@table @asis

@item @emph{Description}:
TODO

@end table

@node Misc
@section Miscellaneous and debug

@menu
* LOGFILENAME  :: Select debug file name
@end menu

@node LOGFILENAME
@subsection @code{LOGFILENAME} -- Select debug file name
@table @asis

@item @emph{Description}:
TODO

@end table

@c ---------------------------------------------------------------------
@c StarPU API
@c ---------------------------------------------------------------------

@node StarPU API
@chapter StarPU API

@menu
* Initialization and Termination::       Initialization and Termination methods
* Workers' Properties::                  Methods to enumerate workers' properties
* Data Library::                         Methods to manipulate data
* Codelets and Tasks::                   Methods to construct tasks
* Tags::                                 Task dependencies
@end menu

@node Initialization and Termination
@section Initialization and Termination

@menu
* starpu_init::            Initialize StarPU
* struct starpu_conf::     StarPU runtime configuration
* starpu_shutdown::        Terminate StarPU
@end menu

@node starpu_init
@subsection @code{starpu_init} -- Initialize StarPU
@table @asis

@item @emph{Description}:
This is StarPU initialization method, which must be called prior to any other
StarPU call.  It is possible to specify StarPU's configuration (eg. scheduling
policy, number of cores, ...) by passing a non-null argument. Default
configuration is used if the passed argument is @code{NULL}.
@item @emph{Return value}:
Upon successful completion, this function returns 0. Otherwise, @code{-ENODEV}
indicates that no worker was available (so that StarPU was not be initialized).

@item @emph{Prototype}:
@code{int starpu_init(struct starpu_conf *conf);}

@end table

@node struct starpu_conf
@subsection @code{struct starpu_conf} -- StarPU runtime configuration

@table @asis
@item @emph{Description}:
This structure is passed to the @code{starpu_init} function in order configure
StarPU. When the default value is used, StarPU automatically select the number
of processing units and takes the default scheduling policy. This parameters
overwrite the equivalent environnement variables. 

@item @emph{Fields}:
@table @asis 
@item @code{sched_policy} (default = NULL):
This is the name of the scheduling policy. This can also be specified with the
@code{SCHED} environment variable.

@item @code{ncpus} (default = -1):
This is the maximum number of CPU cores that StarPU can use. This can also be
specified with the @code{NCPUS} environment variable.

@item @code{ncuda} (default = -1):
This is the maximum number of CUDA devices that StarPU can use. This can also be
specified with the @code{NCUDA} environment variable.

@item @code{nspus} (default = -1):
This is the maximum number of Cell SPUs that StarPU can use. This can also be
specified with the @code{NGORDON} environment variable.

@item @code{calibrate} (default = 0):
If this flag is set, StarPU will calibrate the performance models when
executing tasks. This can also be specified with the @code{CALIBRATE}
environment variable.
@end table

@end table


@node starpu_shutdown
@subsection @code{starpu_shutdown} -- Terminate StarPU
@table @asis

@item @emph{Description}:
This is StarPU termination method. It must be called at the end of the
application: statistics and other post-mortem debugging information are not
garanteed to be available until this method has been called.

@item @emph{Prototype}:
@code{void starpu_shutdown(void);}

@end table

@node Workers' Properties
@section Workers' Properties

@menu
* starpu_get_worker_count:: Get the number of processing units
* starpu_get_worker_id::    Get the identifier of the current worker
* starpu_get_worker_type::  Get the type of processing unit associated to a worker
* starpu_get_worker_name::  Get the name of a worker
@end menu

@node starpu_get_worker_count
@subsection @code{starpu_get_worker_count} -- Get the number of processing units
@table @asis

@item @emph{Description}:
This function returns the number of workers (ie. processing units executing
StarPU tasks). The returned value should be at most @code{STARPU_NMAXWORKERS}. 

@item @emph{Prototype}:
@code{unsigned starpu_get_worker_count(void);}

@end table


@node starpu_get_worker_id
@subsection @code{starpu_get_worker_id} -- Get the identifier of the current worker
@table @asis

@item @emph{Description}:
This function returns the identifier of the worker associated to the calling
thread. The returned value is either -1 if the current context is not a StarPU
worker (ie. when called from the application outside a task or a callback), or
an integer between 0 and @code{starpu_get_worker_count() - 1}.

@item @emph{Prototype}:
@code{int starpu_get_worker_count(void);}

@end table

@node starpu_get_worker_type
@subsection @code{starpu_get_worker_type} -- Get the type of processing unit associated to a worker
@table @asis

@item @emph{Description}:
This function returns the type of worker associated to an identifier (as
returned by the @code{starpu_get_worker_id} function). The returned value
indicates the architecture of the worker: @code{STARPU_CORE_WORKER} for a CPU
core, @code{STARPU_CUDA_WORKER} for a CUDA device, and
@code{STARPU_GORDON_WORKER} for a Cell SPU. The value returned for an invalid
identifier is unspecified.

@item @emph{Prototype}:
@code{enum starpu_archtype starpu_get_worker_type(int id);}

@end table

@node starpu_get_worker_name
@subsection @code{starpu_get_worker_name} -- Get the name of a worker
@table @asis

@item @emph{Description}:
StarPU associates a unique human readable string to each processing unit. This
function copies at most the @code{maxlen} first bytes of the unique string
associated to a worker identified by its identifier @code{id} into the
@code{dst} buffer. The caller is responsible for ensuring that the @code{dst}
is a valid pointer to a buffer of @code{maxlen} bytes at least. Calling this
function on an invalid identifier results in an unspecified behaviour.

@item @emph{Prototype}:
@code{void starpu_get_worker_name(int id, char *dst, size_t maxlen);}

@end table

@node Data Library
@section Data Library

@c data_handle_t 

@c void starpu_delete_data(struct starpu_data_state_t *state);

@c user interaction with the DSM
@c   void starpu_sync_data_with_mem(struct starpu_data_state_t *state);
@c   void starpu_notify_data_modification(struct starpu_data_state_t *state, uint32_t modifying_node);

@node Codelets and Tasks
@section Codelets and Tasks

@menu
* struct starpu_codelet::         StarPU codelet structure
* struct starpu_task::            StarPU task structure
* starpu_task_init::              Initialize a Task
* starpu_task_create::            Allocate and Initialize a Task
* starpu_task_destroy::           Destroy a dynamically allocated Task
* starpu_submit_task::            Submit a Task
* starpu_wait_task::              Wait for the termination of a Task
* starpu_wait_all_tasks::	  Wait for the termination of all Tasks
@end menu


@c struct starpu_task
@c struct starpu_codelet

@node struct starpu_codelet
@subsection @code{struct starpu_codelet} -- StarPU codelet structure
@table @asis 
@item @emph{Description}:
The codelet structure describes a kernel that is possibly implemented on
various targets.
@item @emph{Fields}:
@table @asis
@item @code{where}: 
Indicates which types of processing units are able to execute that codelet.
@code{CORE|CUDA} for instance indicates that the codelet is implemented for
both CPU cores and CUDA devices while @code{GORDON} indicates that it is only
available on Cell SPUs.

@item @code{core_func} (optionnal):
Is a function pointer to the CPU implementation of the codelet. Its prototype
must be: @code{void core_func(starpu_data_interface_t *descr, void *arg)}. The
first argument being the array of data managed by the data management library,
and the second argument is a pointer to the argument (possibly a copy of it)
passed from the @code{.cl_arg} field of the @code{starpu_task} structure. This
pointer is ignored if @code{CORE} does not appear in the @code{.where} field,
it must be non-null otherwise.

@item @code{cuda_func} (optionnal):
Is a function pointer to the CUDA implementation of the codelet. @emph{This
must be a host-function written in the CUDA runtime API}. Its prototype must
be: @code{void cuda_func(starpu_data_interface_t *descr, void *arg);}. This
pointer is ignored if @code{CUDA} does not appear in the @code{.where} field,
it must be non-null otherwise.

@item @code{gordon_func} (optionnal):
This is the index of the Cell SPU implementation within the Gordon library.
TODO

@item @code{nbuffers}:
Specifies the number of arguments taken by the codelet. These arguments are
managed by the DSM and are accessed from the @code{starpu_data_interface_t *}
array. The constant argument passed with the @code{.cl_arg} field of the
@code{starpu_task} structure is not counted in this number.  This value should
not be above @code{STARPU_NMAXBUFS}.

@item @code{model} (optionnal):
This is a pointer to the performance model associated to this codelet. This
optionnal field is ignored when null. TODO

@end table
@end table

@node struct starpu_task
@subsection @code{struct starpu_task} -- StarPU task structure
@table @asis
@item @emph{Description}:
The starpu_task structure describes a task that can be offloaded on the various
processing units managed by StarPU. It instanciates a codelet. It can either be
allocated dynamically with the @code{starpu_task_create} method, or declared
statically. In the latter case, the programmer has to zero the
@code{starpu_task} structure and to fill the different fields properly. The
indicated default values correspond to the configuration of a task allocated
with @code{starpu_task_create}.

@item @emph{Fields}:
@table @asis
@item @code{cl}:
Is a pointer to the corresponding @code{starpu_codelet} data structure. This
describes where the kernel should be executed, and supplies the appropriate
implementations. When set to @code{NULL}, no code is executed during the tasks,
such empty tasks can be useful for synchronization purposes. 

@item @code{buffers}:
TODO

@item @code{cl_arg} (optional) (default = NULL):
TODO

@item @code{cl_arg_size} (optional):
TODO
@c ignored if only executable on CPUs or CUDA ...

@item @code{callback_func} (optional) (default = @code{NULL}):
This is a function pointer of prototype @code{void (*f)(void *)} which
specifies a possible callback. If that pointer is non-null, the callback
function is executed @emph{on the host} after the execution of the task. The
callback is passed the value contained in the @code{callback_arg} field. No
callback is executed if that field is null.

@item @code{callback_arg} (optional) (default = @code{NULL}):
This is the pointer passed to the callback function. This field is ignored if
the @code{callback_func} is null.

@item @code{use_tag} (optional) (default = 0):
If set, this flag indicates that the task should be associated with the tag
conained in the @code{tag_id} field. Tag allow the application to synchronize
with the task and to express task dependencies easily.

@item @code{tag_id}:
This fields contains the tag associated to the tag if the @code{use_tag} field
was set, it is ignored otherwise.

@item @code{synchronous}:
If this flag is set, the @code{starpu_submit_task} function is blocking and
returns only when the task has been executed (or if no worker is able to
process the task). Otherwise, @code{starpu_submit_task} returns immediately.

@item @code{priority} (optionnal) (default = @code{DEFAULT_PRIO}):
This field indicates a level of priority for the task. This is an integer value
that must be selected between @code{MIN_PRIO} (for the least important tasks)
and @code{MAX_PRIO} (for the most important tasks) included. Default priority
is @code{DEFAULT_PRIO}.  Scheduling strategies that take priorities into
account can use this parameter to take better scheduling decisions, but the
scheduling policy may also ignore it.

@item @code{execute_on_a_specific_worker} (default = 0):
If this flag is set, StarPU will bypass the scheduler and directly affect this
task to the worker specified by the @code{workerid} field.

@item @code{workerid} (optional):
If the @code{execute_on_a_specific_worker} field is set, this field indicates
which is the identifier of the worker that should process this task (as
returned by @code{starpu_get_worker_id}). This field is ignored if
@code{execute_on_a_specific_worker} field is set to 0.

@item @code{detach} (optional) (default = 1):
If this flag is set, it is not possible to synchronize with the task
by the means of @code{starpu_wait_task} later on. Internal data structures
are only garanteed to be liberated once @code{starpu_wait_task} is called
if that flag is not set.

@item @code{destroy} (optional) (default = 1):
If that flag is set, the task structure will automatically be liberated, either
after the execution of the callback if the task is detached, or during
@code{starpu_task_wait} otherwise. If this flag is not set, dynamically allocated data
structures will not be liberated until @code{starpu_task_destroy} is called
explicitely. Setting this flag for a statically allocated task structure will
result in undefined behaviour.

@end table
@end table

@node starpu_task_init
@subsection @code{starpu_task_init} -- Initialize a Task
@table @asis
@item @emph{Description}:
TODO
@item @emph{Prototype}:
@code{void starpu_task_init(struct starpu_task *task);}
@end table

@node starpu_task_create
@subsection @code{starpu_task_create} -- Allocate and Initialize a Task
@table @asis
@item @emph{Description}:
TODO
(Describe the different default fields ...)
@item @emph{Prototype}:
@code{struct starpu_task *starpu_task_create(void);}
@end table

@node starpu_task_destroy
@subsection @code{starpu_task_destroy} -- Destroy a dynamically allocated Task
@table @asis
@item @emph{Description}:
Liberate the ressource allocated during starpu_task_create. This function can
be called automatically after the execution of a task by setting the
@code{.destroy} flag of the @code{starpu_task} structure (default behaviour).
Calling this function on a statically allocated task results in an undefined
behaviour.

@item @emph{Prototype}:
@code{void starpu_task_destroy(struct starpu_task *task);}
@end table

@node starpu_wait_task
@subsection @code{starpu_wait_task} -- Wait for the termination of a Task
@table @asis
@item @emph{Description}:
This function blocks until the task was executed. It is not possible to
synchronize with a task more than once. It is not possible to wait
synchronous or detached tasks.
@item @emph{Return value}:
Upon successful completion, this function returns 0. Otherwise, @code{-EINVAL}
indicates that the waited task was either synchronous or detached.
@item @emph{Prototype}:
@code{int starpu_wait_task(struct starpu_task *task);}
@end table


@node starpu_submit_task
@subsection @code{starpu_submit_task} -- Submit a Task
@table @asis
@item @emph{Description}:
This function submits task @code{task} to StarPU. Calling this function does
not mean that the task will be executed immediatly as there can be data or task
(tag) dependencies that are not fulfilled yet: StarPU will take care to
schedule this task with respect to such dependencies.
This function returns immediately if the @code{synchronous} field of the
@code{starpu_task} structure was set to 0, and block until the termination of
the task otherwise. It is also possible to synchronize the application with
asynchronous tasks by the means of tags, using the @code{starpu_tag_wait}
function for instance. 

In case of success, this function returns 0, a return value of @code{-ENODEV}
means that there is no worker able to process that task (eg. there is no GPU
available and this task is only implemented on top of CUDA).
@item @emph{Prototype}:
@code{int starpu_submit_task(struct starpu_task *task);}
@end table

@node starpu_wait_all_tasks
@subsection @code{starpu_wait_all_tasks} -- Wait for the termination of all Tasks
@table @asis
@item @emph{Description}:
This function blocks until all the tasks that were submitted are terminated.

@item @emph{Prototype}:
@code{void starpu_wait_all_tasks(void);}
@end table


@c Callbacks : what can we put in callbacks ?

@node Tags
@section Tags

@menu
* starpu_tag_t::                   Task identifier
* starpu_tag_declare_deps::        Declare the Dependencies of a Tag
* starpu_tag_declare_deps_array::  Declare the Dependencies of a Tag
* starpu_tag_wait::                Block until a Tag is terminated
* starpu_tag_wait_array::          Block until a set of Tags is terminated
* starpu_tag_remove::              Destroy a Tag
* starpu_tag_notify_from_apps::    Feed a tag explicitely
@end menu


@node starpu_tag_t 
@subsection @code{starpu_tag_t} -- Task identifier
@table @asis
@item @emph{Description}:
It is possible to associate a task with a unique "tag" and to express
dependencies between tasks by the means of those tags. To do so, fill the
@code{tag_id} field of the @code{starpu_task} structure with a tag number (can
be arbitrary) and set the @code{use_tag} field to 1.

If @code{starpu_tag_declare_deps} is called with that tag number, the task will
not be started until the task which wears the declared dependency tags are
complete.
@end table

@node starpu_tag_declare_deps
@subsection @code{starpu_tag_declare_deps} -- Declare the Dependencies of a Tag
@table @asis
@item @emph{Description}:
Specify the dependencies of the task identified by tag @code{id}. The first
argument specifies the tag which is configured, the second argument gives the
number of tag(s) on which @code{id} depends. The following arguments are the
tags which have to terminated to unlock the task.

This function must be called before the associated task is submitted to StarPU
with @code{starpu_submit_task}.

@item @emph{Remark}
Because of the variable arity of @code{starpu_tag_declare_deps}, note that the
last arguments @emph{must} be of type @code{starpu_tag_t}: constant values
typically need to be explicitely casted. Using the
@code{starpu_tag_declare_deps_array} function avoids this hazard.

@item @emph{Prototype}:
@code{void starpu_tag_declare_deps(starpu_tag_t id, unsigned ndeps, ...);}

@item @emph{Example}:
@example
@c @cartouche
/*  Tag 0x1 depends on tags 0x32 and 0x52 */
starpu_tag_declare_deps((starpu_tag_t)0x1,
        2, (starpu_tag_t)0x32, (starpu_tag_t)0x52);

@c @end cartouche
@end example


@end table

@node starpu_tag_declare_deps_array
@subsection @code{starpu_tag_declare_deps_array} -- Declare the Dependencies of a Tag
@table @asis
@item @emph{Description}:
This function is similar to @code{starpu_tag_declare_deps}, except that its
does not take a variable number of arguments but an array of tags of size
@code{ndeps}.
@item @emph{Prototype}:
@code{void starpu_tag_declare_deps_array(starpu_tag_t id, unsigned ndeps, starpu_tag_t *array);}
@item @emph{Example}:
@example
@c @cartouche
/*  Tag 0x1 depends on tags 0x32 and 0x52 */
starpu_tag_t tag_array[2] = @{0x32, 0x52@};
starpu_tag_declare_deps((starpu_tag_t)0x1, 2, tag_array);

@c @end cartouche
@end example


@end table


@node starpu_tag_wait
@subsection @code{starpu_tag_wait} -- Block until a Tag is terminated
@table @asis
@item @emph{Description}:
This function blocks until the task associated to tag @code{id} has been
executed. This is a blocking call which must therefore not be called within
tasks or callbacks, but only from the application directly.  It is possible to
synchronize with the same tag multiple times, as long as the
@code{starpu_tag_remove} function is not called.  Note that it is still
possible to synchronize wih a tag associated to a task which @code{starpu_task}
data structure was liberated (eg. if the @code{destroy} flag of the
@code{starpu_task} was enabled).

@item @emph{Prototype}:
@code{void starpu_tag_wait(starpu_tag_t id);}
@end table

@node starpu_tag_wait_array
@subsection @code{starpu_tag_wait_array} -- Block until a set of Tags is terminated
@table @asis
@item @emph{Description}:
This function is similar to @code{starpu_tag_wait} except that it blocks until
@emph{all} the @code{ntags} tags contained in the @code{id} array are
terminated.
@item @emph{Prototype}:
@code{void starpu_tag_wait_array(unsigned ntags, starpu_tag_t *id);}
@end table


@node starpu_tag_remove
@subsection @code{starpu_tag_remove} -- Destroy a Tag
@table @asis
@item @emph{Description}:
This function release the resources associated to tag @code{id}. It can be
called once the corresponding task has been executed and when there is no tag
that depend on that one anymore.
@item @emph{Prototype}:
@code{void starpu_tag_remove(starpu_tag_t id);}
@end table

@node starpu_tag_notify_from_apps
@subsection @code{starpu_tag_notify_from_apps} -- Feed a Tag explicitely
@table @asis
@item @emph{Description}:
This function explicitely unlocks tag @code{id}. It may be useful in the
case of applications which execute part of their computation outside StarPU
tasks (eg. third-party libraries).  It is also provided as a
convenient tool for the programmer, for instance to entirely construct the task
DAG before actually giving StarPU the opportunity to execute the tasks.
@item @emph{Prototype}:
@code{void starpu_tag_notify_from_apps(starpu_tag_t id);}
@end table


@section Extensions

@subsection CUDA extensions

@c void starpu_malloc_pinned_if_possible(float **A, size_t dim);

@subsection Cell extensions

@c ---------------------------------------------------------------------
@c Basic Examples
@c ---------------------------------------------------------------------

@node Basic Examples
@chapter Basic Examples

@menu
* Compiling and linking::        Compiling and Linking Options
* Hello World::                  Submitting Tasks
* Scaling a Vector::             Manipulating Data
* Scaling a Vector (hybrid)::    Handling Heterogeneous Architectures
@end menu

@node Compiling and linking
@section Compiling and linking options

The Makefile could for instance contain the following lines to define which
options must be given to the compiler and to the linker:

@example
@c @cartouche
CFLAGS+=$$(pkg-config --cflags libstarpu)
LIBS+=$$(pkg-config --libs libstarpu)
@c @end cartouche
@end example

@node Hello World
@section Hello World

In this section, we show how to implement a simple program that submits a task to StarPU.

@subsection Required Headers

The @code{starpu.h} header should be included in any code using StarPU.

@example 
@c @cartouche
#include <starpu.h>
@c @end cartouche
@end example


@subsection Defining a Codelet

@example
@c @cartouche
void cpu_func(starpu_data_interface_t *buffers, void *func_arg)
@{
    float *array = func_arg;

    printf("Hello world (array = @{%f, %f@} )\n", array[0], array[1]);
@}

starpu_codelet cl =
@{
    .where = CORE,
    .core_func = cpu_func,
    .nbuffers = 0
@};
@c @end cartouche
@end example

A codelet is a structure that represents a computational kernel. Such a codelet
may contain an implementation of the same kernel on different architectures
(eg. CUDA, Cell's SPU, x86, ...).

The ''@code{.nbuffers}'' field specifies the number of data buffers that are
manipulated by the codelet: here the codelet does not access or modify any data
that is controlled by our data management library. Note that the argument
passed to the codelet (the ''@code{.cl_arg}'' field of the @code{starpu_task}
structure) does not count as a buffer since it is not managed by our data
management library. 

@c TODO need a crossref to the proper description of "where" see bla for more ...
We create a codelet which may only be executed on the CPUs. The ''@code{.where}''
field is a bitmask that defines where the codelet may be executed. Here, the
@code{CORE} value means that only CPUs can execute this codelet
(@pxref{Codelets and Tasks} for more details on that field).
When a CPU core executes a codelet, it calls the @code{.core_func} function,
which @emph{must} have the following prototype:

@code{void (*core_func)(starpu_data_interface_t *, void *)}

In this example, we can ignore the first argument of this function which gives a
description of the input and output buffers (eg. the size and the location of
the matrices). The second argument is a pointer to a buffer passed as an
argument to the codelet by the means of the ''@code{.cl_arg}'' field of the
@code{starpu_task} structure. Be aware that this may be a pointer to a
@emph{copy} of the actual buffer, and not the pointer given by the programmer:
if the codelet modifies this buffer, there is no garantee that the initial
buffer will be modified as well: this for instance implies that the buffer
cannot be used as a synchronization medium.

@subsection Submitting a Task

@example
@c @cartouche
void callback_func(void *callback_arg)
@{
    printf("Callback function (arg %x)\n", callback_arg);
@}

int main(int argc, char **argv)
@{
    /* initialize StarPU */
    starpu_init(NULL);

    struct starpu_task *task = starpu_task_create();

    task->cl = &cl;
    
    float array[2] = @{1.0f, -1.0f@};
    task->cl_arg = &array;
    task->cl_arg_size = 2*sizeof(float);

    task->callback_func = callback_func;
    task->callback_arg = 0x42;

    /* starpu_submit_task will be a blocking call */
    task->synchronous = 1;

    /* submit the task to StarPU */
    starpu_submit_task(task);

    /* terminate StarPU */
    starpu_shutdown();

    return 0;
@}
@c @end cartouche
@end example

Before submitting any tasks to StarPU, @code{starpu_init} must be called. The
@code{NULL} argument specifies that we use default configuration. Tasks cannot
be submitted after the termination of StarPU by a call to
@code{starpu_shutdown}.

In the example above, a task structure is allocated by a call to
@code{starpu_task_create}. This function only allocates and fills the
corresponding structure with the default settings (@pxref{starpu_task_create}),
but it does not submit the task to StarPU.

@c not really clear ;)
The ''@code{.cl}'' field is a pointer to the codelet which the task will
execute: in other words, the codelet structure describes which computational
kernel should be offloaded on the different architectures, and the task
structure is a wrapper containing a codelet and the piece of data on which the
codelet should operate.

The optional ''@code{.cl_arg}'' field is a pointer to a buffer (of size
@code{.cl_arg_size}) with some parameters for the kernel
described by the codelet. For instance, if a codelet implements a computational
kernel that multiplies its input vector by a constant, the constant could be
specified by the means of this buffer.

Once a task has been executed, an optional callback function can be called.
While the computational kernel could be offloaded on various architectures, the
callback function is always executed on a CPU. The ''@code{.callback_arg}''
pointer is passed as an argument of the callback. The prototype of a callback
function must be:
@example
void (*callback_function)(void *);
@end example

If the @code{.synchronous} field is non-null, task submission will be
synchronous: the @code{starpu_submit_task} function will not return until the
task was executed. Note that the @code{starpu_shutdown} method does not
guarantee that asynchronous tasks have been executed before it returns.

@node Scaling a Vector
@section Manipulating Data: Scaling a Vector

The previous example has shown how to submit tasks. In this section we show how
StarPU tasks can manipulate data.

Programmers can describe the data layout of their application so that StarPU is
responsible for enforcing data coherency and availability accross the machine.
Instead of handling complex (and non-portable) mechanisms to perform data
movements, programmers only declare which piece of data is accessed and/or
modified by a task, and StarPU makes sure that when a computational kernel
starts somewhere (eg. on a GPU), its data are available locally.

Before submitting those tasks, the programmer first needs to declare the
different pieces of data to StarPU using the @code{starpu_register_*_data}
functions. To ease the development of applications for StarPU, it is possible
to describe multiple types of data layout. A type of data layout is called an
@b{interface}. By default, there are different interfaces available in StarPU:
here we will consider the @b{vector interface}. 

The following lines show how to declare an array of @code{n} elements of type
@code{float} using the vector interface:
@example
float tab[n];

starpu_data_handle tab_handle;
starpu_register_vector_data(&tab_handle, 0, tab, n, sizeof(float));
@end example

The first argument, called the @b{data handle}, is an opaque pointer which
designates the array in StarPU. This is also the structure which is used to
describe which data is used by a task.
@c TODO: what is 0 ?
It is possible to construct a StarPU
task that multiplies this vector by a constant factor:
@example
float factor;
struct starpu_task *task = starpu_task_create();

task->cl = &cl;

task->buffers[0].handle = tab_handle;
task->buffers[0].mode = STARPU_RW;

task->cl_arg = &factor;
task->cl_arg_size = sizeof(float);
@end example

Since the factor is constant, it does not need a preliminary declaration, and
can just be passed through the @code{cl_arg} pointer like in the previous
example.  The vector parameter is described by its handle.
There are two fields in each element of the @code{buffers} array.
@code{.handle} is the handle of the data, and @code{.mode} specifies how the
kernel will access the data (@code{STARPU_R} for read-only, @code{STARPU_W} for
write-only and @code{STARPU_RW} for read and write access).

The definition of the codelet can be written as follows:

@example
void scal_func(starpu_data_interface_t *buffers, void *arg)
@{
    unsigned i;
    float *factor = arg;

    /* length of the vector */
    unsigned n = buffers[0].vector.nx;
    /* local copy of the vector pointer */
    float *val = (float *)buffers[0].vector.ptr;

    for (i = 0; i < n; i++)
        val[i] *= *factor;
@}

starpu_codelet cl = @{
    .where = CORE,
    .core_func = scal_func,
    .nbuffers = 1
@};
@end example


The second argument of the @code{scal_func} function contains a pointer to the
parameters of the codelet (given in @code{task->cl_arg}), so that we read the
constant factor from this pointer. The first argument is an array that gives
a description of every buffers passed in the @code{task->buffers}@ array, the
number of which is given by the @code{.nbuffers} field of the codelet structure.
In the @b{vector interface}, the location of the vector (resp. its length)
is accessible in the @code{.vector.ptr} (resp. @code{.vector.nx}) of this
array. Since the vector is accessed in a read-write fashion, any modification
will automatically affect future accesses to that vector made by other tasks.

@node Scaling a Vector (hybrid)
@section Vector Scaling on an Hybrid CPU/GPU Machine

Contrary to the previous examples, the task submitted in the example may not
only be executed by the CPUs, but also by a CUDA device.

TODO

@c ---------------------------------------------------------------------
@c Advanced Topics
@c ---------------------------------------------------------------------

@node Advanced Topics
@chapter Advanced Topics

@bye