浏览代码

doc: first version of the doxygen documentation

Nathalie Furmento 12 年之前
父节点
当前提交
a41b1f31f6

文件差异内容过多而无法显示
+ 1811 - 0
doc/doxygen/Doxyfile.handbook


文件差异内容过多而无法显示
+ 1260 - 0
doc/doxygen/chapters/advanced_examples.doxy


+ 553 - 0
doc/doxygen/chapters/api/codelet_and_tasks.doxy

@@ -0,0 +1,553 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Codelet_And_Tasks Codelet And Tasks
+
+\brief This section describes the interface to manipulate codelets and tasks.
+
+\def STARPU_CPU
+\ingroup Codelet_And_Tasks
+\brief This macro is used when setting the field starpu_codelet::where
+to specify the codelet may be executed on a CPU processing unit.
+
+\def STARPU_CUDA
+\ingroup Codelet_And_Tasks
+\brief This macro is used when setting the field starpu_codelet::where
+to specify the codelet may be executed on a CUDA processing unit.
+
+\def STARPU_OPENCL
+\ingroup Codelet_And_Tasks
+\brief This macro is used when setting the field starpu_codelet::where to
+specify the codelet may be executed on a OpenCL processing unit.
+
+\def STARPU_MULTIPLE_CPU_IMPLEMENTATIONS
+\deprecated
+\ingroup Codelet_And_Tasks
+\brief Setting the field starpu_codelet::cpu_func with this macro
+indicates the codelet will have several implementations. The use of
+this macro is deprecated. One should always only define the field
+starpu_codelet::cpu_funcs.
+
+\def STARPU_MULTIPLE_CUDA_IMPLEMENTATIONS
+\deprecated
+\ingroup Codelet_And_Tasks
+\brief Setting the field starpu_codelet::cuda_func with this macro
+indicates the codelet will have several implementations. The use of
+this macro is deprecated. One should always only define the field
+starpu_codelet::cuda_funcs.
+
+\def STARPU_MULTIPLE_OPENCL_IMPLEMENTATIONS
+\deprecated
+\ingroup Codelet_And_Tasks
+\brief Setting the field starpu_codelet::opencl_func with
+this macro indicates the codelet will have several implementations.
+The use of this macro is deprecated. One should always only define the
+field starpu_codelet::opencl_funcs.
+
+\struct starpu_codelet
+\brief The codelet structure describes a kernel that is possibly
+implemented on various targets. For compatibility, make sure to
+initialize the whole structure to zero, either by using explicit
+memset, or the function starpu_codelet_init(), or by letting the
+compiler implicitly do it in e.g. static storage case.
+\ingroup Codelet_And_Tasks
+\var starpu_codelet::where.
+Optional field to indicate which types of processing units are able to
+execute the codelet. The different values ::STARPU_CPU, ::STARPU_CUDA,
+::STARPU_OPENCL can be combined to specify on which types of processing
+units the codelet can be executed. ::STARPU_CPU|::STARPU_CUDA for instance
+indicates that the codelet is implemented for both CPU cores and CUDA
+devices while ::STARPU_OPENCL indicates that it is only available on
+OpenCL devices. If the field is unset, its value will be automatically
+set based on the availability of the XXX_funcs fields defined below.
+
+\var starpu_codelet::can_execute
+Define a function which should return 1 if the worker designated by
+workerid can execute the <c>nimpl</c>th implementation of the given
+task, 0 otherwise.
+
+\var starpu_codelet::type
+Optional field to specify the type of the codelet. The default is
+::STARPU_SEQ, i.e. usual sequential implementation. Other values
+(::STARPU_SPMD or ::STARPU_FORKJOIN declare that a parallel implementation
+is also available. See \ref Parallel_Tasks for details.
+
+\var starpu_codelet::max_parallelism
+Optional field. If a parallel implementation is available, this
+denotes the maximum combined worker size that StarPU will use to
+execute parallel tasks for this codelet.
+
+\var starpu_codelet::cpu_func
+\deprecated
+Optional field which has been made deprecated. One should use instead
+the field starpu_codelet::cpu_funcs.
+
+\var starpu_codelet::cuda_func
+\deprecated
+Optional field which has been made deprecated. One should use instead
+the starpu_codelet::cuda_funcs field.
+
+\var starpu_codelet::opencl_func
+\deprecated
+Optional field which has been made deprecated. One should use instead
+the starpu_codelet::opencl_funcs field.
+
+\var starpu_codelet::cpu_funcs
+Optional array of function pointers to the CPU implementations of the
+codelet. It must be terminated by a NULL value. The functions
+prototype must be:
+\code{.c}
+void cpu_func(void *buffers[], void *cl_arg)
+\endcode
+The first argument being the array of data managed by the data
+management library, and the second argument is a pointer to the
+argument passed from the field starpu_task::cl_arg. If the field
+starpu_codelet::where is set, then the field starpu_codelet::cpu_funcs
+is ignored if ::STARPU_CPU does not appear in the field
+starpu_codelet::where, it must be non-null otherwise.
+
+\var starpu_codelet::cuda_funcs
+Optional array of function pointers to the CUDA implementations of the
+codelet. It must be terminated by a NULL value. The functions must be
+host-functions written in the CUDA runtime API. Their prototype must
+be:
+\code{.c}
+void cuda_func(void *buffers[], void *cl_arg)
+\endcode
+If the field starpu_codelet::where is set, then the field
+starpu_codelet::cuda_funcs is ignored if ::STARPU_CUDA does not appear
+in the field starpu_codelet::where, it must be non-null otherwise.
+
+\var starpu_codelet::opencl_funcs
+Optional array of function pointers to the OpenCL implementations of
+the codelet. It must be terminated by a NULL value. The functions
+prototype must be:
+\code{.c}
+void opencl_func(void *buffers[], void *cl_arg)
+\endcode
+If the field starpu_codelet::where field is set, then the field
+starpu_codelet::opencl_funcs is ignored if ::STARPU_OPENCL does not
+appear in the field starpu_codelet::where, it must be non-null
+otherwise.
+
+\var starpu_codelet::nbuffers
+Specify the number of arguments taken by the codelet. These arguments
+are managed by the DSM and are accessed from the <c>void *buffers[]</c>
+array. The constant argument passed with the field starpu_task::cl_arg
+is not counted in this number. This value should not be above
+STARPU_NMAXBUFS.
+
+\var starpu_codelet::modes
+Is an array of ::starpu_data_access_mode. It describes the required
+access modes to the data neeeded by the codelet (e.g. ::STARPU_RW). The
+number of entries in this array must be specified in the field
+starpu_codelet::nbuffers, and should not exceed STARPU_NMAXBUFS. If
+unsufficient, this value can be set with the <c>--enable-maxbuffers</c>
+option when configuring StarPU.
+
+\var starpu_codelet::dyn_modes
+Is an array of ::starpu_data_access_mode. It describes the required
+access modes to the data neeeded by the codelet (e.g. ::STARPU_RW).
+The number of entries in this array must be specified in the field
+starpu_codelet::nbuffers. This field should be used for codelets having a
+number of datas greater than STARPU_NMAXBUFS (see \ref
+Setting_the_Data_Handles_for_a_Task). When defining a codelet, one
+should either define this field or the field starpu_codelet::modes defined above.
+
+\var starpu_codelet::model
+Optional pointer to the task duration performance model associated to
+this codelet. This optional field is ignored when set to <c>NULL</c> or when
+its field starpu_perfmodel::symbol is not set.
+
+\var starpu_codelet::power_model
+Optional pointer to the task power consumption performance model
+associated to this codelet. This optional field is ignored when set to
+<c>NULL or when its field starpu_perfmodel::field is not set. In the
+case of parallel codelets, this has to account for all processing
+units involved in the parallel execution.
+
+\var starpu_codelet::per_worker_stats
+Optional array for statistics collected at runtime: this is filled by
+StarPU and should not be accessed directly, but for example by calling
+the function starpu_codelet_display_stats() (See
+starpu_codelet_display_stats() for details).
+
+\var starpu_codelet::name
+Optional name of the codelet. This can be useful for debugging
+purposes.
+
+\fn void starpu_codelet_init(struct starpu_codelet *cl)
+\ingroup Codelet_And_Tasks
+\brief Initialize \p cl with default values. Codelets should
+preferably be initialized statically as shown in \ref
+Defining_a_Codelet. However such a initialisation is not always
+possible, e.g. when using C++.
+
+\struct starpu_data_descr
+\ingroup Codelet_And_Tasks
+\brief This type is used to describe a data handle along with an
+access mode.
+\var starpu_data_descr::handle
+describes a data
+\var starpu_data_descr::mode
+describes its access mode
+
+\struct starpu_task
+\ingroup Codelet_And_Tasks
+\brief The structure describes a task that can be offloaded on the
+various processing units managed by StarPU. It instantiates a codelet.
+It can either be allocated dynamically with the function
+starpu_task_create(), or declared statically. In the latter case, the
+programmer has to zero the structure starpu_task and to fill the
+different fields properly. The indicated default values correspond to
+the configuration of a task allocated with starpu_task_create().
+\var starpu_task::cl
+Is a pointer to the corresponding structure starpu_codelet. This
+describes where the kernel should be executed, and supplies the
+appropriate implementations. When set to NULL, no code is executed
+during the tasks, such empty tasks can be useful for synchronization
+purposes.
+\var starpu_task::buffers
+\deprecated
+This field has been made deprecated. One should use instead the
+field starpu_task::handles to specify the data handles accessed
+by the task. The access modes are now defined in the field
+starpu_codelet::mode.
+\var starpu_task::handles
+Is an array of starpu_data_handle_t. It specifies the handles to the
+different pieces of data accessed by the task. The number of entries
+in this array must be specified in the field starpu_codelet::nbuffers,
+and should not exceed STARPU_NMAXBUFS. If unsufficient, this value can
+be set with the option <c>--enable-maxbuffers</c> when configuring
+StarPU.
+\var starpu_task::dyn_handles
+Is an array of starpu_data_handle_t. It specifies the handles to the
+different pieces of data accessed by the task. The number of entries
+in this array must be specified in the field starpu_codelet::nbuffers.
+This field should be used for tasks having a number of datas greater
+than STARPU_NMAXBUFS (see \ref Setting_the_Data_Handles_for_a_Task).
+When defining a task, one should either define this field or the field
+starpu_task::handles defined above.
+
+\var starpu_task::interfaces
+The actual data pointers to the memory node where execution will
+happen, managed by the DSM.
+
+\var starpu_task::dyn_interfaces
+The actual data pointers to the memory node where execution will
+happen, managed by the DSM. Is used when the field
+starpu_task::dyn_handles is defined.
+
+\var starpu_task::cl_arg
+Optional pointer which is passed to the codelet through the second
+argument of the codelet implementation (e.g. starpu_codelet::cpu_func
+or starpu_codelet::cuda_func). The default value is <c>NULL</c>.
+
+\var starpu_task::cl_arg_size
+Optional field. For some specific drivers, the pointer
+starpu_task::cl_arg cannot not be directly given to the driver
+function. A buffer of size starpu_task::cl_arg_size needs to be
+allocated on the driver. This buffer is then filled with the
+starpu_task::cl_arg_size bytes starting at address
+starpu_task::cl_arg. In this case, the argument given to the codelet
+is therefore not the starpu_task::cl_arg pointer, but the address of
+the buffer in local store (LS) instead. This field is ignored for CPU,
+CUDA and OpenCL codelets, where the starpu_task::cl_arg pointer is
+given as such.
+
+\var starpu_task::callback_func
+Optional field, the default value is <c>NULL</c>. This is a function
+pointer of prototype <c>void (*f)(void *)</c> which specifies a
+possible callback. If this pointer is non-null, the callback function
+is executed on the host after the execution of the task. Tasks which
+depend on it might already be executing. The callback is passed the
+value contained in the starpu_task::callback_arg field. No callback is
+executed if the field is set to NULL.
+
+\var starpu_task::callback_arg (optional) (default: NULL)
+Optional field, the default value is <c>NULL</c>. This is the pointer
+passed to the callback function. This field is ignored if the
+callback_func is set to <c>NULL</c>.
+
+\var starpu_task::use_tag
+Optional field, the default value is 0. If set, this flag indicates
+that the task should be associated with the tag contained in the
+starpu_task::tag_id field. Tag allow the application to synchronize
+with the task and to express task dependencies easily. 
+
+\var starpu_task::tag_id
+This optional field contains the tag associated to the task if the
+field starpu_task::use_tag is set, it is ignored otherwise.
+
+\var starpu_task::sequential_consistency
+If this flag is set (which is the default), sequential consistency is
+enforced for the data parameters of this task for which sequential
+consistency is enabled. Clearing this flag permits to disable
+sequential consistency for this task, even if data have it enabled.
+
+\var starpu_task::synchronous
+If this flag is set, the function starpu_task_submit() is blocking and
+returns only when the task has been executed (or if no worker is able
+to process the task). Otherwise, starpu_task_submit() returns
+immediately.
+
+\var starpu_task::priority
+Optional field, the default value is STARPU_DEFAULT_PRIO. This field
+indicates a level of priority for the task. This is an integer value
+that must be set between the return values of the function
+starpu_sched_get_min_priority() for the least important tasks, and
+that of the function starpu_sched_get_max_priority() for the most
+important tasks (included). The STARPU_MIN_PRIO and STARPU_MAX_PRIO
+macros are provided for convenience and respectively returns the value
+of starpu_sched_get_min_priority() and
+starpu_sched_get_max_priority(). Default priority is
+STARPU_DEFAULT_PRIO, which is always defined as 0 in order to allow
+static task initialization. Scheduling strategies that take priorities
+into account can use this parameter to take better scheduling
+decisions, but the scheduling policy may also ignore it.
+
+\var starpu_task::execute_on_a_specific_worker
+Default value is 0. If this flag is set, StarPU will bypass the
+scheduler and directly affect this task to the worker specified by the
+field starpu_task::workerid.
+
+\var starpu_task::workerid
+Optional field. If the field starpu_task::execute_on_a_specific_worker
+is set, this field indicates the identifier of the worker that should
+process this task (as returned by starpu_worker_get_id()). This field
+is ignored if the field starpu_task::execute_on_a_specific_worker is
+set to 0.
+
+\var starpu_task::bundle
+Optional field. The bundle that includes this task. If no bundle is
+used, this should be NULL.
+
+\var starpu_task::detach
+Optional field, default value is 1. If this flag is set, it is not
+possible to synchronize with the task by the means of starpu_task_wait()
+later on. Internal data structures are only guaranteed to be freed
+once starpu_task_wait() is called if the flag is not set.
+
+\var starpu_task::destroy
+Optional value. Default value is 0 for starpu_task_init(), and 1 for
+starpu_task_create(). If this flag is set, the task structure will
+automatically be freed, either after the execution of the callback if
+the task is detached, or during starpu_task_wait() otherwise. If this
+flag is not set, dynamically allocated data structures will not be
+freed until starpu_task_destroy() is called explicitly. Setting this
+flag for a statically allocated task structure will result in
+undefined behaviour. The flag is set to 1 when the task is created by
+calling starpu_task_create(). Note that starpu_task_wait_for_all()
+will not free any task.
+
+\var starpu_task::regenerate
+Optional field. If this flag is set, the task will be re-submitted to
+StarPU once it has been executed. This flag must not be set if the
+destroy flag is set.
+
+\var starpu_task::status
+Optional field. Current state of the task.
+
+\var starpu_task::profiling_info
+Optional field. Profiling information for the task.
+
+\var starpu_task::predicted
+Output field. Predicted duration of the task. This field is only set
+if the scheduling strategy used performance models.
+
+\var starpu_task::predicted_transfer
+Optional field. Predicted data transfer duration for the task in
+microseconds. This field is only valid if the scheduling strategy uses
+performance models.
+
+\var starpu_task::prev
+\private
+A pointer to the previous task. This should only be used by StarPU.
+
+\var starpu_task::next
+\private
+A pointer to the next task. This should only be used by StarPU.
+
+\var starpu_task::mf_skip
+\private
+This is only used for tasks that use multiformat handle. This should
+only be used by StarPU.
+
+\var starpu_task::flops
+This can be set to the number of floating points operations that the
+task will have to achieve. This is useful for easily getting GFlops
+curves from starpu_perfmodel_plot(), and for the hypervisor load
+balancing.
+
+\var starpu_task::starpu_private
+\private
+This is private to StarPU, do not modify. If the task is allocated by
+hand (without starpu_task_create()), this field should be set to NULL.
+
+\var starpu_task::magic
+\private
+This field is set when initializing a task. It prevents a task from
+being submitted if it has not been properly initialized.
+
+\fn void starpu_task_init(struct starpu_task *task)
+\ingroup Codelet_And_Tasks
+\brief Initialize task with default values. This function is
+implicitly called by starpu_task_create(). By default, tasks initialized
+with starpu_task_init() must be deinitialized explicitly with
+starpu_task_clean(). Tasks can also be initialized statically, using
+STARPU_TASK_INITIALIZER.
+
+\def STARPU_TASK_INITIALIZER
+\ingroup Codelet_And_Tasks
+\brief It is possible to initialize statically allocated tasks with
+this value. This is equivalent to initializing a structure starpu_task
+with the function starpu_task_init() function.
+
+\def STARPU_TASK_GET_HANDLE(struct starpu_task *task, int i)
+\ingroup Codelet_And_Tasks
+\brief Return the \p i th data handle of the given task. If the task
+is defined with a static or dynamic number of handles, will either
+return the \p i th element of the field starpu_task::handles or the \p
+i th element of the field starpu_task::dyn_handles (see \ref
+Setting_the_Data_Handles_for_a_Task)
+
+\def STARPU_TASK_SET_HANDLE(struct starpu_task *task, starpu_data_handle_t handle, int i)
+\ingroup Codelet_And_Tasks
+\brief Set the \p i th data handle of the given task with the given
+dat handle. If the task is defined with a static or dynamic number of
+handles, will either set the \p i th element of the field
+starpu_task::handles or the \p i th element of the field
+starpu_task::dyn_handles (see \ref
+Setting_the_Data_Handles_for_a_Task)
+
+\def STARPU_CODELET_GET_MODE(struct starpu_codelet *codelet, int i)
+\ingroup Codelet_And_Tasks
+\brief Return the access mode of the \p i th data handle of the given
+codelet. If the codelet is defined with a static or dynamic number of
+handles, will either return the \p i th element of the field
+starpu_codelet::modes or the \p i th element of the field
+starpu_codelet::dyn_modes (see \ref
+Setting_the_Data_Handles_for_a_Task)
+
+\def STARPU_CODELET_SET_MODE(struct starpu_codelet *codelet, enum starpu_data_access_mode mode, int i)
+\ingroup Codelet_And_Tasks
+\brief Set the access mode of the \p i th data handle of the given
+codelet. If the codelet is defined with a static or dynamic number of
+handles, will either set the \p i th element of the field
+starpu_codelet::modes or the \p i th element of the field
+starpu_codelet::dyn_modes (see \ref
+Setting_the_Data_Handles_for_a_Task)
+
+\fn struct starpu_task * starpu_task_create(void)
+\ingroup Codelet_And_Tasks
+\brief Allocate a task structure and initialize it with default
+values. Tasks allocated dynamically with starpu_task_create() are
+automatically freed when the task is terminated. This means that the
+task pointer can not be used any more once the task is submitted,
+since it can be executed at any time (unless dependencies make it
+wait) and thus freed at any time. If the field starpu_task::destroy is
+explicitly unset, the resources used by the task have to be freed by
+calling starpu_task_destroy().
+
+\fn struct starpu_task * starpu_task_dup(struct starpu_task *task)
+\ingroup Codelet_And_Tasks
+\brief Allocate a task structure which is the exact duplicate of the
+given task.
+
+\fn void starpu_task_clean(struct starpu_task *task)
+\ingroup Codelet_And_Tasks
+\brief Release all the structures automatically allocated to execute
+task, but not the task structure itself and values set by the user
+remain unchanged. It is thus useful for statically allocated tasks for
+instance. It is also useful when users want to execute the same
+operation several times with as least overhead as possible. It is
+called automatically by starpu_task_destroy(). It has to be called
+only after explicitly waiting for the task or after starpu_shutdown()
+(waiting for the callback is not enough, since StarPU still
+manipulates the task after calling the callback).
+
+\fn void starpu_task_destroy(struct starpu_task *task)
+\ingroup Codelet_And_Tasks
+\brief Free the resource allocated during starpu_task_create() and
+associated with task. This function is already called automatically
+after the execution of a task when the field starpu_task::destroy is
+set, which is the default for tasks created by starpu_task_create().
+Calling this function on a statically allocated task results in an
+undefined behaviour.
+
+\fn int starpu_task_wait(struct starpu_task *task)
+\ingroup Codelet_And_Tasks
+\brief This function blocks until \p task has been executed. It is not
+possible to synchronize with a task more than once. It is not possible
+to wait for synchronous or detached tasks. Upon successful completion,
+this function returns 0. Otherwise, <c>-EINVAL</c> indicates that the
+specified task was either synchronous or detached. 
+
+\fn int starpu_task_submit(struct starpu_task *task)
+\ingroup Codelet_And_Tasks
+\brief This function submits task to StarPU. Calling this function
+does not mean that the task will be executed immediately as there can
+be data or task (tag) dependencies that are not fulfilled yet: StarPU
+will take care of scheduling this task with respect to such
+dependencies. This function returns immediately if the field
+starpu_task::synchronous is set to 0, and block until the
+termination of the task otherwise. It is also possible to synchronize
+the application with asynchronous tasks by the means of tags, using
+the function starpu_tag_wait() function for instance. In case of
+success, this function returns 0, a return value of <c>-ENODEV</c>
+means that there is no worker able to process this task (e.g. there is
+no GPU available and this task is only implemented for CUDA devices).
+starpu_task_submit() can be called from anywhere, including codelet
+functions and callbacks, provided that the field
+starpu_task::synchronous is set to 0.
+
+\fn int starpu_task_wait_for_all(void)
+\ingroup Codelet_And_Tasks
+\brief This function blocks until all the tasks that were submitted
+are terminated. It does not destroy these tasks. 
+
+\fn int starpu_task_nready(void)
+\ingroup Codelet_And_Tasks
+\brief TODO
+
+\brief int starpu_task_nsubmitted(void)
+\ingroup Codelet_And_Tasks
+Return the number of submitted tasks which have not completed yet. 
+
+\fn int starpu_task_nready(void)
+\ingroup Codelet_And_Tasks
+\brief Return the number of submitted tasks which are ready for
+execution are already executing. It thus does not include tasks
+waiting for dependencies. 
+
+\fn struct starpu_task * starpu_task_get_current(void)
+\ingroup Codelet_And_Tasks
+\brief This function returns the task currently executed by the
+worker, or <c>NULL</c> if it is called either from a thread that is not a
+task or simply because there is no task being executed at the moment. 
+
+\fn void starpu_codelet_display_stats(struct starpu_codelet *cl)
+\ingroup Codelet_And_Tasks
+\brief Output on stderr some statistics on the codelet \p cl.
+
+\fn int starpu_task_wait_for_no_ready(void)
+\ingroup Codelet_And_Tasks
+\brief This function waits until there is no more ready task. 
+
+\fn void starpu_task_set_implementation(struct starpu_task *task, unsigned impl)
+\ingroup Codelet_And_Tasks
+\brief This function should be called by schedulers to specify the
+codelet implementation to be executed when executing the task. 
+
+\fn unsigned starpu_task_get_implementation(struct starpu_task *task)
+\ingroup Codelet_And_Tasks
+\brief This function return the codelet implementation to be executed
+when executing the task.
+
+
+*/

+ 806 - 0
doc/doxygen/chapters/api/data_interfaces.doxy

@@ -0,0 +1,806 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Data_Interfaces Data Interfaces
+
+\struct starpu_data_interface_ops
+\brief Per-interface data transfer methods.
+\ingroup Data_Interfaces
+\var starpu_data_interface_ops::register_data_handle
+Register an existing interface into a data handle.
+\var starpu_data_interface_ops::allocate_data_on_node
+Allocate data for the interface on a given node.
+\var starpu_data_interface_ops::free_data_on_node
+Free data of the interface on a given node.
+\var starpu_data_interface_ops::copy_methods
+ram/cuda/opencl synchronous and asynchronous transfer methods.
+\var starpu_data_interface_ops::handle_to_pointer
+Return the current pointer (if any) for the handle on the given node.
+\var starpu_data_interface_ops::get_size
+Return an estimation of the size of data, for performance models.
+\var starpu_data_interface_ops::footprint
+Return a 32bit footprint which characterizes the data size.
+\var starpu_data_interface_ops::compare
+Compare the data size of two interfaces.
+\var starpu_data_interface_ops::display
+Dump the sizes of a handle to a file.
+\var starpu_data_interface_ops::interfaceid
+An identifier that is unique to each interface.
+\var starpu_data_interface_ops::interface_size
+The size of the interface data descriptor.
+\var starpu_data_interface_ops::is_multiformat
+todo
+\var starpu_data_interface_ops::get_mf_ops
+todo
+\var starpu_data_interface_ops::pack_data
+Pack the data handle into a contiguous buffer at the address ptr and
+set the size of the newly created buffer in count. If ptr is NULL, the
+function should not copy the data in the buffer but just set count to
+the size of the buffer which would have been allocated. The special
+value -1 indicates the size is yet unknown.
+\var starpu_data_interface_ops::unpack_data
+Unpack the data handle from the contiguous buffer at the address ptr
+of size count
+
+\struct starpu_data_copy_methods
+\brief Defines the per-interface methods. If the any_to_any method is
+provided, it will be used by default if no more specific method is
+provided. It can still be useful to provide more specific method in
+case of e.g. available particular CUDA or OpenCL support.
+\ingroup Data_Interfaces
+\var starpu_data_copy_methods::ram_to_ram
+Define how to copy data from the src_interface interface on the
+src_node CPU node to the dst_interface interface on the dst_node CPU
+node. Return 0 on success.
+\var starpu_data_copy_methods::ram_to_cuda
+Define how to copy data from the src_interface interface on the
+src_node CPU node to the dst_interface interface on the dst_node CUDA
+node. Return 0 on success.
+\var starpu_data_copy_methods::ram_to_opencl
+Define how to copy data from the src_interface interface on the
+src_node CPU node to the dst_interface interface on the dst_node
+OpenCL node. Return 0 on success.
+\var starpu_data_copy_methods::cuda_to_ram
+Define how to copy data from the src_interface interface on the
+src_node CUDA node to the dst_interface interface on the dst_node
+CPU node. Return 0 on success.
+\var starpu_data_copy_methods::cuda_to_cuda
+Define how to copy data from the src_interface interface on the
+src_node CUDA node to the dst_interface interface on the dst_node CUDA
+node. Return 0 on success.
+\var starpu_data_copy_methods::cuda_to_opencl
+Define how to copy data from the src_interface interface on the
+src_node CUDA node to the dst_interface interface on the dst_node
+OpenCL node. Return 0 on success.
+\var starpu_data_copy_methods::opencl_to_ram
+Define how to copy data from the src_interface interface on the
+src_node OpenCL node to the dst_interface interface on the dst_node
+CPU node. Return 0 on success.
+\var starpu_data_copy_methods::opencl_to_cuda
+Define how to copy data from the src_interface interface on the
+src_node OpenCL node to the dst_interface interface on the dst_node
+CUDA node. Return 0 on success.
+\var starpu_data_copy_methods::opencl_to_opencl
+Define how to copy data from the src_interface interface on the
+src_node OpenCL node to the dst_interface interface on the dst_node
+OpenCL node. Return 0 on success.
+
+\var starpu_data_copy_methods::ram_to_cuda_async
+Define how to copy data from the src_interface interface on the
+src_node CPU node to the dst_interface interface on the dst_node CUDA
+node, using the given stream. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the core.
+\var starpu_data_copy_methods::cuda_to_ram_async
+Define how to copy data from the src_interface interface on the
+src_node CUDA node to the dst_interface interface on the dst_node CPU
+node, using the given stream. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the core.
+\var starpu_data_copy_methods::cuda_to_cuda_async
+Define how to copy data from the src_interface interface on the
+src_node CUDA node to the dst_interface interface on the dst_node CUDA
+node, using the given stream. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the core.
+
+\var starpu_data_copy_methods::ram_to_opencl_async
+Define how to copy data from the src_interface interface on the
+src_node CPU node to the dst_interface interface on the dst_node
+OpenCL node, by recording in event, a pointer to a cl_event, the event
+of the last submitted transfer. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the
+core.
+\var starpu_data_copy_methods::opencl_to_ram_async
+Define how to copy data from the src_interface interface on the
+src_node OpenCL node to the dst_interface interface on the dst_node
+CPU node, by recording in event, a pointer to a cl_event, the event of
+the last submitted transfer. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the
+core.
+\var starpu_data_copy_methods::opencl_to_opencl_async
+Define how to copy data from the src_interface interface on the
+src_node OpenCL node to the dst_interface interface on the dst_node
+OpenCL node, by recording in event, a pointer to a cl_event, the event
+of the last submitted transfer. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the
+core.
+
+\var starpu_data_copy_methods::any_to_any
+Define how to copy data from the src_interface interface on the
+src_node node to the dst_interface interface on the dst_node node.
+This is meant to be implemented through the starpu_interface_copy()
+helper, to which async_data should be passed as such, and will be used
+to manage asynchronicity. This must return -EAGAIN if any of the
+starpu_interface_copy() calls has returned -EAGAIN (i.e. at least some
+transfer is still ongoing), and return 0 otherwise.
+
+@name Registering Data
+\ingroup Data_Interfaces
+
+There are several ways to register a memory region so that it can be
+managed by StarPU. The functions below allow the registration of
+vectors, 2D matrices, 3D matrices as well as BCSR and CSR sparse
+matrices.
+
+\fn void starpu_void_data_register(starpu_data_handle_t *handle)
+\ingroup Data_Interfaces
+\brief Register a void interface. There is no data really associated
+to that interface, but it may be used as a synchronization mechanism.
+It also permits to express an abstract piece of data that is managed
+by the application internally: this makes it possible to forbid the
+concurrent execution of different tasks accessing the same <c>void</c> data
+in read-write concurrently. 
+
+\fn void starpu_variable_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, size_t size)
+\ingroup Data_Interfaces
+\brief Register the \p size byte element pointed to by \p ptr, which is
+typically a scalar, and initialize \p handle to represent this data item.
+
+Here an example of how to use the function.
+\code{.c}
+float var;
+starpu_data_handle_t var_handle;
+starpu_variable_data_register(&var_handle, 0, (uintptr_t)&var, sizeof(var));
+\endcode
+
+\fn void starpu_vector_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, uint32_t nx, size_t elemsize)
+\ingroup Data_Interfaces
+\brief Register the \p nx elemsize-byte elements pointed to by \p ptr and initialize \p handle to represent it.
+
+Here an example of how to use the function.
+\code{.c}
+float vector[NX];
+starpu_data_handle_t vector_handle;
+starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX, sizeof(vector[0]));
+\endcode
+
+\fn void starpu_matrix_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, uint32_t ld, uint32_t nx, uint32_t ny, size_t elemsize)
+\ingroup Data_Interfaces
+\brief Register the \p nx x \p  ny 2D matrix of \p elemsize-byte elements pointed
+by \p ptr and initialize \p handle to represent it. \p ld specifies the number
+of elements between rows. a value greater than \p nx adds padding, which
+can be useful for alignment purposes.
+
+Here an example of how to use the function.
+\code{.c}
+float *matrix;
+starpu_data_handle_t matrix_handle;
+matrix = (float*)malloc(width * height * sizeof(float));
+starpu_matrix_data_register(&matrix_handle, 0, (uintptr_t)matrix, width, width, height, sizeof(float));
+\endcode
+
+\fn void starpu_block_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, uint32_t ldy, uint32_t ldz, uint32_t nx, uint32_t ny, uint32_t nz, size_t elemsize)
+\ingroup Data_Interfaces
+\brief Register the \p nx x \p ny x \p nz 3D matrix of \p elemsize byte elements
+pointed by \p ptr and initialize \p handle to represent it. Again, \p ldy and
+\p ldz specify the number of elements between rows and between z planes.
+
+Here an example of how to use the function.
+\code{.c}
+float *block;
+starpu_data_handle_t block_handle;
+block = (float*)malloc(nx*ny*nz*sizeof(float));
+starpu_block_data_register(&block_handle, 0, (uintptr_t)block, nx, nx*ny, nx, ny, nz, sizeof(float));
+\endcode
+
+\fn void starpu_bcsr_data_register(starpu_data_handle_t *handle, unsigned home_node, uint32_t nnz, uint32_t nrow, uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, uint32_t r, uint32_t c, size_t elemsize)
+\ingroup Data_Interfaces
+\brief This variant of starpu_data_register() uses the BCSR (Blocked
+Compressed Sparse Row Representation) sparse matrix interface.
+Register the sparse matrix made of \p nnz non-zero blocks of elements of
+size \p elemsize stored in \p nzval and initializes \p handle to represent it.
+Blocks have size \p r * \p c. \p nrow is the number of rows (in terms of
+blocks), \p colind[i] is the block-column index for block i in \p nzval,
+\p rowptr[i] is the block-index (in \p nzval) of the first block of row i.
+\p firstentry is the index of the first entry of the given arrays
+(usually 0 or 1). 
+
+\fn void starpu_csr_data_register(starpu_data_handle_t *handle, unsigned home_node, uint32_t nnz, uint32_t nrow, uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, size_t elemsize)
+\ingroup Data_Interfaces
+\brief This variant of starpu_data_register() uses the CSR (Compressed
+Sparse Row Representation) sparse matrix interface. TODO
+
+\fn void starpu_coo_data_register(starpu_data_handle_t *handleptr, unsigned home_node, uint32_t nx, uint32_t ny, uint32_t n_values, uint32_t *columns, uint32_t *rows, uintptr_t values, size_t elemsize);
+\ingroup Data_Interfaces
+\brief Register the \p nx x \p ny 2D matrix given in the COO format, using the
+\p columns, \p rows, \p values arrays, which must have \p n_values elements of
+size \p elemsize. Initialize \p handleptr.
+
+\fn void *starpu_data_get_interface_on_node(starpu_data_handle_t handle, unsigned memory_node)
+\ingroup Data_Interfaces
+\brief Return the interface associated with \p handle on \p memory_node.
+
+@name Accessing Data Interfaces
+\ingroup Data_Interfaces
+
+Each data interface is provided with a set of field access functions.
+The ones using a void * parameter aimed to be used in codelet
+implementations (see for example the code in \ref
+Vector_Scaling_Using_StarPU_API).
+
+\fn void *starpu_data_handle_to_pointer(starpu_data_handle_t handle, unsigned node)
+\ingroup Data_Interfaces
+\brief Return the pointer associated with \p handle on node \p node or <c>NULL</c>
+if handle’s interface does not support this operation or data for this
+\p handle is not allocated on that \p node.
+
+\fn void *starpu_data_get_local_ptr(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the local pointer associated with \p handle or <c>NULL</c> if
+\p handle’s interface does not have data allocated locally 
+
+\fn enum starpu_data_interface_id starpu_data_get_interface_id(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the unique identifier of the interface associated with
+the given \p handle.
+
+\fn size_t starpu_data_get_size(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of the data associated with \p handle.
+
+\fn int starpu_data_pack(starpu_data_handle_t handle, void **ptr, starpu_ssize_t *count)
+\ingroup Data_Interfaces
+\brief Execute the packing operation of the interface of the data
+registered at \p handle (see starpu_data_interface_ops). This
+packing operation must allocate a buffer large enough at \p ptr and copy
+into the newly allocated buffer the data associated to \p handle. \p count
+will be set to the size of the allocated buffer. If \p ptr is NULL, the
+function should not copy the data in the buffer but just set \p count to
+the size of the buffer which would have been allocated. The special
+value -1 indicates the size is yet unknown.
+
+\fn int starpu_data_unpack(starpu_data_handle_t handle, void *ptr, size_t count)
+\ingroup Data_Interfaces
+\brief Unpack in handle the data located at \p ptr of size \p count as
+described by the interface of the data. The interface registered at
+\p handle must define a unpacking operation (see
+starpu_data_interface_ops). The memory at the address \p ptr is freed
+after calling the data unpacking operation.
+
+@name Accessing Variable Data Interfaces
+\ingroup Data_Interfaces
+
+\fn size_t starpu_variable_get_elemsize(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of the variable designated by \p handle.
+
+\fn uintptr_t starpu_variable_get_local_ptr(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return a pointer to the variable designated by \p handle.
+
+\def STARPU_VARIABLE_GET_PTR(interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the variable designated by \p interface.
+
+\def STARPU_VARIABLE_GET_ELEMSIZE(interface)
+\ingroup Data_Interfaces
+\brief Return the size of the variable designated by \p interface.
+
+\def STARPU_VARIABLE_GET_DEV_HANDLE(interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the variable designated by
+\p interface, to be used on OpenCL. The offset documented below has to be
+used in addition to this.
+
+\def STARPU_VARIABLE_GET_OFFSET()
+\ingroup Data_Interfaces
+\brief Return the offset in the variable designated by \p interface, to
+be used with the device handle.
+
+@name Accessing Vector Data Interfaces
+\ingroup Data_Interfaces
+
+\fn uint32_t starpu_vector_get_nx(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements registered into the array designated by \p handle.
+
+\fn size_t starpu_vector_get_elemsize(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of each element of the array designated by \p handle.
+
+\fn uintptr_t starpu_vector_get_local_ptr(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the local pointer associated with \p handle.
+
+\def STARPU_VECTOR_GET_PTR(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the array designated by \p interface, valid on
+CPUs and CUDA only. For OpenCL, the device handle and offset need to
+be used instead.
+
+\def STARPU_VECTOR_GET_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the array designated by \p interface,
+to be used on OpenCL. the offset documented below has to be used in
+addition to this.
+
+\def STARPU_VECTOR_GET_OFFSET(void *interface)
+\ingroup Data_Interfaces
+\brief Return the offset in the array designated by \p interface, to be
+used with the device handle.
+
+\def STARPU_VECTOR_GET_NX(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements registered into the array
+designated by \p interface.
+
+\def STARPU_VECTOR_GET_ELEMSIZE(void *interface)
+\ingroup Data_Interfaces
+\brief Return the size of each element of the array designated by
+\p interface.
+
+@name Accessing Matrix Data Interfaces
+\ingroup Data_Interfaces
+
+\fn uint32_t starpu_matrix_get_nx(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the x-axis of the matrix
+designated by \p handle.
+
+\fn uint32_t starpu_matrix_get_ny(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the y-axis of the matrix
+designated by \p handle.
+
+\fn uint32_t starpu_matrix_get_local_ld(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements between each row of the matrix
+designated by \p handle. Maybe be equal to nx when there is no padding.
+
+\fn uintptr_t starpu_matrix_get_local_ptr(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the local pointer associated with \p handle.
+
+\fn size_t starpu_matrix_get_elemsize(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of the elements registered into the matrix
+designated by \p handle.
+
+\def STARPU_MATRIX_GET_PTR(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the matrix designated by \p interface, valid
+on CPUs and CUDA devices only. For OpenCL devices, the device handle
+and offset need to be used instead.
+
+\def STARPU_MATRIX_GET_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the matrix designated by \p interface,
+to be used on OpenCL. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_MATRIX_GET_OFFSET(void *interface)
+\ingroup Data_Interfaces
+\brief Return the offset in the matrix designated by \p interface, to be
+used with the device handle.
+
+\def STARPU_MATRIX_GET_NX(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the x-axis of the matrix
+designated by \p interface.
+
+\def STARPU_MATRIX_GET_NY(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the y-axis of the matrix
+designated by \p interface.
+
+\def STARPU_MATRIX_GET_LD(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements between each row of the matrix
+designated by \p interface. May be equal to nx when there is no padding.
+
+\def STARPU_MATRIX_GET_ELEMSIZE(void *interface)
+\ingroup Data_Interfaces
+\brief Return the size of the elements registered into the matrix
+designated by \p interface.
+
+@name Accessing Block Data Interfaces
+\ingroup Data_Interfaces
+
+\fn uint32_t starpu_block_get_nx(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the x-axis of the block
+designated by \p handle.
+
+\fn uint32_t starpu_block_get_ny(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the y-axis of the block
+designated by \p handle.
+
+\fn uint32_t starpu_block_get_nz(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the z-axis of the block
+designated by \p handle.
+
+\fn uint32_t starpu_block_get_local_ldy(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements between each row of the block
+designated by \p handle, in the format of the current memory node.
+
+\fn uint32_t starpu_block_get_local_ldz(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of elements between each z plane of the block
+designated by \p handle, in the format of the current memory node.
+
+\fn uintptr_t starpu_block_get_local_ptr(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the local pointer associated with \p handle.
+
+\fn size_t starpu_block_get_elemsize(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of the elements of the block designated by
+\p handle.
+
+\def STARPU_BLOCK_GET_PTR(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the block designated by \p interface.
+
+\def STARPU_BLOCK_GET_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the block designated by \p interface,
+to be used on OpenCL. The offset document below has to be used in
+addition to this.
+
+\def STARPU_BLOCK_GET_OFFSET(void *interface)
+\ingroup Data_Interfaces
+\brief Return the offset in the block designated by \p interface, to be
+used with the device handle.
+
+\def STARPU_BLOCK_GET_NX(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the x-axis of the block
+designated by \p interface.
+
+\def STARPU_BLOCK_GET_NY(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the y-axis of the block
+designated by \p interface.
+
+\def STARPU_BLOCK_GET_NZ(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the z-axis of the block
+designated by \p interface.
+
+\def STARPU_BLOCK_GET_LDY(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements between each row of the block
+designated by \p interface. May be equal to nx when there is no padding.
+
+\def STARPU_BLOCK_GET_LDZ(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements between each z plane of the block
+designated by \p interface. May be equal to nx*ny when there is no
+padding.
+
+\def STARPU_BLOCK_GET_ELEMSIZE(void *interface)
+\ingroup Data_Interfaces
+\brief Return the size of the elements of the block designated by
+\p interface.
+
+@name Accessing BCSR Data Interfaces
+\ingroup Data_Interfaces
+
+\fn uint32_t starpu_bcsr_get_nnz(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of non-zero elements in the matrix designated
+by \p handle.
+
+\fn uint32_t starpu_bcsr_get_nrow(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of rows (in terms of blocks of size r*c) in
+the matrix designated by \p handle.
+
+\fn uint32_t starpu_bcsr_get_firstentry(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the index at which all arrays (the column indexes, the
+row pointers...) of the matrix desginated by \p handle.
+
+\fn uintptr_t starpu_bcsr_get_local_nzval(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return a pointer to the non-zero values of the matrix
+designated by \p handle.
+
+\fn uint32_t * starpu_bcsr_get_local_colind(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return a pointer to the column index, which holds the positions
+of the non-zero entries in the matrix designated by \p handle.
+
+\fn uint32_t * starpu_bcsr_get_local_rowptr(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the row pointer array of the matrix designated by
+\p handle.
+
+\fn uint32_t starpu_bcsr_get_r(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of rows in a block.
+
+\fn uint32_t starpu_bcsr_get_c(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the numberof columns in a block.
+
+\fn size_t starpu_bcsr_get_elemsize(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of the elements in the matrix designated by
+\p handle.
+
+\def STARPU_BCSR_GET_NNZ(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of non-zero values in the matrix designated
+by \p interface.
+
+\def STARPU_BCSR_GET_NZVAL(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the non-zero values of the matrix
+designated by \p interface.
+
+\def STARPU_BCSR_GET_NZVAL_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the array of non-zero values in the
+matrix designated by \p interface. The offset documented below has to be
+used in addition to this.
+
+\def STARPU_BCSR_GET_COLIND(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the column index of the matrix designated
+by \p interface.
+
+\def STARPU_BCSR_GET_COLIND_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the column index of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_BCSR_GET_ROWPTR(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the row pointer array of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_ROWPTR_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the row pointer array of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_BCSR_GET_OFFSET(void *interface)
+\ingroup Data_Interfaces
+\brief Return the offset in the arrays (coling, rowptr, nzval) of the
+matrix designated by \p interface, to be used with the device handles.
+
+
+@name Accessing CSR Data Interfaces
+\ingroup Data_Interfaces
+
+\fn uint32_t starpu_csr_get_nnz(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the number of non-zero values in the matrix designated
+by \p handle.
+
+\fn uint32_t starpu_csr_get_nrow(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of the row pointer array of the matrix
+designated by \p handle.
+
+\fn uint32_t starpu_csr_get_firstentry(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the index at which all arrays (the column indexes, the
+row pointers...) of the matrix designated by \p handle.
+
+\fn uintptr_t starpu_csr_get_local_nzval(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return a local pointer to the non-zero values of the matrix
+designated by \p handle.
+
+\fn uint32_t * starpu_csr_get_local_colind(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return a local pointer to the column index of the matrix
+designated by \p handle.
+
+\fn uint32_t * starpu_csr_get_local_rowptr(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return a local pointer to the row pointer array of the matrix
+designated by \p handle.
+
+\fn size_t starpu_csr_get_elemsize(starpu_data_handle_t handle)
+\ingroup Data_Interfaces
+\brief Return the size of the elements registered into the matrix
+designated by \p handle.
+
+\def STARPU_CSR_GET_NNZ(void *interface)
+\ingroup Data_Interfaces
+\brief Return the number of non-zero values in the matrix designated
+by \p interface.
+
+\def STARPU_CSR_GET_NROW(void *interface)
+\ingroup Data_Interfaces
+\brief Return the size of the row pointer array of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_NZVAL(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the non-zero values of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_NZVAL_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the array of non-zero values in the
+matrix designated by \p interface. The offset documented below has to be
+used in addition to this.
+
+\def STARPU_CSR_GET_COLIND(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the column index of the matrix designated
+by \p interface.
+
+\def STARPU_CSR_GET_COLIND_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the column index of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_CSR_GET_ROWPTR(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the row pointer array of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_ROWPTR_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the row pointer array of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_CSR_GET_OFFSET(void *interface)
+\ingroup Data_Interfaces
+\brief Return the offset in the arrays (colind, rowptr, nzval) of the
+matrix designated by \p interface, to be used with the device handles.
+
+\def STARPU_CSR_GET_FIRSTENTRY(void *interface)
+\ingroup Data_Interfaces
+\brief Return the index at which all arrays (the column indexes, the
+row pointers...) of the \p interface start.
+
+\def STARPU_CSR_GET_ELEMSIZE(void *interface)
+\ingroup Data_Interfaces
+\brief Return the size of the elements registered into the matrix
+designated by \p interface.
+
+@name Accessing COO Data Interfaces
+\ingroup Data_Interfaces
+
+\def STARPU_COO_GET_COLUMNS(void *interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the column array of the matrix designated
+by \p interface.
+
+\def STARPU_COO_GET_COLUMNS_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the column array of the matrix
+designated by \p interface, to be used on OpenCL. The offset documented
+below has to be used in addition to this.
+
+\def STARPU_COO_GET_ROWS(interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the rows array of the matrix designated by
+\p interface.
+
+\def STARPU_COO_GET_ROWS_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the row array of the matrix
+designated by \p interface, to be used on OpenCL. The offset documented
+below has to be used in addition to this.
+
+\def STARPU_COO_GET_VALUES(interface)
+\ingroup Data_Interfaces
+\brief Return a pointer to the values array of the matrix designated
+by \p interface.
+
+\def STARPU_COO_GET_VALUES_DEV_HANDLE(void *interface)
+\ingroup Data_Interfaces
+\brief Return a device handle for the value array of the matrix
+designated by \p interface, to be used on OpenCL. The offset documented
+below has to be used in addition to this.
+
+\def STARPU_COO_GET_OFFSET(void *itnerface)
+\ingroup Data_Interfaces
+\brief Return the offset in the arrays of the COO matrix designated by
+\p interface.
+
+\def STARPU_COO_GET_NX(interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the x-axis of the matrix
+designated by \p interface.
+
+\def STARPU_COO_GET_NY(interface)
+\ingroup Data_Interfaces
+\brief Return the number of elements on the y-axis of the matrix
+designated by \p interface.
+
+\def STARPU_COO_GET_NVALUES(interface)
+\ingroup Data_Interfaces
+\brief Return the number of values registered in the matrix designated
+by \p interface.
+
+\def STARPU_COO_GET_ELEMSIZE(interface)
+\ingroup Data_Interfaces
+\brief Return the size of the elements registered into the matrix
+designated by \p interface.
+
+@name Defining Interface
+\ingroup Data_Interfaces
+
+Applications can provide their own interface as shown in \ref
+Defining_a_New_Data_Interface.
+
+\fn uintptr_t starpu_malloc_on_node(unsigned dst_node, size_t size)
+\ingroup Data_Interfaces
+\brief Allocate \p size bytes on node \p dst_node. This returns 0 if
+allocation failed, the allocation method should then return <c>-ENOMEM</c> as
+allocated size.
+
+\fn void starpu_free_on_node(unsigned dst_node, uintptr_t addr, size_t size)
+\ingroup Data_Interfaces
+\brief Free \p addr of \p size bytes on node \p dst_node.
+
+\fn int starpu_interface_copy(uintptr_t src, size_t src_offset, unsigned src_node, uintptr_t dst, size_t dst_offset, unsigned dst_node, size_t size, void *async_data)
+\ingroup Data_Interfaces
+\brief Copy \p size bytes from byte offset \p src_offset of \p src on \p src_node
+to byte offset \p dst_offset of \p dst on \p dst_node. This is to be used in
+the any_to_any() copy method, which is provided with the async_data to
+be passed to starpu_interface_copy(). this returns <c>-EAGAIN</c> if the
+transfer is still ongoing, or 0 if the transfer is already completed.
+
+\fn uint32_t starpu_hash_crc32c_be_n(const void *input, size_t n, uint32_t inputcrc)
+\ingroup Data_Interfaces
+\brief Compute the CRC of a byte buffer seeded by the \p inputcrc
+<em>current state</em>. The return value should be considered as the new
+<em>current state</em> for future CRC computation. This is used for computing
+data size footprint.
+
+\fn uint32_t starpu_hash_crc32c_be(uint32_t input, uint32_t inputcrc)
+\ingroup Data_Interfaces
+\brief Compute the CRC of a 32bit number seeded by the \p inputcrc
+<em>current state</em>. The return value should be considered as the new
+<em>current state</em> for future CRC computation. This is used for computing
+data size footprint.
+
+\fn uint32_t starpu_hash_crc32c_string(const char *str, uint32_t inputcrc)
+\ingroup Data_Interfaces
+\brief Compute the CRC of a string seeded by the \p inputcrc <em>current
+state</em>. The return value should be considered as the new <em>current
+state</em> for future CRC computation. This is used for computing data
+size footprint.
+
+\fn int starpu_data_interface_get_next_id(void)
+\ingroup Data_Interfaces
+\brief Return the next available id for a newly created data interface
+(\ref Defining_a_New_Data_Interface).
+
+*/
+

+ 215 - 0
doc/doxygen/chapters/api/data_management.doxy

@@ -0,0 +1,215 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Data_Management Data Management
+
+\brief This section describes the data management facilities provided
+by StarPU. We show how to use existing data interfaces in \ref
+Data_Interfaces, but developers can design their own data interfaces
+if required.
+
+\typedef starpu_data_handle_t
+\ingroup Data_Management
+\brief StarPU uses ::starpu_data_handle_t as an opaque handle to
+manage a piece of data. Once a piece of data has been registered to
+StarPU, it is associated to a starpu_data_handle_t which keeps track
+of the state of the piece of data over the entire machine, so that we
+can maintain data consistency and locate data replicates for instance.
+
+@name Basic Data Management API
+\ingroup Data_Management
+
+Data management is done at a high-level in StarPU: rather than
+accessing a mere list of contiguous buffers, the tasks may manipulate
+data that are described by a high-level construct which we call data
+interface.
+
+An example of data interface is the "vector" interface which describes
+a contiguous data array on a spefic memory node. This interface is a
+simple structure containing the number of elements in the array, the
+size of the elements, and the address of the array in the appropriate
+address space (this address may be invalid if there is no valid copy
+of the array in the memory node). More informations on the data
+interfaces provided by StarPU are given in \ref Data_Interfaces.
+
+When a piece of data managed by StarPU is used by a task, the task
+implementation is given a pointer to an interface describing a valid
+copy of the data that is accessible from the current processing unit.
+
+Every worker is associated to a memory node which is a logical
+abstraction of the address space from which the processing unit gets
+its data. For instance, the memory node associated to the different
+CPU workers represents main memory (RAM), the memory node associated
+to a GPU is DRAM embedded on the device. Every memory node is
+identified by a logical index which is accessible from the
+starpu_worker_get_memory_node function. When registering a piece of
+data to StarPU, the specified memory node indicates where the piece of
+data initially resides (we also call this memory node the home node of
+a piece of data).
+
+\fn void starpu_data_register(starpu_data_handle_t *handleptr, unsigned home_node, void *data_interface, struct starpu_data_interface_ops *ops)
+\ingroup Data_Management
+\brief Register a piece of data into the handle located at the
+\p handleptr address. The \p data_interface buffer contains the initial
+description of the data in the \p home_node. The \p ops argument is a
+pointer to a structure describing the different methods used to
+manipulate this type of interface. See starpu_data_interface_ops for
+more details on this structure.
+If \p home_node is -1, StarPU will automatically allocate the memory when
+it is used for the first time in write-only mode. Once such data
+handle has been automatically allocated, it is possible to access it
+using any access mode.
+Note that StarPU supplies a set of predefined types of interface (e.g.
+vector or matrix) which can be registered by the means of helper
+functions (e.g. starpu_vector_data_register() or
+starpu_matrix_data_register()).
+
+\fn void starpu_data_register_same(starpu_data_handle_t *handledst, starpu_data_handle_t handlesrc)
+\ingroup Data_Management
+\brief Register a new piece of data into the handle \p handledst with the
+same interface as the handle \p handlesrc.
+
+\fn void starpu_data_unregister(starpu_data_handle_t handle)
+\ingroup Data_Management
+\brief This function unregisters a data handle from StarPU. If the
+data was automatically allocated by StarPU because the home node was
+-1, all automatically allocated buffers are freed. Otherwise, a valid
+copy of the data is put back into the home node in the buffer that was
+initially registered. Using a data handle that has been unregistered
+from StarPU results in an undefined behaviour.
+
+\fn void starpu_data_unregister_no_coherency(starpu_data_handle_t handle)
+\ingroup Data_Management
+\brief This is the same as starpu_data_unregister(), except that
+StarPU does not put back a valid copy into the home node, in the
+buffer that was initially registered.
+
+\fn void starpu_data_unregister_submit(starpu_data_handle_t handle)
+\ingroup Data_Management
+\brief Destroy the data handle once it is not needed anymore by any
+submitted task. No coherency is assumed.
+
+\fn void starpu_data_invalidate(starpu_data_handle_t handle)
+\ingroup Data_Management
+\brief Destroy all replicates of the data handle immediately. After
+data invalidation, the first access to the handle must be performed in
+write-only mode. Accessing an invalidated data in read-mode results in
+undefined behaviour.
+
+\fn void starpu_data_invalidate_submit(starpu_data_handle_t handle)
+\ingroup Data_Management
+\brief Submits invalidation of the data handle after completion of
+previously submitted tasks.
+
+\fn void starpu_data_set_wt_mask(starpu_data_handle_t handle, uint32_t wt_mask)
+\ingroup Data_Management
+\brief This function sets the write-through mask of a given data, i.e.
+a bitmask of nodes where the data should be always replicated after
+modification. It also prevents the data from being evicted from these
+nodes when memory gets scarse.
+
+\fn int starpu_data_prefetch_on_node(starpu_data_handle_t handle, unsigned node, unsigned async)
+\ingroup Data_Management
+\brief Issue a prefetch request for a given data to a given node, i.e.
+requests that the data be replicated to the given node, so that it is
+available there for tasks. If the \p async parameter is 0, the call will
+block until the transfer is achieved, else the call will return as
+soon as the request is scheduled (which may however have to wait for a
+task completion).
+
+\fn starpu_data_handle_t starpu_data_lookup(const void *ptr)
+\ingroup Data_Management
+\brief Return the handle corresponding to the data pointed to by the \p ptr host pointer.
+
+\fn int starpu_data_request_allocation(starpu_data_handle_t handle, unsigned node)
+\ingroup Data_Management
+\brief Explicitly ask StarPU to allocate room for a piece of data on
+the specified memory node.
+
+\fn void starpu_data_query_status(starpu_data_handle_t handle, int memory_node, int *is_allocated, int *is_valid, int *is_requested)
+\ingroup Data_Management
+\brief Query the status of \p handle on the specified \p memory_node.
+
+\fn void starpu_data_advise_as_important(starpu_data_handle_t handle, unsigned is_important)
+\ingroup Data_Management
+\brief This function allows to specify that a piece of data can be
+discarded without impacting the application.
+
+\fn void starpu_data_set_reduction_methods(starpu_data_handle_t handle, struct starpu_codelet *redux_cl, struct starpu_codelet *init_cl)
+\ingroup Data_Management
+\brief This sets the codelets to be used for \p handle when it is
+accessed in STARPU_REDUX mode. Per-worker buffers will be initialized with
+the \p init_cl codelet, and reduction between per-worker buffers will be
+done with the \p redux_cl codelet.
+
+@name Access registered data from the application
+\ingroup Data_Management
+
+\fn int starpu_data_acquire(starpu_data_handle_t handle, enum starpu_data_access_mode mode)
+\ingroup Data_Management
+\brief The application must call this function prior to accessing
+registered data from main memory outside tasks. StarPU ensures that
+the application will get an up-to-date copy of the data in main memory
+located where the data was originally registered, and that all
+concurrent accesses (e.g. from tasks) will be consistent with the
+access mode specified in the mode argument. starpu_data_release() must
+be called once the application does not need to access the piece of
+data anymore. Note that implicit data dependencies are also enforced
+by starpu_data_acquire(), i.e. starpu_data_acquire() will wait for all
+tasks scheduled to work on the data, unless they have been disabled
+explictly by calling starpu_data_set_default_sequential_consistency_flag() or
+starpu_data_set_sequential_consistency_flag(). starpu_data_acquire() is a
+blocking call, so that it cannot be called from tasks or from their
+callbacks (in that case, starpu_data_acquire() returns <c>-EDEADLK</c>). Upon
+successful completion, this function returns 0.
+
+\fn int starpu_data_acquire_cb(starpu_data_handle_t handle, enum starpu_data_access_mode mode, void (*callback)(void *), void *arg)
+\ingroup Data_Management
+\brief Asynchronous equivalent of starpu_data_acquire(). When the data
+specified in \p handle is available in the appropriate access
+mode, the \p callback function is executed. The application may access
+the requested data during the execution of this \p callback. The \p callback
+function must call starpu_data_release() once the application does not
+need to access the piece of data anymore. Note that implicit data
+dependencies are also enforced by starpu_data_acquire_cb() in case they
+are not disabled. Contrary to starpu_data_acquire(), this function is
+non-blocking and may be called from task callbacks. Upon successful
+completion, this function returns 0.
+
+\fn int starpu_data_acquire_on_node(starpu_data_handle_t handle, unsigned node, enum starpu_data_access_mode mode)
+\ingroup Data_Management
+\brief This is the same as starpu_data_acquire(), except that the data
+will be available on the given memory node instead of main memory.
+
+\fn int starpu_data_acquire_on_node_cb(starpu_data_handle_t handle, unsigned node, enum starpu_data_access_mode mode, void (*callback)(void *), void *arg)
+\ingroup Data_Management
+\brief This is the same as starpu_data_acquire_cb(), except that the
+data will be available on the given memory node instead of main
+memory.
+
+\def STARPU_DATA_ACQUIRE_CB(starpu_data_handle_t handle, enum starpu_data_access_mode mode, code)
+\ingroup Data_Management
+\brief STARPU_DATA_ACQUIRE_CB() is the same as starpu_data_acquire_cb(),
+except that the code to be executed in a callback is directly provided
+as a macro parameter, and the data \p handle is automatically released
+after it. This permits to easily execute code which depends on the
+value of some registered data. This is non-blocking too and may be
+called from task callbacks.
+
+\fn void starpu_data_release(starpu_data_handle_t handle)
+\ingroup Data_Management
+\brief This function releases the piece of data acquired by the
+application either by starpu_data_acquire() or by
+starpu_data_acquire_cb().
+
+\fn void starpu_data_release_on_node(starpu_data_handle_t handle, unsigned node)
+\ingroup Data_Management
+\brief This is the same as starpu_data_release(), except that the data
+will be available on the given memory \p node instead of main memory.
+
+*/

+ 257 - 0
doc/doxygen/chapters/api/data_partition.doxy

@@ -0,0 +1,257 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Data_Partition Data Partition
+
+\struct starpu_data_filter
+\brief The filter structure describes a data partitioning operation, to be given to the starpu_data_partition() function.
+\ingroup Data_Partition
+\var starpu_data_filter::filter_func
+This function fills the child_interface structure with interface
+information for the id-th child of the parent father_interface (among
+nparts).
+\var starpu_data_filter::nchildren
+This is the number of parts to partition the data into.
+\var starpu_data_filter::get_nchildren
+This returns the number of children. This can be used instead of
+nchildren when the number of children depends on the actual data (e.g.
+the number of blocks in a sparse matrix).
+\var starpu_data_filter::get_child_ops
+In case the resulting children use a different data interface, this
+function returns which interface is used by child number id.
+\var starpu_data_filter::filter_arg
+Allow to define an additional parameter for the filter function.
+\var starpu_data_filter::filter_arg_ptr
+Allow to define an additional pointer parameter for the filter
+function, such as the sizes of the different parts.
+
+@name Basic API
+\ingroup Data_Partition
+
+\fn void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_data_filter *f)
+\ingroup Data_Partition
+\brief This requests partitioning one StarPU data initial_handle into
+several subdata according to the filter \p f.
+
+Here an example of how to use the function.
+\code{.c}
+struct starpu_data_filter f = {
+        .filter_func = starpu_matrix_filter_block,
+        .nchildren = nslicesx,
+        .get_nchildren = NULL,
+        .get_child_ops = NULL
+};
+starpu_data_partition(A_handle, &f);
+\endcode
+
+\fn void starpu_data_unpartition(starpu_data_handle_t root_data, unsigned gathering_node)
+\ingroup Data_Partition
+\brief This unapplies one filter, thus unpartitioning the data. The
+pieces of data are collected back into one big piece in the
+\p gathering_node (usually 0). Tasks working on the partitioned data must
+be already finished when calling starpu_data_unpartition().
+
+Here an example of how to use the function.
+\code{.c}
+starpu_data_unpartition(A_handle, 0);
+\endcode
+
+\fn int starpu_data_get_nb_children(starpu_data_handle_t handle)
+\ingroup Data_Partition
+\brief This function returns the number of children.
+
+\fn starpu_data_handle_t starpu_data_get_child(starpu_data_handle_t handle, unsigned i)
+\ingroup Data_Partition
+\brief Return the ith child of the given \p handle, which must have been
+partitionned beforehand.
+
+\fn starpu_data_handle_t starpu_data_get_sub_data (starpu_data_handle_t root_data, unsigned depth, ... )
+\ingroup Data_Partition
+\brief After partitioning a StarPU data by applying a filter,
+starpu_data_get_sub_data() can be used to get handles for each of the
+data portions. \p root_data is the parent data that was partitioned.
+\p depth is the number of filters to traverse (in case several filters
+have been applied, to e.g. partition in row blocks, and then in column
+blocks), and the subsequent parameters are the indexes. The function
+returns a handle to the subdata.
+
+Here an example of how to use the function.
+\code{.c}
+h = starpu_data_get_sub_data(A_handle, 1, taskx);
+\endcode
+
+\fn starpu_data_handle_t starpu_data_vget_sub_data(starpu_data_handle_t root_data, unsigned depth, va_list pa)
+\ingroup Data_Partition
+\brief This function is similar to starpu_data_get_sub_data() but uses a
+va_list for the parameter list.
+
+\fn void starpu_data_map_filters(starpu_data_handle_t root_data, unsigned nfilters, ...)
+\ingroup Data_Partition
+\brief Applies \p nfilters filters to the handle designated by
+\p root_handle recursively. \p nfilters pointers to variables of the type
+starpu_data_filter should be given.
+
+\fn void starpu_data_vmap_filters(starpu_data_handle_t root_data, unsigned nfilters, va_list pa)
+\ingroup Data_Partition
+\brief Applies \p nfilters filters to the handle designated by
+\p root_handle recursively. It uses a va_list of pointers to variables of
+the type starpu_data_filter.
+
+@name Predefined Vector Filter Functions
+\ingroup Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for vector data. Examples on how to use them are shown in
+\ref Partitioning_Data. The complete list can be found in the file
+starpu_data_filters.h.
+
+\fn void starpu_vector_filter_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned in \p nparts chunks of
+equal size.
+
+\fn void starpu_vector_filter_block_shadow(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned in \p nparts chunks of
+equal size with a shadow border <c>filter_arg_ptr</c>, thus getting a vector
+of size (n-2*shadow)/nparts+2*shadow. The <c>filter_arg_ptr</c> field
+of \p f must be the shadow size casted into void*. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts. An usage example is available in
+examples/filters/shadow.c
+
+\fn void starpu_vector_filter_list(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned into \p nparts chunks
+according to the <c>filter_arg_ptr</c> field of \p f. The
+<c>filter_arg_ptr</c> field must point to an array of \p nparts uint32_t
+elements, each of which specifies the number of elements in each chunk
+of the partition.
+
+\fn void starpu_vector_filter_divide_in_2(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned in <c>2</c> chunks of
+equal size, ignoring nparts. Thus, \p id must be <c>0</c> or <c>1</c>.
+
+@name Predefined Matrix Filter Functions
+\ingroup Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for matrix data. Examples on how to use them are shown in
+\ref Partitioning_Data. The complete list can be found in the file
+starpu_data_filters.h.
+
+\fn void starpu_matrix_filter_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a dense Matrix along the x dimension, thus
+getting (x/\p nparts ,y) matrices. If \p nparts does not divide x, the
+last submatrix contains the remainder.
+
+\fn void starpu_matrix_filter_block_shadow(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a dense Matrix along the x dimension, with a
+shadow border <c>filter_arg_ptr</c>, thus getting ((x-2*shadow)/\p
+nparts +2*shadow,y) matrices. If \p nparts does not divide x-2*shadow,
+the last submatrix contains the remainder. <b>IMPORTANT</b>: This can
+only be used for read-only access, as no coherency is enforced for the
+shadowed parts. A usage example is available in
+examples/filters/shadow2d.c
+
+\fn void starpu_matrix_filter_vertical_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a dense Matrix along the y dimension, thus
+getting (x,y/\p nparts) matrices. If \p nparts does not divide y, the
+last submatrix contains the remainder.
+
+\fn void starpu_matrix_filter_vertical_block_shadow(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a dense Matrix along the y dimension, with a
+shadow border <c>filter_arg_ptr</c>, thus getting
+(x,(y-2*shadow)/\p nparts +2*shadow) matrices. If \p nparts does not
+divide y-2*shadow, the last submatrix contains the remainder.
+<b>IMPORTANT</b>: This can only be used for read-only access, as no
+coherency is enforced for the shadowed parts. A usage example is
+available in examples/filters/shadow2d.c 
+
+@name Predefined Block Filter Functions
+\ingroup Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for block data. Examples on how to use them are shown in
+\ref Partitioning_Data. The complete list can be found in the file
+starpu_data_filters.h. A usage example is available in
+examples/filters/shadow3d.c
+
+\fn void starpu_block_filter_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block along the X dimension, thus getting
+(x/\p nparts ,y,z) 3D matrices. If \p nparts does not divide x, the last
+submatrix contains the remainder.
+
+\fn void starpu_block_filter_block_shadow (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block along the X dimension, with a
+shadow border <p>filter_arg_ptr</p>, thus getting
+((x-2*shadow)/\p nparts +2*shadow,y,z) blocks. If \p nparts does not
+divide x, the last submatrix contains the remainder. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts.
+
+\fn void starpu_block_filter_vertical_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block along the Y dimension, thus getting
+(x,y/\p nparts ,z) blocks. If \p nparts does not divide y, the last
+submatrix contains the remainder.
+
+\fn void starpu_block_filter_vertical_block_shadow (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block along the Y dimension, with a
+shadow border <p>filter_arg_ptr</p>, thus getting
+(x,(y-2*shadow)/\p nparts +2*shadow,z) 3D matrices. If \p nparts does not
+divide y, the last submatrix contains the remainder. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts.
+
+\fn void starpu_block_filter_depth_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block along the Z dimension, thus getting
+(x,y,z/\p nparts) blocks. If \p nparts does not divide z, the last
+submatrix contains the remainder.
+
+\fn void starpu_block_filter_depth_block_shadow (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block along the Z dimension, with a
+shadow border <p>filter_arg_ptr</p>, thus getting
+(x,y,(z-2*shadow)/\p nparts +2*shadow) blocks. If \p nparts does not
+divide z, the last submatrix contains the remainder. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts.
+
+@name Predefined BCSR Filter Functions
+\ingroup Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for BCSR data. Examples on how to use them are shown in
+\ref Partitioning_Data. The complete list can be found in the file
+starpu_data_filters.h.
+
+\fn void starpu_bcsr_filter_canonical_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block-sparse matrix into dense matrices.
+
+\fn void starpu_csr_filter_vertical_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup Data_Partition
+\brief This partitions a block-sparse matrix into vertical
+block-sparse matrices.
+
+*/
+

+ 211 - 0
doc/doxygen/chapters/api/initialization.doxy

@@ -0,0 +1,211 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Initialization_and_Termination Initialization and Termination
+
+\struct starpu_driver
+\brief structure for a driver
+\ingroup Initialization_and_Termination
+\var starpu_driver::type
+The type of the driver. Only STARPU_CPU_DRIVER,
+STARPU_CUDA_DRIVER and STARPU_OPENCL_DRIVER are currently supported.
+\var starpu_driver::id
+The identifier of the driver.
+
+\struct starpu_vector_interface
+\brief vector interface for contiguous (non-strided) buffers
+\ingroup Initialization_and_Termination
+
+
+\struct starpu_conf
+\ingroup Initialization_and_Termination
+\brief structure for configuring StarPU.
+
+This structure is passed to the starpu_init() function in order to
+configure StarPU. It has to be initialized with starpu_conf_init().
+When the default value is used, StarPU automatically selects the
+number of processing units and takes the default scheduling policy.
+The environment variables overwrite the equivalent parameters.
+\var starpu_conf::magic
+\private
+Will be initialized by starpu_conf_init(). Should not be set by hand.
+\var starpu_conf::sched_policy_name
+This is the name of the scheduling policy. This can also be specified
+with the STARPU_SCHED environment variable. (default = NULL).
+\var starpu_conf::sched_policy
+This is the definition of the scheduling policy. This field is ignored
+if starpu_conf::sched_policy_name is set. (default = NULL)
+\var starpu_conf::ncpus
+This is the number of CPU cores that StarPU can use. This can also be
+specified with the STARPU_NCPU environment variable. (default = -1)
+\var starpu_conf::ncuda
+This is the number of CUDA devices that StarPU can use. This can also
+be specified with the STARPU_NCUDA environment variable. (default =
+-1)
+\var starpu_conf::nopencl
+This is the number of OpenCL devices that StarPU can use. This can
+also be specified with the STARPU_NOPENCL environment variable.
+(default = -1)
+\var starpu_conf::use_explicit_workers_bindid
+If this flag is set, the starpu_conf::workers_bindid array indicates
+where the different workers are bound, otherwise StarPU automatically
+selects where to bind the different workers. This can also be
+specified with the STARPU_WORKERS_CPUID environment variable. (default = 0)
+\var starpu_conf::workers_bindid
+If the starpu_conf::use_explicit_workers_bindid flag is set, this
+array indicates where to bind the different workers. The i-th entry of
+the starpu_conf::workers_bindid indicates the logical identifier of
+the processor which should execute the i-th worker. Note that the
+logical ordering of the CPUs is either determined by the OS, or
+provided by the hwloc library in case it is available.
+\var starpu_conf::use_explicit_workers_cuda_gpuid
+If this flag is set, the CUDA workers will be attached to the CUDA
+devices specified in the starpu_conf::workers_cuda_gpuid array.
+Otherwise, StarPU affects the CUDA devices in a round-robin fashion.
+This can also be specified with the STARPU_WORKERS_CUDAID environment
+variable. (default = 0)
+\var starpu_conf::workers_cuda_gpuid
+If the starpu_conf::use_explicit_workers_cuda_gpuid flag is set, this
+array contains the logical identifiers of the CUDA devices (as used by
+cudaGetDevice()).
+\var starpu_conf::use_explicit_workers_opencl_gpuid
+If this flag is set, the OpenCL workers will be attached to the OpenCL
+devices specified in the starpu_conf::workers_opencl_gpuid array.
+Otherwise, StarPU affects the OpenCL devices in a round-robin fashion.
+This can also be specified with the STARPU_WORKERS_OPENCLID
+environment variable. (default = 0)
+\var starpu_conf::workers_opencl_gpuid
+If the starpu_conf::use_explicit_workers_opencl_gpuid flag is set,
+this array contains the logical identifiers of the OpenCL devices to
+be used.
+\var starpu_conf::bus_calibrate
+If this flag is set, StarPU will recalibrate the bus.  If this value
+is equal to <c>-1</c>, the default value is used.  This can also be
+specified with the STARPU_BUS_CALIBRATE environment variable. (default
+= 0)
+\var starpu_conf::calibrate
+If this flag is set, StarPU will calibrate the performance models when
+executing tasks. If this value is equal to <c>-1</c>, the default
+value is used. If the value is equal to <c>1</c>, it will force
+continuing calibration. If the value is equal to <c>2</c>, the
+existing performance models will be overwritten. This can also be
+specified with the STARPU_CALIBRATE environment variable. (default =
+0)
+\var starpu_conf::single_combined_worker
+By default, StarPU executes parallel tasks
+concurrently. Some parallel libraries (e.g. most OpenMP
+implementations) however do not support concurrent calls to
+parallel code. In such case, setting this flag makes StarPU
+only start one parallel task at a time (but other CPU and
+GPU tasks are not affected and can be run concurrently).
+The parallel task scheduler will however still however
+still try varying combined worker sizes to look for the
+most efficient ones. This can also be specified with the
+STARPU_SINGLE_COMBINED_WORKER environment variable.
+(default = 0)
+\var starpu_conf::disable_asynchronous_copy
+This flag should be set to 1 to disable
+asynchronous copies between CPUs and all accelerators. This
+can also be specified with the
+STARPU_DISABLE_ASYNCHRONOUS_COPY environment variable. The
+AMD implementation of OpenCL is known to fail when copying
+data asynchronously. When using this implementation, it is
+therefore necessary to disable asynchronous data transfers.
+This can also be specified at compilation time by giving to
+the configure script the option
+<c>--disable-asynchronous-copy</c>. (default = 0)
+\var starpu_conf::disable_asynchronous_cuda_copy
+This flag should be set to 1 to disable
+asynchronous copies between CPUs and CUDA accelerators.
+This can also be specified with the
+STARPU_DISABLE_ASYNCHRONOUS_CUDA_COPY environment variable.
+This can also be specified at compilation time by giving to
+the configure script the option
+<c>--disable-asynchronous-cuda-copy</c>. (default = 0)
+\var starpu_conf::disable_asynchronous_opencl_copy
+This flag should be set to 1 to disable
+asynchronous copies between CPUs and OpenCL accelerators.
+This can also be specified with the
+STARPU_DISABLE_ASYNCHRONOUS_OPENCL_COPY environment
+variable. The AMD implementation of OpenCL is known to fail
+when copying data asynchronously. When using this
+implementation, it is therefore necessary to disable
+asynchronous data transfers. This can also be specified at
+compilation time by giving to the configure script the
+option <c>--disable-asynchronous-opencl-copy</c>. (default
+= 0)
+\var starpu_conf::cuda_opengl_interoperability
+Enable CUDA/OpenGL interoperation on these CUDA
+devices. This can be set to an array of CUDA device
+identifiers for which cudaGLSetGLDevice() should be called
+instead of cudaSetDevice(). Its size is specified by the
+starpu_conf::n_cuda_opengl_interoperability field below
+(default = NULL)
+\var starpu_conf::n_cuda_opengl_interoperability
+\var starpu_conf::not_launched_drivers
+Array of drivers that should not be launched by
+StarPU. The application will run in one of its own
+threads. (default = NULL)
+\var starpu_conf::n_not_launched_drivers
+The number of StarPU drivers that should not be
+launched by StarPU. (default = 0)
+\var starpu_conf::trace_buffer_size
+Specifies the buffer size used for FxT tracing.
+Starting from FxT version 0.2.12, the buffer will
+automatically be flushed when it fills in, but it may still
+be interesting to specify a bigger value to avoid any
+flushing (which would disturb the trace).
+
+\fn int starpu_init(struct starpu_conf *conf)
+\ingroup Initialization_and_Termination
+\brief This is StarPU initialization method, which must be called prior to
+any other StarPU call. It is possible to specify StarPU’s
+configuration (e.g. scheduling policy, number of cores, ...) by
+passing a non-null argument. Default configuration is used if the
+passed argument is NULL. Upon successful completion, this function
+returns 0. Otherwise, -ENODEV indicates that no worker was available
+(so that StarPU was not initialized).
+
+\fn int starpu_conf_init(struct starpu_conf *conf)
+\ingroup Initialization_and_Termination
+\brief This function initializes the conf structure passed as argument with
+the default values. In case some configuration parameters are already
+specified through environment variables, starpu_conf_init initializes
+the fields of the structure according to the environment variables.
+For instance if STARPU_CALIBRATE is set, its value is put in the
+.calibrate field of the structure passed as argument. Upon successful
+completion, this function returns 0. Otherwise, -EINVAL indicates that
+the argument was NULL.
+
+\fn void starpu_shutdown(void)
+\ingroup Initialization_and_Termination
+\brief This is StarPU termination method. It must be called at the end of the
+application: statistics and other post-mortem debugging information
+are not guaranteed to be available until this method has been called.
+
+\fn int starpu_asynchronous_copy_disabled(void)
+\ingroup Initialization_and_Termination
+\brief Return 1 if asynchronous data transfers between CPU and accelerators
+are disabled.
+
+\fn int starpu_asynchronous_cuda_copy_disabled(void)
+\ingroup Initialization_and_Termination
+\brief Return 1 if asynchronous data transfers between CPU and CUDA
+accelerators are disabled.
+
+\fn int starpu_asynchronous_opencl_copy_disabled(void)
+\ingroup Initialization_and_Termination
+\brief Return 1 if asynchronous data transfers between CPU and OpenCL
+accelerators are disabled.
+
+\fn void starpu_topology_print(FILE *f)
+\ingroup Initialization_and_Termination
+\brief Prints a description of the topology on f.
+
+*/
+

+ 54 - 0
doc/doxygen/chapters/api/multiformat_data_interface.doxy

@@ -0,0 +1,54 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Multiformat_Data_Interface Multiformat Data Interface
+
+\struct starpu_multiformat_data_interface_ops
+\ingroup Multiformat_Data_Interface
+\brief The different fields are:
+\var starpu_multiformat_data_interface_ops::cpu_elemsize
+        the size of each element on CPUs
+\var starpu_multiformat_data_interface_ops::opencl_elemsize
+        the size of each element on OpenCL devices
+\var starpu_multiformat_data_interface_ops::cpu_to_opencl_cl
+        pointer to a codelet which converts from CPU to OpenCL
+\var starpu_multiformat_data_interface_ops::opencl_to_cpu_cl
+        pointer to a codelet which converts from OpenCL to CPU
+\var starpu_multiformat_data_interface_ops::cuda_elemsize
+        the size of each element on CUDA devices
+\var starpu_multiformat_data_interface_ops::cpu_to_cuda_cl
+        pointer to a codelet which converts from CPU to CUDA
+\var starpu_multiformat_data_interface_ops::cuda_to_cpu_cl
+        pointer to a codelet which converts from CUDA to CPU
+
+\fn void starpu_multiformat_data_register(starpu_data_handle_t *handle, unsigned home_node, void *ptr, uint32_t nobjects, struct starpu_multiformat_data_interface_ops *format_ops)
+\ingroup Multiformat_Data_Interface
+\brief Register a piece of data that can be represented in different
+ways, depending upon the processing unit that manipulates it. It
+allows the programmer, for instance, to use an array of structures
+when working on a CPU, and a structure of arrays when working on a
+GPU. \p nobjects is the number of elements in the data. \p format_ops
+describes the format.
+
+\def STARPU_MULTIFORMAT_GET_CPU_PTR(void *interface)
+\ingroup Multiformat_Data_Interface
+\brief returns the local pointer to the data with CPU format.
+
+\def STARPU_MULTIFORMAT_GET_CUDA_PTR(void *interface)
+\ingroup Multiformat_Data_Interface
+\brief returns the local pointer to the data with CUDA format.
+
+\def STARPU_MULTIFORMAT_GET_OPENCL_PTR(void *interface)
+\ingroup Multiformat_Data_Interface
+\brief returns the local pointer to the data with OpenCL format.
+
+\def STARPU_MULTIFORMAT_GET_NX (void *interface)
+\ingroup Multiformat_Data_Interface
+\brief returns the number of elements in the data.
+
+*/

+ 65 - 0
doc/doxygen/chapters/api/standard_memory_library.doxy

@@ -0,0 +1,65 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Standard_Memory_Library Standard Memory Library
+
+\def STARPU_MALLOC_PINNED
+\ingroup Standard_Memory_Library
+\brief Value passed to the function starpu_malloc_flags() to indicate the memory allocation should be pinned. 
+
+\def STARPU_MALLOC_COUNT
+\ingroup Standard_Memory_Library
+\brief Value passed to the function starpu_malloc_flags() to indicate
+the memory allocation should be in the limit defined by the
+environment variables <c>STARPU_LIMIT_CUDA_devid_MEM</c>,
+<c>STARPU_LIMIT_CUDA_MEM</c>, <c>STARPU_LIMIT_OPENCL_devid_MEM</c>,
+<c>STARPU_LIMIT_OPENCL_MEM</c> and <c>STARPU_LIMIT_CPU_MEM</c> (see
+Section \ref How_to_limit_memory_per_node).
+If no memory is available, it tries to reclaim memory from StarPU.
+Memory allocated this way needs to be freed by calling the
+starpu_free_flags() function with the same flag. 
+
+\fn int starpu_malloc_flags(void **A, size_t dim, int flags)
+\ingroup Standard_Memory_Library
+\brief Performs a memory allocation based on the constraints defined
+by the given flag.
+
+\fn void starpu_malloc_set_align(size_t align)
+\ingroup Standard_Memory_Library
+\brief This function sets an alignment constraints for starpu_malloc()
+allocations. align must be a power of two. This is for instance called
+automatically by the OpenCL driver to specify its own alignment
+constraints.
+
+\fn int starpu_malloc(void **A, size_t dim)
+\ingroup Standard_Memory_Library
+\brief This function allocates data of the given size in main memory.
+It will also try to pin it in CUDA or OpenCL, so that data transfers
+from this buffer can be asynchronous, and thus permit data transfer
+and computation overlapping. The allocated buffer must be freed thanks
+to the starpu_free() function.
+
+\fn int starpu_free(void *A)
+\ingroup Standard_Memory_Library
+\brief This function frees memory which has previously been allocated
+with starpu_malloc().
+
+\fn int starpu_free_flags(void *A, size_t dim, int flags)
+\ingroup Standard_Memory_Library
+\brief This function frees memory by specifying its size. The given
+flags should be consistent with the ones given to starpu_malloc_flags()
+when allocating the memory.
+
+\fn ssize_t starpu_memory_get_available(unsigned node)
+\ingroup Standard_Memory_Library
+\brief If a memory limit is defined on the given node (see Section \ref
+How_to_limit_memory_per_node), return the amount of available memory
+on the node. Otherwise return -1.
+
+*/
+

+ 28 - 0
doc/doxygen/chapters/api/versioning.doxy

@@ -0,0 +1,28 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Versioning Versioning
+
+\def STARPU_MAJOR_VERSION
+\ingroup Versioning
+\brief Define the major version of StarPU. This is the version used when compiling the application.
+
+\def STARPU_MINOR_VERSION
+\ingroup Versioning
+\brief Define the minor version of StarPU. This is the version used when compiling the application.
+
+\def STARPU_RELEASE_VERSION
+\ingroup Versioning
+\brief Define the release version of StarPU. This is the version used when compiling the application.
+
+\fn void starpu_get_version(int *major, int *minor, int *release)
+\ingroup Versioning
+\brief Return as 3 integers the version of StarPU used when running the application.
+
+*/
+

+ 115 - 0
doc/doxygen/chapters/api/workers.doxy

@@ -0,0 +1,115 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup Workers_Properties Workers’ Properties
+
+\fn unsigned starpu_worker_get_count(void)
+\ingroup Workers_Properties
+\brief This function returns the number of workers (i.e. processing
+units executing StarPU tasks). The returned value should be at most
+STARPU_NMAXWORKERS. 
+
+\fn int starpu_worker_get_count_by_type(enum starpu_worker_archtype type)
+\ingroup Workers_Properties
+\brief Returns the number of workers of the given type. A positive (or
+NULL) value is returned in case of success, -EINVAL indicates that the
+type is not valid otherwise.
+
+\fn unsigned starpu_cpu_worker_get_count(void)
+\ingroup Workers_Properties
+\brief This function returns the number of CPUs controlled by StarPU. The
+returned value should be at most STARPU_MAXCPUS.
+
+\fn unsigned starpu_cuda_worker_get_count(void)
+\ingroup Workers_Properties
+\brief This function returns the number of CUDA devices controlled by
+StarPU. The returned value should be at most STARPU_MAXCUDADEVS.
+
+\fn unsigned starpu_opencl_worker_get_count(void)
+\ingroup Workers_Properties
+\brief This function returns the number of OpenCL devices controlled by
+StarPU. The returned value should be at most STARPU_MAXOPENCLDEVS.
+
+\fn int starpu_worker_get_id (void)
+\ingroup Workers_Properties
+\brief This function returns the identifier of the current worker, i.e
+the one associated to the calling thread. The returned value is either
+-1 if the current context is not a StarPU worker (i.e. when called
+from the application outside a task or a callback), or an integer
+between 0 and starpu_worker_get_count() - 1.
+
+\fn int starpu_worker_get_ids_by_type(enum starpu_worker_archtype type, int *workerids, int maxsize)
+\ingroup Workers_Properties
+\brief This function gets the list of identifiers of workers with the
+given type. It fills the workerids array with the identifiers of the
+workers that have the type indicated in the first argument. The
+maxsize argument indicates the size of the workids array. The returned
+value gives the number of identifiers that were put in the array.
+-ERANGE is returned is maxsize is lower than the number of workers
+with the appropriate type: in that case, the array is filled with the
+maxsize first elements. To avoid such overflows, the value of maxsize
+can be chosen by the means of the starpu_worker_get_count_by_type
+function, or by passing a value greater or equal to
+STARPU_NMAXWORKERS.
+
+\fn int starpu_worker_get_by_type(enum starpu_worker_archtype type, int num)
+\ingroup Workers_Properties
+\brief This returns the identifier of the num-th worker that has the
+specified type type. If there are no such worker, -1 is returned.
+
+\fn int starpu_worker_get_by_devid(enum starpu_worker_archtype type, int devid)
+\ingroup Workers_Properties
+\brief This returns the identifier of the worker that has the specified type
+type and devid devid (which may not be the n-th, if some devices are
+skipped for instance). If there are no such worker, -1 is returned.
+
+\fn int starpu_worker_get_devid(int id)
+\ingroup Workers_Properties
+\brief This function returns the device id of the given worker. The
+worker should be identified with the value returned by the
+starpu_worker_get_id() function. In the case of a CUDA worker, this
+device identifier is the logical device identifier exposed by CUDA
+(used by the cudaGetDevice function for instance). The device
+identifier of a CPU worker is the logical identifier of the core on
+which the worker was bound; this identifier is either provided by the
+OS or by the hwloc library in case it is available.
+
+\fn enum starpu_worker_archtype starpu_worker_get_type(int id)
+\ingroup Workers_Properties
+\brief This function returns the type of processing unit associated to
+a worker. The worker identifier is a value returned by the
+starpu_worker_get_id() function). The returned value indicates the
+architecture of the worker: STARPU_CPU_WORKER for a CPU core,
+STARPU_CUDA_WORKER for a CUDA device, and STARPU_OPENCL_WORKER for a
+OpenCL device. The value returned for an invalid identifier is
+unspecified.
+
+\fn void starpu_worker_get_name(int id, char *dst, size_t maxlen)
+\ingroup Workers_Properties
+\brief This function allows to get the name of a given worker. StarPU
+associates a unique human readable string to each processing unit.
+This function copies at most the maxlen first bytes of the unique
+string associated to a worker identified by its identifier id into the
+dst buffer. The caller is responsible for ensuring that the dst is a
+valid pointer to a buffer of maxlen bytes at least. Calling this
+function on an invalid identifier results in an unspecified behaviour.
+
+\fn unsigned starpu_worker_get_memory_node(unsigned workerid)
+\ingroup Workers_Properties
+\brief This function returns the identifier of the memory node
+associated to the worker identified by workerid.
+
+\fn enum starpu_node_kind starpu_node_get_kind(unsigned node)
+\ingroup Workers_Properties
+\brief Returns the type of the given node as defined by
+::starpu_node_kind. For example, when defining a new data interface,
+this function should be used in the allocation function to determine
+on which device the memory needs to be allocated.
+
+*/
+

+ 863 - 0
doc/doxygen/chapters/basic_examples.doxy

@@ -0,0 +1,863 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page basicExamples Basic Examples
+
+\section Hello_World_using_the_C_Extension Hello World using the C Extension
+
+This section shows how to implement a simple program that submits a task
+to StarPU using the StarPU C extension (\ref cExtensions). The complete example, and additional examples,
+is available in the <c>gcc-plugin/examples</c> directory of the StarPU
+distribution. A similar example showing how to directly use the StarPU's API is shown
+in @ref{Hello World using StarPU's API}.
+
+GCC from version 4.5 permit to use the StarPU GCC plug-in (\ref cExtensions). This makes writing a task both simpler and less error-prone.
+In a nutshell, all it takes is to declare a task, declare and define its
+implementations (for CPU, OpenCL, and/or CUDA), and invoke the task like
+a regular C function.  The example below defines <c>my_task</c> which
+has a single implementation for CPU:
+
+\code{.c}
+#include <stdio.h>
+
+/* Task declaration.  */
+static void my_task (int x) __attribute__ ((task));
+
+/* Definition of the CPU implementation of `my_task'.  */
+static void my_task (int x)
+{
+  printf ("Hello, world!  With x = %d\n", x);
+}
+
+int main ()
+{
+  /* Initialize StarPU. */
+#pragma starpu initialize
+
+  /* Do an asynchronous call to `my_task'. */
+  my_task (42);
+
+  /* Wait for the call to complete.  */
+#pragma starpu wait
+
+  /* Terminate. */
+#pragma starpu shutdown
+
+  return 0;
+}
+\endcode
+
+The code can then be compiled and linked with GCC and the <c>-fplugin</c> flag:
+
+\verbatim
+$ gcc `pkg-config starpu-1.1 --cflags` hello-starpu.c \
+    -fplugin=`pkg-config starpu-1.1 --variable=gccplugin` \
+    `pkg-config starpu-1.1 --libs`
+\endverbatim
+
+The code can also be compiled without the StarPU C extension and will
+behave as a normal sequential code.
+
+\verbatim
+$ gcc hello-starpu.c
+hello-starpu.c:33:1: warning: ‘task’ attribute directive ignored [-Wattributes]
+$ ./a.out
+Hello, world! With x = 42
+\endverbatim
+
+As can be seen above, the C extensions allows programmers to
+use StarPU tasks by essentially annotating ``regular'' C code.
+
+\section Hello_World_using_StarPU_API Hello World using StarPU's API
+
+This section shows how to achieve the same result as in the previous
+section using StarPU's standard C API.
+
+\subsection Required_Headers Required Headers
+
+The starpu.h header should be included in any code using StarPU.
+
+\code{.c}
+#include <starpu.h>
+\endcode
+
+\subsection Defining_a_Codelet Defining a Codelet
+
+\code{.c}
+struct params
+{
+    int i;
+    float f;
+};
+void cpu_func(void *buffers[], void *cl_arg)
+{
+    struct params *params = cl_arg;
+
+    printf("Hello world (params = {%i, %f} )\n", params->i, params->f);
+}
+
+struct starpu_codelet cl =
+{
+    .where = STARPU_CPU,
+    .cpu_funcs = { cpu_func, NULL },
+    .nbuffers = 0
+};
+\endcode
+
+A codelet is a structure that represents a computational kernel. Such a codelet
+may contain an implementation of the same kernel on different architectures
+(e.g. CUDA, x86, ...). For compatibility, make sure that the whole
+structure is properly initialized to zero, either by using the
+function starpu_codelet_init (@pxref{starpu_codelet_init}), or by letting the
+compiler implicitly do it as examplified above.
+
+The <c>nbuffers</c> field specifies the number of data buffers that are
+manipulated by the codelet: here the codelet does not access or modify any data
+that is controlled by our data management library. Note that the argument
+passed to the codelet (the <c>cl_arg</c> field of the <c>starpu_task</c>
+structure) does not count as a buffer since it is not managed by our data
+management library, but just contain trivial parameters.
+
+\internal
+TODO need a crossref to the proper description of "where" see bla for more ...
+\endinternal
+
+We create a codelet which may only be executed on the CPUs. The <c>where</c>
+field is a bitmask that defines where the codelet may be executed. Here, the
+<c>STARPU_CPU</c> value means that only CPUs can execute this codelet
+(@pxref{Codelets and Tasks} for more details on this field). Note that
+the <c>where</c> field is optional, when unset its value is
+automatically set based on the availability of the different
+<c>XXX_funcs</c> fields.
+When a CPU core executes a codelet, it calls the <c>cpu_func</c> function,
+which \em must have the following prototype:
+
+\code{.c}
+void (*cpu_func)(void *buffers[], void *cl_arg);
+\endcode
+
+In this example, we can ignore the first argument of this function which gives a
+description of the input and output buffers (e.g. the size and the location of
+the matrices) since there is none.
+The second argument is a pointer to a buffer passed as an
+argument to the codelet by the means of the <c>cl_arg</c> field of the
+<c>starpu_task</c> structure.
+
+\internal
+TODO rewrite so that it is a little clearer ?
+\endinternal
+
+Be aware that this may be a pointer to a
+\em copy of the actual buffer, and not the pointer given by the programmer:
+if the codelet modifies this buffer, there is no guarantee that the initial
+buffer will be modified as well: this for instance implies that the buffer
+cannot be used as a synchronization medium. If synchronization is needed, data
+has to be registered to StarPU, see \ref Vector_Scaling_Using_StarPU_API.
+
+\subsection Submitting_a_Task Submitting a Task
+
+\code{.c}
+void callback_func(void *callback_arg)
+{
+    printf("Callback function (arg %x)\n", callback_arg);
+}
+
+int main(int argc, char **argv)
+{
+    /* initialize StarPU */
+    starpu_init(NULL);
+
+    struct starpu_task *task = starpu_task_create();
+
+    task->cl = &cl; /* Pointer to the codelet defined above */
+
+    struct params params = { 1, 2.0f };
+    task->cl_arg = &params;
+    task->cl_arg_size = sizeof(params);
+
+    task->callback_func = callback_func;
+    task->callback_arg = 0x42;
+
+    /* starpu_task_submit will be a blocking call */
+    task->synchronous = 1;
+
+    /* submit the task to StarPU */
+    starpu_task_submit(task);
+
+    /* terminate StarPU */
+    starpu_shutdown();
+
+    return 0;
+}
+\endcode
+
+Before submitting any tasks to StarPU, starpu_init() must be called. The
+<c>NULL</c> argument specifies that we use default configuration. Tasks cannot
+be submitted after the termination of StarPU by a call to
+starpu_shutdown().
+
+In the example above, a task structure is allocated by a call to
+starpu_task_create(). This function only allocates and fills the
+corresponding structure with the default settings (@pxref{Codelets and
+Tasks, starpu_task_create}), but it does not submit the task to StarPU.
+
+\internal
+not really clear ;)
+\endinternal
+
+The <c>cl</c> field is a pointer to the codelet which the task will
+execute: in other words, the codelet structure describes which computational
+kernel should be offloaded on the different architectures, and the task
+structure is a wrapper containing a codelet and the piece of data on which the
+codelet should operate.
+
+The optional <c>cl_arg</c> field is a pointer to a buffer (of size
+<c>cl_arg_size</c>) with some parameters for the kernel
+described by the codelet. For instance, if a codelet implements a computational
+kernel that multiplies its input vector by a constant, the constant could be
+specified by the means of this buffer, instead of registering it as a StarPU
+data. It must however be noted that StarPU avoids making copy whenever possible
+and rather passes the pointer as such, so the buffer which is pointed at must
+kept allocated until the task terminates, and if several tasks are submitted
+with various parameters, each of them must be given a pointer to their own
+buffer.
+
+Once a task has been executed, an optional callback function is be called.
+While the computational kernel could be offloaded on various architectures, the
+callback function is always executed on a CPU. The <c>callback_arg</c>
+pointer is passed as an argument of the callback. The prototype of a callback
+function must be:
+
+\code{.c}
+void (*callback_function)(void *);
+\endcode
+
+If the <c>synchronous</c> field is non-zero, task submission will be
+synchronous: the starpu_task_submit() function will not return until the
+task was executed. Note that the starpu_shutdown() function does not
+guarantee that asynchronous tasks have been executed before it returns,
+starpu_task_wait_for_all() can be used to that effect, or data can be
+unregistered (starpu_data_unregister()), which will
+implicitly wait for all the tasks scheduled to work on it, unless explicitly
+disabled thanks to starpu_data_set_default_sequential_consistency_flag() or
+starpu_data_set_sequential_consistency_flag().
+
+\subsection Execution_of_Hello_World Execution of Hello World
+
+\verbatim
+$ make hello_world
+cc $(pkg-config --cflags starpu-1.1)  $(pkg-config --libs starpu-1.1) hello_world.c -o hello_world
+$ ./hello_world
+Hello world (params = {1, 2.000000} )
+Callback function (arg 42)
+\endverbatim
+
+\section Vector_Scaling_Using_the_C_Extension Vector Scaling Using the C Extension
+
+The previous example has shown how to submit tasks. In this section,
+we show how StarPU tasks can manipulate data.
+
+We will first show how to use the C language extensions provided by
+the GCC plug-in (\ref cExtensions). The complete example, and
+additional examples, is available in the <c>gcc-plugin/examples</c>
+directory of the StarPU distribution. These extensions map directly
+to StarPU's main concepts: tasks, task implementations for CPU,
+OpenCL, or CUDA, and registered data buffers. The standard C version
+that uses StarPU's standard C programming interface is given in the
+next section (\ref Vector_Scaling_Using_StarPU_API).
+
+First of all, the vector-scaling task and its simple CPU implementation
+has to be defined:
+
+\code{.c}
+/* Declare the `vector_scal' task.  */
+static void vector_scal (unsigned size, float vector[size],
+                         float factor)
+  __attribute__ ((task));
+
+/* Define the standard CPU implementation.  */
+static void
+vector_scal (unsigned size, float vector[size], float factor)
+{
+  unsigned i;
+  for (i = 0; i < size; i++)
+    vector[i] *= factor;
+}
+\endcode
+
+Next, the body of the program, which uses the task defined above, can be
+implemented:
+
+\code{.c}
+int
+main (void)
+{
+#pragma starpu initialize
+
+#define NX     0x100000
+#define FACTOR 3.14
+
+  {
+    float vector[NX]
+       __attribute__ ((heap_allocated, registered));
+
+    size_t i;
+    for (i = 0; i < NX; i++)
+      vector[i] = (float) i;
+
+    vector_scal (NX, vector, FACTOR);
+
+#pragma starpu wait
+  } /* VECTOR is automatically freed here. */
+
+#pragma starpu shutdown
+
+  return valid ? EXIT_SUCCESS : EXIT_FAILURE;
+}
+\endcode
+
+The <c>main</c> function above does several things:
+
+<ul>
+<li>
+It initializes StarPU.
+</li>
+<li>
+It allocates <c>vector</c> in the heap; it will automatically be freed
+when its scope is left.  Alternatively, good old <c>malloc</c> and
+<c>free</c> could have been used, but they are more error-prone and
+require more typing.
+</li>
+<li>
+It registers the memory pointed to by <c>vector</c>.  Eventually,
+when OpenCL or CUDA task implementations are added, this will allow
+StarPU to transfer that memory region between GPUs and the main memory.
+Removing this <c>pragma</c> is an error.
+</li>
+<li>
+It invokes the <c>vector_scal</c> task.  The invocation looks the same
+as a standard C function call.  However, it is an asynchronous
+invocation, meaning that the actual call is performed in parallel with
+the caller's continuation.
+</li>
+<li>
+It waits for the termination of the <c>vector_scal</c>
+asynchronous call.
+</li>
+<li>
+Finally, StarPU is shut down.
+</li>
+</ul>
+
+The program can be compiled and linked with GCC and the <c>-fplugin</c>
+flag:
+
+\verbatim
+$ gcc `pkg-config starpu-1.1 --cflags` vector_scal.c \
+    -fplugin=`pkg-config starpu-1.1 --variable=gccplugin` \
+    `pkg-config starpu-1.1 --libs`
+\endverbatim
+
+And voilà!
+
+\subsection Adding_an_OpenCL_Task_Implementation Adding an OpenCL Task Implementation
+
+Now, this is all fine and great, but you certainly want to take
+advantage of these newfangled GPUs that your lab just bought, don't you?
+
+So, let's add an OpenCL implementation of the <c>vector_scal</c> task.
+We assume that the OpenCL kernel is available in a file,
+<c>vector_scal_opencl_kernel.cl</c>, not shown here.  The OpenCL task
+implementation is similar to that used with the standard C API
+(\ref Definition_of_the_OpenCL_Kernel).  It is declared and defined
+in our C file like this:
+
+\code{.c}
+/* The OpenCL programs, loaded from 'main' (see below). */
+static struct starpu_opencl_program cl_programs;
+
+static void vector_scal_opencl (unsigned size, float vector[size],
+                                float factor)
+  __attribute__ ((task_implementation ("opencl", vector_scal)));
+
+static void
+vector_scal_opencl (unsigned size, float vector[size], float factor)
+{
+  int id, devid, err;
+  cl_kernel kernel;
+  cl_command_queue queue;
+  cl_event event;
+
+  /* VECTOR is GPU memory pointer, not a main memory pointer. */
+  cl_mem val = (cl_mem) vector;
+
+  id = starpu_worker_get_id ();
+  devid = starpu_worker_get_devid (id);
+
+  /* Prepare to invoke the kernel.  In the future, this will be largely automated.  */
+  err = starpu_opencl_load_kernel (&kernel, &queue, &cl_programs,
+                                   "vector_mult_opencl", devid);
+  if (err != CL_SUCCESS)
+    STARPU_OPENCL_REPORT_ERROR (err);
+
+  err = clSetKernelArg (kernel, 0, sizeof (size), &size);
+  err |= clSetKernelArg (kernel, 1, sizeof (val), &val);
+  err |= clSetKernelArg (kernel, 2, sizeof (factor), &factor);
+  if (err)
+    STARPU_OPENCL_REPORT_ERROR (err);
+
+  size_t global = 1, local = 1;
+  err = clEnqueueNDRangeKernel (queue, kernel, 1, NULL, &global,
+                                &local, 0, NULL, &event);
+  if (err != CL_SUCCESS)
+    STARPU_OPENCL_REPORT_ERROR (err);
+
+  clFinish (queue);
+  starpu_opencl_collect_stats (event);
+  clReleaseEvent (event);
+
+  /* Done with KERNEL. */
+  starpu_opencl_release_kernel (kernel);
+}
+\endcode
+
+The OpenCL kernel itself must be loaded from <c>main</c>, sometime after
+the <c>initialize</c> pragma:
+
+\code{.c}
+starpu_opencl_load_opencl_from_file ("vector_scal_opencl_kernel.cl",
+                                       &cl_programs, "");
+\endcode
+
+And that's it.  The <c>vector_scal</c> task now has an additional
+implementation, for OpenCL, which StarPU's scheduler may choose to use
+at run-time.  Unfortunately, the <c>vector_scal_opencl</c> above still
+has to go through the common OpenCL boilerplate; in the future,
+additional extensions will automate most of it.
+
+\subsection Adding_a_CUDA_Task_Implementation Adding a CUDA Task Implementation
+
+Adding a CUDA implementation of the task is very similar, except that
+the implementation itself is typically written in CUDA, and compiled
+with <c>nvcc</c>.  Thus, the C file only needs to contain an external
+declaration for the task implementation:
+
+\code{.c}
+extern void vector_scal_cuda (unsigned size, float vector[size],
+                              float factor)
+  __attribute__ ((task_implementation ("cuda", vector_scal)));
+\endcode
+
+The actual implementation of the CUDA task goes into a separate
+compilation unit, in a <c>.cu</c> file.  It is very close to the
+implementation when using StarPU's standard C API (\ref Definition_of_the_CUDA_Kernel).
+
+\code{.c}
+/* CUDA implementation of the `vector_scal' task, to be compiled with `nvcc'. */
+
+#include <starpu.h>
+#include <stdlib.h>
+
+static __global__ void
+vector_mult_cuda (unsigned n, float *val, float factor)
+{
+  unsigned i = blockIdx.x * blockDim.x + threadIdx.x;
+
+  if (i < n)
+    val[i] *= factor;
+}
+
+/* Definition of the task implementation declared in the C file. */
+extern "C" void
+vector_scal_cuda (size_t size, float vector[], float factor)
+{
+  unsigned threads_per_block = 64;
+  unsigned nblocks = (size + threads_per_block - 1) / threads_per_block;
+
+  vector_mult_cuda <<< nblocks, threads_per_block, 0,
+    starpu_cuda_get_local_stream () >>> (size, vector, factor);
+
+  cudaStreamSynchronize (starpu_cuda_get_local_stream ());
+}
+\endcode
+
+The complete source code, in the <c>gcc-plugin/examples/vector_scal</c>
+directory of the StarPU distribution, also shows how an SSE-specialized
+CPU task implementation can be added.
+
+For more details on the C extensions provided by StarPU's GCC plug-in,
+\ref cExtensions.
+
+\section Vector_Scaling_Using_StarPU_API Vector Scaling Using StarPU's API
+
+This section shows how to achieve the same result as explained in the
+previous section using StarPU's standard C API.
+
+The full source code for
+this example is given in @ref{Full source code for the 'Scaling a
+Vector' example}.
+
+\subsection Source_Code_of_Vector_Scaling Source Code of Vector Scaling
+
+Programmers can describe the data layout of their application so that StarPU is
+responsible for enforcing data coherency and availability across the machine.
+Instead of handling complex (and non-portable) mechanisms to perform data
+movements, programmers only declare which piece of data is accessed and/or
+modified by a task, and StarPU makes sure that when a computational kernel
+starts somewhere (e.g. on a GPU), its data are available locally.
+
+Before submitting those tasks, the programmer first needs to declare the
+different pieces of data to StarPU using the <c>starpu_*_data_register</c>
+functions. To ease the development of applications for StarPU, it is possible
+to describe multiple types of data layout. A type of data layout is called an
+<b>interface</b>. There are different predefined interfaces available in StarPU:
+here we will consider the <b>vector interface</b>.
+
+The following lines show how to declare an array of <c>NX</c> elements of type
+<c>float</c> using the vector interface:
+
+\code{.c}
+float vector[NX];
+
+starpu_data_handle_t vector_handle;
+starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,
+                            sizeof(vector[0]));
+\endcode
+
+The first argument, called the <b>data handle</b>, is an opaque pointer which
+designates the array in StarPU. This is also the structure which is used to
+describe which data is used by a task. The second argument is the node number
+where the data originally resides. Here it is 0 since the <c>vector array</c> is in
+the main memory. Then comes the pointer <c>vector</c> where the data can be found in main memory,
+the number of elements in the vector and the size of each element.
+The following shows how to construct a StarPU task that will manipulate the
+vector and a constant factor.
+
+\code{.c}
+float factor = 3.14;
+struct starpu_task *task = starpu_task_create();
+
+task->cl = &cl;                      /* Pointer to the codelet defined below */
+task->handles[0] = vector_handle;    /* First parameter of the codelet */
+task->cl_arg = &factor;
+task->cl_arg_size = sizeof(factor);
+task->synchronous = 1;
+
+starpu_task_submit(task);
+\endcode
+
+Since the factor is a mere constant float value parameter,
+it does not need a preliminary registration, and
+can just be passed through the <c>cl_arg</c> pointer like in the previous
+example.  The vector parameter is described by its handle.
+There are two fields in each element of the <c>buffers</c> array.
+<c>handle</c> is the handle of the data, and <c>mode</c> specifies how the
+kernel will access the data (<c>STARPU_R</c> for read-only, <c>STARPU_W</c> for
+write-only and <c>STARPU_RW</c> for read and write access).
+
+The definition of the codelet can be written as follows:
+
+\code{.c}
+void scal_cpu_func(void *buffers[], void *cl_arg)
+{
+    unsigned i;
+    float *factor = cl_arg;
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+    /* CPU copy of the vector pointer */
+    float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
+
+    for (i = 0; i < n; i++)
+        val[i] *= *factor;
+}
+
+struct starpu_codelet cl =
+{
+    .cpu_funcs = { scal_cpu_func, NULL },
+    .nbuffers = 1,
+    .modes = { STARPU_RW }
+};
+\endcode
+
+The first argument is an array that gives
+a description of all the buffers passed in the <c>task->handles</c> array. The
+size of this array is given by the <c>nbuffers</c> field of the codelet
+structure. For the sake of genericity, this array contains pointers to the
+different interfaces describing each buffer.  In the case of the <b>vector
+interface</b>, the location of the vector (resp. its length) is accessible in the
+\<c>ptr<c> (resp. <c>nx</c>) of this array. Since the vector is accessed in a
+read-write fashion, any modification will automatically affect future accesses
+to this vector made by other tasks.
+
+The second argument of the <c>scal_cpu_func</c> function contains a pointer to the
+parameters of the codelet (given in <c>task->cl_arg</c>), so that we read the
+constant factor from this pointer.
+
+\subsection Execution_of_Vector_Scaling Execution of Vector Scaling
+
+\verbatim
+$ make vector_scal
+cc $(pkg-config --cflags starpu-1.1)  $(pkg-config --libs starpu-1.1)  vector_scal.c   -o vector_scal
+$ ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+\section Vector_Scaling_on_an_Hybrid_CPU_GPU_Machine Vector Scaling on an Hybrid CPU/GPU Machine
+
+Contrary to the previous examples, the task submitted in this example may not
+only be executed by the CPUs, but also by a CUDA device.
+
+\subsection Definition_of_the_CUDA_Kernel Definition of the CUDA Kernel
+
+The CUDA implementation can be written as follows. It needs to be compiled with
+a CUDA compiler such as nvcc, the NVIDIA CUDA compiler driver. It must be noted
+that the vector pointer returned by STARPU_VECTOR_GET_PTR is here a pointer in GPU
+memory, so that it can be passed as such to the <c>vector_mult_cuda</c> kernel
+call.
+
+\code{.c}
+#include <starpu.h>
+
+static __global__ void vector_mult_cuda(unsigned n, float *val,
+                                        float factor)
+{
+    unsigned i =  blockIdx.x*blockDim.x + threadIdx.x;
+    if (i < n)
+        val[i] *= factor;
+}
+
+extern "C" void scal_cuda_func(void *buffers[], void *_args)
+{
+    float *factor = (float *)_args;
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+    /* CUDA copy of the vector pointer */
+    float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
+    unsigned threads_per_block = 64;
+    unsigned nblocks = (n + threads_per_block-1) / threads_per_block;
+
+@i{    vector_mult_cuda<<<nblocks,threads_per_block, 0, starpu_cuda_get_local_stream()>>>}
+@i{                    (n, val, *factor);}
+
+@i{    cudaStreamSynchronize(starpu_cuda_get_local_stream());}
+}
+\endcode
+
+\subsection Definition_of_the_OpenCL_Kernel Definition of the OpenCL Kernel
+
+The OpenCL implementation can be written as follows. StarPU provides
+tools to compile a OpenCL kernel stored in a file.
+
+\code{.c}
+__kernel void vector_mult_opencl(int nx, __global float* val, float factor)
+{
+        const int i = get_global_id(0);
+        if (i < nx) {
+                val[i] *= factor;
+        }
+}
+\endcode
+
+Contrary to CUDA and CPU, <c>STARPU_VECTOR_GET_DEV_HANDLE</c> has to be used,
+which returns a <c>cl_mem</c> (which is not a device pointer, but an OpenCL
+handle), which can be passed as such to the OpenCL kernel. The difference is
+important when using partitioning, see @ref{Partitioning Data}.
+
+\code{.c}
+#include <starpu.h>
+
+extern struct starpu_opencl_program programs;
+
+void scal_opencl_func(void *buffers[], void *_args)
+{
+    float *factor = _args;
+    int id, devid, err;     /* OpenCL specific code */
+    cl_kernel kernel;       /* OpenCL specific code */
+    cl_command_queue queue; /* OpenCL specific code */
+    cl_event event;         /* OpenCL specific code */
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+    /* OpenCL copy of the vector pointer */
+    cl_mem val = (cl_mem) STARPU_VECTOR_GET_DEV_HANDLE(buffers[0]);
+
+    { /* OpenCL specific code */
+        id = starpu_worker_get_id();
+        devid = starpu_worker_get_devid(id);
+
+	err = starpu_opencl_load_kernel(&kernel, &queue, &programs,
+	                       "vector_mult_opencl", devid);   /* Name of the codelet defined above */
+        if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
+
+        err = clSetKernelArg(kernel, 0, sizeof(n), &n);
+        err |= clSetKernelArg(kernel, 1, sizeof(val), &val);
+        err |= clSetKernelArg(kernel, 2, sizeof(*factor), factor);
+        if (err) STARPU_OPENCL_REPORT_ERROR(err);
+    }
+
+    {  /* OpenCL specific code */
+        size_t global=n;
+        size_t local=1;
+        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &event);
+        if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
+    }
+
+    {  /* OpenCL specific code */
+        clFinish(queue);
+        starpu_opencl_collect_stats(event);
+        clReleaseEvent(event);
+
+        starpu_opencl_release_kernel(kernel);
+    }
+}
+\endcode
+
+
+\subsection Definition_of_the_Main_Code Definition of the Main Code
+
+The CPU implementation is the same as in the previous section.
+
+Here is the source of the main application. You can notice that the fields
+<c>cuda_funcs</c> and <c>opencl_funcs</c> of the codelet are set to
+define the pointers to the CUDA and OpenCL implementations of the
+task.
+
+\code{.c}
+#include <starpu.h>
+
+#define NX 2048
+
+extern void scal_cuda_func(void *buffers[], void *_args);
+extern void scal_cpu_func(void *buffers[], void *_args);
+extern void scal_opencl_func(void *buffers[], void *_args);
+
+/* Definition of the codelet */
+static struct starpu_codelet cl =
+{
+    .cuda_funcs = { scal_cuda_func, NULL },
+    .cpu_funcs = { scal_cpu_func, NULL },
+    .opencl_funcs = { scal_opencl_func, NULL },
+    .nbuffers = 1,
+    .modes = { STARPU_RW }
+}
+
+#ifdef STARPU_USE_OPENCL
+/* The compiled version of the OpenCL program */
+struct starpu_opencl_program programs;
+#endif
+
+int main(int argc, char **argv)
+{
+    float *vector;
+    int i, ret;
+    float factor=3.0;
+    struct starpu_task *task;
+    starpu_data_handle_t vector_handle;
+
+    starpu_init(NULL);                            /* Initialising StarPU */
+
+#ifdef STARPU_USE_OPENCL
+    starpu_opencl_load_opencl_from_file(
+            "examples/basic_examples/vector_scal_opencl_codelet.cl",
+            &programs, NULL);
+#endif
+
+    vector = malloc(NX*sizeof(vector[0]));
+    assert(vector);
+    for(i=0 ; i<NX ; i++) vector[i] = i;
+
+    /* Registering data within StarPU */
+    starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector,
+                                NX, sizeof(vector[0]));
+
+    /* Definition of the task */
+    task = starpu_task_create();
+    task->cl = &cl;
+    task->handles[0] = vector_handle;
+    task->cl_arg = &factor;
+    task->cl_arg_size = sizeof(factor);
+
+    /* Submitting the task */
+    ret = starpu_task_submit(task);
+    if (ret == -ENODEV) {
+            fprintf(stderr, "No worker may execute this task\n");
+            return 1;
+    }
+
+    /* Waiting for its termination */
+    starpu_task_wait_for_all();
+
+    /* Update the vector in RAM */
+    starpu_data_acquire(vector_handle, STARPU_R);
+
+    /* Access the data */
+    for(i=0 ; i<NX; i++) {
+      fprintf(stderr, "%f ", vector[i]);
+    }
+    fprintf(stderr, "\n");
+
+    /* Release the RAM view of the data before unregistering it and shutting down StarPU */
+    starpu_data_release(vector_handle);
+    starpu_data_unregister(vector_handle);
+    starpu_shutdown();
+
+    return 0;
+}
+\endcode
+
+\subsection Execution_of_Hybrid_Vector_Scaling Execution of Hybrid Vector Scaling
+
+The Makefile given at the beginning of the section must be extended to
+give the rules to compile the CUDA source code. Note that the source
+file of the OpenCL kernel does not need to be compiled now, it will
+be compiled at run-time when calling the function
+starpu_opencl_load_opencl_from_file() (@pxref{starpu_opencl_load_opencl_from_file}).
+
+\verbatim
+CFLAGS  += $(shell pkg-config --cflags starpu-1.1)
+LDFLAGS += $(shell pkg-config --libs starpu-1.1)
+CC       = gcc
+
+vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
+
+%.o: %.cu
+       nvcc $(CFLAGS) $< -c $@
+
+clean:
+       rm -f vector_scal *.o
+\endverbatim
+
+\verbatim
+$ make
+\endverbatim
+
+and to execute it, with the default configuration:
+
+\verbatim
+$ ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+or for example, by disabling CPU devices:
+
+\verbatim
+$ STARPU_NCPU=0 ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+or by disabling CUDA devices (which may permit to enable the use of OpenCL,
+see \ref Enabling_OpenCL) :
+
+\verbatim
+$ STARPU_NCUDA=0 ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+*/

+ 289 - 0
doc/doxygen/chapters/building.doxy

@@ -0,0 +1,289 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page buildingAndInstalling Building and Installing StarPU
+
+\section installing_a_binary_package Installing a Binary Package
+
+One of the StarPU developers being a Debian Developer, the packages
+are well integrated and very uptodate. To see which packages are
+available, simply type:
+
+\verbatim
+$ apt-cache search starpu
+\endverbatim
+
+To install what you need, type:
+
+\verbatim
+$ sudo apt-get install libstarpu-1.0 libstarpu-dev
+\endverbatim
+
+\section installing_from_source Installing from Source
+
+StarPU can be built and installed by the standard means of the GNU
+autotools. The following chapter is intended to briefly remind how these tools
+can be used to install StarPU.
+
+\subsection optional_dependencies Optional Dependencies
+
+The <a href="http://www.open-mpi.org/software/hwloc"><c>hwloc</c> topology
+discovery library</a> is not mandatory to use StarPU but strongly
+recommended.  It allows for topology aware scheduling, which improves
+performance.  <c>hwloc</c> is available in major free operating system
+distributions, and for most operating systems.
+
+If <c>hwloc</c> is not available on your system, the option
+<c>--without-hwloc</c> should be explicitely given when calling the
+<c>configure</c> script. If <c>hwloc</c> is installed with a <c>pkg-config</c> file,
+no option is required, it will be detected automatically, otherwise
+<c>with-hwloc=prefix</c> should be used to specify the location
+of <c>hwloc</c>.
+
+\subsection getting_sources Getting Sources
+
+StarPU's sources can be obtained from the <a href="http://runtime.bordeaux.inria.fr/StarPU/files/">download page of
+the StarPU website</a>
+
+All releases and the development tree of StarPU are freely available
+on INRIA's gforge under the LGPL license. Some releases are available
+under the BSD license.
+
+The latest release can be downloaded from the <a href="http://gforge.inria.fr/frs/?group_id=1570">INRIA's gforge</a> or
+directly from the <a href="http://runtime.bordeaux.inria.fr/StarPU/files/">StarPU download page</a>.
+
+The latest nightly snapshot can be downloaded from the <a href="http://starpu.gforge.inria.fr/testing/">StarPU gforge website</a>.
+
+\verbatim
+$ wget http://starpu.gforge.inria.fr/testing/starpu-nightly-latest.tar.gz
+\endverbatim
+
+And finally, current development version is also accessible via svn.
+It should be used only if you need the very latest changes (i.e. less
+than a day!). Note that the client side of the software Subversion can
+be obtained from http://subversion.tigris.org. If you
+are running on Windows, you will probably prefer to use <a href="http://tortoisesvn.tigris.org/">TortoiseSVN</a>.
+
+\verbatim
+$ svn checkout svn://scm.gforge.inria.fr/svn/starpu/trunk StarPU
+\endverbatim
+
+\subsection configuring_starpu Configuring StarPU
+
+Running <c>autogen.sh</c> is not necessary when using the tarball
+releases of StarPU.  If you are using the source code from the svn
+repository, you first need to generate the configure scripts and the
+Makefiles. This requires the availability of <c>autoconf</c>,
+<c>automake</c> >= 2.60, and <c>makeinfo</c>.
+
+\verbatim
+$ ./autogen.sh
+\endverbatim
+
+You then need to configure StarPU. Details about options that are
+useful to give to <c>./configure</c> are given in @ref{Compilation
+configuration}.
+
+\verbatim
+$ ./configure
+\endverbatim
+
+If <c>configure</c> does not detect some software or produces errors, please
+make sure to post the content of <c>config.log</c> when reporting the issue.
+
+By default, the files produced during the compilation are placed in
+the source directory. As the compilation generates a lot of files, it
+is advised to to put them all in a separate directory. It is then
+easier to cleanup, and this allows to compile several configurations
+out of the same source tree. For that, simply enter the directory
+where you want the compilation to produce its files, and invoke the
+<c>configure</c> script located in the StarPU source directory.
+
+\verbatim
+$ mkdir build
+$ cd build
+$ ../configure
+\endverbatim
+
+\subsection building_starpu Building StarPU
+
+\verbatim
+$ make
+\endverbatim
+
+Once everything is built, you may want to test the result. An
+extensive set of regression tests is provided with StarPU. Running the
+tests is done by calling <c>make check</c>. These tests are run every night
+and the result from the main profile is publicly <a href="http://starpu.gforge.inria.fr/testing/">available</a>.
+
+\verbatim
+$ make check
+\endverbatim
+
+\subsection installing_starpu Installing StarPU
+
+In order to install StarPU at the location that was specified during
+configuration:
+
+\verbatim
+$ make install
+\endverbatim
+
+Libtool interface versioning information are included in
+libraries names (libstarpu-1.0.so, libstarpumpi-1.0.so and
+libstarpufft-1.0.so).
+
+\section setting_up_your_own Code Setting up Your Own Code
+
+\subsection setting_flags_for_compiling Setting Flags for Compiling, Linking and Running Applications
+
+StarPU provides a pkg-config executable to obtain relevant compiler
+and linker flags.
+Compiling and linking an application against StarPU may require to use
+specific flags or libraries (for instance <c>CUDA</c> or <c>libspe2</c>).
+To this end, it is possible to use the <c>pkg-config</c> tool.
+
+If StarPU was not installed at some standard location, the path of StarPU's
+library must be specified in the <c>PKG_CONFIG_PATH</c> environment variable so
+that <c>pkg-config</c> can find it. For example if StarPU was installed in
+<c>$prefix_dir</c>:
+
+\verbatim
+$ PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$prefix_dir/lib/pkgconfig
+\endverbatim
+
+The flags required to compile or link against StarPU are then
+accessible with the following commands:
+
+\verbatim
+$ pkg-config --cflags starpu-1.1  # options for the compiler
+$ pkg-config --libs starpu-1.1    # options for the linker
+\endverbatim
+
+Note that it is still possible to use the API provided in the version
+0.9 of StarPU by calling <c>pkg-config</c> with the <c>libstarpu</c> package.
+Similar packages are provided for <c>libstarpumpi</c> and <c>libstarpufft</c>.
+
+Make sure that <c>pkg-config --libs starpu-1.1</c> actually produces some output
+before going further: <c>PKG_CONFIG_PATH</c> has to point to the place where
+<c>starpu-1.1.pc</c> was installed during <c>make install</c>.
+
+Also pass the <c>--static</c> option if the application is to be
+linked statically.
+
+It is also necessary to set the variable <c>LD_LIBRARY_PATH</c> to
+locate dynamic libraries at runtime.
+
+\verbatim
+$ LD_LIBRARY_PATH=$prefix_dir/lib:$LD_LIBRARY_PATH
+\endverbatim
+
+When using a Makefile, the following lines can be added to set the
+options for the compiler and the linker:
+
+\verbatim
+CFLAGS          +=      $$(pkg-config --cflags starpu-1.1)
+LDFLAGS         +=      $$(pkg-config --libs starpu-1.1)
+\endverbatim
+
+\subsection running_a_basic_starpu_application Running a Basic StarPU Application
+
+Basic examples using StarPU are built in the directory
+<c>examples/basic_examples/</c> (and installed in
+<c>$prefix_dir/lib/starpu/examples/</c>). You can for example run the example
+<c>vector_scal</c>.
+
+\verbatim
+$ ./examples/basic_examples/vector_scal
+BEFORE: First element was 1.000000
+AFTER: First element is 3.140000
+\endverbatim
+
+When StarPU is used for the first time, the directory
+<c>$STARPU_HOME/.starpu/</c> is created, performance models will be stored in
+that directory (@pxref{STARPU_HOME}).
+
+Please note that buses are benchmarked when StarPU is launched for the
+first time. This may take a few minutes, or less if <c>hwloc</c> is
+installed. This step is done only once per user and per machine.
+
+\subsection kernel_threads_started_by_starpu Kernel Threads Started by StarPU
+
+StarPU automatically binds one thread per CPU core. It does not use
+SMT/hyperthreading because kernels are usually already optimized for using a
+full core, and using hyperthreading would make kernel calibration rather random.
+
+Since driving GPUs is a CPU-consuming task, StarPU dedicates one core per GPU
+
+While StarPU tasks are executing, the application is not supposed to do
+computations in the threads it starts itself, tasks should be used instead.
+
+TODO: add a StarPU function to bind an application thread (e.g. the main thread)
+to a dedicated core (and thus disable the corresponding StarPU CPU worker).
+
+\subsection Enabling_OpenCL Enabling OpenCL
+
+When both CUDA and OpenCL drivers are enabled, StarPU will launch an
+OpenCL worker for NVIDIA GPUs only if CUDA is not already running on them.
+This design choice was necessary as OpenCL and CUDA can not run at the
+same time on the same NVIDIA GPU, as there is currently no interoperability
+between them.
+
+To enable OpenCL, you need either to disable CUDA when configuring StarPU:
+
+\verbatim
+$ ./configure --disable-cuda
+\endverbatim
+
+or when running applications:
+
+\verbatim
+$ STARPU_NCUDA=0 ./application
+\endverbatim
+
+OpenCL will automatically be started on any device not yet used by
+CUDA. So on a machine running 4 GPUS, it is therefore possible to
+enable CUDA on 2 devices, and OpenCL on the 2 other devices by doing
+so:
+
+\verbatim
+$ STARPU_NCUDA=2 ./application
+\endverbatim
+
+\section benchmarking_starpu Benchmarking StarPU
+
+Some interesting benchmarks are installed among examples in
+<c>$prefix_dir/lib/starpu/examples/</c>. Make sure to try various
+schedulers, for instance <c>STARPU_SCHED=dmda</c>.
+
+\subsection task_size_overhead Task size overhead
+
+This benchmark gives a glimpse into how big a size should be for StarPU overhead
+to be low enough.  Run <c>tasks_size_overhead.sh</c>, it will generate a plot
+of the speedup of tasks of various sizes, depending on the number of CPUs being
+used.
+
+\subsection data_transfer_latency Data transfer latency
+
+<c>local_pingpong</c> performs a ping-pong between the first two CUDA nodes, and
+prints the measured latency.
+
+\subsection matrix_matrix_multiplication Matrix-matrix multiplication
+
+<c>sgemm</c> and <c>dgemm</c> perform a blocked matrix-matrix
+multiplication using BLAS and cuBLAS. They output the obtained GFlops.
+
+\subsection cholesky_factorization Cholesky factorization
+
+<c>cholesky\*</c> perform a Cholesky factorization (single precision). They use different dependency primitives.
+
+\subsection lu_factorization LU factorization
+
+<c>lu\*</c> perform an LU factorization. They use different dependency primitives.
+
+*/

+ 464 - 0
doc/doxygen/chapters/c_extensions.doxy

@@ -0,0 +1,464 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page cExtensions C Extensions
+
+When GCC plug-in support is available, StarPU builds a plug-in for the
+GNU Compiler Collection (GCC), which defines extensions to languages of
+the C family (C, C++, Objective-C) that make it easier to write StarPU
+code. This feature is only available for GCC 4.5 and later; it
+is known to work with GCC 4.5, 4.6, and 4.7.  You
+may need to install a specific <c>-dev</c> package of your distro, such
+as <c>gcc-4.6-plugin-dev</c> on Debian and derivatives.  In addition,
+the plug-in's test suite is only run when <a href="http://www.gnu.org/software/guile/">GNU Guile</a> is found at
+<c>configure</c>-time.  Building the GCC plug-in
+can be disabled by configuring with <c>--disable-gcc-extensions</c>.
+
+Those extensions include syntactic sugar for defining
+tasks and their implementations, invoking a task, and manipulating data
+buffers.  Use of these extensions can be made conditional on the
+availability of the plug-in, leading to valid C sequential code when the
+plug-in is not used (\ref Conditional_Extensions).
+
+When StarPU has been installed with its GCC plug-in, programs that use
+these extensions can be compiled this way:
+
+\verbatim
+$ gcc -c -fplugin=`pkg-config starpu-1.1 --variable=gccplugin` foo.c
+\endverbatim
+
+When the plug-in is not available, the above <c>pkg-config</c>
+command returns the empty string.
+
+In addition, the <c>-fplugin-arg-starpu-verbose</c> flag can be used to
+obtain feedback from the compiler as it analyzes the C extensions used
+in source files.
+
+This section describes the C extensions implemented by StarPU's GCC
+plug-in.  It does not require detailed knowledge of the StarPU library.
+
+Note: as of StarPU @value{VERSION}, this is still an area under
+development and subject to change.
+
+\section Defining_Tasks Defining Tasks
+
+The StarPU GCC plug-in views tasks as ``extended'' C functions:
+
+<ul>
+<Li>
+tasks may have several implementations---e.g., one for CPUs, one written
+in OpenCL, one written in CUDA;
+</li>
+<Li>
+tasks may have several implementations of the same target---e.g.,
+several CPU implementations;
+</li>
+<li>
+when a task is invoked, it may run in parallel, and StarPU is free to
+choose any of its implementations.
+</li>
+</ul>
+
+Tasks and their implementations must be <em>declared</em>.  These
+declarations are annotated with attributes (@pxref{Attribute
+Syntax, attributes in GNU C,, gcc, Using the GNU Compiler Collection
+(GCC)}): the declaration of a task is a regular C function declaration
+with an additional <c>task</c> attribute, and task implementations are
+declared with a <c>task_implementation</c> attribute.
+
+The following function attributes are provided:
+
+<dl>
+
+<dt><c>task</c></dt>
+<dd>
+Declare the given function as a StarPU task.  Its return type must be
+<c>void</c>.  When a function declared as <c>task</c> has a user-defined
+body, that body is interpreted as the implicit definition of the
+task's CPU implementation (see example below).  In all cases, the
+actual definition of a task's body is automatically generated by the
+compiler.
+
+Under the hood, declaring a task leads to the declaration of the
+corresponding <c>codelet</c> (@pxref{Codelet and Tasks}).  If one or
+more task implementations are declared in the same compilation unit,
+then the codelet and the function itself are also defined; they inherit
+the scope of the task.
+
+Scalar arguments to the task are passed by value and copied to the
+target device if need be---technically, they are passed as the
+<c>cl_arg</c> buffer (@pxref{Codelets and Tasks, <c>cl_arg</c>}).
+
+Pointer arguments are assumed to be registered data buffers---the
+<c>buffers</c> argument of a task (@pxref{Codelets and Tasks,
+<c>buffers</c>}); <c>const</c>-qualified pointer arguments are viewed as
+read-only buffers (<c>STARPU_R</c>), and non-<c>const</c>-qualified
+buffers are assumed to be used read-write (<c>STARPU_RW</c>).  In
+addition, the <c>output</c> type attribute can be as a type qualifier
+for output pointer or array parameters (<c>STARPU_W</c>).
+</dd>
+
+<dt><c>task_implementation (target, task)</c></dt>
+<dd>
+Declare the given function as an implementation of <c>task</c> to run on
+<c>target</c>.  <c>target</c> must be a string, currently one of
+<c>"cpu"</c>, <c>"opencl"</c>, or <c>"cuda"</c>.
+\internal
+FIXME: Update when OpenCL support is ready.
+\endinternal
+</dd>
+</dl>
+
+Here is an example:
+
+\code{.c}
+#define __output  __attribute__ ((output))
+
+static void matmul (const float *A, const float *B,
+                    __output float *C,
+                    unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task));
+
+static void matmul_cpu (const float *A, const float *B,
+                        __output float *C,
+                        unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("cpu", matmul)));
+
+
+static void
+matmul_cpu (const float *A, const float *B, __output float *C,
+            unsigned nx, unsigned ny, unsigned nz)
+{
+  unsigned i, j, k;
+
+  for (j = 0; j < ny; j++)
+    for (i = 0; i < nx; i++)
+      {
+        for (k = 0; k < nz; k++)
+          C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
+      }
+}
+\endcode
+
+A <c>matmult</c> task is defined; it has only one implementation,
+<c>matmult_cpu</c>, which runs on the CPU.  Variables <c>A</c> and
+<c>B</c> are input buffers, whereas <c>C</c> is considered an input/output
+buffer.
+
+For convenience, when a function declared with the <c>task</c> attribute
+has a user-defined body, that body is assumed to be that of the CPU
+implementation of a task, which we call an implicit task CPU
+implementation.  Thus, the above snippet can be simplified like this:
+
+\code{.c}
+#define __output  __attribute__ ((output))
+
+static void matmul (const float *A, const float *B,
+                    __output float *C,
+                    unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task));
+
+/* Implicit definition of the CPU implementation of the
+   `matmul' task.  */
+static void
+matmul (const float *A, const float *B, __output float *C,
+        unsigned nx, unsigned ny, unsigned nz)
+{
+  unsigned i, j, k;
+
+  for (j = 0; j < ny; j++)
+    for (i = 0; i < nx; i++)
+      {
+        for (k = 0; k < nz; k++)
+          C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
+      }
+}
+\endcode
+
+Use of implicit CPU task implementations as above has the advantage that
+the code is valid sequential code when StarPU's GCC plug-in is not used
+(\ref Conditional_Extensions).
+
+CUDA and OpenCL implementations can be declared in a similar way:
+
+\code{.c}
+static void matmul_cuda (const float *A, const float *B, float *C,
+                         unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("cuda", matmul)));
+
+static void matmul_opencl (const float *A, const float *B, float *C,
+                           unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("opencl", matmul)));
+\endcode
+
+The CUDA and OpenCL implementations typically either invoke a kernel
+written in CUDA or OpenCL (for similar code, @pxref{CUDA Kernel}, and
+@pxref{OpenCL Kernel}), or call a library function that uses CUDA or
+OpenCL under the hood, such as CUBLAS functions:
+
+\code{.c}
+static void
+matmul_cuda (const float *A, const float *B, float *C,
+             unsigned nx, unsigned ny, unsigned nz)
+{
+  cublasSgemm ('n', 'n', nx, ny, nz,
+               1.0f, A, 0, B, 0,
+               0.0f, C, 0);
+  cudaStreamSynchronize (starpu_cuda_get_local_stream ());
+}
+\endcode
+
+A task can be invoked like a regular C function:
+
+\code{.c}
+matmul (&A[i * zdim * bydim + k * bzdim * bydim],
+        &B[k * xdim * bzdim + j * bxdim * bzdim],
+        &C[i * xdim * bydim + j * bxdim * bydim],
+        bxdim, bydim, bzdim);
+\endcode
+
+This leads to an asynchronous invocation, whereby <c>matmult</c>'s
+implementation may run in parallel with the continuation of the caller.
+
+The next section describes how memory buffers must be handled in
+StarPU-GCC code.  For a complete example, see the
+<c>gcc-plugin/examples</c> directory of the source distribution, and
+\ref Vector_Scaling_Using_the_C_Extension.
+
+
+\section Synchronization_and_Other_Pragmas Initialization, Termination, and Synchronization
+
+The following pragmas allow user code to control StarPU's life time and
+to synchronize with tasks.
+
+<dl>
+
+<dt><c>\#pragma starpu initialize</c></dt>
+<dd>
+Initialize StarPU.  This call is compulsory and is <em>never</em> added
+implicitly.  One of the reasons this has to be done explicitly is that
+it provides greater control to user code over its resource usage.
+</dd>
+
+<dt><c>\#pragma starpu shutdown</c></dt>
+<dd>
+Shut down StarPU, giving it an opportunity to write profiling info to a
+file on disk, for instance (\ref Off-line_performance_feedback).
+</dd>
+
+<dt><c>\#pragma starpu wait</c></dt>
+<dd>
+Wait for all task invocations to complete, as with
+starpu_wait_for_all().
+</dd>
+</dl>
+
+\section Registered_Data_Buffers Registered Data Buffers
+
+Data buffers such as matrices and vectors that are to be passed to tasks
+must be registered.  Registration allows StarPU to handle data
+transfers among devices---e.g., transferring an input buffer from the
+CPU's main memory to a task scheduled to run a GPU (\ref StarPU_Data_Management_Library).
+
+The following pragmas are provided:
+
+<dl>
+
+<dt><c>\#pragma starpu register ptr [size]</c></dt>
+<dd>
+Register <c>ptr</c> as a <c>size</c>-element buffer.  When <c>ptr</c> has
+an array type whose size is known, <c>size</c> may be omitted.
+Alternatively, the <c>registered</c> attribute can be used (see below.)
+</dd>
+
+<dt><c>\#pragma starpu unregister ptr</c></dt>
+<dd>
+Unregister the previously-registered memory area pointed to by
+<c>ptr</c>.  As a side-effect, <c>ptr</c> points to a valid copy in main
+memory.
+</dd>
+
+<dt><c>\#pragma starpu acquire ptr</c></dt>
+<dd>
+Acquire in main memory an up-to-date copy of the previously-registered
+memory area pointed to by <c>ptr</c>, for read-write access.
+</dd>
+
+<dt><c>\#pragma starpu release ptr</c></dt>
+<dd>
+Release the previously-register memory area pointed to by <c>ptr</c>,
+making it available to the tasks.
+</dd>
+</dl>
+
+Additionally, the following attributes offer a simple way to allocate
+and register storage for arrays:
+
+<dl>
+
+<dt><c>registered</c></dt>
+<dd>
+This attributes applies to local variables with an array type.  Its
+effect is to automatically register the array's storage, as per
+<c>\#pragma starpu register</c>.  The array is automatically unregistered
+when the variable's scope is left.  This attribute is typically used in
+conjunction with the <c>heap_allocated</c> attribute, described below.
+</dd>
+
+<dt><c>heap_allocated</c></dt>
+<dd>
+This attributes applies to local variables with an array type.  Its
+effect is to automatically allocate the array's storage on
+the heap, using starpu_malloc() under the hood.  The heap-allocated array is automatically
+freed when the variable's scope is left, as with
+automatic variables.
+</dd>
+</dl>
+
+The following example illustrates use of the <c>heap_allocated</c>
+attribute:
+
+\code{.c}
+extern void cholesky(unsigned nblocks, unsigned size,
+                    float mat[nblocks][nblocks][size])
+  __attribute__ ((task));
+
+int
+main (int argc, char *argv[])
+{
+#pragma starpu initialize
+
+  /* ... */
+
+  int nblocks, size;
+  parse_args (&nblocks, &size);
+
+  /* Allocate an array of the required size on the heap,
+     and register it.  */
+
+  {
+    float matrix[nblocks][nblocks][size]
+      __attribute__ ((heap_allocated, registered));
+
+    cholesky (nblocks, size, matrix);
+
+#pragma starpu wait
+
+  }   /* MATRIX is automatically unregistered & freed here.  */
+
+#pragma starpu shutdown
+
+  return EXIT_SUCCESS;
+}
+\endcode
+
+\section Conditional_Extensions Using C Extensions Conditionally
+
+The C extensions described in this chapter are only available when GCC
+and its StarPU plug-in are in use.  Yet, it is possible to make use of
+these extensions when they are available---leading to hybrid CPU/GPU
+code---and discard them when they are not available---leading to valid
+sequential code.
+
+To that end, the GCC plug-in defines a C preprocessor macro when it is
+being used:
+
+@defmac STARPU_GCC_PLUGIN
+Defined for code being compiled with the StarPU GCC plug-in.  When
+defined, this macro expands to an integer denoting the version of the
+supported C extensions.
+@end defmac
+
+The code below illustrates how to define a task and its implementations
+in a way that allows it to be compiled without the GCC plug-in:
+
+\code{.c}
+/* This program is valid, whether or not StarPU's GCC plug-in
+   is being used.  */
+
+#include <stdlib.h>
+
+/* The attribute below is ignored when GCC is not used.  */
+static void matmul (const float *A, const float *B, float * C,
+                    unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task));
+
+static void
+matmul (const float *A, const float *B, float * C,
+        unsigned nx, unsigned ny, unsigned nz)
+{
+  /* Code of the CPU kernel here...  */
+}
+
+#ifdef STARPU_GCC_PLUGIN
+/* Optional OpenCL task implementation.  */
+
+static void matmul_opencl (const float *A, const float *B, float * C,
+                           unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("opencl", matmul)));
+
+static void
+matmul_opencl (const float *A, const float *B, float * C,
+               unsigned nx, unsigned ny, unsigned nz)
+{
+  /* Code that invokes the OpenCL kernel here...  */
+}
+#endif
+
+int
+main (int argc, char *argv[])
+{
+  /* The pragmas below are simply ignored when StarPU-GCC
+     is not used.  */
+#pragma starpu initialize
+
+  float A[123][42][7], B[123][42][7], C[123][42][7];
+
+#pragma starpu register A
+#pragma starpu register B
+#pragma starpu register C
+
+  /* When StarPU-GCC is used, the call below is asynchronous;
+     otherwise, it is synchronous.  */
+  matmul ((float *) A, (float *) B, (float *) C, 123, 42, 7);
+
+#pragma starpu wait
+#pragma starpu shutdown
+
+  return EXIT_SUCCESS;
+}
+\endcode
+
+The above program is a valid StarPU program when StarPU's GCC plug-in is
+used; it is also a valid sequential program when the plug-in is not
+used.
+
+Note that attributes such as <c>task</c> as well as <c>starpu</c>
+pragmas are simply ignored by GCC when the StarPU plug-in is not loaded.
+However, <c>gcc -Wall</c> emits a warning for unknown attributes and
+pragmas, which can be inconvenient.  In addition, other compilers may be
+unable to parse the attribute syntax (In practice, Clang and
+several proprietary compilers implement attributes.), so you may want to
+wrap attributes in macros like this:
+
+\code{.c}
+/* Use the `task' attribute only when StarPU's GCC plug-in
+   is available.   */
+#ifdef STARPU_GCC_PLUGIN
+# define __task  __attribute__ ((task))
+#else
+# define __task
+#endif
+
+static void matmul (const float *A, const float *B, float *C,
+                    unsigned nx, unsigned ny, unsigned nz) __task;
+\endcode
+
+
+*/
+

+ 71 - 0
doc/doxygen/chapters/fft_support.doxy

@@ -0,0 +1,71 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page fftSupport StarPU FFT Support
+
+StarPU provides <c>libstarpufft</c>, a library whose design is very similar to
+both fftw and cufft, the difference being that it takes benefit from both CPUs
+and GPUs. It should however be noted that GPUs do not have the same precision as
+CPUs, so the results may different by a negligible amount.
+
+Different precisions are available, namely float, double and long
+double precisions, with the following fftw naming conventions:
+
+<ul>
+<li>
+double precision structures and functions are named e.g. starpufft_execute()
+</li>
+<li>
+float precision structures and functions are named e.g. starpufftf_execute()
+</li>
+<li>
+long double precision structures and functions are named e.g. starpufftl_execute()
+</li>
+</ul>
+
+The documentation below is given with names for double precision, replace
+<c>starpufft_</c> with <c>starpufftf_</c> or <c>starpufftl_</c> as appropriate.
+
+Only complex numbers are supported at the moment.
+
+The application has to call starpu_init() before calling starpufft functions.
+
+Either main memory pointers or data handles can be provided.
+
+<ul>
+<li>
+To provide main memory pointers, use starpufft_start() or
+starpufft_execute(). Only one FFT can be performed at a time, because
+StarPU will have to register the data on the fly. In the starpufft_start()
+case, starpufft_cleanup() needs to be called to unregister the data.
+</li>
+<li>
+To provide data handles (which is preferrable),
+use starpufft_start_handle() (preferred) or
+starpufft_execute_handle(). Several FFTs Several FFT tasks can be submitted
+for a given plan, which permits e.g. to start a series of FFT with just one
+plan. starpufft_start_handle() is preferrable since it does not wait for
+the task completion, and thus permits to enqueue a series of tasks.
+</li>
+</ul>
+
+All functions are defined in @ref{FFT Support}.
+
+\section Compilation Compilation
+
+The flags required to compile or link against the FFT library are accessible
+with the following commands:
+
+\verbatim
+$ pkg-config --cflags starpufft-1.0  # options for the compiler
+$ pkg-config --libs starpufft-1.0    # options for the linker
+\endverbatim
+
+Also pass the <c>--static</c> option if the application is to be linked statically.
+
+*/

+ 217 - 0
doc/doxygen/chapters/introduction.doxy

@@ -0,0 +1,217 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+*/
+
+/*! \mainpage Introduction
+
+\section motivation Motivation
+
+\internal
+complex machines with heterogeneous cores/devices
+\endinternal
+
+The use of specialized hardware such as accelerators or coprocessors offers an
+interesting approach to overcome the physical limits encountered by processor
+architects. As a result, many machines are now equipped with one or several
+accelerators (e.g. a GPU), in addition to the usual processor(s). While a lot of
+efforts have been devoted to offload computation onto such accelerators, very
+little attention as been paid to portability concerns on the one hand, and to the
+possibility of having heterogeneous accelerators and processors to interact on the other hand.
+
+StarPU is a runtime system that offers support for heterogeneous multicore
+architectures, it not only offers a unified view of the computational resources
+(i.e. CPUs and accelerators at the same time), but it also takes care of
+efficiently mapping and executing tasks onto an heterogeneous machine while
+transparently handling low-level issues such as data transfers in a portable
+fashion.
+
+\internal
+this leads to a complicated distributed memory design
+which is not (easily) manageable by hand
+
+added value/benefits of StarPU
+   - portability
+   - scheduling, perf. portability
+\endinternal
+
+\section starpu_in_a_nutshell StarPU in a Nutshell
+
+StarPU is a software tool aiming to allow programmers to exploit the
+computing power of the available CPUs and GPUs, while relieving them
+from the need to specially adapt their programs to the target machine
+and processing units.
+
+At the core of StarPU is its run-time support library, which is
+responsible for scheduling application-provided tasks on heterogeneous
+CPU/GPU machines.  In addition, StarPU comes with programming language
+support, in the form of extensions to languages of the C family
+(\ref cExtensions), as well as an OpenCL front-end (\ref soclOpenclExtensions).
+
+StarPU's run-time and programming language extensions support a
+task-based programming model. Applications submit computational
+tasks, with CPU and/or GPU implementations, and StarPU schedules these
+tasks and associated data transfers on available CPUs and GPUs.  The
+data that a task manipulates are automatically transferred among
+accelerators and the main memory, so that programmers are freed from the
+scheduling issues and technical details associated with these transfers.
+
+StarPU takes particular care of scheduling tasks efficiently, using
+well-known algorithms from the literature (\ref Task_scheduling_policy).  In addition, it allows scheduling experts, such as compiler
+or computational library developers, to implement custom scheduling
+policies in a portable fashion (\ref Defining_a_New_Scheduling_Policy).
+
+The remainder of this section describes the main concepts used in StarPU.
+
+\internal
+explain the notion of codelet and task (i.e. g(A, B)
+\endinternal
+
+\subsection codelet_and_tasks Codelet and Tasks
+
+One of the StarPU primary data structures is the \b codelet. A codelet describes a
+computational kernel that can possibly be implemented on multiple architectures
+such as a CPU, a CUDA device or an OpenCL device.
+
+\internal
+TODO insert illustration f: f_spu, f_cpu, ...
+\endinternal
+
+Another important data structure is the \b task. Executing a StarPU task
+consists in applying a codelet on a data set, on one of the architectures on
+which the codelet is implemented. A task thus describes the codelet that it
+uses, but also which data are accessed, and how they are
+accessed during the computation (read and/or write).
+StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking
+operation. The task structure can also specify a \b callback function that is
+called once StarPU has properly executed the task. It also contains optional
+fields that the application may use to give hints to the scheduler (such as
+priority levels).
+
+By default, task dependencies are inferred from data dependency (sequential
+coherence) by StarPU. The application can however disable sequential coherency
+for some data, and dependencies be expressed by hand.
+A task may be identified by a unique 64-bit number chosen by the application
+which we refer as a \b tag.
+Task dependencies can be enforced by hand either by the means of callback functions, by
+submitting other tasks, or by expressing dependencies
+between tags (which can thus correspond to tasks that have not been submitted
+yet).
+
+\internal
+TODO insert illustration f(Ar, Brw, Cr) + ..
+\endinternal
+
+\internal
+DSM
+\endinternal
+
+\subsection StarPU_Data_Management_Library StarPU Data Management Library
+
+Because StarPU schedules tasks at runtime, data transfers have to be
+done automatically and ``just-in-time'' between processing units,
+relieving the application programmer from explicit data transfers.
+Moreover, to avoid unnecessary transfers, StarPU keeps data
+where it was last needed, even if was modified there, and it
+allows multiple copies of the same data to reside at the same time on
+several processing units as long as it is not modified.
+
+\section application_taskification Application taskification
+
+TODO
+
+\internal
+TODO: section describing what taskifying an application means: before
+porting to StarPU, turn the program into:
+"pure" functions, which only access data from their passed parameters
+a main function which just calls these pure functions
+
+and then it's trivial to use StarPU or any other kind of task-based library:
+simply replace calling the function with submitting a task.
+\endinternal
+
+\section glossary Glossary
+
+A \b codelet records pointers to various implementations of the same
+theoretical function.
+
+A <b>memory node</b> can be either the main RAM or GPU-embedded memory.
+
+A \b bus is a link between memory nodes.
+
+A <b>data handle</b> keeps track of replicates of the same data (\b registered by the
+application) over various memory nodes. The data management library manages
+keeping them coherent.
+
+The \b home memory node of a data handle is the memory node from which the data
+was registered (usually the main memory node).
+
+A \b task represents a scheduled execution of a codelet on some data handles.
+
+A \b tag is a rendez-vous point. Tasks typically have their own tag, and can
+depend on other tags. The value is chosen by the application.
+
+A \b worker execute tasks. There is typically one per CPU computation core and
+one per accelerator (for which a whole CPU core is dedicated).
+
+A \b driver drives a given kind of workers. There are currently CPU, CUDA,
+and OpenCL drivers. They usually start several workers to actually drive
+them.
+
+A <b>performance model</b> is a (dynamic or static) model of the performance of a
+given codelet. Codelets can have execution time performance model as well as
+power consumption performance models.
+
+A data \b interface describes the layout of the data: for a vector, a pointer
+for the start, the number of elements and the size of elements ; for a matrix, a
+pointer for the start, the number of elements per row, the offset between rows,
+and the size of each element ; etc. To access their data, codelet functions are
+given interfaces for the local memory node replicates of the data handles of the
+scheduled task.
+
+\b Partitioning data means dividing the data of a given data handle (called
+\b father) into a series of \b children data handles which designate various
+portions of the former.
+
+A \b filter is the function which computes children data handles from a father
+data handle, and thus describes how the partitioning should be done (horizontal,
+vertical, etc.)
+
+\b Acquiring a data handle can be done from the main application, to safely
+access the data of a data handle from its home node, without having to
+unregister it.
+
+
+\section research_papers Research Papers
+
+Research papers about StarPU can be found at
+http://runtime.bordeaux.inria.fr/Publis/Keyword/STARPU.html.
+
+A good overview is available in the research report at
+http://hal.archives-ouvertes.fr/inria-00467677.
+
+\section Further_Reading Further Reading
+
+The documentation chapters include
+
+<ul>
+<li> \ref buildingAndInstalling
+<li> \ref basicExamples
+<li> \ref advancedExamples
+<li> \ref optimizePerformance
+<li> \ref performanceFeedback
+<li> \ref tipsTricks
+<li> \ref mpiSupport
+<li> \ref fftSupport
+<li> \ref cExtensions
+<li> \ref soclOpenclExtensions
+<li> \ref schedulingContexts
+<li> \ref schedulingContextHypervisor
+</ul>
+
+Make sure to have had a look at those too!
+
+*/

+ 377 - 0
doc/doxygen/chapters/mpi_support.doxy

@@ -0,0 +1,377 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page mpiSupport StarPU MPI Support
+
+The integration of MPI transfers within task parallelism is done in a
+very natural way by the means of asynchronous interactions between the
+application and StarPU.  This is implemented in a separate libstarpumpi library
+which basically provides "StarPU" equivalents of <c>MPI_*</c> functions, where
+<c>void *</c> buffers are replaced with <c>starpu_data_handle_t</c>s, and all
+GPU-RAM-NIC transfers are handled efficiently by StarPU-MPI.  The user has to
+use the usual <c>mpirun</c> command of the MPI implementation to start StarPU on
+the different MPI nodes.
+
+An MPI Insert Task function provides an even more seamless transition to a
+distributed application, by automatically issuing all required data transfers
+according to the task graph and an application-provided distribution.
+
+\section Simple_Example Simple Example
+
+The flags required to compile or link against the MPI layer are
+accessible with the following commands:
+
+\verbatim
+$ pkg-config --cflags starpumpi-1.0  # options for the compiler
+$ pkg-config --libs starpumpi-1.0    # options for the linker
+\endverbatim
+
+You also need pass the <c>--static</c> option if the application is to
+be linked statically.
+
+\code{.c}
+void increment_token(void)
+{
+    struct starpu_task *task = starpu_task_create();
+
+    task->cl = &increment_cl;
+    task->handles[0] = token_handle;
+
+    starpu_task_submit(task);
+}
+
+int main(int argc, char **argv)
+{
+    int rank, size;
+
+    starpu_init(NULL);
+    starpu_mpi_initialize_extended(&rank, &size);
+
+    starpu_vector_data_register(&token_handle, 0, (uintptr_t)&token, 1, sizeof(unsigned));
+
+    unsigned nloops = NITER;
+    unsigned loop;
+
+    unsigned last_loop = nloops - 1;
+    unsigned last_rank = size - 1;
+
+    for (loop = 0; loop < nloops; loop++) {
+        int tag = loop*size + rank;
+
+        if (loop == 0 && rank == 0)
+        {
+            token = 0;
+            fprintf(stdout, "Start with token value %d\n", token);
+        }
+        else
+        {
+            starpu_mpi_irecv_detached(token_handle, (rank+size-1)%size, tag,
+                    MPI_COMM_WORLD, NULL, NULL);
+        }
+
+        increment_token();
+
+        if (loop == last_loop && rank == last_rank)
+        {
+            starpu_data_acquire(token_handle, STARPU_R);
+            fprintf(stdout, "Finished: token value %d\n", token);
+            starpu_data_release(token_handle);
+        }
+        else
+        {
+            starpu_mpi_isend_detached(token_handle, (rank+1)%size, tag+1,
+                    MPI_COMM_WORLD, NULL, NULL);
+        }
+    }
+
+    starpu_task_wait_for_all();
+
+    starpu_mpi_shutdown();
+    starpu_shutdown();
+
+    if (rank == last_rank)
+    {
+        fprintf(stderr, "[%d] token = %d == %d * %d ?\n", rank, token, nloops, size);
+        STARPU_ASSERT(token == nloops*size);
+    }
+\endcode
+
+\section Point_to_point_communication Point to point communication
+
+The standard point to point communications of MPI have been
+implemented. The semantic is similar to the MPI one, but adapted to
+the DSM provided by StarPU. A MPI request will only be submitted when
+the data is available in the main memory of the node submitting the
+request.
+
+There is two types of asynchronous communications: the classic
+asynchronous communications and the detached communications. The
+classic asynchronous communications (starpu_mpi_isend() and
+starpu_mpi_irecv()) need to be followed by a call to
+starpu_mpi_wait() or to starpu_mpi_test() to wait for or to
+test the completion of the communication. Waiting for or testing the
+completion of detached communications is not possible, this is done
+internally by StarPU-MPI, on completion, the resources are
+automatically released. This mechanism is similar to the pthread
+detach state attribute which determines whether a thread will be
+created in a joinable or a detached state.
+
+For any communication, the call of the function will result in the
+creation of a StarPU-MPI request, the function
+starpu_data_acquire_cb() is then called to asynchronously request
+StarPU to fetch the data in main memory; when the data is available in
+main memory, a StarPU-MPI function is called to put the new request in
+the list of the ready requests if it is a send request, or in an
+hashmap if it is a receive request.
+
+Internally, all MPI communications submitted by StarPU uses a unique
+tag which has a default value, and can be accessed with the functions
+@ref{starpu_mpi_get_communication_tag} and
+@ref{starpu_mpi_set_communication_tag}.
+
+The matching of tags with corresponding requests is done into StarPU-MPI.
+To handle this, any communication is a double-communication based on a
+envelope + data system. Every data which will be sent needs to send an
+envelope which describes the data (particularly its tag) before sending
+the data, so the receiver can get the matching pending receive request
+from the hashmap, and submit it to recieve the data correctly.
+
+To this aim, the StarPU-MPI progression thread has a permanent-submitted
+request destined to receive incoming envelopes from all sources.
+
+The StarPU-MPI progression thread regularly polls this list of ready
+requests. For each new ready request, the appropriate function is
+called to post the corresponding MPI call. For example, calling
+starpu_mpi_isend() will result in posting <c>MPI_Isend</c>. If
+the request is marked as detached, the request will be put in the list
+of detached requests.
+
+The StarPU-MPI progression thread also polls the list of detached
+requests. For each detached request, it regularly tests the completion
+of the MPI request by calling <c>MPI_Test</c>. On completion, the data
+handle is released, and if a callback was defined, it is called.
+
+Finally, the StarPU-MPI progression thread checks if an envelope has
+arrived. If it is, it'll check if the corresponding receive has already
+been submitted by the application. If it is, it'll submit the request
+just as like as it does with those on the list of ready requests.
+If it is not, it'll allocate a temporary handle to store the data that
+will arrive just after, so as when the corresponding receive request
+will be submitted by the application, it'll copy this temporary handle
+into its one instead of submitting a new StarPU-MPI request.
+
+@ref{Communication} gives the list of all the point to point
+communications defined in StarPU-MPI.
+
+\section Exchanging_User_Defined_Data_Interface Exchanging User Defined Data Interface
+
+New data interfaces defined as explained in @ref{Defining a New Data
+Interface} can also be used within StarPU-MPI and exchanged between
+nodes. Two functions needs to be defined through
+the type <c>struct starpu_data_interface_ops</c> (@pxref{Defining
+Interface}). The pack function takes a handle and returns a
+contiguous memory buffer along with its size where data to be conveyed to another node
+should be copied. The reversed operation is implemented in the unpack
+function which takes a contiguous memory buffer and recreates the data
+handle.
+
+\code{.c}
+static int complex_pack_data(starpu_data_handle_t handle, unsigned node, void **ptr, ssize_t *count)
+{
+  STARPU_ASSERT(starpu_data_test_if_allocated_on_node(handle, node));
+
+  struct starpu_complex_interface *complex_interface =
+    (struct starpu_complex_interface *) starpu_data_get_interface_on_node(handle, node);
+
+  *count = complex_get_size(handle);
+  *ptr = malloc(*count);
+  memcpy(*ptr, complex_interface->real, complex_interface->nx*sizeof(double));
+  memcpy(*ptr+complex_interface->nx*sizeof(double), complex_interface->imaginary,
+         complex_interface->nx*sizeof(double));
+
+  return 0;
+}
+
+static int complex_unpack_data(starpu_data_handle_t handle, unsigned node, void *ptr, size_t count)
+{
+  STARPU_ASSERT(starpu_data_test_if_allocated_on_node(handle, node));
+
+  struct starpu_complex_interface *complex_interface =
+    (struct starpu_complex_interface *)	starpu_data_get_interface_on_node(handle, node);
+
+  memcpy(complex_interface->real, ptr, complex_interface->nx*sizeof(double));
+  memcpy(complex_interface->imaginary, ptr+complex_interface->nx*sizeof(double),
+         complex_interface->nx*sizeof(double));
+
+  return 0;
+}
+
+static struct starpu_data_interface_ops interface_complex_ops =
+{
+  ...
+  .pack_data = complex_pack_data,
+  .unpack_data = complex_unpack_data
+};
+\endcode
+
+\section MPI_Insert_Task_Utility MPI Insert Task Utility
+
+To save the programmer from having to explicit all communications, StarPU
+provides an "MPI Insert Task Utility". The principe is that the application
+decides a distribution of the data over the MPI nodes by allocating it and
+notifying StarPU of that decision, i.e. tell StarPU which MPI node "owns"
+which data. It also decides, for each handle, an MPI tag which will be used to
+exchange the content of the handle. All MPI nodes then process the whole task
+graph, and StarPU automatically determines which node actually execute which
+task, and trigger the required MPI transfers.
+
+The list of functions is described in @ref{MPI Insert Task}.
+
+Here an stencil example showing how to use starpu_mpi_insert_task(). One
+first needs to define a distribution function which specifies the
+locality of the data. Note that that distribution information needs to
+be given to StarPU by calling starpu_data_set_rank(). A MPI tag
+should also be defined for each data handle by calling
+starpu_data_set_tag().
+
+\code{.c}
+/* Returns the MPI node number where data is */
+int my_distrib(int x, int y, int nb_nodes) {
+  /* Block distrib */
+  return ((int)(x / sqrt(nb_nodes) + (y / sqrt(nb_nodes)) * sqrt(nb_nodes))) % nb_nodes;
+
+  // /* Other examples useful for other kinds of computations */
+  // /* / distrib */
+  // return (x+y) % nb_nodes;
+
+  // /* Block cyclic distrib */
+  // unsigned side = sqrt(nb_nodes);
+  // return x % side + (y % side) * size;
+}
+\endcode
+
+Now the data can be registered within StarPU. Data which are not
+owned but will be needed for computations can be registered through
+the lazy allocation mechanism, i.e. with a <c>home_node</c> set to -1.
+StarPU will automatically allocate the memory when it is used for the
+first time.
+
+One can note an optimization here (the <c>else if</c> test): we only register
+data which will be needed by the tasks that we will execute.
+
+\code{.c}
+    unsigned matrix[X][Y];
+    starpu_data_handle_t data_handles[X][Y];
+
+    for(x = 0; x < X; x++) {
+        for (y = 0; y < Y; y++) {
+            int mpi_rank = my_distrib(x, y, size);
+             if (mpi_rank == my_rank)
+                /* Owning data */
+                starpu_variable_data_register(&data_handles[x][y], 0,
+                                              (uintptr_t)&(matrix[x][y]), sizeof(unsigned));
+            else if (my_rank == my_distrib(x+1, y, size) || my_rank == my_distrib(x-1, y, size)
+                  || my_rank == my_distrib(x, y+1, size) || my_rank == my_distrib(x, y-1, size))
+                /* I don't own that index, but will need it for my computations */
+                starpu_variable_data_register(&data_handles[x][y], -1,
+                                              (uintptr_t)NULL, sizeof(unsigned));
+            else
+                /* I know it's useless to allocate anything for this */
+                data_handles[x][y] = NULL;
+            if (data_handles[x][y]) {
+                starpu_data_set_rank(data_handles[x][y], mpi_rank);
+                starpu_data_set_tag(data_handles[x][y], x*X+y);
+            }
+        }
+    }
+\endcode
+
+Now starpu_mpi_insert_task() can be called for the different
+steps of the application.
+
+\code{.c}
+    for(loop=0 ; loop<niter; loop++)
+        for (x = 1; x < X-1; x++)
+            for (y = 1; y < Y-1; y++)
+                starpu_mpi_insert_task(MPI_COMM_WORLD, &stencil5_cl,
+                                       STARPU_RW, data_handles[x][y],
+                                       STARPU_R, data_handles[x-1][y],
+                                       STARPU_R, data_handles[x+1][y],
+                                       STARPU_R, data_handles[x][y-1],
+                                       STARPU_R, data_handles[x][y+1],
+                                       0);
+    starpu_task_wait_for_all();
+\endcode
+
+I.e. all MPI nodes process the whole task graph, but as mentioned above, for
+each task, only the MPI node which owns the data being written to (here,
+<c>data_handles[x][y]</c>) will actually run the task. The other MPI nodes will
+automatically send the required data.
+
+This can be a concern with a growing number of nodes. To avoid this, the
+application can prune the task for loops according to the data distribution,
+so as to only submit tasks on nodes which have to care about them (either to
+execute them, or to send the required data).
+
+\section MPI_Collective_Operations MPI Collective Operations
+
+The functions are described in @ref{Collective Operations}.
+
+\code{.c}
+if (rank == root)
+{
+    /* Allocate the vector */
+    vector = malloc(nblocks * sizeof(float *));
+    for(x=0 ; x<nblocks ; x++)
+    {
+        starpu_malloc((void **)&vector[x], block_size*sizeof(float));
+    }
+}
+
+/* Allocate data handles and register data to StarPU */
+data_handles = malloc(nblocks*sizeof(starpu_data_handle_t *));
+for(x = 0; x < nblocks ;  x++)
+{
+    int mpi_rank = my_distrib(x, nodes);
+    if (rank == root) {
+        starpu_vector_data_register(&data_handles[x], 0, (uintptr_t)vector[x],
+                                    blocks_size, sizeof(float));
+    }
+    else if ((mpi_rank == rank) || ((rank == mpi_rank+1 || rank == mpi_rank-1))) {
+        /* I own that index, or i will need it for my computations */
+        starpu_vector_data_register(&data_handles[x], -1, (uintptr_t)NULL,
+                                   block_size, sizeof(float));
+    }
+    else {
+        /* I know it's useless to allocate anything for this */
+        data_handles[x] = NULL;
+    }
+    if (data_handles[x]) {
+        starpu_data_set_rank(data_handles[x], mpi_rank);
+        starpu_data_set_tag(data_handles[x], x*nblocks+y);
+    }
+}
+
+/* Scatter the matrix among the nodes */
+starpu_mpi_scatter_detached(data_handles, nblocks, root, MPI_COMM_WORLD);
+
+/* Calculation */
+for(x = 0; x < nblocks ;  x++) {
+    if (data_handles[x]) {
+        int owner = starpu_data_get_rank(data_handles[x]);
+        if (owner == rank) {
+            starpu_insert_task(&cl, STARPU_RW, data_handles[x], 0);
+        }
+    }
+}
+
+/* Gather the matrix on main node */
+starpu_mpi_gather_detached(data_handles, nblocks, 0, MPI_COMM_WORLD);
+\endcode
+
+*/

+ 529 - 0
doc/doxygen/chapters/optimize_performance.doxy

@@ -0,0 +1,529 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page optimizePerformance How to optimize performance with StarPU
+
+TODO: improve!
+
+Simply encapsulating application kernels into tasks already permits to
+seamlessly support CPU and GPUs at the same time. To achieve good performance, a
+few additional changes are needed.
+
+\section Data_management Data management
+
+When the application allocates data, whenever possible it should use the
+starpu_malloc() function, which will ask CUDA or
+OpenCL to make the allocation itself and pin the corresponding allocated
+memory. This is needed to permit asynchronous data transfer, i.e. permit data
+transfer to overlap with computations. Otherwise, the trace will show that the
+<c>DriverCopyAsync</c> state takes a lot of time, this is because CUDA or OpenCL
+then reverts to synchronous transfers.
+
+By default, StarPU leaves replicates of data wherever they were used, in case they
+will be re-used by other tasks, thus saving the data transfer time. When some
+task modifies some data, all the other replicates are invalidated, and only the
+processing unit which ran that task will have a valid replicate of the data. If the application knows
+that this data will not be re-used by further tasks, it should advise StarPU to
+immediately replicate it to a desired list of memory nodes (given through a
+bitmask). This can be understood like the write-through mode of CPU caches.
+
+\code{.c}
+starpu_data_set_wt_mask(img_handle, 1<<0);
+\endcode
+
+will for instance request to always automatically transfer a replicate into the
+main memory (node 0), as bit 0 of the write-through bitmask is being set.
+
+\code{.c}
+starpu_data_set_wt_mask(img_handle, ~0U);
+\endcode
+
+will request to always automatically broadcast the updated data to all memory
+nodes.
+
+Setting the write-through mask to <c>~0U</c> can also be useful to make sure all
+memory nodes always have a copy of the data, so that it is never evicted when
+memory gets scarse.
+
+Implicit data dependency computation can become expensive if a lot
+of tasks access the same piece of data. If no dependency is required
+on some piece of data (e.g. because it is only accessed in read-only
+mode, or because write accesses are actually commutative), use the
+starpu_data_set_sequential_consistency_flag() function to disable implicit
+dependencies on that data.
+
+In the same vein, accumulation of results in the same data can become a
+bottleneck. The use of the <c>STARPU_REDUX</c> mode permits to optimize such
+accumulation (see \ref Data_reduction).
+
+Applications often need a data just for temporary results.  In such a case,
+registration can be made without an initial value, for instance this produces a vector data:
+
+\code{.c}
+starpu_vector_data_register(&handle, -1, 0, n, sizeof(float));
+\endcode
+
+StarPU will then allocate the actual buffer only when it is actually needed,
+e.g. directly on the GPU without allocating in main memory.
+
+In the same vein, once the temporary results are not useful any more, the
+data should be thrown away. If the handle is not to be reused, it can be
+unregistered:
+
+\code{.c}
+starpu_unregister_submit(handle);
+\endcode
+
+actual unregistration will be done after all tasks working on the handle
+terminate.
+
+If the handle is to be reused, instead of unregistering it, it can simply be invalidated:
+
+\code{.c}
+starpu_invalidate_submit(handle);
+\endcode
+
+the buffers containing the current value will then be freed, and reallocated
+only when another task writes some value to the handle.
+
+\section Task_granularity Task granularity
+
+Like any other runtime, StarPU has some overhead to manage tasks. Since
+it does smart scheduling and data management, that overhead is not always
+neglectable. The order of magnitude of the overhead is typically a couple of
+microseconds, which is actually quite smaller than the CUDA overhead itself. The
+amount of work that a task should do should thus be somewhat
+bigger, to make sure that the overhead becomes neglectible. The offline
+performance feedback can provide a measure of task length, which should thus be
+checked if bad performance are observed. To get a grasp at the scalability
+possibility according to task size, one can run
+<c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of the
+speedup of independent tasks of very small sizes.
+
+The choice of scheduler also has impact over the overhead: for instance, the
+<c>dmda</c> scheduler takes time to make a decision, while <c>eager</c> does
+not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
+impact that has on the target machine.
+
+\section Task_submission Task submission
+
+To let StarPU make online optimizations, tasks should be submitted
+asynchronously as much as possible. Ideally, all the tasks should be
+submitted, and mere calls to starpu_task_wait_for_all() or
+starpu_data_unregister() be done to wait for
+termination. StarPU will then be able to rework the whole schedule, overlap
+computation with communication, manage accelerator local memory usage, etc.
+
+\section Task_priorities Task priorities
+
+By default, StarPU will consider the tasks in the order they are submitted by
+the application. If the application programmer knows that some tasks should
+be performed in priority (for instance because their output is needed by many
+other tasks and may thus be a bottleneck if not executed early enough), the
+<c>priority</c> field of the task structure should be set to transmit the
+priority information to StarPU.
+
+\section Task_scheduling_policy Task scheduling policy
+
+By default, StarPU uses the <c>eager</c> simple greedy scheduler. This is
+because it provides correct load balance even if the application codelets do not
+have performance models. If your application codelets have performance models
+(\ref Performance_model_example for example showing how to do it),
+you should change the scheduler thanks to the <c>STARPU_SCHED</c> environment
+variable. For instance <c>export STARPU_SCHED=dmda</c> . Use <c>help</c> to get
+the list of available schedulers.
+
+The <b>eager</b> scheduler uses a central task queue, from which workers draw tasks
+to work on. This however does not permit to prefetch data since the scheduling
+decision is taken late. If a task has a non-0 priority, it is put at the front of the queue.
+
+The <b>prio</b> scheduler also uses a central task queue, but sorts tasks by
+priority (between -5 and 5).
+
+The <b>random</b> scheduler distributes tasks randomly according to assumed worker
+overall performance.
+
+The <b>ws</b> (work stealing) scheduler schedules tasks on the local worker by
+default. When a worker becomes idle, it steals a task from the most loaded
+worker.
+
+The <b>dm</b> (deque model) scheduler uses task execution performance models into account to
+perform an HEFT-similar scheduling strategy: it schedules tasks where their
+termination time will be minimal.
+
+The <b>dmda</b> (deque model data aware) scheduler is similar to dm, it also takes
+into account data transfer time.
+
+The <b>dmdar</b> (deque model data aware ready) scheduler is similar to dmda,
+it also sorts tasks on per-worker queues by number of already-available data
+buffers.
+
+The <b>dmdas</b> (deque model data aware sorted) scheduler is similar to dmda, it
+also supports arbitrary priority values.
+
+The <b>heft</b> (heterogeneous earliest finish time) scheduler is deprecated. It
+is now just an alias for <b>dmda</b>.
+
+The <b>pheft</b> (parallel HEFT) scheduler is similar to heft, it also supports
+parallel tasks (still experimental).
+
+The <b>peager</b> (parallel eager) scheduler is similar to eager, it also
+supports parallel tasks (still experimental).
+
+\section Performance_model_calibration Performance model calibration
+
+Most schedulers are based on an estimation of codelet duration on each kind
+of processing unit. For this to be possible, the application programmer needs
+to configure a performance model for the codelets of the application (see
+\ref Performance_model_example for instance). History-based performance models
+use on-line calibration.  StarPU will automatically calibrate codelets
+which have never been calibrated yet, and save the result in
+<c>$STARPU_HOME/.starpu/sampling/codelets</c>.
+The models are indexed by machine name. To share the models between machines (e.g. for a homogeneous cluster), use <c>export STARPU_HOSTNAME=some_global_name</c>. To force continuing calibration, use
+<c>export STARPU_CALIBRATE=1</c> . This may be necessary if your application
+has not-so-stable performance. StarPU will force calibration (and thus ignore
+the current result) until 10 (_STARPU_CALIBRATION_MINIMUM) measurements have been
+made on each architecture, to avoid badly scheduling tasks just because the
+first measurements were not so good. Details on the current performance model status
+can be obtained from the <c>starpu_perfmodel_display</c> command: the <c>-l</c>
+option lists the available performance models, and the <c>-s</c> option permits
+to choose the performance model to be displayed. The result looks like:
+
+\verbatim
+$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
+performance model for cpu
+# hash    size     mean          dev           n
+880805ba  98304    2.731309e+02  6.010210e+01  1240
+b50b6605  393216   1.469926e+03  1.088828e+02  1240
+5c6c3401  1572864  1.125983e+04  3.265296e+03  1240
+\endverbatim
+
+Which shows that for the LU 22 kernel with a 1.5MiB matrix, the average
+execution time on CPUs was about 11ms, with a 3ms standard deviation, over
+1240 samples. It is a good idea to check this before doing actual performance
+measurements.
+
+A graph can be drawn by using the <c>starpu_perfmodel_plot</c>:
+
+\verbatim
+$ starpu_perfmodel_plot -s starpu_dlu_lu_model_22
+98304 393216 1572864
+$ gnuplot starpu_starpu_dlu_lu_model_22.gp
+$ gv starpu_starpu_dlu_lu_model_22.eps
+\endverbatim
+
+If a kernel source code was modified (e.g. performance improvement), the
+calibration information is stale and should be dropped, to re-calibrate from
+start. This can be done by using <c>export STARPU_CALIBRATE=2</c>.
+
+Note: due to CUDA limitations, to be able to measure kernel duration,
+calibration mode needs to disable asynchronous data transfers. Calibration thus
+disables data transfer / computation overlapping, and should thus not be used
+for eventual benchmarks. Note 2: history-based performance models get calibrated
+only if a performance-model-based scheduler is chosen.
+
+The history-based performance models can also be explicitly filled by the
+application without execution, if e.g. the application already has a series of
+measurements. This can be done by using starpu_perfmodel_update_history(),
+for instance:
+
+\code{.c}
+static struct starpu_perfmodel perf_model = {
+    .type = STARPU_HISTORY_BASED,
+    .symbol = "my_perfmodel",
+};
+
+struct starpu_codelet cl = {
+    .where = STARPU_CUDA,
+    .cuda_funcs = { cuda_func1, cuda_func2, NULL },
+    .nbuffers = 1,
+    .modes = {STARPU_W},
+    .model = &perf_model
+};
+
+void feed(void) {
+    struct my_measure *measure;
+    struct starpu_task task;
+    starpu_task_init(&task);
+
+    task.cl = &cl;
+
+    for (measure = &measures[0]; measure < measures[last]; measure++) {
+        starpu_data_handle_t handle;
+	starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));
+	task.handles[0] = handle;
+	starpu_perfmodel_update_history(&perf_model, &task,
+	                                STARPU_CUDA_DEFAULT + measure->cudadev, 0,
+	                                measure->implementation, measure->time);
+	starpu_task_clean(&task);
+	starpu_data_unregister(handle);
+    }
+}
+\endcode
+
+Measurement has to be provided in milliseconds for the completion time models,
+and in Joules for the energy consumption models.
+
+\section Task_distribution_vs_Data_transfer Task distribution vs Data transfer
+
+Distributing tasks to balance the load induces data transfer penalty. StarPU
+thus needs to find a balance between both. The target function that the
+<c>dmda</c> scheduler of StarPU
+tries to minimize is <c>alpha * T_execution + beta * T_data_transfer</c>, where
+<c>T_execution</c> is the estimated execution time of the codelet (usually
+accurate), and <c>T_data_transfer</c> is the estimated data transfer time. The
+latter is estimated based on bus calibration before execution start,
+i.e. with an idle machine, thus without contention. You can force bus re-calibration by running
+<c>starpu_calibrate_bus</c>. The beta parameter defaults to 1, but it can be
+worth trying to tweak it by using <c>export STARPU_SCHED_BETA=2</c> for instance,
+since during real application execution, contention makes transfer times bigger.
+This is of course imprecise, but in practice, a rough estimation already gives
+the good results that a precise estimation would give.
+
+\section Data_prefetch Data prefetch
+
+The <c>heft</c>, <c>dmda</c> and <c>pheft</c> scheduling policies perform data prefetch (see @ref{STARPU_PREFETCH}):
+as soon as a scheduling decision is taken for a task, requests are issued to
+transfer its required data to the target processing unit, if needeed, so that
+when the processing unit actually starts the task, its data will hopefully be
+already available and it will not have to wait for the transfer to finish.
+
+The application may want to perform some manual prefetching, for several reasons
+such as excluding initial data transfers from performance measurements, or
+setting up an initial statically-computed data distribution on the machine
+before submitting tasks, which will thus guide StarPU toward an initial task
+distribution (since StarPU will try to avoid further transfers).
+
+This can be achieved by giving the starpu_data_prefetch_on_node() function
+the handle and the desired target memory node.
+
+\section Power-based_scheduling Power-based scheduling
+
+If the application can provide some power performance model (through
+the <c>power_model</c> field of the codelet structure), StarPU will
+take it into account when distributing tasks. The target function that
+the <c>dmda</c> scheduler minimizes becomes <c>alpha * T_execution +
+beta * T_data_transfer + gamma * Consumption</c> , where <c>Consumption</c>
+is the estimated task consumption in Joules. To tune this parameter, use
+<c>export STARPU_SCHED_GAMMA=3000</c> for instance, to express that each Joule
+(i.e kW during 1000us) is worth 3000us execution time penalty. Setting
+<c>alpha</c> and <c>beta</c> to zero permits to only take into account power consumption.
+
+This is however not sufficient to correctly optimize power: the scheduler would
+simply tend to run all computations on the most energy-conservative processing
+unit. To account for the consumption of the whole machine (including idle
+processing units), the idle power of the machine should be given by setting
+<c>export STARPU_IDLE_POWER=200</c> for 200W, for instance. This value can often
+be obtained from the machine power supplier.
+
+The power actually consumed by the total execution can be displayed by setting
+<c>export STARPU_PROFILING=1 STARPU_WORKER_STATS=1</c> .
+
+On-line task consumption measurement is currently only supported through the
+<c>CL_PROFILING_POWER_CONSUMED</c> OpenCL extension, implemented in the MoviSim
+simulator. Applications can however provide explicit measurements by using the
+starpu_perfmodel_update_history() function (examplified in \ref Performance_model_example
+with the <c>power_model</c> performance model. Fine-grain
+measurement is often not feasible with the feedback provided by the hardware, so
+the user can for instance run a given task a thousand times, measure the global
+consumption for that series of tasks, divide it by a thousand, repeat for
+varying kinds of tasks and task sizes, and eventually feed StarPU
+with these manual measurements through starpu_perfmodel_update_history().
+
+\section Static_scheduling Static scheduling
+
+In some cases, one may want to force some scheduling, for instance force a given
+set of tasks to GPU0, another set to GPU1, etc. while letting some other tasks
+be scheduled on any other device. This can indeed be useful to guide StarPU into
+some work distribution, while still letting some degree of dynamism. For
+instance, to force execution of a task on CUDA0:
+
+\code{.c}
+task->execute_on_a_specific_worker = 1;
+task->worker = starpu_worker_get_by_type(STARPU_CUDA_WORKER, 0);
+\endcode
+
+\section Profiling Profiling
+
+A quick view of how many tasks each worker has executed can be obtained by setting
+<c>export STARPU_WORKER_STATS=1</c> This is a convenient way to check that
+execution did happen on accelerators without penalizing performance with
+the profiling overhead.
+
+A quick view of how much data transfers have been issued can be obtained by setting
+<c>export STARPU_BUS_STATS=1</c> .
+
+More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> or by
+calling starpu_profiling_status_set() from the source code.
+Statistics on the execution can then be obtained by using <c>export
+STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
+ More details on performance feedback are provided by the next chapter.
+
+\section CUDA-specific_optimizations CUDA-specific optimizations
+
+Due to CUDA limitations, StarPU will have a hard time overlapping its own
+communications and the codelet computations if the application does not use a
+dedicated CUDA stream for its computations instead of the default stream,
+which synchronizes all operations of the GPU. StarPU provides one by the use
+of starpu_cuda_get_local_stream() which can be used by all CUDA codelet
+operations to avoid this issue. For instance:
+
+\code{.c}
+func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
+cudaStreamSynchronize(starpu_cuda_get_local_stream());
+\endcode
+
+StarPU already does appropriate calls for the CUBLAS library.
+
+Unfortunately, some CUDA libraries do not have stream variants of
+kernels. That will lower the potential for overlapping.
+
+\section Performance_debugging Performance debugging
+
+To get an idea of what is happening, a lot of performance feedback is available,
+detailed in the next chapter. The various informations should be checked for.
+
+<ul>
+<li>
+What does the Gantt diagram look like? (see \ref Creating_a_Gantt_Diagram)
+<ul>
+  <li> If it's mostly green (tasks running in the initial context) or context specific 
+  color prevailing, then the machine is properly
+  utilized, and perhaps the codelets are just slow. Check their performance, see
+  \ref Performance_of_codelets.
+  </li>
+  <li> If it's mostly purple (FetchingInput), tasks keep waiting for data
+  transfers, do you perhaps have far more communication than computation? Did
+  you properly use CUDA streams to make sure communication can be
+  overlapped? Did you use data-locality aware schedulers to avoid transfers as
+  much as possible?
+  </li>
+  <li> If it's mostly red (Blocked), tasks keep waiting for dependencies,
+  do you have enough parallelism? It might be a good idea to check what the DAG
+  looks like (see \ref Creating_a_DAG_with_graphviz).
+  </li>
+  <li> If only some workers are completely red (Blocked), for some reason the
+  scheduler didn't assign tasks to them. Perhaps the performance model is bogus,
+  check it (see \ref Performance_of_codelets). Do all your codelets have a
+  performance model?  When some of them don't, the schedulers switches to a
+  greedy algorithm which thus performs badly.
+  </li>
+</ul>
+</li>
+</ul>
+
+You can also use the Temanejo task debugger (see \ref Using_the_Temanejo_task_debugger) to
+visualize the task graph more easily.
+
+\section Simulated_performance Simulated performance
+
+StarPU can use Simgrid in order to simulate execution on an arbitrary
+platform.
+
+\subsection Calibration Calibration
+
+The idea is to first compile StarPU normally, and run the application,
+so as to automatically benchmark the bus and the codelets.
+
+\verbatim
+$ ./configure && make
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+[starpu][_starpu_load_history_based_model] Warning: model matvecmult
+   is not calibrated, forcing calibration for this run. Use the
+   STARPU_CALIBRATE environment variable to control this.
+$ ...
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+TEST PASSED
+\endverbatim
+
+Note that we force to use the dmda scheduler to generate performance
+models for the application. The application may need to be run several
+times before the model is calibrated.
+
+\subsection Simulation Simulation
+
+Then, recompile StarPU, passing <c>--enable-simgrid</c> to <c>./configure</c>, and re-run the
+application:
+
+\verbatim
+$ ./configure --enable-simgrid && make
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+TEST FAILED !!!
+\endverbatim
+
+It is normal that the test fails: since the computation are not actually done
+(that is the whole point of simgrid), the result is wrong, of course.
+
+If the performance model is not calibrated enough, the following error
+message will be displayed
+
+\verbatim
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+[starpu][_starpu_load_history_based_model] Warning: model matvecmult
+    is not calibrated, forcing calibration for this run. Use the
+    STARPU_CALIBRATE environment variable to control this.
+[starpu][_starpu_simgrid_execute_job][assert failure] Codelet
+    matvecmult does not have a perfmodel, or is not calibrated enough
+\endverbatim
+
+The number of devices can be chosen as usual with <c>STARPU_NCPU</c>,
+<c>STARPU_NCUDA</c>, and <c>STARPU_NOPENCL</c>.  For now, only the number of
+cpus can be arbitrarily chosen. The number of CUDA and OpenCL devices have to be
+lower than the real number on the current machine.
+
+The amount of simulated GPU memory is for now unbound by default, but
+it can be chosen by hand through the <c>STARPU_LIMIT_CUDA_MEM</c>,
+<c>STARPU_LIMIT_CUDA_devid_MEM</c>, <c>STARPU_LIMIT_OPENCL_MEM</c>, and
+<c>STARPU_LIMIT_OPENCL_devid_MEM</c> environment variables.
+
+The Simgrid default stack size is small; to increase it use the
+parameter <c>--cfg=contexts/stack_size</c>, for example:
+
+\verbatim
+$ ./example --cfg=contexts/stack_size:8192
+TEST FAILED !!!
+\endverbatim
+
+Note: of course, if the application uses <c>gettimeofday</c> to make its
+performance measurements, the real time will be used, which will be bogus. To
+get the simulated time, it has to use starpu_timing_now() which returns the
+virtual timestamp in ms.
+
+\subsection Simulation_on_another_machine Simulation on another machine
+
+The simgrid support even permits to perform simulations on another machine, your
+desktop, typically. To achieve this, one still needs to perform the Calibration
+step on the actual machine to be simulated, then copy them to your desktop
+machine (the <c>$STARPU_HOME/.starpu</c> directory). One can then perform the
+Simulation step on the desktop machine, by setting the <c>STARPU_HOSTNAME</c>
+environment variable to the name of the actual machine, to make StarPU use the
+performance models of the simulated machine even on the desktop machine.
+
+If the desktop machine does not have CUDA or OpenCL, StarPU is still able to
+use simgrid to simulate execution with CUDA/OpenCL devices, but the application
+source code will probably disable the CUDA and OpenCL codelets in that
+case. Since during simgrid execution, the functions of the codelet are actually
+not called, one can use dummy functions such as the following to still permit
+CUDA or OpenCL execution:
+
+\code{.c}
+static struct starpu_codelet cl11 =
+{
+	.cpu_funcs = {chol_cpu_codelet_update_u11, NULL},
+#ifdef STARPU_USE_CUDA
+	.cuda_funcs = {chol_cublas_codelet_update_u11, NULL},
+#elif defined(STARPU_SIMGRID)
+	.cuda_funcs = {(void*)1, NULL},
+#endif
+	.nbuffers = 1,
+	.modes = {STARPU_RW},
+	.model = &chol_model_11
+};
+\endcode
+
+*/

+ 573 - 0
doc/doxygen/chapters/performance_feedback.doxy

@@ -0,0 +1,573 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page performanceFeedback Performance Feedback
+
+\section Using_the_Temanejo_task_debugger Using the Temanejo task debugger
+
+StarPU can connect to Temanejo (see
+http://www.hlrs.de/temanejo), to permit
+nice visual task debugging. To do so, build Temanejo's <c>libayudame.so</c>,
+install <c>Ayudame.h</c> to e.g. <c>/usr/local/include</c>, apply the
+<c>tools/patch-ayudame</c> to it to fix C build, re-<c>./configure</c>, make
+sure that it found it, rebuild StarPU.  Run the Temanejo GUI, give it the path
+to your application, any options you want to pass it, the path to libayudame.so.
+
+Make sure to specify at least the same number of CPUs in the dialog box as your
+machine has, otherwise an error will happen during execution. Future versions
+of Temanejo should be able to tell StarPU the number of CPUs to use.
+
+Tag numbers have to be below <c>4000000000000000000ULL</c> to be usable for
+Temanejo (so as to distinguish them from tasks).
+
+\section On-line_performance_feedback On-line performance feedback
+
+\subsection Enabling_on-line_performance_monitoring Enabling on-line performance monitoring
+
+In order to enable online performance monitoring, the application can call
+<c>starpu_profiling_status_set(STARPU_PROFILING_ENABLE)</c>. It is possible to
+detect whether monitoring is already enabled or not by calling
+starpu_profiling_status_get(). Enabling monitoring also reinitialize all
+previously collected feedback. The <c>STARPU_PROFILING</c> environment variable
+can also be set to 1 to achieve the same effect.
+
+Likewise, performance monitoring is stopped by calling
+<c>starpu_profiling_status_set(STARPU_PROFILING_DISABLE)</c>. Note that this
+does not reset the performance counters so that the application may consult
+them later on.
+
+More details about the performance monitoring API are available in section
+@ref{Profiling API}.
+
+\subsection Per-Task_feedback Per-task feedback
+
+If profiling is enabled, a pointer to a <c>struct starpu_profiling_task_info</c>
+is put in the <c>.profiling_info</c> field of the <c>starpu_task</c>
+structure when a task terminates.
+This structure is automatically destroyed when the task structure is destroyed,
+either automatically or by calling starpu_task_destroy().
+
+The <c>struct starpu_profiling_task_info</c> indicates the date when the
+task was submitted (<c>submit_time</c>), started (<c>start_time</c>), and
+terminated (<c>end_time</c>), relative to the initialization of
+StarPU with starpu_init(). It also specifies the identifier of the worker
+that has executed the task (<c>workerid</c>).
+These date are stored as <c>timespec</c> structures which the user may convert
+into micro-seconds using the starpu_timing_timespec_to_us() helper
+function.
+
+It it worth noting that the application may directly access this structure from
+the callback executed at the end of the task. The <c>starpu_task</c> structure
+associated to the callback currently being executed is indeed accessible with
+the starpu_task_get_current() function.
+
+\subsection Per-codelet_feedback Per-codelet feedback
+
+The <c>per_worker_stats</c> field of the <c>struct starpu_codelet</c> structure is
+an array of counters. The i-th entry of the array is incremented every time a
+task implementing the codelet is executed on the i-th worker.
+This array is not reinitialized when profiling is enabled or disabled.
+
+\subsection Per-worker_feedback Per-worker feedback
+
+The second argument returned by the starpu_profiling_worker_get_info()
+function is a <c>struct starpu_profiling_worker_info</c> that gives
+statistics about the specified worker. This structure specifies when StarPU
+started collecting profiling information for that worker (<c>start_time</c>),
+the duration of the profiling measurement interval (<c>total_time</c>), the
+time spent executing kernels (<c>executing_time</c>), the time spent sleeping
+because there is no task to execute at all (<c>sleeping_time</c>), and the
+number of tasks that were executed while profiling was enabled.
+These values give an estimation of the proportion of time spent do real work,
+and the time spent either sleeping because there are not enough executable
+tasks or simply wasted in pure StarPU overhead.
+
+Calling starpu_profiling_worker_get_info() resets the profiling
+information associated to a worker.
+
+When an FxT trace is generated (see \ref Generating_traces_with_FxT), it is also
+possible to use the <c>starpu_workers_activity</c> script (see \ref Monitoring_activity) to
+generate a graphic showing the evolution of these values during the time, for
+the different workers.
+
+\subsection Bus-related_feedback Bus-related feedback
+
+TODO: ajouter STARPU_BUS_STATS
+
+\internal
+how to enable/disable performance monitoring
+what kind of information do we get ?
+\endinternal
+
+The bus speed measured by StarPU can be displayed by using the
+<c>starpu_machine_display</c> tool, for instance:
+
+\verbatim
+StarPU has found:
+        3 CUDA devices
+                CUDA 0 (Tesla C2050 02:00.0)
+                CUDA 1 (Tesla C2050 03:00.0)
+                CUDA 2 (Tesla C2050 84:00.0)
+from    to RAM          to CUDA 0       to CUDA 1       to CUDA 2
+RAM     0.000000        5176.530428     5176.492994     5191.710722
+CUDA 0  4523.732446     0.000000        2414.074751     2417.379201
+CUDA 1  4523.718152     2414.078822     0.000000        2417.375119
+CUDA 2  4534.229519     2417.069025     2417.060863     0.000000
+\endverbatim
+
+\subsection StarPU-Top_interface StarPU-Top interface
+
+StarPU-Top is an interface which remotely displays the on-line state of a StarPU
+application and permits the user to change parameters on the fly.
+
+Variables to be monitored can be registered by calling the
+starpu_top_add_data_boolean(), starpu_top_add_data_integer(),
+starpu_top_add_data_float() functions, e.g.:
+
+\code{.c}
+starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
+\endcode
+
+The application should then call starpu_top_init_and_wait() to give its name
+and wait for StarPU-Top to get a start request from the user. The name is used
+by StarPU-Top to quickly reload a previously-saved layout of parameter display.
+
+\code{.c}
+starpu_top_init_and_wait("the application");
+\endcode
+
+The new values can then be provided thanks to
+starpu_top_update_data_boolean(), starpu_top_update_data_integer(),
+starpu_top_update_data_float(), e.g.:
+
+\code{.c}
+starpu_top_update_data_integer(data, mynum);
+\endcode
+
+Updateable parameters can be registered thanks to starpu_top_register_parameter_boolean(), starpu_top_register_parameter_integer(), starpu_top_register_parameter_float(), e.g.:
+
+\code{.c}
+float alpha;
+starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
+\endcode
+
+<c>modif_hook</c> is a function which will be called when the parameter is being modified, it can for instance print the new value:
+
+\code{.c}
+void modif_hook(struct starpu_top_param *d) {
+    fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
+}
+\endcode
+
+Task schedulers should notify StarPU-Top when it has decided when a task will be
+scheduled, so that it can show it in its Gantt chart, for instance:
+
+\code{.c}
+starpu_top_task_prevision(task, workerid, begin, end);
+\endcode
+
+Starting StarPU-Top (StarPU-Top is started via the binary
+<c>starpu_top</c>.) and the application can be done two ways:
+
+<ul>
+<li> The application is started by hand on some machine (and thus already
+waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
+checkbox should be unchecked, and the hostname and port (default is 2011) on
+which the application is already running should be specified. Clicking on the
+connection button will thus connect to the already-running application.
+</li>
+<li> StarPU-Top is started first, and clicking on the connection button will
+start the application itself (possibly on a remote machine). The SSH checkbox
+should be checked, and a command line provided, e.g.:
+
+\verbatim
+$ ssh myserver STARPU_SCHED=dmda ./application
+\endverbatim
+
+If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
+
+\verbatim
+$ ssh -L 2011:localhost:2011 myserver STARPU_SCHED=dmda ./application
+\endverbatim
+
+and "localhost" should be used as IP Address to connect to.
+</li>
+</ul>
+
+\section Off-line_performance_feedback Off-line performance feedback
+
+\subsection Generating_traces_with_FxT Generating traces with FxT
+
+StarPU can use the FxT library (see
+https://savannah.nongnu.org/projects/fkt/) to generate traces
+with a limited runtime overhead.
+
+You can either get a tarball:
+
+\verbatim
+$ wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.11.tar.gz
+\endverbatim
+
+or use the FxT library from CVS (autotools are required):
+
+\verbatim
+$ cvs -d :pserver:anonymous\@cvs.sv.gnu.org:/sources/fkt co FxT
+$ ./bootstrap
+\endverbatim
+
+Compiling and installing the FxT library in the <c>$FXTDIR</c> path is
+done following the standard procedure:
+
+\verbatim
+$ ./configure --prefix=$FXTDIR
+$ make
+$ make install
+\endverbatim
+
+In order to have StarPU to generate traces, StarPU should be configured with
+the <c>--with-fxt</c> option:
+
+\verbatim
+$ ./configure --with-fxt=$FXTDIR
+\endverbatim
+
+Or you can simply point the <c>PKG_CONFIG_PATH</c> to
+<c>$FXTDIR/lib/pkgconfig</c> and pass <c>--with-fxt</c> to <c>./configure</c>
+
+When FxT is enabled, a trace is generated when StarPU is terminated by calling
+starpu_shutdown()). The trace is a binary file whose name has the form
+<c>prof_file_XXX_YYY</c> where <c>XXX</c> is the user name, and
+<c>YYY</c> is the pid of the process that used StarPU. This file is saved in the
+<c>/tmp/</c> directory by default, or by the directory specified by
+the <c>STARPU_FXT_PREFIX</c> environment variable.
+
+\subsection Creating_a_Gantt_Diagram Creating a Gantt Diagram
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+generate a trace in the Paje format by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+\endverbatim
+
+Or alternatively, setting the <c>STARPU_GENERATE_TRACE</c> environment variable
+to <c>1</c> before application execution will make StarPU do it automatically at
+application shutdown.
+
+This will create a <c>paje.trace</c> file in the current directory that
+can be inspected with the <a href="http://vite.gforge.inria.fr/">ViTE trace
+visualizing open-source tool</a>.  It is possible to open the
+<c>paje.trace</c> file with ViTE by using the following command:
+
+\verbatim
+$ vite paje.trace
+\endverbatim
+
+To get names of tasks instead of "unknown", fill the optional <c>name</c> field
+of the codelets, or use a performance model for them.
+
+In the MPI execution case, collect the trace files from the MPI nodes, and
+specify them all on the <c>starpu_fxt_tool</c> command, for instance:
+
+\verbatim
+$ starpu_fxt_tool -i filename1 -i filename2
+\endverbatim
+
+By default, all tasks are displayed using a green color. To display tasks with
+varying colors, pass option <c>-c</c> to <c>starpu_fxt_tool</c>.
+
+Traces can also be inspected by hand by using the <c>fxt_print</c> tool, for instance:
+
+\verbatim
+$ fxt_print -o -f filename
+\endverbatim
+
+Timings are in nanoseconds (while timings as seen in <c>vite</c> are in milliseconds).
+
+\subsection Creating_a_DAG_with_graphviz Creating a DAG with graphviz
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+generate a task graph in the DOT format by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+\endverbatim
+
+This will create a <c>dag.dot</c> file in the current directory. This file is a
+task graph described using the DOT language. It is possible to get a
+graphical output of the graph by using the graphviz library:
+
+\verbatim
+$ dot -Tpdf dag.dot -o output.pdf
+\endverbatim
+
+\subsection Monitoring_activity Monitoring activity
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+generate an activity trace by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+\endverbatim
+
+This will create an <c>activity.data</c> file in the current
+directory. A profile of the application showing the activity of StarPU
+during the execution of the program can be generated:
+
+\verbatim
+$ starpu_workers_activity activity.data
+\endverbatim
+
+This will create a file named <c>activity.eps</c> in the current directory.
+This picture is composed of two parts.
+The first part shows the activity of the different workers. The green sections
+indicate which proportion of the time was spent executed kernels on the
+processing unit. The red sections indicate the proportion of time spent in
+StartPU: an important overhead may indicate that the granularity may be too
+low, and that bigger tasks may be appropriate to use the processing unit more
+efficiently. The black sections indicate that the processing unit was blocked
+because there was no task to process: this may indicate a lack of parallelism
+which may be alleviated by creating more tasks when it is possible.
+
+The second part of the <c>activity.eps</c> picture is a graph showing the
+evolution of the number of tasks available in the system during the execution.
+Ready tasks are shown in black, and tasks that are submitted but not
+schedulable yet are shown in grey.
+
+\section Performance_of_codelets Performance of codelets
+
+The performance model of codelets (see \ref Performance_model_example) can be examined by using the
+<c>starpu_perfmodel_display</c> tool:
+
+\verbatim
+$ starpu_perfmodel_display -l
+file: <malloc_pinned.hannibal>
+file: <starpu_slu_lu_model_21.hannibal>
+file: <starpu_slu_lu_model_11.hannibal>
+file: <starpu_slu_lu_model_22.hannibal>
+file: <starpu_slu_lu_model_12.hannibal>
+\endverbatim
+
+Here, the codelets of the lu example are available. We can examine the
+performance of the 22 kernel (in micro-seconds), which is history-based:
+
+\verbatim
+$ starpu_perfmodel_display -s starpu_slu_lu_model_22
+performance model for cpu
+# hash      size       mean          dev           n
+57618ab0    19660800   2.851069e+05  1.829369e+04  109
+performance model for cuda_0
+# hash      size       mean          dev           n
+57618ab0    19660800   1.164144e+04  1.556094e+01  315
+performance model for cuda_1
+# hash      size       mean          dev           n
+57618ab0    19660800   1.164271e+04  1.330628e+01  360
+performance model for cuda_2
+# hash      size       mean          dev           n
+57618ab0    19660800   1.166730e+04  3.390395e+02  456
+\endverbatim
+
+We can see that for the given size, over a sample of a few hundreds of
+execution, the GPUs are about 20 times faster than the CPUs (numbers are in
+us). The standard deviation is extremely low for the GPUs, and less than 10% for
+CPUs.
+
+This tool can also be used for regression-based performance models. It will then
+display the regression formula, and in the case of non-linear regression, the
+same performance log as for history-based performance models:
+
+\verbatim
+$ starpu_perfmodel_display -s non_linear_memset_regression_based
+performance model for cpu_impl_0
+	Regression : #sample = 1400
+	Linear: y = alpha size ^ beta
+		alpha = 1.335973e-03
+		beta = 8.024020e-01
+	Non-Linear: y = a size ^b + c
+		a = 5.429195e-04
+		b = 8.654899e-01
+		c = 9.009313e-01
+# hash		size		mean		stddev		n
+a3d3725e	4096           	4.763200e+00   	7.650928e-01   	100
+870a30aa	8192           	1.827970e+00   	2.037181e-01   	100
+48e988e9	16384          	2.652800e+00   	1.876459e-01   	100
+961e65d2	32768          	4.255530e+00   	3.518025e-01   	100
+...
+\endverbatim
+
+The same can also be achieved by using StarPU's library API, see
+@ref{Performance Model API} and notably the starpu_perfmodel_load_symbol()
+function. The source code of the <c>starpu_perfmodel_display</c> tool can be a
+useful example.
+
+The <c>starpu_perfmodel_plot</c> tool can be used to draw performance models.
+It writes a <c>.gp</c> file in the current directory, to be run in the
+<c>gnuplot</c> tool, which shows the corresponding curve.
+
+When the <c>flops</c> field of tasks is set, <c>starpu_perfmodel_plot</c> can
+directly draw a GFlops curve, by simply adding the <c>-f</c> option:
+
+\verbatim
+$ starpu_perfmodel_display -f -s chol_model_11
+\endverbatim
+
+This will however disable displaying the regression model, for which we can not
+compute GFlops.
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+get a profiling of each codelet by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+$ starpu_codelet_profile distrib.data codelet_name
+\endverbatim
+
+This will create profiling data files, and a <c>.gp</c> file in the current
+directory, which draws the distribution of codelet time over the application
+execution, according to data input size.
+
+This is also available in the <c>starpu_perfmodel_plot</c> tool, by passing it
+the fxt trace:
+
+\verbatim
+$ starpu_perfmodel_plot -s non_linear_memset_regression_based -i /tmp/prof_file_foo_0
+\endverbatim
+
+It will produce a <c>.gp</c> file which contains both the performance model
+curves, and the profiling measurements.
+
+If you have the R statistical tool installed, you can additionally use
+
+\verbatim
+$ starpu_codelet_histo_profile distrib.data
+\endverbatim
+
+Which will create one pdf file per codelet and per input size, showing a
+histogram of the codelet execution time distribution.
+
+\section Theoretical_lower_bound_on_execution_time Theoretical lower bound on execution time
+
+StarPU can record a trace of what tasks are needed to complete the
+application, and then, by using a linear system, provide a theoretical lower
+bound of the execution time (i.e. with an ideal scheduling).
+
+The computed bound is not really correct when not taking into account
+dependencies, but for an application which have enough parallelism, it is very
+near to the bound computed with dependencies enabled (which takes a huge lot
+more time to compute), and thus provides a good-enough estimation of the ideal
+execution time.
+
+\ref Theoretical_lower_bound_on_execution_time provides an example on how to
+use this.
+
+\section Memory_feedback Memory feedback
+
+It is possible to enable memory statistics. To do so, you need to pass the option
+<c>--enable-memory-stats</c> when running configure. It is then
+possible to call the function starpu_display_memory_stats() to
+display statistics about the current data handles registered within StarPU.
+
+Moreover, statistics will be displayed at the end of the execution on
+data handles which have not been cleared out. This can be disabled by
+setting the environment variable <c>STARPU_MEMORY_STATS</c> to 0.
+
+For example, if you do not unregister data at the end of the complex
+example, you will get something similar to:
+
+\verbatim
+$ STARPU_MEMORY_STATS=0 ./examples/interface/complex
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 78.00 + 78.00 i
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 45.00 + 12.00 i
+\endverbatim
+
+\verbatim
+$ STARPU_MEMORY_STATS=1 ./examples/interface/complex
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 78.00 + 78.00 i
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 45.00 + 12.00 i
+
+#---------------------
+Memory stats:
+#-------
+Data on Node #3
+#-----
+Data : 0x553ff40
+Size : 16
+
+#--
+Data access stats
+/!\ Work Underway
+Node #0
+	Direct access : 4
+	Loaded (Owner) : 0
+	Loaded (Shared) : 0
+	Invalidated (was Owner) : 0
+
+Node #3
+	Direct access : 0
+	Loaded (Owner) : 0
+	Loaded (Shared) : 1
+	Invalidated (was Owner) : 0
+
+#-----
+Data : 0x5544710
+Size : 16
+
+#--
+Data access stats
+/!\ Work Underway
+Node #0
+	Direct access : 2
+	Loaded (Owner) : 0
+	Loaded (Shared) : 1
+	Invalidated (was Owner) : 1
+
+Node #3
+	Direct access : 0
+	Loaded (Owner) : 1
+	Loaded (Shared) : 0
+	Invalidated (was Owner) : 0
+\endverbatim
+
+\section Data_statistics Data statistics
+
+Different data statistics can be displayed at the end of the execution
+of the application. To enable them, you need to pass the option
+<c>--enable-stats</c> when calling <c>configure</c>. When calling
+starpu_shutdown() various statistics will be displayed,
+execution, MSI cache statistics, allocation cache statistics, and data
+transfer statistics. The display can be disabled by setting the
+environment variable <c>STARPU_STATS</c> to 0.
+
+\verbatim
+$ ./examples/cholesky/cholesky_tag
+Computation took (in ms)
+518.16
+Synthetic GFlops : 44.21
+#---------------------
+MSI cache stats :
+TOTAL MSI stats	hit 1622 (66.23 %)	miss 827 (33.77 %)
+...
+\endverbatim
+
+\verbatim
+$ STARPU_STATS=0 ./examples/cholesky/cholesky_tag
+Computation took (in ms)
+518.16
+Synthetic GFlops : 44.21
+\endverbatim
+
+\internal
+TODO: data transfer stats are similar to the ones displayed when
+setting STARPU_BUS_STATS
+\endinternal
+
+*/

+ 171 - 0
doc/doxygen/chapters/refman.tex

@@ -0,0 +1,171 @@
+\documentclass{book}
+\usepackage[a4paper,top=2.5cm,bottom=2.5cm,left=2.5cm,right=2.5cm]{geometry}
+\usepackage{makeidx}
+\usepackage{natbib}
+\usepackage{graphicx}
+\usepackage{multicol}
+\usepackage{float}
+\usepackage{listings}
+\usepackage{color}
+\usepackage{ifthen}
+\usepackage[table]{xcolor}
+\usepackage{textcomp}
+\usepackage{alltt}
+\usepackage{ifpdf}
+\ifpdf
+\usepackage[pdftex,
+            pagebackref=true,
+            colorlinks=true,
+            linkcolor=blue,
+            unicode
+           ]{hyperref}
+\else
+\usepackage[ps2pdf,
+            pagebackref=true,
+            colorlinks=true,
+            linkcolor=blue,
+            unicode
+           ]{hyperref}
+\usepackage{pspicture}
+\fi
+\usepackage[utf8]{inputenc}
+\usepackage{mathptmx}
+\usepackage[scaled=.90]{helvet}
+\usepackage{courier}
+\usepackage{sectsty}
+\usepackage{amssymb}
+\usepackage[titles]{tocloft}
+\usepackage{doxygen}
+\lstset{language=C++,inputencoding=utf8,basicstyle=\footnotesize,breaklines=true,breakatwhitespace=true,tabsize=8,numbers=left }
+\makeindex
+\setcounter{tocdepth}{3}
+\renewcommand{\footrulewidth}{0.4pt}
+\renewcommand{\familydefault}{\sfdefault}
+\hfuzz=15pt
+\setlength{\emergencystretch}{15pt}
+\hbadness=750
+\tolerance=750
+\begin{document}
+\hypersetup{pageanchor=false,citecolor=blue}
+\begin{titlepage}
+\vspace*{4cm}
+{\Huge \textbf{StarPU Handbook}}\\
+\rule{\textwidth}{1.5mm}
+\begin{flushright}
+{\Large for StarPU 1.2.0}
+\end{flushright}
+\rule{\textwidth}{1mm}
+~\\
+\vspace*{15cm}
+\begin{flushright}
+Generated by Doxygen $doxygenversion on $datetime
+\end{flushright}
+\end{titlepage}
+
+\begin{figure}[p]
+This manual documents the usage of StarPU version 1.2.0. Its contents
+was last updated on 24 May 2013.\\
+
+Copyright © 2009–2013 Université de Bordeaux 1\\
+
+Copyright © 2010–2013 Centre National de la Recherche Scientifique\\
+
+Copyright © 2011, 2012 Institut National de Recherche en Informatique et Automatique\\
+
+\medskip
+
+\begin{quote}
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3 or
+any later version published by the Free Software Foundation; with no
+Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
+copy of the license is included in the section entitled “GNU Free
+Documentation License”.
+\end{quote}
+\end{figure}
+
+\clearemptydoublepage
+\pagenumbering{roman}
+\tableofcontents
+\clearemptydoublepage
+\pagenumbering{arabic}
+\hypersetup{pageanchor=true,citecolor=blue}
+
+\chapter{Introduction}
+\label{index}
+\hypertarget{index}{}
+\input{index}
+
+\chapter{Building and Installing Star\-P\-U}
+\label{buildingAndInstalling}
+\hypertarget{buildingAndInstalling}{}
+\input{buildingAndInstalling}
+
+\chapter{Basic Examples}
+\label{basicExamples}
+\hypertarget{basicExamples}{}
+\input{basicExamples}
+
+\chapter{Advanced Examples}
+\label{advancedExamples}
+\hypertarget{advancedExamples}{}
+\input{advancedExamples}
+
+\chapter{How to optimize performance with StarPU}
+\label{optimizePerformance}
+\hypertarget{optimizePerformance}{}
+\input{optimizePerformance}
+
+\chapter{Performance Feedback}
+\label{performanceFeedback}
+\hypertarget{performanceFeedback}{}
+\input{performanceFeedback}
+
+\chapter{Tips and Tricks to know about}
+\label{tipsTricks}
+\hypertarget{tipsTricks}{}
+\input{tipsTricks}
+
+\chapter{StarPU MPI Support}
+\label{mpiSupport}
+\hypertarget{mpiSupport}{}
+\input{mpiSupport}
+
+\chapter{StarPU FFT Support}
+\label{fftSupport}
+\hypertarget{fftSupport}{}
+\input{fftSupport}
+
+\chapter{C Extensions}
+\label{cExtensions}
+\hypertarget{cExtensions}{}
+\input{cExtensions}
+
+\chapter{SOCL OpenCL Extensions}
+\label{soclOpenclExtensions}
+\hypertarget{soclOpenclExtensions}{}
+\input{soclOpenclExtensions}
+
+\chapter{Scheduling Contexts in StarPU}
+\label{schedulingContexts}
+\hypertarget{schedulingContexts}{}
+\input{schedulingContexts}
+
+\chapter{Scheduling Context Hypervisor}
+\label{schedulingContextHypervisor}
+\hypertarget{schedulingContextHypervisor}{}
+\input{schedulingContextHypervisor}
+
+\chapter{StarPU's API}
+\label{starpuAPI}
+\hypertarget{starpuAPI}{}
+\input{group__Versioning}
+\input{group__Initialization__and__Termination}
+
+\printindex
+\end{document}
+
+\chapter{}
+\label{}
+\hypertarget{}{}
+\input{}

+ 109 - 0
doc/doxygen/chapters/scheduling_context_hypervisor.doxy

@@ -0,0 +1,109 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page schedulingContextHypervisor Scheduling Context Hypervisor
+
+\section What_is_the_Hypervisor What is the Hypervisor
+
+StarPU proposes a platform for constructing Scheduling Contexts, for deleting and modifying them dynamically.
+A parallel kernel, can thus be isolated into a scheduling context and interferences between several parallel kernels are avoided.
+If the user knows exactly how many workers each scheduling context needs, he can assign them to the contexts at their creation time or modify them during the execution of the program.
+
+The Scheduling Context Hypervisor Plugin is available for the users who do not dispose of a regular parallelism, who cannot know in advance the exact size of the context and need to resize the contexts according to the behavior of the parallel kernels.
+The Hypervisor receives information from StarPU concerning the execution of the tasks, the efficiency of the resources, etc. and it decides accordingly when and how the contexts can be resized.
+Basic strategies of resizing scheduling contexts already exist but a platform for implementing additional custom ones is available.
+
+\section Start_the_Hypervisor Start the Hypervisor
+
+The Hypervisor must be initialised once at the beging of the application. At this point a resizing policy should be indicated. This strategy depends on the information the application is able to provide to the hypervisor as well
+as on the accuracy needed for the resizing procedure. For exemple, the application may be able to provide an estimation of the workload of the contexts. In this situation the hypervisor may decide what resources the contexts need.
+However, if no information is provided the hypervisor evaluates the behavior of the resources and of the application and makes a guess about the future.
+The hypervisor resizes only the registered contexts.
+
+\section Interrogate_the_runtime Interrrogate the runtime
+
+The runtime provides the hypervisor with information concerning the behavior of the resources and the application. This is done by using the performance_counters, some callbacks indicating when the resources are idle or not efficient, when the application submits tasks or when it becames to slow.
+
+\section Trigger_the_Hypervisor Trigger the Hypervisor
+
+The resizing is triggered either when the application requires it or when the initials distribution of resources alters the performance of the application( the application is to slow or the resource are idle for too long time, threashold indicated by the user). When this happens different resizing strategy are applied that target minimising the total execution of the application, the instant speed or the idle time of the resources.
+
+\section Resizing_strategies Resizing strategies
+
+The plugin proposes several strategies for resizing the scheduling context.
+
+The <b>Application driven</b> strategy uses the user's input concerning the moment when he wants to resize the contexts.
+Thus, the users tags the task that should trigger the resizing process. We can set directly the corresponding field in the <c>starpu_task</c> data structure is <c>hypervisor_tag</c> or
+use the macro <c>STARPU_HYPERVISOR_TAG</c> in the <c>starpu_insert_task</c> function.
+
+\code{.c}
+task.hypervisor_tag = 2;
+\endcode
+
+or
+
+\code{.c}
+starpu_insert_task(&codelet,
+		    ...,
+		    STARPU_HYPERVISOR_TAG, 2,
+                    0);
+\endcode
+
+Then the user has to indicate that when a task with the specified tag is executed the contexts should resize.
+
+\code{.c}
+sc_hypervisor_resize(sched_ctx, 2);
+\endcode
+
+The user can use the same tag to change the resizing configuration of the contexts if he considers it necessary.
+
+\code{.c}
+sc_hypervisor_ioctl(sched_ctx,
+                    HYPERVISOR_MIN_WORKERS, 6,
+                    HYPERVISOR_MAX_WORKERS, 12,
+                    HYPERVISOR_TIME_TO_APPLY, 2,
+                    NULL);
+\endcode
+
+
+The <b>Idleness</b> based strategy resizes the scheduling contexts every time one of their workers stays idle
+for a period longer than the one imposed by the user (see @pxref{The user's input in the resizing process})
+
+\code{.c}
+int workerids[3] = {1, 3, 10};
+int workerids2[9] = {0, 2, 4, 5, 6, 7, 8, 9, 11};
+sc_hypervisor_ioctl(sched_ctx_id,
+            HYPERVISOR_MAX_IDLE, workerids, 3, 10000.0,
+            HYPERVISOR_MAX_IDLE, workerids2, 9, 50000.0,
+            NULL);
+\endcode
+
+The <b>Gflops rate</b> based strategy resizes the scheduling contexts such that they all finish at the same time.
+The velocity of each of them is considered and once one of them is significantly slower the resizing process is triggered.
+In order to do these computations the user has to input the total number of instructions needed to be executed by the
+parallel kernels and the number of instruction to be executed by each task.
+The number of flops to be executed by a context are passed as parameter when they are registered to the hypervisor,
+ (<c>sc_hypervisor_register_ctx(sched_ctx_id, flops)</c>) and the one to be executed by each task are passed when the task is submitted.
+The corresponding field in the <c>starpu_task</c> data structure is <c>flops</c> and
+the corresponding macro in the starpu_insert_task() function is <c>STARPU_FLOPS</c>. When the task is executed
+the resizing process is triggered.
+
+\code{.c}
+task.flops = 100;
+\endcode
+
+or
+
+\code{.c}
+starpu_insert_task(&codelet,
+                    ...,
+                    STARPU_FLOPS, (double) 100,
+                    0);
+\endcode
+
+*/

+ 102 - 0
doc/doxygen/chapters/scheduling_contexts.doxy

@@ -0,0 +1,102 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page schedulingContexts Scheduling Contexts in StarPU
+
+TODO: improve!
+
+\section General_Idea General Idea
+
+Scheduling contexts represent abstracts sets of workers that allow the programmers to control the distribution of computational resources (i.e. CPUs and
+GPUs) to concurrent parallel kernels. The main goal is to minimize interferences between the execution of multiple parallel kernels, by partitioning the underlying pool of workers using contexts.
+
+\section Create_a_Context Create a Context
+
+By default, the application submits tasks to an initial context, which disposes of all the computation ressources available to StarPU (all the workers).
+If the application programmer plans to launch several parallel kernels simultaneusly, by default these kernels will be executed within this initial context, using a single scheduler policy(see \ref Task_scheduling_policy).
+Meanwhile, if the application programmer is aware of the demands of these kernels and of the specificity of the machine used to execute them, the workers can be divided between several contexts.
+These scheduling contexts will isolate the execution of each kernel and they will permit the use of a scheduling policy proper to each one of them.
+In order to create the contexts, you have to know the indentifiers of the workers running within StarPU.
+By passing a set of workers together with the scheduling policy to the function starpu_sched_ctx_create(), you will get an identifier of the context created which you will use to indicate the context you want to submit the tasks to.
+
+\code{.c}
+/* the list of ressources the context will manage */
+int workerids[3] = {1, 3, 10};
+
+/* indicate the scheduling policy to be used within the context, the list of
+   workers assigned to it, the number of workers, the name of the context */
+int id_ctx = starpu_sched_ctx_create("dmda", workerids, 3, "my_ctx");
+
+/* let StarPU know that the folowing tasks will be submitted to this context */
+starpu_sched_ctx_set_task_context(id);
+
+/* submit the task to StarPU */
+starpu_task_submit(task);
+\endcode
+
+Note: Parallel greedy and parallel heft scheduling policies do not support the existence of several disjoint contexts on the machine.
+Combined workers are constructed depending on the entire topology of the machine, not only the one belonging to a context.
+
+\section Modify_a_Context Modify a Context
+
+A scheduling context can be modified dynamically. The applications may change its requirements during the execution and the programmer can add additional workers to a context or remove if no longer needed.
+In the following example we have two scheduling contexts <c>sched_ctx1</c> and <c>sched_ctx2</c>. After executing a part of the tasks some of the workers of <c>sched_ctx1</c> will be moved to context <c>sched_ctx2</c>.
+
+\code{.c}
+/* the list of ressources that context 1 will give away */
+int workerids[3] = {1, 3, 10};
+
+/* add the workers to context 1 */
+starpu_sched_ctx_add_workers(workerids, 3, sched_ctx2);
+
+/* remove the workers from context 2 */
+starpu_sched_ctx_remove_workers(workerids, 3, sched_ctx1);
+\endcode
+
+\section Delete_a_Context Delete a Context
+
+When a context is no longer needed it must be deleted. The application can indicate which context should keep the resources of a deleted one.
+All the tasks of the context should be executed before doing this. If the application need to avoid a barrier before moving the resources from the deleted context to the inheritor one, the application can just indicate
+when the last task was submitted. Thus, when this last task was submitted the resources will be move, but the context should still be deleted at some point of the application.
+
+\code{.c}
+/* when the context 2 will be deleted context 1 will be keep its resources */
+starpu_sched_ctx_set_inheritor(sched_ctx2, sched_ctx1);
+
+/* submit tasks to context 2 */
+for (i = 0; i < ntasks; i++)
+    starpu_task_submit_to_ctx(task[i],sched_ctx2);
+
+/* indicate that context 2 finished submitting and that */
+/* as soon as the last task of context 2 finished executing */
+/* its workers can be mobed to the inheritor context */
+starpu_sched_ctx_finished_submit(sched_ctx1);
+
+/* wait for the tasks of both contexts to finish */
+starpu_task_wait_for_all();
+
+/* delete context 2 */
+starpu_sched_ctx_delete(sched_ctx2);
+
+/* delete context 1 */
+starpu_sched_ctx_delete(sched_ctx1);
+\endcode
+
+\section Empty_Context Empty Context
+
+A context may not have any resources at the begining or at a certain moment of the execution. Task can still be submitted to these contexts and they will execute them as soon as they will have resources.
+A list of tasks pending to be executed is kept and when workers are added to the contexts the tasks are submitted. However, if no resources are allocated the program will not terminate.
+If these tasks have not much priority the programmer can forbid the application to submitted them by calling the function starpu_sched_ctx_stop_task_submission().
+
+\section Contexts_Sharing_Workers Contexts Sharing Workers
+
+Contexts may share workers when a single context cannot execute efficiently enough alone on these workers or when the application decides to express a hierarchy of contexts. The workers apply
+an alogrithm of ``Round-Robin'' to chose the context on which they will ``pop'' next. By using the function <c>void starpu_sched_ctx_set_turn_to_other_ctx(int workerid, unsigned sched_ctx_id)</c>
+the programmer can impose the <c>workerid</c> to ``pop'' in the context <c>sched_ctx_id</c> next.
+
+*/

+ 23 - 0
doc/doxygen/chapters/socl_opencl_extensions.doxy

@@ -0,0 +1,23 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page soclOpenclExtensions SOCL OpenCL Extensions
+
+SOCL is an OpenCL implementation based on StarPU. It gives a unified access to
+every available OpenCL device: applications can now share entities such as
+Events, Contexts or Command Queues between several OpenCL implementations.
+
+In addition, command queues that are created without specifying a device provide
+automatic scheduling of the submitted commands on OpenCL devices contained in
+the context to which the command queue is attached.
+
+Note: as of StarPU @value{VERSION}, this is still an area under
+development and subject to change.
+
+
+*/

+ 98 - 0
doc/doxygen/chapters/tips_and_tricks.doxy

@@ -0,0 +1,98 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page tipsTricks Tips and Tricks to know about
+
+\section How_to_initialize_a_computation_library_once_for_each_worker How to initialize a computation library once for each worker?
+
+Some libraries need to be initialized once for each concurrent instance that
+may run on the machine. For instance, a C++ computation class which is not
+thread-safe by itself, but for which several instanciated objects of that class
+can be used concurrently. This can be used in StarPU by initializing one such
+object per worker. For instance, the libstarpufft example does the following to
+be able to use FFTW.
+
+Some global array stores the instanciated objects:
+
+\code{.c}
+fftw_plan plan_cpu[STARPU_NMAXWORKERS];
+\endcode
+
+At initialisation time of libstarpu, the objects are initialized:
+
+\code{.c}
+int workerid;
+for (workerid = 0; workerid < starpu_worker_get_count(); workerid++) {
+    switch (starpu_worker_get_type(workerid)) {
+        case STARPU_CPU_WORKER:
+            plan_cpu[workerid] = fftw_plan(...);
+            break;
+    }
+}
+\endcode
+
+And in the codelet body, they are used:
+
+\code{.c}
+static void fft(void *descr[], void *_args)
+{
+    int workerid = starpu_worker_get_id();
+    fftw_plan plan = plan_cpu[workerid];
+    ...
+
+    fftw_execute(plan, ...);
+}
+\endcode
+
+Another way to go which may be needed is to execute some code from the workers
+themselves thanks to starpu_execute_on_each_worker(). This may be required
+by CUDA to behave properly due to threading issues. For instance, StarPU's
+starpu_cublas_init() looks like the following to call
+<c>cublasInit</c> from the workers themselves:
+
+\code{.c}
+static void init_cublas_func(void *args STARPU_ATTRIBUTE_UNUSED)
+{
+    cublasStatus cublasst = cublasInit();
+    cublasSetKernelStream(starpu_cuda_get_local_stream());
+}
+void starpu_cublas_init(void)
+{
+    starpu_execute_on_each_worker(init_cublas_func, NULL, STARPU_CUDA);
+}
+\endcode
+
+\section How_to_limit_memory_per_node How to limit memory per node
+
+TODO
+
+Talk about
+<c>STARPU_LIMIT_CUDA_devid_MEM</c>, <c>STARPU_LIMIT_CUDA_MEM</c>,
+<c>STARPU_LIMIT_OPENCL_devid_MEM</c>, <c>STARPU_LIMIT_OPENCL_MEM</c>
+and <c>STARPU_LIMIT_CPU_MEM</c>
+
+starpu_memory_get_available()
+
+\section Thread_Binding_on_NetBSD Thread Binding on NetBSD
+
+When using StarPU on a NetBSD machine, if the topology
+discovery library <c>hwloc</c> is used, thread binding will fail. To
+prevent the problem, you should at least use the version 1.7 of
+<c>hwloc</c>, and also issue the following call:
+
+\verbatim
+$ sysctl -w security.models.extensions.user_set_cpu_affinity=1
+\endverbatim
+
+Or add the following line in the file <c>/etc/sysctl.conf</c>
+
+\verbatim
+security.models.extensions.user_set_cpu_affinity=1
+\endverbatim
+
+*/