Corentin Salingue il y a 12 ans
Parent
commit
fbbc1a9ff6
100 fichiers modifiés avec 15190 ajouts et 5 suppressions
  1. 2 0
      .gitignore
  2. 1 0
      ChangeLog
  3. 1 1
      Makefile.am
  4. 4 1
      configure.ac
  5. 210 0
      doc/doxygen/Makefile.am
  6. 1231 0
      doc/doxygen/chapters/advanced_examples.doxy
  7. 690 0
      doc/doxygen/chapters/api/codelet_and_tasks.doxy
  8. 80 0
      doc/doxygen/chapters/api/cuda_extensions.doxy
  9. 988 0
      doc/doxygen/chapters/api/data_interfaces.doxy
  10. 260 0
      doc/doxygen/chapters/api/data_management.doxy
  11. 258 0
      doc/doxygen/chapters/api/data_partition.doxy
  12. 25 0
      doc/doxygen/chapters/api/expert_mode.doxy
  13. 113 0
      doc/doxygen/chapters/api/explicit_dependencies.doxy
  14. 63 0
      doc/doxygen/chapters/api/fft_support.doxy
  15. 81 0
      doc/doxygen/chapters/api/fxt_support.doxy
  16. 42 0
      doc/doxygen/chapters/api/implicit_dependencies.doxy
  17. 264 0
      doc/doxygen/chapters/api/initialization.doxy
  18. 98 0
      doc/doxygen/chapters/api/insert_task.doxy
  19. 50 0
      doc/doxygen/chapters/api/lower_bound.doxy
  20. 28 0
      doc/doxygen/chapters/api/mic_extensions.doxy
  21. 36 0
      doc/doxygen/chapters/api/misc_helpers.doxy
  22. 276 0
      doc/doxygen/chapters/api/mpi.doxy
  23. 71 0
      doc/doxygen/chapters/api/multiformat_data_interface.doxy
  24. 249 0
      doc/doxygen/chapters/api/opencl_extensions.doxy
  25. 51 0
      doc/doxygen/chapters/api/parallel_tasks.doxy
  26. 271 0
      doc/doxygen/chapters/api/performance_model.doxy
  27. 176 0
      doc/doxygen/chapters/api/profiling.doxy
  28. 39 0
      doc/doxygen/chapters/api/running_driver.doxy
  29. 28 0
      doc/doxygen/chapters/api/scc_extensions.doxy
  30. 295 0
      doc/doxygen/chapters/api/scheduling_context_hypervisor.doxy
  31. 248 0
      doc/doxygen/chapters/api/scheduling_contexts.doxy
  32. 174 0
      doc/doxygen/chapters/api/scheduling_policy.doxy
  33. 64 0
      doc/doxygen/chapters/api/standard_memory_library.doxy
  34. 59 0
      doc/doxygen/chapters/api/task_bundles.doxy
  35. 68 0
      doc/doxygen/chapters/api/task_lists.doxy
  36. 216 0
      doc/doxygen/chapters/api/top.doxy
  37. 28 0
      doc/doxygen/chapters/api/versioning.doxy
  38. 178 0
      doc/doxygen/chapters/api/workers.doxy
  39. 732 0
      doc/doxygen/chapters/basic_examples.doxy
  40. 292 0
      doc/doxygen/chapters/building.doxy
  41. 360 0
      doc/doxygen/chapters/c_extensions.doxy
  42. 50 0
      doc/doxygen/chapters/code/cholesky_pragma.c
  43. 25 0
      doc/doxygen/chapters/code/complex.c
  44. 42 0
      doc/doxygen/chapters/code/forkmode.c
  45. 46 0
      doc/doxygen/chapters/code/hello_pragma.c
  46. 43 0
      doc/doxygen/chapters/code/hello_pragma2.c
  47. 73 0
      doc/doxygen/chapters/code/matmul_pragma.c
  48. 29 0
      doc/doxygen/chapters/code/matmul_pragma2.c
  49. 61 0
      doc/doxygen/chapters/code/multiformat.c
  50. 32 0
      doc/doxygen/chapters/code/simgrid.c
  51. 128 0
      doc/doxygen/chapters/code/vector_scal_c.c
  52. 78 0
      doc/doxygen/chapters/code/vector_scal_cpu.c
  53. 45 0
      doc/doxygen/chapters/code/vector_scal_cuda.cu
  54. 72 0
      doc/doxygen/chapters/code/vector_scal_opencl.c
  55. 25 0
      doc/doxygen/chapters/code/vector_scal_opencl_codelet.cl
  56. 501 0
      doc/doxygen/chapters/configure_options.doxy
  57. 531 0
      doc/doxygen/chapters/environment_variables.doxy
  58. 518 0
      doc/doxygen/chapters/fdl-1.3.doxy
  59. 71 0
      doc/doxygen/chapters/fft_support.doxy
  60. 242 0
      doc/doxygen/chapters/introduction.doxy
  61. 56 0
      doc/doxygen/chapters/mic_scc_support.doxy
  62. 377 0
      doc/doxygen/chapters/mpi_support.doxy
  63. 522 0
      doc/doxygen/chapters/optimize_performance.doxy
  64. 580 0
      doc/doxygen/chapters/performance_feedback.doxy
  65. 34 0
      doc/doxygen/chapters/scaling-vector-example.doxy
  66. 145 0
      doc/doxygen/chapters/scheduling_context_hypervisor.doxy
  67. 136 0
      doc/doxygen/chapters/scheduling_contexts.doxy
  68. 21 0
      doc/doxygen/chapters/socl_opencl_extensions.doxy
  69. 98 0
      doc/doxygen/chapters/tips_and_tricks.doxy
  70. 33 0
      doc/doxygen/doxygen-config.cfg.in
  71. 1904 0
      doc/doxygen/doxygen.cfg
  72. 9 0
      doc/doxygen/doxygen_filter.sh.in
  73. 20 0
      doc/doxygen/foreword.html
  74. 240 0
      doc/doxygen/refman.tex
  75. 0 0
      doc/texinfo/Makefile.am
  76. 0 0
      doc/texinfo/chapters/advanced-examples.texi
  77. 0 0
      doc/texinfo/chapters/api.texi
  78. 0 0
      doc/texinfo/chapters/basic-examples.texi
  79. 0 0
      doc/texinfo/chapters/c-extensions.texi
  80. 0 0
      doc/texinfo/chapters/configuration.texi
  81. 0 0
      doc/texinfo/chapters/fdl-1.3.texi
  82. 0 0
      doc/texinfo/chapters/fft-support.texi
  83. 0 0
      doc/texinfo/chapters/hypervisor_api.texi
  84. 0 0
      doc/texinfo/chapters/installing.texi
  85. 0 0
      doc/texinfo/chapters/introduction.texi
  86. 3 3
      doc/chapters/mic-scc-support.texi
  87. 0 0
      doc/texinfo/chapters/mpi-support.texi
  88. 0 0
      doc/texinfo/chapters/perf-feedback.texi
  89. 0 0
      doc/texinfo/chapters/perf-optimization.texi
  90. 0 0
      doc/texinfo/chapters/sc_hypervisor.texi
  91. 0 0
      doc/texinfo/chapters/scaling-vector-example.texi
  92. 0 0
      doc/texinfo/chapters/sched_ctx.texi
  93. 0 0
      doc/texinfo/chapters/socl.texi
  94. 0 0
      doc/texinfo/chapters/tips-tricks.texi
  95. 0 0
      doc/texinfo/chapters/vector_scal_c.texi
  96. 0 0
      doc/texinfo/chapters/vector_scal_cpu.texi
  97. 0 0
      doc/texinfo/chapters/vector_scal_cuda.texi
  98. 0 0
      doc/texinfo/chapters/vector_scal_opencl.texi
  99. 0 0
      doc/texinfo/chapters/vector_scal_opencl_codelet.texi
  100. 0 0
      doc/starpu.css

+ 2 - 0
.gitignore

@@ -28,6 +28,8 @@ starpu.log
 /tests/datawizard/data_lookup
 /doc/stamp-vti
 /doc/chapters/version.texi
+/doc/doxygen/chapters/version.sty
+/doc/doxygen/chapters/version.html
 /examples/basic_examples/block
 /examples/basic_examples/hello_world
 /examples/basic_examples/mult

+ 1 - 0
ChangeLog

@@ -204,6 +204,7 @@ Changes:
   * Tutorial is installed in ${docdir}/tutorial
   * Schedulers eager_central_policy, dm and dmda no longer erroneously respect
     priorities. dmdas has to be used to respect priorities.
+  * Documentation is now generated through doxygen.
 
 Small changes:
   * STARPU_NCPU should now be used instead of STARPU_NCPUS. STARPU_NCPUS is

+ 1 - 1
Makefile.am

@@ -21,7 +21,7 @@ SUBDIRS = src
 SUBDIRS += tools tests
 
 if BUILD_DOC
-SUBDIRS += doc
+SUBDIRS += doc/doxygen
 endif
 
 if USE_MPI

+ 4 - 1
configure.ac

@@ -2176,6 +2176,7 @@ AC_CONFIG_COMMANDS([executable-scripts], [
   chmod +x tools/starpu_codelet_profile
   chmod +x tools/starpu_codelet_histo_profile
   chmod +x tools/starpu_workers_activity
+  chmod +x doc/doxygen/doxygen_filter.sh
 ])
 
 # Create links to ICD files in build/socl/vendors directory. SOCL will use this
@@ -2219,7 +2220,6 @@ AC_OUTPUT([
 	examples/stencil/Makefile
 	tests/Makefile
 	tests/loader-cross.sh
-	doc/Makefile
 	mpi/Makefile
 	mpi/src/Makefile
 	mpi/tests/Makefile
@@ -2235,6 +2235,9 @@ AC_OUTPUT([
 	sc_hypervisor/Makefile
 	sc_hypervisor/src/Makefile
 	sc_hypervisor/examples/Makefile
+	doc/doxygen/Makefile
+	doc/doxygen/doxygen-config.cfg
+	doc/doxygen/doxygen_filter.sh
 ])
 
 AC_MSG_NOTICE([

+ 210 - 0
doc/doxygen/Makefile.am

@@ -0,0 +1,210 @@
+# StarPU --- Runtime system for heterogeneous multicore architectures.
+#
+# Copyright (C) 2009, 2011  Université de Bordeaux 1
+# Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+#
+# Permission is granted to copy, distribute and/or modify this document
+# under the terms of the GNU Free Documentation License, Version 1.3
+# or any later version published by the Free Software Foundation;
+# with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
+#
+# See the GNU Free Documentation License in COPYING.GFDL for more details.
+
+
+DOXYGEN = doxygen
+PDFLATEX = pdflatex
+MAKEINDEX = makeindex
+
+DOX_DIR = $(top_builddir)/doc/doxygen
+DOX_CONFIG = $(top_srcdir)/doc/doxygen/doxygen.cfg
+
+DOX_HTML_DIR = html
+DOX_LATEX_DIR = latex
+DOX_PDF = starpu.pdf
+DOX_TAG = starpu.tag
+
+chapters =	\
+	chapters/advanced_examples.doxy \
+	chapters/basic_examples.doxy \
+	chapters/building.doxy \
+	chapters/c_extensions.doxy \
+	chapters/fft_support.doxy \
+	chapters/introduction.doxy \
+	chapters/mpi_support.doxy \
+	chapters/optimize_performance.doxy \
+	chapters/performance_feedback.doxy \
+	chapters/scheduling_context_hypervisor.doxy \
+	chapters/scheduling_contexts.doxy \
+	chapters/socl_opencl_extensions.doxy \
+	chapters/tips_and_tricks.doxy \
+	chapters/environment_variables.doxy \
+	chapters/configure_options.doxy \
+	chapters/fdl-1.3.doxy \
+	chapters/scaling-vector-example.doxy \
+	chapters/mic_scc_support.doxy \
+	chapters/code/hello_pragma2.c \
+	chapters/code/hello_pragma.c \
+	chapters/code/matmul_pragma.c \
+	chapters/code/matmul_pragma2.c \
+	chapters/code/cholesky_pragma.c \
+	chapters/code/forkmode.c \
+	chapters/code/multiformat.c \
+	chapters/code/complex.c \
+	chapters/code/simgrid.c \
+	chapters/code/vector_scal_c.c \
+	chapters/code/vector_scal_cpu.c \
+	chapters/code/vector_scal_cuda.cu \
+	chapters/code/vector_scal_opencl.c \
+	chapters/code/vector_scal_opencl_codelet.cl \
+	chapters/api/codelet_and_tasks.doxy \
+	chapters/api/cuda_extensions.doxy \
+	chapters/api/data_interfaces.doxy \
+	chapters/api/data_management.doxy \
+	chapters/api/data_partition.doxy \
+	chapters/api/expert_mode.doxy \
+	chapters/api/explicit_dependencies.doxy \
+	chapters/api/fft_support.doxy \
+	chapters/api/fxt_support.doxy \
+	chapters/api/implicit_dependencies.doxy \
+	chapters/api/initialization.doxy \
+	chapters/api/insert_task.doxy \
+	chapters/api/lower_bound.doxy \
+	chapters/api/misc_helpers.doxy \
+	chapters/api/mpi.doxy \
+	chapters/api/multiformat_data_interface.doxy \
+	chapters/api/opencl_extensions.doxy \
+	chapters/api/mic_extensions.doxy \
+	chapters/api/scc_extensions.doxy \
+	chapters/api/parallel_tasks.doxy \
+	chapters/api/performance_model.doxy \
+	chapters/api/profiling.doxy \
+	chapters/api/running_driver.doxy \
+	chapters/api/scheduling_context_hypervisor.doxy \
+	chapters/api/scheduling_contexts.doxy \
+	chapters/api/scheduling_policy.doxy \
+	chapters/api/standard_memory_library.doxy \
+	chapters/api/task_bundles.doxy \
+	chapters/api/task_lists.doxy \
+	chapters/api/top.doxy \
+	chapters/api/versioning.doxy \
+	chapters/api/workers.doxy
+
+chapters/version.sty: $(chapters)
+	@-for f in $(chapters) ; do \
+                if test -f $(top_srcdir)/doc/doxygen/$$f ; then stat --format=%Y $(top_srcdir)/doc/doxygen/$$f 2>/dev/null ; fi \
+        done | sort -r | head -1 > timestamp
+	@if test -s timestamp ; then \
+		LC_ALL=C date --date=@`cat timestamp` +"%d %B %Y" > timestamp_updated 2>/dev/null;\
+		LC_ALL=C date --date=@`cat timestamp` +"%B %Y" > timestamp_updated_month 2>/dev/null;\
+	fi
+	@if test -s timestamp_updated ; then \
+		echo "\newcommand{\STARPUUPDATED}{"`cat timestamp_updated`"}" > $(top_srcdir)/doc/doxygen/chapters/version.sty;\
+	else \
+		echo "\newcommand{\STARPUUPDATED}{unknown_date}" > $(top_srcdir)/doc/doxygen/chapters/version.sty;\
+	fi
+	@echo "\newcommand{\STARPUVERSION}{$(VERSION)}" >> $(top_srcdir)/doc/doxygen/chapters/version.sty
+	@-for f in timestamp timestamp_updated timestamp_updated_month ; do \
+		if test -f $$f ; then $(RM) $$f ; fi ;\
+	done
+
+chapters/version.html: $(chapters)
+	@-for f in $(chapters) ; do \
+                if test -f $(top_srcdir)/doc/doxygen/$$f ; then stat --format=%Y $(top_srcdir)/doc/doxygen/$$f 2>/dev/null ; fi \
+        done | sort -r | head -1 > timestamp
+	@if test -s timestamp ; then \
+		LC_ALL=C date --date=@`cat timestamp` +"%d %B %Y" > timestamp_updated 2>/dev/null;\
+		LC_ALL=C date --date=@`cat timestamp` +"%B %Y" > timestamp_updated_month 2>/dev/null;\
+	fi
+	@echo "This manual documents the usage of StarPU version $(VERSION)." > $(top_srcdir)/doc/doxygen/chapters/version.html
+	@if test -s timestamp_updated ; then \
+		echo "Its contents was last updated on "`cat timestamp_updated`"." >> $(top_srcdir)/doc/doxygen/chapters/version.html;\
+	else \
+		echo "Its contents was last updated on <em>unknown_date</em>." >> $(top_srcdir)/doc/doxygen/chapters/version.html;\
+	fi
+	@-for f in timestamp timestamp_updated timestamp_updated_month ; do \
+		if test -f $$f ; then $(RM) $$f ; fi ;\
+	done
+
+EXTRA_DIST	= 		\
+	$(chapters) 		\
+	chapters/version.sty	\
+	chapters/version.html	\
+	doxygen.cfg 		\
+	refman.tex
+
+dox_inputs = $(DOX_CONFIG) 				\
+	$(chapters) 					\
+	chapters/version.sty				\
+	chapters/version.html				\
+	$(top_srcdir)/include/starpu.h			\
+	$(top_srcdir)/include/starpu_data_filters.h	\
+	$(top_srcdir)/include/starpu_data_interfaces.h	\
+	$(top_srcdir)/include/starpu_worker.h		\
+	$(top_srcdir)/include/starpu_task.h		\
+	$(top_srcdir)/include/starpu_task_bundle.h	\
+	$(top_srcdir)/include/starpu_task_list.h	\
+	$(top_srcdir)/include/starpu_task_util.h	\
+	$(top_srcdir)/include/starpu_data.h		\
+	$(top_srcdir)/include/starpu_perfmodel.h	\
+	$(top_srcdir)/include/starpu_util.h		\
+	$(top_srcdir)/include/starpu_fxt.h		\
+	$(top_srcdir)/include/starpu_cuda.h		\
+	$(top_srcdir)/include/starpu_opencl.h		\
+	$(top_srcdir)/include/starpu_sink.h		\
+	$(top_srcdir)/include/starpu_mic.h		\
+	$(top_srcdir)/include/starpu_scc.h		\
+	$(top_srcdir)/include/starpu_expert.h		\
+	$(top_srcdir)/include/starpu_profiling.h	\
+	$(top_srcdir)/include/starpu_bound.h		\
+	$(top_srcdir)/include/starpu_scheduler.h	\
+	$(top_srcdir)/include/starpu_sched_ctx.h	\
+	$(top_srcdir)/include/starpu_top.h		\
+	$(top_srcdir)/include/starpu_hash.h		\
+	$(top_srcdir)/include/starpu_rand.h		\
+	$(top_srcdir)/include/starpu_cublas.h		\
+	$(top_srcdir)/include/starpu_driver.h		\
+	$(top_srcdir)/include/starpu_stdlib.h		\
+	$(top_srcdir)/include/starpu_thread.h		\
+	$(top_srcdir)/include/starpu_thread_util.h
+
+$(DOX_TAG): $(dox_inputs)
+	rm -fr $(DOX_HTML_DIR) $(DOX_LATEX_DIR)
+	$(DOXYGEN) $(DOX_CONFIG)
+	sed -i 's/ModuleDocumentation <\/li>/<a class="el" href="modules.html">Modules<\/a>/' html/index.html
+
+dist_pdf_DATA = $(DOX_PDF)
+
+$(DOX_PDF): $(DOX_TAG) refman.tex
+	cp $(top_srcdir)/doc/doxygen/chapters/version.sty $(DOX_LATEX_DIR)
+	cd $(DOX_LATEX_DIR); \
+	rm -f *.aux *.toc *.idx *.ind *.ilg *.log *.out; \
+	sed -i -e 's/__env__/\\_Environment Variables!/' -e 's/\\-\\_\\-\\-\\_\\-env\\-\\_\\-\\-\\_\\-//' ExecutionConfigurationThroughEnvironmentVariables.tex ;\
+	sed -i -e 's/__configure__/\\_Configure Options!/' -e 's/\\-\\_\\-\\-\\_\\-configure\\-\\_\\-\\-\\_\\-//' CompilationConfiguration.tex ;\
+	sed -i s'/\\item Module\\-Documentation/\\item \\hyperlink{ModuleDocumentation}{Module Documentation}/' index.tex ;\
+	$(PDFLATEX) refman.tex; \
+	$(MAKEINDEX) refman.idx;\
+	$(PDFLATEX) refman.tex; \
+	done=0; repeat=5; \
+	while test $$done = 0 -a $$repeat -gt 0; do \
+           if $(EGREP) 'Rerun (LaTeX|to get cross-references right)' refman.log > /dev/null 2>&1; then \
+	       $(PDFLATEX) refman.tex; \
+	       repeat=`expr $$repeat - 1`; \
+	   else \
+	       done=1; \
+	   fi; \
+	done; \
+	mv refman.pdf ../$(DOX_PDF)
+
+CLEANFILES = $(DOX_TAG) \
+    -r \
+    $(DOX_HTML_DIR) \
+    $(DOX_LATEX_DIR) \
+    $(DOX_PDF)
+
+# Rule to update documentation on web server. Should only be used locally.
+PUBLISHHOST	?= sync
+update-web: $(DOX_PDF)
+	scp -pr starpu.pdf html $(PUBLISHHOST):/web/runtime/html/StarPU/doc
+
+showcheck:
+	-cat /dev/null

Fichier diff supprimé car celui-ci est trop grand
+ 1231 - 0
doc/doxygen/chapters/advanced_examples.doxy


+ 690 - 0
doc/doxygen/chapters/api/codelet_and_tasks.doxy

@@ -0,0 +1,690 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Codelet_And_Tasks Codelet And Tasks
+
+\brief This section describes the interface to manipulate codelets and tasks.
+
+\enum starpu_codelet_type
+\ingroup API_Codelet_And_Tasks
+Describes the type of parallel task. See \ref ParallelTasks for details.
+\var starpu_codelet_type::STARPU_SEQ
+\ingroup API_Codelet_And_Tasks
+(default) for classical sequential tasks.
+\var starpu_codelet_type::STARPU_SPMD
+\ingroup API_Codelet_And_Tasks
+for a parallel task whose threads are handled by StarPU, the code has
+to use starpu_combined_worker_get_size() and
+starpu_combined_worker_get_rank() to distribute the work.
+\var starpu_codelet_type::STARPU_FORKJOIN
+\ingroup API_Codelet_And_Tasks
+for a parallel task whose threads are started by the codelet function,
+which has to use starpu_combined_worker_get_size() to determine how
+many threads should be started.
+
+\enum starpu_task_status
+\ingroup API_Codelet_And_Tasks
+Task status
+\var starpu_task_status::STARPU_TASK_INVALID
+\ingroup API_Codelet_And_Tasks
+The task has just been initialized.
+\var starpu_task_status::STARPU_TASK_BLOCKED
+\ingroup API_Codelet_And_Tasks
+The task has just been submitted, and its dependencies has not been
+checked yet.
+\var starpu_task_status::STARPU_TASK_READY
+\ingroup API_Codelet_And_Tasks
+The task is ready for execution.
+\var starpu_task_status::STARPU_TASK_RUNNING
+\ingroup API_Codelet_And_Tasks
+The task is running on some worker.
+\var starpu_task_status::STARPU_TASK_FINISHED
+\ingroup API_Codelet_And_Tasks
+The task is finished executing.
+\var starpu_task_status::STARPU_TASK_BLOCKED_ON_TAG
+\ingroup API_Codelet_And_Tasks
+The task is waiting for a tag.
+\var starpu_task_status::STARPU_TASK_BLOCKED_ON_TASK
+\ingroup API_Codelet_And_Tasks
+The task is waiting for a task.
+\var starpu_task_status::STARPU_TASK_BLOCKED_ON_DATA
+\ingroup API_Codelet_And_Tasks
+The task is waiting for some data.
+
+
+\def STARPU_CPU
+\ingroup API_Codelet_And_Tasks
+This macro is used when setting the field starpu_codelet::where
+to specify the codelet may be executed on a CPU processing unit.
+
+\def STARPU_CUDA
+\ingroup API_Codelet_And_Tasks
+This macro is used when setting the field starpu_codelet::where
+to specify the codelet may be executed on a CUDA processing unit.
+
+\def STARPU_OPENCL
+\ingroup API_Codelet_And_Tasks
+This macro is used when setting the field starpu_codelet::where to
+specify the codelet may be executed on a OpenCL processing unit.
+
+\def STARPU_MIC
+\ingroup API_Codelet_And_Tasks
+This macro is used when setting the field starpu_codelet::where to
+specify the codelet may be executed on a MIC processing unit.
+
+\def STARPU_SCC
+\ingroup API_Codelet_And_Tasks
+This macro is used when setting the field starpu_codelet::where to
+specify the codelet may be executed on an SCC processing unit.
+
+\def STARPU_MULTIPLE_CPU_IMPLEMENTATIONS
+\deprecated
+\ingroup API_Codelet_And_Tasks
+Setting the field starpu_codelet::cpu_func with this macro
+indicates the codelet will have several implementations. The use of
+this macro is deprecated. One should always only define the field
+starpu_codelet::cpu_funcs.
+
+\def STARPU_MULTIPLE_CUDA_IMPLEMENTATIONS
+\deprecated
+\ingroup API_Codelet_And_Tasks
+Setting the field starpu_codelet::cuda_func with this macro
+indicates the codelet will have several implementations. The use of
+this macro is deprecated. One should always only define the field
+starpu_codelet::cuda_funcs.
+
+\def STARPU_MULTIPLE_OPENCL_IMPLEMENTATIONS
+\deprecated
+\ingroup API_Codelet_And_Tasks
+Setting the field starpu_codelet::opencl_func with
+this macro indicates the codelet will have several implementations.
+The use of this macro is deprecated. One should always only define the
+field starpu_codelet::opencl_funcs.
+
+\def starpu_cpu_func_t
+\ingroup API_Codelet_And_Tasks
+CPU implementation of a codelet.
+
+\def starpu_cuda_func_t
+\ingroup API_Codelet_And_Tasks
+CUDA implementation of a codelet.
+
+\def starpu_opencl_func_t
+\ingroup API_Codelet_And_Tasks
+OpenCL implementation of a codelet.
+
+\def starpu_mic_func_t
+\ingroup API_Codelet_And_Tasks
+MIC implementation of a codelet.
+
+\def starpu_scc_func_t
+\ingroup API_Codelet_And_Tasks
+SCC implementation of a codelet.
+
+\struct starpu_codelet
+The codelet structure describes a kernel that is possibly
+implemented on various targets. For compatibility, make sure to
+initialize the whole structure to zero, either by using explicit
+memset, or the function starpu_codelet_init(), or by letting the
+compiler implicitly do it in e.g. static storage case.
+\ingroup API_Codelet_And_Tasks
+\var starpu_codelet::where.
+Optional field to indicate which types of processing units are able to
+execute the codelet. The different values ::STARPU_CPU, ::STARPU_CUDA,
+::STARPU_OPENCL can be combined to specify on which types of processing
+units the codelet can be executed. ::STARPU_CPU|::STARPU_CUDA for instance
+indicates that the codelet is implemented for both CPU cores and CUDA
+devices while ::STARPU_OPENCL indicates that it is only available on
+OpenCL devices. If the field is unset, its value will be automatically
+set based on the availability of the XXX_funcs fields defined below.
+
+\var starpu_codelet::can_execute
+Define a function which should return 1 if the worker designated by
+workerid can execute the <c>nimpl</c>th implementation of the given
+task, 0 otherwise.
+
+\var starpu_codelet::type
+Optional field to specify the type of the codelet. The default is
+::STARPU_SEQ, i.e. usual sequential implementation. Other values
+(::STARPU_SPMD or ::STARPU_FORKJOIN declare that a parallel implementation
+is also available. See \ref ParallelTasks for details.
+
+\var starpu_codelet::max_parallelism
+Optional field. If a parallel implementation is available, this
+denotes the maximum combined worker size that StarPU will use to
+execute parallel tasks for this codelet.
+
+\var starpu_codelet::cpu_func
+\deprecated
+Optional field which has been made deprecated. One should use instead
+the field starpu_codelet::cpu_funcs.
+
+\var starpu_codelet::cuda_func
+\deprecated
+Optional field which has been made deprecated. One should use instead
+the starpu_codelet::cuda_funcs field.
+
+\var starpu_codelet::opencl_func
+\deprecated
+Optional field which has been made deprecated. One should use instead
+the starpu_codelet::opencl_funcs field.
+
+\var starpu_codelet::cpu_funcs
+Optional array of function pointers to the CPU implementations of the
+codelet. It must be terminated by a NULL value. The functions
+prototype must be:
+\code{.c}
+void cpu_func(void *buffers[], void *cl_arg)
+\endcode
+The first argument being the array of data managed by the data
+management library, and the second argument is a pointer to the
+argument passed from the field starpu_task::cl_arg. If the field
+starpu_codelet::where is set, then the field starpu_codelet::cpu_funcs
+is ignored if ::STARPU_CPU does not appear in the field
+starpu_codelet::where, it must be non-null otherwise.
+
+\var starpu_codelet::cpu_funcs_name
+Optional array of strings which provide the name of the CPU functions
+referenced in the array starpu_codelet::cpu_funcs. This can be used
+when running on MIC devices or the SCC platform, for StarPU to simply
+look up the MIC function implementation through its name.
+
+\var starpu_codelet::cuda_funcs
+Optional array of function pointers to the CUDA implementations of the
+codelet. It must be terminated by a NULL value. The functions must be
+host-functions written in the CUDA runtime API. Their prototype must
+be:
+\code{.c}
+void cuda_func(void *buffers[], void *cl_arg)
+\endcode
+If the field starpu_codelet::where is set, then the field
+starpu_codelet::cuda_funcs is ignored if ::STARPU_CUDA does not appear
+in the field starpu_codelet::where, it must be non-null otherwise.
+
+\var starpu_codelet::opencl_funcs
+Optional array of function pointers to the OpenCL implementations of
+the codelet. It must be terminated by a NULL value. The functions
+prototype must be:
+\code{.c}
+void opencl_func(void *buffers[], void *cl_arg)
+\endcode
+If the field starpu_codelet::where field is set, then the field
+starpu_codelet::opencl_funcs is ignored if ::STARPU_OPENCL does not
+appear in the field starpu_codelet::where, it must be non-null
+otherwise.
+
+\var starpu_codelet::mic_funcs
+Optional array of function pointers to a function which returns the
+MIC implementation of the codelet. It must be terminated by a NULL
+value. The functions prototype must be:
+\code{.c}
+starpu_mic_kernel_t mic_func(struct starpu_codelet *cl, unsigned nimpl)
+\endcode
+If the field starpu_codelet::where is set, then the field
+starpu_codelet::mic_funcs is ignored if ::STARPU_MIC does not appear
+in the field starpu_codelet::where. It can be null if
+starpu_codelet::cpu_funcs_name is non-NULL, in which case StarPU will
+simply make a symbol lookup to get the implementation.
+
+\var starpu_codelet::scc_funcs
+Optional array of function pointers to a function which returns the
+SCC implementation of the codelet. It must be terminated by a NULL value.
+The functions prototype must be:
+\code{.c}
+starpu_scc_kernel_t scc_func(struct starpu_codelet *cl, unsigned nimpl)
+\endcode
+If the field starpu_codelet::where is set, then the field
+starpu_codelet::scc_funcs is ignored if ::STARPU_SCC does not appear
+in the field starpu_codelet::where. It can be null if
+starpu_codelet::cpu_funcs_name is non-NULL, in which case StarPU will
+simply make a symbol lookup to get the implementation.
+
+\var starpu_codelet::nbuffers
+Specify the number of arguments taken by the codelet. These arguments
+are managed by the DSM and are accessed from the <c>void *buffers[]</c>
+array. The constant argument passed with the field starpu_task::cl_arg
+is not counted in this number. This value should not be above
+\ref STARPU_NMAXBUFS.
+
+\var starpu_codelet::modes
+Is an array of ::starpu_data_access_mode. It describes the required
+access modes to the data neeeded by the codelet (e.g. ::STARPU_RW). The
+number of entries in this array must be specified in the field
+starpu_codelet::nbuffers, and should not exceed \ref STARPU_NMAXBUFS. If
+unsufficient, this value can be set with the configure option
+\ref enable-maxbuffers "--enable-maxbuffers".
+
+\var starpu_codelet::dyn_modes
+Is an array of ::starpu_data_access_mode. It describes the required
+access modes to the data neeeded by the codelet (e.g. ::STARPU_RW).
+The number of entries in this array must be specified in the field
+starpu_codelet::nbuffers. This field should be used for codelets having a
+number of datas greater than \ref STARPU_NMAXBUFS (see \ref
+SettingTheDataHandlesForATask). When defining a codelet, one
+should either define this field or the field starpu_codelet::modes defined above.
+
+\var starpu_codelet::model
+Optional pointer to the task duration performance model associated to
+this codelet. This optional field is ignored when set to <c>NULL</c> or when
+its field starpu_perfmodel::symbol is not set.
+
+\var starpu_codelet::power_model
+Optional pointer to the task power consumption performance model
+associated to this codelet. This optional field is ignored when set to
+<c>NULL</c> or when its field starpu_perfmodel::field is not set. In
+the case of parallel codelets, this has to account for all processing
+units involved in the parallel execution.
+
+\var starpu_codelet::per_worker_stats
+Optional array for statistics collected at runtime: this is filled by
+StarPU and should not be accessed directly, but for example by calling
+the function starpu_codelet_display_stats() (See
+starpu_codelet_display_stats() for details).
+
+\var starpu_codelet::name
+Optional name of the codelet. This can be useful for debugging
+purposes.
+
+\fn void starpu_codelet_init(struct starpu_codelet *cl)
+\ingroup API_Codelet_And_Tasks
+Initialize \p cl with default values. Codelets should
+preferably be initialized statically as shown in \ref
+DefiningACodelet. However such a initialisation is not always
+possible, e.g. when using C++.
+
+\struct starpu_data_descr
+\ingroup API_Codelet_And_Tasks
+This type is used to describe a data handle along with an access mode.
+\var starpu_data_descr::handle
+describes a data
+\var starpu_data_descr::mode
+describes its access mode
+
+\struct starpu_task
+\ingroup API_Codelet_And_Tasks
+The structure describes a task that can be offloaded on the
+various processing units managed by StarPU. It instantiates a codelet.
+It can either be allocated dynamically with the function
+starpu_task_create(), or declared statically. In the latter case, the
+programmer has to zero the structure starpu_task and to fill the
+different fields properly. The indicated default values correspond to
+the configuration of a task allocated with starpu_task_create().
+\var starpu_task::cl
+Is a pointer to the corresponding structure starpu_codelet. This
+describes where the kernel should be executed, and supplies the
+appropriate implementations. When set to NULL, no code is executed
+during the tasks, such empty tasks can be useful for synchronization
+purposes.
+\var starpu_task::buffers
+\deprecated
+This field has been made deprecated. One should use instead the
+field starpu_task::handles to specify the data handles accessed
+by the task. The access modes are now defined in the field
+starpu_codelet::mode.
+\var starpu_task::handles
+Is an array of ::starpu_data_handle_t. It specifies the handles to the
+different pieces of data accessed by the task. The number of entries
+in this array must be specified in the field starpu_codelet::nbuffers,
+and should not exceed \ref STARPU_NMAXBUFS. If unsufficient, this value can
+be set with the configure option \ref enable-maxbuffers "--enable-maxbuffers".
+\var starpu_task::dyn_handles
+Is an array of ::starpu_data_handle_t. It specifies the handles to the
+different pieces of data accessed by the task. The number of entries
+in this array must be specified in the field starpu_codelet::nbuffers.
+This field should be used for tasks having a number of datas greater
+than \ref STARPU_NMAXBUFS (see \ref SettingTheDataHandlesForATask).
+When defining a task, one should either define this field or the field
+starpu_task::handles defined above.
+
+\var starpu_task::interfaces
+The actual data pointers to the memory node where execution will
+happen, managed by the DSM.
+
+\var starpu_task::dyn_interfaces
+The actual data pointers to the memory node where execution will
+happen, managed by the DSM. Is used when the field
+starpu_task::dyn_handles is defined.
+
+\var starpu_task::cl_arg
+Optional pointer which is passed to the codelet through the second
+argument of the codelet implementation (e.g. starpu_codelet::cpu_func
+or starpu_codelet::cuda_func). The default value is <c>NULL</c>.
+
+\var starpu_task::cl_arg_size
+Optional field. For some specific drivers, the pointer
+starpu_task::cl_arg cannot not be directly given to the driver
+function. A buffer of size starpu_task::cl_arg_size needs to be
+allocated on the driver. This buffer is then filled with the
+starpu_task::cl_arg_size bytes starting at address
+starpu_task::cl_arg. In this case, the argument given to the codelet
+is therefore not the starpu_task::cl_arg pointer, but the address of
+the buffer in local store (LS) instead. This field is ignored for CPU,
+CUDA and OpenCL codelets, where the starpu_task::cl_arg pointer is
+given as such.
+
+\var starpu_task::cl_arg_free
+Optional field. In case starpu_task::cl_arg was allocated by the
+application through <c>malloc()</c>, setting starpu_task::cl_arg_free
+to 1 makes StarPU automatically call <c>free(cl_arg)</c> when
+destroying the task. This saves the user from defining a callback just
+for that. This is mostly useful when targetting MIC or SCC, where the
+codelet does not execute in the same memory space as the main thread.
+
+\var starpu_task::callback_func
+Optional field, the default value is <c>NULL</c>. This is a function
+pointer of prototype <c>void (*f)(void *)</c> which specifies a
+possible callback. If this pointer is non-null, the callback function
+is executed on the host after the execution of the task. Tasks which
+depend on it might already be executing. The callback is passed the
+value contained in the starpu_task::callback_arg field. No callback is
+executed if the field is set to NULL.
+
+\var starpu_task::callback_arg (optional) (default: NULL)
+Optional field, the default value is <c>NULL</c>. This is the pointer
+passed to the callback function. This field is ignored if the field
+starpu_task::callback_func is set to <c>NULL</c>.
+
+\var starpu_task::use_tag
+Optional field, the default value is 0. If set, this flag indicates
+that the task should be associated with the tag contained in the
+starpu_task::tag_id field. Tag allow the application to synchronize
+with the task and to express task dependencies easily.
+
+\var starpu_task::tag_id
+This optional field contains the tag associated to the task if the
+field starpu_task::use_tag is set, it is ignored otherwise.
+
+\var starpu_task::sequential_consistency
+If this flag is set (which is the default), sequential consistency is
+enforced for the data parameters of this task for which sequential
+consistency is enabled. Clearing this flag permits to disable
+sequential consistency for this task, even if data have it enabled.
+
+\var starpu_task::synchronous
+If this flag is set, the function starpu_task_submit() is blocking and
+returns only when the task has been executed (or if no worker is able
+to process the task). Otherwise, starpu_task_submit() returns
+immediately.
+
+\var starpu_task::priority
+Optional field, the default value is ::STARPU_DEFAULT_PRIO. This field
+indicates a level of priority for the task. This is an integer value
+that must be set between the return values of the function
+starpu_sched_get_min_priority() for the least important tasks, and
+that of the function starpu_sched_get_max_priority() for the most
+important tasks (included). The ::STARPU_MIN_PRIO and ::STARPU_MAX_PRIO
+macros are provided for convenience and respectively returns the value
+of starpu_sched_get_min_priority() and
+starpu_sched_get_max_priority(). Default priority is
+::STARPU_DEFAULT_PRIO, which is always defined as 0 in order to allow
+static task initialization. Scheduling strategies that take priorities
+into account can use this parameter to take better scheduling
+decisions, but the scheduling policy may also ignore it.
+
+\var starpu_task::execute_on_a_specific_worker
+Default value is 0. If this flag is set, StarPU will bypass the
+scheduler and directly affect this task to the worker specified by the
+field starpu_task::workerid.
+
+\var starpu_task::workerid
+Optional field. If the field starpu_task::execute_on_a_specific_worker
+is set, this field indicates the identifier of the worker that should
+process this task (as returned by starpu_worker_get_id()). This field
+is ignored if the field starpu_task::execute_on_a_specific_worker is
+set to 0.
+
+\var starpu_task::bundle
+Optional field. The bundle that includes this task. If no bundle is
+used, this should be NULL.
+
+\var starpu_task::detach
+Optional field, default value is 1. If this flag is set, it is not
+possible to synchronize with the task by the means of starpu_task_wait()
+later on. Internal data structures are only guaranteed to be freed
+once starpu_task_wait() is called if the flag is not set.
+
+\var starpu_task::destroy
+Optional value. Default value is 0 for starpu_task_init(), and 1 for
+starpu_task_create(). If this flag is set, the task structure will
+automatically be freed, either after the execution of the callback if
+the task is detached, or during starpu_task_wait() otherwise. If this
+flag is not set, dynamically allocated data structures will not be
+freed until starpu_task_destroy() is called explicitly. Setting this
+flag for a statically allocated task structure will result in
+undefined behaviour. The flag is set to 1 when the task is created by
+calling starpu_task_create(). Note that starpu_task_wait_for_all()
+will not free any task.
+
+\var starpu_task::regenerate
+Optional field. If this flag is set, the task will be re-submitted to
+StarPU once it has been executed. This flag must not be set if the
+flag starpu_task::destroy is set.
+
+\var starpu_task::status
+Optional field. Current state of the task.
+
+\var starpu_task::profiling_info
+Optional field. Profiling information for the task.
+
+\var starpu_task::predicted
+Output field. Predicted duration of the task. This field is only set
+if the scheduling strategy uses performance models.
+
+\var starpu_task::predicted_transfer
+Optional field. Predicted data transfer duration for the task in
+microseconds. This field is only valid if the scheduling strategy uses
+performance models.
+
+\var starpu_task::prev
+\private
+A pointer to the previous task. This should only be used by StarPU.
+
+\var starpu_task::next
+\private
+A pointer to the next task. This should only be used by StarPU.
+
+\var starpu_task::mf_skip
+\private
+This is only used for tasks that use multiformat handle. This should
+only be used by StarPU.
+
+\var starpu_task::flops
+This can be set to the number of floating points operations that the
+task will have to achieve. This is useful for easily getting GFlops
+curves from the tool <c>starpu_perfmodel_plot</c>, and for the
+hypervisor load balancing.
+
+\var starpu_task::starpu_private
+\private
+This is private to StarPU, do not modify. If the task is allocated by
+hand (without starpu_task_create()), this field should be set to NULL.
+
+\var starpu_task::magic
+\private
+This field is set when initializing a task. The function
+starpu_task_submit() will fail if the field does not have the right
+value. This will hence avoid submitting tasks which have not been
+properly initialised.
+
+\var starpu_task::sched_ctx
+Scheduling context.
+
+\var starpu_task::hypervisor_tag
+Helps the hypervisor monitor the execution of this task.
+
+\var starpu_task::scheduled
+Whether the scheduler has pushed the task on some queue
+
+\fn void starpu_task_init(struct starpu_task *task)
+\ingroup API_Codelet_And_Tasks
+Initialize task with default values. This function is
+implicitly called by starpu_task_create(). By default, tasks initialized
+with starpu_task_init() must be deinitialized explicitly with
+starpu_task_clean(). Tasks can also be initialized statically, using
+::STARPU_TASK_INITIALIZER.
+
+\def STARPU_TASK_INITIALIZER
+\ingroup API_Codelet_And_Tasks
+It is possible to initialize statically allocated tasks with
+this value. This is equivalent to initializing a structure starpu_task
+with the function starpu_task_init() function.
+
+\def STARPU_TASK_GET_HANDLE(struct starpu_task *task, int i)
+\ingroup API_Codelet_And_Tasks
+Return the \p i th data handle of the given task. If the task
+is defined with a static or dynamic number of handles, will either
+return the \p i th element of the field starpu_task::handles or the \p
+i th element of the field starpu_task::dyn_handles (see \ref
+SettingTheDataHandlesForATask)
+
+\def STARPU_TASK_SET_HANDLE(struct starpu_task *task, starpu_data_handle_t handle, int i)
+\ingroup API_Codelet_And_Tasks
+Set the \p i th data handle of the given task with the given
+dat handle. If the task is defined with a static or dynamic number of
+handles, will either set the \p i th element of the field
+starpu_task::handles or the \p i th element of the field
+starpu_task::dyn_handles (see \ref
+SettingTheDataHandlesForATask)
+
+\def STARPU_CODELET_GET_MODE(struct starpu_codelet *codelet, int i)
+\ingroup API_Codelet_And_Tasks
+Return the access mode of the \p i th data handle of the given
+codelet. If the codelet is defined with a static or dynamic number of
+handles, will either return the \p i th element of the field
+starpu_codelet::modes or the \p i th element of the field
+starpu_codelet::dyn_modes (see \ref
+SettingTheDataHandlesForATask)
+
+\def STARPU_CODELET_SET_MODE(struct starpu_codelet *codelet, enum starpu_data_access_mode mode, int i)
+\ingroup API_Codelet_And_Tasks
+Set the access mode of the \p i th data handle of the given
+codelet. If the codelet is defined with a static or dynamic number of
+handles, will either set the \p i th element of the field
+starpu_codelet::modes or the \p i th element of the field
+starpu_codelet::dyn_modes (see \ref
+SettingTheDataHandlesForATask)
+
+\fn struct starpu_task * starpu_task_create(void)
+\ingroup API_Codelet_And_Tasks
+Allocate a task structure and initialize it with default
+values. Tasks allocated dynamically with starpu_task_create() are
+automatically freed when the task is terminated. This means that the
+task pointer can not be used any more once the task is submitted,
+since it can be executed at any time (unless dependencies make it
+wait) and thus freed at any time. If the field starpu_task::destroy is
+explicitly unset, the resources used by the task have to be freed by
+calling starpu_task_destroy().
+
+\fn struct starpu_task * starpu_task_dup(struct starpu_task *task)
+\ingroup API_Codelet_And_Tasks
+Allocate a task structure which is the exact duplicate of the
+given task.
+
+\fn void starpu_task_clean(struct starpu_task *task)
+\ingroup API_Codelet_And_Tasks
+Release all the structures automatically allocated to execute
+task, but not the task structure itself and values set by the user
+remain unchanged. It is thus useful for statically allocated tasks for
+instance. It is also useful when users want to execute the same
+operation several times with as least overhead as possible. It is
+called automatically by starpu_task_destroy(). It has to be called
+only after explicitly waiting for the task or after starpu_shutdown()
+(waiting for the callback is not enough, since StarPU still
+manipulates the task after calling the callback).
+
+\fn void starpu_task_destroy(struct starpu_task *task)
+\ingroup API_Codelet_And_Tasks
+Free the resource allocated during starpu_task_create() and
+associated with task. This function is already called automatically
+after the execution of a task when the field starpu_task::destroy is
+set, which is the default for tasks created by starpu_task_create().
+Calling this function on a statically allocated task results in an
+undefined behaviour.
+
+\fn int starpu_task_wait(struct starpu_task *task)
+\ingroup API_Codelet_And_Tasks
+This function blocks until \p task has been executed. It is not
+possible to synchronize with a task more than once. It is not possible
+to wait for synchronous or detached tasks. Upon successful completion,
+this function returns 0. Otherwise, <c>-EINVAL</c> indicates that the
+specified task was either synchronous or detached.
+
+\fn int starpu_task_submit(struct starpu_task *task)
+\ingroup API_Codelet_And_Tasks
+This function submits task to StarPU. Calling this function
+does not mean that the task will be executed immediately as there can
+be data or task (tag) dependencies that are not fulfilled yet: StarPU
+will take care of scheduling this task with respect to such
+dependencies. This function returns immediately if the field
+starpu_task::synchronous is set to 0, and block until the
+termination of the task otherwise. It is also possible to synchronize
+the application with asynchronous tasks by the means of tags, using
+the function starpu_tag_wait() function for instance. In case of
+success, this function returns 0, a return value of <c>-ENODEV</c>
+means that there is no worker able to process this task (e.g. there is
+no GPU available and this task is only implemented for CUDA devices).
+starpu_task_submit() can be called from anywhere, including codelet
+functions and callbacks, provided that the field
+starpu_task::synchronous is set to 0.
+
+\fn int starpu_task_wait_for_all(void)
+\ingroup API_Codelet_And_Tasks
+This function blocks until all the tasks that were submitted
+(to the current context or the global one if there aren't any) are
+terminated. It does not destroy these tasks.
+
+\fn int starpu_task_wait_for_all_in_ctx(unsigned sched_ctx_id)
+\ingroup API_Codelet_And_Tasks
+This function waits until all the tasks that were already
+submitted to the context \p sched_ctx_id have been executed
+
+\fn int starpu_task_nready(void)
+\ingroup API_Codelet_And_Tasks
+TODO
+
+\fn int starpu_task_nsubmitted(void)
+\ingroup API_Codelet_And_Tasks
+Return the number of submitted tasks which have not completed yet.
+
+\fn int starpu_task_nready(void)
+\ingroup API_Codelet_And_Tasks
+Return the number of submitted tasks which are ready for
+execution are already executing. It thus does not include tasks
+waiting for dependencies.
+
+\fn struct starpu_task * starpu_task_get_current(void)
+\ingroup API_Codelet_And_Tasks
+This function returns the task currently executed by the
+worker, or <c>NULL</c> if it is called either from a thread that is not a
+task or simply because there is no task being executed at the moment.
+
+\fn void starpu_codelet_display_stats(struct starpu_codelet *cl)
+\ingroup API_Codelet_And_Tasks
+Output on stderr some statistics on the codelet \p cl.
+
+\fn int starpu_task_wait_for_no_ready(void)
+\ingroup API_Codelet_And_Tasks
+This function waits until there is no more ready task.
+
+\fn void starpu_task_set_implementation(struct starpu_task *task, unsigned impl)
+\ingroup API_Codelet_And_Tasks
+This function should be called by schedulers to specify the
+codelet implementation to be executed when executing the task.
+
+\fn unsigned starpu_task_get_implementation(struct starpu_task *task)
+\ingroup API_Codelet_And_Tasks
+This function return the codelet implementation to be executed
+when executing the task.
+
+\fn void starpu_create_sync_task(starpu_tag_t sync_tag, unsigned ndeps, starpu_tag_t *deps,	void (*callback)(void *), void *callback_arg)
+\ingroup API_Codelet_And_Tasks
+This creates (and submits) an empty task that unlocks a tag once all
+its dependencies are fulfilled.
+
+
+*/

+ 80 - 0
doc/doxygen/chapters/api/cuda_extensions.doxy

@@ -0,0 +1,80 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_CUDA_Extensions CUDA Extensions
+
+\def STARPU_USE_CUDA
+\ingroup API_CUDA_Extensions
+This macro is defined when StarPU has been installed with CUDA
+support. It should be used in your code to detect the availability of
+CUDA as shown in \ref FullSourceCodeVectorScal.
+
+\fn cudaStream_t starpu_cuda_get_local_stream(void)
+\ingroup API_CUDA_Extensions
+This function gets the current worker’s CUDA stream. StarPU
+provides a stream for every CUDA device controlled by StarPU. This
+function is only provided for convenience so that programmers can
+easily use asynchronous operations within codelets without having to
+create a stream by hand. Note that the application is not forced to
+use the stream provided by starpu_cuda_get_local_stream() and may also
+create its own streams. Synchronizing with cudaThreadSynchronize() is
+allowed, but will reduce the likelihood of having all transfers
+overlapped.
+
+\fn const struct cudaDeviceProp * starpu_cuda_get_device_properties(unsigned workerid)
+\ingroup API_CUDA_Extensions
+This function returns a pointer to device properties for worker
+\p workerid (assumed to be a CUDA worker).
+
+\fn void starpu_cuda_report_error(const char *func, const char *file, int line, cudaError_t status)
+\ingroup API_CUDA_Extensions
+Report a CUDA error.
+
+\def STARPU_CUDA_REPORT_ERROR (cudaError_t status)
+\ingroup API_CUDA_Extensions
+Calls starpu_cuda_report_error(), passing the current function, file and line position.
+
+\fn int starpu_cuda_copy_async_sync (void *src_ptr, unsigned src_node, void *dst_ptr, unsigned dst_node, size_t ssize, cudaStream_t stream, enum cudaMemcpyKind kind)
+\ingroup API_CUDA_Extensions
+Copy \p ssize bytes from the pointer \p src_ptr on \p src_node
+to the pointer \p dst_ptr on \p dst_node. The function first tries to
+copy the data asynchronous (unless stream is <c>NULL</c>). If the
+asynchronous copy fails or if stream is <c>NULL</c>, it copies the
+data synchronously. The function returns <c>-EAGAIN</c> if the
+asynchronous launch was successfull. It returns 0 if the synchronous
+copy was successful, or fails otherwise.
+
+\fn void starpu_cuda_set_device(unsigned devid)
+\ingroup API_CUDA_Extensions
+Calls cudaSetDevice(devid) or cudaGLSetGLDevice(devid),
+according to whether \p devid is among the field
+starpu_conf::cuda_opengl_interoperability.
+
+\fn void starpu_cublas_init(void)
+\ingroup API_CUDA_Extensions
+This function initializes CUBLAS on every CUDA device. The
+CUBLAS library must be initialized prior to any CUBLAS call. Calling
+starpu_cublas_init() will initialize CUBLAS on every CUDA device
+controlled by StarPU. This call blocks until CUBLAS has been properly
+initialized on every device.
+
+\fn void starpu_cublas_shutdown(void)
+\ingroup API_CUDA_Extensions
+This function synchronously deinitializes the CUBLAS library on
+every CUDA device.
+
+\fn void starpu_cublas_report_error(const char *func, const char *file, int line, cublasStatus status)
+\ingroup API_CUDA_Extensions
+Report a cublas error.
+
+\def STARPU_CUBLAS_REPORT_ERROR (cublasStatus status)
+\ingroup API_CUDA_Extensions
+Calls starpu_cublas_report_error(), passing the current
+function, file and line position.
+
+*/

+ 988 - 0
doc/doxygen/chapters/api/data_interfaces.doxy

@@ -0,0 +1,988 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Data_Interfaces Data Interfaces
+
+\struct starpu_data_interface_ops
+Per-interface data transfer methods.
+\ingroup API_Data_Interfaces
+\var starpu_data_interface_ops::register_data_handle
+Register an existing interface into a data handle.
+\var starpu_data_interface_ops::allocate_data_on_node
+Allocate data for the interface on a given node.
+\var starpu_data_interface_ops::free_data_on_node
+Free data of the interface on a given node.
+\var starpu_data_interface_ops::copy_methods
+ram/cuda/opencl synchronous and asynchronous transfer methods.
+\var starpu_data_interface_ops::handle_to_pointer
+Return the current pointer (if any) for the handle on the given node.
+\var starpu_data_interface_ops::get_size
+Return an estimation of the size of data, for performance models.
+\var starpu_data_interface_ops::footprint
+Return a 32bit footprint which characterizes the data size.
+\var starpu_data_interface_ops::compare
+Compare the data size of two interfaces.
+\var starpu_data_interface_ops::display
+Dump the sizes of a handle to a file.
+\var starpu_data_interface_ops::interfaceid
+An identifier that is unique to each interface.
+\var starpu_data_interface_ops::interface_size
+The size of the interface data descriptor.
+\var starpu_data_interface_ops::is_multiformat
+todo
+\var starpu_data_interface_ops::get_mf_ops
+todo
+\var starpu_data_interface_ops::pack_data
+Pack the data handle into a contiguous buffer at the address ptr and
+set the size of the newly created buffer in count. If ptr is NULL, the
+function should not copy the data in the buffer but just set count to
+the size of the buffer which would have been allocated. The special
+value -1 indicates the size is yet unknown.
+\var starpu_data_interface_ops::unpack_data
+Unpack the data handle from the contiguous buffer at the address ptr
+of size count
+
+\struct starpu_data_copy_methods
+Defines the per-interface methods. If the any_to_any method is
+provided, it will be used by default if no more specific method is
+provided. It can still be useful to provide more specific method in
+case of e.g. available particular CUDA or OpenCL support.
+\ingroup API_Data_Interfaces
+\var starpu_data_copy_methods::ram_to_ram
+Define how to copy data from the \p src_interface interface on the \p
+src_node CPU node to the \p dst_interface interface on the \p dst_node
+CPU  node. Return 0 on success.
+\var starpu_data_copy_methods::ram_to_cuda
+Define how to copy data from the \p src_interface interface on the
+\p src_node CPU node to the \p dst_interface interface on the \p dst_node CUDA
+node. Return 0 on success.
+\var starpu_data_copy_methods::ram_to_opencl
+Define how to copy data from the \p src_interface interface on the
+\p src_node CPU node to the \p dst_interface interface on the \p dst_node
+OpenCL node. Return 0 on success.
+
+\var starpu_data_copy_methods::ram_to_mic
+Define how to copy data from the \p src_interface interface on the
+\p src_node CPU node to the \p dst_interface interface on the \p dst_node MIC
+node. Return 0 on success.
+
+\var starpu_data_copy_methods::cuda_to_ram
+Define how to copy data from the \p src_interface interface on the
+\p src_node CUDA node to the \p dst_interface interface on the \p dst_node
+CPU node. Return 0 on success.
+\var starpu_data_copy_methods::cuda_to_cuda
+Define how to copy data from the \p src_interface interface on the
+\p src_node CUDA node to the \p dst_interface interface on the \p dst_node CUDA
+node. Return 0 on success.
+\var starpu_data_copy_methods::cuda_to_opencl
+Define how to copy data from the \p src_interface interface on the
+\p src_node CUDA node to the \p dst_interface interface on the \p dst_node
+OpenCL node. Return 0 on success.
+\var starpu_data_copy_methods::opencl_to_ram
+Define how to copy data from the \p src_interface interface on the
+\p src_node OpenCL node to the \p dst_interface interface on the \p dst_node
+CPU node. Return 0 on success.
+\var starpu_data_copy_methods::opencl_to_cuda
+Define how to copy data from the \p src_interface interface on the
+\p src_node OpenCL node to the \p dst_interface interface on the \p dst_node
+CUDA node. Return 0 on success.
+\var starpu_data_copy_methods::opencl_to_opencl
+Define how to copy data from the \p src_interface interface on the
+\p src_node OpenCL node to the \p dst_interface interface on the \p dst_node
+OpenCL node. Return 0 on success.
+
+\var starpu_data_copy_methods::mic_to_ram
+Define how to copy data from the \p src_interface interface on the
+\p src_node MIC node to the \p dst_interface interface on the \p dst_node CPU
+node. Return 0 on success.
+
+\var starpu_data_copy_methods::scc_src_to_sink
+Define how to copy data from the \p src_interface interface on the
+\p src_node node to the \p dst_interface interface on the \p dst_node node.
+Must return 0 if the transfer was actually completed completely
+synchronously, or -EAGAIN if at least some transfers are still ongoing
+and should be awaited for by the core.
+\var starpu_data_copy_methods::scc_sink_to_src
+Define how to copy data from the \p src_interface interface on the
+\p src_node node to the \p dst_interface interface on the \p dst_node node.
+Must return 0 if the transfer was actually completed completely
+synchronously, or -EAGAIN if at least some transfers are still ongoing
+and should be awaited for by the core.
+\var starpu_data_copy_methods::scc_sink_to_sink
+Define how to copy data from the \p src_interface interface on the
+\p src_node node to the \p dst_interface interface on the \p dst_node node.
+Must return 0 if the transfer was actually completed completely
+synchronously, or -EAGAIN if at least some transfers are still ongoing
+and should be awaited for by the core.
+
+\var starpu_data_copy_methods::ram_to_cuda_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node CPU node to the \p dst_interface interface on the \p dst_node CUDA
+node, using the given stream. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the core.
+\var starpu_data_copy_methods::cuda_to_ram_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node CUDA node to the \p dst_interface interface on the \p dst_node CPU
+node, using the given stream. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the core.
+\var starpu_data_copy_methods::cuda_to_cuda_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node CUDA node to the \p dst_interface interface on the \p dst_node CUDA
+node, using the given stream. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the core.
+
+\var starpu_data_copy_methods::ram_to_opencl_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node CPU node to the \p dst_interface interface on the \p dst_node
+OpenCL node, by recording in event, a pointer to a cl_event, the event
+of the last submitted transfer. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the
+core.
+\var starpu_data_copy_methods::opencl_to_ram_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node OpenCL node to the \p dst_interface interface on the \p dst_node
+CPU node, by recording in event, a pointer to a cl_event, the event of
+the last submitted transfer. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the
+core.
+\var starpu_data_copy_methods::opencl_to_opencl_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node OpenCL node to the \p dst_interface interface on the \p dst_node
+OpenCL node, by recording in event, a pointer to a cl_event, the event
+of the last submitted transfer. Must return 0 if the transfer was
+actually completed completely synchronously, or -EAGAIN if at least
+some transfers are still ongoing and should be awaited for by the
+core.
+
+\var starpu_data_copy_methods::ram_to_mic_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node CPU node to the \p dst_interface interface on the \p dst_node
+MIC node. Must return 0 if the transfer was actually completed
+completely synchronously, or -EAGAIN if at least some transfers are
+still ongoing and should be awaited for by the core.
+\var starpu_data_copy_methods::mic_to_ram_async
+Define how to copy data from the \p src_interface interface on the
+\p src_node MIC node to the \p dst_interface interface on the \p dst_node
+CPU node. Must return 0 if the transfer was actually completed
+completely synchronously, or -EAGAIN if at least some transfers are
+still ongoing and should be awaited for by the core.
+
+\var starpu_data_copy_methods::any_to_any
+Define how to copy data from the \p src_interface interface on the
+\p src_node node to the \p dst_interface interface on the \p dst_node node.
+This is meant to be implemented through the starpu_interface_copy()
+helper, to which async_data should be passed as such, and will be used
+to manage asynchronicity. This must return -EAGAIN if any of the
+starpu_interface_copy() calls has returned -EAGAIN (i.e. at least some
+transfer is still ongoing), and return 0 otherwise.
+
+@name Registering Data
+\ingroup API_Data_Interfaces
+
+There are several ways to register a memory region so that it can be
+managed by StarPU. The functions below allow the registration of
+vectors, 2D matrices, 3D matrices as well as BCSR and CSR sparse
+matrices.
+
+\fn void starpu_void_data_register(starpu_data_handle_t *handle)
+\ingroup API_Data_Interfaces
+Register a void interface. There is no data really associated
+to that interface, but it may be used as a synchronization mechanism.
+It also permits to express an abstract piece of data that is managed
+by the application internally: this makes it possible to forbid the
+concurrent execution of different tasks accessing the same <c>void</c> data
+in read-write concurrently. 
+
+\fn void starpu_variable_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, size_t size)
+\ingroup API_Data_Interfaces
+Register the \p size byte element pointed to by \p ptr, which is
+typically a scalar, and initialize \p handle to represent this data item.
+
+Here an example of how to use the function.
+\code{.c}
+float var;
+starpu_data_handle_t var_handle;
+starpu_variable_data_register(&var_handle, 0, (uintptr_t)&var, sizeof(var));
+\endcode
+
+\fn void starpu_vector_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, uint32_t nx, size_t elemsize)
+\ingroup API_Data_Interfaces
+Register the \p nx elemsize-byte elements pointed to by \p ptr and initialize \p handle to represent it.
+
+Here an example of how to use the function.
+\code{.c}
+float vector[NX];
+starpu_data_handle_t vector_handle;
+starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX, sizeof(vector[0]));
+\endcode
+
+\fn void starpu_matrix_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, uint32_t ld, uint32_t nx, uint32_t ny, size_t elemsize)
+\ingroup API_Data_Interfaces
+Register the \p nx x \p  ny 2D matrix of \p elemsize-byte elements pointed
+by \p ptr and initialize \p handle to represent it. \p ld specifies the number
+of elements between rows. a value greater than \p nx adds padding, which
+can be useful for alignment purposes.
+
+Here an example of how to use the function.
+\code{.c}
+float *matrix;
+starpu_data_handle_t matrix_handle;
+matrix = (float*)malloc(width * height * sizeof(float));
+starpu_matrix_data_register(&matrix_handle, 0, (uintptr_t)matrix, width, width, height, sizeof(float));
+\endcode
+
+\fn void starpu_block_data_register(starpu_data_handle_t *handle, unsigned home_node, uintptr_t ptr, uint32_t ldy, uint32_t ldz, uint32_t nx, uint32_t ny, uint32_t nz, size_t elemsize)
+\ingroup API_Data_Interfaces
+Register the \p nx x \p ny x \p nz 3D matrix of \p elemsize byte elements
+pointed by \p ptr and initialize \p handle to represent it. Again, \p ldy and
+\p ldz specify the number of elements between rows and between z planes.
+
+Here an example of how to use the function.
+\code{.c}
+float *block;
+starpu_data_handle_t block_handle;
+block = (float*)malloc(nx*ny*nz*sizeof(float));
+starpu_block_data_register(&block_handle, 0, (uintptr_t)block, nx, nx*ny, nx, ny, nz, sizeof(float));
+\endcode
+
+\fn void starpu_bcsr_data_register(starpu_data_handle_t *handle, unsigned home_node, uint32_t nnz, uint32_t nrow, uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, uint32_t r, uint32_t c, size_t elemsize)
+\ingroup API_Data_Interfaces
+This variant of starpu_data_register() uses the BCSR (Blocked
+Compressed Sparse Row Representation) sparse matrix interface.
+Register the sparse matrix made of \p nnz non-zero blocks of elements of
+size \p elemsize stored in \p nzval and initializes \p handle to represent it.
+Blocks have size \p r * \p c. \p nrow is the number of rows (in terms of
+blocks), \p colind[i] is the block-column index for block i in \p nzval,
+\p rowptr[i] is the block-index (in \p nzval) of the first block of row i.
+\p firstentry is the index of the first entry of the given arrays
+(usually 0 or 1). 
+
+\fn void starpu_csr_data_register(starpu_data_handle_t *handle, unsigned home_node, uint32_t nnz, uint32_t nrow, uintptr_t nzval, uint32_t *colind, uint32_t *rowptr, uint32_t firstentry, size_t elemsize)
+\ingroup API_Data_Interfaces
+This variant of starpu_data_register() uses the CSR (Compressed
+Sparse Row Representation) sparse matrix interface. TODO
+
+\fn void starpu_coo_data_register(starpu_data_handle_t *handleptr, unsigned home_node, uint32_t nx, uint32_t ny, uint32_t n_values, uint32_t *columns, uint32_t *rows, uintptr_t values, size_t elemsize);
+\ingroup API_Data_Interfaces
+Register the \p nx x \p ny 2D matrix given in the COO format, using the
+\p columns, \p rows, \p values arrays, which must have \p n_values elements of
+size \p elemsize. Initialize \p handleptr.
+
+\fn void *starpu_data_get_interface_on_node(starpu_data_handle_t handle, unsigned memory_node)
+\ingroup API_Data_Interfaces
+Return the interface associated with \p handle on \p memory_node.
+
+@name Accessing Data Interfaces
+\ingroup API_Data_Interfaces
+
+Each data interface is provided with a set of field access functions.
+The ones using a void * parameter aimed to be used in codelet
+implementations (see for example the code in \ref
+VectorScalingUsingStarPUAPI).
+
+\fn void *starpu_data_handle_to_pointer(starpu_data_handle_t handle, unsigned node)
+\ingroup API_Data_Interfaces
+Return the pointer associated with \p handle on node \p node or <c>NULL</c>
+if handle’s interface does not support this operation or data for this
+\p handle is not allocated on that \p node.
+
+\fn void *starpu_data_get_local_ptr(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the local pointer associated with \p handle or <c>NULL</c> if
+\p handle’s interface does not have data allocated locally 
+
+\fn enum starpu_data_interface_id starpu_data_get_interface_id(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the unique identifier of the interface associated with
+the given \p handle.
+
+\fn size_t starpu_data_get_size(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of the data associated with \p handle.
+
+\fn int starpu_data_pack(starpu_data_handle_t handle, void **ptr, starpu_ssize_t *count)
+\ingroup API_Data_Interfaces
+Execute the packing operation of the interface of the data
+registered at \p handle (see starpu_data_interface_ops). This
+packing operation must allocate a buffer large enough at \p ptr and copy
+into the newly allocated buffer the data associated to \p handle. \p count
+will be set to the size of the allocated buffer. If \p ptr is NULL, the
+function should not copy the data in the buffer but just set \p count to
+the size of the buffer which would have been allocated. The special
+value -1 indicates the size is yet unknown.
+
+\fn int starpu_data_unpack(starpu_data_handle_t handle, void *ptr, size_t count)
+\ingroup API_Data_Interfaces
+Unpack in handle the data located at \p ptr of size \p count as
+described by the interface of the data. The interface registered at
+\p handle must define a unpacking operation (see
+starpu_data_interface_ops). The memory at the address \p ptr is freed
+after calling the data unpacking operation.
+
+@name Accessing Variable Data Interfaces
+\ingroup API_Data_Interfaces
+
+\struct starpu_variable_interface
+Variable interface for a single data (not a vector, a matrix, a list, ...)
+\ingroup API_Data_Interfaces
+\var starpu_variable_interface::id
+Identifier of the interface
+\var starpu_variable_interface::ptr
+local pointer of the variable
+\var starpu_variable_interface::dev_handle
+device handle of the variable.
+\var starpu_variable_interface::offset
+offset in the variable
+\var starpu_variable_interface::elemsize
+size of the variable
+
+\fn size_t starpu_variable_get_elemsize(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of the variable designated by \p handle.
+
+\fn uintptr_t starpu_variable_get_local_ptr(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return a pointer to the variable designated by \p handle.
+
+\def STARPU_VARIABLE_GET_PTR(interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the variable designated by \p interface.
+
+\def STARPU_VARIABLE_GET_ELEMSIZE(interface)
+\ingroup API_Data_Interfaces
+Return the size of the variable designated by \p interface.
+
+\def STARPU_VARIABLE_GET_DEV_HANDLE(interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the variable designated by
+\p interface, to be used on OpenCL. The offset documented below has to be
+used in addition to this.
+
+\def STARPU_VARIABLE_GET_OFFSET()
+\ingroup API_Data_Interfaces
+Return the offset in the variable designated by \p interface, to
+be used with the device handle.
+
+@name Accessing Vector Data Interfaces
+\ingroup API_Data_Interfaces
+
+\struct starpu_vector_interface
+Vector interface
+\ingroup API_Data_Interfaces
+\var starpu_vector_interface::id
+Identifier of the interface
+\var starpu_vector_interface::ptr
+local pointer of the vector
+\var starpu_vector_interface::dev_handle
+device handle of the vector.
+\var starpu_vector_interface::offset
+offset in the vector
+\var starpu_vector_interface::nx
+number of elements on the x-axis of the vector
+\var starpu_vector_interface::elemsize
+size of the elements of the vector
+
+\fn uint32_t starpu_vector_get_nx(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements registered into the array designated by \p handle.
+
+\fn size_t starpu_vector_get_elemsize(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of each element of the array designated by \p handle.
+
+\fn uintptr_t starpu_vector_get_local_ptr(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the local pointer associated with \p handle.
+
+\def STARPU_VECTOR_GET_PTR(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the array designated by \p interface, valid on
+CPUs and CUDA only. For OpenCL, the device handle and offset need to
+be used instead.
+
+\def STARPU_VECTOR_GET_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the array designated by \p interface,
+to be used on OpenCL. the offset documented below has to be used in
+addition to this.
+
+\def STARPU_VECTOR_GET_OFFSET(void *interface)
+\ingroup API_Data_Interfaces
+Return the offset in the array designated by \p interface, to be
+used with the device handle.
+
+\def STARPU_VECTOR_GET_NX(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements registered into the array
+designated by \p interface.
+
+\def STARPU_VECTOR_GET_ELEMSIZE(void *interface)
+\ingroup API_Data_Interfaces
+Return the size of each element of the array designated by
+\p interface.
+
+@name Accessing Matrix Data Interfaces
+\ingroup API_Data_Interfaces
+
+\struct starpu_matrix_interface
+Matrix interface for dense matrices
+\ingroup API_Data_Interfaces
+\var starpu_matrix_interface::id
+Identifier of the interface
+\var starpu_matrix_interface::ptr
+local pointer of the matrix
+\var starpu_matrix_interface::dev_handle
+device handle of the matrix.
+\var starpu_matrix_interface::offset
+offset in the matrix
+\var starpu_matrix_interface::nx
+number of elements on the x-axis of the matrix
+\var starpu_matrix_interface::ny
+number of elements on the y-axis of the matrix
+\var starpu_matrix_interface::ld
+number of elements between each row of the matrix. Maybe be equal to
+starpu_matrix_interface::nx when there is no padding.
+\var starpu_matrix_interface::elemsize
+size of the elements of the matrix
+
+\fn uint32_t starpu_matrix_get_nx(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements on the x-axis of the matrix
+designated by \p handle.
+
+\fn uint32_t starpu_matrix_get_ny(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements on the y-axis of the matrix
+designated by \p handle.
+
+\fn uint32_t starpu_matrix_get_local_ld(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements between each row of the matrix
+designated by \p handle. Maybe be equal to nx when there is no padding.
+
+\fn uintptr_t starpu_matrix_get_local_ptr(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the local pointer associated with \p handle.
+
+\fn size_t starpu_matrix_get_elemsize(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of the elements registered into the matrix
+designated by \p handle.
+
+\def STARPU_MATRIX_GET_PTR(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the matrix designated by \p interface, valid
+on CPUs and CUDA devices only. For OpenCL devices, the device handle
+and offset need to be used instead.
+
+\def STARPU_MATRIX_GET_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the matrix designated by \p interface,
+to be used on OpenCL. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_MATRIX_GET_OFFSET(void *interface)
+\ingroup API_Data_Interfaces
+Return the offset in the matrix designated by \p interface, to be
+used with the device handle.
+
+\def STARPU_MATRIX_GET_NX(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements on the x-axis of the matrix
+designated by \p interface.
+
+\def STARPU_MATRIX_GET_NY(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements on the y-axis of the matrix
+designated by \p interface.
+
+\def STARPU_MATRIX_GET_LD(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements between each row of the matrix
+designated by \p interface. May be equal to nx when there is no padding.
+
+\def STARPU_MATRIX_GET_ELEMSIZE(void *interface)
+\ingroup API_Data_Interfaces
+Return the size of the elements registered into the matrix
+designated by \p interface.
+
+@name Accessing Block Data Interfaces
+\ingroup API_Data_Interfaces
+
+\struct starpu_block_interface
+Block interface for 3D dense blocks
+\ingroup API_Data_Interfaces
+\struct starpu_block_interface::id
+identifier of the interface
+\var starpu_block_interface::ptr
+local pointer of the block
+\var starpu_block_interface::dev_handle
+device handle of the block.
+\var starpu_block_interface::offset
+offset in the block.
+\var starpu_block_interface::nx
+number of elements on the x-axis of the block.
+\var starpu_block_interface::ny
+number of elements on the y-axis of the block.
+\var starpu_block_interface::nz
+number of elements on the z-axis of the block.
+\var starpu_block_interface::ldy
+number of elements between two lines
+\var starpu_block_interface::ldz
+number of elements between two planes
+\var starpu_block_interface::elemsize
+size of the elements of the block.
+
+\fn uint32_t starpu_block_get_nx(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements on the x-axis of the block
+designated by \p handle.
+
+\fn uint32_t starpu_block_get_ny(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements on the y-axis of the block
+designated by \p handle.
+
+\fn uint32_t starpu_block_get_nz(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements on the z-axis of the block
+designated by \p handle.
+
+\fn uint32_t starpu_block_get_local_ldy(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements between each row of the block
+designated by \p handle, in the format of the current memory node.
+
+\fn uint32_t starpu_block_get_local_ldz(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of elements between each z plane of the block
+designated by \p handle, in the format of the current memory node.
+
+\fn uintptr_t starpu_block_get_local_ptr(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the local pointer associated with \p handle.
+
+\fn size_t starpu_block_get_elemsize(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of the elements of the block designated by
+\p handle.
+
+\def STARPU_BLOCK_GET_PTR(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the block designated by \p interface.
+
+\def STARPU_BLOCK_GET_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the block designated by \p interface,
+to be used on OpenCL. The offset document below has to be used in
+addition to this.
+
+\def STARPU_BLOCK_GET_OFFSET(void *interface)
+\ingroup API_Data_Interfaces
+Return the offset in the block designated by \p interface, to be
+used with the device handle.
+
+\def STARPU_BLOCK_GET_NX(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements on the x-axis of the block
+designated by \p interface.
+
+\def STARPU_BLOCK_GET_NY(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements on the y-axis of the block
+designated by \p interface.
+
+\def STARPU_BLOCK_GET_NZ(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements on the z-axis of the block
+designated by \p interface.
+
+\def STARPU_BLOCK_GET_LDY(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements between each row of the block
+designated by \p interface. May be equal to nx when there is no padding.
+
+\def STARPU_BLOCK_GET_LDZ(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of elements between each z plane of the block
+designated by \p interface. May be equal to nx*ny when there is no
+padding.
+
+\def STARPU_BLOCK_GET_ELEMSIZE(void *interface)
+\ingroup API_Data_Interfaces
+Return the size of the elements of the block designated by
+\p interface.
+
+@name Accessing BCSR Data Interfaces
+\ingroup API_Data_Interfaces
+
+\struct starpu_bcsr_interface
+BCSR interface for sparse matrices (blocked compressed sparse
+row representation)
+\ingroup API_Data_Interfaces
+\var starpu_bcsr_interface::id
+Identifier of the interface
+\var starpu_bcsr_interface::nnz
+number of non-zero BLOCKS
+\var starpu_bcsr_interface::nrow
+number of rows (in terms of BLOCKS)
+\var starpu_bcsr_interface::nzval
+non-zero values
+\var starpu_bcsr_interface::colind
+position of non-zero entried on the row
+\var starpu_bcsr_interface::rowptr
+index (in nzval) of the first entry of the row
+\var starpu_bcsr_interface::firstentry
+k for k-based indexing (0 or 1 usually). Also useful when partitionning the matrix.
+\var starpu_bcsr_interface::r
+size of the blocks
+\var starpu_bcsr_interface::c
+size of the blocks
+\var starpu_bcsr_interface::elemsize;
+size of the elements of the matrix
+
+\fn uint32_t starpu_bcsr_get_nnz(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of non-zero elements in the matrix designated
+by \p handle.
+
+\fn uint32_t starpu_bcsr_get_nrow(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of rows (in terms of blocks of size r*c) in
+the matrix designated by \p handle.
+
+\fn uint32_t starpu_bcsr_get_firstentry(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the index at which all arrays (the column indexes, the
+row pointers...) of the matrix desginated by \p handle.
+
+\fn uintptr_t starpu_bcsr_get_local_nzval(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return a pointer to the non-zero values of the matrix
+designated by \p handle.
+
+\fn uint32_t * starpu_bcsr_get_local_colind(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return a pointer to the column index, which holds the positions
+of the non-zero entries in the matrix designated by \p handle.
+
+\fn uint32_t * starpu_bcsr_get_local_rowptr(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the row pointer array of the matrix designated by
+\p handle.
+
+\fn uint32_t starpu_bcsr_get_r(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of rows in a block.
+
+\fn uint32_t starpu_bcsr_get_c(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the numberof columns in a block.
+
+\fn size_t starpu_bcsr_get_elemsize(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of the elements in the matrix designated by
+\p handle.
+
+\def STARPU_BCSR_GET_NNZ(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of non-zero values in the matrix designated
+by \p interface.
+
+\def STARPU_BCSR_GET_NZVAL(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the non-zero values of the matrix
+designated by \p interface.
+
+\def STARPU_BCSR_GET_NZVAL_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the array of non-zero values in the
+matrix designated by \p interface. The offset documented below has to be
+used in addition to this.
+
+\def STARPU_BCSR_GET_COLIND(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the column index of the matrix designated
+by \p interface.
+
+\def STARPU_BCSR_GET_COLIND_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the column index of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_BCSR_GET_ROWPTR(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the row pointer array of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_ROWPTR_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the row pointer array of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_BCSR_GET_OFFSET(void *interface)
+\ingroup API_Data_Interfaces
+Return the offset in the arrays (coling, rowptr, nzval) of the
+matrix designated by \p interface, to be used with the device handles.
+
+@name Accessing CSR Data Interfaces
+\ingroup API_Data_Interfaces
+
+\struct starpu_csr_interface
+CSR interface for sparse matrices (compressed sparse row representation)
+\ingroup API_Data_Interfaces
+\var starpu_csr_interface::id
+Identifier of the interface
+\var starpu_csr_interface::nnz
+number of non-zero entries
+\var starpu_csr_interface::nrow
+number of rows
+\var starpu_csr_interface::nzval
+non-zero values
+\var starpu_csr_interface::colind
+position of non-zero entries on the row
+\var starpu_csr_interface::rowptr
+index (in nzval) of the first entry of the row
+\var starpu_csr_interface::firstentry
+k for k-based indexing (0 or 1 usually). also useful when partitionning the matrix.
+\var starpu_csr_interface::elemsize
+size of the elements of the matrix
+
+\fn uint32_t starpu_csr_get_nnz(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the number of non-zero values in the matrix designated
+by \p handle.
+
+\fn uint32_t starpu_csr_get_nrow(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of the row pointer array of the matrix
+designated by \p handle.
+
+\fn uint32_t starpu_csr_get_firstentry(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the index at which all arrays (the column indexes, the
+row pointers...) of the matrix designated by \p handle.
+
+\fn uintptr_t starpu_csr_get_local_nzval(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return a local pointer to the non-zero values of the matrix
+designated by \p handle.
+
+\fn uint32_t * starpu_csr_get_local_colind(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return a local pointer to the column index of the matrix
+designated by \p handle.
+
+\fn uint32_t * starpu_csr_get_local_rowptr(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return a local pointer to the row pointer array of the matrix
+designated by \p handle.
+
+\fn size_t starpu_csr_get_elemsize(starpu_data_handle_t handle)
+\ingroup API_Data_Interfaces
+Return the size of the elements registered into the matrix
+designated by \p handle.
+
+\def STARPU_CSR_GET_NNZ(void *interface)
+\ingroup API_Data_Interfaces
+Return the number of non-zero values in the matrix designated
+by \p interface.
+
+\def STARPU_CSR_GET_NROW(void *interface)
+\ingroup API_Data_Interfaces
+Return the size of the row pointer array of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_NZVAL(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the non-zero values of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_NZVAL_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the array of non-zero values in the
+matrix designated by \p interface. The offset documented below has to be
+used in addition to this.
+
+\def STARPU_CSR_GET_COLIND(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the column index of the matrix designated
+by \p interface.
+
+\def STARPU_CSR_GET_COLIND_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the column index of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_CSR_GET_ROWPTR(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the row pointer array of the matrix
+designated by \p interface.
+
+\def STARPU_CSR_GET_ROWPTR_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the row pointer array of the matrix
+designated by \p interface. The offset documented below has to be used in
+addition to this.
+
+\def STARPU_CSR_GET_OFFSET(void *interface)
+\ingroup API_Data_Interfaces
+Return the offset in the arrays (colind, rowptr, nzval) of the
+matrix designated by \p interface, to be used with the device handles.
+
+\def STARPU_CSR_GET_FIRSTENTRY(void *interface)
+\ingroup API_Data_Interfaces
+Return the index at which all arrays (the column indexes, the
+row pointers...) of the \p interface start.
+
+\def STARPU_CSR_GET_ELEMSIZE(void *interface)
+\ingroup API_Data_Interfaces
+Return the size of the elements registered into the matrix
+designated by \p interface.
+
+@name Accessing COO Data Interfaces
+\ingroup API_Data_Interfaces
+
+\struct starpu_coo_interface
+COO Matrices
+\ingroup API_Data_Interfaces
+\var starpu_coo_interface::id
+identifier of the interface
+\var starpu_coo_interface::columns
+column array of the matrix
+\var starpu_coo_interface::rows
+row array of the matrix
+\var starpu_coo_interface::values
+values of the matrix
+\var starpu_coo_interface::nx
+number of elements on the x-axis of the matrix
+\var starpu_coo_interface::ny
+number of elements on the y-axis of the matrix
+\var starpu_coo_interface::n_values
+number of values registered in the matrix
+\var starpu_coo_interface::elemsize
+size of the elements of the matrix
+
+\def STARPU_COO_GET_COLUMNS(void *interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the column array of the matrix designated
+by \p interface.
+
+\def STARPU_COO_GET_COLUMNS_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the column array of the matrix
+designated by \p interface, to be used on OpenCL. The offset documented
+below has to be used in addition to this.
+
+\def STARPU_COO_GET_ROWS(interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the rows array of the matrix designated by
+\p interface.
+
+\def STARPU_COO_GET_ROWS_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the row array of the matrix
+designated by \p interface, to be used on OpenCL. The offset documented
+below has to be used in addition to this.
+
+\def STARPU_COO_GET_VALUES(interface)
+\ingroup API_Data_Interfaces
+Return a pointer to the values array of the matrix designated
+by \p interface.
+
+\def STARPU_COO_GET_VALUES_DEV_HANDLE(void *interface)
+\ingroup API_Data_Interfaces
+Return a device handle for the value array of the matrix
+designated by \p interface, to be used on OpenCL. The offset documented
+below has to be used in addition to this.
+
+\def STARPU_COO_GET_OFFSET(void *interface)
+\ingroup API_Data_Interfaces
+Return the offset in the arrays of the COO matrix designated by
+\p interface.
+
+\def STARPU_COO_GET_NX(interface)
+\ingroup API_Data_Interfaces
+Return the number of elements on the x-axis of the matrix
+designated by \p interface.
+
+\def STARPU_COO_GET_NY(interface)
+\ingroup API_Data_Interfaces
+Return the number of elements on the y-axis of the matrix
+designated by \p interface.
+
+\def STARPU_COO_GET_NVALUES(interface)
+\ingroup API_Data_Interfaces
+Return the number of values registered in the matrix designated
+by \p interface.
+
+\def STARPU_COO_GET_ELEMSIZE(interface)
+\ingroup API_Data_Interfaces
+Return the size of the elements registered into the matrix
+designated by \p interface.
+
+@name Defining Interface
+\ingroup API_Data_Interfaces
+
+Applications can provide their own interface as shown in \ref
+DefiningANewDataInterface.
+
+\fn uintptr_t starpu_malloc_on_node(unsigned dst_node, size_t size)
+\ingroup API_Data_Interfaces
+Allocate \p size bytes on node \p dst_node. This returns 0 if
+allocation failed, the allocation method should then return <c>-ENOMEM</c> as
+allocated size.
+
+\fn void starpu_free_on_node(unsigned dst_node, uintptr_t addr, size_t size)
+\ingroup API_Data_Interfaces
+Free \p addr of \p size bytes on node \p dst_node.
+
+\fn int starpu_interface_copy(uintptr_t src, size_t src_offset, unsigned src_node, uintptr_t dst, size_t dst_offset, unsigned dst_node, size_t size, void *async_data)
+\ingroup API_Data_Interfaces
+Copy \p size bytes from byte offset \p src_offset of \p src on \p src_node
+to byte offset \p dst_offset of \p dst on \p dst_node. This is to be used in
+the any_to_any() copy method, which is provided with the async_data to
+be passed to starpu_interface_copy(). this returns <c>-EAGAIN</c> if the
+transfer is still ongoing, or 0 if the transfer is already completed.
+
+\fn uint32_t starpu_hash_crc32c_be_n(const void *input, size_t n, uint32_t inputcrc)
+\ingroup API_Data_Interfaces
+Compute the CRC of a byte buffer seeded by the \p inputcrc
+<em>current state</em>. The return value should be considered as the new
+<em>current state</em> for future CRC computation. This is used for computing
+data size footprint.
+
+\fn uint32_t starpu_hash_crc32c_be(uint32_t input, uint32_t inputcrc)
+\ingroup API_Data_Interfaces
+Compute the CRC of a 32bit number seeded by the \p inputcrc
+<em>current state</em>. The return value should be considered as the new
+<em>current state</em> for future CRC computation. This is used for computing
+data size footprint.
+
+\fn uint32_t starpu_hash_crc32c_string(const char *str, uint32_t inputcrc)
+\ingroup API_Data_Interfaces
+Compute the CRC of a string seeded by the \p inputcrc <em>current
+state</em>. The return value should be considered as the new <em>current
+state</em> for future CRC computation. This is used for computing data
+size footprint.
+
+\fn int starpu_data_interface_get_next_id(void)
+\ingroup API_Data_Interfaces
+Return the next available id for a newly created data interface
+(\ref DefiningANewDataInterface).
+
+*/
+

+ 260 - 0
doc/doxygen/chapters/api/data_management.doxy

@@ -0,0 +1,260 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Data_Management Data Management
+
+\brief This section describes the data management facilities provided
+by StarPU. We show how to use existing data interfaces in \ref
+API_Data_Interfaces, but developers can design their own data interfaces if
+required.
+
+\typedef starpu_data_handle_t
+\ingroup API_Data_Management
+StarPU uses ::starpu_data_handle_t as an opaque handle to
+manage a piece of data. Once a piece of data has been registered to
+StarPU, it is associated to a ::starpu_data_handle_t which keeps track
+of the state of the piece of data over the entire machine, so that we
+can maintain data consistency and locate data replicates for instance.
+
+\enum starpu_data_access_mode
+\ingroup API_Data_Management
+This datatype describes a data access mode.
+\var starpu_data_access_mode::STARPU_NONE
+\ingroup API_Data_Management
+TODO
+\var starpu_data_access_mode::STARPU_R
+\ingroup API_Data_Management
+read-only mode.
+\var starpu_data_access_mode::STARPU_W
+\ingroup API_Data_Management
+write-only mode.
+\var starpu_data_access_mode::STARPU_RW
+\ingroup API_Data_Management
+read-write mode. This is equivalent to ::STARPU_R|::STARPU_W
+\var starpu_data_access_mode::STARPU_SCRATCH
+\ingroup API_Data_Management
+A temporary buffer is allocated for the task, but StarPU does not
+enforce data consistency---i.e. each device has its own buffer,
+independently from each other (even for CPUs), and no data transfer is
+ever performed. This is useful for temporary variables to avoid
+allocating/freeing buffers inside each task. Currently, no behavior is
+defined concerning the relation with the ::STARPU_R and ::STARPU_W modes
+and the value provided at registration --- i.e., the value of the
+scratch buffer is undefined at entry of the codelet function.  It is
+being considered for future extensions at least to define the initial
+value.  For now, data to be used in ::STARPU_SCRATCH mode should be
+registered with node <c>-1</c> and a <c>NULL</c> pointer, since the
+value of the provided buffer is simply ignored for now.
+\var starpu_data_access_mode::STARPU_REDUX
+\ingroup API_Data_Management
+todo
+\var starpu_data_access_mode::STARPU_COMMUTE
+\ingroup API_Data_Management
+In addition to that, ::STARPU_COMMUTE can be passed along ::STARPU_W
+or ::STARPU_RW to express that StarPU can let tasks commute, which is
+useful e.g. when bringing a contribution into some data, which can be
+done in any order (but still require sequential consistency against
+reads or non-commutative writes).
+
+@name Basic Data Management API
+\ingroup API_Data_Management
+
+Data management is done at a high-level in StarPU: rather than
+accessing a mere list of contiguous buffers, the tasks may manipulate
+data that are described by a high-level construct which we call data
+interface.
+
+An example of data interface is the "vector" interface which describes
+a contiguous data array on a spefic memory node. This interface is a
+simple structure containing the number of elements in the array, the
+size of the elements, and the address of the array in the appropriate
+address space (this address may be invalid if there is no valid copy
+of the array in the memory node). More informations on the data
+interfaces provided by StarPU are given in \ref API_Data_Interfaces.
+
+When a piece of data managed by StarPU is used by a task, the task
+implementation is given a pointer to an interface describing a valid
+copy of the data that is accessible from the current processing unit.
+
+Every worker is associated to a memory node which is a logical
+abstraction of the address space from which the processing unit gets
+its data. For instance, the memory node associated to the different
+CPU workers represents main memory (RAM), the memory node associated
+to a GPU is DRAM embedded on the device. Every memory node is
+identified by a logical index which is accessible from the
+function starpu_worker_get_memory_node(). When registering a piece of
+data to StarPU, the specified memory node indicates where the piece of
+data initially resides (we also call this memory node the home node of
+a piece of data).
+
+\fn void starpu_data_register(starpu_data_handle_t *handleptr, unsigned home_node, void *data_interface, struct starpu_data_interface_ops *ops)
+\ingroup API_Data_Management
+Register a piece of data into the handle located at the
+\p handleptr address. The \p data_interface buffer contains the initial
+description of the data in the \p home_node. The \p ops argument is a
+pointer to a structure describing the different methods used to
+manipulate this type of interface. See starpu_data_interface_ops for
+more details on this structure.
+If \p home_node is -1, StarPU will automatically allocate the memory when
+it is used for the first time in write-only mode. Once such data
+handle has been automatically allocated, it is possible to access it
+using any access mode.
+Note that StarPU supplies a set of predefined types of interface (e.g.
+vector or matrix) which can be registered by the means of helper
+functions (e.g. starpu_vector_data_register() or
+starpu_matrix_data_register()).
+
+\fn void starpu_data_register_same(starpu_data_handle_t *handledst, starpu_data_handle_t handlesrc)
+\ingroup API_Data_Management
+Register a new piece of data into the handle \p handledst with the
+same interface as the handle \p handlesrc.
+
+\fn void starpu_data_unregister(starpu_data_handle_t handle)
+\ingroup API_Data_Management
+This function unregisters a data handle from StarPU. If the
+data was automatically allocated by StarPU because the home node was
+-1, all automatically allocated buffers are freed. Otherwise, a valid
+copy of the data is put back into the home node in the buffer that was
+initially registered. Using a data handle that has been unregistered
+from StarPU results in an undefined behaviour. In case we do not need
+to update the value of the data in the home node, we can use
+the function starpu_data_unregister_no_coherency() instead.
+
+\fn void starpu_data_unregister_no_coherency(starpu_data_handle_t handle)
+\ingroup API_Data_Management
+This is the same as starpu_data_unregister(), except that
+StarPU does not put back a valid copy into the home node, in the
+buffer that was initially registered.
+
+\fn void starpu_data_unregister_submit(starpu_data_handle_t handle)
+\ingroup API_Data_Management
+Destroy the data handle once it is not needed anymore by any
+submitted task. No coherency is assumed.
+
+\fn void starpu_data_invalidate(starpu_data_handle_t handle)
+\ingroup API_Data_Management
+Destroy all replicates of the data handle immediately. After
+data invalidation, the first access to the handle must be performed in
+write-only mode. Accessing an invalidated data in read-mode results in
+undefined behaviour.
+
+\fn void starpu_data_invalidate_submit(starpu_data_handle_t handle)
+\ingroup API_Data_Management
+Submits invalidation of the data handle after completion of
+previously submitted tasks.
+
+\fn void starpu_data_set_wt_mask(starpu_data_handle_t handle, uint32_t wt_mask)
+\ingroup API_Data_Management
+This function sets the write-through mask of a given data (and
+its children), i.e. a bitmask of nodes where the data should be always
+replicated after modification. It also prevents the data from being
+evicted from these nodes when memory gets scarse. When the data is
+modified, it is automatically transfered into those memory node. For
+instance a <c>1<<0</c> write-through mask means that the CUDA workers
+will commit their changes in main memory (node 0).
+
+\fn int starpu_data_prefetch_on_node(starpu_data_handle_t handle, unsigned node, unsigned async)
+\ingroup API_Data_Management
+Issue a prefetch request for a given data to a given node, i.e.
+requests that the data be replicated to the given node, so that it is
+available there for tasks. If the \p async parameter is 0, the call will
+block until the transfer is achieved, else the call will return as
+soon as the request is scheduled (which may however have to wait for a
+task completion).
+
+\fn starpu_data_handle_t starpu_data_lookup(const void *ptr)
+\ingroup API_Data_Management
+Return the handle corresponding to the data pointed to by the \p ptr host pointer.
+
+\fn int starpu_data_request_allocation(starpu_data_handle_t handle, unsigned node)
+\ingroup API_Data_Management
+Explicitly ask StarPU to allocate room for a piece of data on
+the specified memory node.
+
+\fn void starpu_data_query_status(starpu_data_handle_t handle, int memory_node, int *is_allocated, int *is_valid, int *is_requested)
+\ingroup API_Data_Management
+Query the status of \p handle on the specified \p memory_node.
+
+\fn void starpu_data_advise_as_important(starpu_data_handle_t handle, unsigned is_important)
+\ingroup API_Data_Management
+This function allows to specify that a piece of data can be
+discarded without impacting the application.
+
+\fn void starpu_data_set_reduction_methods(starpu_data_handle_t handle, struct starpu_codelet *redux_cl, struct starpu_codelet *init_cl)
+\ingroup API_Data_Management
+This sets the codelets to be used for \p handle when it is
+accessed in the mode ::STARPU_REDUX. Per-worker buffers will be initialized with
+the codelet \p init_cl, and reduction between per-worker buffers will be
+done with the codelet \p redux_cl.
+
+@name Access registered data from the application
+\ingroup API_Data_Management
+
+\fn int starpu_data_acquire(starpu_data_handle_t handle, enum starpu_data_access_mode mode)
+\ingroup API_Data_Management
+The application must call this function prior to accessing
+registered data from main memory outside tasks. StarPU ensures that
+the application will get an up-to-date copy of the data in main memory
+located where the data was originally registered, and that all
+concurrent accesses (e.g. from tasks) will be consistent with the
+access mode specified in the mode argument. starpu_data_release() must
+be called once the application does not need to access the piece of
+data anymore. Note that implicit data dependencies are also enforced
+by starpu_data_acquire(), i.e. starpu_data_acquire() will wait for all
+tasks scheduled to work on the data, unless they have been disabled
+explictly by calling starpu_data_set_default_sequential_consistency_flag() or
+starpu_data_set_sequential_consistency_flag(). starpu_data_acquire() is a
+blocking call, so that it cannot be called from tasks or from their
+callbacks (in that case, starpu_data_acquire() returns <c>-EDEADLK</c>). Upon
+successful completion, this function returns 0.
+
+\fn int starpu_data_acquire_cb(starpu_data_handle_t handle, enum starpu_data_access_mode mode, void (*callback)(void *), void *arg)
+\ingroup API_Data_Management
+Asynchronous equivalent of starpu_data_acquire(). When the data
+specified in \p handle is available in the appropriate access
+mode, the \p callback function is executed. The application may access
+the requested data during the execution of this \p callback. The \p callback
+function must call starpu_data_release() once the application does not
+need to access the piece of data anymore. Note that implicit data
+dependencies are also enforced by starpu_data_acquire_cb() in case they
+are not disabled. Contrary to starpu_data_acquire(), this function is
+non-blocking and may be called from task callbacks. Upon successful
+completion, this function returns 0.
+
+\fn int starpu_data_acquire_on_node(starpu_data_handle_t handle, unsigned node, enum starpu_data_access_mode mode)
+\ingroup API_Data_Management
+This is the same as starpu_data_acquire(), except that the data
+will be available on the given memory node instead of main memory.
+
+\fn int starpu_data_acquire_on_node_cb(starpu_data_handle_t handle, unsigned node, enum starpu_data_access_mode mode, void (*callback)(void *), void *arg)
+\ingroup API_Data_Management
+This is the same as starpu_data_acquire_cb(), except that the
+data will be available on the given memory node instead of main
+memory.
+
+\def STARPU_DATA_ACQUIRE_CB(starpu_data_handle_t handle, enum starpu_data_access_mode mode, code)
+\ingroup API_Data_Management
+STARPU_DATA_ACQUIRE_CB() is the same as starpu_data_acquire_cb(),
+except that the code to be executed in a callback is directly provided
+as a macro parameter, and the data \p handle is automatically released
+after it. This permits to easily execute code which depends on the
+value of some registered data. This is non-blocking too and may be
+called from task callbacks.
+
+\fn void starpu_data_release(starpu_data_handle_t handle)
+\ingroup API_Data_Management
+This function releases the piece of data acquired by the
+application either by starpu_data_acquire() or by
+starpu_data_acquire_cb().
+
+\fn void starpu_data_release_on_node(starpu_data_handle_t handle, unsigned node)
+\ingroup API_Data_Management
+This is the same as starpu_data_release(), except that the data
+will be available on the given memory \p node instead of main memory.
+
+*/

+ 258 - 0
doc/doxygen/chapters/api/data_partition.doxy

@@ -0,0 +1,258 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Data_Partition Data Partition
+
+\struct starpu_data_filter
+The filter structure describes a data partitioning operation, to be
+given to the starpu_data_partition() function.
+\ingroup API_Data_Partition
+\var starpu_data_filter::filter_func
+This function fills the child_interface structure with interface
+information for the id-th child of the parent father_interface (among
+nparts).
+\var starpu_data_filter::nchildren
+This is the number of parts to partition the data into.
+\var starpu_data_filter::get_nchildren
+This returns the number of children. This can be used instead of
+nchildren when the number of children depends on the actual data (e.g.
+the number of blocks in a sparse matrix).
+\var starpu_data_filter::get_child_ops
+In case the resulting children use a different data interface, this
+function returns which interface is used by child number id.
+\var starpu_data_filter::filter_arg
+Allow to define an additional parameter for the filter function.
+\var starpu_data_filter::filter_arg_ptr
+Allow to define an additional pointer parameter for the filter
+function, such as the sizes of the different parts.
+
+@name Basic API
+\ingroup API_Data_Partition
+
+\fn void starpu_data_partition(starpu_data_handle_t initial_handle, struct starpu_data_filter *f)
+\ingroup API_Data_Partition
+This requests partitioning one StarPU data initial_handle into
+several subdata according to the filter \p f.
+
+Here an example of how to use the function.
+\code{.c}
+struct starpu_data_filter f = {
+        .filter_func = starpu_matrix_filter_block,
+        .nchildren = nslicesx,
+        .get_nchildren = NULL,
+        .get_child_ops = NULL
+};
+starpu_data_partition(A_handle, &f);
+\endcode
+
+\fn void starpu_data_unpartition(starpu_data_handle_t root_data, unsigned gathering_node)
+\ingroup API_Data_Partition
+This unapplies one filter, thus unpartitioning the data. The
+pieces of data are collected back into one big piece in the
+\p gathering_node (usually 0). Tasks working on the partitioned data must
+be already finished when calling starpu_data_unpartition().
+
+Here an example of how to use the function.
+\code{.c}
+starpu_data_unpartition(A_handle, 0);
+\endcode
+
+\fn int starpu_data_get_nb_children(starpu_data_handle_t handle)
+\ingroup API_Data_Partition
+This function returns the number of children.
+
+\fn starpu_data_handle_t starpu_data_get_child(starpu_data_handle_t handle, unsigned i)
+\ingroup API_Data_Partition
+Return the ith child of the given \p handle, which must have been
+partitionned beforehand.
+
+\fn starpu_data_handle_t starpu_data_get_sub_data (starpu_data_handle_t root_data, unsigned depth, ... )
+\ingroup API_Data_Partition
+After partitioning a StarPU data by applying a filter,
+starpu_data_get_sub_data() can be used to get handles for each of the
+data portions. \p root_data is the parent data that was partitioned.
+\p depth is the number of filters to traverse (in case several filters
+have been applied, to e.g. partition in row blocks, and then in column
+blocks), and the subsequent parameters are the indexes. The function
+returns a handle to the subdata.
+
+Here an example of how to use the function.
+\code{.c}
+h = starpu_data_get_sub_data(A_handle, 1, taskx);
+\endcode
+
+\fn starpu_data_handle_t starpu_data_vget_sub_data(starpu_data_handle_t root_data, unsigned depth, va_list pa)
+\ingroup API_Data_Partition
+This function is similar to starpu_data_get_sub_data() but uses a
+va_list for the parameter list.
+
+\fn void starpu_data_map_filters(starpu_data_handle_t root_data, unsigned nfilters, ...)
+\ingroup API_Data_Partition
+Applies \p nfilters filters to the handle designated by
+\p root_handle recursively. \p nfilters pointers to variables of the type
+starpu_data_filter should be given.
+
+\fn void starpu_data_vmap_filters(starpu_data_handle_t root_data, unsigned nfilters, va_list pa)
+\ingroup API_Data_Partition
+Applies \p nfilters filters to the handle designated by
+\p root_handle recursively. It uses a va_list of pointers to variables of
+the type starpu_data_filter.
+
+@name Predefined Vector Filter Functions
+\ingroup API_Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for vector data. Examples on how to use them are shown in
+\ref PartitioningData. The complete list can be found in the file
+<c>starpu_data_filters.h</c>.
+
+\fn void starpu_vector_filter_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned in \p nparts chunks of
+equal size.
+
+\fn void starpu_vector_filter_block_shadow(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned in \p nparts chunks of
+equal size with a shadow border <c>filter_arg_ptr</c>, thus getting a vector
+of size (n-2*shadow)/nparts+2*shadow. The <c>filter_arg_ptr</c> field
+of \p f must be the shadow size casted into void*. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts. An usage example is available in
+examples/filters/shadow.c
+
+\fn void starpu_vector_filter_list(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned into \p nparts chunks
+according to the <c>filter_arg_ptr</c> field of \p f. The
+<c>filter_arg_ptr</c> field must point to an array of \p nparts uint32_t
+elements, each of which specifies the number of elements in each chunk
+of the partition.
+
+\fn void starpu_vector_filter_divide_in_2(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+Return in \p child_interface the \p id th element of the vector
+represented by \p father_interface once partitioned in <c>2</c> chunks of
+equal size, ignoring nparts. Thus, \p id must be <c>0</c> or <c>1</c>.
+
+@name Predefined Matrix Filter Functions
+\ingroup API_Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for matrix data. Examples on how to use them are shown in
+\ref PartitioningData. The complete list can be found in the file
+<c>starpu_data_filters.h</c>.
+
+\fn void starpu_matrix_filter_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a dense Matrix along the x dimension, thus
+getting (x/\p nparts ,y) matrices. If \p nparts does not divide x, the
+last submatrix contains the remainder.
+
+\fn void starpu_matrix_filter_block_shadow(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a dense Matrix along the x dimension, with a
+shadow border <c>filter_arg_ptr</c>, thus getting ((x-2*shadow)/\p
+nparts +2*shadow,y) matrices. If \p nparts does not divide x-2*shadow,
+the last submatrix contains the remainder. <b>IMPORTANT</b>: This can
+only be used for read-only access, as no coherency is enforced for the
+shadowed parts. A usage example is available in
+examples/filters/shadow2d.c
+
+\fn void starpu_matrix_filter_vertical_block(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a dense Matrix along the y dimension, thus
+getting (x,y/\p nparts) matrices. If \p nparts does not divide y, the
+last submatrix contains the remainder.
+
+\fn void starpu_matrix_filter_vertical_block_shadow(void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a dense Matrix along the y dimension, with a
+shadow border <c>filter_arg_ptr</c>, thus getting
+(x,(y-2*shadow)/\p nparts +2*shadow) matrices. If \p nparts does not
+divide y-2*shadow, the last submatrix contains the remainder.
+<b>IMPORTANT</b>: This can only be used for read-only access, as no
+coherency is enforced for the shadowed parts. A usage example is
+available in examples/filters/shadow2d.c 
+
+@name Predefined Block Filter Functions
+\ingroup API_Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for block data. Examples on how to use them are shown in
+\ref PartitioningData. The complete list can be found in the file
+<c>starpu_data_filters.h</c>. A usage example is available in
+examples/filters/shadow3d.c
+
+\fn void starpu_block_filter_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block along the X dimension, thus getting
+(x/\p nparts ,y,z) 3D matrices. If \p nparts does not divide x, the last
+submatrix contains the remainder.
+
+\fn void starpu_block_filter_block_shadow (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block along the X dimension, with a
+shadow border <c>filter_arg_ptr</c>, thus getting
+((x-2*shadow)/\p nparts +2*shadow,y,z) blocks. If \p nparts does not
+divide x, the last submatrix contains the remainder. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts.
+
+\fn void starpu_block_filter_vertical_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block along the Y dimension, thus getting
+(x,y/\p nparts ,z) blocks. If \p nparts does not divide y, the last
+submatrix contains the remainder.
+
+\fn void starpu_block_filter_vertical_block_shadow (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block along the Y dimension, with a
+shadow border <c>filter_arg_ptr</c>, thus getting
+(x,(y-2*shadow)/\p nparts +2*shadow,z) 3D matrices. If \p nparts does not
+divide y, the last submatrix contains the remainder. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts.
+
+\fn void starpu_block_filter_depth_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block along the Z dimension, thus getting
+(x,y,z/\p nparts) blocks. If \p nparts does not divide z, the last
+submatrix contains the remainder.
+
+\fn void starpu_block_filter_depth_block_shadow (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block along the Z dimension, with a
+shadow border <c>filter_arg_ptr</c>, thus getting
+(x,y,(z-2*shadow)/\p nparts +2*shadow) blocks. If \p nparts does not
+divide z, the last submatrix contains the remainder. <b>IMPORTANT</b>:
+This can only be used for read-only access, as no coherency is
+enforced for the shadowed parts.
+
+@name Predefined BCSR Filter Functions
+\ingroup API_Data_Partition
+
+This section gives a partial list of the predefined partitioning
+functions for BCSR data. Examples on how to use them are shown in
+\ref PartitioningData. The complete list can be found in the file
+<c>starpu_data_filters.h</c>.
+
+\fn void starpu_bcsr_filter_canonical_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block-sparse matrix into dense matrices.
+
+\fn void starpu_csr_filter_vertical_block (void *father_interface, void *child_interface, struct starpu_data_filter *f, unsigned id, unsigned nparts)
+\ingroup API_Data_Partition
+This partitions a block-sparse matrix into vertical
+block-sparse matrices.
+
+*/
+

+ 25 - 0
doc/doxygen/chapters/api/expert_mode.doxy

@@ -0,0 +1,25 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Expert_Mode Expert Mode
+
+\fn void starpu_wake_all_blocked_workers(void)
+\ingroup API_Expert_Mode
+Wake all the workers, so they can inspect data requests and task
+submissions again.
+
+\fn int starpu_progression_hook_register(unsigned (*func)(void *arg), void *arg)
+\ingroup API_Expert_Mode
+Register a progression hook, to be called when workers are idle.
+
+\fn void starpu_progression_hook_deregister(int hook_id)
+\ingroup API_Expert_Mode
+Unregister a given progression hook.
+
+*/
+

+ 113 - 0
doc/doxygen/chapters/api/explicit_dependencies.doxy

@@ -0,0 +1,113 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Explicit_Dependencies Explicit Dependencies
+
+\fn void starpu_task_declare_deps_array(struct starpu_task *task, unsigned ndeps, struct starpu_task *task_array[])
+\ingroup API_Explicit_Dependencies
+Declare task dependencies between a \p task and an array of
+tasks of length \p ndeps. This function must be called prior to the
+submission of the task, but it may called after the submission or the
+execution of the tasks in the array, provided the tasks are still
+valid (i.e. they were not automatically destroyed). Calling this
+function on a task that was already submitted or with an entry of
+\p task_array that is no longer a valid task results in an undefined
+behaviour. If \p ndeps is 0, no dependency is added. It is possible to
+call starpu_task_declare_deps_array() several times on the same task,
+in this case, the dependencies are added. It is possible to have
+redundancy in the task dependencies.
+
+\typedef starpu_tag_t
+\ingroup API_Explicit_Dependencies
+This type defines a task logical identifer. It is possible to
+associate a task with a unique <em>tag</em> chosen by the application,
+and to express dependencies between tasks by the means of those tags.
+To do so, fill the field starpu_task::tag_id with a tag number (can be
+arbitrary) and set the field starpu_task::use_tag to 1. If
+starpu_tag_declare_deps() is called with this tag number, the task
+will not be started until the tasks which holds the declared
+dependency tags are completed.
+
+\fn void starpu_tag_declare_deps(starpu_tag_t id, unsigned ndeps, ...)
+\ingroup API_Explicit_Dependencies
+Specify the dependencies of the task identified by tag \p id.
+The first argument specifies the tag which is configured, the second
+argument gives the number of tag(s) on which \p id depends. The
+following arguments are the tags which have to be terminated to unlock
+the task. This function must be called before the associated task is
+submitted to StarPU with starpu_task_submit().
+
+<b>WARNING! Use with caution</b>. Because of the variable arity of
+starpu_tag_declare_deps(), note that the last arguments must be of
+type starpu_tag_t : constant values typically need to be explicitly
+casted. Otherwise, due to integer sizes and argument passing on the
+stack, the C compiler might consider the tag <c>0x200000003</c>
+instead of <c>0x2</c> and <c>0x3</c> when calling
+<c>starpu_tag_declare_deps(0x1, 2, 0x2, 0x3)</c>. Using the
+starpu_tag_declare_deps_array() function avoids this hazard.
+
+\code{.c}
+/*  Tag 0x1 depends on tags 0x32 and 0x52 */
+starpu_tag_declare_deps((starpu_tag_t)0x1, 2, (starpu_tag_t)0x32, (starpu_tag_t)0x52);
+\endcode
+
+\fn void starpu_tag_declare_deps_array(starpu_tag_t id, unsigned ndeps, starpu_tag_t *array)
+\ingroup API_Explicit_Dependencies
+This function is similar to starpu_tag_declare_deps(), except
+that its does not take a variable number of arguments but an array of
+tags of size \p ndeps.
+
+\code{.c}
+/*  Tag 0x1 depends on tags 0x32 and 0x52 */
+starpu_tag_t tag_array[2] = {0x32, 0x52};
+starpu_tag_declare_deps_array((starpu_tag_t)0x1, 2, tag_array);
+\endcode
+
+\fn int starpu_tag_wait(starpu_tag_t id)
+\ingroup API_Explicit_Dependencies
+This function blocks until the task associated to tag \p id has
+been executed. This is a blocking call which must therefore not be
+called within tasks or callbacks, but only from the application
+directly. It is possible to synchronize with the same tag multiple
+times, as long as the starpu_tag_remove() function is not called. Note
+that it is still possible to synchronize with a tag associated to a
+task for which the strucuture starpu_task was freed (e.g. if the field
+starpu_task::destroy was enabled).
+
+\fn int starpu_tag_wait_array(unsigned ntags, starpu_tag_t *id)
+\ingroup API_Explicit_Dependencies
+This function is similar to starpu_tag_wait() except that it
+blocks until all the \p ntags tags contained in the array \p id are
+terminated.
+
+\fn void starpu_tag_restart(starpu_tag_t id)
+\ingroup API_Explicit_Dependencies
+This function can be used to clear the <em>already
+notified</em> status of a tag which is not associated with a task.
+Before that, calling starpu_tag_notify_from_apps() again will not
+notify the successors. After that, the next call to
+starpu_tag_notify_from_apps() will notify the successors.
+
+\fn void starpu_tag_remove(starpu_tag_t id)
+\ingroup API_Explicit_Dependencies
+This function releases the resources associated to tag \p id.
+It can be called once the corresponding task has been executed and
+when there is no other tag that depend on this tag anymore.
+
+\fn void starpu_tag_notify_from_apps (starpu_tag_t id)
+\ingroup API_Explicit_Dependencies
+This function explicitly unlocks tag \p id. It may be useful in
+the case of applications which execute part of their computation
+outside StarPU tasks (e.g. third-party libraries). It is also provided
+as a convenient tool for the programmer, for instance to entirely
+construct the task DAG before actually giving StarPU the opportunity
+to execute the tasks. When called several times on the same tag,
+notification will be done only on first call, thus implementing "OR"
+dependencies, until the tag is restarted using starpu_tag_restart().
+
+*/

+ 63 - 0
doc/doxygen/chapters/api/fft_support.doxy

@@ -0,0 +1,63 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_FFT_Support FFT Support
+
+\fn void * starpufft_malloc(size_t n)
+\ingroup API_FFT_Support
+Allocates memory for \p n bytes. This is preferred over malloc(),
+since it allocates pinned memory, which allows overlapped transfers.
+
+\fn void * starpufft_free(void *p)
+\ingroup API_FFT_Support
+Release memory previously allocated.
+
+\fn struct starpufft_plan * starpufft_plan_dft_1d(int n, int sign, unsigned flags)
+\ingroup API_FFT_Support
+Initializes a plan for 1D FFT of size \p n. \p sign can be STARPUFFT_FORWARD
+or STARPUFFT_INVERSE. \p flags must be 0.
+
+\fn struct starpufft_plan * starpufft_plan_dft_2d(int n, int m, int sign, unsigned flags)
+\ingroup API_FFT_Support
+Initializes a plan for 2D FFT of size (\p n, \p m). \p sign can be
+STARPUFFT_FORWARD or STARPUFFT_INVERSE. flags must be \p 0.
+
+\fn struct starpu_task * starpufft_start(starpufft_plan p, void *in, void *out)
+\ingroup API_FFT_Support
+Start an FFT previously planned as \p p, using \p in and \p out as
+input and output. This only submits the task and does not wait for it.
+The application should call starpufft_cleanup() to unregister the
+
+\fn struct starpu_task * starpufft_start_handle(starpufft_plan p, starpu_data_handle_t in, starpu_data_handle_t out)
+\ingroup API_FFT_Support
+Start an FFT previously planned as \p p, using data handles \p in and
+\p out as input and output (assumed to be vectors of elements of the
+expected types). This only submits the task and does not wait for it.
+
+\fn void starpufft_execute(starpufft_plan p, void *in, void *out)
+\ingroup API_FFT_Support
+Execute an FFT previously planned as \p p, using \p in and \p out as
+input and output. This submits and waits for the task.
+
+\fn void starpufft_execute_handle(starpufft_plan p, starpu_data_handle_t in, starpu_data_handle_t out)
+\ingroup API_FFT_Support
+Execute an FFT previously planned as \p p, using data handles \p in
+and \p out as input and output (assumed to be vectors of elements of
+the expected types). This submits and waits for the task.
+
+\fn void starpufft_cleanup(starpufft_plan p)
+\ingroup API_FFT_Support
+Releases data for plan \p p, in the starpufft_start() case.
+
+\fn void starpufft_destroy_plan(starpufft_plan p)
+\ingroup API_FFT_Support
+Destroys plan \p p, i.e. release all CPU (fftw) and GPU (cufft)
+resources.
+
+*/
+

+ 81 - 0
doc/doxygen/chapters/api/fxt_support.doxy

@@ -0,0 +1,81 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_FxT_Support FxT Support
+
+\struct starpu_fxt_codelet_event
+todo
+\ingroup API_FxT_Support
+\var starpu_fxt_codelet_event::symbol[256
+name of the codelet
+\var starpu_fxt_codelet_event::workerid
+\var starpu_fxt_codelet_event::archtype
+\var starpu_fxt_codelet_event::hash
+\var starpu_fxt_codelet_event::size
+\var starpu_fxt_codelet_event::time
+
+\struct starpu_fxt_options
+todo
+\ingroup API_FxT_Support
+\var starpu_fxt_options::per_task_colour
+\var starpu_fxt_options::no_counter
+\var starpu_fxt_options::no_bus
+\var starpu_fxt_options::ninputfiles
+\var starpu_fxt_options::filenames
+\var starpu_fxt_options::out_paje_path
+\var starpu_fxt_options::distrib_time_path
+\var starpu_fxt_options::activity_path
+\var starpu_fxt_options::dag_path
+
+\var starpu_fxt_options::file_prefix
+In case we are going to gather multiple traces (e.g in the case of MPI
+processes), we may need to prefix the name of the containers.
+\var starpu_fxt_options::file_offset
+In case we are going to gather multiple traces (e.g in the case of MPI
+processes), we may need to prefix the name of the containers.
+\var starpu_fxt_options::file_rank
+In case we are going to gather multiple traces (e.g in the case of MPI
+processes), we may need to prefix the name of the containers.
+
+\var starpu_fxt_options::worker_names
+Output parameters
+\var starpu_fxt_options::worker_archtypes
+Output parameters
+\var starpu_fxt_options::nworkers
+Output parameters
+
+\var starpu_fxt_options::dumped_codelets
+In case we want to dump the list of codelets to an external tool
+\var starpu_fxt_options::dumped_codelets_count
+In case we want to dump the list of codelets to an external tool
+
+\fn void starpu_fxt_options_init(struct starpu_fxt_options *options)
+\ingroup API_FxT_Support
+todo
+
+\fn void starpu_fxt_generate_trace(struct starpu_fxt_options *options)
+\ingroup API_FxT_Support
+todo
+
+\fn void starpu_fxt_start_profiling(void)
+\ingroup API_FxT_Support
+Start recording the trace. The trace is by default started from
+starpu_init() call, but can be paused by using
+starpu_fxt_stop_profiling(), in which case
+starpu_fxt_start_profiling() should be called to resume recording
+events.
+
+\fn void starpu_fxt_stop_profiling(void)
+\ingroup API_FxT_Support
+Stop recording the trace. The trace is by default stopped when calling
+starpu_shutdown(). starpu_fxt_stop_profiling() can however be used to
+stop it earlier. starpu_fxt_start_profiling() can then be called to
+start recording it again, etc.
+
+*/
+

+ 42 - 0
doc/doxygen/chapters/api/implicit_dependencies.doxy

@@ -0,0 +1,42 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Implicit_Data_Dependencies Implicit Data Dependencies
+
+\brief In this section, we describe how StarPU makes it possible to
+insert implicit task dependencies in order to enforce sequential data
+consistency. When this data consistency is enabled on a specific data
+handle, any data access will appear as sequentially consistent from
+the application. For instance, if the application submits two tasks
+that access the same piece of data in read-only mode, and then a third
+task that access it in write mode, dependencies will be added between
+the two first tasks and the third one. Implicit data dependencies are
+also inserted in the case of data accesses from the application.
+
+\fn starpu_data_set_default_sequential_consistency_flag(unsigned flag)
+\ingroup API_Implicit_Data_Dependencies
+Set the default sequential consistency flag. If a non-zero
+value is passed, a sequential data consistency will be enforced for
+all handles registered after this function call, otherwise it is
+disabled. By default, StarPU enables sequential data consistency. It
+is also possible to select the data consistency mode of a specific
+data handle with the function
+starpu_data_set_sequential_consistency_flag().
+
+\fn unsigned starpu_data_get_default_sequential_consistency_flag(void)
+\ingroup API_Implicit_Data_Dependencies
+Return the default sequential consistency flag
+
+\fn void starpu_data_set_sequential_consistency_flag(starpu_data_handle_t handle, unsigned flag)
+\ingroup API_Implicit_Data_Dependencies
+Set the data consistency mode associated to a data handle. The
+consistency mode set using this function has the priority over the
+default mode which can be set with
+starpu_data_set_default_sequential_consistency_flag().
+
+*/

+ 264 - 0
doc/doxygen/chapters/api/initialization.doxy

@@ -0,0 +1,264 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Initialization_and_Termination Initialization and Termination
+
+\struct starpu_driver
+structure for a driver
+\ingroup API_Initialization_and_Termination
+\var starpu_driver::type
+The type of the driver. Only ::STARPU_CPU_WORKER,
+::STARPU_CUDA_WORKER and ::STARPU_OPENCL_WORKER are currently supported.
+\var starpu_driver::id
+The identifier of the driver.
+
+\struct starpu_vector_interface
+vector interface for contiguous (non-strided) buffers
+\ingroup API_Initialization_and_Termination
+
+\struct starpu_conf
+\ingroup API_Initialization_and_Termination
+This structure is passed to the starpu_init() function in order to
+configure StarPU. It has to be initialized with starpu_conf_init().
+When the default value is used, StarPU automatically selects the
+number of processing units and takes the default scheduling policy.
+The environment variables overwrite the equivalent parameters.
+\var starpu_conf::magic
+\private
+Will be initialized by starpu_conf_init(). Should not be set by hand.
+\var starpu_conf::sched_policy_name
+This is the name of the scheduling policy. This can also be specified
+with the environment variable \ref STARPU_SCHED. (default = NULL).
+\var starpu_conf::sched_policy
+This is the definition of the scheduling policy. This field is ignored
+if starpu_conf::sched_policy_name is set. (default = NULL)
+\var starpu_conf::ncpus
+This is the number of CPU cores that StarPU can use. This can also be
+specified with the environment variable \ref STARPU_NCPU . (default = -1)
+\var starpu_conf::ncuda
+This is the number of CUDA devices that StarPU can use. This can also
+be specified with the environment variable \ref STARPU_NCUDA. (default =
+-1)
+\var starpu_conf::nopencl
+This is the number of OpenCL devices that StarPU can use. This can
+also be specified with the environment variable \ref STARPU_NOPENCL.
+(default = -1)
+\var starpu_conf::nmic
+This is the number of MIC devices that StarPU can use. This can also
+be specified with the environment variable \ref STARPU_NMIC.
+(default = -1)
+\var starpu_conf::nscc
+This is the number of SCC devices that StarPU can use. This can also
+be specified with the environment variable \ref STARPU_NSCC.
+(default = -1)
+
+\var starpu_conf::use_explicit_workers_bindid
+If this flag is set, the starpu_conf::workers_bindid array indicates
+where the different workers are bound, otherwise StarPU automatically
+selects where to bind the different workers. This can also be
+specified with the environment variable \ref STARPU_WORKERS_CPUID. (default = 0)
+\var starpu_conf::workers_bindid
+If the starpu_conf::use_explicit_workers_bindid flag is set, this
+array indicates where to bind the different workers. The i-th entry of
+the starpu_conf::workers_bindid indicates the logical identifier of
+the processor which should execute the i-th worker. Note that the
+logical ordering of the CPUs is either determined by the OS, or
+provided by the hwloc library in case it is available.
+\var starpu_conf::use_explicit_workers_cuda_gpuid
+If this flag is set, the CUDA workers will be attached to the CUDA
+devices specified in the starpu_conf::workers_cuda_gpuid array.
+Otherwise, StarPU affects the CUDA devices in a round-robin fashion.
+This can also be specified with the environment variable \ref
+STARPU_WORKERS_CUDAID. (default = 0)
+\var starpu_conf::workers_cuda_gpuid
+If the starpu_conf::use_explicit_workers_cuda_gpuid flag is set, this
+array contains the logical identifiers of the CUDA devices (as used by
+cudaGetDevice()).
+\var starpu_conf::use_explicit_workers_opencl_gpuid
+If this flag is set, the OpenCL workers will be attached to the OpenCL
+devices specified in the starpu_conf::workers_opencl_gpuid array.
+Otherwise, StarPU affects the OpenCL devices in a round-robin fashion.
+This can also be specified with the environment variable \ref
+STARPU_WORKERS_OPENCLID. (default = 0)
+\var starpu_conf::workers_opencl_gpuid
+If the starpu_conf::use_explicit_workers_opencl_gpuid flag is set,
+this array contains the logical identifiers of the OpenCL devices to
+be used.
+
+\var starpu_conf::use_explicit_workers_mic_deviceid
+If this flag is set, the MIC workers will be attached to the MIC
+devices specified in the array starpu_conf::workers_mic_deviceid.
+Otherwise, StarPU affects the MIC devices in a round-robin fashion.
+This can also be specified with the environment variable \ref
+STARPU_WORKERS_MICID.
+(default = 0)
+\var starpu_conf::workers_mic_deviceid
+If the flag starpu_conf::use_explicit_workers_mic_deviceid is set, the
+array contains the logical identifiers of the MIC devices to be used.
+
+\var starpu_conf::use_explicit_workers_scc_deviceid
+If this flag is set, the SCC workers will be attached to the SCC
+devices specified in the array starpu_conf::workers_scc_deviceid.
+(default = 0)
+\var starpu_conf::workers_scc_deviceid
+If the flag starpu_conf::use_explicit_workers_scc_deviceid is set, the
+array contains the logical identifiers of the SCC devices to be used.
+Otherwise, StarPU affects the SCC devices in a round-robin fashion.
+This can also be specified with the environment variable \ref
+STARPU_WORKERS_SCCID.
+
+\var starpu_conf::bus_calibrate
+If this flag is set, StarPU will recalibrate the bus.  If this value
+is equal to <c>-1</c>, the default value is used.  This can also be
+specified with the environment variable \ref STARPU_BUS_CALIBRATE. (default
+= 0)
+\var starpu_conf::calibrate
+If this flag is set, StarPU will calibrate the performance models when
+executing tasks. If this value is equal to <c>-1</c>, the default
+value is used. If the value is equal to <c>1</c>, it will force
+continuing calibration. If the value is equal to <c>2</c>, the
+existing performance models will be overwritten. This can also be
+specified with the environment variable \ref STARPU_CALIBRATE. (default =
+0)
+\var starpu_conf::single_combined_worker
+By default, StarPU executes parallel tasks
+concurrently. Some parallel libraries (e.g. most OpenMP
+implementations) however do not support concurrent calls to
+parallel code. In such case, setting this flag makes StarPU
+only start one parallel task at a time (but other CPU and
+GPU tasks are not affected and can be run concurrently).
+The parallel task scheduler will however still however
+still try varying combined worker sizes to look for the
+most efficient ones. This can also be specified with the environment
+variable \ref STARPU_SINGLE_COMBINED_WORKER.
+(default = 0)
+
+\var starpu_conf::mic_sink_program_path
+Path to the kernel to execute on the MIC device, compiled for MIC
+architecture. When set to NULL, StarPU automatically looks next to the
+host program location.
+(default = NULL)
+
+\var starpu_conf::disable_asynchronous_copy
+This flag should be set to 1 to disable
+asynchronous copies between CPUs and all accelerators. This
+can also be specified with the environment variable \ref
+STARPU_DISABLE_ASYNCHRONOUS_COPY. The
+AMD implementation of OpenCL is known to fail when copying
+data asynchronously. When using this implementation, it is
+therefore necessary to disable asynchronous data transfers.
+This can also be specified at compilation time by giving to
+the configure script the option
+\ref disable-asynchronous-copy "--disable-asynchronous-copy". (default = 0)
+\var starpu_conf::disable_asynchronous_cuda_copy
+This flag should be set to 1 to disable
+asynchronous copies between CPUs and CUDA accelerators.
+This can also be specified with the environment variable
+\ref STARPU_DISABLE_ASYNCHRONOUS_CUDA_COPY.
+This can also be specified at compilation time by giving to
+the configure script the option
+\ref disable-asynchronous-cuda-copy "--disable-asynchronous-cuda-copy". (default = 0)
+\var starpu_conf::disable_asynchronous_opencl_copy
+This flag should be set to 1 to disable
+asynchronous copies between CPUs and OpenCL accelerators.
+This can also be specified with the environment
+variable \ref STARPU_DISABLE_ASYNCHRONOUS_OPENCL_COPY. The AMD
+implementation of OpenCL is known to fail
+when copying data asynchronously. When using this
+implementation, it is therefore necessary to disable
+asynchronous data transfers. This can also be specified at
+compilation time by giving to the configure script the
+option \ref disable-asynchronous-opencl-copy "--disable-asynchronous-opencl-copy".
+(default = 0)
+
+\var starpu_conf::disable_asynchronous_mic_copy
+This flag should be set to 1 to disable asynchronous copies between
+CPUs and MIC accelerators. This can also be specified with the
+environment variable \ref STARPU_DISABLE_ASYNCHRONOUS_MIC_COPY.
+This can also be specified at compilation time by giving to the
+configure script the option \ref disable-asynchronous-mic-copy "--disable-asynchronous-mic-copy".
+(default = 0).
+
+\var starpu_conf::cuda_opengl_interoperability
+Enable CUDA/OpenGL interoperation on these CUDA
+devices. This can be set to an array of CUDA device
+identifiers for which cudaGLSetGLDevice() should be called
+instead of cudaSetDevice(). Its size is specified by the
+starpu_conf::n_cuda_opengl_interoperability field below
+(default = NULL)
+\var starpu_conf::n_cuda_opengl_interoperability
+\var starpu_conf::not_launched_drivers
+Array of drivers that should not be launched by
+StarPU. The application will run in one of its own
+threads. (default = NULL)
+\var starpu_conf::n_not_launched_drivers
+The number of StarPU drivers that should not be
+launched by StarPU. (default = 0)
+\var starpu_conf::trace_buffer_size
+Specifies the buffer size used for FxT tracing.
+Starting from FxT version 0.2.12, the buffer will
+automatically be flushed when it fills in, but it may still
+be interesting to specify a bigger value to avoid any
+flushing (which would disturb the trace).
+
+\fn int starpu_init(struct starpu_conf *conf)
+\ingroup API_Initialization_and_Termination
+This is StarPU initialization method, which must be called prior to
+any other StarPU call. It is possible to specify StarPU’s
+configuration (e.g. scheduling policy, number of cores, ...) by
+passing a non-null argument. Default configuration is used if the
+passed argument is NULL. Upon successful completion, this function
+returns 0. Otherwise, -ENODEV indicates that no worker was available
+(so that StarPU was not initialized).
+
+\fn int starpu_initialize(struct starpu_conf *user_conf, int *argc, char ***argv)
+\ingroup API_Initialization_and_Termination
+This is the same as starpu_init(), but also takes the \p argc and \p
+argv as defined by the application. This is needed for SCC execution
+to initialize the communication library.
+Do not call starpu_init() and starpu_initialize() in the
+same program.
+
+\fn int starpu_conf_init(struct starpu_conf *conf)
+\ingroup API_Initialization_and_Termination
+This function initializes the conf structure passed as argument with
+the default values. In case some configuration parameters are already
+specified through environment variables, starpu_conf_init() initializes
+the fields of the structure according to the environment variables.
+For instance if \ref STARPU_CALIBRATE is set, its value is put in the
+field starpu_conf::calibrate of the structure passed as argument. Upon successful
+completion, this function returns 0. Otherwise, -EINVAL indicates that
+the argument was NULL.
+
+\fn void starpu_shutdown(void)
+\ingroup API_Initialization_and_Termination
+This is StarPU termination method. It must be called at the end of the
+application: statistics and other post-mortem debugging information
+are not guaranteed to be available until this method has been called.
+
+\fn int starpu_asynchronous_copy_disabled(void)
+\ingroup API_Initialization_and_Termination
+Return 1 if asynchronous data transfers between CPU and accelerators
+are disabled.
+
+\fn int starpu_asynchronous_cuda_copy_disabled(void)
+\ingroup API_Initialization_and_Termination
+Return 1 if asynchronous data transfers between CPU and CUDA
+accelerators are disabled.
+
+\fn int starpu_asynchronous_opencl_copy_disabled(void)
+\ingroup API_Initialization_and_Termination
+Return 1 if asynchronous data transfers between CPU and OpenCL
+accelerators are disabled.
+
+\fn void starpu_topology_print(FILE *f)
+\ingroup API_Initialization_and_Termination
+Prints a description of the topology on f.
+
+*/
+

+ 98 - 0
doc/doxygen/chapters/api/insert_task.doxy

@@ -0,0 +1,98 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Insert_Task Insert_Task
+
+\fn int starpu_insert_task(struct starpu_codelet *cl, ...)
+\ingroup API_Insert_Task
+Create and submit a task corresponding to \p cl with the
+following arguments. The argument list must be zero-terminated.
+
+The arguments following the codelets can be of the following types:
+<ul>
+<li> ::STARPU_R, ::STARPU_W, ::STARPU_RW, ::STARPU_SCRATCH,
+::STARPU_REDUX an access mode followed by a data handle;
+<li> ::STARPU_DATA_ARRAY followed by an array of data handles and its
+number of elements;
+<li> the specific values ::STARPU_VALUE, ::STARPU_CALLBACK,
+::STARPU_CALLBACK_ARG, ::STARPU_CALLBACK_WITH_ARG, ::STARPU_PRIORITY,
+::STARPU_TAG, ::STARPU_FLOPS, ::STARPU_SCHED_CTX followed by the
+appropriated objects as defined elsewhere.
+</ul>
+
+When using ::STARPU_DATA_ARRAY, the access mode of the data handles is
+not defined.
+
+Parameters to be passed to the codelet implementation are defined
+through the type ::STARPU_VALUE. The function
+starpu_codelet_unpack_args() must be called within the codelet
+implementation to retrieve them.
+
+\def STARPU_VALUE
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must
+be followed by a pointer to a constant value and the size of the
+constant
+
+\def STARPU_CALLBACK
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must
+be followed by a pointer to a callback function
+
+\def STARPU_CALLBACK_WITH_ARG
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must
+be followed by two pointers: one to a callback function, and the other
+to be given as an argument to the callback function; this is
+equivalent to using both ::STARPU_CALLBACK and
+::STARPU_CALLBACK_WITH_ARG.
+
+\def STARPU_CALLBACK_ARG
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must
+be followed by a pointer to be given as an argument to the callback
+function
+
+\def STARPU_PRIORITY
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must
+be followed by a integer defining a priority level
+
+\def STARPU_DATA_ARRAY
+\ingroup API_Insert_Task
+TODO
+
+\def STARPU_TAG
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must be followed by a tag.
+
+\def STARPU_FLOPS
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must
+be followed by an amount of floating point operations, as a double.
+Users <b>MUST</b> explicitly cast into double, otherwise parameter
+passing will not work.
+
+\def STARPU_SCHED_CTX
+\ingroup API_Insert_Task
+this macro is used when calling starpu_insert_task(), and must
+be followed by the id of the scheduling context to which we want to
+submit the task.
+
+\fn void starpu_codelet_pack_args(void **arg_buffer, size_t *arg_buffer_size, ...)
+\ingroup API_Insert_Task
+Pack arguments of type ::STARPU_VALUE into a buffer which can be
+given to a codelet and later unpacked with the function
+starpu_codelet_unpack_args().
+
+\fn void starpu_codelet_unpack_args (void *cl_arg, ...)
+\ingroup API_Insert_Task
+Retrieve the arguments of type ::STARPU_VALUE associated to a
+task automatically created using the function starpu_insert_task().
+
+*/

+ 50 - 0
doc/doxygen/chapters/api/lower_bound.doxy

@@ -0,0 +1,50 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Theoretical_Lower_Bound_on_Execution_Time Theoretical Lower Bound on Execution Time
+
+\brief Compute theoretical upper computation efficiency bound
+corresponding to some actual execution.
+
+\fn void starpu_bound_start (int deps, int prio)
+\ingroup API_Theoretical_Lower_Bound_on_Execution_Time
+Start recording tasks (resets stats). \p deps tells whether
+dependencies should be recorded too (this is quite expensive)
+
+\fn void starpu_bound_stop (void)
+\ingroup API_Theoretical_Lower_Bound_on_Execution_Time
+Stop recording tasks
+
+\fn void starpu_bound_print_dot (FILE *output)
+\ingroup API_Theoretical_Lower_Bound_on_Execution_Time
+Print the DAG that was recorded
+
+\fn void starpu_bound_compute (double *res, double *integer_res, int integer)
+\ingroup API_Theoretical_Lower_Bound_on_Execution_Time
+Get theoretical upper bound (in ms) (needs glpk support
+detected by configure script). It returns 0 if some performance models
+are not calibrated.
+
+\fn void starpu_bound_print_lp (FILE *output)
+\ingroup API_Theoretical_Lower_Bound_on_Execution_Time
+Emit the Linear Programming system on \p output for the recorded
+tasks, in the lp format
+
+\fn void starpu_bound_print_mps (FILE *output)
+\ingroup API_Theoretical_Lower_Bound_on_Execution_Time
+Emit the Linear Programming system on \p output for the recorded
+tasks, in the mps format
+
+\fn void starpu_bound_print (FILE *output, int integer)
+\ingroup API_Theoretical_Lower_Bound_on_Execution_Time
+Emit statistics of actual execution vs theoretical upper bound.
+\p integer permits to choose between integer solving (which takes a
+long time but is correct), and relaxed solving (which provides an
+approximate solution).
+
+*/

+ 28 - 0
doc/doxygen/chapters/api/mic_extensions.doxy

@@ -0,0 +1,28 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_MIC_Extensions MIC Extensions
+
+\def STARPU_USE_MIC
+\ingroup API_MIC_Extensions
+This macro is defined when StarPU has been installed with MIC support.
+It should be used in your code to detect the availability of MIC.
+
+\fn int starpu_mic_register_kernel(starpu_mic_func_symbol_t *symbol, const char *func_name)
+\ingroup API_MIC_Extensions
+Initiate a lookup on each MIC device to find the adress of the
+function named \p func_name, store them in the global array kernels
+and return the index in the array through \p symbol.
+
+\fn starpu_mic_kernel_t starpu_mic_get_kernel(starpu_mic_func_symbol_t symbol)
+\ingroup API_MIC_Extensions
+If success, return the pointer to the function defined by \p symbol on
+the device linked to the called device. This can for instance be used
+in a starpu_mic_func_t implementation.
+
+*/

+ 36 - 0
doc/doxygen/chapters/api/misc_helpers.doxy

@@ -0,0 +1,36 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Miscellaneous_Helpers Miscellaneous Helpers
+
+\fn int starpu_data_cpy(starpu_data_handle_t dst_handle, starpu_data_handle_t src_handle, int asynchronous, void (*callback_func)(void*), void *callback_arg)
+\ingroup API_Miscellaneous_Helpers
+Copy the content of \p src_handle into \p dst_handle. The parameter \p
+asynchronous indicates whether the function should block or not. In
+the case of an asynchronous call, it is possible to synchronize with
+the termination of this operation either by the means of implicit
+dependencies (if enabled) or by calling starpu_task_wait_for_all(). If
+\p callback_func is not NULL, this callback function is executed after
+the handle has been copied, and it is given the pointer \p callback_arg as argument.
+
+\fn void starpu_execute_on_each_worker(void (*func)(void *), void *arg, uint32_t where)
+\ingroup API_Miscellaneous_Helpers
+This function executes the given function on a subset of workers. When
+calling this method, the offloaded function \p func is executed by
+every StarPU worker that may execute the function. The argument \p arg
+is passed to the offloaded function. The argument \p where specifies
+on which types of processing units the function should be executed.
+Similarly to the field starpu_codelet::where, it is possible to
+specify that the function should be executed on every CUDA device and
+every CPU by passing ::STARPU_CPU|::STARPU_CUDA. This function blocks
+until the function has been executed on every appropriate processing
+units, so that it may not be called from a callback function for
+instance.
+
+*/
+

+ 276 - 0
doc/doxygen/chapters/api/mpi.doxy

@@ -0,0 +1,276 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_MPI_Support MPI Support
+
+@name Initialisation
+\ingroup API_MPI_Support
+
+\fn int starpu_mpi_init (int *argc, char ***argv, int initialize_mpi)
+\ingroup API_MPI_Support
+Initializes the starpumpi library. \p initialize_mpi indicates if MPI
+should be initialized or not by StarPU. If the value is not 0, MPI
+will be initialized by calling <c>MPI_Init_Thread(argc, argv,
+MPI_THREAD_SERIALIZED, ...)</c>.
+
+\fn int starpu_mpi_initialize (void)
+\deprecated
+\ingroup API_MPI_Support
+This function has been made deprecated. One should use instead the
+function starpu_mpi_init(). This function does not call MPI_Init(), it
+should be called beforehand.
+
+\fn int starpu_mpi_initialize_extended (int *rank, int *world_size)
+\deprecated
+\ingroup API_MPI_Support
+This function has been made deprecated. One should use instead the
+function starpu_mpi_init(). MPI will be initialized by starpumpi by
+calling <c>MPI_Init_Thread(argc, argv, MPI_THREAD_SERIALIZED,
+...)</c>.
+
+\fn int starpu_mpi_shutdown (void)
+\ingroup API_MPI_Support
+Cleans the starpumpi library. This must be called between calling
+starpu_mpi functions and starpu_shutdown(). MPI_Finalize() will be
+called if StarPU-MPI has been initialized by starpu_mpi_init().
+
+\fn void starpu_mpi_comm_amounts_retrieve (size_t *comm_amounts)
+\ingroup API_MPI_Support
+Retrieve the current amount of communications from the current node in
+the array \p comm_amounts which must have a size greater or equal to
+the world size. Communications statistics must be enabled (see
+\ref STARPU_COMM_STATS).
+
+@name Communication
+\anchor MPIPtpCommunication
+\ingroup API_MPI_Support
+
+\fn int starpu_mpi_send (starpu_data_handle_t data_handle, int dest, int mpi_tag, MPI_Comm comm)
+\ingroup API_MPI_Support
+Performs a standard-mode, blocking send of \p data_handle to the node
+\p dest using the message tag \p mpi_tag within the communicator \p
+comm.
+
+\fn int starpu_mpi_recv (starpu_data_handle_t data_handle, int source, int mpi_tag, MPI_Comm comm, MPI_Status *status)
+\ingroup API_MPI_Support
+Performs a standard-mode, blocking receive in \p data_handle from the
+node \p source using the message tag \p mpi_tag within the
+communicator \p comm.
+
+\fn int starpu_mpi_isend (starpu_data_handle_t data_handle, starpu_mpi_req *req, int dest, int mpi_tag, MPI_Comm comm)
+\ingroup API_MPI_Support
+Posts a standard-mode, non blocking send of \p data_handle to the node
+\p dest using the message tag \p mpi_tag within the communicator \p
+comm. After the call, the pointer to the request \p req can be used to
+test or to wait for the completion of the communication.
+
+\fn int starpu_mpi_irecv (starpu_data_handle_t data_handle, starpu_mpi_req *req, int source, int mpi_tag, MPI_Comm comm)
+\ingroup API_MPI_Support
+Posts a nonblocking receive in \p data_handle from the node \p source
+using the message tag \p mpi_tag within the communicator \p comm.
+After the call, the pointer to the request \p req can be used to test
+or to wait for the completion of the communication.
+
+\fn int starpu_mpi_isend_detached (starpu_data_handle_t data_handle, int dest, int mpi_tag, MPI_Comm comm, void (*callback)(void *), void *arg)
+\ingroup API_MPI_Support
+Posts a standard-mode, non blocking send of \p data_handle to the node
+\p dest using the message tag \p mpi_tag within the communicator \p
+comm. On completion, the \p callback function is called with the
+argument \p arg.
+Similarly to the pthread detached functionality, when a detached
+communication completes, its resources are automatically released back
+to the system, there is no need to test or to wait for the completion
+of the request.
+
+\fn int starpu_mpi_irecv_detached (starpu_data_handle_t data_handle, int source, int mpi_tag, MPI_Comm comm, void (*callback)(void *), void *arg)
+\ingroup API_MPI_Support
+Posts a nonblocking receive in \p data_handle from the node \p source
+using the message tag \p mpi_tag within the communicator \p comm. On
+completion, the \p callback function is called with the argument \p
+arg.
+Similarly to the pthread detached functionality, when a detached
+communication completes, its resources are automatically released back
+to the system, there is no need to test or to wait for the completion
+of the request.
+
+\fn int starpu_mpi_wait (starpu_mpi_req *req, MPI_Status *status)
+\ingroup API_MPI_Support
+Returns when the operation identified by request \p req is complete.
+
+\fn int starpu_mpi_test (starpu_mpi_req *req, int *flag, MPI_Status *status)
+\ingroup API_MPI_Support
+If the operation identified by \p req is complete, set \p flag to 1.
+The \p status object is set to contain information on the completed
+operation.
+
+\fn int starpu_mpi_barrier (MPI_Comm comm)
+\ingroup API_MPI_Support
+Blocks the caller until all group members of the communicator \p comm
+have called it.
+
+\fn int starpu_mpi_isend_detached_unlock_tag (starpu_data_handle_t data_handle, int dest, int mpi_tag, MPI_Comm comm, starpu_tag_t tag)
+\ingroup API_MPI_Support
+Posts a standard-mode, non blocking send of \p data_handle to the node
+\p dest using the message tag \p mpi_tag within the communicator \p
+comm. On completion, \p tag is unlocked.
+
+\fn int starpu_mpi_irecv_detached_unlock_tag (starpu_data_handle_t data_handle, int source, int mpi_tag, MPI_Comm comm, starpu_tag_t tag)
+\ingroup API_MPI_Support
+Posts a nonblocking receive in \p data_handle from the node \p source
+using the message tag \p mpi_tag within the communicator \p comm. On
+completion, \p tag is unlocked.
+
+\fn int starpu_mpi_isend_array_detached_unlock_tag (unsigned array_size, starpu_data_handle_t *data_handle, int *dest, int *mpi_tag, MPI_Comm *comm, starpu_tag_t tag)
+\ingroup API_MPI_Support
+Posts \p array_size standard-mode, non blocking send. Each post sends
+the n-th data of the array \p data_handle to the n-th node of the
+array \p dest using the n-th message tag of the array \p mpi_tag
+within the n-th communicator of the array \p comm. On completion of
+the all the requests, \p tag is unlocked.
+
+\fn int starpu_mpi_irecv_array_detached_unlock_tag (unsigned array_size, starpu_data_handle_t *data_handle, int *source, int *mpi_tag, MPI_Comm *comm, starpu_tag_t tag)
+\ingroup API_MPI_Support
+Posts \p array_size nonblocking receive. Each post receives in the n-th
+data of the array \p data_handle from the n-th node of the array \p
+source using the n-th message tag of the array \p mpi_tag within the
+n-th communicator of the array \p comm. On completion of the all the
+requests, \p tag is unlocked.
+
+@name Communication Cache
+\ingroup API_MPI_Support
+
+\fn void starpu_mpi_cache_flush (MPI_Comm comm, starpu_data_handle_t data_handle)
+\ingroup API_MPI_Support
+Clear the send and receive communication cache for the data
+\p data_handle. The function has to be called synchronously by all the
+MPI nodes. The function does nothing if the cache mechanism is
+disabled (see \ref STARPU_MPI_CACHE).
+
+\fn void starpu_mpi_cache_flush_all_data (MPI_Comm comm)
+\ingroup API_MPI_Support
+Clear the send and receive communication cache for all data. The
+function has to be called synchronously by all the MPI nodes. The
+function does nothing if the cache mechanism is disabled (see
+\ref STARPU_MPI_CACHE).
+
+@name MPI Insert Task
+\anchor MPIInsertTask
+\ingroup API_MPI_Support
+
+\fn int starpu_data_set_tag (starpu_data_handle_t handle, int tag)
+\ingroup API_MPI_Support
+Tell StarPU-MPI which MPI tag to use when exchanging the data.
+
+\fn int starpu_data_get_tag (starpu_data_handle_t handle)
+\ingroup API_MPI_Support
+Returns the MPI tag to be used when exchanging the data.
+
+\fn int starpu_data_set_rank (starpu_data_handle_t handle, int rank)
+\ingroup API_MPI_Support
+Tell StarPU-MPI which MPI node "owns" a given data, that is, the node
+which will always keep an up-to-date value, and will by default
+execute tasks which write to it.
+
+\fn int starpu_data_get_rank (starpu_data_handle_t handle)
+\ingroup API_MPI_Support
+Returns the last value set by starpu_data_set_rank().
+
+\def STARPU_EXECUTE_ON_NODE
+\ingroup API_MPI_Support
+this macro is used when calling starpu_mpi_insert_task(), and must be
+followed by a integer value which specified the node on which to
+execute the codelet.
+
+\def STARPU_EXECUTE_ON_DATA
+\ingroup API_MPI_Support
+this macro is used when calling starpu_mpi_insert_task(), and must be
+followed by a data handle to specify that the node owning the given
+data will execute the codelet.
+
+\fn int starpu_mpi_insert_task (MPI_Comm comm, struct starpu_codelet *codelet, ...)
+\ingroup API_MPI_Support
+Create and submit a task corresponding to codelet with the following
+arguments. The argument list must be zero-terminated.
+
+The arguments following the codelets are the same types as for the
+function starpu_insert_task(). The extra argument
+::STARPU_EXECUTE_ON_NODE followed by an integer allows to specify the
+MPI node to execute the codelet. It is also possible to specify that
+the node owning a specific data will execute the codelet, by using
+::STARPU_EXECUTE_ON_DATA followed by a data handle.
+
+The internal algorithm is as follows:
+<ol>
+<li>
+        Find out which MPI node is going to execute the codelet.
+        <ul>
+            <li>If there is only one node owning data in ::STARPU_W mode, it will be selected;
+            <li>If there is several nodes owning data in ::STARPU_W node, the one selected will be the one having the least data in R mode so as to minimize the amount of data to be transfered;
+            <li>The argument ::STARPU_EXECUTE_ON_NODE followed by an integer can be used to specify the node;
+            <li>The argument ::STARPU_EXECUTE_ON_DATA followed by a data handle can be used to specify that the node owing the given data will execute the codelet.
+        </ul>
+</li>
+<li>
+        Send and receive data as requested. Nodes owning data which need to be read by the task are sending them to the MPI node which will execute it. The latter receives them.
+</li>
+<li>
+        Execute the codelet. This is done by the MPI node selected in the 1st step of the algorithm.
+</li>
+<li>
+        If several MPI nodes own data to be written to, send written data back to their owners.
+</li>
+</ol>
+
+The algorithm also includes a communication cache mechanism that
+allows not to send data twice to the same MPI node, unless the data
+has been modified. The cache can be disabled (see \ref STARPU_MPI_CACHE).
+
+\fn void starpu_mpi_get_data_on_node (MPI_Comm comm, starpu_data_handle_t data_handle, int node)
+\ingroup API_MPI_Support
+Transfer data \p data_handle to MPI node \p node, sending it from its
+owner if needed. At least the target node and the owner have to call
+the function.
+
+\fn void starpu_mpi_get_data_on_node_detached (MPI_Comm comm, starpu_data_handle_t data_handle, int node, void (*callback)(void*), void *arg)
+\ingroup API_MPI_Support
+Transfer data \p data_handle to MPI node \p node, sending it from its
+owner if needed. At least the target node and the owner have to call
+the function. On reception, the \p callback function is called with
+the argument \p arg.
+
+@name Collective Operations
+\anchor MPICollectiveOperations
+\ingroup API_MPI_Support
+
+\fn void starpu_mpi_redux_data (MPI_Comm comm, starpu_data_handle_t data_handle)
+\ingroup API_MPI_Support
+Perform a reduction on the given data. All nodes send the data to its
+owner node which will perform a reduction.
+
+\fn int starpu_mpi_scatter_detached (starpu_data_handle_t *data_handles, int count, int root, MPI_Comm comm, void (*scallback)(void *), void *sarg, void (*rcallback)(void *), void *rarg)
+\ingroup API_MPI_Support
+Scatter data among processes of the communicator based on the
+ownership of the data. For each data of the array \p data_handles, the
+process \p root sends the data to the process owning this data. Processes
+receiving data must have valid data handles to receive them. On
+completion of the collective communication, the \p scallback function is
+called with the argument \p sarg on the process \p root, the \p
+rcallback function is called with the argument \p rarg on any other
+process.
+
+\fn int starpu_mpi_gather_detached (starpu_data_handle_t *data_handles, int count, int root, MPI_Comm comm, void (*scallback)(void *), void *sarg, void (*rcallback)(void *), void *rarg)
+\ingroup API_MPI_Support
+Gather data from the different processes of the communicator onto the
+process \p root. Each process owning data handle in the array
+\p data_handles will send them to the process \p root. The process \p
+root must have valid data handles to receive the data. On completion
+of the collective communication, the \p rcallback function is called
+with the argument \p rarg on the process root, the \p scallback
+function is called with the argument \p sarg on any other process.
+
+*/

+ 71 - 0
doc/doxygen/chapters/api/multiformat_data_interface.doxy

@@ -0,0 +1,71 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Multiformat_Data_Interface Multiformat Data Interface
+
+\struct starpu_multiformat_data_interface_ops
+\ingroup API_Multiformat_Data_Interface
+The different fields are:
+\var starpu_multiformat_data_interface_ops::cpu_elemsize
+        the size of each element on CPUs
+\var starpu_multiformat_data_interface_ops::opencl_elemsize
+        the size of each element on OpenCL devices
+\var starpu_multiformat_data_interface_ops::cpu_to_opencl_cl
+        pointer to a codelet which converts from CPU to OpenCL
+\var starpu_multiformat_data_interface_ops::opencl_to_cpu_cl
+        pointer to a codelet which converts from OpenCL to CPU
+\var starpu_multiformat_data_interface_ops::cuda_elemsize
+        the size of each element on CUDA devices
+\var starpu_multiformat_data_interface_ops::cpu_to_cuda_cl
+        pointer to a codelet which converts from CPU to CUDA
+\var starpu_multiformat_data_interface_ops::cuda_to_cpu_cl
+        pointer to a codelet which converts from CUDA to CPU
+\var starpu_multiformat_data_interface_ops::mic_elemsize
+        the size of each element on MIC devices
+\var starpu_multiformat_data_interface_ops::cpu_to_mic_cl
+        pointer to a codelet which converts from CPU to MIC
+\var starpu_multiformat_data_interface_ops::mic_to_cpu_cl
+        pointer to a codelet which converts from MIC to CPU
+
+\struct starpu_multiformat_interface
+todo
+\ingroup API_Multiformat_Data_Interface
+\var starpu_multiformat_interface::id
+\var starpu_multiformat_interface::cpu_ptr
+\var starpu_multiformat_interface::cuda_ptr
+\var starpu_multiformat_interface::opencl_ptr
+\var starpu_multiformat_interface::mic_ptr
+\var starpu_multiformat_interface::nx
+\var starpu_multiformat_interface::ops
+
+\fn void starpu_multiformat_data_register(starpu_data_handle_t *handle, unsigned home_node, void *ptr, uint32_t nobjects, struct starpu_multiformat_data_interface_ops *format_ops)
+\ingroup API_Multiformat_Data_Interface
+Register a piece of data that can be represented in different
+ways, depending upon the processing unit that manipulates it. It
+allows the programmer, for instance, to use an array of structures
+when working on a CPU, and a structure of arrays when working on a
+GPU. \p nobjects is the number of elements in the data. \p format_ops
+describes the format.
+
+\def STARPU_MULTIFORMAT_GET_CPU_PTR(void *interface)
+\ingroup API_Multiformat_Data_Interface
+returns the local pointer to the data with CPU format.
+
+\def STARPU_MULTIFORMAT_GET_CUDA_PTR(void *interface)
+\ingroup API_Multiformat_Data_Interface
+returns the local pointer to the data with CUDA format.
+
+\def STARPU_MULTIFORMAT_GET_OPENCL_PTR(void *interface)
+\ingroup API_Multiformat_Data_Interface
+returns the local pointer to the data with OpenCL format.
+
+\def STARPU_MULTIFORMAT_GET_NX (void *interface)
+\ingroup API_Multiformat_Data_Interface
+returns the number of elements in the data.
+
+*/

+ 249 - 0
doc/doxygen/chapters/api/opencl_extensions.doxy

@@ -0,0 +1,249 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_OpenCL_Extensions OpenCL Extensions
+
+\def STARPU_USE_OPENCL
+\ingroup API_OpenCL_Extensions
+This macro is defined when StarPU has been installed with
+OpenCL support. It should be used in your code to detect the
+availability of OpenCL as shown in \ref FullSourceCodeVectorScal.
+
+\struct starpu_opencl_program
+\ingroup API_OpenCL_Extensions
+Stores the OpenCL programs as compiled for the different OpenCL
+devices.
+\var starpu_opencl_program::programs
+Stores each program for each OpenCL device.
+
+@name Writing OpenCL kernels
+\ingroup API_OpenCL_Extensions
+
+\fn void starpu_opencl_get_context(int devid, cl_context *context)
+\ingroup API_OpenCL_Extensions
+Places the OpenCL context of the device designated by \p devid
+into \p context.
+
+\fn void starpu_opencl_get_device(int devid, cl_device_id *device)
+\ingroup API_OpenCL_Extensions
+Places the cl_device_id corresponding to \p devid in \p device.
+
+\fn void starpu_opencl_get_queue(int devid, cl_command_queue *queue)
+\ingroup API_OpenCL_Extensions
+Places the command queue of the device designated by \p devid
+into \p queue.
+
+\fn void starpu_opencl_get_current_context(cl_context *context)
+\ingroup API_OpenCL_Extensions
+Return the context of the current worker.
+
+\fn void starpu_opencl_get_current_queue(cl_command_queue *queue)
+\ingroup API_OpenCL_Extensions
+Return the computation kernel command queue of the current
+worker.
+
+\fn int starpu_opencl_set_kernel_args(cl_int *err, cl_kernel *kernel, ...)
+\ingroup API_OpenCL_Extensions
+Sets the arguments of a given kernel. The list of arguments
+must be given as <c>(size_t size_of_the_argument, cl_mem *
+pointer_to_the_argument)</c>. The last argument must be 0. Returns the
+number of arguments that were successfully set. In case of failure,
+returns the id of the argument that could not be set and err is set to
+the error returned by OpenCL. Otherwise, returns the number of
+arguments that were set.
+
+Here an example:
+\code{.c}
+int n;
+cl_int err;
+cl_kernel kernel;
+n = starpu_opencl_set_kernel_args(&err, 2, &kernel,
+                                  sizeof(foo), &foo,
+                                  sizeof(bar), &bar,
+                                  0);
+if (n != 2)
+   fprintf(stderr, "Error : %d\n", err);
+\endcode
+
+@name Compiling OpenCL kernels
+\ingroup API_OpenCL_Extensions
+
+Source codes for OpenCL kernels can be stored in a file or in a
+string. StarPU provides functions to build the program executable for
+each available OpenCL device as a cl_program object. This program
+executable can then be loaded within a specific queue as explained in
+the next section. These are only helpers, Applications can also fill a
+starpu_opencl_program array by hand for more advanced use (e.g.
+different programs on the different OpenCL devices, for relocation
+purpose for instance).
+
+\fn int starpu_opencl_load_opencl_from_file(const char *source_file_name, struct starpu_opencl_program *opencl_programs, const char* build_options)
+\ingroup API_OpenCL_Extensions
+This function compiles an OpenCL source code stored in a file.
+
+\fn int starpu_opencl_load_opencl_from_string(const char *opencl_program_source, struct starpu_opencl_program *opencl_programs, const char* build_options)
+\ingroup API_OpenCL_Extensions
+This function compiles an OpenCL source code stored in a string.
+
+\fn int starpu_opencl_unload_opencl(struct starpu_opencl_program *opencl_programs)
+\ingroup API_OpenCL_Extensions
+This function unloads an OpenCL compiled code.
+
+\fn void starpu_opencl_load_program_source(const char *source_file_name, char *located_file_name, char *located_dir_name, char *opencl_program_source)
+\ingroup API_OpenCL_Extensions
+Store the contents of the file \p source_file_name in the buffer
+\p opencl_program_source. The file \p source_file_name can be located in the
+current directory, or in the directory specified by the environment
+variable \ref STARPU_OPENCL_PROGRAM_DIR, or
+in the directory <c>share/starpu/opencl</c> of the installation
+directory of StarPU, or in the source directory of StarPU. When the
+file is found, \p located_file_name is the full name of the file as it
+has been located on the system, \p located_dir_name the directory
+where it has been located. Otherwise, they are both set to the empty
+string.
+
+\fn int starpu_opencl_compile_opencl_from_file(const char *source_file_name, const char * build_options)
+\ingroup API_OpenCL_Extensions
+Compile the OpenCL kernel stored in the file \p source_file_name
+with the given options \p build_options and stores the result in the
+directory <c>$STARPU_HOME/.starpu/opencl</c> with the same filename as
+\p source_file_name. The compilation is done for every OpenCL device,
+and the filename is suffixed with the vendor id and the device id of
+the OpenCL device.
+
+\fn int starpu_opencl_compile_opencl_from_string(const char *opencl_program_source, const char *file_name, const char*build_options)
+\ingroup API_OpenCL_Extensions
+Compile the OpenCL kernel in the string \p opencl_program_source
+with the given options \p build_options and stores the result in the
+directory <c>$STARPU_HOME/.starpu/opencl</c> with the filename \p
+file_name. The compilation is done for every OpenCL device, and the
+filename is suffixed with the vendor id and the device id of the
+OpenCL device.
+
+\fn int starpu_opencl_load_binary_opencl(const char *kernel_id, struct starpu_opencl_program *opencl_programs)
+\ingroup API_OpenCL_Extensions
+Compile the binary OpenCL kernel identified with \p kernel_id.
+For every OpenCL device, the binary OpenCL kernel will be loaded from
+the file
+<c>$STARPU_HOME/.starpu/opencl/\<kernel_id\>.\<device_type\>.vendor_id_\<vendor_id\>_device_id_\<device_id\></c>.
+
+@name Loading OpenCL kernels
+\ingroup API_OpenCL_Extensions
+
+\fn int starpu_opencl_load_kernel(cl_kernel *kernel, cl_command_queue *queue, struct starpu_opencl_program *opencl_programs, const char *kernel_name, int devid)
+\ingroup API_OpenCL_Extensions
+Create a kernel \p kernel for device \p devid, on its computation
+command queue returned in \p queue, using program \p opencl_programs
+and name \p kernel_name.
+
+\fn int starpu_opencl_release_kernel(cl_kernel kernel)
+\ingroup API_OpenCL_Extensions
+Release the given \p kernel, to be called after kernel execution.
+
+@name OpenCL statistics
+
+\fn int starpu_opencl_collect_stats(cl_event event)
+\ingroup API_OpenCL_Extensions
+This function allows to collect statistics on a kernel execution.
+After termination of the kernels, the OpenCL codelet should call this
+function to pass it the even returned by clEnqueueNDRangeKernel, to
+let StarPU collect statistics about the kernel execution (used cycles,
+consumed power).
+
+@name OpenCL utilities
+\ingroup API_OpenCL_Extensions
+
+\fn const char * starpu_opencl_error_string(cl_int status)
+\ingroup API_OpenCL_Extensions
+Return the error message in English corresponding to \p status, an OpenCL
+error code.
+
+\fn void starpu_opencl_display_error(const char *func, const char *file, int line, const char *msg, cl_int status)
+\ingroup API_OpenCL_Extensions
+Given a valid error status, prints the corresponding error message on
+stdout, along with the given function name \p func, the given filename
+\p file, the given line number \p line and the given message \p msg.
+
+\def STARPU_OPENCL_DISPLAY_ERROR(cl_int status)
+\ingroup API_OpenCL_Extensions
+Call the function starpu_opencl_display_error() with the given error
+\p status, the current function name, current file and line number,
+and a empty message.
+
+\fn void starpu_opencl_report_error(const char *func, const char *file, int line, const char *msg, cl_int status)
+\ingroup API_OpenCL_Extensions
+Call the function starpu_opencl_display_error() and abort.
+
+\def STARPU_OPENCL_REPORT_ERROR (cl_int status)
+\ingroup API_OpenCL_Extensions
+Call the function starpu_opencl_report_error() with the given error \p
+status, with the current function name, current file and line number,
+and a empty message.
+
+\def STARPU_OPENCL_REPORT_ERROR_WITH_MSG(const char *msg, cl_int status)
+\ingroup API_OpenCL_Extensions
+Call the function starpu_opencl_report_error() with the given \p msg
+and the given error \p status, with the current function name, current
+file and line number.
+
+\fn cl_int starpu_opencl_allocate_memory(cl_mem *addr, size_t size, cl_mem_flags flags)
+\ingroup API_OpenCL_Extensions
+Allocate \p size bytes of memory, stored in \p addr. \p flags must be a valid
+combination of cl_mem_flags values.
+
+\fn cl_int starpu_opencl_copy_ram_to_opencl(void *ptr, unsigned src_node, cl_mem buffer, unsigned dst_node, size_t size, size_t offset, cl_event *event, int *ret)
+\ingroup API_OpenCL_Extensions
+Copy \p size bytes from the given \p ptr on RAM \p src_node to the
+given \p buffer on OpenCL \p dst_node. \p offset is the offset, in
+bytes, in \p buffer. if \p event is <c>NULL</c>, the copy is
+synchronous, i.e the queue is synchronised before returning. If not
+<c>NULL</c>, \p event can be used after the call to wait for this
+particular copy to complete. This function returns <c>CL_SUCCESS</c>
+if the copy was successful, or a valid OpenCL error code otherwise.
+The integer pointed to by \p ret is set to <c>-EAGAIN</c> if the
+asynchronous launch was successful, or to 0 if \p event was
+<c>NULL</c>.
+
+\fn cl_int starpu_opencl_copy_opencl_to_ram(cl_mem buffer, unsigned src_node, void *ptr, unsigned dst_node, size_t size, size_t offset, cl_event *event, int *ret)
+\ingroup API_OpenCL_Extensions
+Copy \p size bytes asynchronously from the given \p buffer on OpenCL
+\p src_node to the given \p ptr on RAM \p dst_node. \p offset is the
+offset, in bytes, in \p buffer. if \p event is <c>NULL</c>, the copy
+is synchronous, i.e the queue is synchronised before returning. If not
+<c>NULL</c>, \p event can be used after the call to wait for this
+particular copy to complete. This function returns <c>CL_SUCCESS</c>
+if the copy was successful, or a valid OpenCL error code otherwise.
+The integer pointed to by \p ret is set to <c>-EAGAIN</c> if the
+asynchronous launch was successful, or to 0 if \p event was
+<c>NULL</c>.
+
+\fn cl_int starpu_opencl_copy_opencl_to_opencl(cl_mem src, unsigned src_node, size_t src_offset, cl_mem dst, unsigned dst_node, size_t dst_offset, size_t size, cl_event *event, int *ret)
+\ingroup API_OpenCL_Extensions
+Copy \p size bytes asynchronously from byte offset \p src_offset of \p
+src on OpenCL \p src_node to byte offset \p dst_offset of \p dst on
+OpenCL \p dst_node. if \p event is <c>NULL</c>, the copy is
+synchronous, i.e. the queue is synchronised before returning. If not
+<c>NULL</c>, \p event can be used after the call to wait for this
+particular copy to complete. This function returns <c>CL_SUCCESS</c>
+if the copy was successful, or a valid OpenCL error code otherwise.
+The integer pointed to by \p ret is set to <c>-EAGAIN</c> if the
+asynchronous launch was successful, or to 0 if \p event was
+<c>NULL</c>.
+
+\fn cl_int starpu_opencl_copy_async_sync(uintptr_t src, size_t src_offset, unsigned src_node, uintptr_t dst, size_t dst_offset, unsigned dst_node, size_t size, cl_event *event)
+\ingroup API_OpenCL_Extensions
+Copy \p size bytes from byte offset \p src_offset of \p src on \p
+src_node to byte offset \p dst_offset of \p dst on \p dst_node. if \p
+event is <c>NULL</c>, the copy is synchronous, i.e. the queue is
+synchronised before returning. If not <c>NULL</c>, \p event can be
+used after the call to wait for this particular copy to complete. The
+function returns <c>-EAGAIN</c> if the asynchronous launch was
+successfull. It returns 0 if the synchronous copy was successful, or
+fails otherwise.
+
+*/

+ 51 - 0
doc/doxygen/chapters/api/parallel_tasks.doxy

@@ -0,0 +1,51 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Parallel_Tasks Parallel Tasks
+
+\fn int starpu_combined_worker_get_size(void)
+\ingroup API_Parallel_Tasks
+Return the size of the current combined worker, i.e. the total number
+of cpus running the same task in the case of ::STARPU_SPMD parallel
+tasks, or the total number of threads that the task is allowed to
+start in the case of ::STARPU_FORKJOIN parallel tasks.
+
+\fn int starpu_combined_worker_get_rank(void)
+\ingroup API_Parallel_Tasks
+Return the rank of the current thread within the combined worker. Can
+only be used in ::STARPU_FORKJOIN parallel tasks, to know which part
+of the task to work on.
+
+\fn unsigned starpu_combined_worker_get_count(void)
+\ingroup API_Parallel_Tasks
+Return the number of different combined workers.
+
+\fn int starpu_combined_worker_get_id(void)
+\ingroup API_Parallel_Tasks
+Return the identifier of the current combined worker.
+
+\fn int starpu_combined_worker_assign_workerid(int nworkers, int workerid_array[])
+\ingroup API_Parallel_Tasks
+Register a new combined worker and get its identifier
+
+\fn int starpu_combined_worker_get_description(int workerid, int *worker_size, int **combined_workerid)
+\ingroup API_Parallel_Tasks
+Get the description of a combined worker
+
+\fn int starpu_combined_worker_can_execute_task(unsigned workerid, struct starpu_task *task, unsigned nimpl)
+\ingroup API_Parallel_Tasks
+Variant of starpu_worker_can_execute_task() compatible with combined
+workers
+
+\fn void starpu_parallel_task_barrier_init(struct starpu_task*task, int workerid)
+\ingroup API_Parallel_Tasks
+Initialise the barrier for the parallel task, and dispatch the task
+between the different combined workers.
+
+*/
+

+ 271 - 0
doc/doxygen/chapters/api/performance_model.doxy

@@ -0,0 +1,271 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Performance_Model Performance Model
+
+\enum starpu_perfmodel_archtype
+\ingroup API_Performance_Model
+Enumerates the various types of architectures.
+
+it is possible that we have multiple versions of the same kind of
+workers, for instance multiple GPUs or even different CPUs within
+the same machine so we do not use the archtype enum type directly
+for performance models.
+
+<ul>
+<li> CPU types range within ::STARPU_CPU_DEFAULT (1 CPU),
+::STARPU_CPU_DEFAULT+1 (2 CPUs), ... ::STARPU_CPU_DEFAULT +
+STARPU_MAXCPUS - 1 (STARPU_MAXCPUS CPUs).
+</li>
+<li> CUDA types range within ::STARPU_CUDA_DEFAULT (GPU number 0),
+::STARPU_CUDA_DEFAULT + 1 (GPU number 1), ..., ::STARPU_CUDA_DEFAULT +
+STARPU_MAXCUDADEVS - 1 (GPU number STARPU_MAXCUDADEVS - 1).
+</li>
+<li> OpenCL types range within ::STARPU_OPENCL_DEFAULT (GPU number
+0), ::STARPU_OPENCL_DEFAULT + 1 (GPU number 1), ...,
+::STARPU_OPENCL_DEFAULT + STARPU_MAXOPENCLDEVS - 1 (GPU number
+STARPU_MAXOPENCLDEVS - 1).
+</ul>
+\var starpu_perfmodel_archtype::STARPU_CPU_DEFAULT
+\ingroup API_Performance_Model
+CPU combined workers between 0 and STARPU_MAXCPUS-1
+\var starpu_perfmodel_archtype::STARPU_CUDA_DEFAULT
+\ingroup API_Performance_Model
+CUDA workers
+\var starpu_perfmodel_archtype::STARPU_OPENCL_DEFAULT
+\ingroup API_Performance_Model
+OpenCL workers
+\var starpu_perfmodel_archtype::STARPU_MIC_DEFAULT
+\ingroup API_Performance_Model
+MIC workers
+\var starpu_perfmodel_archtype::STARPU_SCC_DEFAULT
+\ingroup API_Performance_Model
+SCC workers
+
+\enum starpu_perfmodel_type
+\ingroup API_Performance_Model
+TODO
+\var starpu_perfmodel_type::STARPU_PER_ARCH
+\ingroup API_Performance_Model
+Application-provided per-arch cost model function
+\var starpu_perfmodel_type::STARPU_COMMON
+\ingroup API_Performance_Model
+Application-provided common cost model function, with per-arch factor
+\var starpu_perfmodel_type::STARPU_HISTORY_BASED
+\ingroup API_Performance_Model
+Automatic history-based cost model
+\var starpu_perfmodel_type::STARPU_REGRESSION_BASED
+\ingroup API_Performance_Model
+Automatic linear regression-based cost model  (alpha * size ^ beta)
+\var starpu_perfmodel_type::STARPU_NL_REGRESSION_BASED
+\ingroup API_Performance_Model
+Automatic non-linear regression-based cost model (a * size ^ b + c)
+
+\struct starpu_perfmodel
+Contains all information about a performance model. At least the
+type and symbol fields have to be filled when defining a performance
+model for a codelet. For compatibility, make sure to initialize the
+whole structure to zero, either by using explicit memset, or by
+letting the compiler implicitly do it in e.g. static storage case. If
+not provided, other fields have to be zero.
+\ingroup API_Performance_Model
+\var starpu_perfmodel::type
+is the type of performance model
+<ul>
+<li>::STARPU_HISTORY_BASED, ::STARPU_REGRESSION_BASED,
+::STARPU_NL_REGRESSION_BASED: No other fields needs to be provided,
+this is purely history-based.
+</li>
+<li> ::STARPU_PER_ARCH: field starpu_perfmodel::per_arch has to be
+filled with functions which return the cost in micro-seconds.
+</li>
+<li> ::STARPU_COMMON: field starpu_perfmodel::cost_function has to be
+filled with a function that returns the cost in micro-seconds on a
+CPU, timing on other archs will be determined by multiplying by an
+arch-specific factor.
+</li>
+</ul>
+\var starpu_perfmodel::symbol
+is the symbol name for the performance model, which will be used as
+file name to store the model. It must be set otherwise the model will
+be ignored.
+\var starpu_perfmodel::cost_model
+\deprecated
+This field is deprecated. Use instead the field starpu_perfmodel::cost_function field.
+\var starpu_perfmodel::cost_function
+Used by ::STARPU_COMMON: takes a task and implementation number, and
+must return a task duration estimation in micro-seconds.
+\var starpu_perfmodel::size_base
+Used by ::STARPU_HISTORY_BASED, ::STARPU_REGRESSION_BASED and
+::STARPU_NL_REGRESSION_BASED. If not NULL, takes a task and
+implementation number, and returns the size to be used as index for
+history and regression.
+\var starpu_perfmodel::per_arch
+Used by ::STARPU_PER_ARCH: array of structures starpu_per_arch_perfmodel
+\var starpu_perfmodel::is_loaded
+\private
+Whether the performance model is already loaded from the disk.
+\var starpu_perfmodel::benchmarking
+\private
+Whether the performance model is still being calibrated.
+\var starpu_perfmodel::model_rwlock
+\private
+Lock to protect concurrency between loading from disk (W), updating
+the values (W), and making a performance estimation (R).
+
+\struct starpu_perfmodel_regression_model
+...
+\ingroup API_Performance_Model
+\var starpu_perfmodel_regression_model::sumlny
+sum of ln(measured)
+\var starpu_perfmodel_regression_model::sumlnx
+sum of ln(size)
+\var starpu_perfmodel_regression_model::sumlnx2
+sum of ln(size)^2
+\var starpu_perfmodel_regression_model::minx
+minimum size
+\var starpu_perfmodel_regression_model::maxx
+maximum size
+\var starpu_perfmodel_regression_model::sumlnxlny
+sum of ln(size)*ln(measured)
+\var starpu_perfmodel_regression_model::alpha
+estimated = alpha * size ^ beta
+\var starpu_perfmodel_regression_model::beta
+estimated = alpha * size ^ beta
+\var starpu_perfmodel_regression_model::valid
+whether the linear regression model is valid (i.e. enough measures)
+\var starpu_perfmodel_regression_model::a
+estimated = a size ^b + c
+\var starpu_perfmodel_regression_model::b
+estimated = a size ^b + c
+\var starpu_perfmodel_regression_model::c
+estimated = a size ^b + c
+\var starpu_perfmodel_regression_model::nl_valid
+whether the non-linear regression model is valid (i.e. enough measures)
+\var starpu_perfmodel_regression_model::nsample
+number of sample values for non-linear regression
+
+\struct starpu_perfmodel_per_arch
+contains information about the performance model of a given
+arch.
+\ingroup API_Performance_Model
+\var starpu_perfmodel_per_arch::cost_model
+\deprecated
+This field is deprecated. Use instead the field
+starpu_perfmodel_per_arch::cost_function.
+\var starpu_perfmodel_per_arch::cost_function
+Used by ::STARPU_PER_ARCH, must point to functions which take a task,
+the target arch and implementation number (as mere conveniency, since
+the array is already indexed by these), and must return a task
+duration estimation in micro-seconds.
+\var starpu_perfmodel_per_arch::size_base
+Same as in structure starpu_perfmodel, but per-arch, in case it
+depends on the architecture-specific implementation.
+\var starpu_perfmodel_per_arch::history
+\private
+The history of performance measurements.
+\var starpu_perfmodel_per_arch::list
+\private
+Used by ::STARPU_HISTORY_BASED and ::STARPU_NL_REGRESSION_BASED,
+records all execution history measures.
+\var starpu_perfmodel_per_arch::regression
+\private
+Used by ::STARPU_HISTORY_BASED and
+::STARPU_NL_REGRESSION_BASED, contains the estimated factors of the
+regression.
+
+\struct starpu_perfmodel_history_list
+todo
+\ingroup API_Performance_Model
+\var starpu_perfmodel_history_list::next
+todo
+\var starpu_perfmodel_history_list::entry
+todo
+
+\struct starpu_perfmodel_history_entry
+todo
+\ingroup API_Performance_Model
+\var starpu_perfmodel_history_entry::mean
+mean_n = 1/n sum
+\var starpu_perfmodel_history_entry::deviation
+n dev_n = sum2 - 1/n (sum)^2
+\var starpu_perfmodel_history_entry::sum
+num of samples
+\var starpu_perfmodel_history_entry::sum2
+sum of samples^2
+\var starpu_perfmodel_history_entry::nsample
+todo
+\var starpu_perfmodel_history_entry::footprint
+todo
+\var starpu_perfmodel_history_entry::size
+in bytes
+\var starpu_perfmodel_history_entry::flops
+Provided by the application
+
+\fn int starpu_perfmodel_load_symbol(const char *symbol, struct starpu_perfmodel *model)
+\ingroup API_Performance_Model
+loads a given performance model. The model structure has to be
+completely zero, and will be filled with the information saved in
+<c>$STARPU_HOME/.starpu</c>. The function is intended to be used by
+external tools that should read the performance model files.
+
+\fn int starpu_perfmodel_unload_model(struct starpu_perfmodel *model)
+\ingroup API_Performance_Model
+unloads the given model which has been previously loaded
+through the function starpu_perfmodel_load_symbol()
+
+\fn void starpu_perfmodel_debugfilepath(struct starpu_perfmodel *model, enum starpu_perfmodel_archtype arch, char *path, size_t maxlen, unsigned nimpl)
+\ingroup API_Performance_Model
+returns the path to the debugging information for the performance model.
+
+\fn void starpu_perfmodel_get_arch_name(enum starpu_perfmodel_archtype arch, char *archname, size_t maxlen, unsigned nimpl)
+\ingroup API_Performance_Model
+returns the architecture name for \p arch
+
+\fn enum starpu_perfmodel_archtype starpu_worker_get_perf_archtype(int workerid)
+\ingroup API_Performance_Model
+returns the architecture type of a given worker.
+
+\fn int starpu_perfmodel_list(FILE *output)
+\ingroup API_Performance_Model
+prints a list of all performance models on \p output
+
+\fn void starpu_perfmodel_print(struct starpu_perfmodel *model, enum starpu_perfmodel_archtype arch, unsigned nimpl, char *parameter, uint32_t *footprint, FILE *output)
+\ingroup API_Performance_Model
+todo
+
+\fn int starpu_perfmodel_print_all(struct starpu_perfmodel *model, char *arch, char *parameter, uint32_t *footprint, FILE *output)
+\ingroup API_Performance_Model
+todo
+
+\fn void starpu_bus_print_bandwidth(FILE *f)
+\ingroup API_Performance_Model
+prints a matrix of bus bandwidths on \p f.
+
+\fn void starpu_bus_print_affinity(FILE *f)
+\ingroup API_Performance_Model
+prints the affinity devices on \p f.
+
+\fn void starpu_perfmodel_update_history(struct starpu_perfmodel *model, struct starpu_task *task, enum starpu_perfmodel_archtype arch, unsigned cpuid, unsigned nimpl, double measured);
+\ingroup API_Performance_Model
+This feeds the performance model model with an explicit
+measurement measured, in addition to measurements done by StarPU
+itself. This can be useful when the application already has an
+existing set of measurements done in good conditions, that StarPU
+could benefit from instead of doing on-line measurements. And example
+of use can be seen in \ref PerformanceModelExample.
+
+\fn double starpu_get_bandwidth_RAM_CUDA(unsigned cudadev)
+\ingroup API_Performance_Model
+Used to compute the velocity of resources
+
+\fn double starpu_get_latency_RAM_CUDA(unsigned cudadev)
+\ingroup API_Performance_Model
+Used to compute the velocity of resources
+
+*/

+ 176 - 0
doc/doxygen/chapters/api/profiling.doxy

@@ -0,0 +1,176 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Profiling Profiling
+
+\struct starpu_profiling_task_info
+\ingroup API_Profiling
+This structure contains information about the execution of a
+task. It is accessible from the field starpu_task::profiling_info if
+profiling was enabled.
+\var starpu_profiling_task_info::submit_time
+Date of task submission (relative to the initialization of StarPU).
+
+\var starpu_profiling_task_info::push_start_time
+Time when the task was submitted to the scheduler.
+
+\var starpu_profiling_task_info::push_end_time
+Time when the scheduler finished with the task submission.
+
+\var starpu_profiling_task_info::pop_start_time
+Time when the scheduler started to be requested for a task, and eventually gave that task.
+
+\var starpu_profiling_task_info::pop_end_time
+Time when the scheduler finished providing the task for execution.
+
+\var starpu_profiling_task_info::acquire_data_start_time
+Time when the worker started fetching input data.
+
+\var starpu_profiling_task_info::acquire_data_end_time
+Time when the worker finished fetching input data.
+
+\var starpu_profiling_task_info::start_time
+Date of task execution beginning (relative to the initialization of StarPU).
+
+\var starpu_profiling_task_info::end_time
+Date of task execution termination (relative to the initialization of StarPU).
+
+\var starpu_profiling_task_info::release_data_start_time
+Time when the worker started releasing data.
+
+\var starpu_profiling_task_info::release_data_end_time
+Time when the worker finished releasing data.
+
+\var starpu_profiling_task_info::callback_start_time
+        Time when the worker started the application callback for the task.
+
+\var starpu_profiling_task_info::callback_end_time
+        Time when the worker finished the application callback for the task.
+
+\var starpu_profiling_task_info::workerid
+        Identifier of the worker which has executed the task.
+
+\var starpu_profiling_task_info::used_cycles
+        Number of cycles used by the task, only available in the MoviSim
+
+\var starpu_profiling_task_info::stall_cycles
+        Number of cycles stalled within the task, only available in the MoviSim
+
+\var starpu_profiling_task_info::power_consumed
+        Power consumed by the task, only available in the MoviSim
+
+\struct starpu_profiling_worker_info
+This structure contains the profiling information associated to
+a worker. The timing is provided since the previous call to
+starpu_profiling_worker_get_info()
+\ingroup API_Profiling
+\var starpu_profiling_worker_info::start_time
+        Starting date for the reported profiling measurements.
+\var starpu_profiling_worker_info::total_time
+        Duration of the profiling measurement interval.
+\var starpu_profiling_worker_info::executing_time
+        Time spent by the worker to execute tasks during the profiling measurement interval.
+\var starpu_profiling_worker_info::sleeping_time
+        Time spent idling by the worker during the profiling measurement interval.
+\var starpu_profiling_worker_info::executed_tasks
+        Number of tasks executed by the worker during the profiling measurement interval.
+\var starpu_profiling_worker_info::used_cycles
+        Number of cycles used by the worker, only available in the MoviSim
+\var starpu_profiling_worker_info::stall_cycles
+        Number of cycles stalled within the worker, only available in the MoviSim
+\var starpu_profiling_worker_info::power_consumed
+        Power consumed by the worker, only available in the MoviSim
+
+\struct starpu_profiling_bus_info
+todo
+\ingroup API_Profiling
+\var starpu_profiling_bus_info::start_time
+        Time of bus profiling startup.
+\var starpu_profiling_bus_info::total_time
+        Total time of bus profiling.
+\var starpu_profiling_bus_info::transferred_bytes
+        Number of bytes transferred during profiling.
+\var starpu_profiling_bus_info::transfer_count
+        Number of transfers during profiling.
+
+\fn int starpu_profiling_status_set(int status)
+\ingroup API_Profiling
+This function sets the profiling status. Profiling is activated
+by passing \ref STARPU_PROFILING_ENABLE in status. Passing
+\ref STARPU_PROFILING_DISABLE disables profiling. Calling this function
+resets all profiling measurements. When profiling is enabled, the
+field starpu_task::profiling_info points to a valid structure
+starpu_profiling_task_info containing information about the execution
+of the task. Negative return values indicate an error, otherwise the
+previous status is returned.
+
+\fn int starpu_profiling_status_get(void)
+\ingroup API_Profiling
+Return the current profiling status or a negative value in case
+there was an error.
+
+\fn void starpu_profiling_set_id(int new_id)
+\ingroup API_Profiling
+This function sets the ID used for profiling trace filename. It
+needs to be called before starpu_init().
+
+\fn int starpu_profiling_worker_get_info(int workerid, struct starpu_profiling_worker_info *worker_info)
+\ingroup API_Profiling
+Get the profiling info associated to the worker identified by
+\p workerid, and reset the profiling measurements. If the argument \p
+worker_info is NULL, only reset the counters associated to worker
+\p workerid. Upon successful completion, this function returns 0.
+Otherwise, a negative value is returned.
+
+\fn int starpu_bus_get_profiling_info(int busid, struct starpu_profiling_bus_info *bus_info)
+\ingroup API_Profiling
+todo
+
+\fn int starpu_bus_get_count(void)
+\ingroup API_Profiling
+Return the number of buses in the machine
+
+\fn int starpu_bus_get_id(int src, int dst)
+\ingroup API_Profiling
+Return the identifier of the bus between \p src and \p dst
+
+\fn int starpu_bus_get_src(int busid)
+\ingroup API_Profiling
+Return the source point of bus \p busid
+
+\fn int starpu_bus_get_dst(int busid)
+\ingroup API_Profiling
+Return the destination point of bus \p busid
+
+\fn double starpu_timing_timespec_delay_us(struct timespec *start, struct timespec *end)
+\ingroup API_Profiling
+Returns the time elapsed between \p start and \p end in microseconds.
+
+\fn double starpu_timing_timespec_to_us(struct timespec *ts)
+\ingroup API_Profiling
+Converts the given timespec \p ts into microseconds
+
+\fn void starpu_profiling_bus_helper_display_summary(void)
+\ingroup API_Profiling
+Displays statistics about the bus on stderr. if the environment
+variable \ref STARPU_BUS_STATS is defined. The function is called
+automatically by starpu_shutdown().
+
+\fn void starpu_profiling_worker_helper_display_summary(void)
+\ingroup API_Profiling
+Displays statistics about the workers on stderr if the
+environment variable \ref STARPU_WORKER_STATS is defined. The function is
+called automatically by starpu_shutdown().
+
+\fn void starpu_data_display_memory_stats()
+\ingroup API_Profiling
+Display statistics about the current data handles registered
+within StarPU. StarPU must have been configured with the configure
+option \ref enable-memory-stats "--enable-memory-stats" (see \ref MemoryFeedback).
+
+*/

+ 39 - 0
doc/doxygen/chapters/api/running_driver.doxy

@@ -0,0 +1,39 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Running_Drivers Running Drivers
+
+\fn int starpu_driver_run(struct starpu_driver *d)
+\ingroup API_Running_Drivers
+Initialize the given driver, run it until it receives a request to
+terminate, deinitialize it and return 0 on success. It returns
+<c>-EINVAL</c> if <c>d->type</c> is not a valid StarPU device type
+(::STARPU_CPU_WORKER, ::STARPU_CUDA_WORKER or ::STARPU_OPENCL_WORKER). This
+is the same as using the following functions: calling
+starpu_driver_init(), then calling starpu_driver_run_once() in a loop,
+and eventually starpu_driver_deinit().
+
+\fn int starpu_driver_init(struct starpu_driver *d)
+\ingroup API_Running_Drivers
+Initialize the given driver. Returns 0 on success, <c>-EINVAL</c> if
+<c>d->type</c> is not a valid ::starpu_worker_archtype.
+
+\fn int starpu_driver_run_once(struct starpu_driver *d)
+\ingroup API_Running_Drivers
+Run the driver once, then returns 0 on success, <c>-EINVAL</c> if <c>d->type</c> is not a valid ::starpu_worker_archtype.
+
+\fn int starpu_driver_deinit(struct starpu_driver *d)
+\ingroup API_Running_Drivers
+Deinitialize the given driver. Returns 0 on success, <c>-EINVAL</c> if
+<c>d->type</c> is not a valid ::starpu_worker_archtype.
+
+\fn void starpu_drivers_request_termination(void)
+\ingroup API_Running_Drivers
+Notify all running drivers they should terminate.
+
+*/

+ 28 - 0
doc/doxygen/chapters/api/scc_extensions.doxy

@@ -0,0 +1,28 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_SCC_Extensions SCC Extensions
+
+\def STARPU_USE_SCC
+\ingroup API_SCC_Extensions
+This macro is defined when StarPU has been installed with SCC support.
+It should be used in your code to detect the availability of SCC.
+
+\fn int starpu_scc_register_kernel(starpu_scc_func_symbol_t *symbol, const char *func_name)
+\ingroup API_SCC_Extensions
+Initiate a lookup on each SCC device to find the adress of the
+function named \p func_name, store them in the global array kernels
+and return the index in the array through \p symbol.
+
+\fn starpu_scc_kernel_t starpu_scc_get_kernel(starpu_scc_func_symbol_t symbol)
+\ingroup API_SCC_Extensions
+If success, return the pointer to the function defined by \p symbol on
+the device linked to the called device. This can for instance be used
+in a starpu_scc_func_t implementation.
+
+*/

+ 295 - 0
doc/doxygen/chapters/api/scheduling_context_hypervisor.doxy

@@ -0,0 +1,295 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Scheduling_Context_Hypervisor Scheduling Context Hypervisor
+
+\struct sc_hypervisor_policy
+\ingroup API_Scheduling_Context_Hypervisor
+This structure contains all the methods that implement a hypervisor resizing policy.
+\var sc_hypervisor_policy::name
+        Indicates the name of the policy, if there is not a custom policy, the policy corresponding to this name will be used by the hypervisor
+\var sc_hypervisor_policy::custom
+        Indicates whether the policy is custom or not
+\var sc_hypervisor_policy::handle_idle_cycle
+        It is called whenever the indicated worker executes another idle cycle in sched_ctx
+\var sc_hypervisor_policy::handle_pushed_task
+        It is called whenever a task is pushed on the worker’s queue corresponding to the context sched_ctx
+\var sc_hypervisor_policy::handle_poped_task
+        It is called whenever a task is poped from the worker’s queue corresponding to the context sched_ctx
+\var sc_hypervisor_policy::handle_idle_end
+        It is called whenever a task is executed on the indicated worker and context after a long period of idle time
+\var sc_hypervisor_policy::handle_post_exec_hook
+        It is called whenever a tag task has just been executed. The table of resize requests is provided as well as the tag
+
+\struct sc_hypervisor_policy_config
+\ingroup API_Scheduling_Context_Hypervisor
+This structure contains all configuration information of a
+context. It contains configuration information for each context, which
+can be used to construct new resize strategies.
+\var sc_hypervisor_policy_config::min_nworkers
+        Indicates the minimum number of workers needed by the context
+\var sc_hypervisor_policy_config::max_nworkers
+        Indicates the maximum number of workers needed by the context
+\var sc_hypervisor_policy_config::granularity
+        Indicates the workers granularity of the context
+\var sc_hypervisor_policy_config::priority
+        Indicates the priority of each worker in the context
+\var sc_hypervisor_policy_config::max_idle
+        Indicates the maximum idle time accepted before a resize is triggered
+\var sc_hypervisor_policy_config::fixed_workers
+        Indicates which workers can be moved and which ones are fixed
+\var sc_hypervisor_policy_config:: new_workers_max_idle
+        Indicates the maximum idle time accepted before a resize is triggered for the workers that just arrived in the new context
+
+\struct sc_hypervisor_wrapper
+\ingroup API_Scheduling_Context_Hypervisor
+This structure is a wrapper of the contexts available in StarPU
+and contains all information about a context obtained by incrementing
+the performance counters.
+\var sc_hypervisor_wrapper::sched_ctx
+        The context wrapped
+\var sc_hypervisor_wrapper::config
+        The corresponding resize configuration
+\var sc_hypervisor_wrapper::current_idle_time
+        The idle time counter of each worker of the context
+\var sc_hypervisor_wrapper::pushed_tasks
+        The number of pushed tasks of each worker of the context
+\var sc_hypervisor_wrapper::poped_tasks
+        The number of poped tasks of each worker of the context
+\var sc_hypervisor_wrapper::total_flops
+        The total number of flops to execute by the context
+\var sc_hypervisor_wrapper::total_elapsed_flops
+        The number of flops executed by each workers of the context
+\var sc_hypervisor_wrapper::elapsed_flops
+        The number of flops executed by each worker of the context from last resize
+\var sc_hypervisor_wrapper::remaining_flops
+        The number of flops that still have to be executed by the workers in the context
+\var sc_hypervisor_wrapper::start_time
+        The time when he started executed
+\var sc_hypervisor_wrapper::resize_ack
+        The structure confirming the last resize finished and a new one can be done
+
+\struct sc_hypervisor_resize_ack
+\ingroup API_Scheduling_Context_Hypervisor
+This structures checks if the workers moved to another context
+are actually taken into account in that context.
+\var sc_hypervisor_resize_ack::receiver_sched_ctx
+        The context receiving the new workers
+\var sc_hypervisor_resize_ack::moved_workers
+        The workers moved to the receiver context
+\var sc_hypervisor_resize_ack::nmoved_workers
+        The number of workers moved
+\var sc_hypervisor_resize_ack::acked_workers
+        If the value corresponding to a worker is 1, this one is taken
+	into account in the new context if 0 not yet
+
+\struct sc_hypervisor_policy_task_pool
+task wrapper linked list
+\ingroup API_Scheduling_Context_Hypervisor
+\var sc_hypervisor_policy_task_pool::cl
+Which codelet has been executed
+\var sc_hypervisor_policy_task_pool::footprint
+Task footprint key
+\var sc_hypervisor_policy_task_pool::sched_ctx_id
+Context the task belongs to
+\var sc_hypervisor_policy_task_pool::n
+Number of tasks of this kind
+\var sc_hypervisor_policy_task_pool::next
+Other task kinds
+
+@name Managing the hypervisor
+\ingroup API_Scheduling_Context_Hypervisor
+
+There is a single hypervisor that is in charge of resizing contexts
+and the resizing strategy is chosen at the initialization of the
+hypervisor. A single resize can be done at a time.
+
+The Scheduling Context Hypervisor Plugin provides a series of
+performance counters to StarPU. By incrementing them, StarPU can help
+the hypervisor in the resizing decision making process. TODO maybe
+they should be hidden to the user
+
+\fn struct starpu_sched_ctx_performance_counters *sc_hypervisor_init(struct sc_hypervisor_policy * policy)
+\ingroup API_Scheduling_Context_Hypervisor
+Initializes the hypervisor to use the strategy provided as parameter
+and creates the performance counters (see starpu_sched_ctx_performance_counters).
+These performance counters represent actually some callbacks that will
+be used by the contexts to notify the information needed by the
+hypervisor.
+
+Note: The Hypervisor is actually a worker that takes this role once
+certain conditions trigger the resizing process (there is no
+additional thread assigned to the hypervisor).
+
+\fn void sc_hypervisor_shutdown(void)
+\ingroup API_Scheduling_Context_Hypervisor
+The hypervisor and all information concerning it is cleaned. There is
+no synchronization between this function and starpu_shutdown(). Thus,
+this should be called after starpu_shutdown(), because the performance
+counters will still need allocated callback functions.
+
+@name Registering Scheduling Contexts to the hypervisor
+\ingroup API_Scheduling_Context_Hypervisor
+
+Scheduling Contexts that have to be resized by the hypervisor must be
+first registered to the hypervisor. Whenever we want to exclude
+contexts from the resizing process we have to unregister them from the
+hypervisor.
+
+\fn void sc_hypervisor_register_ctx(unsigned sched_ctx, double total_flops)
+\ingroup API_Scheduling_Context_Hypervisor
+Register the context to the hypervisor, and indicate the number of
+flops the context will execute (needed for Gflops rate based strategy
+see \ref ResizingStrategies or any other custom strategy needing it, for
+the others we can pass 0.0)
+
+\fn void sc_hypervisor_unregister_ctx (unsigned sched_ctx)
+\ingroup API_Scheduling_Context_Hypervisor
+Unregister the context from the hypervisor.
+
+@name Users’ Input In The Resizing Process
+\anchor UsersInputInTheResizingProcess
+\ingroup API_Scheduling_Context_Hypervisor
+
+The user can totally forbid the resizing of a certain context or can
+then change his mind and allow it (in this case the resizing is
+managed by the hypervisor, that can forbid it or allow it)
+
+\fn void sc_hypervisor_stop_resize(unsigned sched_ctx)
+\ingroup API_Scheduling_Context_Hypervisor
+Forbid resizing of a context
+
+\fn void sc_hypervisor_start_resize(unsigned sched_ctx)
+\ingroup API_Scheduling_Context_Hypervisor
+Allow resizing of a context. The user can then provide information to
+the hypervisor concerning the conditions of resizing.
+
+\fn void sc_hypervisor_ioctl(unsigned sched_ctx, ...)
+\ingroup API_Scheduling_Context_Hypervisor
+Inputs conditions to the context sched_ctx with the following
+arguments. The argument list must be zero-terminated.
+
+\def HYPERVISOR_MAX_IDLE
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 3 arguments: an array of int for the workerids to apply
+the condition, an int to indicate the size of the array, and a double
+value indicating the maximum idle time allowed for a worker before the
+resizing process should be triggered
+
+\def HYPERVISOR_PRIORITY
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 3 arguments: an array of int for the workerids to apply
+the condition, an int to indicate the size of the array, and an int
+value indicating the priority of the workers previously mentioned. The
+workers with the smallest priority are moved the first.
+
+\def HYPERVISOR_MIN_WORKERS
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 1 argument(int) indicating the minimum number of workers a
+context should have, underneath this limit the context cannot execute.
+
+\def HYPERVISOR_MAX_WORKERS
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 1 argument(int) indicating the maximum number of workers a
+context should have, above this limit the context would not be able to
+scale
+
+\def HYPERVISOR_GRANULARITY
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 1 argument(int) indicating the granularity of the resizing
+process (the number of workers should be moved from the context once
+it is resized) This parameter is ignore for the Gflops rate based
+strategy (see \ref ResizingStrategies), the number of workers that have to
+be moved is calculated by the strategy.
+
+\def HYPERVISOR_FIXED_WORKERS
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 2 arguments: an array of int for the workerids to apply
+the condition and an int to indicate the size of the array. These
+workers are not allowed to be moved from the context.
+
+\def HYPERVISOR_MIN_TASKS
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 1 argument (int) that indicated the minimum number of
+tasks that have to be executed before the context could be resized.
+This parameter is ignored for the Application Driven strategy (see \ref 
+ResizingStrategies) where the user indicates exactly when the resize
+should be done.
+
+\def HYPERVISOR_NEW_WORKERS_MAX_IDLE
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 1 argument, a double value indicating the maximum idle
+time allowed for workers that have just been moved from other contexts
+in the current context.
+
+\def HYPERVISOR_TIME_TO_APPLY
+\ingroup API_Scheduling_Context_Hypervisor
+This macro is used when calling sc_hypervisor_ioctl() and must be
+followed by 1 argument (int) indicating the tag an executed task
+should have such that this configuration should be taken into account.
+
+@name Defining a new hypervisor policy
+\ingroup API_Scheduling_Context_Hypervisor
+
+While Scheduling Context Hypervisor Plugin comes with a variety of
+resizing policies (see \ref ResizingStrategies), it may sometimes be
+desirable to implement custom policies to address specific problems.
+The API described below allows users to write their own resizing policy.
+
+Here an example of how to define a new policy
+
+\code{.c}
+struct sc_hypervisor_policy dummy_policy =
+{
+       .handle_poped_task = dummy_handle_poped_task,
+       .handle_pushed_task = dummy_handle_pushed_task,
+       .handle_idle_cycle = dummy_handle_idle_cycle,
+       .handle_idle_end = dummy_handle_idle_end,
+       .handle_post_exec_hook = dummy_handle_post_exec_hook,
+       .custom = 1,
+       .name = "dummy"
+};
+\endcode
+
+\fn void sc_hypervisor_move_workers(unsigned sender_sched_ctx, unsigned receiver_sched_ctx, int *workers_to_move, unsigned nworkers_to_move, unsigned now);
+\ingroup API_Scheduling_Context_Hypervisor
+    Moves workers from one context to another
+
+\fn struct sc_hypervisor_policy_config * sc_hypervisor_get_config(unsigned sched_ctx);
+\ingroup API_Scheduling_Context_Hypervisor
+    Returns the configuration structure of a context
+
+\fn int * sc_hypervisor_get_sched_ctxs();
+\ingroup API_Scheduling_Context_Hypervisor
+    Gets the contexts managed by the hypervisor
+
+\fn int sc_hypervisor_get_nsched_ctxs();
+\ingroup API_Scheduling_Context_Hypervisor
+    Gets the number of contexts managed by the hypervisor
+
+\fn struct sc_hypervisor_wrapper * sc_hypervisor_get_wrapper(unsigned sched_ctx);
+\ingroup API_Scheduling_Context_Hypervisor
+    Returns the wrapper corresponding the context \p sched_ctx
+
+\fn double sc_hypervisor_get_elapsed_flops_per_sched_ctx(struct sc_hypervisor_wrapper * sc_w);
+\ingroup API_Scheduling_Context_Hypervisor
+    Returns the flops of a context elapsed from the last resize
+
+\fn char * sc_hypervisor_get_policy();
+\ingroup API_Scheduling_Context_Hypervisor
+    Returns the name of the resizing policy the hypervisor uses
+
+*/

+ 248 - 0
doc/doxygen/chapters/api/scheduling_contexts.doxy

@@ -0,0 +1,248 @@
+*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Scheduling_Contexts Scheduling Contexts
+
+\brief StarPU permits on one hand grouping workers in combined workers
+in order to execute a parallel task and on the other hand grouping
+tasks in bundles that will be executed by a single specified worker.
+In contrast when we group workers in scheduling contexts we submit
+starpu tasks to them and we schedule them with the policy assigned to
+the context. Scheduling contexts can be created, deleted and modified
+dynamically.
+
+\enum starpu_worker_collection_type
+\ingroup API_Scheduling_Contexts
+types of structures the worker collection can implement
+\var starpu_worker_collection_type::STARPU_WORKER_LIST
+\ingroup API_Scheduling_Contexts
+List of workers
+
+\struct starpu_sched_ctx_iterator
+\ingroup API_Scheduling_Contexts
+todo
+\var starpu_sched_ctx_iterator::cursor
+todo
+
+\struct starpu_worker_collection
+\ingroup API_Scheduling_Contexts
+A scheduling context manages a collection of workers that can
+be memorized using different data structures. Thus, a generic
+structure is available in order to simplify the choice of its type.
+Only the list data structure is available but further data
+structures(like tree) implementations are foreseen.
+\var starpu_worker_collection::workerids
+        The workerids managed by the collection
+\var starpu_worker_collection::nworkers
+        The number of workers in the collection
+\var starpu_worker_collection::type
+        The type of structure (currently ::STARPU_WORKER_LIST is the only one available)
+\var starpu_worker_collection::has_next
+        Checks if there is another element in collection
+\var starpu_worker_collection::get_next
+        return the next element in the collection
+\var starpu_worker_collection::add
+        add a new element in the collection
+\var starpu_worker_collection::remove
+        remove an element from the collection
+\var starpu_worker_collection::init
+        Initialize the collection
+\var starpu_worker_collection::deinit
+        Deinitialize the colection
+\var starpu_worker_collection::init_iterator
+        Initialize the cursor if there is one
+
+\struct starpu_sched_ctx_performance_counters
+Performance counters used by the starpu to indicate the
+hypervisor how the application and the resources are executing.
+\ingroup API_Scheduling_Contexts
+\var starpu_sched_ctx_performance_counters::notify_idle_cycle
+        Informs the hypervisor for how long a worker has been idle in the specified context
+\var starpu_sched_ctx_performance_counters::notify_idle_end
+        Informs the hypervisor that after a period of idle, the worker has just executed a task in the specified context. The idle counter it though reset.
+\var starpu_sched_ctx_performance_counters::notify_pushed_task
+        Notifies the hypervisor a task has been scheduled on the queue of the worker corresponding to the specified context
+\var starpu_sched_ctx_performance_counters::notify_poped_task
+        Informs the hypervisor a task executing a specified number of instructions has been poped from the worker
+\var starpu_sched_ctx_performance_counters::notify_post_exec_hook
+        Notifies the hypervisor a task has just been executed
+
+@name Scheduling Contexts Basic API
+\ingroup API_Scheduling_Contexts
+
+\fn unsigned starpu_sched_ctx_create(const char *policy_name, int *workerids_ctx, int nworkers_ctx, const char *sched_ctx_name)
+\ingroup API_Scheduling_Contexts
+This function creates a scheduling context which uses the scheduling
+policy \p policy_name and assigns the workers in \p workerids_ctx to
+execute the tasks submitted to it.
+The return value represents the identifier of the context that has
+just been created. It will be further used to indicate the context the
+tasks will be submitted to. The return value should be at most
+\ref STARPU_NMAX_SCHED_CTXS.
+
+\fn unsigned starpu_sched_ctx_create_inside_interval(const char *policy_name, const char *sched_name, int min_ncpus, int max_ncpus, int min_ngpus, int max_ngpus, unsigned allow_overlap)
+\ingroup API_Scheduling_Contexts
+Create a context indicating an approximate interval of resources
+
+\fn void starpu_sched_ctx_delete(unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Delete scheduling context \p sched_ctx_id and transfer remaining
+workers to the inheritor scheduling context.
+
+\fn void starpu_sched_ctx_add_workers(int *workerids_ctx, int nworkers_ctx, unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+This function adds dynamically the workers in \p workerids_ctx to the
+context \p sched_ctx_id. The last argument cannot be greater than
+\ref STARPU_NMAX_SCHED_CTXS.
+
+\fn void starpu_sched_ctx_remove_workers(int *workerids_ctx, int nworkers_ctx, unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+This function removes the workers in \p workerids_ctx from the context
+\p sched_ctx_id. The last argument cannot be greater than
+STARPU_NMAX_SCHED_CTXS.
+
+\fn void starpu_sched_ctx_set_inheritor(unsigned sched_ctx_id, unsigned inheritor)
+\ingroup API_Scheduling_Contexts
+Indicate which context whill inherit the resources of this context
+when he will be deleted.
+
+\fn void starpu_sched_ctx_set_context(unsigned *sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Set the scheduling context the subsequent tasks will be submitted to
+
+\fn unsigned starpu_sched_ctx_get_context(void)
+\ingroup API_Scheduling_Contexts
+Return the scheduling context the tasks are currently submitted to
+
+\fn void starpu_sched_ctx_stop_task_submission(void)
+\ingroup API_Scheduling_Contexts
+Stop submitting tasks from the empty context list until the next time
+the context has time to check the empty context list
+
+\fn void starpu_sched_ctx_finished_submit(unsigned sched_ctx_id);
+\ingroup API_Scheduling_Contexts
+Indicate starpu that the application finished submitting to this
+context in order to move the workers to the inheritor as soon as
+possible.
+
+\fn unsigned starpu_sched_ctx_get_nworkers(unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Return the number of workers managed by the specified contexts
+(Usually needed to verify if it manages any workers or if it should be
+blocked)
+
+\fn unsigned starpu_sched_ctx_get_nshared_workers(unsigned sched_ctx_id, unsigned sched_ctx_id2)
+\ingroup API_Scheduling_Contexts
+    Return the number of workers shared by two contexts.
+
+\fn unsigned starpu_sched_ctx_contains_worker(int workerid, unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Return 1 if the worker belongs to the context and 0 otherwise
+
+\fn unsigned starpu_sched_ctx_overlapping_ctxs_on_worker(int workerid)
+\ingroup API_Scheduling_Contexts
+Check if a worker is shared between several contexts
+
+\fn unsigned starpu_sched_ctx_is_ctxs_turn(int workerid, unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Manage sharing of resources between contexts: checkOB which ctx has
+its turn to pop.
+
+\fn void starpu_sched_ctx_set_turn_to_other_ctx(int workerid, unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Manage sharing of resources between contexts: by default a round_robin
+strategy is executed but the user can interfere to tell which ctx has
+its turn to pop.
+
+\fn double starpu_sched_ctx_get_max_time_worker_on_ctx(void)
+\ingroup API_Scheduling_Contexts
+Time sharing a resources, indicate how long a worker has been active
+in the current sched_ctx.
+
+@name Scheduling Context Priorities
+\ingroup API_Scheduling_Contexts
+
+\def STARPU_MIN_PRIO
+\ingroup API_Scheduling_Contexts
+Provided for legacy reasons.
+
+\def STARPU_MAX_PRIO
+\ingroup API_Scheduling_Contexts
+Provided for legacy reasons.
+
+\def STARPU_DEFAULT_PRIO
+\ingroup API_Scheduling_Contexts
+By convention, the default priority level should be 0 so that we can
+statically allocate tasks with a default priority.
+
+\fn int starpu_sched_ctx_set_min_priority(unsigned sched_ctx_id, int min_prio)
+\ingroup API_Scheduling_Contexts
+Defines the minimum task priority level supported by the scheduling
+policy of the given scheduler context. The default minimum priority
+level is the same as the default priority level which is 0 by
+convention. The application may access that value by calling the function
+starpu_sched_ctx_get_min_priority(). This function should only
+be called from the initialization method of the scheduling policy, and
+should not be used directly from the application.
+
+\fn int starpu_sched_ctx_set_max_priority(unsigned sched_ctx_id, int max_prio)
+\ingroup API_Scheduling_Contexts
+Defines the maximum priority level supported by the scheduling policy
+of the given scheduler context. The default maximum priority level is
+1. The application may access that value by calling the
+starpu_sched_ctx_get_max_priority function. This function should only
+be called from the initialization method of the scheduling policy, and
+should not be used directly from the application.
+
+\fn int starpu_sched_ctx_get_min_priority(unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Returns the current minimum priority level supported by the scheduling
+policy of the given scheduler context.
+
+\fn int starpu_sched_ctx_get_max_priority(unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Returns the current maximum priority level supported by the scheduling
+policy of the given scheduler context.
+
+@name Scheduling Context Worker Collection
+\ingroup API_Scheduling_Contexts
+
+\fn struct starpu_worker_collection* starpu_sched_ctx_create_worker_collection(unsigned sched_ctx_id, enum starpu_worker_collection_type type)
+\ingroup API_Scheduling_Contexts
+Create a worker collection of the type indicated by the last parameter
+for the context specified through the first parameter.
+
+\fn void starpu_sched_ctx_delete_worker_collection(unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Delete the worker collection of the specified scheduling context
+
+\fn struct starpu_worker_collection* starpu_sched_ctx_get_worker_collection(unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Return the worker collection managed by the indicated context
+
+@name Scheduling Context Link with Hypervisor
+\ingroup API_Scheduling_Contexts
+
+\fn void starpu_sched_ctx_set_perf_counters(unsigned sched_ctx_id, struct starpu_sched_ctx_performance_counters *perf_counters)
+\ingroup API_Scheduling_Contexts
+Indicates to starpu the pointer to the performance counter
+
+\fn void starpu_sched_ctx_call_pushed_task_cb(int workerid, unsigned sched_ctx_id)
+\ingroup API_Scheduling_Contexts
+Callback that lets the scheduling policy tell the hypervisor that a
+task was pushed on a worker
+
+\fn void starpu_sched_ctx_notify_hypervisor_exists(void)
+\ingroup API_Scheduling_Contexts
+Allow the hypervisor to let starpu know he's initialised
+
+\fn unsigned starpu_sched_ctx_check_if_hypervisor_exists(void)
+\ingroup API_Scheduling_Contexts
+Ask starpu if he is informed if the hypervisor is initialised
+
+*/

+ 174 - 0
doc/doxygen/chapters/api/scheduling_policy.doxy

@@ -0,0 +1,174 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Scheduling_Policy Scheduling Policy
+
+\brief TODO. While StarPU comes with a variety of scheduling policies
+(see \ref TaskSchedulingPolicy), it may sometimes be desirable to
+implement custom policies to address specific problems. The API
+described below allows users to write their own scheduling policy.
+
+\struct starpu_sched_policy
+\ingroup API_Scheduling_Policy
+This structure contains all the methods that implement a
+scheduling policy. An application may specify which scheduling
+strategy in the field starpu_conf::sched_policy passed to the function
+starpu_init().
+\var starpu_sched_policy::init_sched
+        Initialize the scheduling policy.
+\var starpu_sched_policy::deinit_sched
+        Cleanup the scheduling policy.
+\var starpu_sched_policy::push_task
+        Insert a task into the scheduler.
+\var starpu_sched_policy::push_task_notify
+        Notify the scheduler that a task was pushed on a given worker.
+	This method is called when a task that was explicitely
+	assigned to a worker becomes ready and is about to be executed
+	by the worker. This method therefore permits to keep the state
+	of the scheduler coherent even when StarPU bypasses the
+	scheduling strategy.
+\var starpu_sched_policy::pop_task
+        Get a task from the scheduler. The mutex associated to the
+	worker is already taken when this method is called. If this
+	method is defined as NULL, the worker will only execute tasks
+	from its local queue. In this case, the push_task method
+	should use the starpu_push_local_task method to assign tasks
+	to the different workers.
+\var starpu_sched_policy::pop_every_task
+        Remove all available tasks from the scheduler (tasks are
+	chained by the means of the field starpu_task::prev and
+	starpu_task::next). The mutex associated to the worker is
+	already taken when this method is called. This is currently
+	not used.
+\var starpu_sched_policy::pre_exec_hook
+        Optional field. This method is called every time a task is starting.
+\var starpu_sched_policy::post_exec_hook
+        Optional field. This method is called every time a task has been executed.
+\var starpu_sched_policy::add_workers
+        Initialize scheduling structures corresponding to each worker used by the policy.
+\var starpu_sched_policy::remove_workers
+        Deinitialize scheduling structures corresponding to each worker used by the policy.
+\var starpu_sched_policy::policy_name
+        Optional field. Name of the policy.
+\var starpu_sched_policy::policy_description
+        Optional field. Human readable description of the policy.
+
+\fn struct starpu_sched_policy ** starpu_sched_get_predefined_policies()
+\ingroup API_Scheduling_Policy
+Return an NULL-terminated array of all the predefined scheduling
+policies.
+
+\fn void starpu_worker_get_sched_condition(int workerid, starpu_pthread_mutex_t **sched_mutex, starpu_pthread_cond_t **sched_cond)
+\ingroup API_Scheduling_Policy
+When there is no available task for a worker, StarPU blocks this
+worker on a condition variable. This function specifies which
+condition variable (and the associated mutex) should be used to block
+(and to wake up) a worker. Note that multiple workers may use the same
+condition variable. For instance, in the case of a scheduling strategy
+with a single task queue, the same condition variable would be used to
+block and wake up all workers.
+
+\fn void starpu_sched_ctx_set_policy_data(unsigned sched_ctx_id, void * policy_data)
+\ingroup API_Scheduling_Policy
+Each scheduling policy uses some specific data (queues, variables,
+additional condition variables). It is memorize through a local
+structure. This function assigns it to a scheduling context.
+
+\fn void* starpu_sched_ctx_get_policy_data(unsigned sched_ctx_id)
+\ingroup API_Scheduling_Policy
+Returns the policy data previously assigned to a context
+
+\fn int starpu_sched_set_min_priority(int min_prio)
+\ingroup API_Scheduling_Policy
+Defines the minimum task priority level supported by the scheduling
+policy. The default minimum priority level is the same as the default
+priority level which is 0 by convention. The application may access
+that value by calling the function starpu_sched_get_min_priority().
+This function should only be called from the initialization method of
+the scheduling policy, and should not be used directly from the
+application.
+
+\fn int starpu_sched_set_max_priority(int max_prio)
+\ingroup API_Scheduling_Policy
+Defines the maximum priority level supported by the scheduling policy.
+The default maximum priority level is 1. The application may access
+that value by calling the function starpu_sched_get_max_priority().
+This function should only be called from the initialization method of
+the scheduling policy, and should not be used directly from the
+application.
+
+\fn int starpu_sched_get_min_priority(void)
+\ingroup API_Scheduling_Policy
+Returns the current minimum priority level supported by the scheduling
+policy
+
+\fn int starpu_sched_get_max_priority(void)
+\ingroup API_Scheduling_Policy
+Returns the current maximum priority level supported by the scheduling
+policy
+
+\fn int starpu_push_local_task(int workerid, struct starpu_task *task, int back)
+\ingroup API_Scheduling_Policy
+The scheduling policy may put tasks directly into a worker’s local
+queue so that it is not always necessary to create its own queue when
+the local queue is sufficient. If \p back is not 0, \p task is put
+at the back of the queue where the worker will pop tasks first.
+Setting \p back to 0 therefore ensures a FIFO ordering.
+
+\fn int starpu_push_task_end(struct starpu_task *task)
+\ingroup API_Scheduling_Policy
+This function must be called by a scheduler to notify that the given
+task has just been pushed.
+
+\fn int starpu_worker_can_execute_task(unsigned workerid, struct starpu_task *task, unsigned nimpl)
+\ingroup API_Scheduling_Policy
+Check if the worker specified by workerid can execute the codelet.
+Schedulers need to call it before assigning a task to a worker,
+otherwise the task may fail to execute.
+
+\fn double starpu_timing_now(void)
+\ingroup API_Scheduling_Policy
+Return the current date in micro-seconds.
+
+\fn uint32_t starpu_task_footprint(struct starpu_perfmodel *model, struct starpu_task * task, enum starpu_perfmodel_archtype arch, unsigned nimpl)
+\ingroup API_Scheduling_Policy
+Returns the footprint for a given task
+
+\fn double starpu_task_expected_length(struct starpu_task *task, enum starpu_perfmodel_archtype arch, unsigned nimpl)
+\ingroup API_Scheduling_Policy
+Returns expected task duration in micro-seconds.
+
+\fn double starpu_worker_get_relative_speedup(enum starpu_perfmodel_archtype perf_archtype)
+\ingroup API_Scheduling_Policy
+Returns an estimated speedup factor relative to CPU speed
+
+\fn double starpu_task_expected_data_transfer_time(unsigned memory_node, struct starpu_task *task)
+\ingroup API_Scheduling_Policy
+Returns expected data transfer time in micro-seconds.
+
+\fn double starpu_data_expected_transfer_time(starpu_data_handle_t handle, unsigned memory_node, enum starpu_data_access_mode mode)
+\ingroup API_Scheduling_Policy
+Predict the transfer time (in micro-seconds) to move \p handle to a memory node
+
+\fn double starpu_task_expected_power(struct starpu_task *task, enum starpu_perfmodel_archtype arch, unsigned nimpl)
+\ingroup API_Scheduling_Policy
+Returns expected power consumption in J
+
+\fn double starpu_task_expected_conversion_time(struct starpu_task *task, enum starpu_perfmodel_archtype arch, unsigned nimpl)
+\ingroup API_Scheduling_Policy
+Returns expected conversion time in ms (multiformat interface only)
+
+\fn int starpu_get_prefetch_flag(void)
+\ingroup API_Scheduling_Policy
+Whether \ref STARPU_PREFETCH was set
+
+\fn int starpu_prefetch_task_input_on_node(struct starpu_task *task, unsigned node)
+\ingroup API_Scheduling_Policy
+Prefetch data for a given task on a given node
+
+*/

+ 64 - 0
doc/doxygen/chapters/api/standard_memory_library.doxy

@@ -0,0 +1,64 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Standard_Memory_Library Standard Memory Library
+
+\def STARPU_MALLOC_PINNED
+\ingroup API_Standard_Memory_Library
+Value passed to the function starpu_malloc_flags() to indicate the memory allocation should be pinned.
+
+\def STARPU_MALLOC_COUNT
+\ingroup API_Standard_Memory_Library
+Value passed to the function starpu_malloc_flags() to indicate
+the memory allocation should be in the limit defined by the
+environment variables \ref STARPU_LIMIT_CUDA_devid_MEM,
+\ref STARPU_LIMIT_CUDA_MEM, \ref STARPU_LIMIT_OPENCL_devid_MEM,
+\ref STARPU_LIMIT_OPENCL_MEM and \ref STARPU_LIMIT_CPU_MEM (see
+Section \ref HowToLimitMemoryPerNode).
+If no memory is available, it tries to reclaim memory from StarPU.
+Memory allocated this way needs to be freed by calling the function
+starpu_free_flags() with the same flag.
+
+\fn int starpu_malloc_flags(void **A, size_t dim, int flags)
+\ingroup API_Standard_Memory_Library
+Performs a memory allocation based on the constraints defined
+by the given flag.
+
+\fn void starpu_malloc_set_align(size_t align)
+\ingroup API_Standard_Memory_Library
+This function sets an alignment constraints for starpu_malloc()
+allocations. align must be a power of two. This is for instance called
+automatically by the OpenCL driver to specify its own alignment
+constraints.
+
+\fn int starpu_malloc(void **A, size_t dim)
+\ingroup API_Standard_Memory_Library
+This function allocates data of the given size in main memory.
+It will also try to pin it in CUDA or OpenCL, so that data transfers
+from this buffer can be asynchronous, and thus permit data transfer
+and computation overlapping. The allocated buffer must be freed thanks
+to the starpu_free() function.
+
+\fn int starpu_free(void *A)
+\ingroup API_Standard_Memory_Library
+This function frees memory which has previously been allocated
+with starpu_malloc().
+
+\fn int starpu_free_flags(void *A, size_t dim, int flags)
+\ingroup API_Standard_Memory_Library
+This function frees memory by specifying its size. The given
+flags should be consistent with the ones given to starpu_malloc_flags()
+when allocating the memory.
+
+\fn ssize_t starpu_memory_get_available(unsigned node)
+\ingroup API_Standard_Memory_Library
+If a memory limit is defined on the given node (see Section \ref
+HowToLimitMemoryPerNode), return the amount of available memory
+on the node. Otherwise return -1.
+
+*/

+ 59 - 0
doc/doxygen/chapters/api/task_bundles.doxy

@@ -0,0 +1,59 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Task_Bundles Task Bundles
+
+\typedef starpu_task_bundle_t
+\ingroup API_Task_Bundles
+Opaque structure describing a list of tasks that should be scheduled
+on the same worker whenever it’s possible. It must be considered as a
+hint given to the scheduler as there is no guarantee that they will be
+executed on the same worker.
+
+\fn void starpu_task_bundle_create (starpu_task_bundle_t *bundle)
+\ingroup API_Task_Bundles
+Factory function creating and initializing \p bundle, when the call
+returns, memory needed is allocated and \p bundle is ready to use.
+
+\fn int starpu_task_bundle_insert (starpu_task_bundle_t bundle, struct starpu_task *task)
+\ingroup API_Task_Bundles
+Insert \p task in \p bundle. Until \p task is removed from \p bundle
+its expected length and data transfer time will be considered along
+those of the other tasks of bundle. This function must not be called
+if \p bundle is already closed and/or \p task is already submitted.
+On success, it returns 0. There are two cases of error : if \p bundle
+is already closed it returns <c>-EPERM</c>, if \p task was already
+submitted it returns <c>-EINVAL</c>.
+
+\fn int starpu_task_bundle_remove (starpu_task_bundle_t bundle, struct starpu_task *task)
+\ingroup API_Task_Bundles
+Remove \p task from \p bundle. Of course \p task must have been
+previously inserted in \p bundle. This function must not be called if
+\p bundle is already closed and/or \p task is already submitted. Doing
+so would result in undefined behaviour. On success, it returns 0. If
+\p bundle is already closed it returns <c>-ENOENT</c>.
+
+\fn void starpu_task_bundle_close (starpu_task_bundle_t bundle)
+\ingroup API_Task_Bundles
+Inform the runtime that the user will not modify \p bundle anymore, it
+means no more inserting or removing task. Thus the runtime can destroy
+it when possible.
+
+\fn double starpu_task_bundle_expected_length (starpu_task_bundle_t bundle, enum starpu_perfmodel_archtype arch, unsigned nimpl)
+\ingroup API_Task_Bundles
+Return the expected duration of \p bundle in micro-seconds.
+
+\fn double starpu_task_bundle_expected_power (starpu_task_bundle_t bundle, enum starpu_perfmodel_archtype arch, unsigned nimpl)
+\ingroup API_Task_Bundles
+Return the expected power consumption of \p bundle in J.
+
+\fn double starpu_task_bundle_expected_data_transfer_time (starpu_task_bundle_t bundle, unsigned memory_node)
+\ingroup API_Task_Bundles
+Return the time (in micro-seconds) expected to transfer all data used within \p bundle.
+
+*/

+ 68 - 0
doc/doxygen/chapters/api/task_lists.doxy

@@ -0,0 +1,68 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Task_Lists Task Lists
+
+\struct starpu_task_list
+Stores a double-chained list of tasks
+\ingroup API_Task_Lists
+\var starpu_task_list::head
+head of the list
+\var starpu_task_list::tail
+tail of the list
+
+\fn void starpu_task_list_init(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Initialize a list structure
+
+\fn void starpu_task_list_push_front(struct starpu_task_list *list, struct starpu_task *task)
+\ingroup API_Task_Lists
+Push \p task at the front of \p list
+
+\fn void starpu_task_list_push_back(struct starpu_task_list *list, struct starpu_task *task)
+\ingroup API_Task_Lists
+Push \p task at the back of \p list
+
+\fn struct starpu_task * starpu_task_list_front(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Get the front of \p list (without removing it)
+
+\fn struct starpu_task * starpu_task_list_back(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Get the back of \p list (without removing it)
+
+\fn int starpu_task_list_empty(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Test if \p list is empty
+
+\fn void starpu_task_list_erase(struct starpu_task_list *list, struct starpu_task *task)
+\ingroup API_Task_Lists
+Remove \p task from \p list
+
+\fn struct starpu_task * starpu_task_list_pop_front(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Remove the element at the front of \p list
+
+\fn struct starpu_task * starpu_task_list_pop_back(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Remove the element at the back of \p list
+
+\fn struct starpu_task * starpu_task_list_begin(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Get the first task of \p list.
+
+\fn struct starpu_task * starpu_task_list_end(struct starpu_task_list *list)
+\ingroup API_Task_Lists
+Get the end of \p list.
+
+\fn struct starpu_task * starpu_task_list_next(struct starpu_task *task)
+\ingroup API_Task_Lists
+Get the next task of \p list. This is not erase-safe.
+
+*/
+

+ 216 - 0
doc/doxygen/chapters/api/top.doxy

@@ -0,0 +1,216 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_StarPUTop_Interface StarPU-Top Interface
+
+\enum starpu_top_data_type
+\ingroup API_StarPU-Top_Interface
+StarPU-Top Data type
+\var starpu_top_data_type::STARPU_TOP_DATA_BOOLEAN
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_data_type::STARPU_TOP_DATA_INTEGER
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_data_type::STARPU_TOP_DATA_FLOAT
+\ingroup API_StarPU-Top_Interface
+todo
+
+\enum starpu_top_param_type
+\ingroup API_StarPU-Top_Interface
+StarPU-Top Parameter type
+\var starpu_top_param_type::STARPU_TOP_PARAM_BOOLEAN
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_param_type::STARPU_TOP_PARAM_INTEGER
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_param_type::STARPU_TOP_PARAM_FLOAT
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_param_type::STARPU_TOP_PARAM_ENUM
+\ingroup API_StarPU-Top_Interface
+todo
+
+\enum starpu_top_message_type
+\ingroup API_StarPU-Top_Interface
+StarPU-Top Message type
+\var starpu_top_message_type::TOP_TYPE_GO
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_message_type::TOP_TYPE_SET
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_message_type::TOP_TYPE_CONTINUE
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_message_type::TOP_TYPE_ENABLE
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_message_type::TOP_TYPE_DISABLE
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_message_type::TOP_TYPE_DEBUG
+\ingroup API_StarPU-Top_Interface
+todo
+\var starpu_top_message_type::TOP_TYPE_UNKNOW
+\ingroup API_StarPU-Top_Interface
+todo
+
+\struct starpu_top_data
+todo
+\ingroup API_StarPU-Top_Interface
+\var starpu_top_data::id
+todo
+\var starpu_top_data::name
+todo
+\var starpu_top_data::int_min_value
+todo
+\var starpu_top_data::int_max_value
+todo
+\var starpu_top_data::double_min_value
+todo
+\var starpu_top_data::double_max_value
+todo
+\var starpu_top_data::active
+todo
+\var starpu_top_data::type
+todo
+\var starpu_top_data::next
+todo
+
+\struct starpu_top_param
+todo
+\ingroup API_StarPU-Top_Interface
+\var starpu_top_param::id
+todo
+\var starpu_top_param::name
+todo
+\var starpu_top_param::type
+todo
+\var starpu_top_param::value
+todo
+\var starpu_top_param::enum_values
+only for enum type can be NULL
+\var starpu_top_param::nb_values
+todo
+\var starpu_top_param::callback
+todo
+\var starpu_top_param::int_min_value
+only for integer type
+\var starpu_top_param::int_max_value
+todo
+\var starpu_top_param::double_min_value
+only for double type
+\var starpu_top_param::double_max_value
+todo
+\var starpu_top_param::next
+todo
+
+@name Functions to call before the initialisation
+\ingroup API_StarPU-Top_Interface
+
+\fn struct starpu_top_data *starpu_top_add_data_boolean(const char* data_name, int active)
+\ingroup API_StarPU-Top_Interface
+This fonction register a data named data_name of type boolean.
+If \p active=0, the value will NOT be displayed to user by default.
+Any other value will make the value displayed by default.
+
+\fn struct starpu_top_data * starpu_top_add_data_integer(const char* data_name, int minimum_value, int maximum_value, int active)
+\ingroup API_StarPU-Top_Interface
+This fonction register a data named \p data_name of type integer. The
+minimum and maximum value will be usefull to define the scale in UI.
+If \p active=0, the value will NOT be displayed to user by default.
+Any other value will make the value displayed by default.
+
+\fn struct starpu_top_data* starpu_top_add_data_float(const char* data_name, double minimum_value, double maximum_value, int active)
+\ingroup API_StarPU-Top_Interface
+This fonction register a data named data_name of type float. The
+minimum and maximum value will be usefull to define the scale in UI.
+If \p active=0, the value will NOT be displayed to user by default.
+Any other value will make the value displayed by default.
+
+\fn struct starpu_top_param* starpu_top_register_parameter_boolean(const char* param_name, int* parameter_field, void (*callback)(struct starpu_top_param*))
+\ingroup API_StarPU-Top_Interface
+This fonction register a parameter named \p parameter_name, of type
+boolean. The \p callback fonction will be called when the parameter is
+modified by UI, and can be null.
+
+\fn struct starpu_top_param* starpu_top_register_parameter_float(const char* param_name, double* parameter_field, double minimum_value, double maximum_value, void (*callback)(struct starpu_top_param*))
+\ingroup API_StarPU-Top_Interface
+his fonction register a parameter named \p param_name, of type
+integer. Minimum and maximum value will be used to prevent user seting
+incorrect value. The \p callback fonction will be called when the
+parameter is modified by UI, and can be null.
+
+\fn struct starpu_top_param* starpu_top_register_parameter_integer(const char* param_name, int* parameter_field, int minimum_value, int maximum_value, void (*callback)(struct starpu_top_param*))
+\ingroup API_StarPU-Top_Interface
+This fonction register a parameter named \p param_name, of type float.
+Minimum and maximum value will be used to prevent user seting
+incorrect value. The \p callback fonction will be called when the
+parameter is modified by UI, and can be null.
+
+\fn struct starpu_top_param* starpu_top_register_parameter_enum(const char* param_name, int* parameter_field, char** values, int nb_values, void (*callback)(struct starpu_top_param*))
+\ingroup API_StarPU-Top_Interface
+This fonction register a parameter named \p param_name, of type enum.
+Minimum and maximum value will be used to prevent user seting
+incorrect value. The \p callback fonction will be called when the
+parameter is modified by UI, and can be null.
+
+@name Initialisation
+\ingroup API_StarPU-Top_Interface
+
+\fn void starpu_top_init_and_wait(const char *server_name)
+\ingroup API_StarPU-Top_Interface
+This function must be called when all parameters and data have been
+registered AND initialised (for parameters). This function will wait
+for a TOP to connect, send initialisation sentences, and wait for the
+GO message.
+
+@name To call after initialisation
+\ingroup API_StarPU-Top_Interface
+
+\fn void starpu_top_update_parameter(const struct starpu_top_param *param)
+\ingroup API_StarPU-Top_Interface
+This function should be called after every modification of a parameter
+from something other than starpu_top. This fonction notice UI that the
+configuration changed.
+
+\fn void starpu_top_update_data_boolean(const struct starpu_top_data *data, int value)
+\ingroup API_StarPU-Top_Interface
+This function updates the value of the starpu_top_data on UI.
+
+\fn void starpu_top_update_data_integer(const struct starpu_top_data *data, int value)
+\ingroup API_StarPU-Top_Interface
+This function updates the value of the starpu_top_data on UI.
+
+\fn void starpu_top_update_data_float(const struct starpu_top_data *data, double value)
+\ingroup API_StarPU-Top_Interface
+This function updates the value of the starpu_top_data on UI.
+
+\fn void starpu_top_task_prevision(struct starpu_task *task, int devid, unsigned long long start, unsigned long long end)
+\ingroup API_StarPU-Top_Interface
+This function notifies UI than the task have been planed to run from start to end, on computation-core.
+
+\fn void starpu_top_debug_log(const char *message)
+\ingroup API_StarPU-Top_Interface
+This function is useful in debug mode. The starpu developper doesn't
+need to check if the debug mode is active. This is checked by
+starpu_top itsefl. It just send a message to display by UI.
+
+\fn void starpu_top_debug_lock(const char *message)
+\ingroup API_StarPU-Top_Interface
+This function is useful in debug mode. The starpu developper doesn't
+need to check if the debug mode is active. This is checked by
+starpu_top itsefl. It send a message and wait for a continue message
+from UI to return. The lock (wich create a stop-point) should be
+called only by the main thread. Calling it from more than one thread
+is not supported.
+
+*/
+

+ 28 - 0
doc/doxygen/chapters/api/versioning.doxy

@@ -0,0 +1,28 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Versioning Versioning
+
+\def STARPU_MAJOR_VERSION
+\ingroup API_Versioning
+Define the major version of StarPU. This is the version used when compiling the application.
+
+\def STARPU_MINOR_VERSION
+\ingroup API_Versioning
+Define the minor version of StarPU. This is the version used when compiling the application.
+
+\def STARPU_RELEASE_VERSION
+\ingroup API_Versioning
+Define the release version of StarPU. This is the version used when compiling the application.
+
+\fn void starpu_get_version(int *major, int *minor, int *release)
+\ingroup API_Versioning
+Return as 3 integers the version of StarPU used when running the application.
+
+*/
+

+ 178 - 0
doc/doxygen/chapters/api/workers.doxy

@@ -0,0 +1,178 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \defgroup API_Workers_Properties Workers’ Properties
+
+\enum starpu_node_kind
+\ingroup API_Workers_Properties
+TODO
+\var starpu_node_kind::STARPU_UNUSED
+\ingroup API_Workers_Properties
+TODO
+\ingroup API_Workers_Properties
+\var starpu_node_kind::STARPU_CPU_RAM
+\ingroup API_Workers_Properties
+TODO
+\var starpu_node_kind::STARPU_CUDA_RAM
+\ingroup API_Workers_Properties
+TODO
+\var starpu_node_kind::STARPU_OPENCL_RAM
+\ingroup API_Workers_Properties
+TODO
+\var starpu_node_kind::STARPU_MIC_RAM
+\ingroup API_Workers_Properties
+TODO
+\var starpu_node_kind::STARPU_SCC_RAM
+\ingroup API_Workers_Properties
+This node kind is not used anymore, but implementations in interfaces
+will be useful for MPI.
+\var starpu_node_kind::STARPU_SCC_SHM
+\ingroup API_Workers_Properties
+TODO
+
+\enum starpu_worker_archtype
+\ingroup API_Workers_Properties
+Worker Architecture Type
+\var starpu_worker_archtype::STARPU_ANY_WORKER
+\ingroup API_Workers_Properties
+any worker, used in the hypervisor
+\var starpu_worker_archtype::STARPU_CPU_WORKER
+\ingroup API_Workers_Properties
+CPU core
+\var starpu_worker_archtype::STARPU_CUDA_WORKER
+\ingroup API_Workers_Properties
+NVIDIA CUDA device
+\var starpu_worker_archtype::STARPU_OPENCL_WORKER
+\ingroup API_Workers_Properties
+OpenCL device
+\var starpu_worker_archtype::STARPU_MIC_WORKER
+\ingroup API_Workers_Properties
+Intel MIC device
+\var starpu_worker_archtype::STARPU_SCC_WORKER
+\ingroup API_Workers_Properties
+Intel SCC device
+
+
+\fn unsigned starpu_worker_get_count(void)
+\ingroup API_Workers_Properties
+This function returns the number of workers (i.e. processing
+units executing StarPU tasks). The returned value should be at most
+\ref STARPU_NMAXWORKERS.
+
+\fn int starpu_worker_get_count_by_type(enum starpu_worker_archtype type)
+\ingroup API_Workers_Properties
+Returns the number of workers of the given type. A positive (or
+NULL) value is returned in case of success, -EINVAL indicates that the
+type is not valid otherwise.
+
+\fn unsigned starpu_cpu_worker_get_count(void)
+\ingroup API_Workers_Properties
+This function returns the number of CPUs controlled by StarPU. The
+returned value should be at most \ref STARPU_MAXCPUS.
+
+\fn unsigned starpu_cuda_worker_get_count(void)
+\ingroup API_Workers_Properties
+This function returns the number of CUDA devices controlled by
+StarPU. The returned value should be at most \ref STARPU_MAXCUDADEVS.
+
+\fn unsigned starpu_mic_worker_get_count(void)
+\ingroup API_Workers_Properties
+This function returns the number of MIC workers controlled by StarPU.
+
+\fn unsigned starpu_mic_device_get_count(void)
+\ingroup API_Workers_Properties
+This function returns the number of MIC devices controlled by StarPU.
+The returned value should be at most \ref STARPU_MAXMICDEVS.
+
+\fn unsigned starpu_scc_worker_get_count(void)
+\ingroup API_Workers_Properties
+This function returns the number of SCC devices controlled by StarPU.
+The returned value should be at most \ref STARPU_MAXSCCDEVS.
+
+\fn unsigned starpu_opencl_worker_get_count(void)
+\ingroup API_Workers_Properties
+This function returns the number of OpenCL devices controlled by
+StarPU. The returned value should be at most \ref STARPU_MAXOPENCLDEVS.
+
+\fn int starpu_worker_get_id (void)
+\ingroup API_Workers_Properties
+This function returns the identifier of the current worker, i.e
+the one associated to the calling thread. The returned value is either
+-1 if the current context is not a StarPU worker (i.e. when called
+from the application outside a task or a callback), or an integer
+between 0 and starpu_worker_get_count() - 1.
+
+\fn int starpu_worker_get_ids_by_type(enum starpu_worker_archtype type, int *workerids, int maxsize)
+\ingroup API_Workers_Properties
+This function gets the list of identifiers of workers with the
+given type. It fills the array \p workerids with the identifiers of the
+workers that have the type indicated in the first argument. The
+argument \p maxsize indicates the size of the array \p workerids. The returned
+value gives the number of identifiers that were put in the array.
+-ERANGE is returned is \p maxsize is lower than the number of workers
+with the appropriate type: in that case, the array is filled with the
+\p maxsize first elements. To avoid such overflows, the value of maxsize
+can be chosen by the means of the function
+starpu_worker_get_count_by_type(), or by passing a value greater or
+equal to \ref STARPU_NMAXWORKERS.
+
+\fn int starpu_worker_get_by_type(enum starpu_worker_archtype type, int num)
+\ingroup API_Workers_Properties
+This returns the identifier of the num-th worker that has the
+specified type type. If there are no such worker, -1 is returned.
+
+\fn int starpu_worker_get_by_devid(enum starpu_worker_archtype type, int devid)
+\ingroup API_Workers_Properties
+This returns the identifier of the worker that has the specified type
+\p type and device id \p devid (which may not be the n-th, if some
+devices are skipped for instance). If there are no such worker, -1 is returned.
+
+\fn int starpu_worker_get_devid(int id)
+\ingroup API_Workers_Properties
+This function returns the device id of the given worker. The
+worker should be identified with the value returned by the
+starpu_worker_get_id() function. In the case of a CUDA worker, this
+device identifier is the logical device identifier exposed by CUDA
+(used by the function cudaGetDevice() for instance). The device
+identifier of a CPU worker is the logical identifier of the core on
+which the worker was bound; this identifier is either provided by the
+OS or by the library <c>hwloc</c> in case it is available.
+
+\fn enum starpu_worker_archtype starpu_worker_get_type(int id)
+\ingroup API_Workers_Properties
+This function returns the type of processing unit associated to
+a worker. The worker identifier is a value returned by the function
+starpu_worker_get_id()). The returned value indicates the
+architecture of the worker: ::STARPU_CPU_WORKER for a CPU core,
+::STARPU_CUDA_WORKER for a CUDA device, and ::STARPU_OPENCL_WORKER for a
+OpenCL device. The value returned for an invalid identifier is
+unspecified.
+
+\fn void starpu_worker_get_name(int id, char *dst, size_t maxlen)
+\ingroup API_Workers_Properties
+This function allows to get the name of a given worker. StarPU
+associates a unique human readable string to each processing unit.
+This function copies at most the maxlen first bytes of the unique
+string associated to a worker identified by its identifier id into the
+dst buffer. The caller is responsible for ensuring that \p dst is a
+valid pointer to a buffer of \p maxlen bytes at least. Calling this
+function on an invalid identifier results in an unspecified behaviour.
+
+\fn unsigned starpu_worker_get_memory_node(unsigned workerid)
+\ingroup API_Workers_Properties
+This function returns the identifier of the memory node
+associated to the worker identified by \p workerid.
+
+\fn enum starpu_node_kind starpu_node_get_kind(unsigned node)
+\ingroup API_Workers_Properties
+Returns the type of the given node as defined by
+::starpu_node_kind. For example, when defining a new data interface,
+this function should be used in the allocation function to determine
+on which device the memory needs to be allocated.
+
+*/

+ 732 - 0
doc/doxygen/chapters/basic_examples.doxy

@@ -0,0 +1,732 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page BasicExamples Basic Examples
+
+\section HelloWorldUsingTheCExtension Hello World Using The C Extension
+
+This section shows how to implement a simple program that submits a task
+to StarPU using the StarPU C extension (\ref cExtensions). The complete example, and additional examples,
+is available in the <c>gcc-plugin/examples</c> directory of the StarPU
+distribution. A similar example showing how to directly use the StarPU's API is shown
+in \ref HelloWorldUsingStarPUAPI.
+
+GCC from version 4.5 permit to use the StarPU GCC plug-in (\ref cExtensions). This makes writing a task both simpler and less error-prone.
+In a nutshell, all it takes is to declare a task, declare and define its
+implementations (for CPU, OpenCL, and/or CUDA), and invoke the task like
+a regular C function.  The example below defines <c>my_task</c> which
+has a single implementation for CPU:
+
+\snippet hello_pragma.c To be included
+
+The code can then be compiled and linked with GCC and the <c>-fplugin</c> flag:
+
+\verbatim
+$ gcc `pkg-config starpu-1.2 --cflags` hello-starpu.c \
+    -fplugin=`pkg-config starpu-1.2 --variable=gccplugin` \
+    `pkg-config starpu-1.2 --libs`
+\endverbatim
+
+The code can also be compiled without the StarPU C extension and will
+behave as a normal sequential code.
+
+\verbatim
+$ gcc hello-starpu.c
+hello-starpu.c:33:1: warning: ‘task’ attribute directive ignored [-Wattributes]
+$ ./a.out
+Hello, world! With x = 42
+\endverbatim
+
+As can be seen above, the C extensions allows programmers to
+use StarPU tasks by essentially annotating ``regular'' C code.
+
+\section HelloWorldUsingStarPUAPI Hello World Using StarPU's API
+
+This section shows how to achieve the same result as in the previous
+section using StarPU's standard C API.
+
+\subsection RequiredHeaders Required Headers
+
+The header starpu.h should be included in any code using StarPU.
+
+\code{.c}
+#include <starpu.h>
+\endcode
+
+\subsection DefiningACodelet Defining A Codelet
+
+\code{.c}
+struct params
+{
+    int i;
+    float f;
+};
+void cpu_func(void *buffers[], void *cl_arg)
+{
+    struct params *params = cl_arg;
+
+    printf("Hello world (params = {%i, %f} )\n", params->i, params->f);
+}
+
+struct starpu_codelet cl =
+{
+    .where = STARPU_CPU,
+    .cpu_funcs = { cpu_func, NULL },
+    .cpu_funcs_name = { "cpu_func", NULL },
+    .nbuffers = 0
+};
+\endcode
+
+A codelet is a structure that represents a computational kernel. Such a codelet
+may contain an implementation of the same kernel on different architectures
+(e.g. CUDA, x86, ...). For compatibility, make sure that the whole
+structure is properly initialized to zero, either by using the
+function starpu_codelet_init(), or by letting the
+compiler implicitly do it as examplified above.
+
+The field starpu_codelet::nbuffers specifies the number of data buffers that are
+manipulated by the codelet: here the codelet does not access or modify any data
+that is controlled by our data management library. Note that the argument
+passed to the codelet (the field starpu_task::cl_arg) does not count
+as a buffer since it is not managed by our data management library,
+but just contain trivial parameters.
+
+\internal
+TODO need a crossref to the proper description of "where" see bla for more ...
+\endinternal
+
+We create a codelet which may only be executed on the CPUs. The field
+starpu_codelet::where is a bitmask that defines where the codelet may
+be executed. Here, the value ::STARPU_CPU means that only CPUs can
+execute this codelet. Note that field starpu_codelet::where is
+optional, when unset its value is automatically set based on the
+availability of the different fields <c>XXX_funcs</c>.
+When a CPU core executes a codelet, it calls the function
+<c>cpu_func</c>, which \em must have the following prototype:
+
+\code{.c}
+void (*cpu_func)(void *buffers[], void *cl_arg);
+\endcode
+
+In this example, we can ignore the first argument of this function which gives a
+description of the input and output buffers (e.g. the size and the location of
+the matrices) since there is none.
+The second argument is a pointer to a buffer passed as an
+argument to the codelet by the means of the field starpu_task::cl_arg.
+
+\internal
+TODO rewrite so that it is a little clearer ?
+\endinternal
+
+Be aware that this may be a pointer to a
+\em copy of the actual buffer, and not the pointer given by the programmer:
+if the codelet modifies this buffer, there is no guarantee that the initial
+buffer will be modified as well: this for instance implies that the buffer
+cannot be used as a synchronization medium. If synchronization is needed, data
+has to be registered to StarPU, see \ref VectorScalingUsingStarPUAPI.
+
+\subsection SubmittingATask Submitting A Task
+
+\code{.c}
+void callback_func(void *callback_arg)
+{
+    printf("Callback function (arg %x)\n", callback_arg);
+}
+
+int main(int argc, char **argv)
+{
+    /* initialize StarPU */
+    starpu_init(NULL);
+
+    struct starpu_task *task = starpu_task_create();
+
+    task->cl = &cl; /* Pointer to the codelet defined above */
+
+    struct params params = { 1, 2.0f };
+    task->cl_arg = &params;
+    task->cl_arg_size = sizeof(params);
+
+    task->callback_func = callback_func;
+    task->callback_arg = 0x42;
+
+    /* starpu_task_submit will be a blocking call */
+    task->synchronous = 1;
+
+    /* submit the task to StarPU */
+    starpu_task_submit(task);
+
+    /* terminate StarPU */
+    starpu_shutdown();
+
+    return 0;
+}
+\endcode
+
+Before submitting any tasks to StarPU, starpu_init() must be called. The
+<c>NULL</c> argument specifies that we use default configuration. Tasks cannot
+be submitted after the termination of StarPU by a call to
+starpu_shutdown().
+
+In the example above, a task structure is allocated by a call to
+starpu_task_create(). This function only allocates and fills the
+corresponding structure with the default settings, but it does not
+submit the task to StarPU.
+
+\internal
+not really clear ;)
+\endinternal
+
+The field starpu_task::cl is a pointer to the codelet which the task will
+execute: in other words, the codelet structure describes which computational
+kernel should be offloaded on the different architectures, and the task
+structure is a wrapper containing a codelet and the piece of data on which the
+codelet should operate.
+
+The optional field starpu_task::cl_arg field is a pointer to a buffer
+(of size starpu_task::cl_arg_size) with some parameters for the kernel
+described by the codelet. For instance, if a codelet implements a
+computational kernel that multiplies its input vector by a constant,
+the constant could be specified by the means of this buffer, instead
+of registering it as a StarPU data. It must however be noted that
+StarPU avoids making copy whenever possible and rather passes the
+pointer as such, so the buffer which is pointed at must kept allocated
+until the task terminates, and if several tasks are submitted with
+various parameters, each of them must be given a pointer to their
+buffer.	
+
+Once a task has been executed, an optional callback function is be called.
+While the computational kernel could be offloaded on various architectures, the
+callback function is always executed on a CPU. The pointer
+starpu_task::callback_arg is passed as an argument of the callback
+function. The prototype of a callback function must be:
+
+\code{.c}
+void (*callback_function)(void *);
+\endcode
+
+If the field starpu_task::synchronous is non-zero, task submission
+will be synchronous: the function starpu_task_submit() will not return
+until the task was executed. Note that the function starpu_shutdown()
+does not guarantee that asynchronous tasks have been executed before
+it returns, starpu_task_wait_for_all() can be used to that effect, or
+data can be unregistered (starpu_data_unregister()), which will
+implicitly wait for all the tasks scheduled to work on it, unless
+explicitly disabled thanks to
+starpu_data_set_default_sequential_consistency_flag() or
+starpu_data_set_sequential_consistency_flag().
+
+\subsection ExecutionOfHelloWorld Execution Of Hello World
+
+\verbatim
+$ make hello_world
+cc $(pkg-config --cflags starpu-1.2)  $(pkg-config --libs starpu-1.2) hello_world.c -o hello_world
+$ ./hello_world
+Hello world (params = {1, 2.000000} )
+Callback function (arg 42)
+\endverbatim
+
+\section VectorScalingUsingTheCExtension Vector Scaling Using the C Extension
+
+The previous example has shown how to submit tasks. In this section,
+we show how StarPU tasks can manipulate data.
+
+We will first show how to use the C language extensions provided by
+the GCC plug-in (\ref cExtensions). The complete example, and
+additional examples, is available in the <c>gcc-plugin/examples</c>
+directory of the StarPU distribution. These extensions map directly
+to StarPU's main concepts: tasks, task implementations for CPU,
+OpenCL, or CUDA, and registered data buffers. The standard C version
+that uses StarPU's standard C programming interface is given in the
+next section (\ref VectorScalingUsingStarPUAPI).
+
+First of all, the vector-scaling task and its simple CPU implementation
+has to be defined:
+
+\code{.c}
+/* Declare the `vector_scal' task.  */
+static void vector_scal (unsigned size, float vector[size],
+                         float factor)
+  __attribute__ ((task));
+
+/* Define the standard CPU implementation.  */
+static void
+vector_scal (unsigned size, float vector[size], float factor)
+{
+  unsigned i;
+  for (i = 0; i < size; i++)
+    vector[i] *= factor;
+}
+\endcode
+
+Next, the body of the program, which uses the task defined above, can be
+implemented:
+
+\snippet hello_pragma2.c To be included
+
+The <c>main</c> function above does several things:
+
+<ul>
+<li>
+It initializes StarPU.
+</li>
+<li>
+It allocates <c>vector</c> in the heap; it will automatically be freed
+when its scope is left.  Alternatively, good old <c>malloc</c> and
+<c>free</c> could have been used, but they are more error-prone and
+require more typing.
+</li>
+<li>
+It registers the memory pointed to by <c>vector</c>.  Eventually,
+when OpenCL or CUDA task implementations are added, this will allow
+StarPU to transfer that memory region between GPUs and the main memory.
+Removing this <c>pragma</c> is an error.
+</li>
+<li>
+It invokes the <c>vector_scal</c> task.  The invocation looks the same
+as a standard C function call.  However, it is an asynchronous
+invocation, meaning that the actual call is performed in parallel with
+the caller's continuation.
+</li>
+<li>
+It waits for the termination of the <c>vector_scal</c>
+asynchronous call.
+</li>
+<li>
+Finally, StarPU is shut down.
+</li>
+</ul>
+
+The program can be compiled and linked with GCC and the <c>-fplugin</c>
+flag:
+
+\verbatim
+$ gcc `pkg-config starpu-1.2 --cflags` vector_scal.c \
+    -fplugin=`pkg-config starpu-1.2 --variable=gccplugin` \
+    `pkg-config starpu-1.2 --libs`
+\endverbatim
+
+And voilà!
+
+\subsection AddingAnOpenCLTaskImplementation Adding an OpenCL Task Implementation
+
+Now, this is all fine and great, but you certainly want to take
+advantage of these newfangled GPUs that your lab just bought, don't you?
+
+So, let's add an OpenCL implementation of the <c>vector_scal</c> task.
+We assume that the OpenCL kernel is available in a file,
+<c>vector_scal_opencl_kernel.cl</c>, not shown here.  The OpenCL task
+implementation is similar to that used with the standard C API
+(\ref DefinitionOfTheOpenCLKernel).  It is declared and defined
+in our C file like this:
+
+\code{.c}
+/* The OpenCL programs, loaded from 'main' (see below). */
+static struct starpu_opencl_program cl_programs;
+
+static void vector_scal_opencl (unsigned size, float vector[size],
+                                float factor)
+  __attribute__ ((task_implementation ("opencl", vector_scal)));
+
+static void
+vector_scal_opencl (unsigned size, float vector[size], float factor)
+{
+  int id, devid, err;
+  cl_kernel kernel;
+  cl_command_queue queue;
+  cl_event event;
+
+  /* VECTOR is GPU memory pointer, not a main memory pointer. */
+  cl_mem val = (cl_mem) vector;
+
+  id = starpu_worker_get_id ();
+  devid = starpu_worker_get_devid (id);
+
+  /* Prepare to invoke the kernel.  In the future, this will be largely automated.  */
+  err = starpu_opencl_load_kernel (&kernel, &queue, &cl_programs,
+                                   "vector_mult_opencl", devid);
+  if (err != CL_SUCCESS)
+    STARPU_OPENCL_REPORT_ERROR (err);
+
+  err = clSetKernelArg (kernel, 0, sizeof (size), &size);
+  err |= clSetKernelArg (kernel, 1, sizeof (val), &val);
+  err |= clSetKernelArg (kernel, 2, sizeof (factor), &factor);
+  if (err)
+    STARPU_OPENCL_REPORT_ERROR (err);
+
+  size_t global = 1, local = 1;
+  err = clEnqueueNDRangeKernel (queue, kernel, 1, NULL, &global,
+                                &local, 0, NULL, &event);
+  if (err != CL_SUCCESS)
+    STARPU_OPENCL_REPORT_ERROR (err);
+
+  clFinish (queue);
+  starpu_opencl_collect_stats (event);
+  clReleaseEvent (event);
+
+  /* Done with KERNEL. */
+  starpu_opencl_release_kernel (kernel);
+}
+\endcode
+
+The OpenCL kernel itself must be loaded from <c>main</c>, sometime after
+the <c>initialize</c> pragma:
+
+\code{.c}
+starpu_opencl_load_opencl_from_file ("vector_scal_opencl_kernel.cl",
+                                       &cl_programs, "");
+\endcode
+
+And that's it.  The <c>vector_scal</c> task now has an additional
+implementation, for OpenCL, which StarPU's scheduler may choose to use
+at run-time.  Unfortunately, the <c>vector_scal_opencl</c> above still
+has to go through the common OpenCL boilerplate; in the future,
+additional extensions will automate most of it.
+
+\subsection AddingACUDATaskImplementation Adding a CUDA Task Implementation
+
+Adding a CUDA implementation of the task is very similar, except that
+the implementation itself is typically written in CUDA, and compiled
+with <c>nvcc</c>.  Thus, the C file only needs to contain an external
+declaration for the task implementation:
+
+\code{.c}
+extern void vector_scal_cuda (unsigned size, float vector[size],
+                              float factor)
+  __attribute__ ((task_implementation ("cuda", vector_scal)));
+\endcode
+
+The actual implementation of the CUDA task goes into a separate
+compilation unit, in a <c>.cu</c> file.  It is very close to the
+implementation when using StarPU's standard C API (\ref DefinitionOfTheCUDAKernel).
+
+\code{.c}
+/* CUDA implementation of the `vector_scal' task, to be compiled with `nvcc'. */
+
+#include <starpu.h>
+#include <stdlib.h>
+
+static __global__ void
+vector_mult_cuda (unsigned n, float *val, float factor)
+{
+  unsigned i = blockIdx.x * blockDim.x + threadIdx.x;
+
+  if (i < n)
+    val[i] *= factor;
+}
+
+/* Definition of the task implementation declared in the C file. */
+extern "C" void
+vector_scal_cuda (size_t size, float vector[], float factor)
+{
+  unsigned threads_per_block = 64;
+  unsigned nblocks = (size + threads_per_block - 1) / threads_per_block;
+
+  vector_mult_cuda <<< nblocks, threads_per_block, 0,
+    starpu_cuda_get_local_stream () >>> (size, vector, factor);
+
+  cudaStreamSynchronize (starpu_cuda_get_local_stream ());
+}
+\endcode
+
+The complete source code, in the <c>gcc-plugin/examples/vector_scal</c>
+directory of the StarPU distribution, also shows how an SSE-specialized
+CPU task implementation can be added.
+
+For more details on the C extensions provided by StarPU's GCC plug-in,
+\ref cExtensions.
+
+\section VectorScalingUsingStarPUAPI Vector Scaling Using StarPU's API
+
+This section shows how to achieve the same result as explained in the
+previous section using StarPU's standard C API.
+
+The full source code for
+this example is given in \ref FullSourceCodeVectorScal.
+
+\subsection SourceCodeOfVectorScaling Source Code of Vector Scaling
+
+Programmers can describe the data layout of their application so that StarPU is
+responsible for enforcing data coherency and availability across the machine.
+Instead of handling complex (and non-portable) mechanisms to perform data
+movements, programmers only declare which piece of data is accessed and/or
+modified by a task, and StarPU makes sure that when a computational kernel
+starts somewhere (e.g. on a GPU), its data are available locally.
+
+Before submitting those tasks, the programmer first needs to declare the
+different pieces of data to StarPU using the functions
+<c>starpu_*_data_register</c>. To ease the development of applications
+for StarPU, it is possible to describe multiple types of data layout.
+A type of data layout is called an <b>interface</b>. There are
+different predefined interfaces available in StarPU: here we will
+consider the <b>vector interface</b>.
+
+The following lines show how to declare an array of <c>NX</c> elements of type
+<c>float</c> using the vector interface:
+
+\code{.c}
+float vector[NX];
+
+starpu_data_handle_t vector_handle;
+starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector, NX,
+                            sizeof(vector[0]));
+\endcode
+
+The first argument, called the <b>data handle</b>, is an opaque pointer which
+designates the array in StarPU. This is also the structure which is used to
+describe which data is used by a task. The second argument is the node number
+where the data originally resides. Here it is 0 since the <c>vector array</c> is in
+the main memory. Then comes the pointer <c>vector</c> where the data can be found in main memory,
+the number of elements in the vector and the size of each element.
+The following shows how to construct a StarPU task that will manipulate the
+vector and a constant factor.
+
+\code{.c}
+float factor = 3.14;
+struct starpu_task *task = starpu_task_create();
+
+task->cl = &cl;                      /* Pointer to the codelet defined below */
+task->handles[0] = vector_handle;    /* First parameter of the codelet */
+task->cl_arg = &factor;
+task->cl_arg_size = sizeof(factor);
+task->synchronous = 1;
+
+starpu_task_submit(task);
+\endcode
+
+Since the factor is a mere constant float value parameter,
+it does not need a preliminary registration, and
+can just be passed through the pointer starpu_task::cl_arg like in the previous
+example.  The vector parameter is described by its handle.
+starpu_task::handles should be set with the handles of the data, the
+access modes for the data are defined in the field
+starpu_codelet::modes (::STARPU_R for read-only, ::STARPU_W for
+write-only and ::STARPU_RW for read and write access).
+
+The definition of the codelet can be written as follows:
+
+\code{.c}
+void scal_cpu_func(void *buffers[], void *cl_arg)
+{
+    unsigned i;
+    float *factor = cl_arg;
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+    /* CPU copy of the vector pointer */
+    float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
+
+    for (i = 0; i < n; i++)
+        val[i] *= *factor;
+}
+
+struct starpu_codelet cl =
+{
+    .cpu_funcs = { scal_cpu_func, NULL },
+    .cpu_funcs_name = { "scal_cpu_func", NULL },
+    .nbuffers = 1,
+    .modes = { STARPU_RW }
+};
+\endcode
+
+The first argument is an array that gives
+a description of all the buffers passed in the array starpu_task::handles. The
+size of this array is given by the field starpu_codelet::nbuffers. For
+the sake of genericity, this array contains pointers to the different
+interfaces describing each buffer.  In the case of the <b>vector
+interface</b>, the location of the vector (resp. its length) is
+accessible in the starpu_vector_interface::ptr (resp.
+starpu_vector_interface::nx) of this interface. Since the vector is
+accessed in a read-write fashion, any modification will automatically
+affect future accesses to this vector made by other tasks.
+
+The second argument of the function <c>scal_cpu_func</c> contains a
+pointer to the parameters of the codelet (given in
+starpu_task::cl_arg), so that we read the constant factor from this
+pointer.
+
+\subsection ExecutionOfVectorScaling Execution of Vector Scaling
+
+\verbatim
+$ make vector_scal
+cc $(pkg-config --cflags starpu-1.2)  $(pkg-config --libs starpu-1.2)  vector_scal.c   -o vector_scal
+$ ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+\section VectorScalingOnAnHybridCPUGPUMachine Vector Scaling on an Hybrid CPU/GPU Machine
+
+Contrary to the previous examples, the task submitted in this example may not
+only be executed by the CPUs, but also by a CUDA device.
+
+\subsection DefinitionOfTheCUDAKernel Definition of the CUDA Kernel
+
+The CUDA implementation can be written as follows. It needs to be compiled with
+a CUDA compiler such as nvcc, the NVIDIA CUDA compiler driver. It must be noted
+that the vector pointer returned by ::STARPU_VECTOR_GET_PTR is here a
+pointer in GPU memory, so that it can be passed as such to the
+<c>vector_mult_cuda</c> kernel call.
+
+\code{.c}
+#include <starpu.h>
+
+static __global__ void vector_mult_cuda(unsigned n, float *val,
+                                        float factor)
+{
+    unsigned i =  blockIdx.x*blockDim.x + threadIdx.x;
+    if (i < n)
+        val[i] *= factor;
+}
+
+extern "C" void scal_cuda_func(void *buffers[], void *_args)
+{
+    float *factor = (float *)_args;
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+    /* CUDA copy of the vector pointer */
+    float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
+    unsigned threads_per_block = 64;
+    unsigned nblocks = (n + threads_per_block-1) / threads_per_block;
+
+    vector_mult_cuda<<<nblocks,threads_per_block, 0, starpu_cuda_get_local_stream()>>>
+                    (n, val, *factor);
+
+    cudaStreamSynchronize(starpu_cuda_get_local_stream());
+}
+\endcode
+
+\subsection DefinitionOfTheOpenCLKernel Definition of the OpenCL Kernel
+
+The OpenCL implementation can be written as follows. StarPU provides
+tools to compile a OpenCL kernel stored in a file.
+
+\code{.c}
+__kernel void vector_mult_opencl(int nx, __global float* val, float factor)
+{
+        const int i = get_global_id(0);
+        if (i < nx) {
+                val[i] *= factor;
+        }
+}
+\endcode
+
+Contrary to CUDA and CPU, ::STARPU_VECTOR_GET_DEV_HANDLE has to be used,
+which returns a <c>cl_mem</c> (which is not a device pointer, but an OpenCL
+handle), which can be passed as such to the OpenCL kernel. The difference is
+important when using partitioning, see \ref PartitioningData.
+
+\code{.c}
+#include <starpu.h>
+
+extern struct starpu_opencl_program programs;
+
+void scal_opencl_func(void *buffers[], void *_args)
+{
+    float *factor = _args;
+    int id, devid, err;     /* OpenCL specific code */
+    cl_kernel kernel;       /* OpenCL specific code */
+    cl_command_queue queue; /* OpenCL specific code */
+    cl_event event;         /* OpenCL specific code */
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+    /* OpenCL copy of the vector pointer */
+    cl_mem val = (cl_mem) STARPU_VECTOR_GET_DEV_HANDLE(buffers[0]);
+
+    { /* OpenCL specific code */
+        id = starpu_worker_get_id();
+        devid = starpu_worker_get_devid(id);
+
+	err = starpu_opencl_load_kernel(&kernel, &queue, &programs,
+	                       "vector_mult_opencl", devid);   /* Name of the codelet defined above */
+        if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
+
+        err = clSetKernelArg(kernel, 0, sizeof(n), &n);
+        err |= clSetKernelArg(kernel, 1, sizeof(val), &val);
+        err |= clSetKernelArg(kernel, 2, sizeof(*factor), factor);
+        if (err) STARPU_OPENCL_REPORT_ERROR(err);
+    }
+
+    {  /* OpenCL specific code */
+        size_t global=n;
+        size_t local=1;
+        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0, NULL, &event);
+        if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
+    }
+
+    {  /* OpenCL specific code */
+        clFinish(queue);
+        starpu_opencl_collect_stats(event);
+        clReleaseEvent(event);
+
+        starpu_opencl_release_kernel(kernel);
+    }
+}
+\endcode
+
+
+\subsection DefinitionOfTheMainCode Definition of the Main Code
+
+The CPU implementation is the same as in the previous section.
+
+Here is the source of the main application. You can notice that the fields
+starpu_codelet::cuda_funcs and starpu_codelet::opencl_funcs are set to
+define the pointers to the CUDA and OpenCL implementations of the
+task.
+
+\snippet vector_scal_c.c To be included
+
+\subsection ExecutionOfHybridVectorScaling Execution of Hybrid Vector Scaling
+
+The Makefile given at the beginning of the section must be extended to
+give the rules to compile the CUDA source code. Note that the source
+file of the OpenCL kernel does not need to be compiled now, it will
+be compiled at run-time when calling the function
+starpu_opencl_load_opencl_from_file().
+
+\verbatim
+CFLAGS  += $(shell pkg-config --cflags starpu-1.2)
+LDFLAGS += $(shell pkg-config --libs starpu-1.2)
+CC       = gcc
+
+vector_scal: vector_scal.o vector_scal_cpu.o vector_scal_cuda.o vector_scal_opencl.o
+
+%.o: %.cu
+       nvcc $(CFLAGS) $< -c $@
+
+clean:
+       rm -f vector_scal *.o
+\endverbatim
+
+\verbatim
+$ make
+\endverbatim
+
+and to execute it, with the default configuration:
+
+\verbatim
+$ ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+or for example, by disabling CPU devices:
+
+\verbatim
+$ STARPU_NCPU=0 ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+or by disabling CUDA devices (which may permit to enable the use of OpenCL,
+see \ref EnablingOpenCL) :
+
+\verbatim
+$ STARPU_NCUDA=0 ./vector_scal
+0.000000 3.000000 6.000000 9.000000 12.000000
+\endverbatim
+
+*/

+ 292 - 0
doc/doxygen/chapters/building.doxy

@@ -0,0 +1,292 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page BuildingAndInstallingStarPU Building and Installing StarPU
+
+\section InstallingABinaryPackage Installing a Binary Package
+
+One of the StarPU developers being a Debian Developer, the packages
+are well integrated and very uptodate. To see which packages are
+available, simply type:
+
+\verbatim
+$ apt-cache search starpu
+\endverbatim
+
+To install what you need, type:
+
+\verbatim
+$ sudo apt-get install libstarpu-1.2 libstarpu-dev
+\endverbatim
+
+\section InstallingFromSource Installing from Source
+
+StarPU can be built and installed by the standard means of the GNU
+autotools. The following chapter is intended to briefly remind how these tools
+can be used to install StarPU.
+
+\subsection OptionalDependencies Optional Dependencies
+
+The <a href="http://www.open-mpi.org/software/hwloc"><c>hwloc</c> topology
+discovery library</a> is not mandatory to use StarPU but strongly
+recommended.  It allows for topology aware scheduling, which improves
+performance.  <c>hwloc</c> is available in major free operating system
+distributions, and for most operating systems.
+
+If <c>hwloc</c> is not available on your system, the option
+\ref without-hwloc "--without-hwloc" should be explicitely given when calling the
+<c>configure</c> script. If <c>hwloc</c> is installed with a <c>pkg-config</c> file,
+no option is required, it will be detected automatically, otherwise
+\ref with-hwloc "--with-hwloc" should be used to specify the location of
+<c>hwloc</c>.
+
+\subsection GettingSources Getting Sources
+
+StarPU's sources can be obtained from the <a href="http://runtime.bordeaux.inria.fr/StarPU/files/">download page of
+the StarPU website</a>.
+
+All releases and the development tree of StarPU are freely available
+on INRIA's gforge under the LGPL license. Some releases are available
+under the BSD license.
+
+The latest release can be downloaded from the <a href="http://gforge.inria.fr/frs/?group_id=1570">INRIA's gforge</a> or
+directly from the <a href="http://runtime.bordeaux.inria.fr/StarPU/files/">StarPU download page</a>.
+
+The latest nightly snapshot can be downloaded from the <a href="http://starpu.gforge.inria.fr/testing/">StarPU gforge website</a>.
+
+\verbatim
+$ wget http://starpu.gforge.inria.fr/testing/starpu-nightly-latest.tar.gz
+\endverbatim
+
+And finally, current development version is also accessible via svn.
+It should be used only if you need the very latest changes (i.e. less
+than a day!). Note that the client side of the software Subversion can
+be obtained from http://subversion.tigris.org. If you
+are running on Windows, you will probably prefer to use <a href="http://tortoisesvn.tigris.org/">TortoiseSVN</a>.
+
+\verbatim
+$ svn checkout svn://scm.gforge.inria.fr/svn/starpu/trunk StarPU
+\endverbatim
+
+\subsection ConfiguringStarPU Configuring StarPU
+
+Running <c>autogen.sh</c> is not necessary when using the tarball
+releases of StarPU.  If you are using the source code from the svn
+repository, you first need to generate the configure scripts and the
+Makefiles. This requires the availability of <c>autoconf</c>,
+<c>automake</c> >= 2.60.
+
+\verbatim
+$ ./autogen.sh
+\endverbatim
+
+You then need to configure StarPU. Details about options that are
+useful to give to <c>./configure</c> are given in \ref CompilationConfiguration.
+
+\verbatim
+$ ./configure
+\endverbatim
+
+If <c>configure</c> does not detect some software or produces errors, please
+make sure to post the content of <c>config.log</c> when reporting the issue.
+
+By default, the files produced during the compilation are placed in
+the source directory. As the compilation generates a lot of files, it
+is advised to to put them all in a separate directory. It is then
+easier to cleanup, and this allows to compile several configurations
+out of the same source tree. For that, simply enter the directory
+where you want the compilation to produce its files, and invoke the
+<c>configure</c> script located in the StarPU source directory.
+
+\verbatim
+$ mkdir build
+$ cd build
+$ ../configure
+\endverbatim
+
+\subsection BuildingStarPU Building StarPU
+
+\verbatim
+$ make
+\endverbatim
+
+Once everything is built, you may want to test the result. An
+extensive set of regression tests is provided with StarPU. Running the
+tests is done by calling <c>make check</c>. These tests are run every night
+and the result from the main profile is publicly <a href="http://starpu.gforge.inria.fr/testing/">available</a>.
+
+\verbatim
+$ make check
+\endverbatim
+
+\subsection InstallingStarPU Installing StarPU
+
+In order to install StarPU at the location that was specified during
+configuration:
+
+\verbatim
+$ make install
+\endverbatim
+
+Libtool interface versioning information are included in
+libraries names (libstarpu-1.2.so, libstarpumpi-1.2.so and
+libstarpufft-1.2.so).
+
+\section SettingUpYourOwnCode Setting up Your Own Code
+
+\subsection SettingFlagsForCompilingLinkingAndRunningApplications Setting Flags for Compiling, Linking and Running Applications
+
+StarPU provides a pkg-config executable to obtain relevant compiler
+and linker flags.
+Compiling and linking an application against StarPU may require to use
+specific flags or libraries (for instance <c>CUDA</c> or <c>libspe2</c>).
+To this end, it is possible to use the <c>pkg-config</c> tool.
+
+If StarPU was not installed at some standard location, the path of StarPU's
+library must be specified in the <c>PKG_CONFIG_PATH</c> environment variable so
+that <c>pkg-config</c> can find it. For example if StarPU was installed in
+<c>$prefix_dir</c>:
+
+\verbatim
+$ PKG_CONFIG_PATH=$PKG_CONFIG_PATH:$prefix_dir/lib/pkgconfig
+\endverbatim
+
+The flags required to compile or link against StarPU are then
+accessible with the following commands:
+
+\verbatim
+$ pkg-config --cflags starpu-1.2  # options for the compiler
+$ pkg-config --libs starpu-1.2    # options for the linker
+\endverbatim
+
+Note that it is still possible to use the API provided in the version
+1.0 of StarPU by calling <c>pkg-config</c> with the <c>starpu-1.0</c> package.
+Similar packages are provided for <c>starpumpi-1.0</c> and <c>starpufft-1.0</c>.
+It is also possible to use the API provided in the version
+0.9 of StarPU by calling <c>pkg-config</c> with the <c>libstarpu</c> package.
+Similar packages are provided for <c>libstarpumpi</c> and <c>libstarpufft</c>.
+
+Make sure that <c>pkg-config --libs starpu-1.2</c> actually produces some output
+before going further: <c>PKG_CONFIG_PATH</c> has to point to the place where
+<c>starpu-1.2.pc</c> was installed during <c>make install</c>.
+
+Also pass the <c>--static</c> option if the application is to be
+linked statically.
+
+It is also necessary to set the variable <c>LD_LIBRARY_PATH</c> to
+locate dynamic libraries at runtime.
+
+\verbatim
+$ LD_LIBRARY_PATH=$prefix_dir/lib:$LD_LIBRARY_PATH
+\endverbatim
+
+When using a Makefile, the following lines can be added to set the
+options for the compiler and the linker:
+
+\verbatim
+CFLAGS          +=      $$(pkg-config --cflags starpu-1.2)
+LDFLAGS         +=      $$(pkg-config --libs starpu-1.2)
+\endverbatim
+
+\subsection RunningABasicStarPUApplication Running a Basic StarPU Application
+
+Basic examples using StarPU are built in the directory
+<c>examples/basic_examples/</c> (and installed in
+<c>$prefix_dir/lib/starpu/examples/</c>). You can for example run the example
+<c>vector_scal</c>.
+
+\verbatim
+$ ./examples/basic_examples/vector_scal
+BEFORE: First element was 1.000000
+AFTER: First element is 3.140000
+\endverbatim
+
+When StarPU is used for the first time, the directory
+<c>$STARPU_HOME/.starpu/</c> is created, performance models will be stored in
+that directory (\ref STARPU_HOME).
+
+Please note that buses are benchmarked when StarPU is launched for the
+first time. This may take a few minutes, or less if <c>hwloc</c> is
+installed. This step is done only once per user and per machine.
+
+\subsection KernelThreadsStartedByStarPU Kernel Threads Started by StarPU
+
+StarPU automatically binds one thread per CPU core. It does not use
+SMT/hyperthreading because kernels are usually already optimized for using a
+full core, and using hyperthreading would make kernel calibration rather random.
+
+Since driving GPUs is a CPU-consuming task, StarPU dedicates one core
+per GPU.
+
+While StarPU tasks are executing, the application is not supposed to do
+computations in the threads it starts itself, tasks should be used instead.
+
+TODO: add a StarPU function to bind an application thread (e.g. the main thread)
+to a dedicated core (and thus disable the corresponding StarPU CPU worker).
+
+\subsection EnablingOpenCL Enabling OpenCL
+
+When both CUDA and OpenCL drivers are enabled, StarPU will launch an
+OpenCL worker for NVIDIA GPUs only if CUDA is not already running on them.
+This design choice was necessary as OpenCL and CUDA can not run at the
+same time on the same NVIDIA GPU, as there is currently no interoperability
+between them.
+
+To enable OpenCL, you need either to disable CUDA when configuring StarPU:
+
+\verbatim
+$ ./configure --disable-cuda
+\endverbatim
+
+or when running applications:
+
+\verbatim
+$ STARPU_NCUDA=0 ./application
+\endverbatim
+
+OpenCL will automatically be started on any device not yet used by
+CUDA. So on a machine running 4 GPUS, it is therefore possible to
+enable CUDA on 2 devices, and OpenCL on the 2 other devices by doing
+so:
+
+\verbatim
+$ STARPU_NCUDA=2 ./application
+\endverbatim
+
+\section BenchmarkingStarPU Benchmarking StarPU
+
+Some interesting benchmarks are installed among examples in
+<c>$prefix_dir/lib/starpu/examples/</c>. Make sure to try various
+schedulers, for instance <c>STARPU_SCHED=dmda</c>.
+
+\subsection TaskSizeOverhead Task Size Overhead
+
+This benchmark gives a glimpse into how big a size should be for StarPU overhead
+to be low enough.  Run <c>tasks_size_overhead.sh</c>, it will generate a plot
+of the speedup of tasks of various sizes, depending on the number of CPUs being
+used.
+
+\subsection DataTransferLatency Data Transfer Latency
+
+<c>local_pingpong</c> performs a ping-pong between the first two CUDA nodes, and
+prints the measured latency.
+
+\subsection MatrixMatrixMultiplication Matrix-Matrix Multiplication
+
+<c>sgemm</c> and <c>dgemm</c> perform a blocked matrix-matrix
+multiplication using BLAS and cuBLAS. They output the obtained GFlops.
+
+\subsection CholeskyFactorization Cholesky Factorization
+
+<c>cholesky\*</c> perform a Cholesky factorization (single precision). They use different dependency primitives.
+
+\subsection LUFactorization LU Factorization
+
+<c>lu\*</c> perform an LU factorization. They use different dependency primitives.
+
+*/

+ 360 - 0
doc/doxygen/chapters/c_extensions.doxy

@@ -0,0 +1,360 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page cExtensions C Extensions
+
+When GCC plug-in support is available, StarPU builds a plug-in for the
+GNU Compiler Collection (GCC), which defines extensions to languages of
+the C family (C, C++, Objective-C) that make it easier to write StarPU
+code. This feature is only available for GCC 4.5 and later; it
+is known to work with GCC 4.5, 4.6, and 4.7.  You
+may need to install a specific <c>-dev</c> package of your distro, such
+as <c>gcc-4.6-plugin-dev</c> on Debian and derivatives.  In addition,
+the plug-in's test suite is only run when <a href="http://www.gnu.org/software/guile/">GNU Guile</a> is found at
+<c>configure</c>-time.  Building the GCC plug-in
+can be disabled by configuring with \ref disable-gcc-extensions "--disable-gcc-extensions".
+
+Those extensions include syntactic sugar for defining
+tasks and their implementations, invoking a task, and manipulating data
+buffers.  Use of these extensions can be made conditional on the
+availability of the plug-in, leading to valid C sequential code when the
+plug-in is not used (\ref UsingCExtensionsConditionally).
+
+When StarPU has been installed with its GCC plug-in, programs that use
+these extensions can be compiled this way:
+
+\verbatim
+$ gcc -c -fplugin=`pkg-config starpu-1.2 --variable=gccplugin` foo.c
+\endverbatim
+
+When the plug-in is not available, the above <c>pkg-config</c>
+command returns the empty string.
+
+In addition, the <c>-fplugin-arg-starpu-verbose</c> flag can be used to
+obtain feedback from the compiler as it analyzes the C extensions used
+in source files.
+
+This section describes the C extensions implemented by StarPU's GCC
+plug-in.  It does not require detailed knowledge of the StarPU library.
+
+Note: this is still an area under development and subject to change.
+
+\section DefiningTasks Defining Tasks
+
+The StarPU GCC plug-in views tasks as ``extended'' C functions:
+
+<ul>
+<Li>
+tasks may have several implementations---e.g., one for CPUs, one written
+in OpenCL, one written in CUDA;
+</li>
+<Li>
+tasks may have several implementations of the same target---e.g.,
+several CPU implementations;
+</li>
+<li>
+when a task is invoked, it may run in parallel, and StarPU is free to
+choose any of its implementations.
+</li>
+</ul>
+
+Tasks and their implementations must be <em>declared</em>.  These
+declarations are annotated with attributes
+(http://gcc.gnu.org/onlinedocs/gcc/Attribute-Syntax.html#Attribute-Syntax):
+the declaration of a task is a regular C function declaration with an
+additional <c>task</c> attribute, and task implementations are
+declared with a <c>task_implementation</c> attribute.
+
+The following function attributes are provided:
+
+<dl>
+
+<dt><c>task</c></dt>
+<dd>
+Declare the given function as a StarPU task.  Its return type must be
+<c>void</c>.  When a function declared as <c>task</c> has a user-defined
+body, that body is interpreted as the implicit definition of the
+task's CPU implementation (see example below).  In all cases, the
+actual definition of a task's body is automatically generated by the
+compiler.
+
+Under the hood, declaring a task leads to the declaration of the
+corresponding <c>codelet</c> (\ref CodeletAndTasks).  If one or
+more task implementations are declared in the same compilation unit,
+then the codelet and the function itself are also defined; they inherit
+the scope of the task.
+
+Scalar arguments to the task are passed by value and copied to the
+target device if need be---technically, they are passed as the buffer
+starpu_task::cl_arg (\ref CodeletAndTasks).
+
+Pointer arguments are assumed to be registered data buffers---the
+handles argument of a task (starpu_task::handles) ; <c>const</c>-qualified
+pointer arguments are viewed as read-only buffers (::STARPU_R), and
+non-<c>const</c>-qualified buffers are assumed to be used read-write
+(::STARPU_RW).  In addition, the <c>output</c> type attribute can be
+as a type qualifier for output pointer or array parameters
+(::STARPU_W).
+</dd>
+
+<dt><c>task_implementation (target, task)</c></dt>
+<dd>
+Declare the given function as an implementation of <c>task</c> to run on
+<c>target</c>.  <c>target</c> must be a string, currently one of
+<c>"cpu"</c>, <c>"opencl"</c>, or <c>"cuda"</c>.
+\internal
+FIXME: Update when OpenCL support is ready.
+\endinternal
+</dd>
+</dl>
+
+Here is an example:
+
+\code{.c}
+#define __output  __attribute__ ((output))
+
+static void matmul (const float *A, const float *B,
+                    __output float *C,
+                    unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task));
+
+static void matmul_cpu (const float *A, const float *B,
+                        __output float *C,
+                        unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("cpu", matmul)));
+
+
+static void
+matmul_cpu (const float *A, const float *B, __output float *C,
+            unsigned nx, unsigned ny, unsigned nz)
+{
+  unsigned i, j, k;
+
+  for (j = 0; j < ny; j++)
+    for (i = 0; i < nx; i++)
+      {
+        for (k = 0; k < nz; k++)
+          C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
+      }
+}
+\endcode
+
+A <c>matmult</c> task is defined; it has only one implementation,
+<c>matmult_cpu</c>, which runs on the CPU.  Variables <c>A</c> and
+<c>B</c> are input buffers, whereas <c>C</c> is considered an input/output
+buffer.
+
+For convenience, when a function declared with the <c>task</c> attribute
+has a user-defined body, that body is assumed to be that of the CPU
+implementation of a task, which we call an implicit task CPU
+implementation.  Thus, the above snippet can be simplified like this:
+
+\code{.c}
+#define __output  __attribute__ ((output))
+
+static void matmul (const float *A, const float *B,
+                    __output float *C,
+                    unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task));
+
+/* Implicit definition of the CPU implementation of the
+   `matmul' task.  */
+static void
+matmul (const float *A, const float *B, __output float *C,
+        unsigned nx, unsigned ny, unsigned nz)
+{
+  unsigned i, j, k;
+
+  for (j = 0; j < ny; j++)
+    for (i = 0; i < nx; i++)
+      {
+        for (k = 0; k < nz; k++)
+          C[j * nx + i] += A[j * nz + k] * B[k * nx + i];
+      }
+}
+\endcode
+
+Use of implicit CPU task implementations as above has the advantage that
+the code is valid sequential code when StarPU's GCC plug-in is not used
+(\ref UsingCExtensionsConditionally).
+
+CUDA and OpenCL implementations can be declared in a similar way:
+
+\code{.c}
+static void matmul_cuda (const float *A, const float *B, float *C,
+                         unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("cuda", matmul)));
+
+static void matmul_opencl (const float *A, const float *B, float *C,
+                           unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("opencl", matmul)));
+\endcode
+
+The CUDA and OpenCL implementations typically either invoke a kernel
+written in CUDA or OpenCL (for similar code, \ref CUDAKernel, and
+\ref OpenCLKernel), or call a library function that uses CUDA or
+OpenCL under the hood, such as CUBLAS functions:
+
+\code{.c}
+static void
+matmul_cuda (const float *A, const float *B, float *C,
+             unsigned nx, unsigned ny, unsigned nz)
+{
+  cublasSgemm ('n', 'n', nx, ny, nz,
+               1.0f, A, 0, B, 0,
+               0.0f, C, 0);
+  cudaStreamSynchronize (starpu_cuda_get_local_stream ());
+}
+\endcode
+
+A task can be invoked like a regular C function:
+
+\code{.c}
+matmul (&A[i * zdim * bydim + k * bzdim * bydim],
+        &B[k * xdim * bzdim + j * bxdim * bzdim],
+        &C[i * xdim * bydim + j * bxdim * bydim],
+        bxdim, bydim, bzdim);
+\endcode
+
+This leads to an asynchronous invocation, whereby <c>matmult</c>'s
+implementation may run in parallel with the continuation of the caller.
+
+The next section describes how memory buffers must be handled in
+StarPU-GCC code.  For a complete example, see the
+<c>gcc-plugin/examples</c> directory of the source distribution, and
+\ref VectorScalingUsingTheCExtension.
+
+
+\section InitializationTerminationAndSynchronization Initialization, Termination, and Synchronization
+
+The following pragmas allow user code to control StarPU's life time and
+to synchronize with tasks.
+
+<dl>
+
+<dt><c>\#pragma starpu initialize</c></dt>
+<dd>
+Initialize StarPU.  This call is compulsory and is <em>never</em> added
+implicitly.  One of the reasons this has to be done explicitly is that
+it provides greater control to user code over its resource usage.
+</dd>
+
+<dt><c>\#pragma starpu shutdown</c></dt>
+<dd>
+Shut down StarPU, giving it an opportunity to write profiling info to a
+file on disk, for instance (\ref Off-linePerformanceFeedback).
+</dd>
+
+<dt><c>\#pragma starpu wait</c></dt>
+<dd>
+Wait for all task invocations to complete, as with
+starpu_task_wait_for_all().
+</dd>
+</dl>
+
+\section RegisteredDataBuffers Registered Data Buffers
+
+Data buffers such as matrices and vectors that are to be passed to tasks
+must be registered.  Registration allows StarPU to handle data
+transfers among devices---e.g., transferring an input buffer from the
+CPU's main memory to a task scheduled to run a GPU (\ref StarPUDataManagementLibrary).
+
+The following pragmas are provided:
+
+<dl>
+
+<dt><c>\#pragma starpu register ptr [size]</c></dt>
+<dd>
+Register <c>ptr</c> as a <c>size</c>-element buffer.  When <c>ptr</c> has
+an array type whose size is known, <c>size</c> may be omitted.
+Alternatively, the <c>registered</c> attribute can be used (see below.)
+</dd>
+
+<dt><c>\#pragma starpu unregister ptr</c></dt>
+<dd>
+Unregister the previously-registered memory area pointed to by
+<c>ptr</c>.  As a side-effect, <c>ptr</c> points to a valid copy in main
+memory.
+</dd>
+
+<dt><c>\#pragma starpu acquire ptr</c></dt>
+<dd>
+Acquire in main memory an up-to-date copy of the previously-registered
+memory area pointed to by <c>ptr</c>, for read-write access.
+</dd>
+
+<dt><c>\#pragma starpu release ptr</c></dt>
+<dd>
+Release the previously-register memory area pointed to by <c>ptr</c>,
+making it available to the tasks.
+</dd>
+</dl>
+
+Additionally, the following attributes offer a simple way to allocate
+and register storage for arrays:
+
+<dl>
+
+<dt><c>registered</c></dt>
+<dd>
+This attributes applies to local variables with an array type.  Its
+effect is to automatically register the array's storage, as per
+<c>\#pragma starpu register</c>.  The array is automatically unregistered
+when the variable's scope is left.  This attribute is typically used in
+conjunction with the <c>heap_allocated</c> attribute, described below.
+</dd>
+
+<dt><c>heap_allocated</c></dt>
+<dd>
+This attributes applies to local variables with an array type.  Its
+effect is to automatically allocate the array's storage on
+the heap, using starpu_malloc() under the hood.  The heap-allocated array is automatically
+freed when the variable's scope is left, as with
+automatic variables.
+</dd>
+</dl>
+
+The following example illustrates use of the <c>heap_allocated</c>
+attribute:
+
+\snippet cholesky_pragma.c To be included
+
+\section UsingCExtensionsConditionally Using C Extensions Conditionally
+
+The C extensions described in this chapter are only available when GCC
+and its StarPU plug-in are in use.  Yet, it is possible to make use of
+these extensions when they are available---leading to hybrid CPU/GPU
+code---and discard them when they are not available---leading to valid
+sequential code.
+
+To that end, the GCC plug-in defines the C preprocessor macro ---
+<c>STARPU_GCC_PLUGIN</c> --- when it is being used. When defined, this
+macro expands to an integer denoting the version of the supported C
+extensions.
+
+The code below illustrates how to define a task and its implementations
+in a way that allows it to be compiled without the GCC plug-in:
+
+\snippet matmul_pragma.c To be included
+
+The above program is a valid StarPU program when StarPU's GCC plug-in is
+used; it is also a valid sequential program when the plug-in is not
+used.
+
+Note that attributes such as <c>task</c> as well as <c>starpu</c>
+pragmas are simply ignored by GCC when the StarPU plug-in is not loaded.
+However, <c>gcc -Wall</c> emits a warning for unknown attributes and
+pragmas, which can be inconvenient.  In addition, other compilers may be
+unable to parse the attribute syntax (In practice, Clang and
+several proprietary compilers implement attributes.), so you may want to
+wrap attributes in macros like this:
+
+\snippet matmul_pragma2.c To be included
+
+*/
+

+ 50 - 0
doc/doxygen/chapters/code/cholesky_pragma.c

@@ -0,0 +1,50 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+extern void cholesky(unsigned nblocks, unsigned size,
+                    float mat[nblocks][nblocks][size])
+  __attribute__ ((task));
+
+int
+main (int argc, char *argv[])
+{
+#pragma starpu initialize
+
+  /* ... */
+
+  int nblocks, size;
+  parse_args (&nblocks, &size);
+
+  /* Allocate an array of the required size on the heap,
+     and register it.  */
+
+  {
+    float matrix[nblocks][nblocks][size]
+      __attribute__ ((heap_allocated, registered));
+
+    cholesky (nblocks, size, matrix);
+
+#pragma starpu wait
+
+  }   /* MATRIX is automatically unregistered & freed here.  */
+
+#pragma starpu shutdown
+
+  return EXIT_SUCCESS;
+}
+//! [To be included]

+ 25 - 0
doc/doxygen/chapters/code/complex.c

@@ -0,0 +1,25 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+#define STARPU_COMPLEX_GET_REAL(interface)	\
+        (((struct starpu_complex_interface *)(interface))->real)
+#define STARPU_COMPLEX_GET_IMAGINARY(interface)	\
+        (((struct starpu_complex_interface *)(interface))->imaginary)
+#define STARPU_COMPLEX_GET_NX(interface)	\
+        (((struct starpu_complex_interface *)(interface))->nx)
+//! [To be included]

+ 42 - 0
doc/doxygen/chapters/code/forkmode.c

@@ -0,0 +1,42 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+void scal_cpu_func(void *buffers[], void *_args)
+{
+    unsigned i;
+    float *factor = _args;
+    struct starpu_vector_interface *vector = buffers[0];
+    unsigned n = STARPU_VECTOR_GET_NX(vector);
+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
+
+#pragma omp parallel for num_threads(starpu_combined_worker_get_size())
+    for (i = 0; i < n; i++)
+        val[i] *= *factor;
+}
+
+static struct starpu_codelet cl =
+{
+    .modes = { STARPU_RW },
+    .where = STARPU_CPU,
+    .type = STARPU_FORKJOIN,
+    .max_parallelism = INT_MAX,
+    .cpu_funcs = {scal_cpu_func, NULL},
+    .cpu_funcs_name = {"scal_cpu_func", NULL},
+    .nbuffers = 1,
+};
+//! [To be included]

+ 46 - 0
doc/doxygen/chapters/code/hello_pragma.c

@@ -0,0 +1,46 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+#include <stdio.h>
+
+/* Task declaration.  */
+static void my_task (int x) __attribute__ ((task));
+
+/* Definition of the CPU implementation of `my_task'.  */
+static void my_task (int x)
+{
+  printf ("Hello, world!  With x = %d\n", x);
+}
+
+int main ()
+{
+  /* Initialize StarPU. */
+#pragma starpu initialize
+
+  /* Do an asynchronous call to `my_task'. */
+  my_task (42);
+
+  /* Wait for the call to complete.  */
+#pragma starpu wait
+
+  /* Terminate. */
+#pragma starpu shutdown
+
+  return 0;
+}
+//! [To be included]

+ 43 - 0
doc/doxygen/chapters/code/hello_pragma2.c

@@ -0,0 +1,43 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+int main (void)
+{
+#pragma starpu initialize
+
+#define NX     0x100000
+#define FACTOR 3.14
+
+  {
+    float vector[NX]
+       __attribute__ ((heap_allocated, registered));
+
+    size_t i;
+    for (i = 0; i < NX; i++)
+      vector[i] = (float) i;
+
+    vector_scal (NX, vector, FACTOR);
+
+#pragma starpu wait
+  } /* VECTOR is automatically freed here. */
+
+#pragma starpu shutdown
+
+  return valid ? EXIT_SUCCESS : EXIT_FAILURE;
+}
+//! [To be included]

+ 73 - 0
doc/doxygen/chapters/code/matmul_pragma.c

@@ -0,0 +1,73 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+/* This program is valid, whether or not StarPU's GCC plug-in
+   is being used.  */
+
+#include <stdlib.h>
+
+/* The attribute below is ignored when GCC is not used.  */
+static void matmul (const float *A, const float *B, float * C,
+                    unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task));
+
+static void
+matmul (const float *A, const float *B, float * C,
+        unsigned nx, unsigned ny, unsigned nz)
+{
+  /* Code of the CPU kernel here...  */
+}
+
+#ifdef STARPU_GCC_PLUGIN
+/* Optional OpenCL task implementation.  */
+
+static void matmul_opencl (const float *A, const float *B, float * C,
+                           unsigned nx, unsigned ny, unsigned nz)
+  __attribute__ ((task_implementation ("opencl", matmul)));
+
+static void
+matmul_opencl (const float *A, const float *B, float * C,
+               unsigned nx, unsigned ny, unsigned nz)
+{
+  /* Code that invokes the OpenCL kernel here...  */
+}
+#endif
+
+int
+main (int argc, char *argv[])
+{
+  /* The pragmas below are simply ignored when StarPU-GCC
+     is not used.  */
+#pragma starpu initialize
+
+  float A[123][42][7], B[123][42][7], C[123][42][7];
+
+#pragma starpu register A
+#pragma starpu register B
+#pragma starpu register C
+
+  /* When StarPU-GCC is used, the call below is asynchronous;
+     otherwise, it is synchronous.  */
+  matmul ((float *) A, (float *) B, (float *) C, 123, 42, 7);
+
+#pragma starpu wait
+#pragma starpu shutdown
+
+  return EXIT_SUCCESS;
+}
+//! [To be included]

+ 29 - 0
doc/doxygen/chapters/code/matmul_pragma2.c

@@ -0,0 +1,29 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+/* Use the `task' attribute only when StarPU's GCC plug-in
+   is available.   */
+#ifdef STARPU_GCC_PLUGIN
+# define __task  __attribute__ ((task))
+#else
+# define __task
+#endif
+
+static void matmul (const float *A, const float *B, float *C,
+                    unsigned nx, unsigned ny, unsigned nz) __task;
+//! [To be included]

+ 61 - 0
doc/doxygen/chapters/code/multiformat.c

@@ -0,0 +1,61 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+#define NX 1024
+struct point array_of_structs[NX];
+starpu_data_handle_t handle;
+
+/*
+ * The conversion of a piece of data is itself a task, though it is created,
+ * submitted and destroyed by StarPU internals and not by the user. Therefore,
+ * we have to define two codelets.
+ * Note that for now the conversion from the CPU format to the GPU format has to
+ * be executed on the GPU, and the conversion from the GPU to the CPU has to be
+ * executed on the CPU.
+ */
+#ifdef STARPU_USE_OPENCL
+void cpu_to_opencl_opencl_func(void *buffers[], void *args);
+struct starpu_codelet cpu_to_opencl_cl = {
+    .where = STARPU_OPENCL,
+    .opencl_funcs = { cpu_to_opencl_opencl_func, NULL },
+    .nbuffers = 1,
+    .modes = { STARPU_RW }
+};
+
+void opencl_to_cpu_func(void *buffers[], void *args);
+struct starpu_codelet opencl_to_cpu_cl = {
+    .where = STARPU_CPU,
+    .cpu_funcs = { opencl_to_cpu_func, NULL },
+    .cpu_funcs_name = { "opencl_to_cpu_func", NULL },
+    .nbuffers = 1,
+    .modes = { STARPU_RW }
+};
+#endif
+
+struct starpu_multiformat_data_interface_ops format_ops = {
+#ifdef STARPU_USE_OPENCL
+    .opencl_elemsize = 2 * sizeof(float),
+    .cpu_to_opencl_cl = &cpu_to_opencl_cl,
+    .opencl_to_cpu_cl = &opencl_to_cpu_cl,
+#endif
+    .cpu_elemsize = 2 * sizeof(float),
+    ...
+};
+
+starpu_multiformat_data_register(handle, 0, &array_of_structs, NX, &format_ops);
+//! [To be included]

+ 32 - 0
doc/doxygen/chapters/code/simgrid.c

@@ -0,0 +1,32 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+static struct starpu_codelet cl11 =
+{
+	.cpu_funcs = {chol_cpu_codelet_update_u11, NULL},
+	.cpu_funcs_name = {"chol_cpu_codelet_update_u11", NULL},
+#ifdef STARPU_USE_CUDA
+	.cuda_funcs = {chol_cublas_codelet_update_u11, NULL},
+#elif defined(STARPU_SIMGRID)
+	.cuda_funcs = {(void*)1, NULL},
+#endif
+	.nbuffers = 1,
+	.modes = {STARPU_RW},
+	.model = &chol_model_11
+};
+//! [To be included]

+ 128 - 0
doc/doxygen/chapters/code/vector_scal_c.c

@@ -0,0 +1,128 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010-2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010-2013  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+/*
+ * This example demonstrates how to use StarPU to scale an array by a factor.
+ * It shows how to manipulate data with StarPU's data management library.
+ *  1- how to declare a piece of data to StarPU (starpu_vector_data_register)
+ *  2- how to describe which data are accessed by a task (task->handles[0])
+ *  3- how a kernel can manipulate the data (buffers[0].vector.ptr)
+ */
+#include <starpu.h>
+
+#define    NX    2048
+
+extern void scal_cpu_func(void *buffers[], void *_args);
+extern void scal_sse_func(void *buffers[], void *_args);
+extern void scal_cuda_func(void *buffers[], void *_args);
+extern void scal_opencl_func(void *buffers[], void *_args);
+
+static struct starpu_codelet cl = {
+    .where = STARPU_CPU | STARPU_CUDA | STARPU_OPENCL,
+    /* CPU implementation of the codelet */
+    .cpu_funcs = { scal_cpu_func, scal_sse_func, NULL },
+    .cpu_funcs_name = { "scal_cpu_func", "scal_sse_func", NULL },
+#ifdef STARPU_USE_CUDA
+    /* CUDA implementation of the codelet */
+    .cuda_funcs = { scal_cuda_func, NULL },
+#endif
+#ifdef STARPU_USE_OPENCL
+    /* OpenCL implementation of the codelet */
+    .opencl_funcs = { scal_opencl_func, NULL },
+#endif
+    .nbuffers = 1,
+    .modes = { STARPU_RW }
+};
+
+#ifdef STARPU_USE_OPENCL
+struct starpu_opencl_program programs;
+#endif
+
+int main(int argc, char **argv)
+{
+    /* We consider a vector of float that is initialized just as any of C
+      * data */
+    float vector[NX];
+    unsigned i;
+    for (i = 0; i < NX; i++)
+        vector[i] = 1.0f;
+
+    fprintf(stderr, "BEFORE: First element was %f\n", vector[0]);
+
+    /* Initialize StarPU with default configuration */
+    starpu_init(NULL);
+
+#ifdef STARPU_USE_OPENCL
+        starpu_opencl_load_opencl_from_file(
+               "examples/basic_examples/vector_scal_opencl_kernel.cl", &programs, NULL);
+#endif
+
+    /* Tell StaPU to associate the "vector" vector with the "vector_handle"
+     * identifier. When a task needs to access a piece of data, it should
+     * refer to the handle that is associated to it.
+     * In the case of the "vector" data interface:
+     *  - the first argument of the registration method is a pointer to the
+     *    handle that should describe the data
+     *  - the second argument is the memory node where the data (ie. "vector")
+     *    resides initially: 0 stands for an address in main memory, as
+     *    opposed to an adress on a GPU for instance.
+     *  - the third argument is the adress of the vector in RAM
+     *  - the fourth argument is the number of elements in the vector
+     *  - the fifth argument is the size of each element.
+     */
+    starpu_data_handle_t vector_handle;
+    starpu_vector_data_register(&vector_handle, 0, (uintptr_t)vector,
+                                NX, sizeof(vector[0]));
+
+    float factor = 3.14;
+
+    /* create a synchronous task: any call to starpu_task_submit will block
+      * until it is terminated */
+    struct starpu_task *task = starpu_task_create();
+    task->synchronous = 1;
+
+    task->cl = &cl;
+
+    /* the codelet manipulates one buffer in RW mode */
+    task->handles[0] = vector_handle;
+
+    /* an argument is passed to the codelet, beware that this is a
+     * READ-ONLY buffer and that the codelet may be given a pointer to a
+     * COPY of the argument */
+    task->cl_arg = &factor;
+    task->cl_arg_size = sizeof(factor);
+
+    /* execute the task on any eligible computational ressource */
+    starpu_task_submit(task);
+
+    /* StarPU does not need to manipulate the array anymore so we can stop
+      * monitoring it */
+    starpu_data_unregister(vector_handle);
+
+#ifdef STARPU_USE_OPENCL
+    starpu_opencl_unload_opencl(&programs);
+#endif
+
+    /* terminate StarPU, no task can be submitted after */
+    starpu_shutdown();
+
+    fprintf(stderr, "AFTER First element is %f\n", vector[0]);
+
+    return 0;
+}
+//! [To be included]

+ 78 - 0
doc/doxygen/chapters/code/vector_scal_cpu.c

@@ -0,0 +1,78 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010, 2011, 2012  Centre National de la Recherche Scientifique
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+
+#include <starpu.h>
+#include <xmmintrin.h>
+
+/* This kernel takes a buffer and scales it by a constant factor */
+void scal_cpu_func(void *buffers[], void *cl_arg)
+{
+    unsigned i;
+    float *factor = cl_arg;
+
+    /*
+     * The "buffers" array matches the task->handles array: for instance
+     * task->handles[0] is a handle that corresponds to a data with
+     * vector "interface", so that the first entry of the array in the
+     * codelet  is a pointer to a structure describing such a vector (ie.
+     * struct starpu_vector_interface *). Here, we therefore manipulate
+     * the buffers[0] element as a vector: nx gives the number of elements
+     * in the array, ptr gives the location of the array (that was possibly
+     * migrated/replicated), and elemsize gives the size of each elements.
+     */
+    struct starpu_vector_interface *vector = buffers[0];
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(vector);
+
+    /* get a pointer to the local copy of the vector: note that we have to
+     * cast it in (float *) since a vector could contain any type of
+     * elements so that the .ptr field is actually a uintptr_t */
+    float *val = (float *)STARPU_VECTOR_GET_PTR(vector);
+
+    /* scale the vector */
+    for (i = 0; i < n; i++)
+        val[i] *= *factor;
+}
+
+void scal_sse_func(void *buffers[], void *cl_arg)
+{
+    float *vector = (float *) STARPU_VECTOR_GET_PTR(buffers[0]);
+    unsigned int n = STARPU_VECTOR_GET_NX(buffers[0]);
+    unsigned int n_iterations = n/4;
+
+    __m128 *VECTOR = (__m128*) vector;
+    __m128 FACTOR STARPU_ATTRIBUTE_ALIGNED(16);
+    float factor = *(float *) cl_arg;
+    FACTOR = _mm_set1_ps(factor);
+
+    unsigned int i;
+    for (i = 0; i < n_iterations; i++)
+        VECTOR[i] = _mm_mul_ps(FACTOR, VECTOR[i]);
+
+    unsigned int remainder = n%4;
+    if (remainder != 0)
+    {
+        unsigned int start = 4 * n_iterations;
+        for (i = start; i < start+remainder; ++i)
+        {
+            vector[i] = factor * vector[i];
+        }
+    }
+}
+//! [To be included]

+ 45 - 0
doc/doxygen/chapters/code/vector_scal_cuda.cu

@@ -0,0 +1,45 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+//! [To be included]
+#include <starpu.h>
+
+static __global__ void vector_mult_cuda(unsigned n, float *val,
+                                        float factor)
+{
+        unsigned i =  blockIdx.x*blockDim.x + threadIdx.x;
+        if (i < n)
+               val[i] *= factor;
+}
+
+extern "C" void scal_cuda_func(void *buffers[], void *_args)
+{
+        float *factor = (float *)_args;
+
+        /* length of the vector */
+        unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+        /* local copy of the vector pointer */
+        float *val = (float *)STARPU_VECTOR_GET_PTR(buffers[0]);
+        unsigned threads_per_block = 64;
+        unsigned nblocks = (n + threads_per_block-1) / threads_per_block;
+
+        vector_mult_cuda<<<nblocks,threads_per_block, 0, starpu_cuda_get_local_stream()>>>
+	                (n, val, *factor);
+
+        cudaStreamSynchronize(starpu_cuda_get_local_stream());
+}
+//! [To be included]
+

+ 72 - 0
doc/doxygen/chapters/code/vector_scal_opencl.c

@@ -0,0 +1,72 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2010  Institut National de Recherche en Informatique et Automatique
+ * Copyright (C) 2011  Université de Bordeaux 1
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+#include <starpu.h>
+
+extern struct starpu_opencl_program programs;
+
+void scal_opencl_func(void *buffers[], void *_args)
+{
+    float *factor = _args;
+    int id, devid, err;
+    cl_kernel kernel;
+    cl_command_queue queue;
+    cl_event event;
+
+    /* length of the vector */
+    unsigned n = STARPU_VECTOR_GET_NX(buffers[0]);
+    /* OpenCL copy of the vector pointer */
+    cl_mem val = (cl_mem)STARPU_VECTOR_GET_DEV_HANDLE(buffers[0]);
+
+    id = starpu_worker_get_id();
+    devid = starpu_worker_get_devid(id);
+
+    err = starpu_opencl_load_kernel(&kernel, &queue, &programs, "vector_mult_opencl",
+                                    devid);
+    if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
+
+    err = clSetKernelArg(kernel, 0, sizeof(n), &n);
+    err |= clSetKernelArg(kernel, 1, sizeof(val), &val);
+    err |= clSetKernelArg(kernel, 2, sizeof(*factor), factor);
+    if (err) STARPU_OPENCL_REPORT_ERROR(err);
+
+    {
+        size_t global=n;
+        size_t local;
+        size_t s;
+        cl_device_id device;
+
+        starpu_opencl_get_device(devid, &device);
+        err = clGetKernelWorkGroupInfo (kernel, device, CL_KERNEL_WORK_GROUP_SIZE,
+                                        sizeof(local), &local, &s);
+        if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
+        if (local > global) local=global;
+
+        err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &global, &local, 0,
+                                     NULL, &event);
+        if (err != CL_SUCCESS) STARPU_OPENCL_REPORT_ERROR(err);
+    }
+
+    clFinish(queue);
+    starpu_opencl_collect_stats(event);
+    clReleaseEvent(event);
+
+    starpu_opencl_release_kernel(kernel);
+}
+//! [To be included]

+ 25 - 0
doc/doxygen/chapters/code/vector_scal_opencl_codelet.cl

@@ -0,0 +1,25 @@
+/* StarPU --- Runtime system for heterogeneous multicore architectures.
+ *
+ * Copyright (C) 2010, 2011, 2013  Centre National de la Recherche Scientifique
+ *
+ * StarPU is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published by
+ * the Free Software Foundation; either version 2.1 of the License, or (at
+ * your option) any later version.
+ *
+ * StarPU is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+ *
+ * See the GNU Lesser General Public License in COPYING.LGPL for more details.
+ */
+
+//! [To be included]
+__kernel void vector_mult_opencl(int nx, __global float* val, float factor)
+{
+        const int i = get_global_id(0);
+        if (i < nx) {
+                val[i] *= factor;
+        }
+}
+//! [To be included]

+ 501 - 0
doc/doxygen/chapters/configure_options.doxy

@@ -0,0 +1,501 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page CompilationConfiguration Compilation Configuration
+
+The behavior of the StarPU library and tools may be tuned thanks to
+the following configure options.
+
+\section CommonConfiguration Common Configuration
+
+<dl>
+
+<dt>--enable-debug</dt>
+<dd>
+\anchor enable-debug
+\addindex __configure__--enable-debug
+Enable debugging messages.
+</dd>
+
+<dt>--enable-debug</dt>
+<dd>
+\anchor enable-debug
+\addindex __configure__--enable-debug
+Enable debugging messages.
+</dd>
+
+<dt>--enable-fast</dt>
+<dd>
+\anchor enable-fast
+\addindex __configure__--enable-fast
+Disable assertion checks, which saves computation time.
+</dd>
+
+<dt>--enable-verbose</dt>
+<dd>
+\anchor enable-verbose
+\addindex __configure__--enable-verbose
+Increase the verbosity of the debugging messages.  This can be disabled
+at runtime by setting the environment variable \ref STARPU_SILENT to
+any value.
+
+\verbatim
+$ STARPU_SILENT=1 ./vector_scal
+\endverbatim
+</dd>
+
+<dt>--enable-coverage</dt>
+<dd>
+\anchor enable-coverage
+\addindex __configure__--enable-coverage
+Enable flags for the coverage tool <c>gcov</c>.
+</dd>
+
+<dt>--enable-quick-check</dt>
+<dd>
+\anchor enable-quick-check
+\addindex __configure__--enable-quick-check
+Specify tests and examples should be run on a smaller data set, i.e
+allowing a faster execution time
+</dd>
+
+<dt>--enable-long-check</dt>
+<dd>
+\anchor enable-long-check
+\addindex __configure__--enable-long-check
+Enable some exhaustive checks which take a really long time.
+</dd>
+
+<dt>--with-hwloc</dt>
+<dd>
+\anchor with-hwloc
+\addindex __configure__--with-hwloc
+Specify <c>hwloc</c> should be used by StarPU. <c>hwloc</c> should be found by the
+means of the tool <c>pkg-config</c>.
+</dd>
+
+<dt>--with-hwloc=<c>prefix</c></dt>
+<dd>
+\anchor with-hwloc
+\addindex __configure__--with-hwloc
+Specify <c>hwloc</c> should be used by StarPU. <c>hwloc</c> should be found in the
+directory specified by <c>prefix</c>
+</dd>
+
+<dt>--without-hwloc</dt>
+<dd>
+\anchor without-hwloc
+\addindex __configure__--without-hwloc
+Specify <c>hwloc</c> should not be used by StarPU.
+</dd>
+
+<dt>--disable-build-doc</dt>
+<dd>
+\anchor disable-build-doc
+\addindex __configure__--disable-build-doc
+Disable the creation of the documentation. This should be done on a
+machine which does not have the tools <c>doxygen</c> and <c>latex</c>
+(plus the packages <c>latex-xcolor</c> and <c>texlive-latex-extra</c>).
+</dd>
+</dl>
+
+Additionally, the script <c>configure</c> recognize many variables, which
+can be listed by typing <c>./configure --help</c>. For example,
+<c>./configure NVCCFLAGS="-arch sm_13"</c> adds a flag for the compilation of
+CUDA kernels.
+
+
+\section ConfiguringWorkers Configuring Workers
+
+<dl>
+
+<dt>--enable-maxcpus=<c>count</c></dt>
+<dd>
+\anchor enable-maxcpus
+\addindex __configure__--enable-maxcpus
+Use at most <c>count</c> CPU cores.  This information is then
+available as the macro ::STARPU_MAXCPUS.
+</dd>
+
+<dt>--disable-cpu</dt>
+<dd>
+\anchor disable-cpu
+\addindex __configure__--disable-cpu
+Disable the use of CPUs of the machine. Only GPUs etc. will be used.
+</dd>
+
+<dt>--enable-maxcudadev=<c>count</c></dt>
+<dd>
+\anchor enable-maxcudadev
+\addindex __configure__--enable-maxcudadev
+Use at most <c>count</c> CUDA devices.  This information is then
+available as the macro ::STARPU_MAXCUDADEVS.
+</dd>
+
+<dt>--disable-cuda</dt>
+<dd>
+\anchor disable-cuda
+\addindex __configure__--disable-cuda
+Disable the use of CUDA, even if a valid CUDA installation was detected.
+</dd>
+
+<dt>--with-cuda-dir=<c>prefix</c></dt>
+<dd>
+\anchor with-cuda-dir
+\addindex __configure__--with-cuda-dir
+Search for CUDA under <c>prefix</c>, which should notably contain the file
+<c>include/cuda.h</c>.
+</dd>
+
+<dt>--with-cuda-include-dir=<c>dir</c></dt>
+<dd>
+\anchor with-cuda-include-dir
+\addindex __configure__--with-cuda-include-dir
+Search for CUDA headers under <c>dir</c>, which should
+notably contain the file <c>cuda.h</c>. This defaults to
+<c>/include</c> appended to the value given to
+\ref with-cuda-dir "--with-cuda-dir".
+</dd>
+
+<dt>--with-cuda-lib-dir=<c>dir</c></dt>
+<dd>
+\anchor with-cuda-lib-dir
+\addindex __configure__--with-cuda-lib-dir
+Search for CUDA libraries under <c>dir</c>, which should notably contain
+the CUDA shared libraries---e.g., <c>libcuda.so</c>.  This defaults to
+<c>/lib</c> appended to the value given to
+\ref with-cuda-dir "--with-cuda-dir".
+</dd>
+
+<dt>--disable-cuda-memcpy-peer</dt>
+<dd>
+\anchor disable-cuda-memcpy-peer
+\addindex __configure__--disable-cuda-memcpy-peer
+Explicitly disable peer transfers when using CUDA 4.0.
+</dd>
+
+<dt>--enable-maxopencldev=<c>count</c></dt>
+<dd>
+\anchor enable-maxopencldev
+\addindex __configure__--enable-maxopencldev
+Use at most <c>count</c> OpenCL devices.  This information is then
+available as the macro ::STARPU_MAXOPENCLDEVS.
+</dd>
+
+<dt>--disable-opencl</dt>
+<dd>
+\anchor disable-opencl
+\addindex __configure__--disable-opencl
+Disable the use of OpenCL, even if the SDK is detected.
+</dd>
+
+<dt>--with-opencl-dir=<c>prefix</c></dt>
+<dd>
+\anchor with-opencl-dir
+\addindex __configure__--with-opencl-dir
+Search for an OpenCL implementation under <c>prefix</c>, which should
+notably contain <c>include/CL/cl.h</c> (or <c>include/OpenCL/cl.h</c>
+on Mac OS).
+</dd>
+
+<dt>--with-opencl-include-dir=<c>dir</c></dt>
+<dd>
+\anchor with-opencl-include-dir
+\addindex __configure__--with-opencl-include-dir
+Search for OpenCL headers under <c>dir</c>, which should notably contain
+<c>CL/cl.h</c> (or <c>OpenCL/cl.h</c> on Mac OS).  This defaults to
+<c>/include</c> appended to the value given to
+\ref with-opencl-dir "--with-opencl-dir".
+</dd>
+
+<dt>--with-opencl-lib-dir=<c>dir</c></dt>
+<dd>
+\anchor with-opencl-lib-dir
+\addindex __configure__--with-opencl-lib-dir
+Search for an OpenCL library under <c>dir</c>, which should notably
+contain the OpenCL shared libraries---e.g. <c>libOpenCL.so</c>. This defaults to
+<c>/lib</c> appended to the value given to
+\ref with-opencl-dir "--with-opencl-dir".
+</dd>
+
+<dt>--enable-opencl-simulator</dt>
+<dd>
+\anchor enable-opencl-simulator
+\addindex __configure__--enable-opencl-simulator
+Enable considering the provided OpenCL implementation as a simulator, i.e. use
+the kernel duration returned by OpenCL profiling information as wallclock time
+instead of the actual measured real time. This requires simgrid support.
+</dd>
+
+<dt>--enable-maximplementations=<c>count</c></dt>
+<dd>
+\anchor enable-maximplementations
+\addindex __configure__--enable-maximplementations
+Allow for at most <c>count</c> codelet implementations for the same
+target device.  This information is then available as the
+macro ::STARPU_MAXIMPLEMENTATIONS macro.
+</dd>
+
+<dt>--enable-max-sched-ctxs=<c>count</c></dt>
+<dd>
+\anchor enable-max-sched-ctxs
+\addindex __configure__--enable-max-sched-ctxs
+Allow for at most <c>count</c> scheduling contexts
+This information is then available as the macro
+::STARPU_NMAX_SCHED_CTXS.
+</dd>
+
+<dt>--disable-asynchronous-copy</dt>
+<dd>
+\anchor disable-asynchronous-copy
+\addindex __configure__--disable-asynchronous-copy
+Disable asynchronous copies between CPU and GPU devices.
+The AMD implementation of OpenCL is known to
+fail when copying data asynchronously. When using this implementation,
+it is therefore necessary to disable asynchronous data transfers.
+</dd>
+
+<dt>--disable-asynchronous-cuda-copy</dt>
+<dd>
+\anchor disable-asynchronous-cuda-copy
+\addindex __configure__--disable-asynchronous-cuda-copy
+Disable asynchronous copies between CPU and CUDA devices.
+</dd>
+
+<dt>--disable-asynchronous-opencl-copy</dt>
+<dd>
+\anchor disable-asynchronous-opencl-copy
+\addindex __configure__--disable-asynchronous-opencl-copy
+Disable asynchronous copies between CPU and OpenCL devices.
+The AMD implementation of OpenCL is known to
+fail when copying data asynchronously. When using this implementation,
+it is therefore necessary to disable asynchronous data transfers.
+</dd>
+
+<dt>--disable-asynchronous-mic-copy</dt>
+<dd>
+\anchor disable-asynchronous-mic-copy
+\addindex __configure__--disable-asynchronous-mic-copy
+Disable asynchronous copies between CPU and MIC devices.
+</dd>
+
+</dl>
+
+\section ExtensionConfiguration Extension Configuration
+
+<dl>
+
+<dt>--disable-socl</dt>
+<dd>
+\anchor disable-socl
+\addindex __configure__--disable-socl
+Disable the SOCL extension (\ref SOCLOpenclExtensions).  By
+default, it is enabled when an OpenCL implementation is found.
+</dd>
+
+<dt>--disable-starpu-top</dt>
+<dd>
+\anchor disable-starpu-top
+\addindex __configure__--disable-starpu-top
+Disable the StarPU-Top interface (\ref StarPU-TopInterface).  By default, it
+is enabled when the required dependencies are found.
+</dd>
+
+<dt>--disable-gcc-extensions</dt>
+<dd>
+\anchor disable-gcc-extensions
+\addindex __configure__--disable-gcc-extensions
+Disable the GCC plug-in (\ref cExtensions).  By default, it is
+enabled when the GCC compiler provides a plug-in support.
+</dd>
+
+<dt>--with-mpicc=<c>path</c></dt>
+<dd>
+\anchor with-mpicc
+\addindex __configure__--with-mpicc
+Use the compiler <c>mpicc</c> at <c>path</c>, for StarPU-MPI.
+(\ref MPISupport).
+</dd>
+
+<dt>--enable-mpi-progression-hook</dt>
+<dd>
+\anchor enable-mpi-progression-hook
+\addindex __configure__--enable-mpi-progression-hook
+Enable the activity polling method for StarPU-MPI.
+</dd>
+
+\section AdvancedConfiguration Advanced Configuration
+
+<dl>
+
+<dt>--enable-perf-debug</dt>
+<dd>
+\anchor enable-perf-debug
+\addindex __configure__--enable-perf-debug
+Enable performance debugging through gprof.
+</dd>
+
+<dt>--enable-model-debug</dt>
+<dd>
+\anchor enable-model-debug
+\addindex __configure__--enable-model-debug
+Enable performance model debugging.
+</dd>
+
+<dt>--enable-stats</dt>
+<dd>
+\anchor enable-stats
+\addindex __configure__--enable-stats
+(see ../../src/datawizard/datastats.c)
+Enable gathering of various data statistics (\ref DataStatistics).
+</dd>
+
+<dt>--enable-maxbuffers</dt>
+<dd>
+\anchor enable-maxbuffers
+\addindex __configure__--enable-maxbuffers
+Define the maximum number of buffers that tasks will be able to take
+as parameters, then available as the macro ::STARPU_NMAXBUFS.
+</dd>
+
+<dt>--enable-allocation-cache</dt>
+<dd>
+\anchor enable-allocation-cache
+\addindex __configure__--enable-allocation-cache
+Enable the use of a data allocation cache to avoid the cost of it with
+CUDA. Still experimental.
+</dd>
+
+<dt>--enable-opengl-render</dt>
+<dd>
+\anchor enable-opengl-render
+\addindex __configure__--enable-opengl-render
+Enable the use of OpenGL for the rendering of some examples.
+\internal
+TODO: rather default to enabled when detected
+\endinternal
+</dd>
+
+<dt>--enable-blas-lib</dt>
+<dd>
+\anchor enable-blas-lib
+\addindex __configure__--enable-blas-lib
+Specify the blas library to be used by some of the examples. The
+library has to be 'atlas' or 'goto'.
+</dd>
+
+<dt>--disable-starpufft</dt>
+<dd>
+\anchor disable-starpufft
+\addindex __configure__--disable-starpufft
+Disable the build of libstarpufft, even if <c>fftw</c> or <c>cuFFT</c> is available.
+</dd>
+
+<dt>--with-magma=<c>prefix</c></dt>
+<dd>
+\anchor with-magma
+\addindex __configure__--with-magma
+Search for MAGMA under <c>prefix</c>.  <c>prefix</c> should notably
+contain <c>include/magmablas.h</c>.
+</dd>
+
+<dt>--with-fxt=<c>prefix</c></dt>
+<dd>
+\anchor with-fxt
+\addindex __configure__--with-fxt
+Search for FxT under <c>prefix</c>.
+FxT (http://savannah.nongnu.org/projects/fkt) is used to generate
+traces of scheduling events, which can then be rendered them using ViTE
+(\ref Off-linePerformanceFeedback).  <c>prefix</c> should
+notably contain <c>include/fxt/fxt.h</c>.
+</dd>
+
+<dt>--with-perf-model-dir=<c>dir</c></dt>
+<dd>
+\anchor with-perf-model-dir
+\addindex __configure__--with-perf-model-dir
+Store performance models under <c>dir</c>, instead of the current user's
+home.
+</dd>
+
+<dt>--with-goto-dir=<c>prefix</c></dt>
+<dd>
+\anchor with-goto-dir
+\addindex __configure__--with-goto-dir
+Search for GotoBLAS under <c>prefix</c>, which should notably contain
+<c>libgoto.so</c> or <c>libgoto2.so</c>.
+</dd>
+
+<dt>--with-atlas-dir=<c>prefix</c></dt>
+<dd>
+\anchor with-atlas-dir
+\addindex __configure__--with-atlas-dir
+Search for ATLAS under <c>prefix</c>, which should notably contain
+<c>include/cblas.h</c>.
+</dd>
+
+<dt>--with-mkl-cflags=<c>cflags</c></dt>
+<dd>
+\anchor with-mkl-cflags
+\addindex __configure__--with-mkl-cflags
+Use <c>cflags</c> to compile code that uses the MKL library.
+</dd>
+
+<dt>--with-mkl-ldflags=<c>ldflags</c></dt>
+<dd>
+\anchor with-mkl-ldflags
+\addindex __configure__--with-mkl-ldflags
+Use <c>ldflags</c> when linking code that uses the MKL library.  Note
+that the MKL website
+(http://software.intel.com/en-us/articles/intel-mkl-link-line-advisor/)
+provides a script to determine the linking flags.
+</dd>
+
+<dt>--disable-build-examples</dt>
+<dd>
+\anchor disable-build-examples
+\addindex __configure__--disable-build-examples
+Disable the build of examples.
+</dd>
+
+
+<dt>--enable-sc-hypervisor</dt>
+<dd>
+\anchor enable-sc-hypervisor
+\addindex __configure__--enable-sc-hypervisor
+Enable the Scheduling Context Hypervisor plugin(\ref SchedulingContextHypervisor).
+By default, it is disabled.
+</dd>
+
+<dt>--enable-memory-stats</dt>
+<dd>
+\anchor enable-memory-stats
+\addindex __configure__--enable-memory-stats
+Enable memory statistics (\ref MemoryFeedback).
+</dd>
+
+<dt>--enable-simgrid</dt>
+<dd>
+\anchor enable-simgrid
+\addindex __configure__--enable-simgrid
+Enable simulation of execution in simgrid, to allow easy experimentation with
+various numbers of cores and GPUs, or amount of memory, etc. Experimental.
+
+The path to simgrid can be specified through the <c>SIMGRID_CFLAGS</c> and
+<c>SIMGRID_LIBS</c> environment variables, for instance:
+
+\verbatim
+export SIMGRID_CFLAGS="-I/usr/local/simgrid/include"
+export SIMGRID_LIBS="-L/usr/local/simgrid/lib -lsimgrid"
+\endverbatim
+
+</dd>
+
+</dl>
+
+*/

+ 531 - 0
doc/doxygen/chapters/environment_variables.doxy

@@ -0,0 +1,531 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page ExecutionConfigurationThroughEnvironmentVariables Execution Configuration Through Environment Variables
+
+The behavior of the StarPU library and tools may be tuned thanks to
+the following environment variables.
+
+\section ConfiguringWorkers Configuring Workers
+
+<dl>
+
+<dt>STARPU_NCPU</dt>
+<dd>
+\anchor STARPU_NCPU
+\addindex __env__STARPU_NCPU
+Specify the number of CPU workers (thus not including workers
+dedicated to control accelerators). Note that by default, StarPU will
+not allocate more CPU workers than there are physical CPUs, and that
+some CPUs are used to control the accelerators.
+</dd>
+
+<dt>STARPU_NCPUS</dt>
+<dd>
+\anchor STARPU_NCPUS
+\addindex __env__STARPU_NCPUS
+This variable is deprecated. You should use \ref STARPU_NCPU.
+</dd>
+
+<dt>STARPU_NCUDA</dt>
+<dd>
+\anchor STARPU_NCUDA
+\addindex __env__STARPU_NCUDA
+Specify the number of CUDA devices that StarPU can use. If
+\ref STARPU_NCUDA is lower than the number of physical devices, it is
+possible to select which CUDA devices should be used by the means of the
+environment variable \ref STARPU_WORKERS_CUDAID. By default, StarPU will
+create as many CUDA workers as there are CUDA devices.
+</dd>
+
+<dt>STARPU_NOPENCL</dt>
+<dd>
+\anchor STARPU_NOPENCL
+\addindex __env__STARPU_NOPENCL
+OpenCL equivalent of the environment variable \ref STARPU_NCUDA.
+</dd>
+
+<dt>STARPU_OPENCL_ON_CPUS</dt>
+<dd>
+\anchor STARPU_OPENCL_ON_CPUS
+\addindex __env__STARPU_OPENCL_ON_CPUS
+By default, the OpenCL driver only enables GPU and accelerator
+devices. By setting the environment variable \ref
+STARPU_OPENCL_ON_CPUS to 1, the OpenCL driver will also enable CPU
+devices.
+</dd>
+
+<dt>STARPU_OPENCL_ONLY_ON_CPUS</dt>
+<dd>
+\anchor STARPU_OPENCL_ONLY_ON_CPUS
+\addindex __env__STARPU_OPENCL_ONLY_ON_CPUS
+By default, the OpenCL driver enables GPU and accelerator
+devices. By setting the environment variable \ref
+STARPU_OPENCL_ONLY_ON_CPUS to 1, the OpenCL driver will ONLY enable
+CPU devices.
+</dd>
+
+<dt>STARPU_WORKERS_NOBIND</dt>
+<dd>
+\anchor STARPU_WORKERS_NOBIND
+\addindex __env__STARPU_WORKERS_NOBIND
+Setting it to non-zero will prevent StarPU from binding its threads to
+CPUs. This is for instance useful when running the testsuite in parallel.
+</dd>
+
+<dt>STARPU_WORKERS_CPUID</dt>
+<dd>
+\anchor STARPU_WORKERS_CPUID
+\addindex __env__STARPU_WORKERS_CPUID
+Passing an array of integers (starting from 0) in \ref STARPU_WORKERS_CPUID
+specifies on which logical CPU the different workers should be
+bound. For instance, if <c>STARPU_WORKERS_CPUID = "0 1 4 5"</c>, the first
+worker will be bound to logical CPU #0, the second CPU worker will be bound to
+logical CPU #1 and so on.  Note that the logical ordering of the CPUs is either
+determined by the OS, or provided by the library <c>hwloc</c> in case it is
+available.
+
+Note that the first workers correspond to the CUDA workers, then come the
+OpenCL workers, and finally the CPU workers. For example if
+we have <c>STARPU_NCUDA=1</c>, <c>STARPU_NOPENCL=1</c>, <c>STARPU_NCPU=2</c>
+and <c>STARPU_WORKERS_CPUID = "0 2 1 3"</c>, the CUDA device will be controlled
+by logical CPU #0, the OpenCL device will be controlled by logical CPU #2, and
+the logical CPUs #1 and #3 will be used by the CPU workers.
+
+If the number of workers is larger than the array given in \ref
+STARPU_WORKERS_CPUID, the workers are bound to the logical CPUs in a
+round-robin fashion: if <c>STARPU_WORKERS_CPUID = "0 1"</c>, the first
+and the third (resp. second and fourth) workers will be put on CPU #0
+(resp. CPU #1).
+
+This variable is ignored if the field
+starpu_conf::use_explicit_workers_bindid passed to starpu_init() is
+set.
+
+</dd>
+
+<dt>STARPU_WORKERS_CUDAID</dt>
+<dd>
+\anchor STARPU_WORKERS_CUDAID
+\addindex __env__STARPU_WORKERS_CUDAID
+Similarly to the \ref STARPU_WORKERS_CPUID environment variable, it is
+possible to select which CUDA devices should be used by StarPU. On a machine
+equipped with 4 GPUs, setting <c>STARPU_WORKERS_CUDAID = "1 3"</c> and
+<c>STARPU_NCUDA=2</c> specifies that 2 CUDA workers should be created, and that
+they should use CUDA devices #1 and #3 (the logical ordering of the devices is
+the one reported by CUDA).
+
+This variable is ignored if the field
+starpu_conf::use_explicit_workers_cuda_gpuid passed to starpu_init()
+is set.
+</dd>
+
+<dt>STARPU_WORKERS_OPENCLID</dt>
+<dd>
+\anchor STARPU_WORKERS_OPENCLID
+\addindex __env__STARPU_WORKERS_OPENCLID
+OpenCL equivalent of the \ref STARPU_WORKERS_CUDAID environment variable.
+
+This variable is ignored if the field
+starpu_conf::use_explicit_workers_opencl_gpuid passed to starpu_init()
+is set.
+</dd>
+
+<dt>STARPU_SINGLE_COMBINED_WORKER</dt>
+<dd>
+\anchor STARPU_SINGLE_COMBINED_WORKER
+\addindex __env__STARPU_SINGLE_COMBINED_WORKER
+If set, StarPU will create several workers which won't be able to work
+concurrently. It will by default create combined workers which size goes from 1
+to the total number of CPU workers in the system. \ref STARPU_MIN_WORKERSIZE
+and \ref STARPU_MAX_WORKERSIZE can be used to change this default.
+</dd>
+
+<dt>STARPU_MIN_WORKERSIZE</dt>
+<dd>
+\anchor STARPU_MIN_WORKERSIZE
+\addindex __env__STARPU_MIN_WORKERSIZE
+When \ref STARPU_SINGLE_COMBINED_WORKER is set, \ref STARPU_MIN_WORKERSIZE
+permits to specify the minimum size of the combined workers (instead of the default 1)
+</dd>
+
+<dt>STARPU_MAX_WORKERSIZE</dt>
+<dd>
+\anchor STARPU_MAX_WORKERSIZE
+\addindex __env__STARPU_MAX_WORKERSIZE
+When \ref STARPU_SINGLE_COMBINED_WORKER is set, \ref STARPU_MAX_WORKERSIZE
+permits to specify the minimum size of the combined workers (instead of the
+number of CPU workers in the system)
+</dd>
+
+<dt>STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER</dt>
+<dd>
+\anchor STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER
+\addindex __env__STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER
+Let the user decide how many elements are allowed between combined workers
+created from hwloc information. For instance, in the case of sockets with 6
+cores without shared L2 caches, if \ref STARPU_SYNTHESIZE_ARITY_COMBINED_WORKER is
+set to 6, no combined worker will be synthesized beyond one for the socket
+and one per core. If it is set to 3, 3 intermediate combined workers will be
+synthesized, to divide the socket cores into 3 chunks of 2 cores. If it set to
+2, 2 intermediate combined workers will be synthesized, to divide the the socket
+cores into 2 chunks of 3 cores, and then 3 additional combined workers will be
+synthesized, to divide the former synthesized workers into a bunch of 2 cores,
+and the remaining core (for which no combined worker is synthesized since there
+is already a normal worker for it).
+
+The default, 2, thus makes StarPU tend to building a binary trees of combined
+workers.
+</dd>
+
+<dt>STARPU_DISABLE_ASYNCHRONOUS_COPY</dt>
+<dd>
+\anchor STARPU_DISABLE_ASYNCHRONOUS_COPY
+\addindex __env__STARPU_DISABLE_ASYNCHRONOUS_COPY
+Disable asynchronous copies between CPU and GPU devices.
+The AMD implementation of OpenCL is known to
+fail when copying data asynchronously. When using this implementation,
+it is therefore necessary to disable asynchronous data transfers.
+</dd>
+
+<dt>STARPU_DISABLE_ASYNCHRONOUS_CUDA_COPY</dt>
+<dd>
+\anchor STARPU_DISABLE_ASYNCHRONOUS_CUDA_COPY
+\addindex __env__STARPU_DISABLE_ASYNCHRONOUS_CUDA_COPY
+Disable asynchronous copies between CPU and CUDA devices.
+</dd>
+
+<dt>STARPU_DISABLE_ASYNCHRONOUS_OPENCL_COPY</dt>
+<dd>
+\anchor STARPU_DISABLE_ASYNCHRONOUS_OPENCL_COPY
+\addindex __env__STARPU_DISABLE_ASYNCHRONOUS_OPENCL_COPY
+Disable asynchronous copies between CPU and OpenCL devices.
+The AMD implementation of OpenCL is known to
+fail when copying data asynchronously. When using this implementation,
+it is therefore necessary to disable asynchronous data transfers.
+</dd>
+
+<dt>STARPU_DISABLE_ASYNCHRONOUS_MIC_COPY</dt>
+<dd>
+\anchor STARPU_DISABLE_ASYNCHRONOUS_MIC_COPY
+\addindex __env__STARPU_DISABLE_ASYNCHRONOUS_MIC_COPY
+Disable asynchronous copies between CPU and MIC devices.
+</dd>
+
+<dt>STARPU_DISABLE_CUDA_GPU_GPU_DIRECT</dt>
+<dd>
+\anchor STARPU_DISABLE_CUDA_GPU_GPU_DIRECT
+\addindex __env__STARPU_DISABLE_CUDA_GPU_GPU_DIRECT
+Disable direct CUDA transfers from GPU to GPU, and let CUDA copy through RAM
+instead. This permits to test the performance effect of GPU-Direct.
+</dd>
+
+</dl>
+
+\section ConfiguringTheSchedulingEngine Configuring The Scheduling Engine
+
+<dl>
+
+<dt>STARPU_SCHED</dt>
+<dd>
+\anchor STARPU_SCHED
+\addindex __env__STARPU_SCHED
+Choose between the different scheduling policies proposed by StarPU: work
+random, stealing, greedy, with performance models, etc.
+
+Use <c>STARPU_SCHED=help</c> to get the list of available schedulers.
+</dd>
+
+<dt>STARPU_CALIBRATE</dt>
+<dd>
+\anchor STARPU_CALIBRATE
+\addindex __env__STARPU_CALIBRATE
+If this variable is set to 1, the performance models are calibrated during
+the execution. If it is set to 2, the previous values are dropped to restart
+calibration from scratch. Setting this variable to 0 disable calibration, this
+is the default behaviour.
+
+Note: this currently only applies to <c>dm</c> and <c>dmda</c> scheduling policies.
+</dd>
+
+<dt>STARPU_BUS_CALIBRATE</dt>
+<dd>
+\anchor STARPU_BUS_CALIBRATE
+\addindex __env__STARPU_BUS_CALIBRATE
+If this variable is set to 1, the bus is recalibrated during intialization.
+</dd>
+
+<dt>STARPU_PREFETCH</dt>
+<dd>
+\anchor STARPU_PREFETCH
+\addindex __env__STARPU_PREFETCH
+This variable indicates whether data prefetching should be enabled (0 means
+that it is disabled). If prefetching is enabled, when a task is scheduled to be
+executed e.g. on a GPU, StarPU will request an asynchronous transfer in
+advance, so that data is already present on the GPU when the task starts. As a
+result, computation and data transfers are overlapped.
+Note that prefetching is enabled by default in StarPU.
+</dd>
+
+<dt>STARPU_SCHED_ALPHA</dt>
+<dd>
+\anchor STARPU_SCHED_ALPHA
+\addindex __env__STARPU_SCHED_ALPHA
+To estimate the cost of a task StarPU takes into account the estimated
+computation time (obtained thanks to performance models). The alpha factor is
+the coefficient to be applied to it before adding it to the communication part.
+</dd>
+
+<dt>STARPU_SCHED_BETA</dt>
+<dd>
+\anchor STARPU_SCHED_BETA
+\addindex __env__STARPU_SCHED_BETA
+To estimate the cost of a task StarPU takes into account the estimated
+data transfer time (obtained thanks to performance models). The beta factor is
+the coefficient to be applied to it before adding it to the computation part.
+</dd>
+
+<dt>STARPU_SCHED_GAMMA</dt>
+<dd>
+\anchor STARPU_SCHED_GAMMA
+\addindex __env__STARPU_SCHED_GAMMA
+Define the execution time penalty of a joule (\ref Power-basedScheduling).
+</dd>
+
+<dt>STARPU_IDLE_POWER</dt>
+<dd>
+\anchor STARPU_IDLE_POWER
+\addindex __env__STARPU_IDLE_POWER
+Define the idle power of the machine (\ref Power-basedScheduling).
+</dd>
+
+<dt>STARPU_PROFILING</dt>
+<dd>
+\anchor STARPU_PROFILING
+\addindex __env__STARPU_PROFILING
+Enable on-line performance monitoring (\ref EnablingOn-linePerformanceMonitoring).
+</dd>
+
+</dl>
+
+\section Extensions Extensions
+
+<dl>
+
+<dt>SOCL_OCL_LIB_OPENCL</dt>
+<dd>
+\anchor SOCL_OCL_LIB_OPENCL
+\addindex __env__SOCL_OCL_LIB_OPENCL
+THE SOCL test suite is only run when the environment variable \ref
+SOCL_OCL_LIB_OPENCL is defined. It should contain the location
+of the file <c>libOpenCL.so</c> of the OCL ICD implementation.
+</dd>
+
+<dt>STARPU_COMM_STATS</dt>
+<dd>
+\anchor STARPU_COMM_STATS
+\addindex __env__STARPU_COMM_STATS
+Communication statistics for starpumpi (\ref MPISupport)
+will be enabled when the environment variable \ref STARPU_COMM_STATS
+is defined to an value other than 0.
+</dd>
+
+<dt>STARPU_MPI_CACHE</dt>
+<dd>
+\anchor STARPU_MPI_CACHE
+\addindex __env__STARPU_MPI_CACHE
+Communication cache for starpumpi (\ref MPISupport) will be
+disabled when the environment variable \ref STARPU_MPI_CACHE is set
+to 0. It is enabled by default or for any other values of the variable
+\ref STARPU_MPI_CACHE.
+</dd>
+
+<dt>STARPU_MIC_HOST</dt>
+<dd>
+\anchor STARPU_MIC_HOST
+\addindex __env__STARPU_MIC_HOST
+Defines the value of the parameter <c>--host</c> passed to
+<c>configure</c> for the cross-compilation. The current default is
+<c>x86_64-k1om-linux</c>.
+</dd>
+
+<dt>STARPU_MIC_CC_PATH</dt>
+<dd>
+\anchor STARPU_MIC_CC_PATH
+\addindex __env__STARPU_MIC_CC_PATH
+Defines the path to the MIC cross-compiler. The current default is
+<c>/usr/linux-k1om-4.7/bin/</c>
+</dd>
+
+<dt>STARPU_COI_DIR</dt>
+<dd>
+\anchor STARPU_COI_DIR
+\addindex __env__STARPU_COI_DIR
+Defines the path to the COI library. The current default is
+<c>/opt/intel/mic/coi</c>.
+</dd>
+
+</dl>
+
+\section MiscellaneousAndDebug Miscellaneous And Debug
+
+<dl>
+
+<dt>STARPU_HOME</dt>
+<dd>
+\anchor STARPU_HOME
+\addindex __env__STARPU_HOME
+This specifies the main directory in which StarPU stores its
+configuration files. The default is <c>$HOME</c> on Unix environments,
+and <c>$USERPROFILE</c> on Windows environments.
+</dd>
+
+<dt>STARPU_HOSTNAME</dt>
+<dd>
+\anchor STARPU_HOSTNAME
+\addindex __env__STARPU_HOSTNAME
+When set, force the hostname to be used when dealing performance model
+files. Models are indexed by machine name. When running for example on
+a homogenenous cluster, it is possible to share the models between
+machines by setting <c>export STARPU_HOSTNAME=some_global_name</c>.
+</dd>
+
+<dt>STARPU_OPENCL_PROGRAM_DIR</dt>
+<dd>
+\anchor STARPU_OPENCL_PROGRAM_DIR
+\addindex __env__STARPU_OPENCL_PROGRAM_DIR
+This specifies the directory where the OpenCL codelet source files are
+located. The function starpu_opencl_load_program_source() looks
+for the codelet in the current directory, in the directory specified
+by the environment variable \ref STARPU_OPENCL_PROGRAM_DIR, in the
+directory <c>share/starpu/opencl</c> of the installation directory of
+StarPU, and finally in the source directory of StarPU.
+</dd>
+
+<dt>STARPU_SILENT</dt>
+<dd>
+\anchor STARPU_SILENT
+\addindex __env__STARPU_SILENT
+This variable allows to disable verbose mode at runtime when StarPU
+has been configured with the option \ref enable-verbose "--enable-verbose". It also
+disables the display of StarPU information and warning messages.
+</dd>
+
+<dt>STARPU_LOGFILENAME</dt>
+<dd>
+\anchor STARPU_LOGFILENAME
+\addindex __env__STARPU_LOGFILENAME
+This variable specifies in which file the debugging output should be saved to.
+</dd>
+
+<dt>STARPU_FXT_PREFIX</dt>
+<dd>
+\anchor STARPU_FXT_PREFIX
+\addindex __env__STARPU_FXT_PREFIX
+This variable specifies in which directory to save the trace generated if FxT is enabled. It needs to have a trailing '/' character.
+</dd>
+
+<dt>STARPU_LIMIT_CUDA_devid_MEM</dt>
+<dd>
+\anchor STARPU_LIMIT_CUDA_devid_MEM
+\addindex __env__STARPU_LIMIT_CUDA_devid_MEM
+This variable specifies the maximum number of megabytes that should be
+available to the application on the CUDA device with the identifier
+<c>devid</c>. This variable is intended to be used for experimental
+purposes as it emulates devices that have a limited amount of memory.
+When defined, the variable overwrites the value of the variable
+\ref STARPU_LIMIT_CUDA_MEM.
+</dd>
+
+<dt>STARPU_LIMIT_CUDA_MEM</dt>
+<dd>
+\anchor STARPU_LIMIT_CUDA_MEM
+\addindex __env__STARPU_LIMIT_CUDA_MEM
+This variable specifies the maximum number of megabytes that should be
+available to the application on each CUDA devices. This variable is
+intended to be used for experimental purposes as it emulates devices
+that have a limited amount of memory.
+</dd>
+
+<dt>STARPU_LIMIT_OPENCL_devid_MEM</dt>
+<dd>
+\anchor STARPU_LIMIT_OPENCL_devid_MEM
+\addindex __env__STARPU_LIMIT_OPENCL_devid_MEM
+This variable specifies the maximum number of megabytes that should be
+available to the application on the OpenCL device with the identifier
+<c>devid</c>. This variable is intended to be used for experimental
+purposes as it emulates devices that have a limited amount of memory.
+When defined, the variable overwrites the value of the variable
+\ref STARPU_LIMIT_OPENCL_MEM.
+</dd>
+
+<dt>STARPU_LIMIT_OPENCL_MEM</dt>
+<dd>
+\anchor STARPU_LIMIT_OPENCL_MEM
+\addindex __env__STARPU_LIMIT_OPENCL_MEM
+This variable specifies the maximum number of megabytes that should be
+available to the application on each OpenCL devices. This variable is
+intended to be used for experimental purposes as it emulates devices
+that have a limited amount of memory.
+</dd>
+
+<dt>STARPU_LIMIT_CPU_MEM</dt>
+<dd>
+\anchor STARPU_LIMIT_CPU_MEM
+\addindex __env__STARPU_LIMIT_CPU_MEM
+This variable specifies the maximum number of megabytes that should be
+available to the application on each CPU device. This variable is
+intended to be used for experimental purposes as it emulates devices
+that have a limited amount of memory.
+</dd>
+
+<dt>STARPU_GENERATE_TRACE</dt>
+<dd>
+\anchor STARPU_GENERATE_TRACE
+\addindex __env__STARPU_GENERATE_TRACE
+When set to <c>1</c>, this variable indicates that StarPU should automatically
+generate a Paje trace when starpu_shutdown() is called.
+</dd>
+
+<dt>STARPU_MEMORY_STATS</dt>
+<dd>
+\anchor STARPU_MEMORY_STATS
+\addindex __env__STARPU_MEMORY_STATS
+When set to 0, disable the display of memory statistics on data which
+have not been unregistered at the end of the execution (\ref MemoryFeedback).
+</dd>
+
+<dt>STARPU_BUS_STATS</dt>
+<dd>
+\anchor STARPU_BUS_STATS
+\addindex __env__STARPU_BUS_STATS
+When defined, statistics about data transfers will be displayed when calling
+starpu_shutdown() (\ref Profiling).
+</dd>
+
+<dt>STARPU_WORKER_STATS</dt>
+<dd>
+\anchor STARPU_WORKER_STATS
+\addindex __env__STARPU_WORKER_STATS
+When defined, statistics about the workers will be displayed when calling
+starpu_shutdown() (\ref Profiling). When combined with the
+environment variable \ref STARPU_PROFILING, it displays the power
+consumption (\ref Power-basedScheduling).
+</dd>
+
+<dt>STARPU_STATS</dt>
+<dd>
+\anchor STARPU_STATS
+\addindex __env__STARPU_STATS
+When set to 0, data statistics will not be displayed at the
+end of the execution of an application (\ref DataStatistics).
+</dd>
+
+</dl>
+
+*/

+ 518 - 0
doc/doxygen/chapters/fdl-1.3.doxy

@@ -0,0 +1,518 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page GNUFreeDocumentationLicense The GNU Free Documentation License
+
+<center>
+Version 1.3, 3 November 2008
+</center>
+
+<BLOCKQUOTE>
+\copyright 2000, 2001, 2002, 2007, 2008 Free Software Foundation, Inc.
+http://fsf.org/
+
+Everyone is permitted to copy and distribute verbatim copies
+of this license document, but changing it is not allowed.
+</BLOCKQUOTE>
+
+<ol>
+<li>
+PREAMBLE
+
+The purpose of this License is to make a manual, textbook, or other
+functional and useful document <em>free</em> in the sense of freedom: to
+assure everyone the effective freedom to copy and redistribute it,
+with or without modifying it, either commercially or noncommercially.
+Secondarily, this License preserves for the author and publisher a way
+to get credit for their work, while not being considered responsible
+for modifications made by others.
+
+This License is a kind of ``copyleft'', which means that derivative
+works of the document must themselves be free in the same sense.  It
+complements the GNU General Public License, which is a copyleft
+license designed for free software.
+
+We have designed this License in order to use it for manuals for free
+software, because free software needs free documentation: a free
+program should come with manuals providing the same freedoms that the
+software does.  But this License is not limited to software manuals;
+it can be used for any textual work, regardless of subject matter or
+whether it is published as a printed book.  We recommend this License
+principally for works whose purpose is instruction or reference.
+
+</li>
+<li>
+APPLICABILITY AND DEFINITIONS
+
+This License applies to any manual or other work, in any medium, that
+contains a notice placed by the copyright holder saying it can be
+distributed under the terms of this License.  Such a notice grants a
+world-wide, royalty-free license, unlimited in duration, to use that
+work under the conditions stated herein.  The ``Document'', below,
+refers to any such manual or work.  Any member of the public is a
+licensee, and is addressed as ``you''.  You accept the license if you
+copy, modify or distribute the work in a way requiring permission
+under copyright law.
+
+A ``Modified Version'' of the Document means any work containing the
+Document or a portion of it, either copied verbatim, or with
+modifications and/or translated into another language.
+
+A ``Secondary Section'' is a named appendix or a front-matter section
+of the Document that deals exclusively with the relationship of the
+publishers or authors of the Document to the Document's overall
+subject (or to related matters) and contains nothing that could fall
+directly within that overall subject.  (Thus, if the Document is in
+part a textbook of mathematics, a Secondary Section may not explain
+any mathematics.)  The relationship could be a matter of historical
+connection with the subject or with related matters, or of legal,
+commercial, philosophical, ethical or political position regarding
+them.
+
+The ``Invariant Sections'' are certain Secondary Sections whose titles
+are designated, as being those of Invariant Sections, in the notice
+that says that the Document is released under this License.  If a
+section does not fit the above definition of Secondary then it is not
+allowed to be designated as Invariant.  The Document may contain zero
+Invariant Sections.  If the Document does not identify any Invariant
+Sections then there are none.
+
+The ``Cover Texts'' are certain short passages of text that are listed,
+as Front-Cover Texts or Back-Cover Texts, in the notice that says that
+the Document is released under this License.  A Front-Cover Text may
+be at most 5 words, and a Back-Cover Text may be at most 25 words.
+
+A ``Transparent'' copy of the Document means a machine-readable copy,
+represented in a format whose specification is available to the
+general public, that is suitable for revising the document
+straightforwardly with generic text editors or (for images composed of
+pixels) generic paint programs or (for drawings) some widely available
+drawing editor, and that is suitable for input to text formatters or
+for automatic translation to a variety of formats suitable for input
+to text formatters.  A copy made in an otherwise Transparent file
+format whose markup, or absence of markup, has been arranged to thwart
+or discourage subsequent modification by readers is not Transparent.
+An image format is not Transparent if used for any substantial amount
+of text.  A copy that is not ``Transparent'' is called ``Opaque''.
+
+Examples of suitable formats for Transparent copies include plain
+ASCII without markup, Texinfo input format, LaTeX input
+format, SGML or XML using a publicly available
+DTD, and standard-conforming simple HTML,
+PostScript or PDF designed for human modification.  Examples
+of transparent image formats include PNG, XCF and
+JPG.  Opaque formats include proprietary formats that can be
+read and edited only by proprietary word processors, SGML or
+XML for which the DTD and/or processing tools are
+not generally available, and the machine-generated HTML,
+PostScript or PDF produced by some word processors for
+output purposes only.
+
+The ``Title Page'' means, for a printed book, the title page itself,
+plus such following pages as are needed to hold, legibly, the material
+this License requires to appear in the title page.  For works in
+formats which do not have any title page as such, ``Title Page'' means
+the text near the most prominent appearance of the work's title,
+preceding the beginning of the body of the text.
+
+The ``publisher'' means any person or entity that distributes copies
+of the Document to the public.
+
+A section ``Entitled XYZ'' means a named subunit of the Document whose
+title either is precisely XYZ or contains XYZ in parentheses following
+text that translates XYZ in another language.  (Here XYZ stands for a
+specific section name mentioned below, such as ``Acknowledgements'',
+``Dedications'', ``Endorsements'', or ``History''.)  To ``Preserve the Title''
+of such a section when you modify the Document means that it remains a
+section ``Entitled XYZ'' according to this definition.
+
+The Document may include Warranty Disclaimers next to the notice which
+states that this License applies to the Document.  These Warranty
+Disclaimers are considered to be included by reference in this
+License, but only as regards disclaiming warranties: any other
+implication that these Warranty Disclaimers may have is void and has
+no effect on the meaning of this License.
+
+</li>
+<LI>
+VERBATIM COPYING
+
+You may copy and distribute the Document in any medium, either
+commercially or noncommercially, provided that this License, the
+copyright notices, and the license notice saying this License applies
+to the Document are reproduced in all copies, and that you add no other
+conditions whatsoever to those of this License.  You may not use
+technical measures to obstruct or control the reading or further
+copying of the copies you make or distribute.  However, you may accept
+compensation in exchange for copies.  If you distribute a large enough
+number of copies you must also follow the conditions in section 3.
+
+You may also lend copies, under the same conditions stated above, and
+you may publicly display copies.
+
+</li>
+<LI>
+COPYING IN QUANTITY
+
+If you publish printed copies (or copies in media that commonly have
+printed covers) of the Document, numbering more than 100, and the
+Document's license notice requires Cover Texts, you must enclose the
+copies in covers that carry, clearly and legibly, all these Cover
+Texts: Front-Cover Texts on the front cover, and Back-Cover Texts on
+the back cover.  Both covers must also clearly and legibly identify
+you as the publisher of these copies.  The front cover must present
+the full title with all words of the title equally prominent and
+visible.  You may add other material on the covers in addition.
+Copying with changes limited to the covers, as long as they preserve
+the title of the Document and satisfy these conditions, can be treated
+as verbatim copying in other respects.
+
+If the required texts for either cover are too voluminous to fit
+legibly, you should put the first ones listed (as many as fit
+reasonably) on the actual cover, and continue the rest onto adjacent
+pages.
+
+If you publish or distribute Opaque copies of the Document numbering
+more than 100, you must either include a machine-readable Transparent
+copy along with each Opaque copy, or state in or with each Opaque copy
+a computer-network location from which the general network-using
+public has access to download using public-standard network protocols
+a complete Transparent copy of the Document, free of added material.
+If you use the latter option, you must take reasonably prudent steps,
+when you begin distribution of Opaque copies in quantity, to ensure
+that this Transparent copy will remain thus accessible at the stated
+location until at least one year after the last time you distribute an
+Opaque copy (directly or through your agents or retailers) of that
+edition to the public.
+
+It is requested, but not required, that you contact the authors of the
+Document well before redistributing any large number of copies, to give
+them a chance to provide you with an updated version of the Document.
+</LI>
+
+<LI>
+MODIFICATIONS
+
+You may copy and distribute a Modified Version of the Document under
+the conditions of sections 2 and 3 above, provided that you release
+the Modified Version under precisely this License, with the Modified
+Version filling the role of the Document, thus licensing distribution
+and modification of the Modified Version to whoever possesses a copy
+of it.  In addition, you must do these things in the Modified Version:
+
+<ol>
+<li>
+Use in the Title Page (and on the covers, if any) a title distinct
+from that of the Document, and from those of previous versions
+(which should, if there were any, be listed in the History section
+of the Document).  You may use the same title as a previous version
+if the original publisher of that version gives permission.
+</li>
+<li>
+List on the Title Page, as authors, one or more persons or entities
+responsible for authorship of the modifications in the Modified
+Version, together with at least five of the principal authors of the
+Document (all of its principal authors, if it has fewer than five),
+unless they release you from this requirement.
+</li>
+<li>
+State on the Title page the name of the publisher of the
+Modified Version, as the publisher.
+</li>
+<li>
+Preserve all the copyright notices of the Document.
+</li>
+<li>
+Add an appropriate copyright notice for your modifications
+adjacent to the other copyright notices.
+</li>
+<li>
+Include, immediately after the copyright notices, a license notice
+giving the public permission to use the Modified Version under the
+terms of this License, in the form shown in the Addendum below.
+</li>
+<li>
+Preserve in that license notice the full lists of Invariant Sections
+and required Cover Texts given in the Document's license notice.
+</li>
+<li>
+Include an unaltered copy of this License.
+</li>
+<li>
+Preserve the section Entitled ``History'', Preserve its Title, and add
+to it an item stating at least the title, year, new authors, and
+publisher of the Modified Version as given on the Title Page.  If
+there is no section Entitled ``History'' in the Document, create one
+stating the title, year, authors, and publisher of the Document as
+given on its Title Page, then add an item describing the Modified
+Version as stated in the previous sentence.
+</li>
+<li>
+Preserve the network location, if any, given in the Document for
+public access to a Transparent copy of the Document, and likewise
+the network locations given in the Document for previous versions
+it was based on.  These may be placed in the ``History'' section.
+You may omit a network location for a work that was published at
+least four years before the Document itself, or if the original
+publisher of the version it refers to gives permission.
+</li>
+<li>
+For any section Entitled ``Acknowledgements'' or ``Dedications'', Preserve
+the Title of the section, and preserve in the section all the
+substance and tone of each of the contributor acknowledgements and/or
+dedications given therein.
+</li>
+<li>
+Preserve all the Invariant Sections of the Document,
+unaltered in their text and in their titles.  Section numbers
+or the equivalent are not considered part of the section titles.
+</li>
+<li>
+Delete any section Entitled ``Endorsements''.  Such a section
+may not be included in the Modified Version.
+</li>
+<li>
+Do not retitle any existing section to be Entitled ``Endorsements'' or
+to conflict in title with any Invariant Section.
+</li>
+<li>
+Preserve any Warranty Disclaimers.
+</li>
+</ol>
+
+If the Modified Version includes new front-matter sections or
+appendices that qualify as Secondary Sections and contain no material
+copied from the Document, you may at your option designate some or all
+of these sections as invariant.  To do this, add their titles to the
+list of Invariant Sections in the Modified Version's license notice.
+These titles must be distinct from any other section titles.
+
+You may add a section Entitled ``Endorsements'', provided it contains
+nothing but endorsements of your Modified Version by various
+parties---for example, statements of peer review or that the text has
+been approved by an organization as the authoritative definition of a
+standard.
+
+You may add a passage of up to five words as a Front-Cover Text, and a
+passage of up to 25 words as a Back-Cover Text, to the end of the list
+of Cover Texts in the Modified Version.  Only one passage of
+Front-Cover Text and one of Back-Cover Text may be added by (or
+through arrangements made by) any one entity.  If the Document already
+includes a cover text for the same cover, previously added by you or
+by arrangement made by the same entity you are acting on behalf of,
+you may not add another; but you may replace the old one, on explicit
+permission from the previous publisher that added the old one.
+
+The author(s) and publisher(s) of the Document do not by this License
+give permission to use their names for publicity for or to assert or
+imply endorsement of any Modified Version.
+</li>
+
+<li>
+COMBINING DOCUMENTS
+
+You may combine the Document with other documents released under this
+License, under the terms defined in section 4 above for modified
+versions, provided that you include in the combination all of the
+Invariant Sections of all of the original documents, unmodified, and
+list them all as Invariant Sections of your combined work in its
+license notice, and that you preserve all their Warranty Disclaimers.
+
+The combined work need only contain one copy of this License, and
+multiple identical Invariant Sections may be replaced with a single
+copy.  If there are multiple Invariant Sections with the same name but
+different contents, make the title of each such section unique by
+adding at the end of it, in parentheses, the name of the original
+author or publisher of that section if known, or else a unique number.
+Make the same adjustment to the section titles in the list of
+Invariant Sections in the license notice of the combined work.
+
+In the combination, you must combine any sections Entitled ``History''
+in the various original documents, forming one section Entitled
+``History''; likewise combine any sections Entitled ``Acknowledgements'',
+and any sections Entitled ``Dedications''.  You must delete all
+sections Entitled ``Endorsements.''
+</li>
+
+<li>
+COLLECTIONS OF DOCUMENTS
+
+You may make a collection consisting of the Document and other documents
+released under this License, and replace the individual copies of this
+License in the various documents with a single copy that is included in
+the collection, provided that you follow the rules of this License for
+verbatim copying of each of the documents in all other respects.
+
+You may extract a single document from such a collection, and distribute
+it individually under this License, provided you insert a copy of this
+License into the extracted document, and follow this License in all
+other respects regarding verbatim copying of that document.
+</li>
+
+<li>
+AGGREGATION WITH INDEPENDENT WORKS
+
+A compilation of the Document or its derivatives with other separate
+and independent documents or works, in or on a volume of a storage or
+distribution medium, is called an ``aggregate'' if the copyright
+resulting from the compilation is not used to limit the legal rights
+of the compilation's users beyond what the individual works permit.
+When the Document is included in an aggregate, this License does not
+apply to the other works in the aggregate which are not themselves
+derivative works of the Document.
+
+If the Cover Text requirement of section 3 is applicable to these
+copies of the Document, then if the Document is less than one half of
+the entire aggregate, the Document's Cover Texts may be placed on
+covers that bracket the Document within the aggregate, or the
+electronic equivalent of covers if the Document is in electronic form.
+Otherwise they must appear on printed covers that bracket the whole
+aggregate.
+</li>
+
+<li>
+TRANSLATION
+
+Translation is considered a kind of modification, so you may
+distribute translations of the Document under the terms of section 4.
+Replacing Invariant Sections with translations requires special
+permission from their copyright holders, but you may include
+translations of some or all Invariant Sections in addition to the
+original versions of these Invariant Sections.  You may include a
+translation of this License, and all the license notices in the
+Document, and any Warranty Disclaimers, provided that you also include
+the original English version of this License and the original versions
+of those notices and disclaimers.  In case of a disagreement between
+the translation and the original version of this License or a notice
+or disclaimer, the original version will prevail.
+
+If a section in the Document is Entitled ``Acknowledgements'',
+``Dedications'', or ``History'', the requirement (section 4) to Preserve
+its Title (section 1) will typically require changing the actual
+title.
+</li>
+
+<li>
+TERMINATION
+
+You may not copy, modify, sublicense, or distribute the Document
+except as expressly provided under this License.  Any attempt
+otherwise to copy, modify, sublicense, or distribute it is void, and
+will automatically terminate your rights under this License.
+
+However, if you cease all violation of this License, then your license
+from a particular copyright holder is reinstated (a) provisionally,
+unless and until the copyright holder explicitly and finally
+terminates your license, and (b) permanently, if the copyright holder
+fails to notify you of the violation by some reasonable means prior to
+60 days after the cessation.
+
+Moreover, your license from a particular copyright holder is
+reinstated permanently if the copyright holder notifies you of the
+violation by some reasonable means, this is the first time you have
+received notice of violation of this License (for any work) from that
+copyright holder, and you cure the violation prior to 30 days after
+your receipt of the notice.
+
+Termination of your rights under this section does not terminate the
+licenses of parties who have received copies or rights from you under
+this License.  If your rights have been terminated and not permanently
+reinstated, receipt of a copy of some or all of the same material does
+not give you any rights to use it.
+</li>
+
+<li>
+FUTURE REVISIONS OF THIS LICENSE
+
+The Free Software Foundation may publish new, revised versions
+of the GNU Free Documentation License from time to time.  Such new
+versions will be similar in spirit to the present version, but may
+differ in detail to address new problems or concerns.  See
+http://www.gnu.org/copyleft/.
+
+Each version of the License is given a distinguishing version number.
+If the Document specifies that a particular numbered version of this
+License ``or any later version'' applies to it, you have the option of
+following the terms and conditions either of that specified version or
+of any later version that has been published (not as a draft) by the
+Free Software Foundation.  If the Document does not specify a version
+number of this License, you may choose any version ever published (not
+as a draft) by the Free Software Foundation.  If the Document
+specifies that a proxy can decide which future versions of this
+License can be used, that proxy's public statement of acceptance of a
+version permanently authorizes you to choose that version for the
+Document.
+</li>
+
+<li>
+RELICENSING
+
+``Massive Multiauthor Collaboration Site'' (or ``MMC Site'') means any
+World Wide Web server that publishes copyrightable works and also
+provides prominent facilities for anybody to edit those works.  A
+public wiki that anybody can edit is an example of such a server.  A
+``Massive Multiauthor Collaboration'' (or ``MMC'') contained in the
+site means any set of copyrightable works thus published on the MMC
+site.
+
+``CC-BY-SA'' means the Creative Commons Attribution-Share Alike 3.0
+license published by Creative Commons Corporation, a not-for-profit
+corporation with a principal place of business in San Francisco,
+California, as well as future copyleft versions of that license
+published by that same organization.
+
+``Incorporate'' means to publish or republish a Document, in whole or
+in part, as part of another Document.
+
+An MMC is ``eligible for relicensing'' if it is licensed under this
+License, and if all works that were first published under this License
+somewhere other than this MMC, and subsequently incorporated in whole
+or in part into the MMC, (1) had no cover texts or invariant sections,
+and (2) were thus incorporated prior to November 1, 2008.
+
+The operator of an MMC Site may republish an MMC contained in the site
+under CC-BY-SA on the same site at any time before August 1, 2009,
+provided the MMC is eligible for relicensing.
+</li>
+</ol>
+
+\section ADDENDUM ADDENDUM: How to use this License for your documents
+
+To use this License in a document you have written, include a copy of
+the License in the document and put the following copyright and
+license notices just after the title page:
+
+<blockquote>
+  Copyright (C)  <em>year</em>  <em>your name</em>.
+  Permission is granted to copy, distribute and/or modify this document
+  under the terms of the GNU Free Documentation License, Version 1.3
+  or any later version published by the Free Software Foundation;
+  with no Invariant Sections, no Front-Cover Texts, and no Back-Cover
+  Texts.  A copy of the license is included in the section entitled ``GNU
+  Free Documentation License''.
+</blockquote>
+
+If you have Invariant Sections, Front-Cover Texts and Back-Cover Texts,
+replace the ``with...Texts.'' line with this:
+
+<blockquote>
+    with the Invariant Sections being <em>list their titles</em>, with
+    the Front-Cover Texts being <em>list</em>, and with the Back-Cover Texts
+    being <em>list</em>.
+</blockquote>
+
+If you have Invariant Sections without Cover Texts, or some other
+combination of the three, merge those two alternatives to suit the
+situation.
+
+If your document contains nontrivial examples of program code, we
+recommend releasing these examples in parallel under your choice of
+free software license, such as the GNU General Public License,
+to permit their use in free software.
+
+*/

+ 71 - 0
doc/doxygen/chapters/fft_support.doxy

@@ -0,0 +1,71 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page FFTSupport FFT Support
+
+StarPU provides <c>libstarpufft</c>, a library whose design is very similar to
+both fftw and cufft, the difference being that it takes benefit from both CPUs
+and GPUs. It should however be noted that GPUs do not have the same precision as
+CPUs, so the results may different by a negligible amount.
+
+Different precisions are available, namely float, double and long
+double precisions, with the following fftw naming conventions:
+
+<ul>
+<li>
+double precision structures and functions are named e.g. starpufft_execute()
+</li>
+<li>
+float precision structures and functions are named e.g. starpufftf_execute()
+</li>
+<li>
+long double precision structures and functions are named e.g. starpufftl_execute()
+</li>
+</ul>
+
+The documentation below is given with names for double precision, replace
+<c>starpufft_</c> with <c>starpufftf_</c> or <c>starpufftl_</c> as appropriate.
+
+Only complex numbers are supported at the moment.
+
+The application has to call starpu_init() before calling starpufft functions.
+
+Either main memory pointers or data handles can be provided.
+
+<ul>
+<li>
+To provide main memory pointers, use starpufft_start() or
+starpufft_execute(). Only one FFT can be performed at a time, because
+StarPU will have to register the data on the fly. In the starpufft_start()
+case, starpufft_cleanup() needs to be called to unregister the data.
+</li>
+<li>
+To provide data handles (which is preferrable),
+use starpufft_start_handle() (preferred) or
+starpufft_execute_handle(). Several FFTs tasks can be submitted
+for a given plan, which permits e.g. to start a series of FFT with just one
+plan. starpufft_start_handle() is preferrable since it does not wait for
+the task completion, and thus permits to enqueue a series of tasks.
+</li>
+</ul>
+
+All functions are defined in \ref API_FFT_Support.
+
+\section Compilation Compilation
+
+The flags required to compile or link against the FFT library are accessible
+with the following commands:
+
+\verbatim
+$ pkg-config --cflags starpufft-1.2  # options for the compiler
+$ pkg-config --libs starpufft-1.2    # options for the linker
+\endverbatim
+
+Also pass the <c>--static</c> option if the application is to be linked statically.
+
+*/

+ 242 - 0
doc/doxygen/chapters/introduction.doxy

@@ -0,0 +1,242 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+*/
+
+/*! \mainpage Introduction
+
+\htmlonly
+<h1><a class="anchor" id="Foreword"></a>Foreword</h1>
+\endhtmlonly
+\htmlinclude version.html
+\htmlinclude foreword.html
+
+\section Motivation Motivation
+
+\internal
+complex machines with heterogeneous cores/devices
+\endinternal
+
+The use of specialized hardware such as accelerators or coprocessors offers an
+interesting approach to overcome the physical limits encountered by processor
+architects. As a result, many machines are now equipped with one or several
+accelerators (e.g. a GPU), in addition to the usual processor(s). While a lot of
+efforts have been devoted to offload computation onto such accelerators, very
+little attention as been paid to portability concerns on the one hand, and to the
+possibility of having heterogeneous accelerators and processors to interact on the other hand.
+
+StarPU is a runtime system that offers support for heterogeneous multicore
+architectures, it not only offers a unified view of the computational resources
+(i.e. CPUs and accelerators at the same time), but it also takes care of
+efficiently mapping and executing tasks onto an heterogeneous machine while
+transparently handling low-level issues such as data transfers in a portable
+fashion.
+
+\internal
+this leads to a complicated distributed memory design
+which is not (easily) manageable by hand
+
+added value/benefits of StarPU
+   - portability
+   - scheduling, perf. portability
+\endinternal
+
+\section StarPUInANutshell StarPU in a Nutshell
+
+StarPU is a software tool aiming to allow programmers to exploit the
+computing power of the available CPUs and GPUs, while relieving them
+from the need to specially adapt their programs to the target machine
+and processing units.
+
+At the core of StarPU is its run-time support library, which is
+responsible for scheduling application-provided tasks on heterogeneous
+CPU/GPU machines.  In addition, StarPU comes with programming language
+support, in the form of extensions to languages of the C family
+(\ref cExtensions), as well as an OpenCL front-end (\ref SOCLOpenclExtensions).
+
+StarPU's run-time and programming language extensions support a
+task-based programming model. Applications submit computational
+tasks, with CPU and/or GPU implementations, and StarPU schedules these
+tasks and associated data transfers on available CPUs and GPUs.  The
+data that a task manipulates are automatically transferred among
+accelerators and the main memory, so that programmers are freed from the
+scheduling issues and technical details associated with these transfers.
+
+StarPU takes particular care of scheduling tasks efficiently, using
+well-known algorithms from the literature (\ref TaskSchedulingPolicy).
+In addition, it allows scheduling experts, such as compiler or
+computational library developers, to implement custom scheduling
+policies in a portable fashion (\ref DefiningANewSchedulingPolicy).
+
+The remainder of this section describes the main concepts used in StarPU.
+
+\internal
+explain the notion of codelet and task (i.e. g(A, B)
+\endinternal
+
+\subsection CodeletAndTasks Codelet and Tasks
+
+One of the StarPU primary data structures is the \b codelet. A codelet describes a
+computational kernel that can possibly be implemented on multiple architectures
+such as a CPU, a CUDA device or an OpenCL device.
+
+\internal
+TODO insert illustration f: f_spu, f_cpu, ...
+\endinternal
+
+Another important data structure is the \b task. Executing a StarPU task
+consists in applying a codelet on a data set, on one of the architectures on
+which the codelet is implemented. A task thus describes the codelet that it
+uses, but also which data are accessed, and how they are
+accessed during the computation (read and/or write).
+StarPU tasks are asynchronous: submitting a task to StarPU is a non-blocking
+operation. The task structure can also specify a \b callback function that is
+called once StarPU has properly executed the task. It also contains optional
+fields that the application may use to give hints to the scheduler (such as
+priority levels).
+
+By default, task dependencies are inferred from data dependency (sequential
+coherence) by StarPU. The application can however disable sequential coherency
+for some data, and dependencies be expressed by hand.
+A task may be identified by a unique 64-bit number chosen by the application
+which we refer as a \b tag.
+Task dependencies can be enforced by hand either by the means of callback functions, by
+submitting other tasks, or by expressing dependencies
+between tags (which can thus correspond to tasks that have not been submitted
+yet).
+
+\internal
+TODO insert illustration f(Ar, Brw, Cr) + ..
+\endinternal
+
+\internal
+DSM
+\endinternal
+
+\subsection StarPUDataManagementLibrary StarPU Data Management Library
+
+Because StarPU schedules tasks at runtime, data transfers have to be
+done automatically and ``just-in-time'' between processing units,
+relieving the application programmer from explicit data transfers.
+Moreover, to avoid unnecessary transfers, StarPU keeps data
+where it was last needed, even if was modified there, and it
+allows multiple copies of the same data to reside at the same time on
+several processing units as long as it is not modified.
+
+\section ApplicationTaskification Application Taskification
+
+TODO
+
+\internal
+TODO: section describing what taskifying an application means: before
+porting to StarPU, turn the program into:
+"pure" functions, which only access data from their passed parameters
+a main function which just calls these pure functions
+
+and then it's trivial to use StarPU or any other kind of task-based library:
+simply replace calling the function with submitting a task.
+\endinternal
+
+\section Glossary Glossary
+
+A \b codelet records pointers to various implementations of the same
+theoretical function.
+
+A <b>memory node</b> can be either the main RAM or GPU-embedded memory.
+
+A \b bus is a link between memory nodes.
+
+A <b>data handle</b> keeps track of replicates of the same data (\b registered by the
+application) over various memory nodes. The data management library manages
+keeping them coherent.
+
+The \b home memory node of a data handle is the memory node from which the data
+was registered (usually the main memory node).
+
+A \b task represents a scheduled execution of a codelet on some data handles.
+
+A \b tag is a rendez-vous point. Tasks typically have their own tag, and can
+depend on other tags. The value is chosen by the application.
+
+A \b worker execute tasks. There is typically one per CPU computation core and
+one per accelerator (for which a whole CPU core is dedicated).
+
+A \b driver drives a given kind of workers. There are currently CPU, CUDA,
+and OpenCL drivers. They usually start several workers to actually drive
+them.
+
+A <b>performance model</b> is a (dynamic or static) model of the performance of a
+given codelet. Codelets can have execution time performance model as well as
+power consumption performance models.
+
+A data \b interface describes the layout of the data: for a vector, a pointer
+for the start, the number of elements and the size of elements ; for a matrix, a
+pointer for the start, the number of elements per row, the offset between rows,
+and the size of each element ; etc. To access their data, codelet functions are
+given interfaces for the local memory node replicates of the data handles of the
+scheduled task.
+
+\b Partitioning data means dividing the data of a given data handle (called
+\b father) into a series of \b children data handles which designate various
+portions of the former.
+
+A \b filter is the function which computes children data handles from a father
+data handle, and thus describes how the partitioning should be done (horizontal,
+vertical, etc.)
+
+\b Acquiring a data handle can be done from the main application, to safely
+access the data of a data handle from its home node, without having to
+unregister it.
+
+
+\section ResearchPapers Research Papers
+
+Research papers about StarPU can be found at
+http://runtime.bordeaux.inria.fr/Publis/Keyword/STARPU.html.
+
+A good overview is available in the research report at
+http://hal.archives-ouvertes.fr/inria-00467677.
+
+\section FurtherReading Further Reading
+
+The documentation chapters include
+
+<ol>
+<li> Part: Using StarPU
+<ul>
+<li> \ref BuildingAndInstallingStarPU
+<li> \ref BasicExamples
+<li> \ref AdvancedExamples
+<li> \ref HowToOptimizePerformanceWithStarPU
+<li> \ref PerformanceFeedback
+<li> \ref TipsAndTricksToKnowAbout
+<li> \ref MPISupport
+<li> \ref FFTSupport
+<li> \ref MICSCCSupport
+<li> \ref cExtensions
+<li> \ref SOCLOpenclExtensions
+<li> \ref SchedulingContexts
+<li> \ref SchedulingContextHypervisor
+</ul>
+</li>
+<li> Part: Inside StarPU
+<ul>
+<li> \ref ExecutionConfigurationThroughEnvironmentVariables
+<li> \ref CompilationConfiguration
+<li> \ref ModuleDocumentation
+<li> \ref deprecated
+</ul>
+<li> Part: Appendix
+<ul>
+<li> \ref FullSourceCodeVectorScal
+<li> \ref GNUFreeDocumentationLicense
+</ul>
+</ol>
+
+
+Make sure to have had a look at those too!
+
+*/

+ 56 - 0
doc/doxygen/chapters/mic_scc_support.doxy

@@ -0,0 +1,56 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page MICSCCSupport MIC/SCC Support
+
+\section Compilation Compilation
+
+SCC support just needs the presence of the RCCE library.
+
+MIC support actually needs two compilations of StarPU, one for the host and one for
+the device. The script <c>mic-configure</c> can be used to achieve this: it basically
+calls <c>configure</c> as appropriate from two new directories: <c>build_mic</c> and
+<c>build_host</c>. <c>make</c> and <c>make install</c> can then be used as usual and will
+recurse into both directories.
+
+\internal
+TODO: move to configuration section?
+\endinternal
+
+It can be parameterized with the environment variables \ref
+STARPU_MIC_HOST, \ref STARPU_MIC_CC_PATH and \ref STARPU_COI_DIR.
+
+
+\section PortingApplicationsToMICSCC Porting Applications To MIC/SCC
+
+The simplest way to port an application to MIC/SCC is to set the field
+starpu_codelet::cpu_funcs_name, to provide StarPU with the function
+name of the CPU implementation. StarPU will thus simply use the
+existing CPU implementation (cross-rebuilt in the MIC case). The
+functions have to be globally-visible (i.e. not <c>static</c>) for
+StarPU to be able to look them up.
+
+For SCC execution, the function starpu_initialize() also has to be
+used instead of starpu_init(), so as to pass <c>argc</c> and
+<c>argv</c>.
+
+\section LaunchingPrograms Launching Programs
+
+SCC programs are started through RCCE.
+
+MIC programs are started from the host. StarPU automatically
+starts the same program on MIC devices. It however needs to get
+the MIC-cross-built binary. It will look for the file given by the
+environment variable \ref STARPU_MIC_SINK_PROGRAM_NAME or in the
+directory given by the environment variable \ref
+STARPU_MIC_SINK_PROGRAM_PATH, or in the field
+starpu_config::mic_sink_program_path. It will also look in the current
+directory for the same binary name plus the suffix <c>-mic</c> or
+<c>_mic</c>.
+
+*/

+ 377 - 0
doc/doxygen/chapters/mpi_support.doxy

@@ -0,0 +1,377 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page MPISupport MPI Support
+
+The integration of MPI transfers within task parallelism is done in a
+very natural way by the means of asynchronous interactions between the
+application and StarPU.  This is implemented in a separate libstarpumpi library
+which basically provides "StarPU" equivalents of <c>MPI_*</c> functions, where
+<c>void *</c> buffers are replaced with ::starpu_data_handle_t, and all
+GPU-RAM-NIC transfers are handled efficiently by StarPU-MPI.  The user has to
+use the usual <c>mpirun</c> command of the MPI implementation to start StarPU on
+the different MPI nodes.
+
+An MPI Insert Task function provides an even more seamless transition to a
+distributed application, by automatically issuing all required data transfers
+according to the task graph and an application-provided distribution.
+
+\section SimpleExample Simple Example
+
+The flags required to compile or link against the MPI layer are
+accessible with the following commands:
+
+\verbatim
+$ pkg-config --cflags starpumpi-1.2  # options for the compiler
+$ pkg-config --libs starpumpi-1.2    # options for the linker
+\endverbatim
+
+You also need pass the <c>--static</c> option if the application is to
+be linked statically.
+
+\code{.c}
+void increment_token(void)
+{
+    struct starpu_task *task = starpu_task_create();
+
+    task->cl = &increment_cl;
+    task->handles[0] = token_handle;
+
+    starpu_task_submit(task);
+}
+
+int main(int argc, char **argv)
+{
+    int rank, size;
+
+    starpu_init(NULL);
+    starpu_mpi_initialize_extended(&rank, &size);
+
+    starpu_vector_data_register(&token_handle, 0, (uintptr_t)&token, 1, sizeof(unsigned));
+
+    unsigned nloops = NITER;
+    unsigned loop;
+
+    unsigned last_loop = nloops - 1;
+    unsigned last_rank = size - 1;
+
+    for (loop = 0; loop < nloops; loop++) {
+        int tag = loop*size + rank;
+
+        if (loop == 0 && rank == 0)
+        {
+            token = 0;
+            fprintf(stdout, "Start with token value %d\n", token);
+        }
+        else
+        {
+            starpu_mpi_irecv_detached(token_handle, (rank+size-1)%size, tag,
+                    MPI_COMM_WORLD, NULL, NULL);
+        }
+
+        increment_token();
+
+        if (loop == last_loop && rank == last_rank)
+        {
+            starpu_data_acquire(token_handle, STARPU_R);
+            fprintf(stdout, "Finished: token value %d\n", token);
+            starpu_data_release(token_handle);
+        }
+        else
+        {
+            starpu_mpi_isend_detached(token_handle, (rank+1)%size, tag+1,
+                    MPI_COMM_WORLD, NULL, NULL);
+        }
+    }
+
+    starpu_task_wait_for_all();
+
+    starpu_mpi_shutdown();
+    starpu_shutdown();
+
+    if (rank == last_rank)
+    {
+        fprintf(stderr, "[%d] token = %d == %d * %d ?\n", rank, token, nloops, size);
+        STARPU_ASSERT(token == nloops*size);
+    }
+\endcode
+
+\section PointToPointCommunication Point To Point Communication
+
+The standard point to point communications of MPI have been
+implemented. The semantic is similar to the MPI one, but adapted to
+the DSM provided by StarPU. A MPI request will only be submitted when
+the data is available in the main memory of the node submitting the
+request.
+
+There is two types of asynchronous communications: the classic
+asynchronous communications and the detached communications. The
+classic asynchronous communications (starpu_mpi_isend() and
+starpu_mpi_irecv()) need to be followed by a call to
+starpu_mpi_wait() or to starpu_mpi_test() to wait for or to
+test the completion of the communication. Waiting for or testing the
+completion of detached communications is not possible, this is done
+internally by StarPU-MPI, on completion, the resources are
+automatically released. This mechanism is similar to the pthread
+detach state attribute which determines whether a thread will be
+created in a joinable or a detached state.
+
+For any communication, the call of the function will result in the
+creation of a StarPU-MPI request, the function
+starpu_data_acquire_cb() is then called to asynchronously request
+StarPU to fetch the data in main memory; when the data is available in
+main memory, a StarPU-MPI function is called to put the new request in
+the list of the ready requests if it is a send request, or in an
+hashmap if it is a receive request.
+
+Internally, all MPI communications submitted by StarPU uses a unique
+tag which has a default value, and can be accessed with the functions
+starpu_mpi_get_communication_tag() and
+starpu_mpi_set_communication_tag().
+
+The matching of tags with corresponding requests is done into StarPU-MPI.
+To handle this, any communication is a double-communication based on a
+envelope + data system. Every data which will be sent needs to send an
+envelope which describes the data (particularly its tag) before sending
+the data, so the receiver can get the matching pending receive request
+from the hashmap, and submit it to recieve the data correctly.
+
+To this aim, the StarPU-MPI progression thread has a permanent-submitted
+request destined to receive incoming envelopes from all sources.
+
+The StarPU-MPI progression thread regularly polls this list of ready
+requests. For each new ready request, the appropriate function is
+called to post the corresponding MPI call. For example, calling
+starpu_mpi_isend() will result in posting <c>MPI_Isend</c>. If
+the request is marked as detached, the request will be put in the list
+of detached requests.
+
+The StarPU-MPI progression thread also polls the list of detached
+requests. For each detached request, it regularly tests the completion
+of the MPI request by calling <c>MPI_Test</c>. On completion, the data
+handle is released, and if a callback was defined, it is called.
+
+Finally, the StarPU-MPI progression thread checks if an envelope has
+arrived. If it is, it'll check if the corresponding receive has already
+been submitted by the application. If it is, it'll submit the request
+just as like as it does with those on the list of ready requests.
+If it is not, it'll allocate a temporary handle to store the data that
+will arrive just after, so as when the corresponding receive request
+will be submitted by the application, it'll copy this temporary handle
+into its one instead of submitting a new StarPU-MPI request.
+
+\ref MPIPtpCommunication "Communication" gives the list of all the
+point to point communications defined in StarPU-MPI.
+
+\section ExchangingUserDefinedDataInterface Exchanging User Defined Data Interface
+
+New data interfaces defined as explained in \ref
+DefiningANewDataInterface can also be used within StarPU-MPI and
+exchanged between nodes. Two functions needs to be defined through the
+type starpu_data_interface_ops. The function
+starpu_data_interface_ops::pack_data takes a handle and returns a
+contiguous memory buffer along with its size where data to be conveyed
+to another node should be copied. The reversed operation is
+implemented in the function starpu_data_interface_ops::unpack_data which
+takes a contiguous memory buffer and recreates the data handle.
+
+\code{.c}
+static int complex_pack_data(starpu_data_handle_t handle, unsigned node, void **ptr, ssize_t *count)
+{
+  STARPU_ASSERT(starpu_data_test_if_allocated_on_node(handle, node));
+
+  struct starpu_complex_interface *complex_interface =
+    (struct starpu_complex_interface *) starpu_data_get_interface_on_node(handle, node);
+
+  *count = complex_get_size(handle);
+  *ptr = malloc(*count);
+  memcpy(*ptr, complex_interface->real, complex_interface->nx*sizeof(double));
+  memcpy(*ptr+complex_interface->nx*sizeof(double), complex_interface->imaginary,
+         complex_interface->nx*sizeof(double));
+
+  return 0;
+}
+
+static int complex_unpack_data(starpu_data_handle_t handle, unsigned node, void *ptr, size_t count)
+{
+  STARPU_ASSERT(starpu_data_test_if_allocated_on_node(handle, node));
+
+  struct starpu_complex_interface *complex_interface =
+    (struct starpu_complex_interface *)	starpu_data_get_interface_on_node(handle, node);
+
+  memcpy(complex_interface->real, ptr, complex_interface->nx*sizeof(double));
+  memcpy(complex_interface->imaginary, ptr+complex_interface->nx*sizeof(double),
+         complex_interface->nx*sizeof(double));
+
+  return 0;
+}
+
+static struct starpu_data_interface_ops interface_complex_ops =
+{
+  ...
+  .pack_data = complex_pack_data,
+  .unpack_data = complex_unpack_data
+};
+\endcode
+
+\section MPIInsertTaskUtility MPI Insert Task Utility
+
+To save the programmer from having to explicit all communications, StarPU
+provides an "MPI Insert Task Utility". The principe is that the application
+decides a distribution of the data over the MPI nodes by allocating it and
+notifying StarPU of that decision, i.e. tell StarPU which MPI node "owns"
+which data. It also decides, for each handle, an MPI tag which will be used to
+exchange the content of the handle. All MPI nodes then process the whole task
+graph, and StarPU automatically determines which node actually execute which
+task, and trigger the required MPI transfers.
+
+The list of functions is described in \ref MPIInsertTask "MPI Insert Task".
+
+Here an stencil example showing how to use starpu_mpi_insert_task(). One
+first needs to define a distribution function which specifies the
+locality of the data. Note that that distribution information needs to
+be given to StarPU by calling starpu_data_set_rank(). A MPI tag
+should also be defined for each data handle by calling
+starpu_data_set_tag().
+
+\code{.c}
+/* Returns the MPI node number where data is */
+int my_distrib(int x, int y, int nb_nodes) {
+  /* Block distrib */
+  return ((int)(x / sqrt(nb_nodes) + (y / sqrt(nb_nodes)) * sqrt(nb_nodes))) % nb_nodes;
+
+  // /* Other examples useful for other kinds of computations */
+  // /* / distrib */
+  // return (x+y) % nb_nodes;
+
+  // /* Block cyclic distrib */
+  // unsigned side = sqrt(nb_nodes);
+  // return x % side + (y % side) * size;
+}
+\endcode
+
+Now the data can be registered within StarPU. Data which are not
+owned but will be needed for computations can be registered through
+the lazy allocation mechanism, i.e. with a <c>home_node</c> set to -1.
+StarPU will automatically allocate the memory when it is used for the
+first time.
+
+One can note an optimization here (the <c>else if</c> test): we only register
+data which will be needed by the tasks that we will execute.
+
+\code{.c}
+    unsigned matrix[X][Y];
+    starpu_data_handle_t data_handles[X][Y];
+
+    for(x = 0; x < X; x++) {
+        for (y = 0; y < Y; y++) {
+            int mpi_rank = my_distrib(x, y, size);
+             if (mpi_rank == my_rank)
+                /* Owning data */
+                starpu_variable_data_register(&data_handles[x][y], 0,
+                                              (uintptr_t)&(matrix[x][y]), sizeof(unsigned));
+            else if (my_rank == my_distrib(x+1, y, size) || my_rank == my_distrib(x-1, y, size)
+                  || my_rank == my_distrib(x, y+1, size) || my_rank == my_distrib(x, y-1, size))
+                /* I don't own that index, but will need it for my computations */
+                starpu_variable_data_register(&data_handles[x][y], -1,
+                                              (uintptr_t)NULL, sizeof(unsigned));
+            else
+                /* I know it's useless to allocate anything for this */
+                data_handles[x][y] = NULL;
+            if (data_handles[x][y]) {
+                starpu_data_set_rank(data_handles[x][y], mpi_rank);
+                starpu_data_set_tag(data_handles[x][y], x*X+y);
+            }
+        }
+    }
+\endcode
+
+Now starpu_mpi_insert_task() can be called for the different
+steps of the application.
+
+\code{.c}
+    for(loop=0 ; loop<niter; loop++)
+        for (x = 1; x < X-1; x++)
+            for (y = 1; y < Y-1; y++)
+                starpu_mpi_insert_task(MPI_COMM_WORLD, &stencil5_cl,
+                                       STARPU_RW, data_handles[x][y],
+                                       STARPU_R, data_handles[x-1][y],
+                                       STARPU_R, data_handles[x+1][y],
+                                       STARPU_R, data_handles[x][y-1],
+                                       STARPU_R, data_handles[x][y+1],
+                                       0);
+    starpu_task_wait_for_all();
+\endcode
+
+I.e. all MPI nodes process the whole task graph, but as mentioned above, for
+each task, only the MPI node which owns the data being written to (here,
+<c>data_handles[x][y]</c>) will actually run the task. The other MPI nodes will
+automatically send the required data.
+
+This can be a concern with a growing number of nodes. To avoid this, the
+application can prune the task for loops according to the data distribution,
+so as to only submit tasks on nodes which have to care about them (either to
+execute them, or to send the required data).
+
+\section MPICollective MPI Collective Operations
+
+The functions are described in \ref MPICollectiveOperations "MPI Collective Operations".
+
+\code{.c}
+if (rank == root)
+{
+    /* Allocate the vector */
+    vector = malloc(nblocks * sizeof(float *));
+    for(x=0 ; x<nblocks ; x++)
+    {
+        starpu_malloc((void **)&vector[x], block_size*sizeof(float));
+    }
+}
+
+/* Allocate data handles and register data to StarPU */
+data_handles = malloc(nblocks*sizeof(starpu_data_handle_t *));
+for(x = 0; x < nblocks ;  x++)
+{
+    int mpi_rank = my_distrib(x, nodes);
+    if (rank == root) {
+        starpu_vector_data_register(&data_handles[x], 0, (uintptr_t)vector[x],
+                                    blocks_size, sizeof(float));
+    }
+    else if ((mpi_rank == rank) || ((rank == mpi_rank+1 || rank == mpi_rank-1))) {
+        /* I own that index, or i will need it for my computations */
+        starpu_vector_data_register(&data_handles[x], -1, (uintptr_t)NULL,
+                                   block_size, sizeof(float));
+    }
+    else {
+        /* I know it's useless to allocate anything for this */
+        data_handles[x] = NULL;
+    }
+    if (data_handles[x]) {
+        starpu_data_set_rank(data_handles[x], mpi_rank);
+        starpu_data_set_tag(data_handles[x], x*nblocks+y);
+    }
+}
+
+/* Scatter the matrix among the nodes */
+starpu_mpi_scatter_detached(data_handles, nblocks, root, MPI_COMM_WORLD);
+
+/* Calculation */
+for(x = 0; x < nblocks ;  x++) {
+    if (data_handles[x]) {
+        int owner = starpu_data_get_rank(data_handles[x]);
+        if (owner == rank) {
+            starpu_insert_task(&cl, STARPU_RW, data_handles[x], 0);
+        }
+    }
+}
+
+/* Gather the matrix on main node */
+starpu_mpi_gather_detached(data_handles, nblocks, 0, MPI_COMM_WORLD);
+\endcode
+
+*/

+ 522 - 0
doc/doxygen/chapters/optimize_performance.doxy

@@ -0,0 +1,522 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page HowToOptimizePerformanceWithStarPU How To Optimize Performance With StarPU
+
+TODO: improve!
+
+Simply encapsulating application kernels into tasks already permits to
+seamlessly support CPU and GPUs at the same time. To achieve good performance, a
+few additional changes are needed.
+
+\section DataManagement Data Management
+
+When the application allocates data, whenever possible it should use
+the function starpu_malloc(), which will ask CUDA or OpenCL to make
+the allocation itself and pin the corresponding allocated memory. This
+is needed to permit asynchronous data transfer, i.e. permit data
+transfer to overlap with computations. Otherwise, the trace will show
+that the <c>DriverCopyAsync</c> state takes a lot of time, this is
+because CUDA or OpenCL then reverts to synchronous transfers.
+
+By default, StarPU leaves replicates of data wherever they were used, in case they
+will be re-used by other tasks, thus saving the data transfer time. When some
+task modifies some data, all the other replicates are invalidated, and only the
+processing unit which ran that task will have a valid replicate of the data. If the application knows
+that this data will not be re-used by further tasks, it should advise StarPU to
+immediately replicate it to a desired list of memory nodes (given through a
+bitmask). This can be understood like the write-through mode of CPU caches.
+
+\code{.c}
+starpu_data_set_wt_mask(img_handle, 1<<0);
+\endcode
+
+will for instance request to always automatically transfer a replicate into the
+main memory (node 0), as bit 0 of the write-through bitmask is being set.
+
+\code{.c}
+starpu_data_set_wt_mask(img_handle, ~0U);
+\endcode
+
+will request to always automatically broadcast the updated data to all memory
+nodes.
+
+Setting the write-through mask to <c>~0U</c> can also be useful to make sure all
+memory nodes always have a copy of the data, so that it is never evicted when
+memory gets scarse.
+
+Implicit data dependency computation can become expensive if a lot
+of tasks access the same piece of data. If no dependency is required
+on some piece of data (e.g. because it is only accessed in read-only
+mode, or because write accesses are actually commutative), use the
+function starpu_data_set_sequential_consistency_flag() to disable
+implicit dependencies on that data.
+
+In the same vein, accumulation of results in the same data can become a
+bottleneck. The use of the mode ::STARPU_REDUX permits to optimize such
+accumulation (see \ref DataReduction). To a lesser extent, the use of
+the flag ::STARPU_COMMUTE keeps the bottleneck, but at least permits
+the accumulation to happen in any order.
+
+Applications often need a data just for temporary results.  In such a case,
+registration can be made without an initial value, for instance this produces a vector data:
+
+\code{.c}
+starpu_vector_data_register(&handle, -1, 0, n, sizeof(float));
+\endcode
+
+StarPU will then allocate the actual buffer only when it is actually needed,
+e.g. directly on the GPU without allocating in main memory.
+
+In the same vein, once the temporary results are not useful any more, the
+data should be thrown away. If the handle is not to be reused, it can be
+unregistered:
+
+\code{.c}
+starpu_unregister_submit(handle);
+\endcode
+
+actual unregistration will be done after all tasks working on the handle
+terminate.
+
+If the handle is to be reused, instead of unregistering it, it can simply be invalidated:
+
+\code{.c}
+starpu_invalidate_submit(handle);
+\endcode
+
+the buffers containing the current value will then be freed, and reallocated
+only when another task writes some value to the handle.
+
+\section TaskGranularity Task Granularity
+
+Like any other runtime, StarPU has some overhead to manage tasks. Since
+it does smart scheduling and data management, that overhead is not always
+neglectable. The order of magnitude of the overhead is typically a couple of
+microseconds, which is actually quite smaller than the CUDA overhead itself. The
+amount of work that a task should do should thus be somewhat
+bigger, to make sure that the overhead becomes neglectible. The offline
+performance feedback can provide a measure of task length, which should thus be
+checked if bad performance are observed. To get a grasp at the scalability
+possibility according to task size, one can run
+<c>tests/microbenchs/tasks_size_overhead.sh</c> which draws curves of the
+speedup of independent tasks of very small sizes.
+
+The choice of scheduler also has impact over the overhead: for instance, the
+<c>dmda</c> scheduler takes time to make a decision, while <c>eager</c> does
+not. <c>tasks_size_overhead.sh</c> can again be used to get a grasp at how much
+impact that has on the target machine.
+
+\section TaskSubmission Task Submission
+
+To let StarPU make online optimizations, tasks should be submitted
+asynchronously as much as possible. Ideally, all the tasks should be
+submitted, and mere calls to starpu_task_wait_for_all() or
+starpu_data_unregister() be done to wait for
+termination. StarPU will then be able to rework the whole schedule, overlap
+computation with communication, manage accelerator local memory usage, etc.
+
+\section TaskPriorities Task Priorities
+
+By default, StarPU will consider the tasks in the order they are submitted by
+the application. If the application programmer knows that some tasks should
+be performed in priority (for instance because their output is needed by many
+other tasks and may thus be a bottleneck if not executed early
+enough), the field starpu_task::priority should be set to transmit the
+priority information to StarPU.
+
+\section TaskSchedulingPolicy Task Scheduling Policy
+
+By default, StarPU uses the <c>eager</c> simple greedy scheduler. This is
+because it provides correct load balance even if the application codelets do not
+have performance models. If your application codelets have performance models
+(\ref PerformanceModelExample), you should change the scheduler thanks
+to the environment variable \ref STARPU_SCHED. For instance <c>export
+STARPU_SCHED=dmda</c> . Use <c>help</c> to get the list of available schedulers.
+
+The <b>eager</b> scheduler uses a central task queue, from which workers draw tasks
+to work on. This however does not permit to prefetch data since the scheduling
+decision is taken late. If a task has a non-0 priority, it is put at the front of the queue.
+
+The <b>prio</b> scheduler also uses a central task queue, but sorts tasks by
+priority (between -5 and 5).
+
+The <b>random</b> scheduler distributes tasks randomly according to assumed worker
+overall performance.
+
+The <b>ws</b> (work stealing) scheduler schedules tasks on the local worker by
+default. When a worker becomes idle, it steals a task from the most loaded
+worker.
+
+The <b>dm</b> (deque model) scheduler uses task execution performance models into account to
+perform an HEFT-similar scheduling strategy: it schedules tasks where their
+termination time will be minimal.
+
+The <b>dmda</b> (deque model data aware) scheduler is similar to dm, it also takes
+into account data transfer time.
+
+The <b>dmdar</b> (deque model data aware ready) scheduler is similar to dmda,
+it also sorts tasks on per-worker queues by number of already-available data
+buffers.
+
+The <b>dmdas</b> (deque model data aware sorted) scheduler is similar to dmda, it
+also supports arbitrary priority values.
+
+The <b>heft</b> (heterogeneous earliest finish time) scheduler is deprecated. It
+is now just an alias for <b>dmda</b>.
+
+The <b>pheft</b> (parallel HEFT) scheduler is similar to heft, it also supports
+parallel tasks (still experimental).
+
+The <b>peager</b> (parallel eager) scheduler is similar to eager, it also
+supports parallel tasks (still experimental).
+
+\section PerformanceModelCalibration Performance Model Calibration
+
+Most schedulers are based on an estimation of codelet duration on each kind
+of processing unit. For this to be possible, the application programmer needs
+to configure a performance model for the codelets of the application (see
+\ref PerformanceModelExample for instance). History-based performance models
+use on-line calibration.  StarPU will automatically calibrate codelets
+which have never been calibrated yet, and save the result in
+<c>$STARPU_HOME/.starpu/sampling/codelets</c>.
+The models are indexed by machine name. To share the models between
+machines (e.g. for a homogeneous cluster), use <c>export
+STARPU_HOSTNAME=some_global_name</c>. To force continuing calibration,
+use <c>export STARPU_CALIBRATE=1</c> . This may be necessary if your application
+has not-so-stable performance. StarPU will force calibration (and thus ignore
+the current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have been
+made on each architecture, to avoid badly scheduling tasks just because the
+first measurements were not so good. Details on the current performance model status
+can be obtained from the command <c>starpu_perfmodel_display</c>: the <c>-l</c>
+option lists the available performance models, and the <c>-s</c> option permits
+to choose the performance model to be displayed. The result looks like:
+
+\verbatim
+$ starpu_perfmodel_display -s starpu_dlu_lu_model_22
+performance model for cpu
+# hash    size     mean          dev           n
+880805ba  98304    2.731309e+02  6.010210e+01  1240
+b50b6605  393216   1.469926e+03  1.088828e+02  1240
+5c6c3401  1572864  1.125983e+04  3.265296e+03  1240
+\endverbatim
+
+Which shows that for the LU 22 kernel with a 1.5MiB matrix, the average
+execution time on CPUs was about 11ms, with a 3ms standard deviation, over
+1240 samples. It is a good idea to check this before doing actual performance
+measurements.
+
+A graph can be drawn by using the tool <c>starpu_perfmodel_plot</c>:
+
+\verbatim
+$ starpu_perfmodel_plot -s starpu_dlu_lu_model_22
+98304 393216 1572864
+$ gnuplot starpu_starpu_dlu_lu_model_22.gp
+$ gv starpu_starpu_dlu_lu_model_22.eps
+\endverbatim
+
+If a kernel source code was modified (e.g. performance improvement), the
+calibration information is stale and should be dropped, to re-calibrate from
+start. This can be done by using <c>export STARPU_CALIBRATE=2</c>.
+
+Note: due to CUDA limitations, to be able to measure kernel duration,
+calibration mode needs to disable asynchronous data transfers. Calibration thus
+disables data transfer / computation overlapping, and should thus not be used
+for eventual benchmarks. Note 2: history-based performance models get calibrated
+only if a performance-model-based scheduler is chosen.
+
+The history-based performance models can also be explicitly filled by the
+application without execution, if e.g. the application already has a series of
+measurements. This can be done by using starpu_perfmodel_update_history(),
+for instance:
+
+\code{.c}
+static struct starpu_perfmodel perf_model = {
+    .type = STARPU_HISTORY_BASED,
+    .symbol = "my_perfmodel",
+};
+
+struct starpu_codelet cl = {
+    .where = STARPU_CUDA,
+    .cuda_funcs = { cuda_func1, cuda_func2, NULL },
+    .nbuffers = 1,
+    .modes = {STARPU_W},
+    .model = &perf_model
+};
+
+void feed(void) {
+    struct my_measure *measure;
+    struct starpu_task task;
+    starpu_task_init(&task);
+
+    task.cl = &cl;
+
+    for (measure = &measures[0]; measure < measures[last]; measure++) {
+        starpu_data_handle_t handle;
+	starpu_vector_data_register(&handle, -1, 0, measure->size, sizeof(float));
+	task.handles[0] = handle;
+	starpu_perfmodel_update_history(&perf_model, &task,
+	                                STARPU_CUDA_DEFAULT + measure->cudadev, 0,
+	                                measure->implementation, measure->time);
+	starpu_task_clean(&task);
+	starpu_data_unregister(handle);
+    }
+}
+\endcode
+
+Measurement has to be provided in milliseconds for the completion time models,
+and in Joules for the energy consumption models.
+
+\section TaskDistributionVsDataTransfer Task Distribution Vs Data Transfer
+
+Distributing tasks to balance the load induces data transfer penalty. StarPU
+thus needs to find a balance between both. The target function that the
+<c>dmda</c> scheduler of StarPU
+tries to minimize is <c>alpha * T_execution + beta * T_data_transfer</c>, where
+<c>T_execution</c> is the estimated execution time of the codelet (usually
+accurate), and <c>T_data_transfer</c> is the estimated data transfer time. The
+latter is estimated based on bus calibration before execution start,
+i.e. with an idle machine, thus without contention. You can force bus
+re-calibration by running the tool <c>starpu_calibrate_bus</c>. The
+beta parameter defaults to 1, but it can be worth trying to tweak it
+by using <c>export STARPU_SCHED_BETA=2</c> for instance, since during
+real application execution, contention makes transfer times bigger.
+This is of course imprecise, but in practice, a rough estimation
+already gives the good results that a precise estimation would give.
+
+\section DataPrefetch Data Prefetch
+
+The <c>heft</c>, <c>dmda</c> and <c>pheft</c> scheduling policies
+perform data prefetch (see \ref STARPU_PREFETCH):
+as soon as a scheduling decision is taken for a task, requests are issued to
+transfer its required data to the target processing unit, if needeed, so that
+when the processing unit actually starts the task, its data will hopefully be
+already available and it will not have to wait for the transfer to finish.
+
+The application may want to perform some manual prefetching, for several reasons
+such as excluding initial data transfers from performance measurements, or
+setting up an initial statically-computed data distribution on the machine
+before submitting tasks, which will thus guide StarPU toward an initial task
+distribution (since StarPU will try to avoid further transfers).
+
+This can be achieved by giving the function starpu_data_prefetch_on_node()
+the handle and the desired target memory node.
+
+\section Power-basedScheduling Power-based Scheduling
+
+If the application can provide some power performance model (through
+the <c>power_model</c> field of the codelet structure), StarPU will
+take it into account when distributing tasks. The target function that
+the <c>dmda</c> scheduler minimizes becomes <c>alpha * T_execution +
+beta * T_data_transfer + gamma * Consumption</c> , where <c>Consumption</c>
+is the estimated task consumption in Joules. To tune this parameter, use
+<c>export STARPU_SCHED_GAMMA=3000</c> for instance, to express that each Joule
+(i.e kW during 1000us) is worth 3000us execution time penalty. Setting
+<c>alpha</c> and <c>beta</c> to zero permits to only take into account power consumption.
+
+This is however not sufficient to correctly optimize power: the scheduler would
+simply tend to run all computations on the most energy-conservative processing
+unit. To account for the consumption of the whole machine (including idle
+processing units), the idle power of the machine should be given by setting
+<c>export STARPU_IDLE_POWER=200</c> for 200W, for instance. This value can often
+be obtained from the machine power supplier.
+
+The power actually consumed by the total execution can be displayed by setting
+<c>export STARPU_PROFILING=1 STARPU_WORKER_STATS=1</c> .
+
+On-line task consumption measurement is currently only supported through the
+<c>CL_PROFILING_POWER_CONSUMED</c> OpenCL extension, implemented in the MoviSim
+simulator. Applications can however provide explicit measurements by
+using the function starpu_perfmodel_update_history() (examplified in \ref PerformanceModelExample
+with the <c>power_model</c> performance model. Fine-grain
+measurement is often not feasible with the feedback provided by the hardware, so
+the user can for instance run a given task a thousand times, measure the global
+consumption for that series of tasks, divide it by a thousand, repeat for
+varying kinds of tasks and task sizes, and eventually feed StarPU
+with these manual measurements through starpu_perfmodel_update_history().
+
+\section StaticScheduling Static Scheduling
+
+In some cases, one may want to force some scheduling, for instance force a given
+set of tasks to GPU0, another set to GPU1, etc. while letting some other tasks
+be scheduled on any other device. This can indeed be useful to guide StarPU into
+some work distribution, while still letting some degree of dynamism. For
+instance, to force execution of a task on CUDA0:
+
+\code{.c}
+task->execute_on_a_specific_worker = 1;
+task->worker = starpu_worker_get_by_type(STARPU_CUDA_WORKER, 0);
+\endcode
+
+\section Profiling Profiling
+
+A quick view of how many tasks each worker has executed can be obtained by setting
+<c>export STARPU_WORKER_STATS=1</c> This is a convenient way to check that
+execution did happen on accelerators without penalizing performance with
+the profiling overhead.
+
+A quick view of how much data transfers have been issued can be obtained by setting
+<c>export STARPU_BUS_STATS=1</c> .
+
+More detailed profiling information can be enabled by using <c>export STARPU_PROFILING=1</c> or by
+calling starpu_profiling_status_set() from the source code.
+Statistics on the execution can then be obtained by using <c>export
+STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
+ More details on performance feedback are provided by the next chapter.
+
+\section CUDA-specificOptimizations CUDA-specific Optimizations
+
+Due to CUDA limitations, StarPU will have a hard time overlapping its own
+communications and the codelet computations if the application does not use a
+dedicated CUDA stream for its computations instead of the default stream,
+which synchronizes all operations of the GPU. StarPU provides one by the use
+of starpu_cuda_get_local_stream() which can be used by all CUDA codelet
+operations to avoid this issue. For instance:
+
+\code{.c}
+func <<<grid,block,0,starpu_cuda_get_local_stream()>>> (foo, bar);
+cudaStreamSynchronize(starpu_cuda_get_local_stream());
+\endcode
+
+StarPU already does appropriate calls for the CUBLAS library.
+
+Unfortunately, some CUDA libraries do not have stream variants of
+kernels. That will lower the potential for overlapping.
+
+\section PerformanceDebugging Performance Debugging
+
+To get an idea of what is happening, a lot of performance feedback is available,
+detailed in the next chapter. The various informations should be checked for.
+
+<ul>
+<li>
+What does the Gantt diagram look like? (see \ref CreatingAGanttDiagram)
+<ul>
+  <li> If it's mostly green (tasks running in the initial context) or context specific
+  color prevailing, then the machine is properly
+  utilized, and perhaps the codelets are just slow. Check their performance, see
+  \ref PerformanceOfCodelets.
+  </li>
+  <li> If it's mostly purple (FetchingInput), tasks keep waiting for data
+  transfers, do you perhaps have far more communication than computation? Did
+  you properly use CUDA streams to make sure communication can be
+  overlapped? Did you use data-locality aware schedulers to avoid transfers as
+  much as possible?
+  </li>
+  <li> If it's mostly red (Blocked), tasks keep waiting for dependencies,
+  do you have enough parallelism? It might be a good idea to check what the DAG
+  looks like (see \ref CreatingADAGWithGraphviz).
+  </li>
+  <li> If only some workers are completely red (Blocked), for some reason the
+  scheduler didn't assign tasks to them. Perhaps the performance model is bogus,
+  check it (see \ref PerformanceOfCodelets). Do all your codelets have a
+  performance model?  When some of them don't, the schedulers switches to a
+  greedy algorithm which thus performs badly.
+  </li>
+</ul>
+</li>
+</ul>
+
+You can also use the Temanejo task debugger (see \ref UsingTheTemanejoTaskDebugger) to
+visualize the task graph more easily.
+
+\section SimulatedPerformance Simulated Performance
+
+StarPU can use Simgrid in order to simulate execution on an arbitrary
+platform.
+
+\subsection Calibration Calibration
+
+The idea is to first compile StarPU normally, and run the application,
+so as to automatically benchmark the bus and the codelets.
+
+\verbatim
+$ ./configure && make
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+[starpu][_starpu_load_history_based_model] Warning: model matvecmult
+   is not calibrated, forcing calibration for this run. Use the
+   STARPU_CALIBRATE environment variable to control this.
+$ ...
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+TEST PASSED
+\endverbatim
+
+Note that we force to use the dmda scheduler to generate performance
+models for the application. The application may need to be run several
+times before the model is calibrated.
+
+\subsection Simulation Simulation
+
+Then, recompile StarPU, passing \ref enable-simgrid "--enable-simgrid"
+to <c>./configure</c>, and re-run the application:
+
+\verbatim
+$ ./configure --enable-simgrid && make
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+TEST FAILED !!!
+\endverbatim
+
+It is normal that the test fails: since the computation are not actually done
+(that is the whole point of simgrid), the result is wrong, of course.
+
+If the performance model is not calibrated enough, the following error
+message will be displayed
+
+\verbatim
+$ STARPU_SCHED=dmda ./examples/matvecmult/matvecmult
+[starpu][_starpu_load_history_based_model] Warning: model matvecmult
+    is not calibrated, forcing calibration for this run. Use the
+    STARPU_CALIBRATE environment variable to control this.
+[starpu][_starpu_simgrid_execute_job][assert failure] Codelet
+    matvecmult does not have a perfmodel, or is not calibrated enough
+\endverbatim
+
+The number of devices can be chosen as usual with \ref STARPU_NCPU,
+\ref STARPU_NCUDA, and \ref STARPU_NOPENCL.  For now, only the number of
+cpus can be arbitrarily chosen. The number of CUDA and OpenCL devices have to be
+lower than the real number on the current machine.
+
+The amount of simulated GPU memory is for now unbound by default, but
+it can be chosen by hand through the \ref STARPU_LIMIT_CUDA_MEM,
+\ref STARPU_LIMIT_CUDA_devid_MEM, \ref STARPU_LIMIT_OPENCL_MEM, and
+\ref STARPU_LIMIT_OPENCL_devid_MEM environment variables.
+
+The Simgrid default stack size is small; to increase it use the
+parameter <c>--cfg=contexts/stack_size</c>, for example:
+
+\verbatim
+$ ./example --cfg=contexts/stack_size:8192
+TEST FAILED !!!
+\endverbatim
+
+Note: of course, if the application uses <c>gettimeofday</c> to make its
+performance measurements, the real time will be used, which will be bogus. To
+get the simulated time, it has to use starpu_timing_now() which returns the
+virtual timestamp in ms.
+
+\subsection SimulationOnAnotherMachine Simulation On Another Machine
+
+The simgrid support even permits to perform simulations on another machine, your
+desktop, typically. To achieve this, one still needs to perform the Calibration
+step on the actual machine to be simulated, then copy them to your desktop
+machine (the <c>$STARPU_HOME/.starpu</c> directory). One can then perform the
+Simulation step on the desktop machine, by setting the environment
+variable \ref STARPU_HOSTNAME to the name of the actual machine, to
+make StarPU use the performance models of the simulated machine even
+on the desktop machine.
+
+If the desktop machine does not have CUDA or OpenCL, StarPU is still able to
+use simgrid to simulate execution with CUDA/OpenCL devices, but the application
+source code will probably disable the CUDA and OpenCL codelets in that
+case. Since during simgrid execution, the functions of the codelet are actually
+not called, one can use dummy functions such as the following to still permit
+CUDA or OpenCL execution:
+
+\snippet simgrid.c To be included
+
+*/

+ 580 - 0
doc/doxygen/chapters/performance_feedback.doxy

@@ -0,0 +1,580 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page PerformanceFeedback Performance Feedback
+
+\section UsingTheTemanejoTaskDebugger Using The Temanejo Task Debugger
+
+StarPU can connect to Temanejo (see
+http://www.hlrs.de/temanejo), to permit
+nice visual task debugging. To do so, build Temanejo's <c>libayudame.so</c>,
+install <c>Ayudame.h</c> to e.g. <c>/usr/local/include</c>, apply the
+<c>tools/patch-ayudame</c> to it to fix C build, re-<c>./configure</c>, make
+sure that it found it, rebuild StarPU.  Run the Temanejo GUI, give it the path
+to your application, any options you want to pass it, the path to libayudame.so.
+
+Make sure to specify at least the same number of CPUs in the dialog box as your
+machine has, otherwise an error will happen during execution. Future versions
+of Temanejo should be able to tell StarPU the number of CPUs to use.
+
+Tag numbers have to be below <c>4000000000000000000ULL</c> to be usable for
+Temanejo (so as to distinguish them from tasks).
+
+\section On-linePerformanceFeedback On-line Performance Feedback
+
+\subsection EnablingOn-linePerformanceMonitoring Enabling On-line Performance Monitoring
+
+In order to enable online performance monitoring, the application can
+call starpu_profiling_status_set() with the parameter
+::STARPU_PROFILING_ENABLE. It is possible to detect whether monitoring
+is already enabled or not by calling starpu_profiling_status_get().
+Enabling monitoring also reinitialize all previously collected
+feedback. The environment variable \ref STARPU_PROFILING can also be
+set to 1 to achieve the same effect.
+
+Likewise, performance monitoring is stopped by calling
+starpu_profiling_status_set() with the parameter
+::STARPU_PROFILING_DISABLE. Note that this does not reset the
+performance counters so that the application may consult them later
+on.
+
+More details about the performance monitoring API are available in \ref API_Profiling.
+
+\subsection Per-taskFeedback Per-task Feedback
+
+If profiling is enabled, a pointer to a structure
+starpu_profiling_task_info is put in the field
+starpu_task::profiling_info when a task terminates. This structure is
+automatically destroyed when the task structure is destroyed, either
+automatically or by calling starpu_task_destroy().
+
+The structure starpu_profiling_task_info indicates the date when the
+task was submitted (starpu_profiling_task_info::submit_time), started
+(starpu_profiling_task_info::start_time), and terminated
+(starpu_profiling_task_info::end_time), relative to the initialization
+of StarPU with starpu_init(). It also specifies the identifier of the worker
+that has executed the task (starpu_profiling_task_info::workerid).
+These date are stored as <c>timespec</c> structures which the user may convert
+into micro-seconds using the helper function
+starpu_timing_timespec_to_us().
+
+It it worth noting that the application may directly access this structure from
+the callback executed at the end of the task. The structure starpu_task
+associated to the callback currently being executed is indeed accessible with
+the function starpu_task_get_current().
+
+\subsection Per-codeletFeedback Per-codelet Feedback
+
+The field starpu_codelet::per_worker_stats is
+an array of counters. The i-th entry of the array is incremented every time a
+task implementing the codelet is executed on the i-th worker.
+This array is not reinitialized when profiling is enabled or disabled.
+
+\subsection Per-workerFeedback Per-worker Feedback
+
+The second argument returned by the function
+starpu_profiling_worker_get_info() is a structure
+starpu_profiling_worker_info that gives statistics about the specified
+worker. This structure specifies when StarPU started collecting
+profiling information for that worker
+(starpu_profiling_worker_info::start_time), the
+duration of the profiling measurement interval
+(starpu_profiling_worker_info::total_time), the time spent executing
+kernels (starpu_profiling_worker_info::executing_time), the time
+spent sleeping because there is no task to execute at all
+(starpu_profiling_worker_info::sleeping_time), and the number of tasks that were executed
+while profiling was enabled. These values give an estimation of the
+proportion of time spent do real work, and the time spent either
+sleeping because there are not enough executable tasks or simply
+wasted in pure StarPU overhead.
+
+Calling starpu_profiling_worker_get_info() resets the profiling
+information associated to a worker.
+
+When an FxT trace is generated (see \ref GeneratingTracesWithFxT), it is also
+possible to use the tool <c>starpu_workers_activity</c> (see \ref
+MonitoringActivity) to generate a graphic showing the evolution of
+these values during the time, for the different workers.
+
+\subsection Bus-relatedFeedback Bus-related Feedback
+
+TODO: ajouter \ref STARPU_BUS_STATS
+
+\internal
+how to enable/disable performance monitoring
+what kind of information do we get ?
+\endinternal
+
+The bus speed measured by StarPU can be displayed by using the tool
+<c>starpu_machine_display</c>, for instance:
+
+\verbatim
+StarPU has found:
+        3 CUDA devices
+                CUDA 0 (Tesla C2050 02:00.0)
+                CUDA 1 (Tesla C2050 03:00.0)
+                CUDA 2 (Tesla C2050 84:00.0)
+from    to RAM          to CUDA 0       to CUDA 1       to CUDA 2
+RAM     0.000000        5176.530428     5176.492994     5191.710722
+CUDA 0  4523.732446     0.000000        2414.074751     2417.379201
+CUDA 1  4523.718152     2414.078822     0.000000        2417.375119
+CUDA 2  4534.229519     2417.069025     2417.060863     0.000000
+\endverbatim
+
+\subsection StarPU-TopInterface StarPU-Top Interface
+
+StarPU-Top is an interface which remotely displays the on-line state of a StarPU
+application and permits the user to change parameters on the fly.
+
+Variables to be monitored can be registered by calling the functions
+starpu_top_add_data_boolean(), starpu_top_add_data_integer(),
+starpu_top_add_data_float(), e.g.:
+
+\code{.c}
+starpu_top_data *data = starpu_top_add_data_integer("mynum", 0, 100, 1);
+\endcode
+
+The application should then call starpu_top_init_and_wait() to give its name
+and wait for StarPU-Top to get a start request from the user. The name is used
+by StarPU-Top to quickly reload a previously-saved layout of parameter display.
+
+\code{.c}
+starpu_top_init_and_wait("the application");
+\endcode
+
+The new values can then be provided thanks to
+starpu_top_update_data_boolean(), starpu_top_update_data_integer(),
+starpu_top_update_data_float(), e.g.:
+
+\code{.c}
+starpu_top_update_data_integer(data, mynum);
+\endcode
+
+Updateable parameters can be registered thanks to starpu_top_register_parameter_boolean(), starpu_top_register_parameter_integer(), starpu_top_register_parameter_float(), e.g.:
+
+\code{.c}
+float alpha;
+starpu_top_register_parameter_float("alpha", &alpha, 0, 10, modif_hook);
+\endcode
+
+<c>modif_hook</c> is a function which will be called when the parameter is being modified, it can for instance print the new value:
+
+\code{.c}
+void modif_hook(struct starpu_top_param *d) {
+    fprintf(stderr,"%s has been modified: %f\n", d->name, alpha);
+}
+\endcode
+
+Task schedulers should notify StarPU-Top when it has decided when a task will be
+scheduled, so that it can show it in its Gantt chart, for instance:
+
+\code{.c}
+starpu_top_task_prevision(task, workerid, begin, end);
+\endcode
+
+Starting StarPU-Top (StarPU-Top is started via the binary
+<c>starpu_top</c>.) and the application can be done two ways:
+
+<ul>
+<li> The application is started by hand on some machine (and thus already
+waiting for the start event). In the Preference dialog of StarPU-Top, the SSH
+checkbox should be unchecked, and the hostname and port (default is 2011) on
+which the application is already running should be specified. Clicking on the
+connection button will thus connect to the already-running application.
+</li>
+<li> StarPU-Top is started first, and clicking on the connection button will
+start the application itself (possibly on a remote machine). The SSH checkbox
+should be checked, and a command line provided, e.g.:
+
+\verbatim
+$ ssh myserver STARPU_SCHED=dmda ./application
+\endverbatim
+
+If port 2011 of the remote machine can not be accessed directly, an ssh port bridge should be added:
+
+\verbatim
+$ ssh -L 2011:localhost:2011 myserver STARPU_SCHED=dmda ./application
+\endverbatim
+
+and "localhost" should be used as IP Address to connect to.
+</li>
+</ul>
+
+\section Off-linePerformanceFeedback Off-line Performance Feedback
+
+\subsection GeneratingTracesWithFxT Generating Traces With FxT
+
+StarPU can use the FxT library (see
+https://savannah.nongnu.org/projects/fkt/) to generate traces
+with a limited runtime overhead.
+
+You can either get a tarball:
+
+\verbatim
+$ wget http://download.savannah.gnu.org/releases/fkt/fxt-0.2.11.tar.gz
+\endverbatim
+
+or use the FxT library from CVS (autotools are required):
+
+\verbatim
+$ cvs -d :pserver:anonymous\@cvs.sv.gnu.org:/sources/fkt co FxT
+$ ./bootstrap
+\endverbatim
+
+Compiling and installing the FxT library in the <c>$FXTDIR</c> path is
+done following the standard procedure:
+
+\verbatim
+$ ./configure --prefix=$FXTDIR
+$ make
+$ make install
+\endverbatim
+
+In order to have StarPU to generate traces, StarPU should be configured with
+the option \ref with-fxt "--with-fxt" :
+
+\verbatim
+$ ./configure --with-fxt=$FXTDIR
+\endverbatim
+
+Or you can simply point the <c>PKG_CONFIG_PATH</c> to
+<c>$FXTDIR/lib/pkgconfig</c> and pass
+\ref with-fxt "--with-fxt" to <c>./configure</c>
+
+When FxT is enabled, a trace is generated when StarPU is terminated by calling
+starpu_shutdown()). The trace is a binary file whose name has the form
+<c>prof_file_XXX_YYY</c> where <c>XXX</c> is the user name, and
+<c>YYY</c> is the pid of the process that used StarPU. This file is saved in the
+<c>/tmp/</c> directory by default, or by the directory specified by
+the environment variable \ref STARPU_FXT_PREFIX.
+
+\subsection CreatingAGanttDiagram Creating a Gantt Diagram
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+generate a trace in the Paje format by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+\endverbatim
+
+Or alternatively, setting the environment variable \ref STARPU_GENERATE_TRACE
+to <c>1</c> before application execution will make StarPU do it automatically at
+application shutdown.
+
+This will create a file <c>paje.trace</c> in the current directory that
+can be inspected with the <a href="http://vite.gforge.inria.fr/">ViTE trace
+visualizing open-source tool</a>.  It is possible to open the
+<c>paje.trace</c> file with ViTE by using the following command:
+
+\verbatim
+$ vite paje.trace
+\endverbatim
+
+To get names of tasks instead of "unknown", fill the optional
+starpu_codelet::name, or use a performance model for them.
+
+In the MPI execution case, collect the trace files from the MPI nodes, and
+specify them all on the command <c>starpu_fxt_tool</c>, for instance:
+
+\verbatim
+$ starpu_fxt_tool -i filename1 -i filename2
+\endverbatim
+
+By default, all tasks are displayed using a green color. To display tasks with
+varying colors, pass option <c>-c</c> to <c>starpu_fxt_tool</c>.
+
+Traces can also be inspected by hand by using the tool <c>fxt_print</c>, for instance:
+
+\verbatim
+$ fxt_print -o -f filename
+\endverbatim
+
+Timings are in nanoseconds (while timings as seen in <c>vite</c> are in milliseconds).
+
+\subsection CreatingADAGWithGraphviz Creating a DAG With Graphviz
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+generate a task graph in the DOT format by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+\endverbatim
+
+This will create a <c>dag.dot</c> file in the current directory. This file is a
+task graph described using the DOT language. It is possible to get a
+graphical output of the graph by using the graphviz library:
+
+\verbatim
+$ dot -Tpdf dag.dot -o output.pdf
+\endverbatim
+
+\subsection MonitoringActivity Monitoring Activity
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+generate an activity trace by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+\endverbatim
+
+This will create an <c>activity.data</c> file in the current
+directory. A profile of the application showing the activity of StarPU
+during the execution of the program can be generated:
+
+\verbatim
+$ starpu_workers_activity activity.data
+\endverbatim
+
+This will create a file named <c>activity.eps</c> in the current directory.
+This picture is composed of two parts.
+The first part shows the activity of the different workers. The green sections
+indicate which proportion of the time was spent executed kernels on the
+processing unit. The red sections indicate the proportion of time spent in
+StartPU: an important overhead may indicate that the granularity may be too
+low, and that bigger tasks may be appropriate to use the processing unit more
+efficiently. The black sections indicate that the processing unit was blocked
+because there was no task to process: this may indicate a lack of parallelism
+which may be alleviated by creating more tasks when it is possible.
+
+The second part of the <c>activity.eps</c> picture is a graph showing the
+evolution of the number of tasks available in the system during the execution.
+Ready tasks are shown in black, and tasks that are submitted but not
+schedulable yet are shown in grey.
+
+\section PerformanceOfCodelets Performance Of Codelets
+
+The performance model of codelets (see \ref PerformanceModelExample)
+can be examined by using the tool <c>starpu_perfmodel_display</c>:
+
+\verbatim
+$ starpu_perfmodel_display -l
+file: <malloc_pinned.hannibal>
+file: <starpu_slu_lu_model_21.hannibal>
+file: <starpu_slu_lu_model_11.hannibal>
+file: <starpu_slu_lu_model_22.hannibal>
+file: <starpu_slu_lu_model_12.hannibal>
+\endverbatim
+
+Here, the codelets of the lu example are available. We can examine the
+performance of the 22 kernel (in micro-seconds), which is history-based:
+
+\verbatim
+$ starpu_perfmodel_display -s starpu_slu_lu_model_22
+performance model for cpu
+# hash      size       mean          dev           n
+57618ab0    19660800   2.851069e+05  1.829369e+04  109
+performance model for cuda_0
+# hash      size       mean          dev           n
+57618ab0    19660800   1.164144e+04  1.556094e+01  315
+performance model for cuda_1
+# hash      size       mean          dev           n
+57618ab0    19660800   1.164271e+04  1.330628e+01  360
+performance model for cuda_2
+# hash      size       mean          dev           n
+57618ab0    19660800   1.166730e+04  3.390395e+02  456
+\endverbatim
+
+We can see that for the given size, over a sample of a few hundreds of
+execution, the GPUs are about 20 times faster than the CPUs (numbers are in
+us). The standard deviation is extremely low for the GPUs, and less than 10% for
+CPUs.
+
+This tool can also be used for regression-based performance models. It will then
+display the regression formula, and in the case of non-linear regression, the
+same performance log as for history-based performance models:
+
+\verbatim
+$ starpu_perfmodel_display -s non_linear_memset_regression_based
+performance model for cpu_impl_0
+	Regression : #sample = 1400
+	Linear: y = alpha size ^ beta
+		alpha = 1.335973e-03
+		beta = 8.024020e-01
+	Non-Linear: y = a size ^b + c
+		a = 5.429195e-04
+		b = 8.654899e-01
+		c = 9.009313e-01
+# hash		size		mean		stddev		n
+a3d3725e	4096           	4.763200e+00   	7.650928e-01   	100
+870a30aa	8192           	1.827970e+00   	2.037181e-01   	100
+48e988e9	16384          	2.652800e+00   	1.876459e-01   	100
+961e65d2	32768          	4.255530e+00   	3.518025e-01   	100
+...
+\endverbatim
+
+The same can also be achieved by using StarPU's library API, see
+\ref API_Performance_Model and notably the function
+starpu_perfmodel_load_symbol(). The source code of the tool
+<c>starpu_perfmodel_display</c> can be a useful example.
+
+The tool <c>starpu_perfmodel_plot</c> can be used to draw performance
+models. It writes a <c>.gp</c> file in the current directory, to be
+run in the <c>gnuplot</c> tool, which shows the corresponding curve.
+
+When the field starpu_task::flops is set, <c>starpu_perfmodel_plot</c> can
+directly draw a GFlops curve, by simply adding the <c>-f</c> option:
+
+\verbatim
+$ starpu_perfmodel_display -f -s chol_model_11
+\endverbatim
+
+This will however disable displaying the regression model, for which we can not
+compute GFlops.
+
+When the FxT trace file <c>filename</c> has been generated, it is possible to
+get a profiling of each codelet by calling:
+
+\verbatim
+$ starpu_fxt_tool -i filename
+$ starpu_codelet_profile distrib.data codelet_name
+\endverbatim
+
+This will create profiling data files, and a <c>.gp</c> file in the current
+directory, which draws the distribution of codelet time over the application
+execution, according to data input size.
+
+This is also available in the tool <c>starpu_perfmodel_plot</c>, by passing it
+the fxt trace:
+
+\verbatim
+$ starpu_perfmodel_plot -s non_linear_memset_regression_based -i /tmp/prof_file_foo_0
+\endverbatim
+
+It will produce a <c>.gp</c> file which contains both the performance model
+curves, and the profiling measurements.
+
+If you have the <c>R</c> statistical tool installed, you can additionally use
+
+\verbatim
+$ starpu_codelet_histo_profile distrib.data
+\endverbatim
+
+Which will create one pdf file per codelet and per input size, showing a
+histogram of the codelet execution time distribution.
+
+\section TheoreticalLowerBoundOnExecutionTime Theoretical Lower Bound On Execution Time
+
+StarPU can record a trace of what tasks are needed to complete the
+application, and then, by using a linear system, provide a theoretical lower
+bound of the execution time (i.e. with an ideal scheduling).
+
+The computed bound is not really correct when not taking into account
+dependencies, but for an application which have enough parallelism, it is very
+near to the bound computed with dependencies enabled (which takes a huge lot
+more time to compute), and thus provides a good-enough estimation of the ideal
+execution time.
+
+\ref TheoreticalLowerBoundOnExecutionTimeExample provides an example on how to
+use this.
+
+\section MemoryFeedback Memory Feedback
+
+It is possible to enable memory statistics. To do so, you need to pass
+the option \ref enable-memory-stats "--enable-memory-stats" when running configure. It is then
+possible to call the function starpu_display_memory_stats() to
+display statistics about the current data handles registered within StarPU.
+
+Moreover, statistics will be displayed at the end of the execution on
+data handles which have not been cleared out. This can be disabled by
+setting the environment variable \ref STARPU_MEMORY_STATS to 0.
+
+For example, if you do not unregister data at the end of the complex
+example, you will get something similar to:
+
+\verbatim
+$ STARPU_MEMORY_STATS=0 ./examples/interface/complex
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 78.00 + 78.00 i
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 45.00 + 12.00 i
+\endverbatim
+
+\verbatim
+$ STARPU_MEMORY_STATS=1 ./examples/interface/complex
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 78.00 + 78.00 i
+Complex[0] = 45.00 + 12.00 i
+Complex[0] = 45.00 + 12.00 i
+
+#---------------------
+Memory stats:
+#-------
+Data on Node #3
+#-----
+Data : 0x553ff40
+Size : 16
+
+#--
+Data access stats
+/!\ Work Underway
+Node #0
+	Direct access : 4
+	Loaded (Owner) : 0
+	Loaded (Shared) : 0
+	Invalidated (was Owner) : 0
+
+Node #3
+	Direct access : 0
+	Loaded (Owner) : 0
+	Loaded (Shared) : 1
+	Invalidated (was Owner) : 0
+
+#-----
+Data : 0x5544710
+Size : 16
+
+#--
+Data access stats
+/!\ Work Underway
+Node #0
+	Direct access : 2
+	Loaded (Owner) : 0
+	Loaded (Shared) : 1
+	Invalidated (was Owner) : 1
+
+Node #3
+	Direct access : 0
+	Loaded (Owner) : 1
+	Loaded (Shared) : 0
+	Invalidated (was Owner) : 0
+\endverbatim
+
+\section DataStatistics Data Statistics
+
+Different data statistics can be displayed at the end of the execution
+of the application. To enable them, you need to pass the option
+\ref enable-stats "--enable-stats" when calling <c>configure</c>. When calling
+starpu_shutdown() various statistics will be displayed,
+execution, MSI cache statistics, allocation cache statistics, and data
+transfer statistics. The display can be disabled by setting the
+environment variable \ref STARPU_STATS to 0.
+
+\verbatim
+$ ./examples/cholesky/cholesky_tag
+Computation took (in ms)
+518.16
+Synthetic GFlops : 44.21
+#---------------------
+MSI cache stats :
+TOTAL MSI stats	hit 1622 (66.23 %)	miss 827 (33.77 %)
+...
+\endverbatim
+
+\verbatim
+$ STARPU_STATS=0 ./examples/cholesky/cholesky_tag
+Computation took (in ms)
+518.16
+Synthetic GFlops : 44.21
+\endverbatim
+
+\internal
+TODO: data transfer stats are similar to the ones displayed when
+setting STARPU_BUS_STATS
+\endinternal
+
+*/

+ 34 - 0
doc/doxygen/chapters/scaling-vector-example.doxy

@@ -0,0 +1,34 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page FullSourceCodeVectorScal Full source code for the ’Scaling a Vector’ example
+
+\section MainApplication Main Application
+
+\snippet vector_scal_c.c To be included
+
+\section CPUKernel CPU Kernel
+
+\snippet vector_scal_cpu.c To be included
+
+\section CUDAKernel CUDA Kernel
+
+\snippet vector_scal_cuda.cu To be included
+
+\section OpenCLKernel OpenCL Kernel
+
+\subsection InvokingtheKernel Invoking the Kernel
+
+\snippet vector_scal_opencl.c To be included
+
+\subsection SourceoftheKernel Source of the Kernel
+
+\snippet vector_scal_opencl_codelet.cl To be included
+
+*/
+

+ 145 - 0
doc/doxygen/chapters/scheduling_context_hypervisor.doxy

@@ -0,0 +1,145 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page SchedulingContextHypervisor Scheduling Context Hypervisor
+
+\section WhatIsTheHypervisor What Is The Hypervisor
+
+StarPU proposes a platform for constructing Scheduling Contexts, for
+deleting and modifying them dynamically. A parallel kernel, can thus
+be isolated into a scheduling context and interferences between
+several parallel kernels are avoided. If the user knows exactly how
+many workers each scheduling context needs, he can assign them to the
+contexts at their creation time or modify them during the execution of
+the program.
+
+The Scheduling Context Hypervisor Plugin is available for the users
+who do not dispose of a regular parallelism, who cannot know in
+advance the exact size of the context and need to resize the contexts
+according to the behavior of the parallel kernels.
+
+The Hypervisor receives information from StarPU concerning the
+execution of the tasks, the efficiency of the resources, etc. and it
+decides accordingly when and how the contexts can be resized. Basic
+strategies of resizing scheduling contexts already exist but a
+platform for implementing additional custom ones is available.
+
+\section StartTheHypervisor Start the Hypervisor
+
+The Hypervisor must be initialised once at the beging of the
+application. At this point a resizing policy should be indicated. This
+strategy depends on the information the application is able to provide
+to the hypervisor as well as on the accuracy needed for the resizing
+procedure. For exemple, the application may be able to provide an
+estimation of the workload of the contexts. In this situation the
+hypervisor may decide what resources the contexts need. However, if no
+information is provided the hypervisor evaluates the behavior of the
+resources and of the application and makes a guess about the future.
+The hypervisor resizes only the registered contexts.
+
+\section InterrogateTheRuntime Interrogate The Runtime
+
+The runtime provides the hypervisor with information concerning the
+behavior of the resources and the application. This is done by using
+the performance_counters, some callbacks indicating when the resources
+are idle or not efficient, when the application submits tasks or when
+it becames to slow.
+
+\section TriggerTheHypervisor Trigger the Hypervisor
+
+The resizing is triggered either when the application requires it or
+when the initials distribution of resources alters the performance of
+the application( the application is to slow or the resource are idle
+for too long time, threashold indicated by the user). When this
+happens different resizing strategy are applied that target minimising
+the total execution of the application, the instant speed or the idle
+time of the resources.
+
+\section ResizingStrategies Resizing Strategies
+
+The plugin proposes several strategies for resizing the scheduling context.
+
+The <b>Application driven</b> strategy uses the user's input concerning the moment when he wants to resize the contexts.
+Thus, the users tags the task that should trigger the resizing
+process. We can set directly the field starpu_task::hypervisor_tag or
+use the macro ::STARPU_HYPERVISOR_TAG in the function
+starpu_insert_task().
+
+\code{.c}
+task.hypervisor_tag = 2;
+\endcode
+
+or
+
+\code{.c}
+starpu_insert_task(&codelet,
+		    ...,
+		    STARPU_HYPERVISOR_TAG, 2,
+                    0);
+\endcode
+
+Then the user has to indicate that when a task with the specified tag is executed the contexts should resize.
+
+\code{.c}
+sc_hypervisor_resize(sched_ctx, 2);
+\endcode
+
+The user can use the same tag to change the resizing configuration of the contexts if he considers it necessary.
+
+\code{.c}
+sc_hypervisor_ioctl(sched_ctx,
+                    HYPERVISOR_MIN_WORKERS, 6,
+                    HYPERVISOR_MAX_WORKERS, 12,
+                    HYPERVISOR_TIME_TO_APPLY, 2,
+                    NULL);
+\endcode
+
+
+The <b>Idleness</b> based strategy resizes the scheduling contexts every time one of their workers stays idle
+for a period longer than the one imposed by the user
+(see \ref UsersInputInTheResizingProcess "Users’ Input In The Resizing Process")
+
+\code{.c}
+int workerids[3] = {1, 3, 10};
+int workerids2[9] = {0, 2, 4, 5, 6, 7, 8, 9, 11};
+sc_hypervisor_ioctl(sched_ctx_id,
+            HYPERVISOR_MAX_IDLE, workerids, 3, 10000.0,
+            HYPERVISOR_MAX_IDLE, workerids2, 9, 50000.0,
+            NULL);
+\endcode
+
+The <b>Gflops rate</b> based strategy resizes the scheduling contexts such that they all finish at the same time.
+The velocity of each of them is considered and once one of them is significantly slower the resizing process is triggered.
+In order to do these computations the user has to input the total number of instructions needed to be executed by the
+parallel kernels and the number of instruction to be executed by each
+task.
+
+The number of flops to be executed by a context are passed as
+ parameter when they are registered to the hypervisor,
+ (<c>sc_hypervisor_register_ctx(sched_ctx_id, flops)</c>) and the one
+ to be executed by each task are passed when the task is submitted.
+ The corresponding field is starpu_task::flops and the corresponding
+ macro in the function starpu_insert_task() is ::STARPU_FLOPS
+ (<b>Caution</b>: but take care of passing a double, not an integer,
+ otherwise parameter passing will be bogus). When the task is executed
+ the resizing process is triggered.
+
+\code{.c}
+task.flops = 100;
+\endcode
+
+or
+
+\code{.c}
+starpu_insert_task(&codelet,
+                    ...,
+                    STARPU_FLOPS, (double) 100,
+                    0);
+\endcode
+
+*/

+ 136 - 0
doc/doxygen/chapters/scheduling_contexts.doxy

@@ -0,0 +1,136 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page SchedulingContexts Scheduling Contexts
+
+TODO: improve!
+
+\section GeneralIdeas General Ideas
+
+Scheduling contexts represent abstracts sets of workers that allow the
+programmers to control the distribution of computational resources
+(i.e. CPUs and GPUs) to concurrent parallel kernels. The main goal is
+to minimize interferences between the execution of multiple parallel
+kernels, by partitioning the underlying pool of workers using
+contexts.
+
+\section CreatingAContext Creating A Context
+
+By default, the application submits tasks to an initial context, which
+disposes of all the computation ressources available to StarPU (all
+the workers). If the application programmer plans to launch several
+parallel kernels simultaneusly, by default these kernels will be
+executed within this initial context, using a single scheduler
+policy(see \ref TaskSchedulingPolicy). Meanwhile, if the application
+programmer is aware of the demands of these kernels and of the
+specificity of the machine used to execute them, the workers can be
+divided between several contexts. These scheduling contexts will
+isolate the execution of each kernel and they will permit the use of a
+scheduling policy proper to each one of them. In order to create the
+contexts, you have to know the indentifiers of the workers running
+within StarPU. By passing a set of workers together with the
+scheduling policy to the function starpu_sched_ctx_create(), you will
+get an identifier of the context created which you will use to
+indicate the context you want to submit the tasks to.
+
+\code{.c}
+/* the list of ressources the context will manage */
+int workerids[3] = {1, 3, 10};
+
+/* indicate the scheduling policy to be used within the context, the list of
+   workers assigned to it, the number of workers, the name of the context */
+int id_ctx = starpu_sched_ctx_create("dmda", workerids, 3, "my_ctx");
+
+/* let StarPU know that the folowing tasks will be submitted to this context */
+starpu_sched_ctx_set_task_context(id);
+
+/* submit the task to StarPU */
+starpu_task_submit(task);
+\endcode
+
+Note: Parallel greedy and parallel heft scheduling policies do not support the existence of several disjoint contexts on the machine.
+Combined workers are constructed depending on the entire topology of the machine, not only the one belonging to a context.
+
+\section ModifyingAContext Modifying A Context
+
+A scheduling context can be modified dynamically. The applications may
+change its requirements during the execution and the programmer can
+add additional workers to a context or remove if no longer needed. In
+the following example we have two scheduling contexts
+<c>sched_ctx1</c> and <c>sched_ctx2</c>. After executing a part of the
+tasks some of the workers of <c>sched_ctx1</c> will be moved to
+context <c>sched_ctx2</c>.
+
+\code{.c}
+/* the list of ressources that context 1 will give away */
+int workerids[3] = {1, 3, 10};
+
+/* add the workers to context 1 */
+starpu_sched_ctx_add_workers(workerids, 3, sched_ctx2);
+
+/* remove the workers from context 2 */
+starpu_sched_ctx_remove_workers(workerids, 3, sched_ctx1);
+\endcode
+
+\section DeletingAContext Deleting A Context
+
+When a context is no longer needed it must be deleted. The application
+can indicate which context should keep the resources of a deleted one.
+All the tasks of the context should be executed before doing this. If
+the application need to avoid a barrier before moving the resources
+from the deleted context to the inheritor one, the application can
+just indicate when the last task was submitted. Thus, when this last
+task was submitted the resources will be move, but the context should
+still be deleted at some point of the application.
+
+\code{.c}
+/* when the context 2 will be deleted context 1 will be keep its resources */
+starpu_sched_ctx_set_inheritor(sched_ctx2, sched_ctx1);
+
+/* submit tasks to context 2 */
+for (i = 0; i < ntasks; i++)
+    starpu_task_submit_to_ctx(task[i],sched_ctx2);
+
+/* indicate that context 2 finished submitting and that */
+/* as soon as the last task of context 2 finished executing */
+/* its workers can be mobed to the inheritor context */
+starpu_sched_ctx_finished_submit(sched_ctx1);
+
+/* wait for the tasks of both contexts to finish */
+starpu_task_wait_for_all();
+
+/* delete context 2 */
+starpu_sched_ctx_delete(sched_ctx2);
+
+/* delete context 1 */
+starpu_sched_ctx_delete(sched_ctx1);
+\endcode
+
+\section EmptyingAContext Emptying A Context
+
+A context may not have any resources at the begining or at a certain
+moment of the execution. Task can still be submitted to these contexts
+and they will execute them as soon as they will have resources. A list
+of tasks pending to be executed is kept and when workers are added to
+the contexts the tasks are submitted. However, if no resources are
+allocated the program will not terminate. If these tasks have not much
+priority the programmer can forbid the application to submitted them
+by calling the function starpu_sched_ctx_stop_task_submission().
+
+\section ContextsSharingWorkers Contexts Sharing Workers
+
+Contexts may share workers when a single context cannot execute
+efficiently enough alone on these workers or when the application
+decides to express a hierarchy of contexts. The workers apply an
+alogrithm of ``Round-Robin'' to chose the context on which they will
+``pop'' next. By using the function
+starpu_sched_ctx_set_turn_to_other_ctx(), the programmer can impose
+the <c>workerid</c> to ``pop'' in the context <c>sched_ctx_id</c>
+next.
+
+*/

+ 21 - 0
doc/doxygen/chapters/socl_opencl_extensions.doxy

@@ -0,0 +1,21 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page SOCLOpenclExtensions SOCL OpenCL Extensions
+
+SOCL is an OpenCL implementation based on StarPU. It gives a unified access to
+every available OpenCL device: applications can now share entities such as
+Events, Contexts or Command Queues between several OpenCL implementations.
+
+In addition, command queues that are created without specifying a device provide
+automatic scheduling of the submitted commands on OpenCL devices contained in
+the context to which the command queue is attached.
+
+Note: this is still an area under development and subject to change.
+
+*/

+ 98 - 0
doc/doxygen/chapters/tips_and_tricks.doxy

@@ -0,0 +1,98 @@
+/*
+ * This file is part of the StarPU Handbook.
+ * Copyright (C) 2009--2011  Universit@'e de Bordeaux 1
+ * Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+ * Copyright (C) 2011, 2012 Institut National de Recherche en Informatique et Automatique
+ * See the file version.doxy for copying conditions.
+ */
+
+/*! \page TipsAndTricksToKnowAbout Tips and Tricks To Know About
+
+\section HowToInitializeAComputationLibraryOnceForEachWorker How To Initialize A Computation Library Once For Each Worker?
+
+Some libraries need to be initialized once for each concurrent instance that
+may run on the machine. For instance, a C++ computation class which is not
+thread-safe by itself, but for which several instanciated objects of that class
+can be used concurrently. This can be used in StarPU by initializing one such
+object per worker. For instance, the libstarpufft example does the following to
+be able to use FFTW.
+
+Some global array stores the instanciated objects:
+
+\code{.c}
+fftw_plan plan_cpu[STARPU_NMAXWORKERS];
+\endcode
+
+At initialisation time of libstarpu, the objects are initialized:
+
+\code{.c}
+int workerid;
+for (workerid = 0; workerid < starpu_worker_get_count(); workerid++) {
+    switch (starpu_worker_get_type(workerid)) {
+        case STARPU_CPU_WORKER:
+            plan_cpu[workerid] = fftw_plan(...);
+            break;
+    }
+}
+\endcode
+
+And in the codelet body, they are used:
+
+\code{.c}
+static void fft(void *descr[], void *_args)
+{
+    int workerid = starpu_worker_get_id();
+    fftw_plan plan = plan_cpu[workerid];
+    ...
+
+    fftw_execute(plan, ...);
+}
+\endcode
+
+Another way to go which may be needed is to execute some code from the workers
+themselves thanks to starpu_execute_on_each_worker(). This may be required
+by CUDA to behave properly due to threading issues. For instance, StarPU's
+starpu_cublas_init() looks like the following to call
+<c>cublasInit</c> from the workers themselves:
+
+\code{.c}
+static void init_cublas_func(void *args STARPU_ATTRIBUTE_UNUSED)
+{
+    cublasStatus cublasst = cublasInit();
+    cublasSetKernelStream(starpu_cuda_get_local_stream());
+}
+void starpu_cublas_init(void)
+{
+    starpu_execute_on_each_worker(init_cublas_func, NULL, STARPU_CUDA);
+}
+\endcode
+
+\section HowToLimitMemoryPerNode How to limit memory per node
+
+TODO
+
+Talk about
+\ref STARPU_LIMIT_CUDA_devid_MEM, \ref STARPU_LIMIT_CUDA_MEM,
+\ref STARPU_LIMIT_OPENCL_devid_MEM, \ref STARPU_LIMIT_OPENCL_MEM
+and \ref STARPU_LIMIT_CPU_MEM
+
+starpu_memory_get_available()
+
+\section ThreadBindingOnNetBSD Thread Binding on NetBSD
+
+When using StarPU on a NetBSD machine, if the topology
+discovery library <c>hwloc</c> is used, thread binding will fail. To
+prevent the problem, you should at least use the version 1.7 of
+<c>hwloc</c>, and also issue the following call:
+
+\verbatim
+$ sysctl -w security.models.extensions.user_set_cpu_affinity=1
+\endverbatim
+
+Or add the following line in the file <c>/etc/sysctl.conf</c>
+
+\verbatim
+security.models.extensions.user_set_cpu_affinity=1
+\endverbatim
+
+*/

+ 33 - 0
doc/doxygen/doxygen-config.cfg.in

@@ -0,0 +1,33 @@
+# StarPU --- Runtime system for heterogeneous multicore architectures.
+#
+# Copyright (C) 2009-2013  Université de Bordeaux 1
+# Copyright (C) 2010, 2011, 2012, 2013  Centre National de la Recherche Scientifique
+# Copyright (C) 2011  Télécom-SudParis
+# Copyright (C) 2011, 2012  Institut National de Recherche en Informatique et Automatique
+#
+# StarPU is free software; you can redistribute it and/or modify
+# it under the terms of the GNU Lesser General Public License as published by
+# the Free Software Foundation; either version 2.1 of the License, or (at
+# your option) any later version.
+#
+# StarPU is distributed in the hope that it will be useful, but
+# WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+#
+# See the GNU Lesser General Public License in COPYING.LGPL for more details.
+
+INPUT                  = @top_srcdir@/doc/doxygen/chapters \
+		       	 @top_srcdir@/doc/doxygen/chapters/api \
+                         @top_builddir@/include/starpu_config.h \
+			 @top_srcdir@/include/ \
+			 @top_srcdir@/mpi/include/ \
+			 @top_srcdir@/starpufft/starpufft.h \
+			 @top_srcdir@/sc_hypervisor/include
+
+EXAMPLE_PATH           = @top_srcdir@/doc/doxygen \
+		       	 @top_srcdir@/doc/doxygen/chapters \
+		       	 @top_srcdir@/doc/doxygen/chapters/code
+
+INPUT_FILTER           = @top_builddir@/doc/doxygen/doxygen_filter.sh
+
+LATEX_HEADER           = @top_srcdir@/doc/doxygen/refman.tex

Fichier diff supprimé car celui-ci est trop grand
+ 1904 - 0
doc/doxygen/doxygen.cfg


+ 9 - 0
doc/doxygen/doxygen_filter.sh.in

@@ -0,0 +1,9 @@
+#!/bin/bash
+
+if [ "$(basename $1)" == "starpufft.h" ] ; then
+    gcc -E $1 -I @top_srcdir@/include/ -I @top_builddir@/include/ |grep starpufft
+else
+    SUFFIX_C=$(basename $1 .c)
+    sed -e 's/STARPU_DEPRECATED//' $1
+fi
+

+ 20 - 0
doc/doxygen/foreword.html

@@ -0,0 +1,20 @@
+<br/>
+<br/>
+Copyright &copy; 2009–2013 Université de Bordeaux 1
+<br/>
+Copyright &copy; 2010-2013 Centre National de la Recherche Scientifique
+<br/>
+Copyright &copy; 2011, 2012 Institut National de Recherche en Informatique et Automatique
+<br/>
+<br/>
+<br/>
+
+<blockquote>
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3 or
+any later version published by the Free Software Foundation; with no
+Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
+copy of the license is included in the section entitled “GNU Free
+Documentation License”.
+</blockquote>
+<br/>

+ 240 - 0
doc/doxygen/refman.tex

@@ -0,0 +1,240 @@
+\documentclass{book}
+\usepackage[a4paper,top=2.5cm,bottom=2.5cm,left=2.5cm,right=2.5cm]{geometry}
+\usepackage{makeidx}
+\usepackage{natbib}
+\usepackage{graphicx}
+\usepackage{multicol}
+\usepackage{float}
+\usepackage{listings}
+\usepackage{color}
+\usepackage{ifthen}
+\usepackage[table]{xcolor}
+\usepackage{textcomp}
+\usepackage{alltt}
+\usepackage{ifpdf}
+\usepackage{./version}
+\ifpdf
+\usepackage[pdftex,
+            pagebackref=true,
+            colorlinks=true,
+            linkcolor=blue,
+            unicode
+           ]{hyperref}
+\else
+\usepackage[ps2pdf,
+            pagebackref=true,
+            colorlinks=true,
+            linkcolor=blue,
+            unicode
+           ]{hyperref}
+\usepackage{pspicture}
+\fi
+\usepackage[utf8]{inputenc}
+\usepackage{mathptmx}
+\usepackage[scaled=.90]{helvet}
+\usepackage{courier}
+\usepackage{sectsty}
+\usepackage{amssymb}
+\usepackage[titles]{tocloft}
+\usepackage{doxygen}
+\lstset{language=C++,inputencoding=utf8,basicstyle=\footnotesize,breaklines=true,breakatwhitespace=true,tabsize=8,numbers=left }
+\makeindex
+\setcounter{tocdepth}{3}
+\renewcommand{\footrulewidth}{0.4pt}
+\renewcommand{\familydefault}{\sfdefault}
+\hfuzz=15pt
+\setlength{\emergencystretch}{15pt}
+\hbadness=750
+\tolerance=750
+\begin{document}
+\hypersetup{pageanchor=false,citecolor=blue}
+\begin{titlepage}
+\vspace*{4cm}
+{\Huge \textbf{StarPU Handbook}}\\
+\rule{\textwidth}{1.5mm}
+\begin{flushright}
+{\Large for StarPU \STARPUVERSION}
+\end{flushright}
+\rule{\textwidth}{1mm}
+~\\
+\vspace*{15cm}
+\begin{flushright}
+Generated by Doxygen $doxygenversion on $datetime
+\end{flushright}
+\end{titlepage}
+
+\begin{figure}[p]
+This manual documents the usage of StarPU version \STARPUVERSION. Its contents
+was last updated on \STARPUUPDATED.\\
+
+Copyright © 2009–2013 Université de Bordeaux 1\\
+
+Copyright © 2010-2013 Centre National de la Recherche Scientifique\\
+
+Copyright © 2011, 2012 Institut National de Recherche en Informatique et Automatique\\
+
+\medskip
+
+\begin{quote}
+Permission is granted to copy, distribute and/or modify this document
+under the terms of the GNU Free Documentation License, Version 1.3 or
+any later version published by the Free Software Foundation; with no
+Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A
+copy of the license is included in the section entitled “GNU Free
+Documentation License”.
+\end{quote}
+\end{figure}
+
+\clearemptydoublepage
+\pagenumbering{roman}
+\tableofcontents
+\clearemptydoublepage
+\pagenumbering{arabic}
+\hypersetup{pageanchor=true,citecolor=blue}
+
+\chapter{Introduction}
+\label{index}
+\hypertarget{index}{}
+\input{index}
+
+\part{Using StarPU}
+
+\chapter{Building and Installing StarPU}
+\label{BuildingAndInstallingStarPU}
+\hypertarget{BuildingAndInstallingStarPU}{}
+\input{BuildingAndInstallingStarPU}
+
+\chapter{Basic Examples}
+\label{BasicExamples}
+\hypertarget{BasicExamples}{}
+\input{BasicExamples}
+
+\chapter{Advanced Examples}
+\label{AdvancedExamples}
+\hypertarget{AdvancedExamples}{}
+\input{AdvancedExamples}
+
+\chapter{How to optimize performance with StarPU}
+\label{HowToOptimizePerformanceWithStarPU}
+\hypertarget{HowToOptimizePerformanceWithStarPU}{}
+\input{HowToOptimizePerformanceWithStarPU}
+
+\chapter{Performance Feedback}
+\label{PerformanceFeedback}
+\hypertarget{PerformanceFeedback}{}
+\input{PerformanceFeedback}
+
+\chapter{Tips and Tricks To Know About}
+\label{TipsAndTricksToKnowAbout}
+\hypertarget{TipsAndTricksToKnowAbout}{}
+\input{TipsAndTricksToKnowAbout}
+
+\chapter{MPI Support}
+\label{MPISupport}
+\hypertarget{MPISupport}{}
+\input{MPISupport}
+
+\chapter{FFT Support}
+\label{FFTSupport}
+\hypertarget{FFTSupport}{}
+\input{FFTSupport}
+
+\chapter{MIC/SCC Support}
+\label{MICSCCSupport}
+\hypertarget{MICSCCSupport}{}
+\input{MICSCCSupport}
+
+\chapter{C Extensions}
+\label{cExtensions}
+\hypertarget{cExtensions}{}
+\input{cExtensions}
+
+\chapter{SOCL OpenCL Extensions}
+\label{SOCLOpenclExtensions}
+\hypertarget{SOCLOpenclExtensions}{}
+\input{SOCLOpenclExtensions}
+
+\chapter{Scheduling Contexts}
+\label{SchedulingContexts}
+\hypertarget{SchedulingContexts}{}
+\input{SchedulingContexts}
+
+\chapter{Scheduling Context Hypervisor}
+\label{SchedulingContextHypervisor}
+\hypertarget{SchedulingContextHypervisor}{}
+\input{SchedulingContextHypervisor}
+
+\part{Inside StarPU}
+
+\chapter{Execution Configuration Through Environment Variables}
+\label{ExecutionConfigurationThroughEnvironmentVariables}
+\hypertarget{ExecutionConfigurationThroughEnvironmentVariables}{}
+\input{ExecutionConfigurationThroughEnvironmentVariables}
+
+\chapter{Compilation Configuration}
+\label{CompilationConfiguration}
+\hypertarget{CompilationConfiguration}{}
+\input{CompilationConfiguration}
+
+\chapter{Module Index}
+\input{modules}
+
+\chapter{Module Documentation a.k.a StarPU's API}
+\label{ModuleDocumentation}
+\hypertarget{ModuleDocumentation}{}
+
+\input{group__API__Versioning}
+\input{group__API__Initialization__and__Termination}
+\input{group__API__Standard__Memory__Library}
+\input{group__API__Workers__Properties}
+\input{group__API__Data__Management}
+\input{group__API__Data__Interfaces}
+\input{group__API__Data__Partition}
+\input{group__API__Multiformat__Data__Interface}
+\input{group__API__Codelet__And__Tasks}
+\input{group__API__Insert__Task}
+\input{group__API__Explicit__Dependencies}
+\input{group__API__Implicit__Data__Dependencies}
+\input{group__API__Performance__Model}
+\input{group__API__Profiling}
+\input{group__API__Theoretical__Lower__Bound__on__Execution__Time}
+\input{group__API__CUDA__Extensions}
+\input{group__API__OpenCL__Extensions}
+\input{group__API__MIC__Extensions}
+\input{group__API__SCC__Extensions}
+\input{group__API__Miscellaneous__Helpers}
+\input{group__API__FxT__Support}
+\input{group__API__FFT__Support}
+\input{group__API__MPI__Support}
+\input{group__API__Task__Bundles}
+\input{group__API__Task__Lists}
+\input{group__API__Parallel__Tasks}
+\input{group__API__Running__Drivers}
+\input{group__API__Expert__Mode}
+\input{group__API__StarPUTop__Interface}
+\input{group__API__Scheduling__Contexts}
+\input{group__API__Scheduling__Policy}
+\input{group__API__Scheduling__Context__Hypervisor}
+
+\chapter{Deprecated List}
+\label{deprecated}
+\hypertarget{deprecated}{}
+\input{deprecated}
+
+
+\addcontentsline{toc}{chapter}{Index}
+\printindex
+
+\part{Appendix}
+
+\chapter{Full Source Code for the ’Scaling a Vector’ Example}
+\label{FullSourceCodeVectorScal}
+\hypertarget{FullSourceCodeVectorScal}{}
+\input{FullSourceCodeVectorScal}
+
+\chapter{GNU Free Documentation License}
+\label{GNUFreeDocumentationLicense}
+\hypertarget{GNUFreeDocumentationLicense}{}
+\input{GNUFreeDocumentationLicense}
+
+\end{document}

doc/Makefile.am → doc/texinfo/Makefile.am


doc/chapters/advanced-examples.texi → doc/texinfo/chapters/advanced-examples.texi


doc/chapters/api.texi → doc/texinfo/chapters/api.texi


doc/chapters/basic-examples.texi → doc/texinfo/chapters/basic-examples.texi


doc/chapters/c-extensions.texi → doc/texinfo/chapters/c-extensions.texi


doc/chapters/configuration.texi → doc/texinfo/chapters/configuration.texi


doc/chapters/fdl-1.3.texi → doc/texinfo/chapters/fdl-1.3.texi


doc/chapters/fft-support.texi → doc/texinfo/chapters/fft-support.texi


doc/chapters/hypervisor_api.texi → doc/texinfo/chapters/hypervisor_api.texi


doc/chapters/installing.texi → doc/texinfo/chapters/installing.texi


doc/chapters/introduction.texi → doc/texinfo/chapters/introduction.texi


+ 3 - 3
doc/chapters/mic-scc-support.texi

@@ -19,14 +19,14 @@ recurse into both directories.
 It can be parameterized with the following environment variables:
 
 @table @asis
-@item @code{MIC_HOST}
+@item @code{STARPU_MIC_HOST}
 Defines the value of the @code{--host} parameter passed to @code{configure} for the
 cross-compilation. The current default is @code{x86_64-k1om-linux}.
 
-@item @code{MIC_CC_PATH}
+@item @code{STARPU_MIC_CC_PATH}
 Defines the path to the MIC cross-compiler. The current default is @code{/usr/linux-k1om-4.7/bin/}.
 
-@item @code{COI_DIR}
+@item @code{STARPU_COI_DIR}
 Defines the path to the COI library. The current default is @code{/opt/intel/mic/coi}
 @end table
 

doc/chapters/mpi-support.texi → doc/texinfo/chapters/mpi-support.texi


doc/chapters/perf-feedback.texi → doc/texinfo/chapters/perf-feedback.texi


doc/chapters/perf-optimization.texi → doc/texinfo/chapters/perf-optimization.texi


doc/chapters/sc_hypervisor.texi → doc/texinfo/chapters/sc_hypervisor.texi


doc/chapters/scaling-vector-example.texi → doc/texinfo/chapters/scaling-vector-example.texi


doc/chapters/sched_ctx.texi → doc/texinfo/chapters/sched_ctx.texi


doc/chapters/socl.texi → doc/texinfo/chapters/socl.texi


doc/chapters/tips-tricks.texi → doc/texinfo/chapters/tips-tricks.texi


doc/chapters/vector_scal_c.texi → doc/texinfo/chapters/vector_scal_c.texi


doc/chapters/vector_scal_cpu.texi → doc/texinfo/chapters/vector_scal_cpu.texi


doc/chapters/vector_scal_cuda.texi → doc/texinfo/chapters/vector_scal_cuda.texi


doc/chapters/vector_scal_opencl.texi → doc/texinfo/chapters/vector_scal_opencl.texi


doc/chapters/vector_scal_opencl_codelet.texi → doc/texinfo/chapters/vector_scal_opencl_codelet.texi


+ 0 - 0
doc/starpu.css


Certains fichiers n'ont pas été affichés car il y a eu trop de fichiers modifiés dans ce diff