浏览代码

doc: corrections

Nathalie Furmento 9 年之前
父节点
当前提交
911797a49a

+ 6 - 4
doc/doxygen/chapters/101_building.doxy

@@ -40,9 +40,10 @@ distributions, and for most operating systems.
 
 If <c>libhwloc</c> is not available on your system, the option
 \ref without-hwloc "--without-hwloc" should be explicitely given when calling the
-<c>configure</c> script. If <c>libhwloc</c> is installed with a <c>pkg-config</c> file,
-no option is required, it will be detected automatically, otherwise
-\ref with-hwloc "--with-hwloc" should be used to specify its location.
+<c>configure</c> script. If <c>libhwloc</c> is installed in a standard
+location, no option is required, it will be detected automatically,
+otherwise \ref with-hwloc "--with-hwloc=<directory>" should be used to specify its
+location.
 
 \subsection GettingSources Getting Sources
 
@@ -92,7 +93,8 @@ $ ./configure
 \endverbatim
 
 If <c>configure</c> does not detect some software or produces errors, please
-make sure to post the content of <c>config.log</c> when reporting the issue.
+make sure to post the contents of the file <c>config.log</c> when
+reporting the issue.
 
 By default, the files produced during the compilation are placed in
 the source directory. As the compilation generates a lot of files, it

+ 9 - 11
doc/doxygen/chapters/110_basic_examples.doxy

@@ -65,13 +65,13 @@ may contain an implementation of the same kernel on different architectures
 (e.g. CUDA, x86, ...). For compatibility, make sure that the whole
 structure is properly initialized to zero, either by using the
 function starpu_codelet_init(), or by letting the
-compiler implicitly do it as examplified above.
+compiler implicitly do it as examplified below.
 
 The field starpu_codelet::nbuffers specifies the number of data buffers that are
 manipulated by the codelet: here the codelet does not access or modify any data
 that is controlled by our data management library.
 
-We create a codelet which may only be executed on the CPUs. When a CPU
+We create a codelet which may only be executed on CPUs. When a CPU
 core will execute a codelet, it will call the function
 <c>cpu_func</c>, which \em must have the following prototype:
 
@@ -100,17 +100,15 @@ struct starpu_codelet cl =
 \subsection SubmittingATask Submitting A Task
 
 Before submitting any tasks to StarPU, starpu_init() must be called. The
-<c>NULL</c> argument specifies that we use the default configuration. Tasks cannot
-be submitted after the termination of StarPU by a call to
-starpu_shutdown().
+<c>NULL</c> argument specifies that we use the default configuration.
+Tasks can then be submitted until the termination of StarPU -- done by a
+call to starpu_shutdown().
 
-In the example above, a task structure is allocated by a call to
-starpu_task_create(). This function only allocates and fills the
-corresponding structure with the default settings, but it does not
+In the example below, a task structure is allocated by a call to
+starpu_task_create(). This function allocates and fills the
+task structure with its default settings, it does not
 submit the task to StarPU.
 
-// not really clear ;)
-
 The field starpu_task::cl is a pointer to the codelet which the task will
 execute: in other words, the codelet structure describes which computational
 kernel should be offloaded on the different architectures, and the task
@@ -529,7 +527,7 @@ starpu_vector_data_register(&vector_handle, STARPU_MAIN_RAM, (uintptr_t)vector,
 \endcode
 
 The first argument, called the <b>data handle</b>, is an opaque pointer which
-designates the array in StarPU. This is also the structure which is used to
+designates the array within StarPU. This is also the structure which is used to
 describe which data is used by a task. The second argument is the node number
 where the data originally resides. Here it is ::STARPU_MAIN_RAM since the array <c>vector</c> is in
 the main memory. Then comes the pointer <c>vector</c> where the data can be found in main memory,

+ 32 - 23
doc/doxygen/chapters/210_check_list_performance.doxy

@@ -16,7 +16,7 @@ performance, we give below a list of features which should be checked.
 \section ConfigurationImprovePerformance Configuration That May Improve Performance
 
 The \ref enable-fast "--enable-fast" configuration option disables all
-assertions. This makes starpu more performant for really small tasks by
+assertions. This makes StarPU more performant for really small tasks by
 disabling all sanity checks. Only use this for measurements and production, not for development, since this will drop all basic checks.
 
 
@@ -63,7 +63,7 @@ kernels. That will lower the potential for overlapping.
 
 Calling starpu_cublas_init() makes StarPU already do appropriate calls for the
 CUBLAS library. Some libraries like Magma may however change the current stream,
-one then has to call cublasSetKernelStream(starpu_cuda_get_local_stream()); at
+one then has to call <c>cublasSetKernelStream(starpu_cuda_get_local_stream())</c> at
 the beginning of the codelet to make sure that CUBLAS is really using the proper
 stream.
 
@@ -71,7 +71,7 @@ If the kernel can be made to only use this local stream or other self-allocated
 streams, i.e. the whole kernel submission can be made asynchronous, then
 one should enable asynchronous execution of the kernel.  That means setting
 the flag ::STARPU_CUDA_ASYNC in the corresponding field starpu_codelet::cuda_flags, and dropping the
-cudaStreamSynchronize() call at the end of the cuda_func function, so that it
+<c>cudaStreamSynchronize()</c> call at the end of the <c>cuda_func</c> function, so that it
 returns immediately after having queued the kernel to the local stream. That way, StarPU will be
 able to submit and complete data transfers while kernels are executing, instead of only at each
 kernel submission. The kernel just has to make sure that StarPU can use the
@@ -89,7 +89,7 @@ If the kernel can be made to only use the StarPU-provided command queue or other
 queues, i.e. the whole kernel submission can be made asynchronous, then
 one should enable asynchronous execution of the kernel. This means setting
 the flag ::STARPU_OPENCL_ASYNC in the corresponding field starpu_codelet::opencl_flags and dropping the
-clFinish() and starpu_opencl_collect_stats() calls at the end of the kernel, so
+<c>clFinish()</c> and starpu_opencl_collect_stats() calls at the end of the kernel, so
 that it returns immediately after having queued the kernel to the provided queue.
 That way, StarPU will be able to submit and complete data transfers while kernels are executing, instead of
 only at each kernel submission. The kernel just has to make sure
@@ -110,11 +110,11 @@ any task for 10ms, but lets the application continue normally. In addition to th
 
 <c>export STARPU_WATCHDOG_CRASH=1</c> (\ref STARPU_WATCHDOG_CRASH)
 
-raises SIGABRT in that condition, thus allowing to catch the situation in gdb.
-It can also be useful to type "handle SIGABRT nopass" in gdb to be able to let
+raises <c>SIGABRT</c> in that condition, thus allowing to catch the situation in gdb.
+It can also be useful to type <c>handle SIGABRT nopass</c> in <c>gdb</c> to be able to let
 the process continue, after inspecting the state of the process.
 
-\section HowToLimitMemoryPerNode How to limit memory used by StarPU and cache buffer allocations
+\section HowToLimitMemoryPerNode How to Limit Memory Used By StarPU And Cache Buffer Allocations
 
 By default, StarPU makes sure to use at most 90% of the memory of GPU devices,
 moving data in and out of the device as appropriate and with prefetch and
@@ -143,8 +143,12 @@ starpu_malloc_on_node() which are used by the data interfaces
 (matrix, vector, etc.).  This does not include allocations performed by
 the application through e.g. malloc(). It does not include allocations
 performed through starpu_malloc() either, only allocations
-performed explicitly with the \ref STARPU_MALLOC_COUNT flag (i.e. by passing
-the parameter \ref STARPU_MALLOC_COUNT when calling starpu_malloc_flags())
+performed explicitly with the \ref STARPU_MALLOC_COUNT flag, i.e. by calling
+
+\code{.c}
+starpu_malloc_flags(STARPU_MALLOC_COUNT)
+\endcode
+
 are taken into account.  If the
 application wants to make StarPU aware of its own allocations, so that StarPU
 knows precisely how much data is allocated, and thus when to evict allocation
@@ -158,9 +162,13 @@ the delta thanks to starpu_memory_allocate() and starpu_memory_deallocate() call
 starpu_memory_get_total() and starpu_memory_get_available()
 can be used to get an estimation of how much memory is available.
 starpu_memory_wait_available() can also be used to block until an
-amount of memory becomes available (but it may be preferrable to call
-starpu_memory_allocate() with the parameter \ref STARPU_MEMORY_WAIT)
-to reserve that amount immediately).
+amount of memory becomes available, but it may be preferrable to call
+
+\code{.c}
+starpu_memory_allocate(STARPU_MEMORY_WAIT)
+\endcode
+
+to reserve that amount immediately.
 
 \section HowToReduceTheMemoryFootprintOfInternalDataStructures How To Reduce The Memory Footprint Of Internal Data Structures
 
@@ -182,8 +190,8 @@ execution. For example, in the Cholesky factorization (dense linear algebra
 application), the GEMM task uses up to 3 buffers, so it is possible to set the
 maximum number of task buffers to 3 to run a Cholesky factorization on StarPU.
 
-The size of the various structures of StarPU can be printed by the
-tests/microbenchs/display_structures_size.
+The size of the various structures of StarPU can be printed by 
+<c>tests/microbenchs/display_structures_size</c>.
 
 It is also often useless to submit *all* the tasks at the same time. One can
 make the starpu_task_submit() function block when a reasonable given number of
@@ -203,9 +211,9 @@ course this may reduce parallelism if the threshold is set too low. The precise
 balance depends on the application task graph.
 
 An idea of how much memory is used for tasks and data handles can be obtained by
-setting the \ref STARPU_MAX_MEMORY_USE environment variable to 1.
+setting the \ref STARPU_MAX_MEMORY_USE environment variable to <c>1</c>.
 
-\section HowtoReuseMemory How to reuse memory
+\section HowtoReuseMemory How To Reuse Memory
 
 When your application needs to allocate more data than the available amount of
 memory usable by StarPU (given by starpu_memory_get_available()), the
@@ -246,7 +254,7 @@ has not-so-stable performance. StarPU will force calibration (and thus ignore
 the current result) until 10 (<c>_STARPU_CALIBRATION_MINIMUM</c>) measurements have been
 made on each architecture, to avoid badly scheduling tasks just because the
 first measurements were not so good. Details on the current performance model status
-can be obtained from the command <c>starpu_perfmodel_display</c>: the <c>-l</c>
+can be obtained from the tool <c>starpu_perfmodel_display</c>: the <c>-l</c>
 option lists the available performance models, and the <c>-s</c> option permits
 to choose the performance model to be displayed. The result looks like:
 
@@ -346,26 +354,27 @@ STARPU_BUS_STATS=1</c> and <c>export STARPU_WORKER_STATS=1</c> .
 
 \ref OfflinePerformanceTools can already provide an idea of to what extent and
 which part of StarPU bring overhead on the execution time. To get a more precise
-analysis of the parts of StarPU which bring most overhead, gprof can be used.
+analysis of the parts of StarPU which bring most overhead, <c>gprof</c> can be used.
 
-First, recompile and reinstall StarPU with gprof support:
+First, recompile and reinstall StarPU with <c>gprof</c> support:
 
 \code
 ./configure --enable-perf-debug --disable-shared --disable-build-tests --disable-build-examples
 \endcode
 
 Make sure not to leave a dynamic version of StarPU in the target path: remove
-any remaining libstarpu-*.so
+any remaining <c>libstarpu-*.so</c>
 
 Then relink your application with the static StarPU library, make sure that
-running ldd on your application does not mention libstarpu (i.e. it's really statically-linked).
+running <c>ldd</c> on your application does not mention any libstarpu
+(i.e. it's really statically-linked).
 
 \code
 gcc test.c -o test $(pkg-config --cflags starpu-1.3) $(pkg-config --libs starpu-1.3)
 \endcode
 
-Now you can run your application, and a gmon.out file should appear in the
-current directory, you can process by running gprof on your application:
+Now you can run your application, and a <c>gmon.out</c> file should appear in the
+current directory, you can process it by running <c>gprof</c> on your application:
 
 \code
 gprof ./test

+ 11 - 8
doc/doxygen/chapters/301_tasks.doxy

@@ -95,7 +95,7 @@ starpu_task_insert(&dummy_big_cl,
 The whole code for this complex data interface is available in the
 directory <c>examples/basic_examples/dynamic_handles.c</c>.
 
-\section SettingVariableDataHandlesForATask Setting a Variable number Data Handles For a Task
+\section SettingVariableDataHandlesForATask Setting a Variable Number Of Data Handles For a Task
 
 Normally, the number of data handles given to a task is fixed in the
 starpu_codelet::nbuffers codelet field. This field can however be set to
@@ -294,7 +294,8 @@ And the call to the function starpu_task_insert():
 starpu_task_insert(&mycodelet,
                    STARPU_VALUE, &ifactor, sizeof(ifactor),
                    STARPU_VALUE, &ffactor, sizeof(ffactor),
-                   STARPU_RW, data_handles[0], STARPU_RW, data_handles[1],
+                   STARPU_RW, data_handles[0],
+		   STARPU_RW, data_handles[1],
                    0);
 \endcode
 
@@ -338,7 +339,9 @@ starpu_task_insert(&which_index, STARPU_W, i_handle, 0);
 
 /* And submit the corresponding task */
 STARPU_DATA_ACQUIRE_CB(i_handle, STARPU_R,
-                       starpu_task_insert(&work, STARPU_RW, A_handle[i], 0));
+                       starpu_task_insert(&work,
+		                          STARPU_RW, A_handle[i],
+					  0));
 \endcode
 
 The macro ::STARPU_DATA_ACQUIRE_CB submits an asynchronous request for
@@ -420,7 +423,7 @@ allowed to start to achieve the computation. The CPU binding mask for the whole
 set of CPUs is already enforced, so that threads created by the function will
 inherit the mask, and thus execute where StarPU expected, the OS being in charge
 of choosing how to schedule threads on the corresponding CPUs. The application
-can also choose to bind threads by hand, using e.g. sched_getaffinity to know
+can also choose to bind threads by hand, using e.g. <c>sched_getaffinity</c> to know
 the CPU binding mask that StarPU chose.
 
 For instance, using OpenMP (full source is available in
@@ -523,7 +526,7 @@ CPU and GPU tasks are not affected and can be run concurrently). The parallel
 task scheduler will however still however still try varying combined worker
 sizes to look for the most efficient ones.
 
-\subsection SynchronizationTasks Synchronization tasks
+\subsection SynchronizationTasks Synchronization Tasks
 
 For the application conveniency, it may be useful to define tasks which do not
 actually make any computation, but wear for instance dependencies between other
@@ -532,13 +535,13 @@ tasks or tags, or to be submitted in callbacks, etc.
 The obvious way is of course to make kernel functions empty, but such task will
 thus have to wait for a worker to become ready, transfer data, etc.
 
-A much lighter way to define a synchronization task is to set its <c>cl</c>
+A much lighter way to define a synchronization task is to set its starpu_task::cl
 field to <c>NULL</c>. The task will thus be a mere synchronization point,
 without any data access or execution content: as soon as its dependencies become
 available, it will terminate, call the callbacks, and release dependencies.
 
-An intermediate solution is to define a codelet with its <c>where</c> field set
-to \ref STARPU_NOWHERE, for instance this:
+An intermediate solution is to define a codelet with its
+starpu_codelet::where field set to \ref STARPU_NOWHERE, for instance:
 
 \code{.c}
 struct starpu_codelet {

+ 13 - 13
doc/doxygen/chapters/310_data_management.doxy

@@ -8,7 +8,7 @@
 
 /*! \page DataManagement Data Management
 
-intro qui parle de coherency entre autres
+TODO: intro qui parle de coherency entre autres
 
 \section DataManagement Data Management
 
@@ -217,16 +217,16 @@ struct starpu_data_filter f_vert =
 starpu_data_partition_plan(handle, &f_vert, vert_handle);
 \endcode
 
-starpu_data_partition_plan() returns the handles for the partition in vert_handle.
+starpu_data_partition_plan() returns the handles for the partition in <c>vert_handle</c>.
 
-One can submit tasks working on the main handle, but not yet on the vert_handle
+One can submit tasks working on the main handle, but not yet on the <c>vert_handle</c>
 handles. Now we submit the partitioning:
 
 \code{.c}
 starpu_data_partition_submit(handle, PARTS, vert_handle);
 \endcode
 
-And now we can submit tasks working on vert_handle handles (and not on the main
+And now we can submit tasks working on <c>vert_handle</c> handles (and not on the main
 handle any more). Eventually we want to work on the main handle again, so we
 submit the unpartitioning:
 
@@ -443,10 +443,10 @@ properly be serialized against accesses with this flag. For instance:
         0);
 \endcode
 
-The two tasks running cl2 will be able to commute: depending on whether the
-value of handle1 or handle2 becomes available first, the corresponding task
-running cl2 will start first. The task running cl1 will however always be run
-before them, and the task running cl3 will always be run after them.
+The two tasks running <c>cl2</c> will be able to commute: depending on whether the
+value of <c>handle1</c> or <c>handle2</c> becomes available first, the corresponding task
+running <c>cl2</c> will start first. The task running <c>cl1</c> will however always be run
+before them, and the task running <c>cl3</c> will always be run after them.
 
 If a lot of tasks use the commute access on the same set of data and a lot of
 them are ready at the same time, it may become interesting to use an arbiter,
@@ -474,7 +474,7 @@ be avoided by using several arbiters, thus separating sets of data for which
 arbitration will be done.  If a task accesses data from different arbiters, it
 will acquire them arbiter by arbiter, in arbiter pointer value order.
 
-See the tests/datawizard/test_arbiter.cpp example.
+See the <c>tests/datawizard/test_arbiter.cpp</c> example.
 
 Arbiters however do not support the ::STARPU_REDUX flag yet.
 
@@ -526,7 +526,7 @@ but that would
 make them systematic and permanent. A more  optimized way is to use
 the data access mode ::STARPU_SCRATCH, as examplified below, which
 provides per-worker buffers without content consistency. The buffer is
-registered only once, using memory node -1, i.e. the application didn't allocate
+registered only once, using memory node <c>-1</c>, i.e. the application didn't allocate
 memory for it, and StarPU will allocate it on demand at task execution.
 
 \code{.c}
@@ -698,16 +698,16 @@ The whole code for this complex data interface is available in the
 directory <c>examples/interface/</c>.
 
 
-\section SpecifyingATargetNode Specifying a target node for task data
+\section SpecifyingATargetNode Specifying A Target Node For Task Data
 
 When executing a task on a GPU for instance, StarPU would normally copy all the
 needed data for the tasks on the embedded memory of the GPU.  It may however
 happen that the task kernel would rather have some of the datas kept in the
 main memory instead of copied in the GPU, a pivoting vector for instance.
 This can be achieved by setting the starpu_codelet::specific_nodes flag to
-1, and then fill the starpu_codelet::nodes array (or starpu_codelet::dyn_nodes when
+<c>1</c>, and then fill the starpu_codelet::nodes array (or starpu_codelet::dyn_nodes when
 starpu_codelet::nbuffers is greater than \ref STARPU_NMAXBUFS) with the node numbers
-where data should be copied to, or -1 to let StarPU copy it to the memory node
+where data should be copied to, or <c>-1</c> to let StarPU copy it to the memory node
 where the task will be executed. For instance, with the following codelet:
 
 \code{.c}

+ 8 - 8
doc/doxygen/chapters/320_scheduling.doxy

@@ -93,7 +93,7 @@ latter is estimated based on bus calibration before execution start,
 i.e. with an idle machine, thus without contention. You can force bus
 re-calibration by running the tool <c>starpu_calibrate_bus</c>. The
 beta parameter defaults to <c>1</c>, but it can be worth trying to tweak it
-by using <c>export STARPU_SCHED_BETA=2</c> for instance, since during
+by using <c>export STARPU_SCHED_BETA=2</c> (\ref STARPU_SCHED_BETA) for instance, since during
 real application execution, contention makes transfer times bigger.
 This is of course imprecise, but in practice, a rough estimation
 already gives the good results that a precise estimation would give.
@@ -106,7 +106,7 @@ take it into account when distributing tasks. The target function that
 the scheduler <c>dmda</c> minimizes becomes <c>alpha * T_execution +
 beta * T_data_transfer + gamma * Consumption</c> , where <c>Consumption</c>
 is the estimated task consumption in Joules. To tune this parameter, use
-<c>export STARPU_SCHED_GAMMA=3000</c> for instance, to express that each Joule
+<c>export STARPU_SCHED_GAMMA=3000</c> (\ref STARPU_SCHED_GAMMA) for instance, to express that each Joule
 (i.e kW during 1000us) is worth 3000us execution time penalty. Setting
 <c>alpha</c> and <c>beta</c> to zero permits to only take into account energy consumption.
 
@@ -114,7 +114,7 @@ This is however not sufficient to correctly optimize energy: the scheduler would
 simply tend to run all computations on the most energy-conservative processing
 unit. To account for the consumption of the whole machine (including idle
 processing units), the idle power of the machine should be given by setting
-<c>export STARPU_IDLE_POWER=200</c> for 200W, for instance. This value can often
+<c>export STARPU_IDLE_POWER=200</c> (\ref STARPU_IDLE_POWER) for 200W, for instance. This value can often
 be obtained from the machine power supplier.
 
 The energy actually consumed by the total execution can be displayed by setting
@@ -198,7 +198,7 @@ methods of the policy.
 Make sure to have a look at the \ref API_Scheduling_Policy section, which
 provides a list of the available functions for writing advanced schedulers, such
 as starpu_task_expected_length(), starpu_task_expected_data_transfer_time(),
-starpu_task_expected_energy(), starpu_prefetch_task_input_node(), etc. Other
+starpu_task_expected_energy(), etc. Other
 useful functions include starpu_transfer_bandwidth(), starpu_transfer_latency(),
 starpu_transfer_predict(), ...
 
@@ -223,7 +223,7 @@ schedulers can be read in <c>src/sched_policies</c>, for
 instance <c>random_policy.c</c>, <c>eager_central_policy.c</c>,
 <c>work_stealing_policy.c</c>
 
-\section GraphScheduling Graph-based scheduling
+\section GraphScheduling Graph-based Scheduling
 
 For performance reasons, most of the schedulers shipped with StarPU use simple
 list-scheduling heuristics, assuming that the application has already set
@@ -251,7 +251,7 @@ heuristic based on the duration of the task over CPUs and GPUs to decide between
 the two queues. CPU workers can then pop from the CPU priority queue, and GPU
 workers from the GPU priority queue.
 
-\section DebuggingScheduling Debugging scheduling
+\section DebuggingScheduling Debugging Scheduling
 
 All the \ref OnlinePerformanceTools and \ref OfflinePerformanceTools can
 be used to get information about how well the execution proceeded, and thus the
@@ -260,11 +260,11 @@ overall quality of the execution.
 Precise debugging can also be performed by using the
 \ref STARPU_TASK_BREAK_ON_SCHED, \ref STARPU_TASK_BREAK_ON_PUSH, and
 \ref STARPU_TASK_BREAK_ON_POP environment variables. By setting the job_id of a task
-in these environment variables, StarPU will raise SIGTRAP when the task is being
+in these environment variables, StarPU will raise <c>SIGTRAP</c> when the task is being
 scheduled, pushed, or popped by the scheduler. That means that when one notices
 that a task is being scheduled in a seemingly odd way, one can just reexecute
 the application in a debugger, with some of those variables set, and the
 execution will stop exactly at the scheduling points of that task, thus allowing
-to inspect the scheduler state etc.
+to inspect the scheduler state, etc.
 
 */

+ 3 - 14
doc/doxygen/chapters/330_scheduling_contexts.doxy

@@ -42,8 +42,8 @@ workers to each context according to their needs (\ref SchedulingContextHypervis
 
 Both cases require a call to the function
 starpu_sched_ctx_create(), which requires as input the worker
-list (the exact list or a NULL pointer) and a list of optional
-parameters such as the scheduling policy, terminated by a 0. The
+list (the exact list or a <c>NULL</c> pointer) and a list of optional
+parameters such as the scheduling policy, terminated by a <c>0</c>. The
 scheduling policy can be a character list corresponding to the name of
 a StarPU predefined policy or the pointer to a custom policy. The
 function returns an identifier of the context created which you will
@@ -77,7 +77,7 @@ Combined workers are constructed depending on the entire topology of the machine
 
 \section ModifyingAContext Modifying A Context
 
-A scheduling context can be modified dynamically. The applications may
+A scheduling context can be modified dynamically. The application may
 change its requirements during the execution and the programmer can
 add additional workers to a context or remove if no longer needed. In
 the following example we have two scheduling contexts
@@ -155,15 +155,4 @@ If these tasks have low
 priority the programmer can forbid the application to submit them
 by calling the function starpu_sched_ctx_stop_task_submission().
 
-\section ContextsSharingWorkers Contexts Sharing Workers
-
-Contexts may share workers when a single context cannot execute
-efficiently enough alone on these workers or when the application
-decides to express a hierarchy of contexts. The workers apply an
-alogrithm of ``Round-Robin'' to chose the context on which they will
-``pop'' next. By using the function
-starpu_sched_ctx_set_turn_to_other_ctx(), the programmer can impose
-the <c>workerid</c> to ``pop'' in the context <c>sched_ctx_id</c>
-next.
-
 */

+ 17 - 13
doc/doxygen/chapters/340_scheduling_context_hypervisor.doxy

@@ -46,23 +46,23 @@ The hypervisor resizes only the registered contexts.
 
 The runtime provides the hypervisor with information concerning the
 behavior of the resources and the application. This is done by using
-the <c>performance_counters</c> which represent callbacks indicating 
-when the resources are idle or not efficient, when the application 
+the <c>performance_counters</c> which represent callbacks indicating
+when the resources are idle or not efficient, when the application
 submits tasks or when it becomes to slow.
 
 \section TriggerTheHypervisor Trigger the Hypervisor
 
-The resizing is triggered either when the application requires it 
-(<c> sc_hypervisor_resize_ctxs </c>) or
+The resizing is triggered either when the application requires it
+(sc_hypervisor_resize_ctxs()) or
 when the initials distribution of resources alters the performance of
 the application (the application is to slow or the resource are idle
-for too long time). If the environment 
-variable <c>SC_HYPERVISOR_TRIGGER_RESIZE</c> is set to <c>speed</c> 
+for too long time). If the environment
+variable \ref SC_HYPERVISOR_TRIGGER_RESIZE is set to <c>speed</c>
 the monitored speed of the contexts is compared to a theoretical value
 computed with a linear program, and the resizing is triggered
-whenever the two values do not correspond. Otherwise, if the environment 
+whenever the two values do not correspond. Otherwise, if the environment
 variable is set to <c>idle</c> the hypervisor triggers the resizing algorithm
-whenever the workers are idle for a period longer than the threshold 
+whenever the workers are idle for a period longer than the threshold
 indicated by the programmer. When this
 happens different resizing strategy are applied that target minimizing
 the total execution of the application, the instant speed or the idle
@@ -128,7 +128,11 @@ task.
 
 The number of flops to be executed by a context are passed as
  parameter when they are registered to the hypervisor,
- (<c>sc_hypervisor_register_ctx(sched_ctx_id, flops)</c>) and the one
+\code{.c}
+sc_hypervisor_register_ctx(sched_ctx_id, flops)
+\endcode
+
+and the one
  to be executed by each task are passed when the task is submitted.
  The corresponding field is starpu_task::flops and the corresponding
  macro in the function starpu_task_insert() is ::STARPU_FLOPS
@@ -154,12 +158,12 @@ such that the application finishes in a minimum amount of time. As for the <b>Gf
 strategy the programmers has to indicate the total number of flops to be executed
 when registering the context. This number of flops may be updated dynamically during the execution
 of the application whenever this information is not very accurate from the beginning.
-The function <c>sc_hypervisor_update_diff_total_flop </c> is called in order add or remove
+The function sc_hypervisor_update_diff_total_flops() is called in order to add or to remove
 a difference to the flops left to be executed.
-Tasks are provided also the number of flops corresponding to each one of them. During the 
+Tasks are provided also the number of flops corresponding to each one of them. During the
 execution of the application the hypervisor monitors the consumed flops and recomputes
 the time left and the number of resources to use. The speed of each type of resource
-is (re)evaluated and inserter in the linear program in order to better adapt to the 
+is (re)evaluated and inserter in the linear program in order to better adapt to the
 needs of the application.
 
 The <b>Teft</b> strategy uses a linear program too, that considers all the types of tasks
@@ -170,7 +174,7 @@ in order to have good predictions of the execution time of each type of task.
 The types of tasks may be determines directly by the hypervisor when they are submitted.
 However there are applications that do not expose all the graph of tasks from the beginning.
 In this case in order to let the hypervisor know about all the tasks the function
-<c> sc_hypervisor_set_type_of_task </c> will just inform the hypervisor about future tasks
+sc_hypervisor_set_type_of_task() will just inform the hypervisor about future tasks
 without submitting them right away.
 
 The <b>Ispeed </b> strategy divides the execution of the application in several frames.

+ 53 - 54
doc/doxygen/chapters/350_modularized_scheduler.doxy

@@ -9,8 +9,8 @@
 
 \section Introduction
 
-StarPU's Modularized Schedulers are made of individual Scheduling Components 
-Modularizedly assembled as a Scheduling Tree. Each Scheduling Component has an 
+StarPU's Modularized Schedulers are made of individual Scheduling Components
+Modularizedly assembled as a Scheduling Tree. Each Scheduling Component has an
 unique purpose, such as prioritizing tasks or mapping tasks over resources.
 A typical Scheduling Tree is shown below.
 
@@ -21,30 +21,30 @@ A typical Scheduling Tree is shown below.
                                   v
                             Fifo_Component
                                 |  ^
-                                |  |        
+                                |  |
                                 v  |
                            Eager_Component
                                 |  ^
-                                |  |    
+                                |  |
                                 v  |
                  --------><--------------><--------
                  |  ^                          |  ^
-                 |  |                          |  |        
+                 |  |                          |  |
                  v  |                          v  |
              Fifo_Component                 Fifo_Component
                  |  ^                          |  ^
-                 |  |                          |  |        
+                 |  |                          |  |
                  v  |                          v  |
             Worker_Component               Worker_Component
 </pre>
 
 When a task is pushed by StarPU in a Modularized Scheduler, the task moves from
 a Scheduling Component to an other, following the hierarchy of the
-Scheduling Tree, and is stored in one of the Scheduling Components of the 
+Scheduling Tree, and is stored in one of the Scheduling Components of the
 strategy.
 When a worker wants to pop a task from the Modularized Scheduler, the
-corresponding Worker Component of the Scheduling Tree tries to pull a task from 
-its parents, following the hierarchy, and gives it to the worker if it succeded 
+corresponding Worker Component of the Scheduling Tree tries to pull a task from
+its parents, following the hierarchy, and gives it to the worker if it succeded
 to get one.
 
 
@@ -52,7 +52,7 @@ to get one.
 
 \subsection ExistingModularizedSchedulers Existing Modularized Schedulers
 
-StarPU is currently shipped with the following pre-defined Modularized 
+StarPU is currently shipped with the following pre-defined Modularized
 Schedulers :
 
 - Eager-based Schedulers (with/without prefetching) : \n
@@ -60,11 +60,11 @@ Naive scheduler, which tries to map a task on the first available resource
 it finds.
 
 - Prio-based Schedulers (with/without prefetching) : \n
-Similar to Eager-Based Schedulers. Can handle tasks which have a defined 
+Similar to Eager-Based Schedulers. Can handle tasks which have a defined
 priority and schedule them accordingly.
 
 - Random-based Schedulers (with/without prefetching) : \n
-Selects randomly a resource to be mapped on for each task. 
+Selects randomly a resource to be mapped on for each task.
 
 - HEFT Scheduler : \n
 Heterogeneous Earliest Finish Time Scheduler.
@@ -73,8 +73,8 @@ defined performance model (\ref PerformanceModelCalibration)
 to work efficiently, but can handle tasks without a performance
 model.
 
-It is currently needed to set the environment variable \ref STARPU_SCHED 
-to use those Schedulers. Modularized Schedulers' naming is tree-*
+To use one of these schedulers, one can set the environment variable \ref STARPU_SCHED.
+All modularized schedulers are named following the RE <c>tree-*</c>
 
 \subsection ExampleTreeEagerPrefetchingStrategy An Example : The Tree-Eager-Prefetching Strategy
 
@@ -89,7 +89,7 @@ to use those Schedulers. Modularized Schedulers' naming is tree-*
                                 v  |
                           Eager_Component
                                 |  ^
-                                |  |    
+                                |  |
                                 v  |
               --------><-------------------><---------
               |  ^                                |  ^
@@ -104,22 +104,22 @@ to use those Schedulers. Modularized Schedulers' naming is tree-*
 
 \subsection Interface
 
-Each Scheduling Component must follow the following pre-defined Interface 
+Each Scheduling Component must follow the following pre-defined Interface
 to be able to interact with other Scheduling Components.
 
 	- Push (Caller_Component, Child_Component, Task) \n
-	The calling Scheduling Component transfers a task to its 
-	Child Component. When the Push function returns, the task no longer 
-	belongs to the calling Component. The Modularized Schedulers' 
+	The calling Scheduling Component transfers a task to its
+	Child Component. When the Push function returns, the task no longer
+	belongs to the calling Component. The Modularized Schedulers'
 	model relies on this function to perform prefetching.
 
 	- Pull (Caller_Component, Parent_Component)  ->  Task \n
 	The calling Scheduling Component requests a task from
-	its Parent Component. When the Pull function ends, the returned 
+	its Parent Component. When the Pull function ends, the returned
 	task belongs to the calling Component.
 
 	- Can_Push (Caller_Component, Parent_Component) \n
-	The calling Scheduling Component notifies its Parent Component that 
+	The calling Scheduling Component notifies its Parent Component that
 	it is ready to accept new tasks.
 
 	- Can_Pull (Caller_Component, Child_Component) \n
@@ -127,13 +127,13 @@ to be able to interact with other Scheduling Components.
 	that it is ready to give new tasks.
 
 
-\section BuildAModularizedScheduler Build a Modularized Scheduler
+\section BuildAModularizedScheduler Building a Modularized Scheduler
 
 \subsection PreImplementedComponents Pre-implemented Components
 
-StarPU is currently shipped with the following four Scheduling Components : 
+StarPU is currently shipped with the following four Scheduling Components :
 
-	- Flow-control Components : Fifo, Prio \n 
+	- Flow-control Components : Fifo, Prio \n
 	Components which store tasks. They can also prioritize them if
 	they have a defined priority. It is possible to define a threshold
 	for those Components following two criterias : the number of tasks
@@ -148,19 +148,19 @@ StarPU is currently shipped with the following four Scheduling Components :
 	Each Worker Component modelize a concrete worker.
 
 	- Special-Purpose Components : Perfmodel_Select, Best_Implementation \n
-	Components dedicated to original purposes. The Perfmodel_Select 
-	Component decides which Resource-Mapping Component should be used to 
+	Components dedicated to original purposes. The Perfmodel_Select
+	Component decides which Resource-Mapping Component should be used to
 	schedule a task. The Best_Implementation Component chooses which
 	implementation of a task should be used on the choosen resource.
 
 \subsection ProgressionAndValidationRules Progression And Validation Rules
 
-Some rules must be followed to ensure the correctness of a Modularized 
+Some rules must be followed to ensure the correctness of a Modularized
 Scheduler :
 
-	- At least one Flow-control Component without threshold per Worker Component 
-	is needed in a Modularized Scheduler, to store incoming tasks from StarPU 
-	and to give tasks to Worker Components who asks for it. It is possible to 
+	- At least one Flow-control Component without threshold per Worker Component
+	is needed in a Modularized Scheduler, to store incoming tasks from StarPU
+	and to give tasks to Worker Components who asks for it. It is possible to
 	use one Flow-control Component per Worker Component, or one for all Worker
 	Components, depending on how the Scheduling Tree is defined.
 
@@ -168,7 +168,7 @@ Scheduler :
 	Scheduler. Resource-Mapping Components are the only ones who can make
 	scheduling choices, and so the only ones who can have several child.
 
-\subsection ImplementAModularizedScheduler Implement a Modularized Scheduler
+\subsection ImplementAModularizedScheduler Implementing a Modularized Scheduler
 
 The following code shows how the Tree-Eager-Prefetching Scheduler
 shown in Section \ref ExampleTreeEagerPrefetchingStrategy is implemented :
@@ -188,7 +188,7 @@ static void initialize_eager_prefetching_center_policy(unsigned sched_ctx_id)
     (sched_ctx_id, STARPU_WORKER_LIST);
 
   /* Create the Scheduling Tree */
-  struct starpu_sched_tree * t = 
+  struct starpu_sched_tree * t =
     starpu_sched_tree_create(sched_ctx_id);
 
   /* The Root Component is a Flow-control Fifo Component */
@@ -199,16 +199,16 @@ static void initialize_eager_prefetching_center_policy(unsigned sched_ctx_id)
   struct starpu_sched_component * eager_component =
     starpu_sched_component_eager_create(NULL);
 
-  /* Create links between Components : the Eager Component is the child 
+  /* Create links between Components : the Eager Component is the child
    * of the Root Component */
   t->root->add_child
     (t->root, eager_component);
   eager_component->add_father
     (eager_component, t->root);
 
-  /* A task threshold is set for the Flow-control Components which will 
-   * be connected to Worker Components. By doing so, this Modularized 
-   * Scheduler will be able to perform some prefetching on the resources 
+  /* A task threshold is set for the Flow-control Components which will
+   * be connected to Worker Components. By doing so, this Modularized
+   * Scheduler will be able to perform some prefetching on the resources
    */
   struct starpu_sched_component_fifo_data fifo_data =
   {
@@ -218,11 +218,11 @@ static void initialize_eager_prefetching_center_policy(unsigned sched_ctx_id)
 
   unsigned i;
   for(i = 0;
-    i < starpu_worker_get_count() + 
+    i < starpu_worker_get_count() +
     starpu_combined_worker_get_count();
     i++)
   {
-    /* Each Worker Component has a Flow-control Fifo Component as 
+    /* Each Worker Component has a Flow-control Fifo Component as
      * father */
     struct starpu_sched_component * worker_component =
 	  starpu_sched_component_worker_get(i);
@@ -233,8 +233,8 @@ static void initialize_eager_prefetching_center_policy(unsigned sched_ctx_id)
     worker_component->add_father
       (worker_component, fifo_component);
 
-    /* Each Flow-control Fifo Component associated to a Worker 
-     * Component is linked to the Eager Component as one of its 
+    /* Each Flow-control Fifo Component associated to a Worker
+     * Component is linked to the Eager Component as one of its
      * children */
     eager_component->add_child
       (eager_component, fifo_component);
@@ -276,7 +276,7 @@ struct starpu_sched_policy _starpu_sched_tree_eager_prefetching_policy =
 };
 \endcode
 
-\section WriteASchedulingComponent Write a Scheduling Component
+\section WriteASchedulingComponent Writing a Scheduling Component
 
 \subsection GenericSchedulingComponent Generic Scheduling Component
 
@@ -284,10 +284,10 @@ Each Scheduling Component is instantiated from a Generic Scheduling Component,
 which implements a generic version of the Interface. The generic implementation
 of Pull, Can_Pull and Can_Push functions are recursive calls to their parents
 (respectively to their children). However, as a Generic Scheduling Component do
-not know how much children it will have when it will be instantiated, it does 
+not know how much children it will have when it will be instantiated, it does
 not implement the Push function.
 
-\subsection InstantiationRedefineInterface Instantiation : Redefine the Interface
+\subsection InstantiationRedefineInterface Instantiation : Redefining the Interface
 
 A Scheduling Component must implement all the functions of the Interface. It is
 so necessary to implement a Push function to instantiate a Scheduling Component.
@@ -297,7 +297,7 @@ to the Scheduling Component he is implementing, it is possible to reimplement
 all the functions of the Interface. For example, a Flow-control Component
 reimplements the Pull and the Can_Push functions of the Interface, allowing him
 to catch the generic recursive calls of these functions. The Pull function of
-a Flow-control Component can, for example, pop a task from the local storage 
+a Flow-control Component can, for example, pop a task from the local storage
 queue of the Component, and give it to the calling Component which asks for it.
 
 \subsection DetailedProgressionAndValidationRules Detailed Progression and Validation Rules
@@ -307,18 +307,18 @@ queue of the Component, and give it to the calling Component which asks for it.
 	Areas in the Scheduling Tree.
 
 	- A Pump is the engine source of the Scheduler : it pushes/pulls tasks
-	to/from a Scheduling Component to an other. Native Pumps of a Scheduling 
-	Tree are located at the root of the Tree (incoming Push calls from StarPU), 
-	and at the leafs of the Tree (Pop calls coming from StarPU Workers). 
-	Pre-implemented Scheduling Components currently shipped with Pumps are 
-	Flow-Control Components and the Resource-Mapping Component Heft, within 
+	to/from a Scheduling Component to an other. Native Pumps of a Scheduling
+	Tree are located at the root of the Tree (incoming Push calls from StarPU),
+	and at the leafs of the Tree (Pop calls coming from StarPU Workers).
+	Pre-implemented Scheduling Components currently shipped with Pumps are
+	Flow-Control Components and the Resource-Mapping Component Heft, within
 	their defined Can_Push functions.
 
-	- A correct Scheduling Tree requires a Pump per Scheduling Area and per 
-	Execution Flow. 
+	- A correct Scheduling Tree requires a Pump per Scheduling Area and per
+	Execution Flow.
 
 
-The Tree-Eager-Prefetching Scheduler shown in Section 
+The Tree-Eager-Prefetching Scheduler shown in Section
 \ref ExampleTreeEagerPrefetchingStrategy follows the previous assumptions :
 
 <pre>
@@ -335,7 +335,7 @@ The Tree-Eager-Prefetching Scheduler shown in Section
                                         v  |
  Area 2                           Eager_Component
                                         |  ^
-                                        |  |    
+                                        |  |
                                         v  |
                       --------><-------------------><---------
                       |  ^                                |  ^
@@ -350,4 +350,3 @@ The Tree-Eager-Prefetching Scheduler shown in Section
 </pre>
 
 */
-

+ 19 - 16
doc/doxygen/chapters/360_debugging_tools.doxy

@@ -13,7 +13,7 @@ can be generated and displayed graphically, see \ref GeneratingTracesWithFxT.
 
 \section DebuggingInGeneral TroubleShooting In General
 
-Generally-speaking, if you have troubles, pass <c>--enable-debug</c> to
+Generally-speaking, if you have troubles, pass \ref enable-debug "--enable-debug" to
 <c>./configure</c> to enable some checks which impact performance, but will
 catch common issues, possibly earlier than the actual problem you are observing,
 which may just be a consequence of a bug that happened earlier. If your program
@@ -25,11 +25,11 @@ Then, if your program crashes with an assertion error, a segfault, etc. you can
 thread apply all bt
 \endverbatim
 
-run in gdb at the point of the crash.
+run in <c>gdb</c> at the point of the crash.
 
 In case your program just hangs, but it may also be useful in case of a crash
 too, it helps to source <c>gdbinit</c> as described in the next section to be
-able to run and send us the output of:
+able to run and send us the output of the following commands:
 
 \verbatim
 starpu-workers
@@ -51,31 +51,34 @@ starpu-print-datas
 
 \section UsingGdb Using The Gdb Debugger
 
-Some gdb helpers are provided to show the whole StarPU state:
+Some <c>gdb</c> helpers are provided to show the whole StarPU state:
 
 \verbatim
 (gdb) source tools/gdbinit
 (gdb) help starpu
 \endverbatim
 
-For instance, one can print all tasks with <c>starpu-print-all-tasks</c>,
-print all datas with <c>starpu-print-datas</c>, print all pending data
-transfers with <c>starpu-print-prequests</c>, <c>starpu-print-requests</c>, <c>starpu-print-frequests</c>, <c>starpu-print-irequests</c>,
-print pending MPI requests with
-<c>starpu-mpi-print-detached-requests</c>
+For instance,
+<ul>
+<li> one can print all tasks with <c>starpu-print-all-tasks</c>, </li>
+<li> print all datas with <c>starpu-print-datas</c>, </li>
+<li> print all pending data transfers with <c>starpu-print-prequests</c>, <c>starpu-print-requests</c>, <c>starpu-print-frequests</c>, <c>starpu-print-irequests</c>,</li>
+<li> print pending MPI requests with <c>starpu-mpi-print-detached-requests</c></li>
+</ul>
 
-Some functions can only work if <c>--enable-debug</c> was passed to <c>./configure</c>
+Some functions can only work if \ref enable-debug "--enable-debug"
+was passed to <c>./configure</c>
 (because they impact performance)
 
-\section UsingOtherDebugger Using other debugging tools
+\section UsingOtherDebugger Using Other Debugging Tools
 
-Valgrind can be used on StarPU: valgrind.h just needs to be found at ./configure
+Valgrind can be used on StarPU: valgrind.h just needs to be found at <c>./configure</c>
 time, to tell valgrind about some known false positives and disable host memory
 pinning. Other known false positives can be suppressed by giving the suppression
-files in tools/valgrind/ *.suppr to valgrind's --suppressions option.
+files in <c>tools/valgrind/*.suppr</c> to valgrind's <c>--suppressions</c> option.
 
-The environment variable \ref STARPU_DISABLE_KERNELS can also be set to 1 to make
-StarPU do everything (schedule tasks, transfer memory, etc.) except actually
+The environment variable \ref STARPU_DISABLE_KERNELS can also be set to <c>1</c> to make
+StarPU does everything (schedule tasks, transfer memory, etc.) except actually
 calling the application-provided kernel functions, i.e. the computation will not
 happen. This permits to quickly check that the task scheme is working properly.
 
@@ -90,7 +93,7 @@ sure that it found it, rebuild StarPU.  Run the Temanejo GUI, give it the path
 to your application, any options you want to pass it, the path to <c>libayudame.so</c>.
 
 It permits to visualize the task graph, add breakpoints, continue execution
-task-by-task, and run gdb on a given task, etc.
+task-by-task, and run <c>gdb</c> on a given task, etc.
 
 \image html temanejo.png
 \image latex temanejo.png "" width=\textwidth

+ 9 - 9
doc/doxygen/chapters/370_online_performance_tools.doxy

@@ -57,8 +57,8 @@ the function starpu_task_get_current().
 \subsection Per-codeletFeedback Per-codelet Feedback
 
 The field starpu_codelet::per_worker_stats is
-an array of counters. The i-th entry of the array is incremented every time a
-task implementing the codelet is executed on the i-th worker.
+an array of counters. The <c>i</c>-th entry of the array is incremented every time a
+task implementing the codelet is executed on the <c>i</c>-th worker.
 This array is not reinitialized when profiling is enabled or disabled.
 
 \subsection Per-workerFeedback Per-worker Feedback
@@ -83,7 +83,7 @@ Calling starpu_profiling_worker_get_info() resets the profiling
 information associated to a worker.
 
 To easily display all this information, the environment variable
-\ref STARPU_WORKER_STATS can be set to 1 (in addition to setting
+\ref STARPU_WORKER_STATS can be set to <c>1</c> (in addition to setting
 \ref STARPU_PROFILING to 1). A summary will then be displayed at program termination:
 
 \verbatim
@@ -140,7 +140,7 @@ CUDA 2  4534.229519     2417.069025     2417.060863     0.000000
 
 Statistics about the data transfers which were performed and temporal average
 of bandwidth usage can be obtained by setting the environment variable
-\ref STARPU_BUS_STATS to 1; a summary will then be displayed at program termination:
+\ref STARPU_BUS_STATS to <c>1</c>; a summary will then be displayed at program termination:
 
 \verbatim
 Data transfer stats:
@@ -211,7 +211,7 @@ starpu_top_task_prevision(task, workerid, begin, end);
 \endcode
 
 Starting StarPU-Top (StarPU-Top is started via the binary
-<c>starpu_top</c>.) and the application can be done two ways:
+<c>starpu_top</c>) and the application can be done in two ways:
 
 <ul>
 <li> The application is started by hand on some machine (and thus already
@@ -335,8 +335,8 @@ using the function starpu_perfmodel_update_history().
 The following is a small code example.
 
 If e.g. the code is recompiled with other compilation options, or several
-variants of the code are used, the symbol string should be changed to reflect
-that, in order to recalibrate a new model from zero. The symbol string can even
+variants of the code are used, the <c>symbol</c> string should be changed to reflect
+that, in order to recalibrate a new model from zero. The <c>symbol</c> string can even
 be constructed dynamically at execution time, as long as this is done before
 submitting any task using it.
 
@@ -362,8 +362,8 @@ Measured at runtime and refined by regression (model types
 ::STARPU_REGRESSION_BASED and ::STARPU_NL_REGRESSION_BASED). This
 still assumes performance regularity, but works
 with various data input sizes, by applying regression over observed
-execution times. ::STARPU_REGRESSION_BASED uses an a*n^b regression
-form, ::STARPU_NL_REGRESSION_BASED uses an a*n^b+c (more precise than
+execution times. ::STARPU_REGRESSION_BASED uses an <c>a*n^b</c> regression
+form, ::STARPU_NL_REGRESSION_BASED uses an <c>a*n^b+c</c> (more precise than
 ::STARPU_REGRESSION_BASED, but costs a lot more to compute).
 
 For instance,

+ 31 - 29
doc/doxygen/chapters/380_offline_performance_tools.doxy

@@ -90,8 +90,8 @@ starpu_shutdown(). The trace is a binary file whose name has the form
 <c>/tmp/</c> directory by default, or by the directory specified by
 the environment variable \ref STARPU_FXT_PREFIX.
 
-The additional configure option \ref enable-fxt-lock "--enable-fxt-lock" can 
-be used to generate trace events which describes the locks behaviour during 
+The additional configure option \ref enable-fxt-lock "--enable-fxt-lock" can
+be used to generate trace events which describes the locks behaviour during
 the execution.
 
 \subsection CreatingAGanttDiagram Creating a Gantt Diagram
@@ -137,7 +137,7 @@ varying colors, pass option <c>-c</c> to <c>starpu_fxt_tool</c>.
 
 To identify tasks precisely, the application can set the starpu_task::tag_id field of the
 task (or use \ref STARPU_TAG_ONLY when using starpu_task_insert()), and with a recent
-enough version of vite (>= r1430) and the
+enough version of ViTE (>= r1430) and the
 \ref enable-paje-codelet-details "--enable-paje-codelet-details"
 StarPU configure option, the value of the tag will show up in the trace.
 
@@ -151,7 +151,7 @@ Traces can also be inspected by hand by using the tool <c>fxt_print</c>, for ins
 $ fxt_print -o -f /tmp/prof_file_something
 \endverbatim
 
-Timings are in nanoseconds (while timings as seen in <c>vite</c> are in milliseconds).
+Timings are in nanoseconds (while timings as seen in ViTE are in milliseconds).
 
 \subsection CreatingADAGWithGraphviz Creating a DAG With Graphviz
 
@@ -170,7 +170,7 @@ graphical output of the graph by using the graphviz library:
 $ dot -Tpdf dag.dot -o output.pdf
 \endverbatim
 
-\subsection TraceTaskDetails Getting task details
+\subsection TraceTaskDetails Getting Task Details
 
 When the FxT trace file <c>prof_file_something</c> has been generated, details on the
 executed tasks can be retrieved by calling:
@@ -219,7 +219,7 @@ evolution of the number of tasks available in the system during the execution.
 Ready tasks are shown in black, and tasks that are submitted but not
 schedulable yet are shown in grey.
 
-\subsection Animation Getting modular schedular animation
+\subsection Animation Getting Modular Schedular Animation
 
 When using modular schedulers (i.e. schedulers which use a modular architecture,
 and whose name start with "modular-"), the command
@@ -232,26 +232,27 @@ will also produce a <c>trace.html</c> file which can be viewed in a
 javascript-enabled web browser. It shows the flow of tasks between the
 components of the modular scheduler.
 
-\subsection Limiting the scope of the trace
+\subsection LimitingScopeTrace Limiting The Scope Of The Trace
 
 For computing statistics, it is useful to limit the trace to a given portion of
 the time of the whole execution. This can be achieved by calling
 
-\verbatim
-starpu_fxt_autostart_profiling(0);
-\endverbatim
+\code{.c}
+starpu_fxt_autostart_profiling(0)
+\endcode
 
-before calling starpu_init(), to prevent tracing from starting immediately. Then
+before calling starpu_init(), to
+prevent tracing from starting immediately. Then
 
-\verbatim
+\code{.c}
 starpu_fxt_start_profiling();
-\endverbatim
+\endcode
 
-and 
+and
 
-\verbatim
+\code{.c}
 starpu_fxt_stop_profiling();
-\endverbatim
+\endcode
 
 can be used around the portion of code to be traced. This will show up as marks
 in the trace, and states of workers will only show up for that portion.
@@ -383,16 +384,17 @@ histogram of the codelet execution time distribution.
 \image html distrib_data_histo.png
 \image latex distrib_data_histo.eps "" width=\textwidth
 
-\section TraceStatistics Trace statistics
+\section TraceStatistics Trace Statistics
 
 More than just codelet performance, it is interesting to get statistics over all
 kinds of StarPU states (allocations, data transfers, etc.). This is particularly
 useful to check what may have gone wrong in the accurracy of the simgrid
 simulation.
 
-This requires the <c>R</c> statistical tool, with the plyr, ggplot2 and
-data.table packages. If your system distribution does not have packages for
-these, one can fetch them from CRAN:
+This requires the <c>R</c> statistical tool, with the <c>plyr</c>,
+<c>ggplot2</c> and <c>data.table</c> packages. If your system
+distribution does not have packages for these, one can fetch them from
+<c>CRAN</c>:
 
 \verbatim
 $ R
@@ -402,10 +404,10 @@ $ R
 > install.packages("knitr")
 \endverbatim
 
-The pj_dump tool from pajeng is also needed (see
+The <c>pj_dump</c> tool from <c>pajeng</c> is also needed (see
 https://github.com/schnorr/pajeng)
 
-One can then get textual or .csv statistics over the trace states:
+One can then get textual or <c>.csv</c> statistics over the trace states:
 
 \verbatim
 $ starpu_paje_state_stats -v native.trace simgrid.trace
@@ -417,12 +419,12 @@ $ starpu_paje_state_stats -v native.trace simgrid.trace
 $ starpu_paje_state_stats native.trace simgrid.trace
 \endverbatim
 
-An other way to get statistics of StarPU states (without installing R and
-pj_dump) is to use the <c>starpu_trace_state_stats.py</c> script which parses the
-generated trace.rec file instead of the paje.trace file. The output is similar
+An other way to get statistics of StarPU states (without installing <c>R</c> and
+<c>pj_dump</c>) is to use the <c>starpu_trace_state_stats.py</c> script which parses the
+generated <c>trace.rec</c> file instead of the <c>paje.trace</c> file. The output is similar
 to the previous script but it doesn't need any dependencies.
 
-The different prefixes used in trace.rec are:
+The different prefixes used in <c>trace.rec</c> are:
 
 \verbatim
 E: Event type
@@ -433,7 +435,7 @@ T: Thread ID
 S: Start time
 \endverbatim
 
-Here's an example how to use it:
+Here's an example on how to use it:
 
 \verbatim
 $ python starpu_trace_state_stats.py trace.rec | column -t -s ","
@@ -466,7 +468,7 @@ $ starpu_paje_summary native.trace simgrid.trace
 it includes gantt charts, execution summaries, as well as state duration charts
 and time distribution histograms.
 
-Other external Pajé analysis tools can be used on these traces, one just needs
+Other external Paje analysis tools can be used on these traces, one just needs
 to sort the traces by timestamp order (which not guaranteed to make recording
 more efficient):
 
@@ -505,7 +507,7 @@ execution time of your tasks. If StarPU was compiled with the library
 <c>glpk</c> installed, starpu_bound_compute() can be used to solve it
 immediately and get the optimized minimum, in ms. Its parameter
 <c>integer</c> allows to decide whether integer resolution should be
-computed and returned 
+computed and returned
 
 The <c>deps</c> parameter tells StarPU whether to take tasks, implicit
 data, and tag dependencies into account. Tags released in a callback

+ 7 - 7
doc/doxygen/chapters/390_faq.doxy

@@ -1,7 +1,7 @@
 /*
  * This file is part of the StarPU Handbook.
  * Copyright (C) 2009--2011  Universit@'e de Bordeaux
- * Copyright (C) 2010, 2011, 2012, 2013, 2014  CNRS
+ * Copyright (C) 2010, 2011, 2012, 2013, 2014, 2016  CNRS
  * Copyright (C) 2011, 2012 INRIA
  * See the file version.doxy for copying conditions.
  */
@@ -14,7 +14,7 @@ Some libraries need to be initialized once for each concurrent instance that
 may run on the machine. For instance, a C++ computation class which is not
 thread-safe by itself, but for which several instanciated objects of that class
 can be used concurrently. This can be used in StarPU by initializing one such
-object per worker. For instance, the libstarpufft example does the following to
+object per worker. For instance, the <c>libstarpufft</c> example does the following to
 be able to use FFTW on CPUs.
 
 Some global array stores the instanciated objects:
@@ -51,7 +51,7 @@ static void fft(void *descr[], void *_args)
 
 This however is not sufficient for FFT on CUDA: initialization has
 to be done from the workers themselves.  This can be done thanks to
-starpu_execute_on_each_worker().  For instance libstarpufft does the following.
+starpu_execute_on_each_worker().  For instance <c>libstarpufft</c> does the following.
 
 \code{.c}
 static void fft_plan_gpu(void *args)
@@ -164,10 +164,10 @@ and display it e.g. in the callback function.
 Some users had issues with MKL 11 and StarPU (versions 1.1rc1 and
 1.0.5) on Linux with MKL, using 1 thread for MKL and doing all the
 parallelism using StarPU (no multithreaded tasks), setting the
-environment variable MKL_NUM_THREADS to 1, and using the threaded MKL library,
-with iomp5.
+environment variable <c>MKL_NUM_THREADS</c> to <c>1</c>, and using the threaded MKL library,
+with <c>iomp5</c>.
 
-Using this configuration, StarPU uses only 1 core, no matter the value of
+Using this configuration, StarPU only uses 1 core, no matter the value of
 \ref STARPU_NCPU. The problem is actually a thread pinning issue with MKL.
 
 The solution is to set the environment variable KMP_AFFINITY to <c>disabled</c>
@@ -204,7 +204,7 @@ frozen), and stop them from polling for more work.
 Note that this does not prevent you from submitting new tasks, but
 they won't execute until starpu_resume() is called. Also note
 that StarPU must not be paused when you call starpu_shutdown(), and
-that this function pair works in a push/pull manner, ie you need to
+that this function pair works in a push/pull manner, i.e you need to
 match the number of calls to these functions to clear their effect.
 
 

+ 1 - 1
doc/doxygen/chapters/401_out_of_core.doxy

@@ -13,7 +13,7 @@ When using StarPU, one may need to store more data than what the main memory
 disk and to use it.
 
 The principle is that one first registers a disk location, seen by StarPU as
-a void*, which can be for instance a Unix path for the stdio or unistd case,
+a <c>void*</c>, which can be for instance a Unix path for the stdio or unistd case,
 or a database file path for a leveldb case, etc. The disk backend opens this
 place with the plug method.
 

+ 17 - 11
doc/doxygen/chapters/410_mpi_support.doxy

@@ -10,7 +10,7 @@
 
 The integration of MPI transfers within task parallelism is done in a
 very natural way by the means of asynchronous interactions between the
-application and StarPU.  This is implemented in a separate libstarpumpi library
+application and StarPU.  This is implemented in a separate <c>libstarpumpi</c> library
 which basically provides "StarPU" equivalents of <c>MPI_*</c> functions, where
 <c>void *</c> buffers are replaced with ::starpu_data_handle_t, and all
 GPU-RAM-NIC transfers are handled efficiently by StarPU-MPI.  The user has to
@@ -96,11 +96,11 @@ instance:
     }
 \endcode
 
-In that case, libstarpumpi is not needed. One can also use MPI_Isend() and
-MPI_Irecv(), by calling starpu_data_release() after MPI_Wait() or MPI_Test()
+In that case, <c>libstarpumpi</c> is not needed. One can also use <c>MPI_Isend()</c> and
+<c>MPI_Irecv()</c>, by calling starpu_data_release() after <c>MPI_Wait()</c> or <c>MPI_Test()</c>
 have notified completion.
 
-It is however better to use libstarpumpi, to save the application from having to
+It is however better to use <c>libstarpumpi</c>, to save the application from having to
 synchronize with starpu_data_acquire(), and instead just submit all tasks and
 communications asynchronously, and wait for the overall completion.
 
@@ -183,7 +183,7 @@ int main(int argc, char **argv)
     }
 \endcode
 
-We have here replaced MPI_Recv() and MPI_Send() with starpu_mpi_irecv_detached()
+We have here replaced <c>MPI_Recv()</c> and <c>MPI_Send()</c> with starpu_mpi_irecv_detached()
 and starpu_mpi_isend_detached(), which just submit the communication to be
 performed. The only remaining synchronization with starpu_data_acquire() is at
 the beginning and the end.
@@ -233,7 +233,7 @@ does the following:
 <li> it polls the <em>ready requests list</em>. For all the ready
 requests, the appropriate function is called to post the corresponding
 MPI call. For example, an initial call to starpu_mpi_isend() will
-result in a call to <c>MPI_Isend</c>. If the request is marked as
+result in a call to <c>MPI_Isend()</c>. If the request is marked as
 detached, the request will then be added in the <em>detached requests
 list</em>.
 </li>
@@ -241,7 +241,7 @@ list</em>.
 </li>
 <li> it polls the <em>detached requests list</em>. For all the detached
 requests, it tests its completion of the MPI request by calling
-<c>MPI_Test</c>. On completion, the data handle is released, and if a
+<c>MPI_Test()</c>. On completion, the data handle is released, and if a
 callback was defined, it is called.
 </li>
 <li> finally, it checks if a data envelope has been received. If so,
@@ -266,7 +266,13 @@ can also be used within StarPU-MPI and
 exchanged between nodes. Two functions needs to be defined through the
 type starpu_data_interface_ops. The function
 starpu_data_interface_ops::pack_data takes a handle and returns a
-contiguous memory buffer allocated with starpu_malloc_flags(ptr, size, 0) along with its size where data to be conveyed
+contiguous memory buffer allocated with
+
+\code{.c}
+starpu_malloc_flags(ptr, size, 0)
+\endcode
+
+along with its size where data to be conveyed
 to another node should be copied. The reversed operation is
 implemented in the function starpu_data_interface_ops::unpack_data which
 takes a contiguous memory buffer and recreates the data handle.
@@ -498,7 +504,7 @@ the cost of task submission.
 A function starpu_mpi_task_build() is also provided with the aim to
 only construct the task structure. All MPI nodes need to call the
 function, only the node which is to execute the task will return a
-valid task structure, others will return NULL. That node must submit that task.
+valid task structure, others will return <c>NULL</c>. That node must submit that task.
 All nodes then need to call the function starpu_mpi_task_post_build() -- with the same
 list of arguments as starpu_mpi_task_build() -- to post all the
 necessary data communications.
@@ -535,7 +541,7 @@ modify the current value, it can not decide by itself whether to flush the cache
 or not.  The application can however explicitly tell StarPU-MPI to flush the
 cache by calling starpu_mpi_cache_flush() or starpu_mpi_cache_flush_all_data(),
 for instance in case the data will not be used at all any more (see for instance
-the cholesky example in mpi/examples/matrix_decomposition), or at least not in
+the cholesky example in <c>mpi/examples/matrix_decomposition</c>), or at least not in
 the close future. If a newly-submitted task actually needs the value again,
 another transmission of D will be initiated from A to B.  A mere
 starpu_mpi_cache_flush_all_data() can for instance be added at the end of the whole
@@ -546,7 +552,7 @@ for the data deallocation will be the same, but it will additionally release som
 pressure from the StarPU-MPI cache hash table during task submission.
 
 The whole caching behavior can be disabled thanks to the \ref STARPU_MPI_CACHE
-environment variable. The variable \ref STARPU_MPI_CACHE_STATS can be set to 1
+environment variable. The variable \ref STARPU_MPI_CACHE_STATS can be set to <c>1</c>
 to enable the runtime to display messages when data are added or removed
 from the cache holding the received data.
 

+ 3 - 3
doc/doxygen/chapters/430_mic_scc_support.doxy

@@ -13,10 +13,10 @@
 SCC support just needs the presence of the RCCE library.
 
 MIC Xeon Phi support actually needs two compilations of StarPU, one for the host and one for
-the device. The PATH environment variable has to include the path to the
+the device. The <c>PATH</c> environment variable has to include the path to the
 cross-compilation toolchain, for instance <c>/usr/linux-k1om-4.7/bin</c> .
-The SINK_PKG_CONFIG_PATH environment variable should include the path to the
-cross-compiled hwloc.pc.
+The <c>SINK_PKG_CONFIG_PATH</c> environment variable should include the path to the
+cross-compiled <c>hwloc.pc</c>.
 The script <c>mic-configure</c> can then be used to achieve the two compilations: it basically
 calls <c>configure</c> as appropriate from two new directories: <c>build_mic</c> and
 <c>build_host</c>. <c>make</c> and <c>make install</c> can then be used as usual and will

+ 3 - 3
doc/doxygen/chapters/450_native_fortran_support.doxy

@@ -62,7 +62,7 @@ building application codes with StarPU.
 All these examples assume that the standard Fortran module <c>iso_c_binding</c>
 is in use.
 
-- Specifying a NULL pointer
+- Specifying a <c>NULL</c> pointer
 \code{.f90}
         type(c_ptr) :: my_ptr  ! variable to store the pointer
         ! [...]
@@ -198,8 +198,8 @@ structure:
         call fstarpu_codelet_free(cl_vec)
 \endcode
 
-\section Notes Additional notes about the Native Fortran support
-\subsection OldFortran Using StarPU with older Fortran compilers
+\section Notes Additional Notes about the Native Fortran Support
+\subsection OldFortran Using StarPU with Older Fortran Compilers
 
 When using older compilers, Fortran applications may still interoperate
 with StarPU using C marshalling functions as exemplified in StarPU's

+ 8 - 8
doc/doxygen/chapters/470_simgrid.doxy

@@ -12,7 +12,7 @@ StarPU can use Simgrid in order to simulate execution on an arbitrary
 platform. This was tested with simgrid 3.11, 3.12 and 3.13, other versions may have
 compatibility issues.
 
-\section Preparing Preparing your application for simulation.
+\section Preparing Preparing Your Application For Simulation
 
 There are a few technical details which need to be handled for an application to
 be simulated through Simgrid.
@@ -28,7 +28,7 @@ into starpu_main(), and it is libstarpu which will provide the real main() and
 will call the application's main().
 
 To be able to test with crazy data sizes, one may want to only allocate
-application data if STARPU_SIMGRID is not defined.  Passing a NULL pointer to
+application data if STARPU_SIMGRID is not defined.  Passing a <c>NULL</c> pointer to
 starpu_data_register functions is fine, data will never be read/written to by
 StarPU in Simgrid mode anyway.
 
@@ -123,26 +123,26 @@ case. Since during simgrid execution, the functions of the codelet are actually
 not called by default, one can use dummy functions such as the following to
 still permit CUDA or OpenCL execution.
 
-\section SimulationExamples Simulation examples
+\section SimulationExamples Simulation Examples
 
 StarPU ships a few performance models for a couple of systems: attila,
 mirage, idgraf, and sirocco. See section \ref SimulatedBenchmarks for the details.
 
-\section FakeSimulations Simulations on fake machines
+\section FakeSimulations Simulations On Fake Machines
 
 It is possible to build fake machines which do not exist, by modifying the
 platform file in <c>$STARPU_HOME/.starpu/sampling/bus/machine.platform.xml</c>
 by hand: one can add more CPUs, add GPUs (but the performance model file has to
 be extended as well), change the available GPU memory size, PCI memory bandwidth, etc.
 
-\section TweakingSimulation Tweaking simulation
+\section TweakingSimulation Tweaking Simulation
 
 The simulation can be tweaked, to be able to tune it between a very accurate
 simulation and a very simple simulation (which is thus close to scheduling
 theory results), see the \ref STARPU_SIMGRID_CUDA_MALLOC_COST and
 \ref STARPU_SIMGRID_CUDA_QUEUE_COST environment variables.
 
-\section SimulationMPIApplications MPI applications
+\section SimulationMPIApplications MPI Applications
 
 StarPU-MPI applications can also be run in simgrid mode. It needs to be compiled
 with smpicc, and run using the <c>starpu_smpirun</c> script, for instance:
@@ -156,7 +156,7 @@ list of MPI nodes to be used. StarPU currently only supports homogeneous MPI
 clusters: for each MPI node it will just replicate the architecture referred by
 \ref STARPU_HOSTNAME.
 
-\section SimulationDebuggingApplications Debugging applications
+\section SimulationDebuggingApplications Debugging Applications
 
 By default, simgrid uses its own implementation of threads, which prevents gdb
 from being able to inspect stacks of all threads.  To be able to fully debug an
@@ -166,7 +166,7 @@ able to manipulate as usual.
 
 \snippet simgrid.c To be included. You should update doxygen if you see this text.
 
-\section SimulationMemoryUsage Memory usage
+\section SimulationMemoryUsage Memory Usage
 
 Since kernels are not actually run and data transfers are not actually
 performed, the data memory does not actually need to be allocated.  This allows

+ 9 - 9
doc/doxygen/chapters/501_environment_variables.doxy

@@ -750,7 +750,7 @@ memory is getting full. The default is unlimited.
 This variable allows the user to control the task submission flow by specifying
 to StarPU a maximum number of submitted tasks allowed at a given time, i.e. when
 this limit is reached task submission becomes blocking until enough tasks have
-completed, specified by STARPU_LIMIT_MIN_SUBMITTED_TASKS.
+completed, specified by \ref STARPU_LIMIT_MIN_SUBMITTED_TASKS.
 Setting it enables allocation cache buffer reuse in main memory.
 </dd>
 
@@ -858,8 +858,8 @@ dog is reached, thus allowing to catch the situation in gdb, etc
 When this variable contains a job id, StarPU will raise SIGTRAP when the task
 with that job id is being scheduled by the scheduler (at a scheduler-specific
 point), which will be nicely catched by debuggers.
-This only works for schedulers which have such a scheduling point defined.
-See \ref DebuggingScheduling
+This only works for schedulers which have such a scheduling point defined
+(see \ref DebuggingScheduling)
 </dd>
 
 <dt>STARPU_TASK_BREAK_ON_PUSH</dt>
@@ -867,8 +867,8 @@ See \ref DebuggingScheduling
 \anchor STARPU_TASK_BREAK_ON_PUSH
 \addindex __env__STARPU_TASK_BREAK_ON_PUSH
 When this variable contains a job id, StarPU will raise SIGTRAP when the task
-with that job id is being pushed to the scheduler, which will be nicely catched by debuggers.
-See \ref DebuggingScheduling
+with that job id is being pushed to the scheduler, which will be nicely catched by debuggers
+(see \ref DebuggingScheduling)
 </dd>
 
 <dt>STARPU_TASK_BREAK_ON_POP</dt>
@@ -876,8 +876,8 @@ See \ref DebuggingScheduling
 \anchor STARPU_TASK_BREAK_ON_POP
 \addindex __env__STARPU_TASK_BREAK_ON_POP
 When this variable contains a job id, StarPU will raise SIGTRAP when the task
-with that job id is being popped from the scheduler, which will be nicely catched by debuggers.
-See \ref DebuggingScheduling
+with that job id is being popped from the scheduler, which will be nicely catched by debuggers
+(see \ref DebuggingScheduling)
 </dd>
 
 <dt>STARPU_DISABLE_KERNELS</dt>
@@ -906,7 +906,7 @@ average.
 The random scheduler and some examples use random numbers for their own
 working. Depending on the examples, the seed is by default juste always 0 or
 the current time() (unless simgrid mode is enabled, in which case it is always
-0). STARPU_RAND_SEED allows to set the seed to a specific value.
+0). \ref STARPU_RAND_SEED allows to set the seed to a specific value.
 </dd>
 
 <dt>STARPU_IDLE_TIME</dt>
@@ -924,7 +924,7 @@ sum of all the workers' idle time.
 \addindex __env__STARPU_GLOBAL_ARBITER
 When set to a positive value, StarPU will create a arbiter, which
 implements an advanced but centralized management of concurrent data
-accesses, see \ref ConcurrentDataAccess for the details.
+accesses (see \ref ConcurrentDataAccess).
 </dd>
 
 </dl>

+ 24 - 9
doc/doxygen/dev/starpu_check_documented.py

@@ -1,4 +1,4 @@
-#!/usr/bin/python
+#!/usr/bin/python3
 
 import os
 import sys
@@ -7,17 +7,27 @@ class bcolors:
     FAILURE = '\033[91m'
     NORMAL = '\033[0m'
 
-def loadFunctionsAndDatatypes(flist, dtlist, fname):
-    f = open(fname, 'r')
+def list_files(directory):
+    return list(map(lambda a : directory+a, list(filter(lambda a:a.count(".h") and not a.count("starpu_deprecated_api.h"),os.listdir(directory)))))
+
+def loadFunctionsAndDatatypes(flist, dtlist, file_name):
+    f = open(file_name, 'r')
     for line in f:
         mline = line[:-1]
         if mline.count("\\fn"):
             if mline.count("fft") == 0:
                 func = mline.replace("\\fn ", "")
-                flist.append(list([func, fname]))
+                l = func.split("(")[0].split()
+                func_name = l[len(l)-1].replace("*", "")
+                flist.append(list([func, func_name, file_name]))
         if mline.count("\\struct ") or mline.count("\\def ") or mline.count("\\typedef ") or mline.count("\\enum "):
             datatype = mline.replace("\\struct ", "").replace("\\def ", "").replace("\\typedef ", "").replace("\\enum ","")
-            dtlist.append(list([datatype, fname]))
+            l = datatype.split("(")
+            if len(l) > 1:
+                datatype_name = l[0]
+            else:
+                datatype_name = datatype
+            dtlist.append(list([datatype, datatype_name, file_name]))
     f.close()
 
 functions = []
@@ -30,14 +40,19 @@ for docfile in os.listdir(docfile_dir):
     if docfile.count(".doxy"):
         loadFunctionsAndDatatypes(functions, datatypes, docfile_dir+docfile)
 
-incfiles=dirname+"/../../../include/*.h " + dirname + "/../../../mpi/include/*.h " + dirname + "/../../../starpufft/include/*.h " + dirname + "/../../../sc_hypervisor/include/*.h " + dirname + "/../../../include/starpu_config.h.in"
+list_incfiles = [dirname + "/../../../include/starpu_config.h.in"]
+for d in [dirname+"/../../../include/", dirname + "/../../../mpi/include/", dirname + "/../../../starpufft/include/", dirname + "/../../../sc_hypervisor/include/"]:
+    list_incfiles.extend(list_files(d))
+incfiles=" ".join(list_incfiles)
+
 for function in functions:
     x = os.system("sed 's/ *STARPU_ATTRIBUTE_UNUSED *//g' " + incfiles + "| sed 's/ STARPU_WARN_UNUSED_RESULT//g' | fgrep \"" + function[0] + "\" > /dev/null")
     if x != 0:
-        print "Function <" + bcolors.FAILURE + function[0] + bcolors.NORMAL + "> documented in <" + function[1] + "> does not exist in StarPU's API"
+        print("Function <" + bcolors.FAILURE + function[0] + bcolors.NORMAL + "> documented in <" + function[2] + "> does not exist in StarPU's API")
+        os.system("grep " + function[1] + " " + dirname+"/../../../include/starpu_deprecated_api.h")
 
 for datatype in datatypes:
     x = os.system("fgrep -l \"" + datatype[0] + "\" " + incfiles + " > /dev/null")
     if x != 0:
-        print "Datatype <" + bcolors.FAILURE + datatype[0] + bcolors.NORMAL + "> documented in <" + datatype[1] + "> does not exist in StarPU's API"
-
+        print("Datatype <" + bcolors.FAILURE + datatype[0] + bcolors.NORMAL + "> documented in <" + datatype[2] + "> does not exist in StarPU's API")
+        os.system("grep " + datatype[1] + " " + dirname+"/../../../include/starpu_deprecated_api.h")